Coverage Evals: Stop Agents Missing Things

The biggest AI failure mode we see in building-industry workflows is not “wrong math.” It’s “right-sounding answer, missing the thing that makes the answer safe.”

An agent can complete a task — “recommend three HVAC options,” “draft a scope,” “summarize a vendor spec,” “compare bids” — and still be unusable if it misses a critical exclusion, code requirement, lead-time constraint, or a better alternative. This is a coverage problem, not a completion problem.

What a coverage eval is

A coverage eval scores whether the agent included the required sources, alternatives, caveats, and constraints — not just whether it produced an output.

Think of it as the contractor version of “did we check the obvious stuff?”: permits, code, service area, warranties, lead times, compatibility, and what changes price.

A building-industry example (how agents go wrong)

Imagine an agent helping a remodeler pick windows. It recommends three “top” models and even includes price ranges. But it fails to surface the one constraint that matters: lead time and install date risk. The agent output is not “incorrect” — it’s incomplete in a way that creates real downstream cost.

This is the pattern: agents over-focus on what’s easy to summarize (features and price) and under-surface what protects operations (constraints, exceptions, alternatives, and confirmation steps).

How to implement coverage evals (operator playbook)

Define a “must-not-miss” list per workflow (sources, alternatives, caveats, and constraints).
Write a short rubric the reviewer can score in under 60 seconds (coverage, caveat retention, and wrong-action risk).
Log the run: prompt, tools used, retrieved sources, output, and the reviewer’s notes (failed traces become future eval cases).
Split the work: the task model generates the draft; a smaller evaluator checks coverage; a human approves changes that affect cost, schedule, or client commitments.
Keep the page/workflow agent-readable: stable labels, clear state, and explicit confirmation steps so the agent can’t “silently decide.”

What to publish on your site so coverage improves

Coverage starts with source-of-truth content. If the constraint isn’t visible on your pages, the agent can’t reliably include it.

Add “truth set” blocks: the exact scope, constraints, and exclusions for one decision.
Add proof modules: one real example, one checklist, one diagram, or one before/after that shows you’ve done the work.
Add an explicit “what changes price / schedule” section (short, scannable, dated).
Add clear next steps and boundaries (what happens after the CTA, and what requires paid review).

Datum’s bottom line

If you’re shipping agents into real operations, you need more than “it finished.” You need “it didn’t miss the things that change cost, schedule, or client promises.” Coverage evals make that visible, testable, and improvable.

Sources Read

WebMCP evalsChrome for Developers
WebMCP best practicesChrome for Developers
A study on search-agent coverage failures (arXiv:2605.27905)arXiv
The next evolution of the Agents SDKOpenAI

Next step, if this note maps to a problem on your desk: Foundations of AI — Live Cohort — two live virtual afternoons for building-industry teams ($2,500).

The Coverage Eval: How To Stop AI Agents From Missing What Matters