Your Agent Builder Demo Needs A Workflow Receipt.

The AI market keeps trying to make agents feel like a button. Click once, watch the model work, and trust the result. That is not a serious operating model for a remodeler, builder, showroom, supplier, or trade business. If an agent is touching scopes, estimates, client updates, vendor data, purchase decisions, ad spend, or production schedules, the demo is not the deliverable. The workflow receipt is.

OpenAI quietly made that lesson easier to explain. On June 3, 2026, its AgentKit page was updated to say Agent Builder and Evals are being wound down on the OpenAI platform, with Agents SDK recommended for workflows that should continue as code. That does not mean visual builders are useless. It does mean teams should be careful about treating a pretty canvas as the durable business system.

The receipt is the product surface

OpenAI describes the newer Agents SDK direction as a harness for agents that can inspect files, run commands, edit code, and work inside controlled sandbox environments. The example matters less than the posture: give the agent a controlled workspace, explicit instructions, scoped tools, and evidence it can cite. That is the same posture building-industry AI should use, even when the task is not software.

What source data did the agent read: plans, selections, CRM notes, vendor PDFs, price sheets, emails, or prior proposals?
What was it allowed to do: draft, compare, classify, summarize, notify, price, schedule, or update a record?
What state did it move through: queued, reading, extracting, drafting, waiting for approval, completed, failed, or retried?
What did it produce: a draft scope, a discrepancy list, a client update, a quote packet, a task list, or an exception report?
What should a human review before anything leaves the business?

Why this matters in a building business

A bad AI answer in a generic chat is annoying. A bad AI action inside a construction business can create rework, margin leakage, awkward client communication, or a vendor mistake. The right first implementation is rarely an autonomous agent that does everything. It is usually a narrow worker with typed inputs, typed outputs, source links, visible status, and a review queue.

Think about a bid-review workflow. The agent should not simply announce that one supplier is cheaper. It should show which documents were compared, which line items matched, which assumptions changed, which alternates were excluded, which numbers need human confirmation, and where the source evidence lives. The output should be reviewable by an operator who understands the job, not trusted because the model sounded confident.

Evals are how you define good before the client sees it

OpenAI frames evals as a loop: specify what great means, measure against real-world conditions, and improve from errors. For Datum clients, that means the first question is not which model is newest. The first question is what failure would hurt this business and what test catches it before the workflow ships.

For scope drafting: did the AI preserve exclusions, allowances, and owner responsibilities?
For vendor comparison: did it separate true price differences from missing line items?
For client updates: did it keep promises inside what the team can actually deliver?
For search and marketing pages: did it keep structured data aligned with visible content and avoid invented claims?
For lead follow-up: did it route the right role, offer, urgency, and consent state into the next step?

Honesty beats fake progress

Anthropic highlighted honesty as a major improvement area in its recent Claude Opus 4.8 release, specifically the tendency for models to claim progress when the evidence is thin. That is exactly the behavior a workflow receipt is meant to constrain. A useful agent should be able to say: I read these sources, I could not access this one, I matched these fields, I am uncertain about this assumption, and this needs approval.

AI Search wants the same discipline

Google Search Central is still saying the AI Search playbook is not special AI schema. The strategy is helpful original content, crawlability, technically clean pages, and structured data that matches what users can see. Google has also started rolling out generative AI performance views in Search Console. That creates a useful parallel: your public pages and your private AI workflows both need visible source truth, not hidden claims.

If an agent may summarize your business, route a prospect, draft a follow-up, or prepare an internal recommendation, it needs the same content discipline your website needs: clear facts, current sources, stable next steps, and no unsupported promises.

The operator checklist

Before you approve an agent demo, ask for the receipt. It can be a simple admin panel, a table, or a generated run report. The format matters less than the contents.

Inputs: the exact documents, records, pages, and user fields used.
Permissions: what the agent could read, draft, write, send, spend, or approve.
Tool calls: every lookup, extraction, calculation, search, or external action.
State and timing: queued, running, waiting, failed, retried, approved, and completed timestamps.
Sources and citations: visible links back to the files or pages that support the output.
Human review: who approved it, what changed, and what was rejected.
Eval result: the product-specific test that says whether the run was good enough.

That is less flashy than an AI magic button. It is also the difference between a demo and an operating system.

Sources Read

Introducing AgentKitOpenAI
The next evolution of the Agents SDKOpenAI
How evals drive the next chapter in AI for businessesOpenAI
Introducing Claude Opus 4.8Anthropic
AI features and your websiteGoogle Search Central
Optimizing your website for generative AI features on Google SearchGoogle Search Central
Introducing Search Generative AI performance reports in Search ConsoleGoogle Search Central