← Blog

The Verification Gap

Most software verification rests on an assumption that has served us well: you know what the program should do, and you can check whether it did it. Write the expected output, run the code, compare. Green or red.

That assumption starts to break down as code becomes agentic. When an LLM generates a greeting, there is no single expected output — there are many acceptable ones and no practical way to enumerate them. In our experience building tools that mix deterministic and agentic code, the gap between what we can specify and what we can verify widens at every level of the spectrum.

We’ve been calling this the verification gap. This post is our attempt to map the spectrum — five levels from fully deterministic to fully agentic — and describe the verification methods each level demands. It is a simplification. The boundaries between levels are fuzzy, and we’ve only applied this framework to our own small codebase. But we’ve found the taxonomy useful for making verification decisions, and we’re sharing it in case others do too.

Five Levels

Consider a simple task: greet a user by name and ask a follow-up question. The same requirement can be implemented at five different levels of determinism, each with fundamentally different verification properties.

Level 1: Fully Deterministic

Every path is hand-coded. No machine learning, no external models. The program is a finite function from inputs to outputs.

MALE_ENDINGS = {"james", "john", "robert"}
FEMALE_SUFFIXES = ("lyn", "een", "ine", "ary")

name = input("What is your name? ").strip()
n = name.lower()
if n in MALE_ENDINGS: g = "M"
elif n.endswith(FEMALE_SUFFIXES): g = "F"
elif n[-1] == "a": g = "F"
else: g = "?"

FOLLOWUPS = {"M": "Catch the game?",
  "F": "Read anything good?", "?": "What's fun?"}
print(f"Nice to meet you, {name}!")
print(FOLLOWUPS[g])

Verification paradigm: Classical testing. The oracle is exact — for every input, there is one correct output. Unit tests assert input-to-output pairs. Property tests confirm the output is always one of three known responses. Formal proof is trivial: finite input classes map to finite outputs. Mutation testing works because every mutation is detectable.

Coverage: 100% branch and path coverage is achievable and meaningful. Every behavior is enumerable.

Verification gap: Zero. This is the level that classical testing was designed for — and where it works best.

Level 2: Deterministic + Narrow ML

A trained classifier replaces the hand-coded rules, but the surrounding flow remains deterministic.

import gender_guesser.detector as gd

detector = gd.Detector()
name = input("What is your name? ").strip()

result = detector.get_gender(name)
gender_map = {
    "male": "M", "mostly_male": "M",
    "female": "F", "mostly_female": "F",
}
g = gender_map.get(result, "?")

FOLLOWUPS = {"M": "Catch the game?",
  "F": "Read anything good?", "?": "What's fun?"}
print(f"Nice to meet you, {name}!")
print(FOLLOWUPS[g])

Verification paradigm: Statistical + classical. The deterministic flow still gets exact-match tests. But the classifier is a black box — you test its behavior, not its reasoning. Accuracy alone is misleading — if 70% of names in your dataset are male, a classifier that always returns “male” achieves 70% accuracy. The real metrics are precision and recall per class: of the names the classifier labels female, how many are correct (precision)? Of the actually female names, how many does it find (recall)? Bias audits stratify these metrics by ethnicity and culture. Integration tests use a golden set of ~500 names with regression thresholds on per-class F1 scores [1].

Coverage: Branches: 100%. Classifier precision/recall: statistical bounds, not guarantees.

Verification gap: Measurement gap. You can measure the classifier’s precision and recall on your dataset. You cannot prove absence of bias — only bound it on the data you have. The gap is the distance between your dataset and the real distribution of names [2].

Level 3: Deterministic Scaffold + LLM Functions

The deterministic scaffold controls the flow — the classifier still decides the gender, the code still structures the interaction — but the LLM generates the response. The program’s output is no longer one of three strings. It is natural language. Your code calls the LLM; the LLM does not call your code.

import gender_guesser.detector as gd
from anthropic import Anthropic

client = Anthropic()
detector = gd.Detector()
name = input("What is your name? ").strip()

result = detector.get_gender(name)
g = {"male": "M", "mostly_male": "M",
     "female": "F", "mostly_female": "F"
     }.get(result, "unknown")

resp = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=150,
    messages=[{"role": "user",
      "content": f"Greet {name} (gender: {g}). "
      "Say nice to meet you, ask ONE "
      "follow-up. Natural, not stereotyped. "
      "2 sentences max."}]
)
print(resp.content[0].text)

Verification paradigm: Behavioral contracts + LLM-as-judge. You can no longer check for exact output. Instead, you assert structural invariants: the output contains the name, is at most two sentences, ends with a question mark. For semantic properties — “is it stereotypical?” — you need a second model to judge the first [3]. Adversarial inputs (profanity-adjacent names, adversarial strings) test boundary behavior. Regression runs N=100 per name and asserts >=95% pass the behavioral contract.

Coverage: Deterministic scaffold: 100%. LLM output: statistical coverage over a behavioral contract.

Verification gap: Oracle gap. No exact oracle exists. The test oracle is itself probabilistic (an LLM judging another LLM) or structural (regex and length checks — a weak claim). Verification shifts from “is the output correct?” to “is the output acceptable?” [4]. Model updates can silently change behavior with no code change.

Level 4: LLM Orchestrator + Deterministic Tools

Control inverts. The LLM drives the workflow — deciding what to do, in what order, and when to stop. But it calls deterministic tools for precise operations: database lookups, API calls, file writes, calculations. The LLM is the outer loop; your code is the inner function.

This is what most production agentic systems actually look like. Claude Code, coding agents, and orchestrated workflows all sit here. The LLM reasons about what to do; tools do it reliably.

from anthropic import Anthropic

client = Anthropic()
name = input("What is your name? ").strip()

tools = [{
    "name": "lookup_name",
    "description": "Look up cultural origin and demographics for a name",
    "input_schema": {
        "type": "object",
        "properties": {"name": {"type": "string"}},
        "required": ["name"]
    }
}]

resp = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=300,
    system="You are a friendly greeter. Use the lookup tool "
    "to learn about the person's name, then greet them "
    "with a thoughtful follow-up question. 2 sentences max.",
    tools=tools,
    messages=[{"role": "user", "content": name}]
)
# Agent loop: handle tool calls, feed results back

Verification paradigm: Tool contract testing + trajectory evaluation. The deterministic tools are fully testable at Level 1 — exact inputs, exact outputs, 100% coverage. But the sequence in which the LLM calls them is non-deterministic. Verification splits: tool correctness (classical) and orchestration behavior (statistical). You test trajectories — given this input, did the agent call the right tools in a reasonable order and produce an acceptable result? [5]

Coverage: Tools: 100%. Orchestration trajectories: sampled and scored.

Verification gap: Orchestration gap. Each tool is verifiable. The composition is not. The LLM may call tools in unexpected orders, skip steps, or hallucinate tool inputs. You can constrain this with structured outputs and tool schemas, but you cannot prove the agent will always choose the right trajectory.

Level 5: Fully Agentic

No deterministic code. No tools. The LLM is the program. Everything — interpretation, reasoning, response generation — is delegated entirely to the model. In practice, this is a chatbot.

from anthropic import Anthropic

client = Anthropic()
name = input("What is your name? ").strip()

resp = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=200,
    system="You are a friendly greeter. "
    "Given a name, say nice to meet them. "
    "Infer what you can from their name "
    "(culture, gender, etc.) and ask ONE "
    "thoughtful follow-up. Never be "
    "stereotypical.",
    messages=[{"role": "user", "content": name}]
)
print(resp.content[0].text)

Verification paradigm: Specification-as-prompt + adversarial evaluation. The system prompt is the specification, but it is natural language, not formal. Behavioral eval suites run N=1000 inputs and score on rubric: greeting presence, relevance, cultural sensitivity, tone. Red-teaming probes for prompt injection via the name field [5]. Drift detection pins model versions and re-runs the eval suite on each update, alerting on score distribution shift (e.g., Kolmogorov-Smirnov test) [6].

Coverage: Undefined in the classical sense. “Coverage” becomes: what percentage of your behavioral spec do your evals exercise?

Verification gap: Fundamental. As far as we can tell, you cannot prove correctness at this level — only accumulate evidence of acceptability. The program’s behavior is a function of weights you do not control. A model version change is a new program. Your “specification” (the prompt) is natural language, not formal. Refinement calculus has no purchase here without a formalization layer [7]. Level 5 is not rare — it is what every foundation model is. Every LLM chat interface, every conversational AI, every system that takes a prompt and returns a completion operates here. What most production teams do is wrap Level 5 models with deterministic tools to pull the system down to Level 4, where individual operations are verifiable even if the orchestration is not.

This is the same verification challenge that model providers themselves face at a much larger scale. Anthropic’s Constitutional AI [13] trains models to self-critique against a written constitution — a natural-language specification that guides RLHF. OpenAI’s Preparedness Framework [14] runs thousands of hours of external red-teaming and only deploys models scoring “medium” risk or below. Their joint evaluation exercise [15] stress-tested both companies’ models against adversarial scenarios, finding no consistent pattern that reasoning models are more or less aligned. These are Level 5 verification strategies applied to the models themselves: no formal proof, only accumulated evidence from increasingly sophisticated evaluation infrastructure. The techniques — constitutional principles, red-teaming, behavioral evals, drift detection — are the same ones available to application developers. The difference is scale.

The Gap Widens

The progression is not smooth — it is a series of phase transitions in what verification can claim:

LevelOracleGap
1. Fully DeterministicExact matchNone
2. Deterministic + Narrow MLPrecision/recall thresholdMeasurement gap
3. Deterministic Scaffold + LLMStructural + LLM judgeOracle gap
4. LLM Orchestrator + ToolsTool contracts + trajectory scoringOrchestration gap
5. Fully AgenticRubric onlyFundamental gap

The critical phase transition is between Level 3 and Level 4: the inversion of control. In Level 3, your code calls the LLM — you control execution order. In Level 4, the LLM calls your code — it decides what to do and when. That’s where verification takes its biggest hit.

Most production systems are a mix of all five levels. The model itself is always Level 5. Everything we build around it — tools, scaffolds, classifiers, hand-coded logic — is an attempt to pull the system down the spectrum toward levels where verification is stronger.

Built At vs. Designed For

There is one more distinction the five levels need: a tool can be built at one level while being designed for consumption at another.

A text-to-speech engine might be built entirely at Level 1 — deterministic provider selection, audio codec handling, file I/O. But its API is designed to be called by Level 4 agents: an LLM orchestrator decides what to say, the TTS engine handles how to say it. The engine’s verification gap is zero (Level 1). But the system it participates in has an orchestration gap (Level 4).

This matters because the verification strategy depends on both levels. Building tools at Level 1 that are designed to be called at Level 4 is a deliberate strategy: maximize the surface area where verification is strong, minimize the surface area where it is not. The universal access pattern (library, CLI, MCP server, REST API) is one way to project the same deterministic capability across every consumption level.

Here’s how we classify our own tools:

Building blocks have a built-at level (internal implementation) and a designed-for level (consumption surface). Applications are just whatever level they are.

Built atDesigned for
Building Blocks
Quarry, Biff, Vox, Lux, Punt KitL1 — DeterministicL4 — LLM + Tools
Applications
Koch TrainerL1 — Deterministic
PR/FAQ, langlearn-tts, Z Spec, DungeonL4 — LLM + Tools

Closing the Gap

The verification gap is not an argument against agentic code. Level 4 code can do things that Level 1 code cannot — reason about context, select strategies, recover from errors. In our experience, the gap is an argument for knowing where your code sits on the spectrum and choosing verification methods that match.

We are not alone in working on this. Huyen’s AI Engineering [12] provides the conceptual framework — evaluation becomes more important precisely as it becomes harder. Teams building eval infrastructure — Braintrust [16], LangSmith [17], and Weights & Biases — are tackling the tooling side. What we are adding is the formal methods angle: mathematical specifications to close the gap where classical testing still works (Levels 1–2) and to provide contracts where it does not (Levels 3–4).

Three strategies that have helped us:

Push behavior down the spectrum. If a Level 5 component can be replaced by a Level 4 agent with deterministic tools, do it. If a Level 4 orchestration can be replaced by a Level 3 scaffold that calls the LLM at known points, do it. Every level you descend narrows the gap.

The deterministic scaffold in Level 3 exists precisely for this reason — keep what you can verify exact, and confine what you cannot to the smallest possible surface.

Formalize the specification. The gaps at Levels 4 and 5 exist partly because the spec (the prompt or system instructions) is natural language. Formal specifications — Z schemas [8], TLA+ models [9], B machines [10] — give you something a natural language prompt cannot: a mathematical object that admits type-checking, animation, and model-checking. You can prove properties of the specification even when you cannot prove properties of the LLM’s output. The spec becomes a contract that the agentic output is tested against, not a suggestion it may or may not follow.

We intend to test this with Dungeon — a text adventure where the LLM is the game engine (Level 4). A Z specification of the game state (inventory, rooms, quest flags) would give us a formal contract: after the player picks up the sword, the invariant says it is in inventory. The LLM narrates the story; the spec verifies the state transitions. If this works, it would demonstrate formal verification of agentic behavior — not by constraining the LLM, but by checking its output against a mathematical model.

The deterministic parts of your system — the Level 1 and Level 2 code, and the tools that Level 4 agents call — are exactly where formal specification pays off most: the gap is zero or small, and the spec can close it completely.

Invest in behavioral eval infrastructure. For Level 3, 4, and 5 code, the eval suite is the test suite. Treat it with the same rigor: version it, run it in CI, set statistical thresholds, alert on drift [11].

We think the question is not whether to write agentic code — the capabilities are too valuable. The question is whether to write agentic code without knowing what you lost.

References

  1. Ribeiro, M. T., Wu, T., Guestrin, C., & Singh, S. (2020). “Beyond Accuracy: Behavioral Testing of NLP Models with CheckList.” ACL 2020. aclanthology.org

  2. Mitchell, M., Wu, S., Zaldivar, A., et al. (2019). “Model Cards for Model Reporting.” FAT 2019*. arxiv.org

  3. Zheng, L., Chiang, W.-L., Sheng, Y., et al. (2023). “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” NeurIPS 2023. arxiv.org

  4. Ribeiro, M. T. & Lundberg, S. (2022). “Adaptive Testing and Debugging of NLP Models.” ACL 2022. aclanthology.org

  5. Perez, F. & Ribeiro, I. (2022). “Ignore This Title and HackAPrompt: Exposing Systemic Weaknesses of LLMs through a Global Scale Prompt Hacking Competition.” EMNLP 2023. arxiv.org

  6. Shankar, S., Garcia, R., Hellerstein, J. M., & Parameswaran, A. G. (2024). “Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences.” arxiv.org

  7. Morgan, C. (1994). Programming from Specifications. 2nd ed. Prentice Hall.

  8. Spivey, J. M. (1992). The Z Notation: A Reference Manual. 2nd ed. Prentice Hall. spivey.oriel.ox.ac.uk

  9. Lamport, L. (2002). Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers. Addison-Wesley.

  10. Abrial, J.-R. (1996). The B-Book: Assigning Programs to Meanings. Cambridge University Press.

  11. Breck, E., Cai, S., Nielsen, E., Salib, M., & Sculley, D. (2017). “The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction.” IEEE BigData 2017. research.google

  12. Huyen, C. (2025). AI Engineering: Building Applications with Foundation Models. O’Reilly Media. oreilly.com

  13. Bai, Y., Kadavath, S., Kundu, S., et al. (2022). “Constitutional AI: Harmlessness from AI Feedback.” Anthropic. arxiv.org

  14. OpenAI. (2025). “GPT-5 System Card.” cdn.openai.com

  15. OpenAI & Anthropic. (2025). “Findings from a pilot Anthropic–OpenAI alignment evaluation exercise.” openai.com

  16. Braintrust. (2025). “AI Evaluation and Observability Platform.” braintrust.dev

  17. LangSmith. (2025). “LLM Application Observability and Testing.” LangChain. smith.langchain.com