← Blog

AI Coding + Grounding and Formal Methods = Agentic Software Engineering

The formal methods community has been working on verified software for decades. Leslie Lamport’s TLA+ [1] specifies concurrent systems. Alloy [2] models structural properties. Lean [3] and Coq [4] prove theorems about programs. Dafny [5] verifies code against specifications at compile time. This is deep, serious work with a long lineage — and none of these notations are competing. Each serves different verification needs.

One significant barrier to adoption was always the cost. Writing a formal spec for a notification scheduler or an authentication flow takes hours of skilled effort. Most teams can’t justify that investment for most features — and the time required to learn and apply these techniques is time most businesses never give their engineers. The specification stays in the textbook.

We hypothesize that AI changes that — not just the cost of writing specs, but the cost of learning and consistently applying them. If an AI agent can teach you the notation, draft an initial spec, and help you refine it, formal methods become approachable and practical for ordinary projects. Our Z Spec plugin is how we’re testing that hypothesis — starting with Z, with early support for B, and more notations to follow.

The Notation

The notation we’re using is Z — a schema language grounded in set theory and first-order predicate logic, developed at the University of Oxford and standardized as ISO 13568 [6]. Our interest in Z comes from Andrew Simpson’s Software Engineering Mathematics and State-Based Modelling courses in Oxford’s Department of Computer Science [10], where formal specification is taught as a practical engineering discipline — not just theory, but a tool for reasoning about systems before building them.

A Z schema describes a system as a set of states, invariants that constrain those states, and operations that transition between them. When the type-checker (fuzz [7]) accepts a specification, the system description is internally consistent. When the animator (ProB [8]) explores the state space, you see every reachable configuration — including ones you forgot to think about.

Not Waterfall

Z Spec is not “write a spec, then generate code.” That’s waterfall with extra steps.

The plugin has three core commands that work in any direction:

  • model2code — Generate an implementation from a Z spec. The spec defines the acceptance criteria; Claude writes code that satisfies the schema.
  • code2model — Extract a Z spec from existing code. Now you have a formal description of what your code actually does, which may differ from what you intended.
  • audit — Compare an existing spec against existing code. Find where implementation diverges from specification.

You can start from either end. Write a spec first and generate code from it. Or write code first and extract a spec to verify your assumptions. Or do both independently and audit the gap. The workflow is iterative, not sequential.

Case Study: Koch Trainer

Koch Trainer is an iOS Morse code app where formal specification drove the implementation iteratively — not all at once, but domain by domain as complexity demanded it.

The first spec modeled training sessions and learner state — spaced repetition scheduling, character progression, accuracy tracking. That spec was type-checked, animated, and used to generate the core Swift implementation. Then the half-duplex radio simulation (the kind found in ham radio rigs) needed its own state model — transmit/receive switching, squelch behavior, timing constraints. A second spec captured that domain. Finally, the keyer — a deeply complex electromechanical device with iambic squeeze logic, dit/dah memory, and precise timing ratios — got its own spec. Each domain was specified and implemented independently, then composed in the app. The specs are committed to the repository as both LaTeX source and compiled PDFs — equally readable by the developer reviewing a pull request and the agent generating the implementation.

The audit maps every Z invariant, precondition, effect, and bound to its corresponding test. It found that the half-duplex radio model had no guard preventing keying while in receive mode — a constraint the Z spec expressed as ¬(radioMode = receiving ∧ toneActive). The session counter could theoretically report more correct answers than total attempts — the spec’s invariant correct ≤ attempts made that visible. Neither bug had been caught by existing tests because no test happened to exercise those specific state combinations.

The fix wasn’t more tests. It was domain types that make violations structurally impossible. Radio.key() throws unless radioMode = transmitting. SessionCounter.recordAttempt() always increments attempts before correctness. The spec identified the missing constraints; enforcing those constraints through the type system brought coverage from 71% to 89%.

Eric Bowman, CTO of King, used Z Spec’s audit on an iOS notification scheduler and found two bugs in the same category — the system was notifying when it shouldn’t have been. The spec’s invariants caught them structurally, not through test cases that happened to cover the right scenario. As he put it: “Z-spec skill is a real game changer… Not sure whether to sell this pick to the miners or use it to go after gold.”

Why Now

Hand-writing a Z spec for a notification scheduler was always the right approach — it just took more time to learn and apply than most teams had. When Claude produces an initial draft that fuzz can verify and you can refine — a cycle that takes minutes instead of hours — the cost collapses. What remains is the payoff: a mathematically precise definition that catches an entire class of bugs, not just the specific inputs you thought to test.

Formal specification — which was always the rigorous approach — finally fits in the timeline of ordinary feature work. The audience isn’t people who should have been doing this all along. It’s practitioners who knew the value but faced real constraints. We believe that AI can remove those constraints.

It’s also a question we’re pursuing through a masters dissertation in Oxford’s Professional Software Engineering programme.

We’re going after gold.

References

  1. Lamport, L. “TLA+: Temporal Logic of Actions.” lamport.azurewebsites.net
  2. Jackson, D. “Alloy: A Language and Tool for Exploring Software Designs.” alloytools.org
  3. “Lean Theorem Prover.” lean-lang.org
  4. “The Coq Proof Assistant.” coq.inria.fr
  5. Microsoft Research. “Dafny.” dafny.org
  6. ISO. “ISO/IEC 13568: Z Formal Specification Notation.” 2002. iso.org
  7. Spivey, M. “fuzz: The Z Type Checker.” spivey.oriel.ox.ac.uk
  8. “ProB: The Animator and Model Checker.” prob.hhu.de
  9. Anthropic. “Claude Code.” 2024–present. docs.anthropic.com
  10. Simpson, A. “Software Engineering Mathematics” and “State-Based Modelling.” Department of Computer Science, University of Oxford. cs.ox.ac.uk