The Accountability Gap

Software has always been hard to get right. That is why test-driven development exists [1], why formal methods were invented [2], why code review is widespread, and why entire careers are built around software quality. Humans are non-deterministic producers of code — we make mistakes, hold incorrect mental models, and write bugs under time pressure. None of this is new.

What is new is that humans have legal status. When a developer writes a bug that causes a production outage, there is a chain of accountability: the developer, the reviewer, the team lead, the company. Someone is responsible. Insurance covers the liability. Contracts define remedies. Employment law creates obligations. The developer may not have written correct code, but they are a legal person who can be held to account — and that matters, because it creates an incentive structure that drives quality even when verification is expensive.

AI coding assistants are not legal persons. When an LLM generates code that causes a failure, the terms of service of every major provider explicitly disclaim liability for the output [3]. The model does not know why the code works. The model does not take ownership of the code. The model will not be available to explain the design decisions six months later when something breaks. And the human who accepted the generated code may not understand it deeply enough to take accountability either — particularly when the code was generated at speed, across dozens of files, in a session that scrolled past faster than it could be reviewed.

This is the accountability gap: the distance between the code that ships and the person who stands behind it.

The Gap Is Not About Code Quality

This is important to say clearly: the accountability gap is not a claim that AI-generated code is worse than human-written code. AI-generated code is probably better than the median human output already — more consistent, more thoroughly documented, less likely to contain the copy-paste errors and off-by-one mistakes that plague tired developers at 2am. The code quality may be higher. The accountability is lower.

The gap exists because verification and accountability are different things. Verification asks: “does this code do what it should?” Accountability asks: “who is responsible when it doesn’t?” You can have perfect verification with no accountability (a fully automated pipeline with no human in the loop), or perfect accountability with no verification (a developer who signs off on untested code). Both are failure modes.

The pre-AI world had a rough equilibrium: humans wrote code, humans reviewed it, humans were accountable for it. Verification was often insufficient — most teams had less test coverage than they wanted, fewer formal specs than they needed, and more technical debt than they admitted. But the accountability chain was clear. If something went wrong, you could trace it to a decision made by a person.

What Changes

AI coding changes both sides of this equation simultaneously:

Verification becomes more important. When the producer of code is a model that cannot explain its reasoning, the burden of verification shifts entirely to the consumer. You cannot ask the LLM “why did you write it this way?” and get a reliable answer — the model will confabulate a plausible explanation whether or not it reflects the actual generation process [4]. The only reliable evidence is what the tests prove, what the specs guarantee, and what the reviewers catch. In our view, verification moves from “nice to have” to “load-bearing.”

Verification becomes more obtainable. The same AI that produces unaccountable code can also help you verify it. Writing a Z specification used to take hours of skilled effort — now it takes minutes with an AI assistant that knows the notation [5]. Running a comprehensive test suite used to require careful manual construction — now the agent can generate and iterate on tests while you review. The time cost of formal methods, TDD, and structured requirements collapses. What remains is the payoff: code you can verify, not just trust.

The accountability chain thins. In a traditional team, code passes through multiple humans: developer, reviewer, QA, release manager. Each adds a layer of accountability. In an AI-accelerated workflow, one developer with an agent can produce and ship code that previously required a team. The code may be correct. But the accountability surface — the number of humans who understood and endorsed the code — shrinks. Faros.ai’s telemetry across 10,000+ developers shows PR review time increased 91% even as code generation accelerated [6]: the bottleneck migrated to the one remaining human checkpoint.

The Legal Dimension

The terms of service of major AI coding assistants follow a consistent pattern. GitHub Copilot’s terms: “You retain ownership of Your Code and you retain responsibility for Suggestions you include in Your Code” [3]. Anthropic’s terms: similar disclaimers on output quality and fitness. OpenAI’s terms: the user is responsible for evaluating and using the output [7].

This is reasonable from the providers’ perspective — they cannot guarantee the correctness of probabilistic output. But it means the legal accountability for AI-generated code falls entirely on the user who accepted it. If you ship code that an LLM wrote, you are as liable as if you wrote it yourself — but potentially with less understanding of what it does and why.

The emerging legal landscape is still forming. The EU AI Act [8] classifies AI systems by risk level but does not yet create clear liability chains for AI-assisted code. Professional indemnity insurance for software engineering does not distinguish between human-written and AI-generated code — yet. We are not lawyers, and this landscape is evolving fast. But our read is that the accountability gap will eventually need to be filled by some combination of legal frameworks, industry standards, and engineering practice. Of the three, engineering practice is the one you can control today. Teams like Entire.io [11] are building provenance infrastructure that captures the reasoning behind AI-generated code, and SageOx [12] is working on persistent context for agentic sessions — both are working on pieces of this puzzle from different angles.

Closing the Accountability Gap

The accountability gap does not close by avoiding AI coding — the productivity gains are too significant, and the competitive pressure too real. It closes by building engineering practices that restore the accountability chain:

Verify what you ship. When a human wrote the code, you could sometimes rely on their judgment as a partial substitute for verification. When an LLM writes the code, judgment is not available — only evidence. Every claim about the code’s behavior needs to be backed by a test, a spec, or a proof. This is not a higher bar than good engineering practice always demanded — it is the bar that time pressure used to let teams duck under.

Record the provenance. Design decision logs [9], commit messages, and specification documents are no longer just documentation — they are the audit trail that connects shipped code to human intent. When an AI generated the code, the specification that prompted it and the review that accepted it become the primary evidence of accountability. The specification is the human saying: “this is what I intended.” The review is the human saying: “this is what I endorsed.”

Learn the techniques. The accountability gap is widest when the human in the loop does not understand the verification methods available to them. Test-driven development [1], formal specification [2], structured requirements [10] — these techniques have always produced trustworthy software. They were under-adopted because of time cost and learning curve. AI collapses both. An agent that knows Z notation can teach you the notation while drafting the spec. An agent that knows TDD can scaffold the test harness while explaining the design. The tools become tutors — not just doing the work, but teaching the discipline. The accountability gap narrows when the human who accepts the code genuinely understands what it does, why it was built that way, and how to verify that it works.

We think the question is not whether AI should write code. It already does. The question is whether the humans who ship that code can stand behind it — with evidence, not hope. We don’t know how this plays out at scale — our experience is limited to a small team. But the accountability gap is real in our own workflow, and these practices are how we’re trying to close it.

References

Beck, K. (2002). Test Driven Development: By Example. Addison-Wesley.
Woodcock, J., Larsen, P. G., Bicarregui, J., & Fitzgerald, J. (2009). “Formal Methods: Practice and Experience.” ACM Computing Surveys, 41(4). doi.org
GitHub. (2024). “GitHub Terms for Additional Products and Features.” Section on GitHub Copilot. docs.github.com
Ji, Z., Lee, N., Frieske, R., et al. (2023). “Survey of Hallucination in Natural Language Generation.” ACM Computing Surveys, 55(12). doi.org
Freeman, J. (2026). “AI Coding + Grounding and Formal Methods = Agentic Software Engineering.” Punt Labs. punt-labs.com
Faros.ai. (2025). “AI Software Engineering: What the Data Actually Shows.” Telemetry from 10,000+ developers across 1,255 teams. faros.ai
OpenAI. (2024). “Terms of Use.” Section 3: Content. openai.com
European Parliament. (2024). “Regulation (EU) 2024/1689 — Artificial Intelligence Act.” eur-lex.europa.eu
Freeman, J. (2026). “Design Decision Logs When AI Is Your Co-Author.” Punt Labs. punt-labs.com
Cockburn, A. (2000). Writing Effective Use Cases. Addison-Wesley.
Entire.io. (2025). Provenance infrastructure for AI-generated code. entire.io
SageOx. (2025). Persistent context infrastructure for agentic sessions. sageox.com