Open source tools for software engineering teams using AI

Uncovering systematic failures of LLMs in verifying code against natural language specifications

Haolin Jin and Huaming Chen inproceedings 2025 Source: z-spec

Note

LLMs exhibit systematic over-correction bias when verifying code. GPT-4o accuracy drops from 52.4% to 11.0% under complex prompts.

Book/Proceedings: ASE 2025: 40th IEEE/ACM international conference on automated software engineering

jin2025overcorrection