← Bibliography

Uncovering systematic failures of LLMs in verifying code against natural language specifications

View original ↗

Note

LLMs exhibit systematic over-correction bias when verifying code. GPT-4o accuracy drops from 52.4% to 11.0% under complex prompts.

Details

Book/Proceedings
ASE 2025: 40th IEEE/ACM international conference on automated software engineering

Citation Key

jin2025overcorrection