AI coding agents find the right file but miss the exact lines that matter, study shows
Photo: the-decoder.com

AI coding agents find the right file but miss the exact lines that matter, study shows

Originally reported by The Decoder

"AI coding agents struggle to find crucial code lines, despite locating the right files. This weakness undermines their overall performance."

Shanghai Jiao Tong University researchers have identified a significant flaw in AI coding agents, which are widely used to fix bugs and improve software. These agents can locate the correct files but often fail to pinpoint the exact lines of code that matter, leading to subpar performance. This discovery was made possible by SWE-Explore, a new benchmark that separates code search from the actual fix, exposing a hidden weakness in AI coding agents.

The research team, involving international collaborators, created SWE-Explore to evaluate the first phase of the AI coding process. In this phase, an agent receives a bug description and a software project, then returns a ranked list of code sections it considers relevant. By analyzing the results, the researchers found that AI coding agents land in the right neighborhood but miss the crucial spots. This weakness is not immediately apparent, as the outcome of the coding process often hides what actually went wrong.

To conduct their study, the researchers used a dataset of 848 problems from 203 open-source projects across ten programming languages. Python dominated the dataset with 547 tasks, followed by Go, JavaScript, and Rust. The comparison pitted traditional search methods against five general-purpose coding agents, including Claude Code, Codex, and OpenHands, along with four research systems built specifically for code search. The results showed that old-school keyword search barely beats chance, as a bug description like "RuntimeWarning on Overflow" contains terms that appear more frequently in project templates and docs than in the actual source code.

The AI agents, on the other hand, search the project step by step instead of sorting all hits at once, which allows them to pull ahead clearly. At the file level, the agents perform well, finding the right source file, ranking it early, and keeping the selection tight. However, when the test zooms in to individual lines of code, the system falls apart. General coding agents cover only 14 to 19 percent of the lines that actually matter. This pattern holds even when stronger language models are used, as the researchers found that throwing a more powerful model at the problem doesn't fix it.

The team ran the same agent with six different models from OpenAI, Anthropic, Google, Moonshot, and Zhipu, and the results showed that the GPT family leads, but the pattern holds. File hit rates stay consistently higher than actual line coverage, and the various agent architectures land strikingly close to each other. The CoSIL research system is the outlier, as it scans code as a network of interconnected building blocks and achieves much higher line coverage. Among the specialized localization systems, AutoCodeRover works precisely but stays conservative, while OrcaLoca produces little noise but misses many relevant spots.

In a controlled ablation experiment, the team artificially varied the context, and the repair model saw only 0, 25, 50, 75, or 100 percent of the core regions, sometimes padded with irrelevant non-core code. The results showed that for easier tasks in the dataset, a clear threshold effect appears. As long as less than half the necessary core regions are visible, repairs mostly fail. The success rate only jumps between 50 and 75 percent coverage, and fixes don't improve gradually. They need a minimum amount of clues before anything clicks.

For harder tasks, the effect is much narrower, and even better context doesn't help much. Once the critical spots are available, irrelevant extra code barely gets in the way. An agent that reads too little code will struggle to fix the bug, but providing more context doesn't necessarily lead to better results. This finding has significant implications for the development of AI coding agents, as it highlights the need to improve their ability to locate crucial code lines.

The researchers' discovery also underscores the importance of evaluating AI coding agents beyond their overall performance. By separating code search from the actual fix, SWE-Explore provides a more nuanced understanding of the strengths and weaknesses of these agents. As the use of AI coding agents becomes more widespread, it is essential to address their limitations and develop more effective methods for locating crucial code lines.

In conclusion, the study by Shanghai Jiao Tong University researchers has exposed a significant weakness in AI coding agents, which can locate the correct files but often fail to pinpoint the exact lines of code that matter. This discovery has significant implications for the development of AI coding agents and highlights the need to improve their ability to locate crucial code lines. By addressing this limitation, researchers and developers can create more effective AI coding agents that can better support software development and maintenance.