Humans and large language models respond surprisingly similarly to confusing program code

Person sitting at a desk in front of several screens — © UdS/Oliver DietzeSven Apel, Professor of Computer Science at Saarland University

Researchers from Saarland University and the Max Planck Institute for Software Systems have, for the first time, shown that the reactions of humans and large language models (LLMs) to complex or misleading program code significantly align, by comparing brain activity of study participants with model uncertainty. Building on this, the team developed a data-driven method to automatically detect such confusing areas in code — a promising step toward better AI assistants for software development.

The team led by Sven Apel, Professor of Software Engineering at Saarland University, and Dr. Mariya Toneva, researcher at the Max Planck Institute for Software Systems, investigated how humans and large language models respond to confusing program code. The characteristics of such code, known as atoms of confusion, are well studied: They are short, syntactically correct programming patterns that are misleading for humans and can throw even experienced developers off track.

To find out whether LLMs and humans “think” about the same stumbling blocks, the research team used an interdisciplinary approach: On the one hand, they used data from an earlier study by Apel and colleagues, in which participants read confusing and clean code variants while their brain activity and attention were measured using electroencephalography (EEG) and eye tracking. On the other hand, they analyzed the “confusion” or model uncertainty of LLMs using so-called perplexity values. Perplexity is an established metric for evaluating language models by quantifying their uncertainty in predicting sequences of text tokens based on their probability.

The result: Wherever humans got stuck on code, the LLM also showed increased perplexity. EEG signals from participants—especially the so-called late frontal positivity, which in language research is associated with unexpected sentence endings—rose precisely where the language model’s uncertainty spiked. “We were astounded that the peaks in brain activity and model uncertainty showed significant correlations,” says Youssef Abdelsalam, who was advised by Toneva and Apel and was instrumental in conducting the study as part of his doctoral studies.

Based on this similarity, the researchers developed a data-driven method that automatically detects and highlights unclear parts of code. In more than 60 percent of cases, the algorithm successfully identified known, manually annotated confusing patterns in the test code and even discovered more than 150 new, previously unrecognized patterns that also coincided with increased brain activity.

“With this work, we are taking a step toward a better understanding of the alignment between humans and machines,” says Max Planck researcher Mariya Toneva. “If we know when and why LLMs and humans stumble in the same places, we can develop tools that make code more understandable and significantly improve human–AI collaboration,” adds Professor Sven Apel.

Through their project, the researchers are building a bridge between neuroscience, software engineering, and artificial intelligence. The study, currently published as a preprint, was accepted for publication at the International Conference on Software Engineering (ICSE), one of the world’s leading conferences in the field of software development. The conference will take place in Rio de Janeiro in April 2026. The authors of the study are: Youssef Abdelsalam, Norman Peitek, Anna-Maria Maurer, Mariya Toneva, and Sven Apel.

Preprint:

Y. Abdelsalam, N. Peitek, A.-M. Maurer, M. Toneva, S. Apel (2025): “How do Humans and LLMs Process Confusing Code?” arXiv:2508.18547v1 [cs.SE], August 25, 2025. https://arxiv.org/abs/2508.18547

Further information:

Chair of Software Engineering: https://www.se.cs.uni-saarland.de

Max Planck research group “Bridging AI and Neuroscience”: https://mtoneva.com/index.html

Scientific contacts:

Prof. Dr. Sven Apel
Chair of Software Engineering
Saarland University
Tel.: +49 681 302 57211
E-mail: apel(at)cs.uni-saarland.de

Dr. Mariya Toneva
Head of the Research Group “Bridging AI and Neuroscience”
Max Planck Institute for Software Systems
Tel.: +49 681 9303 9801
E-mail: mtoneva@mpi-sws.org

Editorial contact:
Philipp Zapf-Schramm
Saarland Informatics Campus
Tel: +49 681 9325 4509
E-Mail: pzs@mpi-klsb.mpg.de

Humans and large language models respond surprisingly similarly to confusing program code

Cookie Configuration