The team led by Sven Apel, Professor of Software Engineering at Saarland University, and Dr. Mariya Toneva, researcher at the Max Planck Institute for Software Systems, investigated how humans and large language models respond to confusing program code. The characteristics of such code, known as atoms of confusion, are well studied: They are short, syntactically correct programming patterns that are misleading for humans and can throw even experienced developers off track.
To find out whether LLMs and humans “think” about the same stumbling blocks, the research team used an interdisciplinary approach: On the one hand, they used data from an earlier study by Apel and colleagues, in which participants read confusing and clean code variants while their brain activity and attention were measured using electroencephalography (EEG) and eye tracking. On the other hand, they analyzed the “confusion” or model uncertainty of LLMs using so-called perplexity values. Perplexity is an established metric for evaluating language models by quantifying their uncertainty in predicting sequences of text tokens based on their probability.
The result: Wherever humans got stuck on code, the LLM also showed increased perplexity. EEG signals from participants—especially the so-called late frontal positivity, which in language research is associated with unexpected sentence endings—rose precisely where the language model’s uncertainty spiked. “We were astounded that the peaks in brain activity and model uncertainty showed significant correlations,” says Youssef Abdelsalam, who was advised by Toneva and Apel and was instrumental in conducting the study as part of his doctoral studies.
Based on this similarity, the researchers developed a data-driven method that automatically detects and highlights unclear parts of code. In more than 60 percent of cases, the algorithm successfully identified known, manually annotated confusing patterns in the test code and even discovered more than 150 new, previously unrecognized patterns that also coincided with increased brain activity.
“With this work, we are taking a step toward a better understanding of the alignment between humans and machines,” says Max Planck researcher Mariya Toneva. “If we know when and why LLMs and humans stumble in the same places, we can develop tools that make code more understandable and significantly improve human–AI collaboration,” adds Professor Sven Apel.
Through their project, the researchers are building a bridge between neuroscience, software engineering, and artificial intelligence. The study, currently published as a preprint, was accepted for publication at the International Conference on Software Engineering (ICSE), one of the world’s leading conferences in the field of software development. The conference will take place in Rio de Janeiro in April 2026. The authors of the study are: Youssef Abdelsalam, Norman Peitek, Anna-Maria Maurer, Mariya Toneva, and Sven Apel.
Preprint:
Y. Abdelsalam, N. Peitek, A.-M. Maurer, M. Toneva, S. Apel (2025): “How do Humans and LLMs Process Confusing Code?” arXiv:2508.18547v1 [cs.SE], August 25, 2025. https://arxiv.org/abs/2508.18547
Further information:
Chair of Software Engineering: https://www.se.cs.uni-saarland.de
Max Planck research group “Bridging AI and Neuroscience”: https://mtoneva.com/index.html
Scientific contacts:
Prof. Dr. Sven Apel
Chair of Software Engineering
Saarland University
Tel.: +49 681 302 57211
E-mail: apel(at)cs.uni-saarland.de
Dr. Mariya Toneva
Head of the Research Group “Bridging AI and Neuroscience”
Max Planck Institute for Software Systems
Tel.: +49 681 9303 9801
E-mail: mtoneva@mpi-sws.org
Editorial contact:
Philipp Zapf-Schramm
Saarland Informatics Campus
Tel: +49 681 9325 4509
E-Mail: pzs@mpi-klsb.mpg.de



