Natural language is more complex than it strictly needs to be – and for good reason

Portraitfoto von Michael Hahn — © Thorsten MohrProf. Dr. Michael Hahn

Human languages are complex, rich and varied. From an information-theoretic perspective, however, they could convey the same information in a much more compact form. So why don't we speak 'digitally', encoding information in strings of ones and zeros like a computer? Michael Hahn, a linguist from Saarbrücken, has explored this question together with a research colleague from the US. They have developed a model that explains why we speak the way we do.

Their findings have recently been published in Nature Human Behaviour.

Human languages are complex phenomena. Around 7,000 languages are spoken worldwide, some with only a handful of remaining speakers while others, such as Chinese, English, Spanish and Hindi, are spoken by billions. Despite their profound differences, they all share a common function: they convey information by combining individual words into phrases – groups of related words – which are then assembled into sentences. Each of these units has its own meaning, which in combination ultimately form a comprehensible whole.

'This is actually a very complex structure. Since the natural world tends towards maximizing efficiency and conserving resources, it's perfectly reasonable to ask why the brain encodes linguistic information in such an apparently complicated way instead of digitally, like a computer,' explains Michael Hahn. Hahn, Professor of Computational Linguistics at Saarland University, has been examining this question together with his colleague Richard Futrell from the University of California, Irvine. Encoding information in a classical binary sequence of ones and zeros would, in theory at least, be far more efficient because it compresses information much more tightly than natural languages. So why don't we all communicate – metaphorically speaking – like R2-D2 from Star Wars, but instead speak the way we do? Hahn and Futrell have now found an answer to this conundrum.

'Human language is shaped by the realities of life around us,' says Michael Hahn. 'If, for instance, I was to talk about half a cat paired with half a dog and I referred to this using the abstract term "gol", nobody would know what I meant, as it's pretty certain that no one has seen a gol – it simply does not reflect anyone's lived experience. Equally, it makes no sense to blend the words "cat" and "dog" into a string of characters that uses the same letters but is impossible to interpret,' he continues. We simply wouldn't be able to process a string like 'gadcot', even if it technically contains the letters of both words. In contrast, the phrase "cat and dog" does form a meaningful linguistic unit because the two words "cat" and "dog" refer to animals that virtually everyone will be familiar with.

Hahn summarizes the main findings of the study as follow: 'Put simply, it's easier for our brain to take what might seem to be the more complicated route.' Although the information is not in its most compressed form, the computational load for the brain is much lower because the human brain processes language in constant interaction with the familiar natural environment. Coding the information in a purely binary digital form might seem more efficient, as the information can be transmitted in a shorter time, but such a code would be detached from our real-world experience. Michael Hahn says the daily drive to work provides a good analogy: 'On our usual commute, the route is so familiar to us that the drive is almost like on autopilot. Our brain knows exactly what to expect, so the effort it needs to make is much lower. Taking a shorter but less familiar route feels much more tiring, as the new route demands that we be far more attentive during the drive.' Mathematically speaking: 'The number of bits the brain needs to process is far smaller when we speak in familiar, natural ways.'

Encoding and decoding information digitally would therefore require significantly more cognitive effort for both speaker and listener. Instead, the human brain continuously calculates the probabilities of words and phrases occurring in sequence, and because we use our native language daily for tens of thousands of days across a lifetime, these sequence patterns become deeply ingrained, reducing the computational load even further.

Hahn offers another example: 'When I say the German phrase "Die fünf grünen Autos" (Engl.: "the five green cars"), the phrase will almost certainly make sense to another German speaker, whereas "Grünen fünf die Autos" (Engl.: "green five the cars") won't,' he says.

Consider what happens when a speaker utters the phrase 'Die fünf grünen Autos'. It begins with the German definite article 'Die'. At that point, a German-speaking listener will already know that the word 'Die' is likely to signal a feminine singular noun or a plural noun of any gender. This allows the brain to rule out masculine or neuter singular nouns immediately. The next word, 'fünf', is highly likely to refer to something countable, which rules out non-enumerable concepts like 'love' or 'thirst'. The next word in the sequence 'grünen' tells the listener that the as-yet-unknown noun will be in the plural form and is green in colour. It could be cars, but could just as well be bananas or frogs. Only when the final word in the sequence 'Autos' is uttered does the brain resolve the remaining ambiguity. As the phrase unfolds, the number of interpretative possibilities narrows until (in most cases) only one final interpretation is left.

However, in the phrase 'Grünen fünf die Autos' (Engl.: 'green five the cars'), this logical chain of predictions and correlations breaks down. Our brain cannot construct meaning from the utterance because the expected sequence of cues is disrupted.

Michael Hahn and his US colleague Richard Futrell have now demonstrated these relationships mathematically. The significance of their study is underscored by its publication in the high-impact journal Nature Human Behaviour. Their insights could prove valuable, for example, in the further development of the large language models (LLMs) that underpin generative AI systems such as ChatGPT or Microsoft's Copilot.

Original publication:
Futrell, R., Hahn, M. Linguistic structure from a bottleneck on sequential information processing. Nat Hum Behav (2025). https://doi.org/10.1038/s41562-025-02336-w

Further information:
Professor Michael Hahn
Email: mhahn(at)lst.uni-saarland.de

Natural language is more complex than it strictly needs to be – and for good reason

Cookie Configuration