ChatGPT doesn’t understand language rules - it reconstructs words from memory

While the study confirms that LLMs exhibit analogical behavior in derivational morphology, it also finds that they diverge from human cognition in a critical way: humans generalize over types - distinct word forms, whereas LLMs rely on tokens, the raw frequency of word instances in their training data. This distinction has meaningful consequences for the human-likeness of AI linguistic performance.


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 14-05-2025 09:23 IST | Created: 14-05-2025 09:23 IST
ChatGPT doesn’t understand language rules - it reconstructs words from memory
Representative Image. Credit: ChatGPT

A new study reveals that LLMs like GPT-J generalize language not through abstract grammatical rules but through analogy-based memory mechanisms. The study titled “Derivational Morphology Reveals Analogical Generalization in Large Language Models”, published in the Proceedings of the National Academy of Sciences, challenges dominant assumptions about how large language models (LLMs) process and generalize linguistic input. 

By focusing on derivational morphology, specifically English adjective-to-noun nominalization using suffixes like -ity and -ness - the researchers probe whether LLMs follow symbolic rules or rely on exemplar similarity. They test this using nonce words (invented adjectives) to eliminate the possibility of memorization and compare the predictions of GPT-J with those of two leading cognitive models: the rule-based Minimal Generalization Learner (MGL) and the analogy-based Generalized Context Model (GCM). Their findings decisively support the analogical hypothesis: GPT-J’s predictions align closely with those of the token-level analogy model and deviate significantly from rule-based logic.

What kind of evidence supports the analogy hypothesis?

To isolate how GPT-J generalizes derivational forms, the authors constructed a controlled experiment using pseudowords tailored to mimic real adjective endings. They targeted four adjective classes - two with regular suffix patterns (-able → -ity, -ish → -ness) and two with variable ones (-ive, -ous) - and generated 200 nonce adjectives. GPT-J’s task was to choose between the two possible nominalizations for each pseudoword.

In regular cases, both cognitive models and GPT-J performed similarly, consistently choosing the expected suffix. However, in variable classes, GPT-J’s behavior diverged from the rule-based model and instead mirrored the analogy model’s predictions. For example, GPT-J favored -ness for the pseudoword pepulative - matching what a frequency-sensitive analogy model would choose based on similar words like manipulativeness rather than a rule predicting -ity for -ive endings.

Crucially, the study revealed that GPT-J is sensitive to token-level frequency effects. Words with high occurrence in the training data were more confidently predicted, while those with rare or ambiguous suffix neighborhoods led to more tentative predictions. This finding contradicts rule-based theories, which posit that once a rule is learned, frequency of individual instances should not affect output generation. The presence of significant frequency-driven confidence differentials reinforces the conclusion that LLMs operate via analogy over stored exemplars.

How do LLMs differ from humans in linguistic generalization?

While the study confirms that LLMs exhibit analogical behavior in derivational morphology, it also finds that they diverge from human cognition in a critical way: humans generalize over types - distinct word forms, whereas LLMs rely on tokens, the raw frequency of word instances in their training data. This distinction has meaningful consequences for the human-likeness of AI linguistic performance.

In human evaluation experiments, GPT-J and GPT-4 were tested against judgments by 22 native English speakers on the preferred nominalization of the same nonce adjectives. The results showed that GPT-4, despite its more advanced architecture, matched human preferences less accurately than GPT-J, especially in irregular morphological cases. Both LLMs performed worse than the simpler type-level GCM model, revealing that their reliance on token frequency distorts their generalization when abstract pattern recognition is required.

Additional tests using real-word data from the Hoosier Lexicon also showed that humans judged morphologically complex low-frequency words as more familiar than their frequency would predict, because of internal structure. GPT-J, however, rated these as less familiar than their simplex counterparts, suggesting it lacks an internal mechanism to decompose and recognize familiar substructures in rare words. This again emphasizes that LLMs store surface forms and frequencies rather than structured mental lexicons akin to human cognition.

Why does this matter for AI development and language science?

The implications of the study extend far beyond morphology. It challenges the prevailing narrative that LLMs acquire grammar-like rules akin to those internalized by humans. Instead, the authors argue that LLMs achieve linguistic generalization via dense memory networks that store massive quantities of exemplars and interpolate analogies when faced with novel forms. This offers a compelling explanation for their success with nuanced language tasks, but also exposes their limitations in abstract reasoning and rule flexibility.

Furthermore, the study suggests that larger, more powerful models like GPT-4 are not necessarily more cognitively aligned with human linguistic behavior. In fact, their deeper entrenchment in training data frequencies might widen the human-machine gap in abstract language understanding - a phenomenon the authors term an “inverse scaling effect.”

The research also reinforces broader findings that LLMs struggle with meta-level representation. Their inability to semantically ascend, to abstract away from specific examples toward general linguistic rules, means that their performance, while impressive, remains fundamentally different from that of human minds.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback