WugGPT | ChangeLing Lab

Various claims have been made about the linguistic capacities of large language models. Some have asserted that, as next token prediction models, they do not display human-like linguistic “competence.” Others have claimed that the language abilities of LLMs are practically identical to those of humans. One way of testing this is to perform “psycholinguistic” experiments on LLMs.

One of the most influential psycholinguistic experiments was the Wug Test, in which small children were asked to complete sentences that required applying a morphological operation to a nonce word (a make-up word). For example, children might be shown a picture of a bird-like creature and be told “this is a wug.” Then, being shown two of the creatures, they would be told, “now there are two of them. There are two….” English speaking children, even very small children, then continue, “wugs!” This test showed that small children are able to generalize morphological and phonological patterns to contexts where they have never seen them before.

Our research applies the same technique to language models, investigating the degree to which they can generalize morphology—whether inflection or derivation—to nonce words or, in other words, the degree to which LLMs have human-like morphological behavior.

The project, about testing the linguistics capabilities of large language models, has resulted in a couple of papers, with more to come:

(Weissweiler et al., 2023)
(Mortensen et al., 2024)

In particular, ongoing work compares human and LLM derivational morphology and finds that—like humans—LLMs use analogical relationships to derive words but—unlike humans—current LLMs rely on token frequencies rather than type frequencies of exemplars.

References

2024

Verbing Weirds Language (Models): Evaluation of English Zero-Derivation in Five LLMs

David R. Mortensen, Valentina Izrailevitch, Yunze Xiao, and 2 more authors

In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), May 2024

Abs

Lexical-syntactic flexibility, in the form of conversion (or zero-derivation) is a hallmark of English morphology. In conversion, a word with one part of speech is placed in a non-prototypical context, where it is coerced to behave as if it had a different part of speech. However, while this process affects a large part of the English lexicon, little work has been done to establish the degree to which language models capture this type of generalization. This paper reports the first study on the behavior of large language models with reference to conversion. We design a task for testing lexical-syntactic flexibility—the degree to which models can generalize over words in a construction with a non-prototypical part of speech. This task is situated within a natural language inference paradigm. We test the abilities of five language models—two proprietary models (GPT-3.5 and GPT-4), three open source model (Mistral 7B, Falcon 40B, and Llama 2 70B). We find that GPT-4 performs best on the task, followed by GPT-3.5, but that the open source language models are also able to perform it and that the 7-billion parameter Mistral displays as little difference between its baseline performance on the natural language inference task and the non-prototypical syntactic category task, as the massive GPT-4.

2023

Counting the Bugs in ChatGPT’s Wugs: A Multilingual Investigation into the Morphological Capabilities of a Large Language Model

Leonie Weissweiler, Valentin Hofmann, Anjali Kantharuban, and 10 more authors

In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Dec 2023

Abs

Large language models (LLMs) have recently reached an impressive level of linguistic capability, prompting comparisons with human language skills. However, there have been relatively few systematic inquiries into the linguistic capabilities of the latest generation of LLMs, and those studies that do exist (i) ignore the remarkable ability of humans to generalize, (ii) focus only on English, and (iii) investigate syntax or semantics and overlook other capabilities that lie at the heart of human language, like morphology. Here, we close these gaps by conducting the first rigorous analysis of the morphological capabilities of ChatGPT in four typologically varied languages (specifically, English, German, Tamil, and Turkish). We apply a version of Berko’s (1958) wug test to ChatGPT, using novel, uncontaminated datasets for the four examined languages. We find that ChatGPT massively underperforms purpose-built systems, particularly in English. Overall, our results—through the lens of morphology—cast a new light on the linguistic capabilities of ChatGPT, suggesting that claims of human-like language skills are premature and misleading.