Historical Linguistics as Code Generation

Modeling phonological reconstruction as a code generation problem using LLMs

Since the Neogrammarians, when linguists write the phonological histories of languages, they essentially write programs (that convert protoforms into reflexes). From a computational perspective, this is an example of “coding by example.” In this project, we seek to answer the question: can we treat comparative reconstruction as a code-generation problem and solve it with LLMs?

We have made initial progress on this front, experimenting both with data sets where only one rule (sound law) was involved and sets with multiple sound laws (Naik et al., 2024). In the next year, we hope to extend this work to naturalistic data sets.

References

2024

  1. Can Large Language Models Code Like a Linguist?: A Case Study in Low Resource Sound Law Induction
    Atharva Naik, Kexun Zhang, Nathaniel Robinson, and 7 more authors
    Dec 2024