Historical Linguistics as Code Generation

Since the Neogrammarians, when linguists write the phonological histories of languages, they essentially write programs (that convert protoforms into reflexes). From a computational perspective, this is an example of “coding by example.” In this project, we seek to answer the question: can we treat comparative reconstruction as a code-generation problem and solve it with LLMs?

We have made initial progress on this front, experimenting both with data sets where only one rule (sound law) was involved and sets with multiple sound laws (Naik et al., 2024). In the next year, we hope to extend this work to naturalistic data sets.

References

2024