Automating Comparative Reconstruction
Work on developing models that reconstruct protolanguages based on collections of cognate sets
In the 19th century, European philologists made a discovery that would change the direction of the human sciences: they discovered that languages change in systematic ways and that, by leveraging these systematic patterns, it was possible to reproducibly reconstruct ancestors of families of languages (proto-languages) even when no record of those languages survived. This technique, called the comparative method, provided an unprecedented window into the human past—its cultures, its migrations, and its encounters between populations.
The assumption that historical changes in pronunciation (“sound changes”) are regular, known as ‘the regularity principle’ or ‘the Neogrammarian hypothesis’, is fundamental to the comparative method. As the 19th century Neogrammarians Hermann Osthoff and Karl Brugmann put it:
Every sound change, in so far as it proceeds mechanically, is completed in accordance with laws admitting of no exceptions; i.e. the direction in which the change takes place is always the same for all members of a language community, apart from the case of dialect division, and all words in which the sound subject to change occurs in the same conditions are affected by the change without exception (Morphologische Untersuchungen auf dem Gebiete der indogermanischen Sprachen i).
The comparative method, however, is challenging for humans to apply. This is true largely because it involves dealing with large volumes of data and modeling numerous interactions between competing patterns. One must balance the need for phonetic similarity between reconstructed words and their descendants (reflexes) with the need to be able to deterministically derive reflexes from reconstructed words with a single set of sound changes. It imposes a heavy cognitive load. For this reason, researchers have long aspired to implement the comparative method computationally.
In this research, we build upon past research in this area.
- In (Chang et al., 2022), we propose a new resource for Chinese historical phonology (including Middle Chinese and modern Chinese varieties). This data is foundational to our later papers.
- In (Kim et al., 2023), we show that Transformer-based models can perform better than RNN (e.g., GRU) based models for supervised protoform reconstruction.
- In (Chang et al., 2023), we semi-automate intermediate sound change prediction (AISCP) for Tukanoan phylogenetic inference (the process of determining how languages branched off from its relatives). Traditionally, linguists have manually predicted the intermediate stages of sounds that proto-sounds go through, which are then used to group different varieties.
- In (Lu et al., 2024), we further improve automatic comparative reconstruction by using reflex prediction to perform reranking on the beam search results from protoform prediction, emulating the methodology of practicing historical linguists.
- In (Cui et al., 2024), we explored VAEs for supervised comparative reconstruction.
- Finally. in (Lu et al., 2024), we showed that it is possible to achieve strong performance on the protoform reconstruction task using only a fraction of the number of labeled data by using the Proto-Daughter-Proto architecture, an end-to-end architecture that favors protoforms that can be derived from cognate sets and from which cognate sets can be derived.
References
2024
- Improved Neural Protoform Reconstruction via Reflex PredictionIn Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), May 2024
- Neural Proto-Language ReconstructionMay 2024
- Semisupervised Neural Proto-Language ReconstructionMay 2024
2023
- Transformed Protoform ReconstructionIn Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Jul 2023
- Automating Sound Change Prediction for Phylogenetic Inference: A Tukanoan Case StudyIn Proceedings of the 4th Workshop on Computational Approaches to Historical Language Change, Dec 2023
2022
- WikiHan: A New Comparative Dataset for Chinese LanguagesIn COLING 2022, Dec 2022