Automating Comparative Reconstruction

In the 19th century, European philologists made a discovery that would change the direction of the human sciences: they discovered that languages change in systematic ways and that, by leveraging these systematic patterns, it was possible to reproducibly reconstruct ancestors of families of languages (proto-languages) even when no record of those languages survived. This technique, called the comparative method, provided an unprecedented window into the human past—its cultures, its migrations, and its encounters between populations.

The assumption that historical changes in pronunciation (“sound changes”) are regular, known as ‘the regularity principle’ or ‘the Neogrammarian hypothesis’, is fundamental to the comparative method. As the 19th century Neogrammarians Hermann Osthoff and Karl Brugmann put it:

Every sound change, in so far as it proceeds mechanically, is completed in accordance with laws admitting of no exceptions; i.e. the direction in which the change takes place is always the same for all members of a language community, apart from the case of dialect division, and all words in which the sound subject to change occurs in the same conditions are affected by the change without exception (Morphologische Untersuchungen auf dem Gebiete der indogermanischen Sprachen i).

The comparative method, however, is challenging for humans to apply. This is true largely because it involves dealing with large volumes of data and modeling numerous interactions between competing patterns. One must balance the need for phonetic similarity between reconstructed words and their descendants (reflexes) with the need to be able to deterministically derive reflexes from reconstructed words with a single set of sound changes. It imposes a heavy cognitive load. For this reason, researchers have long aspired to implement the comparative method computationally.

In this research, we build upon past research in this area.

In (Chang et al., 2022), we propose a new resource for Chinese historical phonology (including Middle Chinese and modern Chinese varieties). This data is foundational to our later papers.
In (Kim et al., 2023), we show that Transformer-based models can perform better than RNN (e.g., GRU) based models for supervised protoform reconstruction.
In (Chang et al., 2023), we semi-automate intermediate sound change prediction (AISCP) for Tukanoan phylogenetic inference (the process of determining how languages branched off from its relatives). Traditionally, linguists have manually predicted the intermediate stages of sounds that proto-sounds go through, which are then used to group different varieties.
In (Lu et al., 2024), we further improve automatic comparative reconstruction by using reflex prediction to perform reranking on the beam search results from protoform prediction, emulating the methodology of practicing historical linguists.
In (Cui et al., 2024), we explored VAEs for supervised comparative reconstruction.
Finally. in (Lu et al., 2024), we showed that it is possible to achieve strong performance on the protoform reconstruction task using only a fraction of the number of labeled data by using the Proto-Daughter-Proto architecture, an end-to-end architecture that favors protoforms that can be derived from cognate sets and from which cognate sets can be derived.

References

2024

Improved Neural Protoform Reconstruction via Reflex Prediction

Liang Lu, Jingzhi Wang, and David R. Mortensen

In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), May 2024

Abs

Protolanguage reconstruction is central to historical linguistics. The comparative method, one of the most influential theoretical and methodological frameworks in the history of the language sciences, allows linguists to infer protoforms (reconstructed ancestral words) from their reflexes (related modern words) based on the assumption of regular sound change. Not surprisingly, numerous computational linguists have attempted to operationalize comparative reconstruction through various computational models, the most successful of which have been supervised encoder-decoder models, which treat the problem of predicting protoforms given sets of reflexes as a sequence-to-sequence problem. We argue that this framework ignores one of the most important aspects of the comparative method: not only should protoforms be inferable from cognate sets (sets of related reflexes) but the reflexes should also be inferable from the protoforms. Leveraging another line of research—reflex prediction—we propose a system in which candidate protoforms from a reconstruction model are reranked by a reflex prediction model. We show that this more complete implementation of the comparative method allows us to surpass state-of-the-art protoform reconstruction methods on three of four Chinese and Romance datasets.
Neural Proto-Language Reconstruction

Chenxuan Cui, Ying Chen, Qinxin Wang, and 1 more author

May 2024
Semisupervised Neural Proto-Language Reconstruction

Liang Lu, Peirong Xie, and David R. Mortensen

May 2024

2023

Transformed Protoform Reconstruction

Young Min Kim, Kalvin Chang, Chenxuan Cui, and 1 more author

In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Jul 2023

Abs

Protoform reconstruction is the task of inferring what morphemes or words appeared like in the ancestral languages of a set of daughter languages. Meloni et al (2021) achieved the state-of-the-art on Latin protoform reconstruction with an RNN-based encoder-decoder with attention model. We update their model with the state-of-the-art seq2seq model: the Transformer. Our model outperforms their model on a suite of different metrics on two different datasets: their Romance data of 8,000 cognates spanning 5 languages and a Chinese dataset (Hou 2004) of 800+ cognates spanning 39 varieties. We also probe our model for potential phylogenetic signal contained in the model. Our code is publicly available at \urlhttps://github.com/cmu-llab/acl-2023.
Automating Sound Change Prediction for Phylogenetic Inference: A Tukanoan Case Study

Kalvin Chang, Nathaniel Robinson, Anna Cai, and 3 more authors

In Proceedings of the 4th Workshop on Computational Approaches to Historical Language Change, Dec 2023

Abs

We describe a set of new methods to partially automate linguistic phylogenetic inference given (1) cognate sets with their respective protoforms and sound laws, (2) a mapping from phones to their articulatory features and (3) a typological database of sound changes.We train a neural network on these sound change data to weight articulatory distances between phones and predict intermediate sound change steps between historical protoforms and their modern descendants, replacing a linguistic expert in part of a parsimony-based phylogenetic inference algorithm. In our best experiments on Tukanoan languages, this method produces trees with a Generalized Quartet Distance of 0.12 from a tree that used expert annotations, a significant improvement over other semi-automated baselines. We discuss potential benefits and drawbacks to our neural approach and parsimony-based tree prediction. We also experiment with a minimal generalization learner for automatic sound law induction, finding it less effective than sound laws from expert annotation. Our code is publicly available.

2022

WikiHan: A New Comparative Dataset for Chinese Languages

Kalvin Chang, Chenxuan Cui, Youngmin Kim, and 1 more author

In COLING 2022, Dec 2022