Phonological Representations for NLP

The ways in which phonology can be used for NLP ends is underexplored. Members of this lab paved the way for future work in this area with tools like Epitran (Mortensen et al., 2018) and PanPhon (Mortensen et al., 2016). We are now seeking to apply phonological representations in a variety of tasks, following the path cleared by (Bharadwaj et al., 2016) and (Chaudhary et al., 2018). We have recently extended this work to modern classes of pretrained models like XPhoneBERT (Sohn et al., 2024). This year, we plan to generalize this investigation to a variety of linguistic tasks (instead of just NER and MT, as in past work) and develop better techniques for exploiting phonological resources.

References

2024

Zero-Shot Cross-Lingual NER Using Phonemic Representations for Low-Resource Languages

Jimin Sohn, Haeji Jung, Alex Cheng, and 3 more authors

Dec 2024

2018

Epitran: Precision G2P for Many Languages

David R. Mortensen, Siddharth Dalmia, and Patrick Littell

In Proceedings of the 11th Language Resources and Evaluation Conference, May 2018
Adapting Word Embeddings to New Languages with Morphological and Phonological Subword Representations

Aditi Chaudhary, Chunting Zhou, Lori Levin, and 3 more authors

In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Oct 2018

Abs

Much work in Natural Language Processing (NLP) has been for resource-rich languages, making generalization to new, less-resourced languages challenging. We present two approaches for improving generalization to low-resourced languages by adapting continuous word representations using linguistically motivated subword units: phonemes, morphemes and graphemes. Our method requires neither parallel corpora nor bilingual dictionaries and provides a significant gain in performance over previous methods relying on these resources. We demonstrate the effectiveness of our approaches on Named Entity Recognition for four languages, namely Uyghur, Turkish, Bengali and Hindi, of which Uyghur and Bengali are low resource languages, and also perform experiments on Machine Translation. Exploiting subwords with transfer learning gives us a boost of +15.2 NER F1 for Uyghur and +9.7 F1 for Bengali. We also show improvements in the monolingual setting where we achieve (avg.) +3 F1 and (avg.) +1.35 BLEU.

2016

PanPhon: A Resource for Mapping IPA Segments to Articulatory Feature Vectors

David R. Mortensen, Patrick Littell, Akash Bharadwaj, and 3 more authors

In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Dec 2016

Abs

This paper contributes to a growing body of evidence that—when coupled with appropriate machine-learning techniques–linguistically motivated, information-rich representations can outperform one-hot encodings of linguistic data. In particular, we show that phonological features outperform character-based models. PanPhon is a database relating over 5,000 IPA segments to 21 subsegmental articulatory features. We show that this database boosts performance in various NER-related tasks. Phonologically aware, neural CRF models built on PanPhon features are able to perform better on monolingual Spanish and Turkish NER tasks that character-based models. They have also been shown to work well in transfer models (as between Uzbek and Turkish). PanPhon features also contribute measurably to Orthography-to-IPA conversion tasks.
Phonologically Aware Neural Model for Named Entity Recognition in Low Resource Transfer Settings

Akash Bharadwaj, David Mortensen, Chris Dyer, and 1 more author

In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Nov 2016