Systematic Relationships for Improved ASR

Better ASR for low resource varieties

We seek to take advantage of the systematic relationships between languages—the consequence of regular sound change and morphological developments—to build better ASR systems for low-resource varieties given ASR models for high-resource varieties.

In (Chou et al., 2023), we collected a corpus of soap operas spoken in Taiwanese Hokkien, a low resource variety spoken in Taiwan, and trained E2E ASR models with (frozen) self-supervised speech representations. The S3M pretrained on Mandarin performed the best in our model, showing the effectiveness of transfer learning from a high resource sister language. However, the word error rates (WERs) were still abysmal due to the low amount of data.

As we show in (Chang et al., 2024), SSL pre-training on Mainstream American English per se cannot generalize to unseen pronunciation variants in African American Vernacular English in a zero shot manner, despite being touted for their ability to lower WERs with less data and for the linguistic (phonetic, phonological, etc) information they encode. This motivates a new approach based on speech in-context learning and on the regularity of sound change.

References

2024

  1. Self-supervised Speech Representations Still Struggle with African American Vernacular English
    Kalvin Chang, Yi-Hui Chou, Jiatong Shi, and 4 more authors
    In Proc. INTERSPEECH 2024, 2024

2023

  1. Evaluating self-supervised speech models on a Taiwanese Hokkien corpus
    Yi-Hui Chou, Kalvin Chang, Meng-Ju Wu, and 8 more authors
    In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023