Universal Phone Recognition

Recognizing phonetic units in a language-neural fashion

Modern ASR systems typically units larger than an individual sound. However, sometimes it is desirable to recognize individual sounds, whether as structural units of a particular language (phonemes) or as language-neural idealizations of an acoustic/articulatory unit (phones). Recognizing phones is valuable for a variety of applications:

  • Language documentation
  • Very low resource ASR
  • Zero-shot language identification from speech
  • Analysis of atypical speech (e.g., dysarthric or non-native speech)
  • However, existing universal ASR systems suffer from a couple of deficits:

  • They display very high phone error rates
  • They do not handle some important phenomena like tone.

Tone and other suprasegmentals are very challenging because they are really phonological rather than phonetic. All speech, for example, displays acoustic variation in frequency, but only some languages use this variation to distinguish words from each other. Thus, hard work is required in order to know how to characterize tone in a language-neural fashion.

Building both on past efforts at universal phone recognition (Li et al., 2020; Yan et al., 2021)and current self-supervised speech models, we aim to build high-accuracy models that can transcribe speech as IPA (International Phonetic Alphabet) with the same reliability as a human linguist.

References

2021

  1. Differentiable Allophone Graphs for Language-Universal Speech Recognition
    Brian Yan, Siddharth Dalmia, David R. Mortensen, and 2 more authors
    In Proc. Interspeech 2021, 2021

2020

  1. Universal Phone Recognition with a Multilingual Allophone System
    Xinjian Li, Siddharth Dalmia, Juncheng Li, and 8 more authors
    In ICASSP 2020, 2020