Universal Phone Recognition
Recognizing phonetic units in a language-neural fashion
Modern ASR systems typically units larger than an individual sound. However, sometimes it is desirable to recognize individual sounds, whether as structural units of a particular language (phonemes) or as language-neural idealizations of an acoustic/articulatory unit (phones). Recognizing phones is valuable for a variety of applications:
- Language documentation
- Very low resource ASR
- Zero-shot language identification from speech
- Analysis of atypical speech (e.g., dysarthric or non-native speech)
-
However, existing universal ASR systems suffer from a couple of deficits:
- They display very high phone error rates
- They do not handle some important phenomena like tone.
Tone and other suprasegmentals are very challenging because they are really phonological rather than phonetic. All speech, for example, displays acoustic variation in frequency, but only some languages use this variation to distinguish words from each other. Thus, hard work is required in order to know how to characterize tone in a language-neural fashion.
Building both on past efforts at universal phone recognition (Li et al., 2020; Yan et al., 2021)and current self-supervised speech models, we aim to build high-accuracy models that can transcribe speech as IPA (International Phonetic Alphabet) with the same reliability as a human linguist.
References
2021
- Differentiable Allophone Graphs for Language-Universal Speech RecognitionIn Proc. Interspeech 2021, 2021
2020
- Universal Phone Recognition with a Multilingual Allophone SystemIn ICASSP 2020, 2020