It can also produce text-to-speech in over 1,100 languages.
Meta has created an AI language model that (in a refreshing change of pace) isn’t a ChatGPT clone. The company’s Massively Multilingual Speech (MMS) project can recognize over 4,000 spoken languages and produce speech (text-to-speech) in over 1,100. Like most of its other publicly announced AI projects, Meta is open-sourcing MMS today to help preserve language diversity and encourage researchers to build on its foundation. “Today, we are publicly sharing our models and code so that others in the research community can build upon our work,” the company wrote. “Through this work, we hope to make a small contribution to preserve the incredible language diversity of the world.”
Speech recognition and text-to-speech models typically require training on thousands of hours of audio with accompanying transcription labels. (Labels are crucial to machine learning, allowing the algorithms to correctly categorize and “understand” the data.) But for languages that aren’t widely used in industrialized nations — many of which are in danger of disappearing in the coming decades — “this data simply does not exist,” as Meta puts it.
Meta used an unconventional approach to collecting audio data: tapping into audio recordings of translated religious texts. “We turned to religious texts, such as the Bible, that have been translated in many different languages and whose translations have been widely studied for text-based language translation research,” the company said. “These translations have publicly available audio recordings of people reading these texts in different languages.” Incorporating the unlabeled recordings of the Bible and similar texts, Meta’s researchers increased the model’s available languages to over 4,000.
If you’re like me, that approach may raise your eyebrows at first glance, as it sounds like a recipe for an AI model heavily biased toward Christian worldviews. But Meta says that isn’t the case. “While the content of the audio recordings is religious, our analysis shows that this does not bias the model to produce more religious language,” Meta wrote. “We believe this is because we use a connectionist temporal classification (CTC) approach, which is far more constrained compared with large language models (LLMs) or sequence-to-sequence models for speech recognition.” Furthermore, despite most of the religious recordings being read by male speakers, that didn’t introduce a male bias either — performing equally well in female and male voices.
After training an alignment model to make the data more usable, Meta used wav2vec 2.0, the company’s “self-supervised speech representation learning” model, which can train on unlabeled data. Combining unconventional data sources and a self-supervised speech model led to impressive outcomes. “Our results show that the Massively Multilingual Speech models perform well compared with existing models and cover 10 times as many languages.” Specifically, Meta compared MMS to OpenAI’s Whisper, and it exceeded expectations. “We found that models trained on the Massively Multilingual Speech data achieve half the word error rate, but Massively Multilingual Speech covers 11 times more languages.”
Meta cautions that its new models aren’t perfect. “For example, there is some risk that the speech-to-text model may mistranscribe select words or phrases,” the company wrote. “Depending on the output, this could result in offensive and/or inaccurate language. We continue to believe that collaboration across the AI community is critical to the responsible development of AI technologies.”
Now that Meta has released MMS for open-source research, it hopes it can reverse the trend of technology dwindling the world’s languages to the 100 or fewer most often supported by Big Tech. It sees a world where assistive technology, TTS and even VR / AR tech allow everyone to speak and learn in their native tongues. It said, “We envision a world where technology has the opposite effect, encouraging people to keep their languages alive since they can access information and use technology by speaking in their preferred language.”