Mozilla updates Common Voice dataset with 1,400 hours of speech across 18 languages

By Kyle Wiggers

Mozilla wants to make it easier for startups, researchers, and hobbyists to build voice-enabled apps, services, and devices. Toward that end, it’s today releasing the latest version of Common Voice, its open source collection of transcribed voice data that now comprises over 1,400 hours of voice samples from 42,000 contributors across 18 languages, including English, French, German, Dutch, Hakha-Chin, Esperanto, Farsi, Basque, Spanish, Mandarin Chinese, Welsh, and Kabyle.

It’s one of the largest multi-language dataset of its kind, Mozilla claims — substantially larger than the Common Voice corpus it made publicly available eight months ago, which contained 500 hours (400,000 recordings) from 20,000 volunteers in English — and the corpus will soon grow larger still. The organization says that data collection efforts in 70 languages are actively underway via the Common Voice website and mobile apps.

“From the onset, our vision for Common Voice has been to build the world’s most diverse voice dataset, optimized for building voice technologies,” the company wrote in a blog post. “Since we enabled multi-language support … Common Voice has grown to be more global and more inclusive.”

Mozilla Common Speech

Common Voice — which can be can be integrated into DeepSpeech, a suite of open-source speech-to-text, text-to-speech engines, and trained models maintained by Mozilla’s Machine Learning Group — consists not only of voice snippets, but of voluntarily contributed metadata useful for training speech engines, like speakers’ ages, sex, and accents. Collecting it — and the snippets themselves — requires a lot of legwork: the speech prompts on the Common Voice website have to be translated into each target language.

In an effort to streamline the process, Mozilla’s this week rolling out an improved Common Voice web tool with simplified prompts that vary clip-to-clip, plus new controls for reviewing, re-recording, and skipping clips; a toggle that quickly switches between the dashboard’s “speak” and “listen” modes; and an option to opt-out of speech sessions. Additionally, it’s debuting new profile functionality that allows users to keep track of their progress and metrics across languages and add demographic information.

Mozilla says that in the coming months, it’ll experiment with different approaches to “increase the quantity and quality of data [collected],” both through community efforts and “new partnerships.” And it says that eventually, it plans to use some of the recordings to develop voice-enabled products. (It’s already demonstrated that DeepSpeech, when trained on Common Voice data supplemented with other sources, can transcribe lectures, phone conversations, television programs, radio shows, and other live streams with “human accuracy.”) But the company contends that the ultimate goal is to provide “more and better [speech] data” to those who seek to “build and use voice technology.”

“Mozilla aims to contribute to a more diverse and innovative voice technology ecosystem,” it added. “The Common Voice Website is one of our main vehicles for building voice data sets that are useful for voice-interaction technology. The way it looks today is the result of an ongoing process of iteration. We listened to community feedback about the pain points of contributing while also conducting usability research to make contribution easier, more engaging, and fun.”