From their newsletter:

We’re so excited to share that the 22nd dataset release for Common Voice is now available for download.

Common Voice 22.0 has an additional 281 hours of speech data, bringing the total number of hours to 33,815. This release has also seen a jump in 296 newly validated hours, with a total of 22,640 validated hours of clips. This release welcomes the addition of Aromanian (rup), Tajik (tg), and Venda/Tshivenda (ve) languages.

Aromanian is spoken by around 210,000 people in the Balkans, while Tajik is a language closely related to Persian spoken in Tajikistan and Uzbekistan by over 10 million people. Venda / Tshivenda is spoken by over 2 million people as a first or other language in South Africa and Zimbabwe.

This brings the total number of languages available in this Scripted Speech release to 137.

For those unfamiliar:

Common Voice is a crowdsourcing project started by Mozilla to create a free and open speech corpus. The project is supported by volunteers who record sample sentences with a microphone and review recordings of other users. The transcribed sentences are collected in a voice database available under the public domain license CC0.[1] This license ensures that developers can use the database for voice-to-text and text-to-voice applications without restrictions or costs.

  • Kissaki@programming.dev
    link
    fedilink
    English
    arrow-up
    13
    ·
    2 days ago
    • 44% Male/Masculine
    • 39% No information
    • 18% Female/Feminine

    Tech bias even on public domain open contribution datasets. Apparently could use more female contributors.