Daan van Esch

Journal ArticleDOI

Building Machine Translation Systems for the Next Thousand Languages

- 09 May 2022 -

TL;DR: Results in three research domains are described, which include building clean, web-mined datasets for 1500+ languages by leveraging semi-supervised pre-training for language identiﬁcation and developing data-drivenData-driven language identification techniques and developing practical MT models for under-served languages.

...read moreread less

Proceedings ArticleDOI

Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

Isaac Caswell, +3 more

TL;DR: Two classes of techniques are proposed: wordlist-based tunable-precision filters and transformer-based semi-supervised LangID models, which increase median dataset precision from 5.5% to 71.2% and enable an initial data set covering 100K or more relatively clean sentences in each of 500+ languages, paving the way towards a 1,000-language web text corpus.

...read moreread less

Proceedings ArticleDOI

Building Speech Recognition Systems for Language Documentation: The CoEDL Endangered Language Pipeline and Inference System (ELPIS)

Ben Foley, +15 more

TL;DR: The development of Elpis is described, a pipeline which language documentation workers with minimal computational experience can use to build their own speech recognition models, resulting in models being built for 16 languages from the Asia-Pacific region.

...read moreread less

Posted Content

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

Isaac Caswell, +51 more

- 23 Mar 2021 -

arXiv: Computation and Language

TL;DR: In this paper, the authors manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4) and audit the correctness of language codes in a sixth (JW300).

...read moreread less

Proceedings Article

Text Normalization Infrastructure that Scales to Hundreds of Language Varieties

Mason Chua, +5 more

TL;DR: The automated multi-language text normalization infrastructure that prepares textual data to train language models used in Google’s keyboards and speech recognition systems is described, across hundreds of language varieties.

...read moreread less

Papers

Building Machine Translation Systems for the Next Thousand Languages

Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

Building Speech Recognition Systems for Language Documentation: The CoEDL Endangered Language Pipeline and Inference System (ELPIS)

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

Text Normalization Infrastructure that Scales to Hundreds of Language Varieties