scispace - formally typeset
D

Daan van Esch

Researcher at Google

Publications -  23
Citations -  310

Daan van Esch is an academic researcher from Google. The author has contributed to research in topics: Computer science & Variety (linguistics). The author has an hindex of 9, co-authored 17 publications receiving 152 citations.

Papers
More filters
Journal ArticleDOI

Building Machine Translation Systems for the Next Thousand Languages

TL;DR: Results in three research domains are described, which include building clean, web-mined datasets for 1500+ languages by leveraging semi-supervised pre-training for language identification and developing data-drivenData-driven language identification techniques and developing practical MT models for under-served languages.
Proceedings ArticleDOI

Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

TL;DR: Two classes of techniques are proposed: wordlist-based tunable-precision filters and transformer-based semi-supervised LangID models, which increase median dataset precision from 5.5% to 71.2% and enable an initial data set covering 100K or more relatively clean sentences in each of 500+ languages, paving the way towards a 1,000-language web text corpus.
Posted Content

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

TL;DR: In this paper, the authors manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4) and audit the correctness of language codes in a sixth (JW300).
Proceedings Article

Text Normalization Infrastructure that Scales to Hundreds of Language Varieties

TL;DR: The automated multi-language text normalization infrastructure that prepares textual data to train language models used in Google’s keyboards and speech recognition systems is described, across hundreds of language varieties.