D
Daan van Esch
Researcher at Google
Publications - 23
Citations - 310
Daan van Esch is an academic researcher from Google. The author has contributed to research in topics: Computer science & Variety (linguistics). The author has an hindex of 9, co-authored 17 publications receiving 152 citations.
Papers
More filters
Journal ArticleDOI
Building Machine Translation Systems for the Next Thousand Languages
Ankur Bapna,Isaac Caswell,Julia Kreutzer,Orhan Firat,Daan van Esch,Aditya Siddhant,Mengmeng Niu,Pallavi Baljekar,Xavier Garcia,Wolfgang Macherey,Theresa Breiner,Vera Axelrod,Jason Riesa,Yuan Cao,Mia Xu Chen,Klaus Macherey,Maxim Krikun,Pidong Wang,Alexander Gutkin,Apurva A. Shah,Yanping Huang,Zhi Chen,Yonghui Wu,Macduff Hughes +23 more
TL;DR: Results in three research domains are described, which include building clean, web-mined datasets for 1500+ languages by leveraging semi-supervised pre-training for language identification and developing data-drivenData-driven language identification techniques and developing practical MT models for under-served languages.
Proceedings ArticleDOI
Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus
TL;DR: Two classes of techniques are proposed: wordlist-based tunable-precision filters and transformer-based semi-supervised LangID models, which increase median dataset precision from 5.5% to 71.2% and enable an initial data set covering 100K or more relatively clean sentences in each of 500+ languages, paving the way towards a 1,000-language web text corpus.
Proceedings ArticleDOI
Building Speech Recognition Systems for Language Documentation: The CoEDL Endangered Language Pipeline and Inference System (ELPIS)
Ben Foley,Joshua Arnold,Rolando Coto-Solano,Gautier Durantin,T. Mark Ellison,Daan van Esch,Scott Heath,František Kratochvíl,Zara Maxwell-Smith,David Nash,Ola Olsson,Mark Richards,Nay San,Hywel Stoakes,Nick Thieberger,Janet Wiles +15 more
TL;DR: The development of Elpis is described, a pipeline which language documentation workers with minimal computational experience can use to build their own speech recognition models, resulting in models being built for 16 languages from the Asia-Pacific region.
Posted Content
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Isaac Caswell,Julia Kreutzer,Lisa Wang,Ahsan Wahab,Daan van Esch,Nasanbayar Ulzii-Orshikh,Allahsera Auguste Tapo,Nishant Subramani,Artem Sokolov,Claytone Sikasote,Monang Setyawan,Supheakmungkol Sarin,Sokhar Samb,Benoît Sagot,Clara E. Rivera,Annette Rios,Isabel Papadimitriou,Salomey Osei,Pedro Javier Ortiz Suárez,Iroro Orife,Kelechi Ogueji,Rubungo Andre Niyongabo,Toan Q. Nguyen,Mathias Müller,André Müller,Shamsuddeen Hassan Muhammad,Nanda Muhammad,Ayanda Mnyakeni,Jamshidbek Mirzakhalov,Tapiwanashe Matangira,Colin Leong,Nze Lawson,Sneha Kudugunta,Yacine Jernite,Mathias Jenny,Orhan Firat,Bonaventure F. P. Dossou,Sakhile Dlamini,Nisansa de Silva,Sakine Çabuk Ballı,Stella Biderman,Alessia Battisti,Ahmed Baruwa,Ankur Bapna,Pallavi Baljekar,Israel Abebe Azime,Ayodele Awokoya,Duygu Ataman,Orevaoghene Ahia,Oghenefego Ahia,Sweta Agrawal,Mofetoluwa Adeyemi +51 more
TL;DR: In this paper, the authors manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4) and audit the correctness of language codes in a sixth (JW300).
Proceedings Article
Text Normalization Infrastructure that Scales to Hundreds of Language Varieties
TL;DR: The automated multi-language text normalization infrastructure that prepares textual data to train language models used in Google’s keyboards and speech recognition systems is described, across hundreds of language varieties.