M
Monang Setyawan
Researcher at Google
Publications - 1
Citations - 24
Monang Setyawan is an academic researcher from Google. The author has contributed to research in topics: Audit & Natural language processing. The author has an hindex of 1, co-authored 1 publications receiving 17 citations.
Papers
More filters
Posted Content
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Isaac Caswell,Julia Kreutzer,Lisa Wang,Ahsan Wahab,Daan van Esch,Nasanbayar Ulzii-Orshikh,Allahsera Auguste Tapo,Nishant Subramani,Artem Sokolov,Claytone Sikasote,Monang Setyawan,Supheakmungkol Sarin,Sokhar Samb,Benoît Sagot,Clara E. Rivera,Annette Rios,Isabel Papadimitriou,Salomey Osei,Pedro Javier Ortiz Suárez,Iroro Orife,Kelechi Ogueji,Rubungo Andre Niyongabo,Toan Q. Nguyen,Mathias Müller,André Müller,Shamsuddeen Hassan Muhammad,Nanda Muhammad,Ayanda Mnyakeni,Jamshidbek Mirzakhalov,Tapiwanashe Matangira,Colin Leong,Nze Lawson,Sneha Kudugunta,Yacine Jernite,Mathias Jenny,Orhan Firat,Bonaventure F. P. Dossou,Sakhile Dlamini,Nisansa de Silva,Sakine Çabuk Ballı,Stella Biderman,Alessia Battisti,Ahmed Baruwa,Ankur Bapna,Pallavi Baljekar,Israel Abebe Azime,Ayodele Awokoya,Duygu Ataman,Orevaoghene Ahia,Oghenefego Ahia,Sweta Agrawal,Mofetoluwa Adeyemi +51 more
TL;DR: In this paper, the authors manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4) and audit the correctness of language codes in a sixth (JW300).