Home
/
Authors
/
Rubungo Andre Niyongabo

Author

Rubungo Andre Niyongabo

University of Electronic Science and Technology of China

Bio: Rubungo Andre Niyongabo is an academic researcher from University of Electronic Science and Technology of China. The author has contributed to research in topics: Languages of Africa & Computer science. The author has an hindex of 3, co-authored 9 publications receiving 87 citations.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Participatory Research for Low-resourced Machine Translation:A Case Study in African Languages

[...]

Wilhelmina Nekoto, Vukosi Marivate¹, Tshinondiwa Matsila, Timi E. Fasubaa², Tajudeen Kolawole, Taiwo Fagbohungbe, Solomon Oluwole Akinola, Shamsuddeen Hassan Muhammad³, Salomon Kabongo⁴, Salomey Osei⁵, Sackey Freshia, Rubungo Andre Niyongabo⁶, Ricky Macharm, Perez Ogayo, Orevaoghene Ahia, Musie Meressa, Mofe Adeyemi, Masabata Mokgesi-Selinga, Lawrence Okegbemi, Laura Martinus⁷, Kolawole Tajudeen, Kevin Degila, Kelechi Ogueji⁸, Kathleen Siminyu⁹, Julia Kreutzer¹⁰, Jason Webster, Jamiil Toure Ali, Jade Abbott¹, Iroro Orife¹¹, Ignatius Ezeani¹², Idris Abdulkabir Dangana, Herman Kamper¹³, Hady Elsahar¹⁴, Goodness Duru, Ghollah Kioko, Espoir Murhabazi, Elan van Biljon¹³, Daniel Whitenack, Christopher Onyefuluchi, Chris Chinenye Emezue¹⁵, Bonaventure F. P. Dossou¹⁶, Blessing Sibanda, Blessing Itoro Bassey, Ayodele Olabiyi, Arshath Ramkilowan, Alp Öktem¹⁷, Adewale Akinfaderin¹⁸, Abdallah Bashir - Show less +44 more•Institutions (18)

University of Pretoria¹, University of California, Berkeley², University of Porto³, Leibniz University of Hanover⁴, African Institute for Mathematical Sciences⁵, University of Electronic Science and Technology of China⁶, Council of Scientific and Industrial Research⁷, University of Waterloo⁸, Georgia Institute of Technology⁹, Google¹⁰, Carnegie Mellon University¹¹, Lancaster University¹², Stellenbosch University¹³, Naver Corporation¹⁴, Technische Universität München¹⁵, Jacobs University Bremen¹⁶, Pompeu Fabra University¹⁷, Florida State University¹⁸

05 Oct 2020

TL;DR: The feasibility and scalability of participatory research is demonstrated with a case study on MT for African languages, which leads to a collection of novel translation datasets, MT benchmarks for over 30 languages, with human evaluations for a third of them, and enables participants without formal training to make a unique scientific contribution.

...read moreread less

Abstract: Research in NLP lacks geographic diversity, and the question of how NLP can be scaled to low-resourced languages has not yet been adequately solved. "Low-resourced"-ness is a complex problem going beyond data availability and reflects systemic problems in society. In this paper, we focus on the task of Machine Translation (MT), that plays a crucial role for information accessibility and communication worldwide. Despite immense improvements in MT over the past decade, MT is centered around a few high-resourced languages. As MT researchers cannot solve the problem of low-resourcedness alone, we propose participatory research as a means to involve all necessary agents required in the MT development process. We demonstrate the feasibility and scalability of participatory research with a case study on MT for African languages. Its implementation leads to a collection of novel translation datasets, MT benchmarks for over 30 languages, with human evaluations for a third of them, and enables participants without formal training to make a unique scientific contribution. Benchmarks, models, data, code, and evaluation results are released under https://github.com/masakhane-io/masakhane-mt.

...read moreread less

109 citations

Posted Content•

The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics.

[...]

Sebastian Gehrmann¹, Tosin P. Adewumi, Karmanya Aggarwal, Pawan Sasanka Ammanamanchi, Aremu Anuoluwapo, Antoine Bosselut, Khyathi Raghavi Chandu, Miruna Clinciu, Dipanjan Das, Kaustubh Dhole, Wanyu Du, Esin Durmus, Ondřej Dušek, Chris Chinenye Emezue, Varun Gangal, Cristina Garbacea, Tatsunori Hashimoto, Yufang Hou, Yacine Jernite, Harsh Jhamtani, Yangfeng Ji, Shailza Jolly, Mihir Kale, Dhruv Kumar, Faisal Ladhak, Aman Madaan, Mounica Maddela, Khyati Mahajan, Saad Mahamood, Bodhisattwa Prasad Majumder, Pedro Henrique Martins, Angelina McMillan-Major, Simon Mille, Emiel van Miltenburg, Moin Nadeem, Shashi Narayan, Vitaly Nikolaev, Rubungo Andre Niyongabo, Salomey Osei, Ankur P. Parikh, Laura Perez-Beltrachini, Niranjan Ramesh Rao, Vikas Raunak, Juan Diego Rodriguez, Sashank Santhanam, João Sedoc, Thibault Sellam, Samira Shaikh, Anastasia Shimorina, Marco Antonio Sobrevilla Cabezudo, Hendrik Strobelt, Nishant Subramani, Wei Xu, Diyi Yang, Akhila Yerukola, Jiawei Zhou - Show less +52 more•Institutions (1)

Google¹

02 Feb 2021-arXiv: Computation and Language

TL;DR: GEM as discussed by the authors is a living benchmark for natural language generation (NLG), its Evaluation and Metrics, which provides an environment in which models can easily be applied to a wide set of tasks and in which evaluation strategies can be tested.

...read moreread less

Abstract: We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving ecosystem of automated metrics, datasets, and human evaluation standards. Due to this moving target, new models often still evaluate on divergent anglo-centric corpora with well-established, but flawed, metrics. This disconnect makes it challenging to identify the limitations of current models and opportunities for progress. Addressing this limitation, GEM provides an environment in which models can easily be applied to a wide set of tasks and in which evaluation strategies can be tested. Regular updates to the benchmark will help NLG research become more multilingual and evolve the challenge alongside models. This paper serves as the description of the data for which we are organizing a shared task at our ACL 2021 Workshop and to which we invite the entire NLG community to participate.

...read moreread less

44 citations

Proceedings Article•DOI•

A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation

[...]

David Ifeoluwa Adelani, Jesujoba O. Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter, Dietrich Klakow, Peter Nabende, Ernie Chang, Tajuddeen R. Gwadabe, Freshia Sackey, Bonaventure F. P. Dossou, Chris Chinenye Emezue, Colin D. Leong, Michael Beukman, Shamsuddeen Hassan Muhammad, Guyo Dub Jarso, Oreen Yousuf, Rubungo Andre Niyongabo, Gilles Hacheme, Eric Peter Wairagala, Muhammad Umair Nasir, Benjamin Ayoade Ajibade, Tunde Ajayi, Yvonne Wambui Gitau, Jade Abbott, Mohamed Ahmed, Millicent A. Ochieng, Anuoluwapo Aremu, Perez Ogayo, Jonathan Mukiibi, Fatoumata Kabore, Godson Kalipe, Derguene Mbaye, Allahsera Auguste Tapo, V. M. Koagne, Edwin Munkoh-Buabeng, Valencia K. Wagner, Idris Abdulmumin, Ayodele Awokoya, Happy Buzaaba, Blessing Sibanda, Andiswa Bukula, Sam Manthalu - Show less +41 more

04 May 2022

TL;DR: It is demonstrated that the most effective strategy for transferring both additional languages and additional domains is to leverage small quantities of high-quality translation data to fine-tune large pre-trained models.

...read moreread less

Abstract: Recent advances in the pre-training for language models leverage large-scale datasets to create multilingual models. However, low-resource languages are mostly left out in these datasets. This is primarily because many widely spoken languages that are not well represented on the web and therefore excluded from the large-scale crawls for datasets. Furthermore, downstream users of these models are restricted to the selection of languages originally chosen for pre-training. This work investigates how to optimally leverage existing pre-trained models to create low-resource translation systems for 16 African languages. We focus on two questions: 1) How can pre-trained models be used for languages not included in the initial pretraining? and 2) How can the resulting translation models effectively transfer to new domains? To answer these questions, we create a novel African news corpus covering 16 languages, of which eight languages are not part of any existing evaluation dataset. We demonstrate that the most effective strategy for transferring both additional languages and additional domains is to leverage small quantities of high-quality translation data to fine-tune large pre-trained models.

...read moreread less

43 citations

Proceedings Article•DOI•

KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for Kinyarwanda and Kirundi.

[...]

Rubungo Andre Niyongabo¹, Qu Hong, Julia Kreutzer², Li Huang¹•Institutions (2)

University of Electronic Science and Technology of China¹, Google²

01 Dec 2020

TL;DR: The experiments show that training embeddings on the relatively higher-resourced Kinyarwanda yields successful cross-lingual transfer to Kirundi, and the design of the created datasets allows for a wider use in NLP beyond text classification in future studies, such as representation learning, cross-lingsual learning with more distant languages, or as base for new annotations for tasks such as parsing, POS tagging, and NER.

...read moreread less

Abstract: Recent progress in text classification has been focused on high-resource languages such as English and Chinese. For low-resource languages, amongst them most African languages, the lack of well-annotated data and effective preprocessing, is hindering the progress and the transfer of successful methods. In this paper, we introduce two news datasets (KINNEWS and IRNEWS) for multi-class classification of news articles in Kinyarwanda and Kirundi, two low-resource African languages. The two languages are mutually intelligible, but while Kinyarwanda has been studied in Natural Language Processing (NLP) to some extent, this work constitutes the first study on Kirundi. Along with the datasets, we provide statistics, guidelines for preprocessing, and monolingual and cross-lingual baseline models. Our experiments show that training embeddings on the relatively higher-resourced Kinyarwanda yields successful cross-lingual transfer to Kirundi. In addition, the design of the created datasets allows for a wider use in NLP beyond text classification in future studies, such as representation learning, cross-lingual learning with more distant languages, or as base for new annotations for tasks such as parsing, POS tagging, and NER. The datasets, stopwords, and pre-trained embeddings are publicly available at https://github.com/Andrews2017/KINNEWS-and-KIRNEWS-Corpus.

...read moreread less

26 citations

Posted Content•

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

[...]

Isaac Caswell¹, Julia Kreutzer¹, Lisa Wang¹, Ahsan Wahab, Daan van Esch¹, Nasanbayar Ulzii-Orshikh, Allahsera Auguste Tapo, Nishant Subramani², Artem Sokolov¹, Claytone Sikasote³, Monang Setyawan¹, Supheakmungkol Sarin, Sokhar Samb⁴, Benoît Sagot, Clara E. Rivera¹, Annette Rios⁵, Isabel Papadimitriou⁶, Salomey Osei⁷, Pedro Javier Ortiz Suárez⁸, Iroro Orife, Kelechi Ogueji⁹, Rubungo Andre Niyongabo¹⁰, Toan Q. Nguyen¹¹, Mathias Müller⁵, André Müller⁵, Shamsuddeen Hassan Muhammad¹², Nanda Muhammad¹, Ayanda Mnyakeni¹, Jamshidbek Mirzakhalov¹³, Tapiwanashe Matangira¹, Colin Leong, Nze Lawson¹, Sneha Kudugunta¹, Yacine Jernite, Mathias Jenny⁵, Orhan Firat¹, Bonaventure F. P. Dossou¹⁴, Sakhile Dlamini¹, Nisansa de Silva¹⁵, Sakine Çabuk Ballı¹, Stella Biderman, Alessia Battisti⁵, Ahmed Baruwa¹⁶, Ankur Bapna¹, Pallavi Baljekar¹, Israel Abebe Azime⁴, Ayodele Awokoya¹⁷, Duygu Ataman⁵, Orevaoghene Ahia, Oghenefego Ahia, Sweta Agrawal¹⁸, Mofetoluwa Adeyemi - Show less +48 more•Institutions (18)

Google¹, Intel², University of Zambia³, African Institute for Mathematical Sciences⁴, University of Zurich⁵, Stanford University⁶, Kwame Nkrumah University of Science and Technology⁷, University of Paris⁸, University of Waterloo⁹, University of Electronic Science and Technology of China¹⁰, University of Notre Dame¹¹, Bayero University Kano¹², University of South Florida¹³, Jacobs University Bremen¹⁴, University of Moratuwa¹⁵, Obafemi Awolowo University¹⁶, University of Ibadan¹⁷, University of Maryland, Baltimore¹⁸

23 Mar 2021-arXiv: Computation and Language

TL;DR: In this paper, the authors manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4) and audit the correctness of language codes in a sixth (JW300).

...read moreread less

Abstract: With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. However, to date there has been no systematic analysis of the quality of these publicly available datasets, or whether the datasets actually contain content in the languages they claim to represent. In this work, we manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4), and audit the correctness of language codes in a sixth (JW300). We find that lower-resource corpora have systematic issues: at least 15 corpora are completely erroneous, and a significant fraction contains less than 50% sentences of acceptable quality. Similarly, we find 82 corpora that are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses. Inspired by our analysis, we recommend techniques to evaluate and improve multilingual corpora and discuss the risks that come with low-quality data releases.

...read moreread less

24 citations

Cited by

PDF

Open Access

More filters

Posted Content•

WILDS: A Benchmark of in-the-Wild Distribution Shifts

[...]

Pang Wei Koh¹, Shiori Sagawa¹, Henrik Marklund¹, Sang Michael Xie², Marvin Zhang¹, Akshay Balsubramani¹, Weihua Hu¹, Michihiro Yasunaga³, Richard Lanas Phillips¹, Irena Gao¹, Tony Lee¹, Etienne David⁴, Ian Stavness⁵, Wei Guo⁵, Berton A. Earnshaw, Imran S. Haque⁶, Sara Beery¹, Jure Leskovec¹, Anshul Kundaje⁷, Emma Pierson², Sergey Levine¹, Chelsea Finn¹, Percy Liang¹ - Show less +19 more•Institutions (7)

Stanford University¹, University of California, Berkeley², Cornell University³, University of Saskatchewan⁴, University of Tokyo⁵, California Institute of Technology⁶, Microsoft⁷

14 Dec 2020-arXiv: Learning

TL;DR: WILDS is presented, a benchmark of in-the-wild distribution shifts spanning diverse data modalities and applications, and is hoped to encourage the development of general-purpose methods that are anchored to real-world distribution shifts and that work well across different applications and problem settings.

...read moreread less

Abstract: Distribution shifts -- where the training distribution differs from the test distribution -- can substantially degrade the accuracy of machine learning (ML) systems deployed in the wild. Despite their ubiquity, these real-world distribution shifts are under-represented in the datasets widely used in the ML community today. To address this gap, we present WILDS, a curated collection of 8 benchmark datasets that reflect a diverse range of distribution shifts which naturally arise in real-world applications, such as shifts across hospitals for tumor identification; across camera traps for wildlife monitoring; and across time and location in satellite imaging and poverty mapping. On each dataset, we show that standard training results in substantially lower out-of-distribution than in-distribution performance, and that this gap remains even with models trained by existing methods for handling distribution shifts. This underscores the need for new training methods that produce models which are more robust to the types of distribution shifts that arise in practice. To facilitate method development, we provide an open-source package that automates dataset loading, contains default model architectures and hyperparameters, and standardizes evaluations. Code and leaderboards are available at this https URL.

...read moreread less

579 citations

Journal Article•DOI•

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

[...]

Teven Le Scao, Angela Fan, Christopher Akiki, Elizabeth-Jane Pavlick +383 more

09 Nov 2022-arXiv.org

TL;DR: BLOOM as discussed by the authors is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total).

...read moreread less

Abstract: Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.

...read moreread less

407 citations

Posted Content•

Beyond English-Centric Multilingual Machine Translation

[...]

Angela Fan¹, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin - Show less +13 more•Institutions (1)

Facebook¹

21 Oct 2020-arXiv: Computation and Language

TL;DR: This work creates a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages and explores how to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters to create high quality models.

...read moreread less

Abstract: Existing work in translation demonstrated the potential of massively multilingual machine translation by training a single model able to translate between any pair of languages. However, much of this work is English-Centric by training only on data which was translated from or to English. While this is supported by large sources of training data, it does not reflect translation needs worldwide. In this work, we create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages. We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining. Then, we explore how to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters to create high quality models. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT. We open-source our scripts so that others may reproduce the data, evaluation, and final M2M-100 model.

...read moreread less

378 citations

Posted Content•

Evaluation of Text Generation: A Survey

[...]

Asli Celikyilmaz¹, Elizabeth Clark², Jianfeng Gao³•Institutions (3)

Facebook¹, University of Washington², Microsoft³

26 Jun 2020-arXiv: Computation and Language

TL;DR: This paper surveys evaluation methods of natural language generation (NLG) systems that have been developed in the last few years, with a focus on the evaluation of recently proposed NLG tasks and neural NLG models.

...read moreread less

Abstract: The paper surveys evaluation methods of natural language generation (NLG) systems that have been developed in the last few years We group NLG evaluation methods into three categories: (1) human-centric evaluation metrics, (2) automatic metrics that require no training, and (3) machine-learned metrics For each category, we discuss the progress that has been made and the challenges still being faced, with a focus on the evaluation of recently proposed NLG tasks and neural NLG models We then present two examples for task-specific NLG evaluations for automatic text summarization and long text generation, and conclude the paper by proposing future research directions

...read moreread less

186 citations

Proceedings Article•DOI•

A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios.

[...]

Michael A. Hedderich¹, Lukas Lange², Heike Adel², Jannik Strötgen², Dietrich Klakow¹ - Show less +1 more•Institutions (2)

Saarland University¹, Bosch²

01 Jun 2021

TL;DR: A structured overview of methods that enable learning when training data is sparse including mechanisms to create additional labeled data like data augmentation and distant supervision as well as transfer learning settings that reduce the need for target supervision are given.

...read moreread less

Abstract: Deep neural networks and huge language models are becoming omnipresent in natural language applications. As they are known for requiring large amounts of training data, there is a growing body of work to improve the performance in low-resource settings. Motivated by the recent fundamental changes towards neural models and the popular pre-train and fine-tune paradigm, we survey promising approaches for low-resource natural language processing. After a discussion about the different dimensions of data availability, we give a structured overview of methods that enable learning when training data is sparse. This includes mechanisms to create additional labeled data like data augmentation and distant supervision as well as transfer learning settings that reduce the need for target supervision. A goal of our survey is to explain how these methods differ in their requirements as understanding them is essential for choosing a technique suited for a specific low-resource setting. Further key aspects of this work are to highlight open issues and to outline promising directions for future research.

...read moreread less

138 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49

Collapse