Home
/
Authors
/
Chris Chinenye Emezue

Author

Chris Chinenye Emezue

Bio: Chris Chinenye Emezue is an academic researcher from Technische Universität München. The author has contributed to research in topics: Computer science & Languages of Africa. The author has an hindex of 5, co-authored 13 publications receiving 124 citations.

Papers

PDF

Open Access

More filters

Journal Article•DOI•

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

[...]

Teven Le Scao, Angela Fan, Christopher Akiki, Elizabeth-Jane Pavlick +383 more

09 Nov 2022-arXiv.org

TL;DR: BLOOM as discussed by the authors is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total).

...read moreread less

Abstract: Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.

...read moreread less

407 citations

Proceedings Article•DOI•

Participatory Research for Low-resourced Machine Translation:A Case Study in African Languages

[...]

Wilhelmina Nekoto, Vukosi Marivate¹, Tshinondiwa Matsila, Timi E. Fasubaa², Tajudeen Kolawole, Taiwo Fagbohungbe, Solomon Oluwole Akinola, Shamsuddeen Hassan Muhammad³, Salomon Kabongo⁴, Salomey Osei⁵, Sackey Freshia, Rubungo Andre Niyongabo⁶, Ricky Macharm, Perez Ogayo, Orevaoghene Ahia, Musie Meressa, Mofe Adeyemi, Masabata Mokgesi-Selinga, Lawrence Okegbemi, Laura Martinus⁷, Kolawole Tajudeen, Kevin Degila, Kelechi Ogueji⁸, Kathleen Siminyu⁹, Julia Kreutzer¹⁰, Jason Webster, Jamiil Toure Ali, Jade Abbott¹, Iroro Orife¹¹, Ignatius Ezeani¹², Idris Abdulkabir Dangana, Herman Kamper¹³, Hady Elsahar¹⁴, Goodness Duru, Ghollah Kioko, Espoir Murhabazi, Elan van Biljon¹³, Daniel Whitenack, Christopher Onyefuluchi, Chris Chinenye Emezue¹⁵, Bonaventure F. P. Dossou¹⁶, Blessing Sibanda, Blessing Itoro Bassey, Ayodele Olabiyi, Arshath Ramkilowan, Alp Öktem¹⁷, Adewale Akinfaderin¹⁸, Abdallah Bashir - Show less +44 more•Institutions (18)

University of Pretoria¹, University of California, Berkeley², University of Porto³, Leibniz University of Hanover⁴, African Institute for Mathematical Sciences⁵, University of Electronic Science and Technology of China⁶, Council of Scientific and Industrial Research⁷, University of Waterloo⁸, Georgia Institute of Technology⁹, Google¹⁰, Carnegie Mellon University¹¹, Lancaster University¹², Stellenbosch University¹³, Naver Corporation¹⁴, Technische Universität München¹⁵, Jacobs University Bremen¹⁶, Pompeu Fabra University¹⁷, Florida State University¹⁸

05 Oct 2020

TL;DR: The feasibility and scalability of participatory research is demonstrated with a case study on MT for African languages, which leads to a collection of novel translation datasets, MT benchmarks for over 30 languages, with human evaluations for a third of them, and enables participants without formal training to make a unique scientific contribution.

...read moreread less

Abstract: Research in NLP lacks geographic diversity, and the question of how NLP can be scaled to low-resourced languages has not yet been adequately solved. "Low-resourced"-ness is a complex problem going beyond data availability and reflects systemic problems in society. In this paper, we focus on the task of Machine Translation (MT), that plays a crucial role for information accessibility and communication worldwide. Despite immense improvements in MT over the past decade, MT is centered around a few high-resourced languages. As MT researchers cannot solve the problem of low-resourcedness alone, we propose participatory research as a means to involve all necessary agents required in the MT development process. We demonstrate the feasibility and scalability of participatory research with a case study on MT for African languages. Its implementation leads to a collection of novel translation datasets, MT benchmarks for over 30 languages, with human evaluations for a third of them, and enables participants without formal training to make a unique scientific contribution. Benchmarks, models, data, code, and evaluation results are released under https://github.com/masakhane-io/masakhane-mt.

...read moreread less

109 citations

Proceedings Article•

Bayesian Structure Learning with Generative Flow Networks

[...]

Tristan Deleu, Ant'onio G'ois, Chris Chinenye Emezue, Mansi Rankawat, Simon Lacoste-Julien, Stefan Bauer, Yoshua Bengio - Show less +3 more

28 Feb 2022

TL;DR: This work proposes to use a GFlowNet as an alternative to MCMC for approximating the posterior distribution over the structure of Bayesian networks, given a dataset of observations, and it compares favorably against other methods based on MCMC or variational inference.

...read moreread less

Abstract: In Bayesian structure learning, we are interested in inferring a distribution over the directed acyclic graph (DAG) structure of Bayesian networks, from data. Defining such a distribution is very challenging, due to the combinatorially large sample space, and approximations based on MCMC are often required. Recently, a novel class of probabilistic models, called Generative Flow Networks (GFlowNets), have been introduced as a general framework for generative modeling of discrete and composite objects, such as graphs. In this work, we propose to use a GFlowNet as an alternative to MCMC for approximating the posterior distribution over the structure of Bayesian networks, given a dataset of observations. Generating a sample DAG from this approximate distribution is viewed as a sequential decision problem, where the graph is constructed one edge at a time, based on learned transition probabilities. Through evaluation on both simulated and real data, we show that our approach, called DAG-GFlowNet, provides an accurate approximation of the posterior over DAGs, and it compares favorably against other methods based on MCMC or variational inference.

...read moreread less

50 citations

Proceedings Article•

NaijaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment Analysis

[...]

Shamsuddeen Hassan Muhammad, David Ifeoluwa Adelani, Ibrahim Said Ahmad, Idris Abdulmumin, Bello Shehu Bello, Monojit Choudhury, Chris Chinenye Emezue, Anuoluwapo Aremu, Saheed Abdul, Pavel Brazdil - Show less +6 more

20 Jan 2022

TL;DR: This work introduces the first large-scale human-annotated Twitter sentiment dataset for Nigeria—Hausa, Igbo, Nigerian-Pidgin, and Yorùbá—consisting of around 30,000 annotated tweets per language, including a significant fraction of code-mixed tweets.

...read moreread less

Abstract: Sentiment analysis is one of the most widely studied applications in NLP, but most work focuses on languages with large amounts of data. We introduce the first large-scale human-annotated Twitter sentiment dataset for the four most widely spoken languages in Nigeria—Hausa, Igbo, Nigerian-Pidgin, and Yorùbá—consisting of around 30,000 annotated tweets per language, including a significant fraction of code-mixed tweets. We propose text collection, filtering, processing and labeling methods that enable us to create datasets for these low-resource languages. We evaluate a range of pre-trained models and transfer strategies on the dataset. We find that language-specific models and language-adaptive fine-tuning generally perform best. We release the datasets, trained models, sentiment lexicons, and code to incentivize research on sentiment analysis in under-represented languages.

...read moreread less

48 citations

Posted Content•

Masakhane - Machine Translation For Africa.

[...]

Iroro Orife, Julia Kreutzer, Blessing Sibanda, Daniel Whitenack, Kathleen Siminyu, Laura Martinus, Jamiil Toure Ali, Jade Abbott, Vukosi Marivate, Salomon Kabongo, Musie Meressa, Espoir Murhabazi, Orevaoghene Ahia, Elan van Biljon, Arshath Ramkilowan, Adewale Akinfaderin, Alp Öktem, Wole Akin, Ghollah Kioko, Kevin Degila, Herman Kamper, Bonaventure F. P. Dossou, Chris Chinenye Emezue, Kelechi Ogueji, Abdallah Bashir - Show less +21 more

13 Mar 2020-arXiv: Computation and Language

TL;DR: The methodology for building the community and spurring research from the African continent, as well as the success of the community in terms of addressing the identified problems affecting African NLP are discussed.

...read moreread less

Abstract: Africa has over 2000 languages. Despite this, African languages account for a small portion of available resources and publications in Natural Language Processing (NLP). This is due to multiple factors, including: a lack of focus from government and funding, discoverability, a lack of community, sheer language complexity, difficulty in reproducing papers and no benchmarks to compare techniques. To begin to address the identified problems, MASAKHANE, an open-source, continent-wide, distributed, online research effort for machine translation for African languages, was founded. In this paper, we discuss our methodology for building the community and spurring research from the African continent, as well as outline the success of the community in terms of addressing the identified problems affecting African NLP.

...read moreread less

47 citations

1
2
3
4
…
5
6
7

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

Ethnologue: Languages of the World

[...]

Sarah L. Nesbeitt

01 Nov 1999-Electronic Resources Review

1,364 citations

Journal Article•DOI•

LLaMA: Open and Efficient Foundation Language Models

[...]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Roziere, Naman Goyal, Eric Hambro, Faisal Azhar, Aur'elien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample - Show less +10 more

27 Feb 2023-arXiv.org

TL;DR: This article introduced LLaMA, a collection of foundation language models ranging from 7B to 65B parameters, and trained their models on trillions of tokens, and showed that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets.

...read moreread less

Abstract: We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community.

...read moreread less

809 citations

Posted Content•

WILDS: A Benchmark of in-the-Wild Distribution Shifts

[...]

Pang Wei Koh¹, Shiori Sagawa¹, Henrik Marklund¹, Sang Michael Xie², Marvin Zhang¹, Akshay Balsubramani¹, Weihua Hu¹, Michihiro Yasunaga³, Richard Lanas Phillips¹, Irena Gao¹, Tony Lee¹, Etienne David⁴, Ian Stavness⁵, Wei Guo⁵, Berton A. Earnshaw, Imran S. Haque⁶, Sara Beery¹, Jure Leskovec¹, Anshul Kundaje⁷, Emma Pierson², Sergey Levine¹, Chelsea Finn¹, Percy Liang¹ - Show less +19 more•Institutions (7)

Stanford University¹, University of California, Berkeley², Cornell University³, University of Saskatchewan⁴, University of Tokyo⁵, California Institute of Technology⁶, Microsoft⁷

14 Dec 2020-arXiv: Learning

TL;DR: WILDS is presented, a benchmark of in-the-wild distribution shifts spanning diverse data modalities and applications, and is hoped to encourage the development of general-purpose methods that are anchored to real-world distribution shifts and that work well across different applications and problem settings.

...read moreread less

Abstract: Distribution shifts -- where the training distribution differs from the test distribution -- can substantially degrade the accuracy of machine learning (ML) systems deployed in the wild. Despite their ubiquity, these real-world distribution shifts are under-represented in the datasets widely used in the ML community today. To address this gap, we present WILDS, a curated collection of 8 benchmark datasets that reflect a diverse range of distribution shifts which naturally arise in real-world applications, such as shifts across hospitals for tumor identification; across camera traps for wildlife monitoring; and across time and location in satellite imaging and poverty mapping. On each dataset, we show that standard training results in substantially lower out-of-distribution than in-distribution performance, and that this gap remains even with models trained by existing methods for handling distribution shifts. This underscores the need for new training methods that produce models which are more robust to the types of distribution shifts that arise in practice. To facilitate method development, we provide an open-source package that automates dataset loading, contains default model architectures and hyperparameters, and standardizes evaluations. Code and leaderboards are available at this https URL.

...read moreread less

579 citations

Journal Article•DOI•

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

[...]

Teven Le Scao, Angela Fan, Christopher Akiki, Elizabeth-Jane Pavlick +383 more

09 Nov 2022-arXiv.org

...read moreread less

407 citations

Posted Content•

Beyond English-Centric Multilingual Machine Translation

[...]

Angela Fan¹, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin - Show less +13 more•Institutions (1)

Facebook¹

21 Oct 2020-arXiv: Computation and Language

TL;DR: This work creates a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages and explores how to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters to create high quality models.

...read moreread less

Abstract: Existing work in translation demonstrated the potential of massively multilingual machine translation by training a single model able to translate between any pair of languages. However, much of this work is English-Centric by training only on data which was translated from or to English. While this is supported by large sources of training data, it does not reflect translation needs worldwide. In this work, we create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages. We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining. Then, we explore how to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters to create high quality models. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT. We open-source our scripts so that others may reproduce the data, evaluation, and final M2M-100 model.

...read moreread less

378 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153

Collapse