Home
/
Authors
/
Francisco Guzmán

Author

Francisco Guzmán

Other affiliations: University of Melbourne, Johns Hopkins University, Khalifa University ...read more

Bio: Francisco Guzmán is an academic researcher from Facebook. The author has contributed to research in topics: Machine translation & Sentence. The author has an hindex of 24, co-authored 82 publications receiving 3433 citations. Previous affiliations of Francisco Guzmán include University of Melbourne & Johns Hopkins University.

Papers published on a yearly basis

2021
2020
2019
2017
2016
2015
2014
2013
2012
2010
2009

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Unsupervised Cross-lingual Representation Learning at Scale

[...]

Alexis Conneau¹, Kartikay Khandelwal², Naman Goyal¹, Vishrav Chaudhary¹, Guillaume Wenzek¹, Francisco Guzmán³, Edouard Grave¹, Myle Ott¹, Luke Zettlemoyer¹, Veselin Stoyanov¹ - Show less +6 more•Institutions (3)

Facebook¹, Microsoft², Johns Hopkins University³

01 Jul 2020

TL;DR: It is shown that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks, and the possibility of multilingual modeling without sacrificing per-language performance is shown for the first time.

...read moreread less

Abstract: This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6% average accuracy on XNLI, +13% average F1 score on MLQA, and +2.4% F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 15.7% in XNLI accuracy for Swahili and 11.4% for Urdu over previous XLM models. We also present a detailed empirical analysis of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-R is very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make our code and models publicly available.

...read moreread less

3,248 citations

Posted Content•

Unsupervised Cross-lingual Representation Learning at Scale.

[...]

Facebook¹, Microsoft², Johns Hopkins University³

05 Nov 2019-arXiv: Computation and Language

TL;DR: This paper showed that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks and proposed a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data.

...read moreread less

669 citations

Proceedings Article•

CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data

[...]

Guillaume Wenzek¹, Marie-Anne Lachaux¹, Alexis Conneau¹, Vishrav Chaudhary¹, Francisco Guzmán¹, Armand Joulin¹, Edouard Grave¹ - Show less +3 more•Institutions (1)

Facebook¹

01 Nov 2019

TL;DR: An automatic pipeline to extract massive high-quality monolingual datasets from Common Crawl for a variety of languages by following the data processing introduced in fastText, that deduplicates documents and identifies their language.

...read moreread less

Abstract: Pre-training text representations have led to significant improvements in many areas of natural language processing. The quality of these models benefits greatly from the size of the pretraining corpora as long as its quality is preserved. In this paper, we describe an automatic pipeline to extract massive high-quality monolingual datasets from Common Crawl for a variety of languages. Our pipeline follows the data processing introduced in fastText (Mikolov et al., 2017; Grave et al., 2018), that deduplicates documents and identifies their language. We augment this pipeline with a filtering step to select documents that are close to high quality corpora like Wikipedia.

...read moreread less

313 citations

Posted Content•

WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia

[...]

Holger Schwenk¹, Vishrav Chaudhary¹, Shuo Sun², Hongyu Gong³, Francisco Guzmán¹ - Show less +1 more•Institutions (3)

Facebook¹, Johns Hopkins University², University of Illinois at Urbana–Champaign³

10 Jul 2019-arXiv: Computation and Language

TL;DR: An approach based on multilingual sentence embeddings to automatically extract parallel sentences from the content of Wikipedia articles in 96 languages, including several dialects or low-resource languages is presented.

...read moreread less

Abstract: We present an approach based on multilingual sentence embeddings to automatically extract parallel sentences from the content of Wikipedia articles in 85 languages, including several dialects or low-resource languages. We do not limit the the extraction process to alignments with English, but systematically consider all possible language pairs. In total, we are able to extract 135M parallel sentences for 1620 different language pairs, out of which only 34M are aligned with English. This corpus of parallel sentences is freely available at this https URL. To get an indication on the quality of the extracted bitexts, we train neural MT baseline systems on the mined data only for 1886 languages pairs, and evaluate them on the TED corpus, achieving strong BLEU scores for many language pairs. The WikiMatrix bitexts seem to be particularly interesting to train MT systems between distant languages without the need to pivot through English.

...read moreread less

287 citations

Posted Content•

The FLoRes Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English.

[...]

Francisco Guzmán¹, Peng-Jen Chen¹, Myle Ott¹, Juan Pino¹, Guillaume Lample¹, Philipp Koehn¹, Vishrav Chaudhary¹, Marc'Aurelio Ranzato² - Show less +4 more•Institutions (2)

Facebook¹, Johns Hopkins University²

04 Feb 2019-arXiv: Computation and Language

TL;DR: This work introduces the FLORES evaluation datasets for Nepali–English and Sinhala– English, based on sentences translated from Wikipedia, and demonstrates that current state-of-the-art methods perform rather poorly on this benchmark, posing a challenge to the research community working on low-resource MT.

...read moreread less

Abstract: For machine translation, a vast majority of language pairs in the world are considered low-resource because they have little parallel data available. Besides the technical challenges of learning with limited supervision, it is difficult to evaluate methods trained on low-resource language pairs because of the lack of freely and publicly available benchmarks. In this work, we introduce the FLoRes evaluation datasets for Nepali-English and Sinhala-English, based on sentences translated from Wikipedia. Compared to English, these are languages with very different morphology and syntax, for which little out-of-domain parallel data is available and for which relatively large amounts of monolingual data are freely available. We describe our process to collect and cross-check the quality of translations, and we report baseline performance using several learning settings: fully supervised, weakly supervised, semi-supervised, and fully unsupervised. Our experiments demonstrate that current state-of-the-art methods perform rather poorly on this benchmark, posing a challenge to the research community working on low-resource MT. Data and code to reproduce our experiments are available at this https URL.

...read moreread less

168 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17

Collapse

Cited by

PDF

Open Access

More filters

Proceedings Article•DOI•

Unsupervised Cross-lingual Representation Learning at Scale

[...]

Facebook¹, Microsoft², Johns Hopkins University³

01 Jul 2020

...read moreread less

3,248 citations

Proceedings Article•DOI•

mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

[...]

Linting Xue¹, Noah Constant¹, Adam Roberts², Mihir Kale¹, Rami Al-Rfou¹, Aditya Siddhant¹, Aditya Barua¹, Colin Raffel¹ - Show less +4 more•Institutions (2)

Google¹, University of Chester²

01 Jun 2021

TL;DR: This paper proposed a multilingual variant of T5, mT5, which was pre-trained on a new Common Crawl-based dataset covering 101 languages and achieved state-of-the-art performance on many multilingual benchmarks.

...read moreread less

Abstract: The recent “Text-to-Text Transfer Transformer” (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on a wide variety of English-language NLP tasks. In this paper, we introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages. We detail the design and modified training of mT5 and demonstrate its state-of-the-art performance on many multilingual benchmarks. We also describe a simple technique to prevent “accidental translation” in the zero-shot setting, where a generative model chooses to (partially) translate its prediction into the wrong language. All of the code and model checkpoints used in this work are publicly available.

...read moreread less

1,016 citations

Journal Article•DOI•

Multilingual Denoising Pre-training for Neural Machine Translation

[...]

Yinhan Liu, Jiatao Gu¹, Naman Goyal¹, Xian Li¹, Sergey Edunov¹, Marjan Ghazvininejad¹, Michael Lewis¹, Luke Zettlemoyer¹ - Show less +4 more•Institutions (1)

Facebook¹

27 Nov 2020-Transactions of the Association for Computational Linguistics

TL;DR: This article proposed mBART, a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the BART objective.

...read moreread less

Abstract: This paper demonstrates that multilingual denoising pre-training produces significant performance gains across a wide variety of machine translation (MT) tasks. We present mBART -- a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the BART objective. mBART is the first method for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text. Pre-training a complete model allows it to be directly fine-tuned for supervised (both sentence-level and document-level) and unsupervised machine translation, with no task-specific modifications. We demonstrate that adding mBART initialization produces performance gains in all but the highest-resource settings, including up to 12 BLEU points for low resource MT and over 5 BLEU points for many document-level and unsupervised models. We also show it enables transfer to language pairs with no bi-text or that were not in the pre-training corpus, and present extensive analysis of which factors contribute the most to effective pre-training.

...read moreread less

921 citations

Journal Article•DOI•

LLaMA: Open and Efficient Foundation Language Models

[...]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Roziere, Naman Goyal, Eric Hambro, Faisal Azhar, Aur'elien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample - Show less +10 more

27 Feb 2023-arXiv.org

TL;DR: This article introduced LLaMA, a collection of foundation language models ranging from 7B to 65B parameters, and trained their models on trillions of tokens, and showed that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets.

...read moreread less

Abstract: We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community.

...read moreread less

809 citations

Journal Article•DOI•

Pre-trained Models for Natural Language Processing: A Survey

[...]

Xipeng Qiu¹, Tianxiang Sun¹, Yige Xu¹, Yunfan Shao¹, Ning Dai¹, Xuanjing Huang¹ - Show less +2 more•Institutions (1)

Fudan University¹

18 Mar 2020-Science China-technological Sciences

TL;DR: Recently, the emergence of pre-trained models (PTMs) has brought natural language processing (NLP) to a new era as mentioned in this paper, and a comprehensive review of PTMs for NLP can be found in this survey.

...read moreread less

Abstract: Recently, the emergence of pre-trained models (PTMs) has brought natural language processing (NLP) to a new era. In this survey, we provide a comprehensive review of PTMs for NLP. We first briefly introduce language representation learning and its research progress. Then we systematically categorize existing PTMs based on a taxonomy from four different perspectives. Next, we describe how to adapt the knowledge of PTMs to downstream tasks. Finally, we outline some potential directions of PTMs for future research. This survey is purposed to be a hands-on guide for understanding, using, and developing PTMs for various NLP tasks.

...read moreread less

755 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse