CCMatrix: Mining Billions of High-Quality Parallel Sentences on the Web
Holger Schwenk,Guillaume Wenzek,Sergey Edunov,Edouard Grave,Armand Joulin,Angela Fan +5 more
- pp 6490-6500
TLDR
This article used a unified approach for 90 languages, and mined 10.8 billion parallel sentences, out of which only 2.9 billion are aligned with English and achieved state-of-the-art results on the WMT-19 test set for English-German/Russian/Chinese.Abstract:
We show that margin-based bitext mining in a multilingual sentence space can be successfully scaled to operate on monolingual corpora of billions of sentences. We use 32 snapshots of a curated common crawl corpus (Wenzel et al, 2019) totaling 71 billion unique sentences. Using one unified approach for 90 languages, we were able to mine 10.8 billion parallel sentences, out of which only 2.9 billions are aligned with English. We illustrate the capability of our scalable mining system to create high quality training sets from one language to any other by training hundreds of different machine translation models and evaluating them on the many-to-many TED benchmark. Further, we evaluate on competitive translation benchmarks such as WMT and WAT. Using only mined bitext, we set a new state of the art for a single system on the WMT’19 test set for English-German/Russian/Chinese. In particular, our English/German and English/Russian systems outperform the best single ones by over 4 BLEU points and are on par with best WMT’19 systems, which train on the WMT training data and augment it with backtranslation. We also achieve excellent results for distant languages pairs like Russian/Japanese, outperforming the best submission at the 2020 WAT workshop. All of the mined bitext will be freely available.read more
Citations
More filters
Posted Content
Beyond English-Centric Multilingual Machine Translation
Angela Fan,Shruti Bhosale,Holger Schwenk,Zhiyi Ma,Ahmed El-Kishky,Siddharth Goyal,Mandeep Baines,Onur Celebi,Guillaume Wenzek,Vishrav Chaudhary,Naman Goyal,Tom Birch,Vitaliy Liptchinsky,Sergey Edunov,Edouard Grave,Michael Auli,Armand Joulin +16 more
TL;DR: This work creates a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages and explores how to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters to create high quality models.
Proceedings ArticleDOI
WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia
TL;DR: The authors extract parallel sentences from the content of Wikipedia articles in 96 languages, including several dialects or low-resource languages and train neural MT baseline systems on the mined data only for 1886 languages pairs, and evaluate them on the TED corpus, achieving strong BLEU scores for many language pairs.
Journal ArticleDOI
<i>Samanantar</i>: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages
TL;DR: Samanantar as mentioned in this paper is the largest publicly available parallel corpora collection for Indic languages, which contains 49.7 million sentence pairs between English and 11 languages (from two language families).
Journal ArticleDOI
Survey of Low-Resource Machine Translation
TL;DR: The state of the art in low-resource machine translation (MT) research is surveyed in this paper , where the authors present a survey covering the state-of-the-art in lowresource MT research.
Posted Content
Scalable and Efficient MoE Training for Multitask Multilingual Models.
Young Jin Kim,Ammar Ahmad Awan,Alexandre Muzio,Andrés Felipe Cruz-Salinas,Liyang Lu,Amr Hendy,Samyam Rajbhandari,Yuxiong He,Hany Hassan Awadalla +8 more
TL;DR: In this article, a system capable of scaling Mixture of Experts (MoE) models efficiently to trillions of parameters is presented. But, supporting large scale MoE training also has its own set of system and modeling challenges, and to overcome the challenges and embrace the opportunities of MoE, they first develop a system that combines multi-dimensional parallelism and heterogeneous memory technologies harmoniously with MoE to empower 8x larger models on the same hardware compared with existing work.
References
More filters
Proceedings ArticleDOI
Neural Machine Translation of Rare Words with Subword Units
TL;DR: This paper introduces a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units, and empirically shows that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 1.3 BLEU.
Europarl: A Parallel Corpus for Statistical Machine Translation
TL;DR: A corpus of parallel text in 11 languages from the proceedings of the European Parliament is collected and its acquisition and application as training data for statistical machine translation (SMT) is focused on.
Journal ArticleDOI
Product Quantization for Nearest Neighbor Search
TL;DR: This paper introduces a product quantization-based approach for approximate nearest neighbor search to decompose the space into a Cartesian product of low-dimensional subspaces and to quantize each subspace separately.
Posted Content
Billion-scale similarity search with GPUs
TL;DR: In this paper, the authors propose a design for k-selection that operates at up to 55% of theoretical peak performance, enabling a nearest neighbor implementation that is 8.5x faster than prior GPU state of the art.
Proceedings Article
Parallel Data, Tools and Interfaces in OPUS
TL;DR: New data sets and their features, additional annotation tools and models provided from the website and essential interfaces and on-line services included in the OPUS project are reported.