CCMatrix: Mining Billions of High-Quality Parallel Sentences on the Web

doi:10.18653/V1/2021.ACL-LONG.507

Open AccessProceedings ArticleDOI

CCMatrix: Mining Billions of High-Quality Parallel Sentences on the Web

- pp 6490-6500

TLDR

This article used a unified approach for 90 languages, and mined 10.8 billion parallel sentences, out of which only 2.9 billion are aligned with English and achieved state-of-the-art results on the WMT-19 test set for English-German/Russian/Chinese.

Abstract:

We show that margin-based bitext mining in a multilingual sentence space can be successfully scaled to operate on monolingual corpora of billions of sentences. We use 32 snapshots of a curated common crawl corpus (Wenzel et al, 2019) totaling 71 billion unique sentences. Using one unified approach for 90 languages, we were able to mine 10.8 billion parallel sentences, out of which only 2.9 billions are aligned with English. We illustrate the capability of our scalable mining system to create high quality training sets from one language to any other by training hundreds of different machine translation models and evaluating them on the many-to-many TED benchmark. Further, we evaluate on competitive translation benchmarks such as WMT and WAT. Using only mined bitext, we set a new state of the art for a single system on the WMT’19 test set for English-German/Russian/Chinese. In particular, our English/German and English/Russian systems outperform the best single ones by over 4 BLEU points and are on par with best WMT’19 systems, which train on the WMT training data and augment it with backtranslation. We also achieve excellent results for distant languages pairs like Russian/Japanese, outperforming the best submission at the 2020 WAT workshop. All of the mined bitext will be freely available.

Citations

PDF

Open Access

More filters

Posted Content

Beyond English-Centric Multilingual Machine Translation

Angela Fan, +16 more

- 21 Oct 2020 -

arXiv: Computation and Language

TL;DR: This work creates a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages and explores how to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters to create high quality models.

...read moreread less

Proceedings ArticleDOI

WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia

Holger Schwenk, +4 more

TL;DR: The authors extract parallel sentences from the content of Wikipedia articles in 96 languages, including several dialects or low-resource languages and train neural MT baseline systems on the mined data only for 1886 languages pairs, and evaluate them on the TED corpus, achieving strong BLEU scores for many language pairs.

...read moreread less

Journal ArticleDOI

<i>Samanantar</i>: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages

- 01 Jan 2022 -

Transactions of the Association for Comp...

TL;DR: Samanantar as mentioned in this paper is the largest publicly available parallel corpora collection for Indic languages, which contains 49.7 million sentence pairs between English and 11 languages (from two language families).

...read moreread less

Journal ArticleDOI

Survey of Low-Resource Machine Translation

- 01 Jan 2022 -

Computational Linguistics

TL;DR: The state of the art in low-resource machine translation (MT) research is surveyed in this paper , where the authors present a survey covering the state-of-the-art in lowresource MT research.

...read moreread less

Posted Content

Scalable and Efficient MoE Training for Multitask Multilingual Models.

Young Jin Kim, +8 more

- 22 Sep 2021 -

arXiv: Computation and Language

TL;DR: In this article, a system capable of scaling Mixture of Experts (MoE) models efficiently to trillions of parameters is presented. But, supporting large scale MoE training also has its own set of system and modeling challenges, and to overcome the challenges and embrace the opportunities of MoE, they first develop a system that combines multi-dimensional parallelism and heterogeneous memory technologies harmoniously with MoE to empower 8x larger models on the same hardware compared with existing work.

...read moreread less

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Neural Machine Translation of Rare Words with Subword Units

Rico Sennrich, +2 more

TL;DR: This paper introduces a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units, and empirically shows that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 1.3 BLEU.

...read moreread less

Europarl: A Parallel Corpus for Statistical Machine Translation

Philipp Koehn

TL;DR: A corpus of parallel text in 11 languages from the proceedings of the European Parliament is collected and its acquisition and application as training data for statistical machine translation (SMT) is focused on.

...read moreread less

Journal ArticleDOI

Product Quantization for Nearest Neighbor Search

Hervé Jégou, +2 more

- 01 Jan 2011 -

IEEE Transactions on Pattern Analysis an...

TL;DR: This paper introduces a product quantization-based approach for approximate nearest neighbor search to decompose the space into a Cartesian product of low-dimensional subspaces and to quantize each subspace separately.

...read moreread less

Posted Content

Billion-scale similarity search with GPUs

Jeff Johnson, +2 more

- 28 Feb 2017 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: In this paper, the authors propose a design for k-selection that operates at up to 55% of theoretical peak performance, enabling a nearest neighbor implementation that is 8.5x faster than prior GPU state of the art.

...read moreread less

Proceedings Article

Parallel Data, Tools and Interfaces in OPUS

J"org Tiedemann

TL;DR: New data sets and their features, additional annotation tools and models provided from the website and essential interfaces and on-line services included in the OPUS project are reported.

...read moreread less