scispace - formally typeset
Open AccessProceedings ArticleDOI

CCMatrix: Mining Billions of High-Quality Parallel Sentences on the Web

TLDR
This article used a unified approach for 90 languages, and mined 10.8 billion parallel sentences, out of which only 2.9 billion are aligned with English and achieved state-of-the-art results on the WMT-19 test set for English-German/Russian/Chinese.
Abstract
We show that margin-based bitext mining in a multilingual sentence space can be successfully scaled to operate on monolingual corpora of billions of sentences. We use 32 snapshots of a curated common crawl corpus (Wenzel et al, 2019) totaling 71 billion unique sentences. Using one unified approach for 90 languages, we were able to mine 10.8 billion parallel sentences, out of which only 2.9 billions are aligned with English. We illustrate the capability of our scalable mining system to create high quality training sets from one language to any other by training hundreds of different machine translation models and evaluating them on the many-to-many TED benchmark. Further, we evaluate on competitive translation benchmarks such as WMT and WAT. Using only mined bitext, we set a new state of the art for a single system on the WMT’19 test set for English-German/Russian/Chinese. In particular, our English/German and English/Russian systems outperform the best single ones by over 4 BLEU points and are on par with best WMT’19 systems, which train on the WMT training data and augment it with backtranslation. We also achieve excellent results for distant languages pairs like Russian/Japanese, outperforming the best submission at the 2020 WAT workshop. All of the mined bitext will be freely available.

read more

Content maybe subject to copyright    Report

Citations
More filters
Posted Content

Beyond English-Centric Multilingual Machine Translation

TL;DR: This work creates a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages and explores how to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters to create high quality models.
Proceedings ArticleDOI

WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia

TL;DR: The authors extract parallel sentences from the content of Wikipedia articles in 96 languages, including several dialects or low-resource languages and train neural MT baseline systems on the mined data only for 1886 languages pairs, and evaluate them on the TED corpus, achieving strong BLEU scores for many language pairs.
Journal ArticleDOI

<i>Samanantar</i>: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages

TL;DR: Samanantar as mentioned in this paper is the largest publicly available parallel corpora collection for Indic languages, which contains 49.7 million sentence pairs between English and 11 languages (from two language families).
Journal ArticleDOI

Survey of Low-Resource Machine Translation

TL;DR: The state of the art in low-resource machine translation (MT) research is surveyed in this paper , where the authors present a survey covering the state-of-the-art in lowresource MT research.
Posted Content

Scalable and Efficient MoE Training for Multitask Multilingual Models.

TL;DR: In this article, a system capable of scaling Mixture of Experts (MoE) models efficiently to trillions of parameters is presented. But, supporting large scale MoE training also has its own set of system and modeling challenges, and to overcome the challenges and embrace the opportunities of MoE, they first develop a system that combines multi-dimensional parallelism and heterogeneous memory technologies harmoniously with MoE to empower 8x larger models on the same hardware compared with existing work.
References
More filters
Proceedings ArticleDOI

Neural Machine Translation of Rare Words with Subword Units

TL;DR: This paper introduces a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units, and empirically shows that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 1.3 BLEU.

Europarl: A Parallel Corpus for Statistical Machine Translation

Philipp Koehn
TL;DR: A corpus of parallel text in 11 languages from the proceedings of the European Parliament is collected and its acquisition and application as training data for statistical machine translation (SMT) is focused on.
Journal ArticleDOI

Product Quantization for Nearest Neighbor Search

TL;DR: This paper introduces a product quantization-based approach for approximate nearest neighbor search to decompose the space into a Cartesian product of low-dimensional subspaces and to quantize each subspace separately.
Posted Content

Billion-scale similarity search with GPUs

TL;DR: In this paper, the authors propose a design for k-selection that operates at up to 55% of theoretical peak performance, enabling a nearest neighbor implementation that is 8.5x faster than prior GPU state of the art.
Proceedings Article

Parallel Data, Tools and Interfaces in OPUS

TL;DR: New data sets and their features, additional annotation tools and models provided from the website and essential interfaces and on-line services included in the OPUS project are reported.
Related Papers (5)