Earth Mover’s Distance Minimization for Unsupervised Bilingual Lexicon Induction

doi:10.18653/V1/D17-1207

Home
/
Papers
/
Earth Mover’s Distance Minimization for Unsupervised Bilingual Lexicon Induction

Proceedings Article•DOI•

Earth Mover’s Distance Minimization for Unsupervised Bilingual Lexicon Induction

Meng Zhang¹, Yang Liu¹, Huanbo Luan¹, Maosong Sun¹•Institutions (1)

Tsinghua University¹

01 Sep 2017-pp 1934-1945

TL;DR: This paper proposes to minimize their earth mover’s distance, a measure of divergence between distributions, by viewing word embedding spaces as distributions, and demonstrates the success on the unsupervised bilingual lexicon induction task.

read less

Abstract: Cross-lingual natural language processing hinges on the premise that there exists invariance across languages. At the word level, researchers have identified such invariance in the word embedding semantic spaces of different languages. However, in order to connect the separate spaces, cross-lingual supervision encoded in parallel data is typically required. In this paper, we attempt to establish the cross-lingual connection without relying on any cross-lingual supervision. By viewing word embedding spaces as distributions, we propose to minimize their earth mover’s distance, a measure of divergence between distributions. We demonstrate the success on the unsupervised bilingual lexicon induction task. In addition, we reveal an interesting finding that the earth mover’s distance shows potential as a measure of language difference.

...read moreread less

Citations

PDF

Open Access

More filters

Proceedings Article•

Word translation without parallel data

[...]

Guillaume Lample¹, Alexis Conneau¹, Marc'Aurelio Ranzato¹, Ludovic Denoyer², Hervé Jégou¹ - Show less +1 more•Institutions (2)

Facebook¹, University of Paris²

15 Feb 2018

TL;DR: It is shown that a bilingual dictionary can be built between two languages without using any parallel corpora, by aligning monolingual word embedding spaces in an unsupervised way.

...read moreread less

1,068 citations

Posted Content•

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

[...]

Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, Melvin Johnson - Show less +2 more

24 Mar 2020-arXiv: Computation and Language

TL;DR: The Cross-lingual TRansfer Evaluation of Multilingual Encoders XTREME benchmark is introduced, a multi-task benchmark for evaluating the cross-lingually generalization capabilities of multilingual representations across 40 languages and 9 tasks.

...read moreread less

Abstract: Much recent progress in applications of machine learning models to NLP has been driven by benchmarks that evaluate models across a wide variety of tasks. However, these broad-coverage benchmarks have been mostly limited to English, and despite an increasing interest in multilingual models, a benchmark that enables the comprehensive evaluation of such methods on a diverse range of languages and tasks is still missing. To this end, we introduce the Cross-lingual TRansfer Evaluation of Multilingual Encoders XTREME benchmark, a multi-task benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models, particularly on syntactic and sentence retrieval tasks. There is also a wide spread of results across languages. We release the benchmark to encourage research on cross-lingual learning methods that transfer linguistic knowledge across a diverse and representative set of languages and tasks.

...read moreread less

538 citations

Cites background from "Earth Mover’s Distance Minimization..."

..., 2018), and optimal transport (Zhang et al., 2017)....
[...]
...Later approaches reduced the amount of supervision required using self-training (Artetxe et al., 2017) and unsupervised strategies such as adversarial training (Conneau et al., 2018a), heuristic initialisation (Artetxe et al., 2018), and optimal transport (Zhang et al., 2017)....
[...]

Proceedings Article•DOI•

A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings

[...]

Mikel Artetxe¹, Gorka Labaka¹, Eneko Agirre¹•Institutions (1)

University of the Basque Country¹

01 Jul 2018

TL;DR: This work proposes an alternative approach based on a fully unsupervised initialization that explicitly exploits the structural similarity of the embeddings, and a robust self-learning algorithm that iteratively improves this solution.

...read moreread less

Abstract: Recent work has managed to learn cross-lingual word embeddings without parallel data by mapping monolingual embeddings to a shared space through adversarial training. However, their evaluation has focused on favorable conditions, using comparable corpora or closely-related languages, and we show that they often fail in more realistic scenarios. This work proposes an alternative approach based on a fully unsupervised initialization that explicitly exploits the structural similarity of the embeddings, and a robust self-learning algorithm that iteratively improves this solution. Our method succeeds in all tested scenarios and obtains the best published results in standard datasets, even surpassing previous supervised systems. Our implementation is released as an open source project at https://github.com/artetxem/vecmap.

...read moreread less

414 citations

Cites background or methods or result from "Earth Mover’s Distance Minimization..."

...We report the results in the dataset of Zhang et al. (2017a) at Table 1....
[...]
...Given that Zhang et al. (2017a) report using a different value of their hyperparameter λ for different language pairs (λ = 10 for English-Turkish and λ = 1 for the rest), we test both values in all our experiments to 4The test dictionaries were obtained through personal communication with the…...
[...]
...5Despite our efforts, Zhang et al. (2017b) was left out because: 1) it does not create a one-to-one dictionary, thus difficulting direct comparison, 2) it depends on expensive proprietary software 3) its computational cost is orders of magnitude higher (running the experiments would have taken…...
[...]
...The method of Zhang et al. (2017a) does not work at all in this more challenging scenario, which is in line with the negative results reported by the authors themselves for similar conditions (only %2.53 accuracy in their large Gigaword dataset)....
[...]
...Together with it, we also test the methods of Zhang et al. (2017a) and Conneau et al. (2018) using the publicly available implementations from the authors5....
[...]

Proceedings Article•DOI•

Unsupervised Statistical Machine Translation

[...]

Mikel Artetxe¹, Gorka Labaka¹, Eneko Agirre¹•Institutions (1)

University of the Basque Country¹

01 Sep 2018

TL;DR: This paper proposes an alternative approach based on phrase-based Statistical Machine Translation (SMT) that significantly closes the gap with supervised systems, and profits from the modular architecture of SMT.

...read moreread less

Abstract: While modern machine translation has relied on large parallel corpora, a recent line of work has managed to train Neural Machine Translation (NMT) systems from monolingual corpora only (Artetxe et al, 2018c; Lample et al, 2018) Despite the potential of this approach for low-resource settings, existing systems are far behind their supervised counterparts, limiting their practical interest In this paper, we propose an alternative approach based on phrase-based Statistical Machine Translation (SMT) that significantly closes the gap with supervised systems Our method profits from the modular architecture of SMT: we first induce a phrase table from monolingual corpora through cross-lingual embedding mappings, combine it with an n-gram language model, and fine-tune hyperparameters through an unsupervised MERT variant In addition, iterative backtranslation improves results further, yielding, for instance, 1408 and 2622 BLEU points in WMT 2014 English-German and English-French, respectively, an improvement of more than 7-10 BLEU points over previous unsupervised systems, and closing the gap with supervised SMT (Moses trained on Europarl) down to 2-5 BLEU points Our implementation is available at https://githubcom/artetxem/monoses

...read moreread less

270 citations

Cites result from "Earth Mover’s Distance Minimization..."

...…dictionary, typically in the range of a few thousand entries, although a recent line of work has managed to achieve comparable results in a fully unsupervised manner based on either self-learning (Artetxe et al., 2017, 2018b) or adversarial training (Zhang et al., 2017a,b; Conneau et al., 2018)....
[...]

Book•

Neural Machine Translation

[...]

Philipp Koehn¹•Institutions (1)

Johns Hopkins University¹

23 Jul 2020

TL;DR: A comprehensive treatment of the topic, ranging from introduction to neural networks, computation graphs, description of the currently dominant attentional sequence-to-sequence model, recent refinements, alternative architectures and challenges.

...read moreread less

Abstract: Deep learning is revolutionizing how machine translation systems are built today This book introduces the challenge of machine translation and evaluation - including historical, linguistic, and applied context -- then develops the core deep learning methods used for natural language applications Code examples in Python give readers a hands-on blueprint for understanding and implementing their own machine translation systems The book also provides extensive coverage of machine learning tricks, issues involved in handling various forms of data, model enhancements, and current challenges and methods for analysis and visualization Summaries of the current research in the field make this a state-of-the-art textbook for undergraduate and graduate classes, as well as an essential reference for researchers and developers interested in other applications of neural methods in the broader field of human language processing

...read moreread less

239 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Generative Adversarial Nets

[...]

Ian Goodfellow¹, Jean Pouget-Abadie¹, Mehdi Mirza¹, Bing Xu¹, David Warde-Farley¹, Sherjil Ozair², Aaron Courville¹, Yoshua Bengio¹ - Show less +4 more•Institutions (2)

Université de Montréal¹, Indian Institute of Technology Delhi²

08 Dec 2014

TL;DR: A new framework for estimating generative models via an adversarial process, in which two models are simultaneously train: a generative model G that captures the data distribution and a discriminative model D that estimates the probability that a sample came from the training data rather than G.

...read moreread less

Abstract: We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to ½ everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.

...read moreread less

38,211 citations

"Earth Mover’s Distance Minimization..." refers background in this paper

...Generative adversarial nets (GANs) are originally proposed to generate natural images (Goodfellow et al., 2014)....
[...]

Proceedings Article•

Distributed Representations of Words and Phrases and their Compositionality

[...]

Tomas Mikolov¹, Ilya Sutskever¹, Kai Chen¹, Greg S. Corrado¹, Jeffrey Dean¹ - Show less +1 more•Institutions (1)

Google¹

05 Dec 2013

TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.

...read moreread less

Abstract: The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.

...read moreread less

24,012 citations

"Earth Mover’s Distance Minimization..." refers background or methods or result in this paper

...This is exactly the supervised scenario, and previous works typically resort to gradient-based solvers (Mikolov et al., 2013a)....
[...]
...This idea has led to previous supervised methods: • Translation matrix (TM) (Mikolov et al., 2013a): the pioneer of this type of methods, using linear transformation....
[...]
...Interestingly, as computational models of word semantics, monolingual word embeddings also exhibit isomorphism across languages (Mikolov et al., 2013a)....
[...]
...As we aim to eliminate the need for crosslingual supervision from word translation pairs, the measure cannot be defined at the word level as in previous work (Mikolov et al., 2013a)....
[...]
...…et al., 2016), or the word level (i.e. in the form of seed lexicon) (Gouws and Søgaard, 2015; Wick et al., 2016; Duong et al., 2016; Shi et al., 2015; Mikolov et al., 2013a; Faruqui and Dyer, 2014; Lu et al., 2015; Dinu et al., 2015; Lazaridou et al., 2015; Ammar et al., 2016; Zhang et al.,…...
[...]

Posted Content•

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

[...]

Alec Radford, Luke Metz, Soumith Chintala¹•Institutions (1)

Facebook¹

19 Nov 2015-arXiv: Learning

TL;DR: This work introduces a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrates that they are a strong candidate for unsupervised learning.

...read moreread less

Abstract: In recent years, supervised learning with convolutional networks (CNNs) has seen huge adoption in computer vision applications. Comparatively, unsupervised learning with CNNs has received less attention. In this work we hope to help bridge the gap between the success of CNNs for supervised learning and unsupervised learning. We introduce a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrate that they are a strong candidate for unsupervised learning. Training on various image datasets, we show convincing evidence that our deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator. Additionally, we use the learned features for novel tasks - demonstrating their applicability as general image representations.

...read moreread less

6,759 citations

"Earth Mover’s Distance Minimization..." refers background in this paper

...Therefore, a lot of research efforts have been dedicated to the investigation into stabler training (Radford et al., 2015; Salimans et al., 2016; Nowozin et al., 2016; Metz et al., 2016; Poole et al., 2016; Arjovsky and Bottou, 2017), and the recently proposed Wasserstein GAN (Arjovsky et al....
[...]
...Therefore, a lot of research efforts have been dedicated to the investigation into stabler training (Radford et al., 2015; Salimans et al., 2016; Nowozin et al., 2016; Metz et al., 2016; Poole et al., 2016; Arjovsky and Bottou, 2017), and the recently proposed Wasserstein GAN (Arjovsky et al.,…...
[...]

Posted Content•

Improved Techniques for Training GANs

[...]

Tim Salimans¹, Ian Goodfellow², Wojciech Zaremba³, Vicki Cheung, Alec Radford¹, Xi Chen⁴ - Show less +2 more•Institutions (4)

OpenAI¹, Google², Facebook³, University of California, Berkeley⁴

10 Jun 2016-arXiv: Learning

TL;DR: In this article, the authors present a variety of new architectural features and training procedures that apply to the generative adversarial networks (GANs) framework and achieve state-of-the-art results in semi-supervised classification on MNIST, CIFAR-10 and SVHN.

...read moreread less

Abstract: We present a variety of new architectural features and training procedures that we apply to the generative adversarial networks (GANs) framework. We focus on two applications of GANs: semi-supervised learning, and the generation of images that humans find visually realistic. Unlike most work on generative models, our primary goal is not to train a model that assigns high likelihood to test data, nor do we require the model to be able to learn well without using any labels. Using our new techniques, we achieve state-of-the-art results in semi-supervised classification on MNIST, CIFAR-10 and SVHN. The generated images are of high quality as confirmed by a visual Turing test: our model generates MNIST samples that humans cannot distinguish from real data, and CIFAR-10 samples that yield a human error rate of 21.3%. We also present ImageNet samples with unprecedented resolution and show that our methods enable the model to learn recognizable features of ImageNet classes.

...read moreread less

5,711 citations

Journal Article•DOI•

Approximation capabilities of multilayer feedforward networks

[...]

Kurt Hornik¹•Institutions (1)

Vienna University of Technology¹

01 Mar 1991-Neural Networks

TL;DR: It is shown that standard multilayer feedforward networks with as few as a single hidden layer and arbitrary bounded and nonconstant activation function are universal approximators with respect to L p (μ) performance criteria, for arbitrary finite input environment measures μ.

...read moreread less

5,593 citations

"Earth Mover’s Distance Minimization..." refers methods in this paper

...As neural networks are universal function approximators (Hornik, 1991), we can attempt to approximate f with a neural network, called the critic D, with weight clipping to ensure the function family is K-Lipschitz....
[...]