Gromov-Wasserstein Alignment of Word Embedding Spaces

doi:10.18653/V1/D18-1214

Home
/
Papers
/
Gromov-Wasserstein Alignment of Word Embedding Spaces

Proceedings Article•DOI•

Gromov-Wasserstein Alignment of Word Embedding Spaces

David Alvarez-Melis¹, Tommi S. Jaakkola¹•Institutions (1)

Massachusetts Institute of Technology¹

31 Aug 2018-pp 1881-1890

TL;DR: The authors cast the correspondence problem directly as an optimal transport (OT) problem, building on the idea that word embeddings arise from metric recovery algorithms, and exploit the Gromov-Wasserstein distance that measures how similarities between pairs of words relate across languages.

read less

Abstract: Cross-lingual or cross-domain correspondences play key roles in tasks ranging from machine translation to transfer learning. Recently, purely unsupervised methods operating on monolingual embeddings have become effective alignment tools. Current state-of-the-art methods, however, involve multiple steps, including heuristic post-hoc refinement strategies. In this paper, we cast the correspondence problem directly as an optimal transport (OT) problem, building on the idea that word embeddings arise from metric recovery algorithms. Indeed, we exploit the Gromov-Wasserstein distance that measures how similarities between pairs of words relate across languages. We show that our OT objective can be estimated efficiently, requires little or no tuning, and results in performance comparable with the state-of-the-art in various unsupervised word translation tasks.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Book•

Neural Machine Translation

[...]

Philipp Koehn¹•Institutions (1)

Johns Hopkins University¹

23 Jul 2020

TL;DR: A comprehensive treatment of the topic, ranging from introduction to neural networks, computation graphs, description of the currently dominant attentional sequence-to-sequence model, recent refinements, alternative architectures and challenges.

...read moreread less

Abstract: Deep learning is revolutionizing how machine translation systems are built today This book introduces the challenge of machine translation and evaluation - including historical, linguistic, and applied context -- then develops the core deep learning methods used for natural language applications Code examples in Python give readers a hands-on blueprint for understanding and implementing their own machine translation systems The book also provides extensive coverage of machine learning tricks, issues involved in handling various forms of data, model enhancements, and current challenges and methods for analysis and visualization Summaries of the current research in the field make this a state-of-the-art textbook for undergraduate and graduate classes, as well as an essential reference for researchers and developers interested in other applications of neural methods in the broader field of human language processing

...read moreread less

239 citations

Proceedings Article•DOI•

JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages

[...]

Željko Agić¹, Ivan Vulić¹•Institutions (1)

University of Cambridge¹

01 Jul 2019

TL;DR: JW300, a parallel corpus of over 300 languages with around 100 thousand parallel sentences per language pair on average is introduced and its utility in experiments with cross-lingual word embedding induction and multi-source part-of-speech projection is showcased.

...read moreread less

Abstract: Viable cross-lingual transfer critically depends on the availability of parallel texts. Shortage of such resources imposes a development and evaluation bottleneck in multilingual processing. We introduce JW300, a parallel corpus of over 300 languages with around 100 thousand parallel sentences per language pair on average. In this paper, we present the resource and showcase its utility in experiments with cross-lingual word embedding induction and multi-source part-of-speech projection.

...read moreread less

171 citations

Proceedings Article•DOI•

Cross-Lingual Alignment of Contextual Word Embeddings, with Applications to Zero-shot Dependency Parsing.

[...]

Tal Schuster¹, Ori Ram², Regina Barzilay¹, Amir Globerson³•Institutions (3)

Massachusetts Institute of Technology¹, Tel Aviv University², Google³

01 Jun 2019

TL;DR: A novel method for multilingual transfer that utilizes deep contextual embeddings, pretrained in an unsupervised fashion, that consistently outperforms the previous state-of-the-art on 6 tested languages, yielding an improvement of 6.8 LAS points on average.

...read moreread less

Abstract: We introduce a novel method for multilingual transfer that utilizes deep contextual embeddings, pretrained in an unsupervised fashion. While contextual embeddings have been shown to yield richer representations of meaning compared to their static counterparts, aligning them poses a challenge due to their dynamic nature. To this end, we construct context-independent variants of the original monolingual spaces and utilize their mapping to derive an alignment for the context-dependent spaces. This mapping readily supports processing of a target language, improving transfer by context-aware embeddings. Our experimental results demonstrate the effectiveness of this approach for zero-shot and few-shot learning of dependency parsing. Specifically, our method consistently outperforms the previous state-of-the-art on 6 tested languages, yielding an improvement of 6.8 LAS points on average.

...read moreread less

167 citations

Proceedings Article•DOI•

How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions

[...]

Goran Glavaš¹, Robert Litschko¹, Sebastian Ruder², Ivan Vulić³•Institutions (3)

University of Mannheim¹, Allen Institute for Artificial Intelligence², University of Cambridge³

01 Feb 2019

TL;DR: The authors evaluate both supervised and unsupervised cross-lingual word embeddings (CLEs) for bilingual lexicon induction (BLI), and empirically demonstrate that the performance of CLE models largely depends on the task at hand and that optimizing CLE models for BLI may hurt downstream performance.

...read moreread less

Abstract: Cross-lingual word embeddings (CLEs) facilitate cross-lingual transfer of NLP models. Despite their ubiquitous downstream usage, increasingly popular projection-based CLE models are almost exclusively evaluated on bilingual lexicon induction (BLI). Even the BLI evaluations vary greatly, hindering our ability to correctly interpret performance and properties of different CLE models. In this work, we take the first step towards a comprehensive evaluation of CLE models: we thoroughly evaluate both supervised and unsupervised CLE models, for a large number of language pairs, on BLI and three downstream tasks, providing new insights concerning the ability of cutting-edge CLE models to support cross-lingual NLP. We empirically demonstrate that the performance of CLE models largely depends on the task at hand and that optimizing CLE models for BLI may hurt downstream performance. We indicate the most robust supervised and unsupervised CLE models and emphasize the need to reassess simple baselines, which still display competitive performance across the board. We hope our work catalyzes further research on CLE evaluation and model analysis.

...read moreread less

165 citations

Posted Content•

Cross-Lingual Alignment of Contextual Word Embeddings, with Applications to Zero-shot Dependency Parsing

[...]

Tal Schuster¹, Ori Ram², Regina Barzilay¹, Amir Globerson³•Institutions (3)

Massachusetts Institute of Technology¹, Tel Aviv University², Google³

25 Feb 2019-arXiv: Computation and Language

TL;DR: The authors use context-independent variants of the original monolingual spaces and utilize their mapping to derive an alignment for the context-dependent spaces, improving transfer by context-aware embeddings.

...read moreread less

122 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Enriching Word Vectors with Subword Information

[...]

Piotr Bojanowski¹, Edouard Grave¹, Armand Joulin¹, Tomas Mikolov¹•Institutions (1)

Facebook¹

12 Jun 2017-Transactions of the Association for Computational Linguistics

TL;DR: This paper proposed a new approach based on skip-gram model, where each word is represented as a bag of character n-grams, words being represented as the sum of these representations, allowing to train models on large corpora quickly and allowing to compute word representations for words that did not appear in the training data.

...read moreread less

Abstract: Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models to learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skipgram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram, words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpora quickly and allows to compute word representations for words that did not appear in the training data. We evaluate our word representations on nine different languages, both on word similarity and analogy tasks. By comparing to recently proposed morphological word representations, we show that our vectors achieve state-of-the-art performance on these tasks.

...read moreread less

7,537 citations

Proceedings Article•

Sinkhorn Distances: Lightspeed Computation of Optimal Transport

[...]

Marco Cuturi¹•Institutions (1)

Kyoto University¹

05 Dec 2013

TL;DR: This work smooths the classic optimal transport problem with an entropic regularization term, and shows that the resulting optimum is also a distance which can be computed through Sinkhorn's matrix scaling algorithm at a speed that is several orders of magnitude faster than that of transport solvers.

...read moreread less

Abstract: Optimal transport distances are a fundamental family of distances for probability measures and histograms of features. Despite their appealing theoretical properties, excellent performance in retrieval tasks and intuitive formulation, their computation involves the resolution of a linear program whose cost can quickly become prohibitive whenever the size of the support of these measures or the histograms' dimension exceeds a few hundred. We propose in this work a new family of optimal transport distances that look at transport problems from a maximum-entropy perspective. We smooth the classic optimal transport problem with an entropic regularization term, and show that the resulting optimum is also a distance which can be computed through Sinkhorn's matrix scaling algorithm at a speed that is several orders of magnitude faster than that of transport solvers. We also show that this regularized distance improves upon classic optimal transport distances on the MNIST classification problem.

...read moreread less

2,681 citations

Journal Article•DOI•

A generalized solution of the orthogonal procrustes problem

[...]

Peter H. Schönemann¹•Institutions (1)

University of North Carolina at Chapel Hill¹

01 Mar 1966-Psychometrika

TL;DR: In this paper, a solution for the least square problem with respect to matrices of less than full column rank is presented. But this solution is applicable to only matrices A and B and is not applicable to all matrices.

...read moreread less

Abstract: A solutionT of the least-squares problemAT=B +E, givenA andB so that trace (E′E)= minimum andT′T=I is presented. It is compared with a less general solution of the same problem which was given by Green [5]. The present solution, in contrast to Green's, is applicable to matricesA andB which are of less than full column rank. Some technical suggestions for the numerical computation ofT and an illustrative example are given.

...read moreread less

1,924 citations

Posted Content•

[...]

Tomas Mikolov, Quoc V. Le, Ilya Sutskever

17 Sep 2013-arXiv: Computation and Language

TL;DR: This method can translate missing word and phrase entries by learning language structures based on large monolingual data and mapping between languages from small bilingual data and uses distributed representation of words and learns a linear mapping between vector spaces of languages.

...read moreread less

Abstract: Dictionaries and phrase tables are the basis of modern statistical machine translation systems. This paper develops a method that can automate the process of generating and extending dictionaries and phrase tables. Our method can translate missing word and phrase entries by learning language structures based on large monolingual data and mapping between languages from small bilingual data. It uses distributed representation of words and learns a linear mapping between vector spaces of languages. Despite its simplicity, our method is surprisingly effective: we can achieve almost 90% precision@5 for translation of words between English and Spanish. This method makes little assumption about the languages, so it can be used to extend and refine dictionaries and translation tables for any language pairs.

...read moreread less

1,564 citations

Posted Content•

Computational Optimal Transport

[...]

Gabriel Peyré, Marco Cuturi

01 Mar 2018-arXiv: Machine Learning

TL;DR: This short book reviews OT with a bias toward numerical methods and their applications in data sciences, and sheds lights on the theoretical properties of OT that make it particularly useful for some of these applications.

...read moreread less

Abstract: Optimal transport (OT) theory can be informally described using the words of the French mathematician Gaspard Monge (1746-1818): A worker with a shovel in hand has to move a large pile of sand lying on a construction site. The goal of the worker is to erect with all that sand a target pile with a prescribed shape (for example, that of a giant sand castle). Naturally, the worker wishes to minimize her total effort, quantified for instance as the total distance or time spent carrying shovelfuls of sand. Mathematicians interested in OT cast that problem as that of comparing two probability distributions, two different piles of sand of the same volume. They consider all of the many possible ways to morph, transport or reshape the first pile into the second, and associate a "global" cost to every such transport, using the "local" consideration of how much it costs to move a grain of sand from one place to another. Recent years have witnessed the spread of OT in several fields, thanks to the emergence of approximate solvers that can scale to sizes and dimensions that are relevant to data sciences. Thanks to this newfound scalability, OT is being increasingly used to unlock various problems in imaging sciences (such as color or texture processing), computer vision and graphics (for shape manipulation) or machine learning (for regression, classification and density fitting). This short book reviews OT with a bias toward numerical methods and their applications in data sciences, and sheds lights on the theoretical properties of OT that make it particularly useful for some of these applications.

...read moreread less

1,355 citations