scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Gromov-Wasserstein Alignment of Word Embedding Spaces

TL;DR: The authors cast the correspondence problem directly as an optimal transport (OT) problem, building on the idea that word embeddings arise from metric recovery algorithms, and exploit the Gromov-Wasserstein distance that measures how similarities between pairs of words relate across languages.
Abstract: Cross-lingual or cross-domain correspondences play key roles in tasks ranging from machine translation to transfer learning. Recently, purely unsupervised methods operating on monolingual embeddings have become effective alignment tools. Current state-of-the-art methods, however, involve multiple steps, including heuristic post-hoc refinement strategies. In this paper, we cast the correspondence problem directly as an optimal transport (OT) problem, building on the idea that word embeddings arise from metric recovery algorithms. Indeed, we exploit the Gromov-Wasserstein distance that measures how similarities between pairs of words relate across languages. We show that our OT objective can be estimated efficiently, requires little or no tuning, and results in performance comparable with the state-of-the-art in various unsupervised word translation tasks.

Content maybe subject to copyright    Report

Citations
More filters
Book
23 Jul 2020
TL;DR: A comprehensive treatment of the topic, ranging from introduction to neural networks, computation graphs, description of the currently dominant attentional sequence-to-sequence model, recent refinements, alternative architectures and challenges.
Abstract: Deep learning is revolutionizing how machine translation systems are built today This book introduces the challenge of machine translation and evaluation - including historical, linguistic, and applied context -- then develops the core deep learning methods used for natural language applications Code examples in Python give readers a hands-on blueprint for understanding and implementing their own machine translation systems The book also provides extensive coverage of machine learning tricks, issues involved in handling various forms of data, model enhancements, and current challenges and methods for analysis and visualization Summaries of the current research in the field make this a state-of-the-art textbook for undergraduate and graduate classes, as well as an essential reference for researchers and developers interested in other applications of neural methods in the broader field of human language processing

239 citations

Proceedings ArticleDOI
01 Jul 2019
TL;DR: JW300, a parallel corpus of over 300 languages with around 100 thousand parallel sentences per language pair on average is introduced and its utility in experiments with cross-lingual word embedding induction and multi-source part-of-speech projection is showcased.
Abstract: Viable cross-lingual transfer critically depends on the availability of parallel texts. Shortage of such resources imposes a development and evaluation bottleneck in multilingual processing. We introduce JW300, a parallel corpus of over 300 languages with around 100 thousand parallel sentences per language pair on average. In this paper, we present the resource and showcase its utility in experiments with cross-lingual word embedding induction and multi-source part-of-speech projection.

171 citations

Proceedings ArticleDOI
01 Jun 2019
TL;DR: A novel method for multilingual transfer that utilizes deep contextual embeddings, pretrained in an unsupervised fashion, that consistently outperforms the previous state-of-the-art on 6 tested languages, yielding an improvement of 6.8 LAS points on average.
Abstract: We introduce a novel method for multilingual transfer that utilizes deep contextual embeddings, pretrained in an unsupervised fashion. While contextual embeddings have been shown to yield richer representations of meaning compared to their static counterparts, aligning them poses a challenge due to their dynamic nature. To this end, we construct context-independent variants of the original monolingual spaces and utilize their mapping to derive an alignment for the context-dependent spaces. This mapping readily supports processing of a target language, improving transfer by context-aware embeddings. Our experimental results demonstrate the effectiveness of this approach for zero-shot and few-shot learning of dependency parsing. Specifically, our method consistently outperforms the previous state-of-the-art on 6 tested languages, yielding an improvement of 6.8 LAS points on average.

167 citations

Proceedings ArticleDOI
01 Feb 2019
TL;DR: The authors evaluate both supervised and unsupervised cross-lingual word embeddings (CLEs) for bilingual lexicon induction (BLI), and empirically demonstrate that the performance of CLE models largely depends on the task at hand and that optimizing CLE models for BLI may hurt downstream performance.
Abstract: Cross-lingual word embeddings (CLEs) facilitate cross-lingual transfer of NLP models. Despite their ubiquitous downstream usage, increasingly popular projection-based CLE models are almost exclusively evaluated on bilingual lexicon induction (BLI). Even the BLI evaluations vary greatly, hindering our ability to correctly interpret performance and properties of different CLE models. In this work, we take the first step towards a comprehensive evaluation of CLE models: we thoroughly evaluate both supervised and unsupervised CLE models, for a large number of language pairs, on BLI and three downstream tasks, providing new insights concerning the ability of cutting-edge CLE models to support cross-lingual NLP. We empirically demonstrate that the performance of CLE models largely depends on the task at hand and that optimizing CLE models for BLI may hurt downstream performance. We indicate the most robust supervised and unsupervised CLE models and emphasize the need to reassess simple baselines, which still display competitive performance across the board. We hope our work catalyzes further research on CLE evaluation and model analysis.

165 citations

Posted Content
TL;DR: The authors use context-independent variants of the original monolingual spaces and utilize their mapping to derive an alignment for the context-dependent spaces, improving transfer by context-aware embeddings.
Abstract: We introduce a novel method for multilingual transfer that utilizes deep contextual embeddings, pretrained in an unsupervised fashion. While contextual embeddings have been shown to yield richer representations of meaning compared to their static counterparts, aligning them poses a challenge due to their dynamic nature. To this end, we construct context-independent variants of the original monolingual spaces and utilize their mapping to derive an alignment for the context-dependent spaces. This mapping readily supports processing of a target language, improving transfer by context-aware embeddings. Our experimental results demonstrate the effectiveness of this approach for zero-shot and few-shot learning of dependency parsing. Specifically, our method consistently outperforms the previous state-of-the-art on 6 tested languages, yielding an improvement of 6.8 LAS points on average.

122 citations

References
More filters
Journal ArticleDOI
TL;DR: This paper proposed a new approach based on skip-gram model, where each word is represented as a bag of character n-grams, words being represented as the sum of these representations, allowing to train models on large corpora quickly and allowing to compute word representations for words that did not appear in the training data.
Abstract: Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models to learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skipgram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram, words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpora quickly and allows to compute word representations for words that did not appear in the training data. We evaluate our word representations on nine different languages, both on word similarity and analogy tasks. By comparing to recently proposed morphological word representations, we show that our vectors achieve state-of-the-art performance on these tasks.

7,537 citations

Proceedings Article
Marco Cuturi1
05 Dec 2013
TL;DR: This work smooths the classic optimal transport problem with an entropic regularization term, and shows that the resulting optimum is also a distance which can be computed through Sinkhorn's matrix scaling algorithm at a speed that is several orders of magnitude faster than that of transport solvers.
Abstract: Optimal transport distances are a fundamental family of distances for probability measures and histograms of features. Despite their appealing theoretical properties, excellent performance in retrieval tasks and intuitive formulation, their computation involves the resolution of a linear program whose cost can quickly become prohibitive whenever the size of the support of these measures or the histograms' dimension exceeds a few hundred. We propose in this work a new family of optimal transport distances that look at transport problems from a maximum-entropy perspective. We smooth the classic optimal transport problem with an entropic regularization term, and show that the resulting optimum is also a distance which can be computed through Sinkhorn's matrix scaling algorithm at a speed that is several orders of magnitude faster than that of transport solvers. We also show that this regularized distance improves upon classic optimal transport distances on the MNIST classification problem.

2,681 citations

Journal ArticleDOI
TL;DR: In this paper, a solution for the least square problem with respect to matrices of less than full column rank is presented. But this solution is applicable to only matrices A and B and is not applicable to all matrices.
Abstract: A solutionT of the least-squares problemAT=B +E, givenA andB so that trace (E′E)= minimum andT′T=I is presented. It is compared with a less general solution of the same problem which was given by Green [5]. The present solution, in contrast to Green's, is applicable to matricesA andB which are of less than full column rank. Some technical suggestions for the numerical computation ofT and an illustrative example are given.

1,924 citations

Posted Content
TL;DR: This method can translate missing word and phrase entries by learning language structures based on large monolingual data and mapping between languages from small bilingual data and uses distributed representation of words and learns a linear mapping between vector spaces of languages.
Abstract: Dictionaries and phrase tables are the basis of modern statistical machine translation systems. This paper develops a method that can automate the process of generating and extending dictionaries and phrase tables. Our method can translate missing word and phrase entries by learning language structures based on large monolingual data and mapping between languages from small bilingual data. It uses distributed representation of words and learns a linear mapping between vector spaces of languages. Despite its simplicity, our method is surprisingly effective: we can achieve almost 90% precision@5 for translation of words between English and Spanish. This method makes little assumption about the languages, so it can be used to extend and refine dictionaries and translation tables for any language pairs.

1,564 citations

Posted Content
TL;DR: This short book reviews OT with a bias toward numerical methods and their applications in data sciences, and sheds lights on the theoretical properties of OT that make it particularly useful for some of these applications.
Abstract: Optimal transport (OT) theory can be informally described using the words of the French mathematician Gaspard Monge (1746-1818): A worker with a shovel in hand has to move a large pile of sand lying on a construction site. The goal of the worker is to erect with all that sand a target pile with a prescribed shape (for example, that of a giant sand castle). Naturally, the worker wishes to minimize her total effort, quantified for instance as the total distance or time spent carrying shovelfuls of sand. Mathematicians interested in OT cast that problem as that of comparing two probability distributions, two different piles of sand of the same volume. They consider all of the many possible ways to morph, transport or reshape the first pile into the second, and associate a "global" cost to every such transport, using the "local" consideration of how much it costs to move a grain of sand from one place to another. Recent years have witnessed the spread of OT in several fields, thanks to the emergence of approximate solvers that can scale to sizes and dimensions that are relevant to data sciences. Thanks to this newfound scalability, OT is being increasingly used to unlock various problems in imaging sciences (such as color or texture processing), computer vision and graphics (for shape manipulation) or machine learning (for regression, classification and density fitting). This short book reviews OT with a bias toward numerical methods and their applications in data sciences, and sheds lights on the theoretical properties of OT that make it particularly useful for some of these applications.

1,355 citations