Generalizing and Improving Bilingual Word Embedding Mappings with a Multi-Step Framework of Linear Transformations.

Open AccessProceedings Article

Generalizing and Improving Bilingual Word Embedding Mappings with a Multi-Step Framework of Linear Transformations.

Mikel Artetxe, +2 more

- Vol. 32, Iss: 1, pp 5012-5019

Chats0

TLDR

A multi-step framework of linear transformations that generalizes a substantial body of previous work is proposed that allows new insights into the behavior of existing methods, including the effectiveness of inverse regression, and design a novel variant that obtains the best published results in zero-shot bilingual lexicon extraction.

Abstract:

Using a dictionary to map independently trained word embeddings to a shared space has shown to be an effective approach to learn bilingual word embeddings. In this work, we propose a multi-step framework of linear transformations that generalizes a substantial body of previous work. The core step of the framework is an orthogonal transformation, and existing methods can be explained in terms of the additional normalization, whitening, re-weighting, de-whitening and dimensionality reduction steps. This allows us to gain new insights into the behavior of existing methods, including the effectiveness of inverse regression, and design a novel variant that obtains the best published results in zero-shot bilingual lexicon extraction. The corresponding software is released as an open source project.

Citations

PDF

Open Access

More filters

Posted Content

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

Mikel Artetxe, +1 more

- 26 Dec 2018 -

arXiv: Computation and Language

TL;DR: This article used a single BiLSTM encoder with a shared BPE vocabulary for all languages, coupled with an auxiliary decoder and trained on publicly available parallel corpora to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts.

...read moreread less

Proceedings ArticleDOI

A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings

Mikel Artetxe, +2 more

TL;DR: This work proposes an alternative approach based on a fully unsupervised initialization that explicitly exploits the structural similarity of the embeddings, and a robust self-learning algorithm that iteratively improves this solution.

...read moreread less

Posted Content

Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation

Nils Reimers, +1 more

- 21 Apr 2020 -

arXiv: Computation and Language

TL;DR: An easy and efficient method to extend existing sentence embedding models to new languages by using the original (monolingual) model to generate sentence embeddings for the source language and then training a new system on translated sentences to mimic the original model.

...read moreread less

Journal ArticleDOI

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

Mikel Artetxe, +1 more

- 11 Sep 2019 -

Transactions of the Association for Comp...

TL;DR: An architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts using a single BiLSTM encoder with a shared byte-pair encoding vocabulary for all languages, coupled with an auxiliary decoder and trained on publicly available parallel corpora.

...read moreread less

Proceedings ArticleDOI

Unsupervised Statistical Machine Translation

Mikel Artetxe, +2 more

TL;DR: This paper proposes an alternative approach based on phrase-based Statistical Machine Translation (SMT) that significantly closes the gap with supervised systems, and profits from the modular architecture of SMT.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Distributed Representations of Words and Phrases and their Compositionality

Tomas Mikolov, +4 more

TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.

...read moreread less

Posted Content

Exploiting Similarities among Languages for Machine Translation

Tomas Mikolov, +2 more

- 17 Sep 2013 -

arXiv: Computation and Language

TL;DR: This method can translate missing word and phrase entries by learning language structures based on large monolingual data and mapping between languages from small bilingual data and uses distributed representation of words and learns a linear mapping between vector spaces of languages.

...read moreread less

Proceedings Article

Parallel Data, Tools and Interfaces in OPUS

J"org Tiedemann

TL;DR: New data sets and their features, additional annotation tools and models provided from the website and essential interfaces and on-line services included in the OPUS project are reported.

...read moreread less

Proceedings ArticleDOI

Zero-shot Learning with Semantic Output Codes

Mark Palatucci, +3 more

TL;DR: A semantic output code classifier which utilizes a knowledge base of semantic properties of Y to extrapolate to novel classes and can often predict words that people are thinking about from functional magnetic resonance images of their neural activity, even without training examples for those words.

...read moreread less

Proceedings ArticleDOI

Improving Vector Space Word Representations Using Multilingual Correlation

Manaal Faruqui, +1 more

TL;DR: This paper argues that lexico-semantic content should additionally be invariant across languages and proposes a simple technique based on canonical correlation analysis (CCA) for incorporating multilingual evidence into vectors generated monolingually.

...read moreread less

Collapse

Generalizing and Improving Bilingual Word Embedding Mappings with a Multi-Step Framework of Linear Transformations.

Citations

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings

Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

Unsupervised Statistical Machine Translation

References

Distributed Representations of Words and Phrases and their Compositionality

Exploiting Similarities among Languages for Machine Translation

Parallel Data, Tools and Interfaces in OPUS

Zero-shot Learning with Semantic Output Codes

Improving Vector Space Word Representations Using Multilingual Correlation

Related Papers (5)

Exploiting Similarities among Languages for Machine Translation

Learning bilingual word embeddings with (almost) no bilingual data

Enriching Word Vectors with Subword Information

Word translation without parallel data

Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation