scispace - formally typeset
Open AccessProceedings Article

Generalizing and Improving Bilingual Word Embedding Mappings with a Multi-Step Framework of Linear Transformations.

Mikel Artetxe, +2 more
- Vol. 32, Iss: 1, pp 5012-5019
Reads0
Chats0
TLDR
A multi-step framework of linear transformations that generalizes a substantial body of previous work is proposed that allows new insights into the behavior of existing methods, including the effectiveness of inverse regression, and design a novel variant that obtains the best published results in zero-shot bilingual lexicon extraction.
Abstract
Using a dictionary to map independently trained word embeddings to a shared space has shown to be an effective approach to learn bilingual word embeddings. In this work, we propose a multi-step framework of linear transformations that generalizes a substantial body of previous work. The core step of the framework is an orthogonal transformation, and existing methods can be explained in terms of the additional normalization, whitening, re-weighting, de-whitening and dimensionality reduction steps. This allows us to gain new insights into the behavior of existing methods, including the effectiveness of inverse regression, and design a novel variant that obtains the best published results in zero-shot bilingual lexicon extraction. The corresponding software is released as an open source project.

read more

Citations
More filters
Posted Content

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

TL;DR: This article used a single BiLSTM encoder with a shared BPE vocabulary for all languages, coupled with an auxiliary decoder and trained on publicly available parallel corpora to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts.
Proceedings ArticleDOI

A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings

TL;DR: This work proposes an alternative approach based on a fully unsupervised initialization that explicitly exploits the structural similarity of the embeddings, and a robust self-learning algorithm that iteratively improves this solution.
Posted Content

Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation

TL;DR: An easy and efficient method to extend existing sentence embedding models to new languages by using the original (monolingual) model to generate sentence embeddings for the source language and then training a new system on translated sentences to mimic the original model.
Journal ArticleDOI

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

TL;DR: An architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts using a single BiLSTM encoder with a shared byte-pair encoding vocabulary for all languages, coupled with an auxiliary decoder and trained on publicly available parallel corpora.
Proceedings ArticleDOI

Unsupervised Statistical Machine Translation

TL;DR: This paper proposes an alternative approach based on phrase-based Statistical Machine Translation (SMT) that significantly closes the gap with supervised systems, and profits from the modular architecture of SMT.
References
More filters
Proceedings Article

Distributed Representations of Words and Phrases and their Compositionality

TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.
Posted Content

Exploiting Similarities among Languages for Machine Translation

TL;DR: This method can translate missing word and phrase entries by learning language structures based on large monolingual data and mapping between languages from small bilingual data and uses distributed representation of words and learns a linear mapping between vector spaces of languages.
Proceedings Article

Parallel Data, Tools and Interfaces in OPUS

TL;DR: New data sets and their features, additional annotation tools and models provided from the website and essential interfaces and on-line services included in the OPUS project are reported.
Proceedings ArticleDOI

Zero-shot Learning with Semantic Output Codes

TL;DR: A semantic output code classifier which utilizes a knowledge base of semantic properties of Y to extrapolate to novel classes and can often predict words that people are thinking about from functional magnetic resonance images of their neural activity, even without training examples for those words.
Proceedings ArticleDOI

Improving Vector Space Word Representations Using Multilingual Correlation

TL;DR: This paper argues that lexico-semantic content should additionally be invariant across languages and proposes a simple technique based on canonical correlation analysis (CCA) for incorporating multilingual evidence into vectors generated monolingually.
Related Papers (5)