Open AccessProceedings Article
Generalizing and Improving Bilingual Word Embedding Mappings with a Multi-Step Framework of Linear Transformations.
Mikel Artetxe,Gorka Labaka,Eneko Agirre +2 more
- Vol. 32, Iss: 1, pp 5012-5019
Reads0
Chats0
TLDR
A multi-step framework of linear transformations that generalizes a substantial body of previous work is proposed that allows new insights into the behavior of existing methods, including the effectiveness of inverse regression, and design a novel variant that obtains the best published results in zero-shot bilingual lexicon extraction.Abstract:
Using a dictionary to map independently trained word embeddings to a shared space has shown to be an effective approach to learn bilingual word embeddings. In this work, we propose a multi-step framework of linear transformations that generalizes a substantial body of previous work. The core step of the framework is an orthogonal transformation, and existing methods can be explained in terms of the additional normalization, whitening, re-weighting, de-whitening and dimensionality reduction steps. This allows us to gain new insights into the behavior of existing methods, including the effectiveness of inverse regression, and design a novel variant that obtains the best published results in zero-shot bilingual lexicon extraction. The corresponding software is released as an open source project.read more
Citations
More filters
Posted Content
Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond
Mikel Artetxe,Holger Schwenk +1 more
TL;DR: This article used a single BiLSTM encoder with a shared BPE vocabulary for all languages, coupled with an auxiliary decoder and trained on publicly available parallel corpora to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts.
Proceedings ArticleDOI
A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings
TL;DR: This work proposes an alternative approach based on a fully unsupervised initialization that explicitly exploits the structural similarity of the embeddings, and a robust self-learning algorithm that iteratively improves this solution.
Posted Content
Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation
Nils Reimers,Iryna Gurevych +1 more
TL;DR: An easy and efficient method to extend existing sentence embedding models to new languages by using the original (monolingual) model to generate sentence embeddings for the source language and then training a new system on translated sentences to mimic the original model.
Journal ArticleDOI
Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond
Mikel Artetxe,Holger Schwenk +1 more
TL;DR: An architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts using a single BiLSTM encoder with a shared byte-pair encoding vocabulary for all languages, coupled with an auxiliary decoder and trained on publicly available parallel corpora.
Proceedings ArticleDOI
Unsupervised Statistical Machine Translation
TL;DR: This paper proposes an alternative approach based on phrase-based Statistical Machine Translation (SMT) that significantly closes the gap with supervised systems, and profits from the modular architecture of SMT.
References
More filters
Proceedings Article
Distributed Representations of Words and Phrases and their Compositionality
TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.
Posted Content
Exploiting Similarities among Languages for Machine Translation
TL;DR: This method can translate missing word and phrase entries by learning language structures based on large monolingual data and mapping between languages from small bilingual data and uses distributed representation of words and learns a linear mapping between vector spaces of languages.
Proceedings Article
Parallel Data, Tools and Interfaces in OPUS
TL;DR: New data sets and their features, additional annotation tools and models provided from the website and essential interfaces and on-line services included in the OPUS project are reported.
Proceedings ArticleDOI
Zero-shot Learning with Semantic Output Codes
TL;DR: A semantic output code classifier which utilizes a knowledge base of semantic properties of Y to extrapolate to novel classes and can often predict words that people are thinking about from functional magnetic resonance images of their neural activity, even without training examples for those words.
Proceedings ArticleDOI
Improving Vector Space Word Representations Using Multilingual Correlation
Manaal Faruqui,Chris Dyer +1 more
TL;DR: This paper argues that lexico-semantic content should additionally be invariant across languages and proposes a simple technique based on canonical correlation analysis (CCA) for incorporating multilingual evidence into vectors generated monolingually.