URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors

doi:10.18653/V1/E17-2002

Open AccessProceedings ArticleDOI

URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors

Patrick Littell, +5 more

- pp 8-14

Chats0

TLDR

The URIEL knowledge base for massively multilingual NLP and the lang2vec utility, which provides information-rich vector identifications of languages drawn from typological, geographical, and phylogenetic databases and normalized to have straightforward and consistent formats, naming, and semantics are introduced.

Abstract:

We introduce the URIEL knowledge base for massively multilingual NLP and the lang2vec utility, which provides information-rich vector identifications of languages drawn from typological, geographical, and phylogenetic databases and normalized to have straightforward and consistent formats, naming, and semantics. The goal of URIEL and lang2vec is to enable multilingual NLP, especially on less-resourced languages and make possible types of experiments (especially but not exclusively related to NLP tasks) that are otherwise difficult or impossible due to the sparsity and incommensurability of the data sources. lang2vec vectors have been shown to reduce perplexity in multilingual language modeling, when compared to one-hot language identification vectors.

URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors

Citations

Ethnologue: Languages of the World

The World Atlas of Language Structures Online

From zero to hero: On the limitations of zero-shot language transfer with multilingual transformers

Learning Language Representations for Typology Prediction

On Difficulties of Cross-Lingual Transfer with Order Differences: A Case Study on Dependency Parsing

References

Ethnologue: Languages of the World

Improving Vector Space Word Representations Using Multilingual Correlation

The World Atlas of Language Structures Online

The World Atlas of Language Structures Online

Multilingual Models for Compositional Distributed Semantics

Related Papers (5)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Attention is All you Need

Unsupervised Cross-lingual Representation Learning at Scale

Adam: A Method for Stochastic Optimization

Neural Machine Translation of Rare Words with Subword Units