Polyglot: Distributed Word Representations for Multilingual NLP

Open AccessProceedings Article

Polyglot: Distributed Word Representations for Multilingual NLP

- pp 183-192

TLDR

The authors used word embeddings for more than 100 languages using their corresponding Wikipedias and found their performance to be competitive with near state-of-the-art methods in English, Danish and Swedish.

Abstract:

Distributed word representations (word embeddings) have recently contributed to competitive performance in language modeling and several NLP tasks. In this work, we train word embeddings for more than 100 languages using their corresponding Wikipedias. We quantitatively demonstrate the utility of our word embeddings by using them as the sole features for training a part of speech tagger for a subset of these languages. We find their performance to be competitive with near state-of-art methods in English, Danish and Swedish. Moreover, we investigate the semantic features captured by these embeddings through the proximity of word groupings. We will release these embeddings publicly to help researchers in the development and enhancement of multilingual applications.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

DeepWalk: online learning of social representations

Bryan Perozzi, +2 more

TL;DR: DeepWalk as mentioned in this paper uses local information obtained from truncated random walks to learn latent representations by treating walks as the equivalent of sentences, which encode social relations in a continuous vector space, which is easily exploited by statistical models.

...read moreread less

Book

Neural Networks and Deep Learning

Charu C. Aggarwal

Proceedings ArticleDOI

Dependency-Based Word Embeddings

Omer Levy, +1 more

TL;DR: The skip-gram model with negative sampling introduced by Mikolov et al. is generalized to include arbitrary contexts, and experiments with dependency-based contexts are performed, showing that they produce markedly different embeddings.

...read moreread less

Proceedings Article

Learning Word Vectors for 157 Languages

Edouard Grave, +4 more

TL;DR: This article used two sources of data to train these models: the free online encyclopedia Wikipedia and data from the common crawl project, and introduced three new word analogy datasets to evaluate these word vectors, for French, Hindi and Polish.

...read moreread less

Proceedings ArticleDOI

Linguistic Regularities in Sparse and Explicit Word Representations

Omer Levy, +1 more

TL;DR: It is demonstrated that analogy recovery is not restricted to neural word embeddings, and that a similar amount of relational similarities can be recovered from traditional distributional word representations.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Learning representations by back-propagating errors

David E. Rumelhart, +2 more

- 01 Jan 1988 -

Nature

TL;DR: Back-propagation repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector, which helps to represent important features of the task domain.

...read moreread less

ReportDOI

Building a large annotated corpus of English: the penn treebank

Mitchell Marcus, +2 more

- 01 Jun 1993 -

Computational Linguistics

TL;DR: As a result of this grant, the researchers have now published on CDROM a corpus of over 4 million words of running text annotated with part-of- speech (POS) tags, which includes a fully hand-parsed version of the classic Brown corpus.

...read moreread less

Journal Article

Natural Language Processing (Almost) from Scratch

Ronan Collobert, +5 more

- 01 Feb 2011 -

Journal of Machine Learning Research

TL;DR: A unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling is proposed.

...read moreread less

Proceedings ArticleDOI

A unified architecture for natural language processing: deep neural networks with multitask learning

Ronan Collobert, +1 more

TL;DR: This work describes a single convolutional neural network architecture that, given a sentence, outputs a host of language processing predictions: part-of-speech tags, chunks, named entity tags, semantic roles, semantically similar words and the likelihood that the sentence makes sense using a language model.

...read moreread less

Proceedings Article

Recurrent neural network based language model

Tomas Mikolov, +4 more

TL;DR: Results indicate that it is possible to obtain around 50% reduction of perplexity by using mixture of several RNN LMs, compared to a state of the art backoff language model.

...read moreread less