BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages

Open AccessProceedings Article

BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages

TLDR

The authors presented BPEmb, a collection of pre-trained subword unit embeddings in 275 languages, based on Byte-Pair Encoding (BPE) for fine-grained entity typing.

Abstract:

We present BPEmb, a collection of pre-trained subword unit embeddings in 275 languages, based on Byte-Pair Encoding (BPE). In an evaluation using fine-grained entity typing as testbed, BPEmb performs competitively, and for some languages bet- ter than alternative subword approaches, while requiring vastly fewer resources and no tokenization. BPEmb is available at this https URL

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

FLAIR: An Easy-to-Use Framework for State-of-the-Art NLP

Alan Akbik, +5 more

TL;DR: The core idea of the FLAIR framework is to present a simple, unified interface for conceptually very different types of word and document embeddings, which effectively hides all embedding-specific engineering complexity and allows researchers to “mix and match” variousembeddings with little effort.

...read moreread less

Journal Article

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Adithya Renduchintala, +3 more

- 01 Aug 2016 -

The Association for Computational Lingui...

Proceedings ArticleDOI

SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020)

Marcos Zampieri, +8 more

TL;DR: The SemEval-2020 Task 12 on Multilingual Offensive Language Identification in Social Media (OffensEval 2020) as mentioned in this paper included three subtasks corresponding to the hierarchical taxonomy of the OLID schema, and was offered in five languages: Arabic, Danish, English, Greek, and Turkish.

...read moreread less

Posted Content

SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020)

Marcos Zampieri, +8 more

- 12 Jun 2020 -

arXiv: Computation and Language

TL;DR: The task included three subtasks corresponding to the hierarchical taxonomy of the OLID schema from OffensEval-2019, and it was offered in five languages: Arabic, Danish, English, Greek, and Turkish.

...read moreread less

Proceedings ArticleDOI

Are All Languages Created Equal in Multilingual BERT

Shijie Wu, +1 more

TL;DR: This work explores how mBERT performs on a much wider set of languages, focusing on the quality of representation for low-resource languages, measured by within-language performance, and finds that better models for low resource languages require more efficient pretraining techniques or more data.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Glove: Global Vectors for Word Representation

Jeffrey Pennington, +2 more

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

...read moreread less

Posted Content

Efficient Estimation of Word Representations in Vector Space

Tomas Mikolov, +3 more

- 16 Jan 2013 -

arXiv: Computation and Language

TL;DR: This paper proposed two novel model architectures for computing continuous vector representations of words from very large data sets, and the quality of these representations is measured in a word similarity task and the results are compared to the previously best performing techniques based on different types of neural networks.

...read moreread less

Proceedings ArticleDOI

The Stanford CoreNLP Natural Language Processing Toolkit

Christopher D. Manning, +5 more

TL;DR: The design and use of the Stanford CoreNLP toolkit is described, an extensible pipeline that provides core natural language analysis, and it is suggested that this follows from a simple, approachable design, straightforward interfaces, the inclusion of robust and good quality analysis components, and not requiring use of a large amount of associated baggage.

...read moreread less

Proceedings ArticleDOI

Freebase: a collaboratively created graph database for structuring human knowledge

Kurt Bollacker, +4 more

TL;DR: MQL provides an easy-to-use object-oriented interface to the tuple data in Freebase and is designed to facilitate the creation of collaborative, Web-based data-oriented applications.

...read moreread less

Proceedings Article

Algorithms for Hyper-Parameter Optimization

James Bergstra, +3 more

TL;DR: This work contributes novel techniques for making response surface models P(y|x) in which many elements of hyper-parameter assignment (x) are known to be irrelevant given particular values of other elements.

...read moreread less

Related Papers (5)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

Enriching Word Vectors with Subword Information

Piotr Bojanowski, +3 more

- 12 Jun 2017 -

Transactions of the Association for Comp...

BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages

Citations

FLAIR: An Easy-to-Use Framework for State-of-the-Art NLP

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020)

SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020)

Are All Languages Created Equal in Multilingual BERT

References

Glove: Global Vectors for Word Representation

Efficient Estimation of Word Representations in Vector Space

The Stanford CoreNLP Natural Language Processing Toolkit

Freebase: a collaboratively created graph database for structuring human knowledge

Algorithms for Hyper-Parameter Optimization

Related Papers (5)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Enriching Word Vectors with Subword Information

Glove: Global Vectors for Word Representation

Distributed Representations of Words and Phrases and their Compositionality

Attention is All you Need