Open AccessProceedings Article
BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages
TLDR
The authors presented BPEmb, a collection of pre-trained subword unit embeddings in 275 languages, based on Byte-Pair Encoding (BPE) for fine-grained entity typing.Abstract:
We present BPEmb, a collection of pre-trained subword unit embeddings in 275 languages, based on Byte-Pair Encoding (BPE). In an evaluation using fine-grained entity typing as testbed, BPEmb performs competitively, and for some languages bet- ter than alternative subword approaches, while requiring vastly fewer resources and no tokenization. BPEmb is available at this https URLread more
Citations
More filters
Proceedings ArticleDOI
FLAIR: An Easy-to-Use Framework for State-of-the-Art NLP
TL;DR: The core idea of the FLAIR framework is to present a simple, unified interface for conceptually very different types of word and document embeddings, which effectively hides all embedding-specific engineering complexity and allows researchers to “mix and match” variousembeddings with little effort.
Journal Article
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Proceedings ArticleDOI
SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020)
Marcos Zampieri,Preslav Nakov,Sara Rosenthal,Pepa Atanasova,Georgi Karadzhov,Hamdy Mubarak,Leon Derczynski,Zeses Pitenis,Çağrı Çöltekin +8 more
TL;DR: The SemEval-2020 Task 12 on Multilingual Offensive Language Identification in Social Media (OffensEval 2020) as mentioned in this paper included three subtasks corresponding to the hierarchical taxonomy of the OLID schema, and was offered in five languages: Arabic, Danish, English, Greek, and Turkish.
Posted Content
SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020)
Marcos Zampieri,Preslav Nakov,Sara Rosenthal,Pepa Atanasova,Georgi Karadzhov,Hamdy Mubarak,Leon Derczynski,Zeses Pitenis,Çağrı Çöltekin +8 more
TL;DR: The task included three subtasks corresponding to the hierarchical taxonomy of the OLID schema from OffensEval-2019, and it was offered in five languages: Arabic, Danish, English, Greek, and Turkish.
Proceedings ArticleDOI
Are All Languages Created Equal in Multilingual BERT
Shijie Wu,Mark Dredze +1 more
TL;DR: This work explores how mBERT performs on a much wider set of languages, focusing on the quality of representation for low-resource languages, measured by within-language performance, and finds that better models for low resource languages require more efficient pretraining techniques or more data.
References
More filters
Proceedings ArticleDOI
Glove: Global Vectors for Word Representation
TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.
Posted Content
Efficient Estimation of Word Representations in Vector Space
TL;DR: This paper proposed two novel model architectures for computing continuous vector representations of words from very large data sets, and the quality of these representations is measured in a word similarity task and the results are compared to the previously best performing techniques based on different types of neural networks.
Proceedings ArticleDOI
The Stanford CoreNLP Natural Language Processing Toolkit
Christopher D. Manning,Mihai Surdeanu,John Bauer,Jenny Rose Finkel,Steven Bethard,David McClosky +5 more
TL;DR: The design and use of the Stanford CoreNLP toolkit is described, an extensible pipeline that provides core natural language analysis, and it is suggested that this follows from a simple, approachable design, straightforward interfaces, the inclusion of robust and good quality analysis components, and not requiring use of a large amount of associated baggage.
Proceedings ArticleDOI
Freebase: a collaboratively created graph database for structuring human knowledge
TL;DR: MQL provides an easy-to-use object-oriented interface to the tuple data in Freebase and is designed to facilitate the creation of collaborative, Web-based data-oriented applications.
Proceedings Article
Algorithms for Hyper-Parameter Optimization
TL;DR: This work contributes novel techniques for making response surface models P(y|x) in which many elements of hyper-parameter assignment (x) are known to be irrelevant given particular values of other elements.