Cross-lingual Name Tagging and Linking for 282 Languages

doi:10.18653/V1/P17-1178

Open AccessProceedings ArticleDOI

Cross-lingual Name Tagging and Linking for 282 Languages

Xiaoman Pan, +5 more

- Vol. 1, pp 1946-1958

Chats0

TLDR

This work develops a cross-lingual name tagging and linking framework for 282 languages that exist in Wikipedia that is able to identify name mentions, assign a coarse-grained or fine- grained type to each mention, and link it to an English Knowledge Base (KB) if it is linkable.

Abstract:

The ambitious goal of this work is to develop a cross-lingual name tagging and linking framework for 282 languages that exist in Wikipedia. Given a document in any of these languages, our framework is able to identify name mentions, assign a coarse-grained or fine-grained type to each mention, and link it to an English Knowledge Base (KB) if it is linkable. We achieve this goal by performing a series of new KB mining methods: generating “silver-standard” annotations by transferring annotations from English to other languages through cross-lingual links and KB properties, refining annotations through self-training and topic selection, deriving language-specific morphology features from anchor links, and mining word translation pairs from cross-lingual links. Both name tagging and linking results for 282 languages are promising on Wikipedia data and on-Wikipedia data.

Citations

PDF

Open Access

More filters

Posted Content

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

Junjie Hu, +5 more

- 24 Mar 2020 -

arXiv: Computation and Language

TL;DR: The Cross-lingual TRansfer Evaluation of Multilingual Encoders XTREME benchmark is introduced, a multi-task benchmark for evaluating the cross-lingually generalization capabilities of multilingual representations across 40 languages and 9 tasks.

...read moreread less

Proceedings ArticleDOI

IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages

Divyanshu Kakwani, +6 more

TL;DR: This paper introduces NLP resources for 11 major Indian languages from two major language families, and creates datasets for the following tasks: Article Genre Classification, Headline Prediction, Wikipedia Section-Title Prediction, Cloze-style Multiple choice QA, Winograd NLI and COPA.

...read moreread less

Proceedings ArticleDOI

Emerging Cross-lingual Structure in Pretrained Language Models

Alexis Conneau, +4 more

TL;DR: It is shown that transfer is possible even when there is no shared vocabulary across the monolingual corpora and also when the text comes from very different domains, and it is strongly suggested that, much like for non-contextual word embeddings, there are universal latent symmetries in the learned embedding spaces.

...read moreread less

Proceedings ArticleDOI

MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer

Jonas Pfeiffer, +3 more

TL;DR: This paper proposed MAD-X, an adapter-based framework that enables high portability and parameter-efficient transfer to arbitrary tasks and languages by learning modular language and task representations, and introduced a novel invertible adapter architecture and a strong baseline method for adapting a pre-trained multilingual model to a new language.

...read moreread less

Proceedings Article

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalisation

Junjie Hu, +5 more

TL;DR: The Cross-lingual TRansfer Evaluation of Multilingual Encoders XTREME benchmark as discussed by the authors is a multi-task benchmark for evaluating the crosslingual generalization capabilities of multilingual representations across 40 languages and 9 tasks.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

The Stanford CoreNLP Natural Language Processing Toolkit

Christopher D. Manning, +5 more

TL;DR: The design and use of the Stanford CoreNLP toolkit is described, an extensible pipeline that provides core natural language analysis, and it is suggested that this follows from a simple, approachable design, straightforward interfaces, the inclusion of robust and good quality analysis components, and not requiring use of a large amount of associated baggage.

...read moreread less

Proceedings ArticleDOI

Freebase: a collaboratively created graph database for structuring human knowledge

Kurt Bollacker, +4 more

TL;DR: MQL provides an easy-to-use object-oriented interface to the tuple data in Freebase and is designed to facilitate the creation of collaborative, Web-based data-oriented applications.

...read moreread less

Journal ArticleDOI

A systematic comparison of various statistical alignment models

Franz Josef Och, +1 more

- 01 Mar 2003 -

Computational Linguistics

TL;DR: An important result is that refined alignment models with a first-order dependence and a fertility model yield significantly better results than simple heuristic models.

...read moreread less

Journal ArticleDOI

Word association norms, mutual information, and lexicography

Kenneth Church, +1 more

- 01 Mar 1990 -

Computational Linguistics

TL;DR: The proposed measure, the association ratio, estimates word association norms directly from computer readable corpora, making it possible to estimate norms for tens of thousands of words.

...read moreread less

Proceedings ArticleDOI

Neural Architectures for Named Entity Recognition

Guillaume Lample, +4 more

TL;DR: Comunicacio presentada a la 2016 Conference of the North American Chapter of the Association for Computational Linguistics, celebrada a San Diego (CA, EUA) els dies 12 a 17 of juny 2016.

...read moreread less

Collapse

arXiv: Computation and Language

Cross-lingual Name Tagging and Linking for 282 Languages

Citations

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages

Emerging Cross-lingual Structure in Pretrained Language Models

MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalisation

References

The Stanford CoreNLP Natural Language Processing Toolkit

Freebase: a collaboratively created graph database for structuring human knowledge

A systematic comparison of various statistical alignment models

Word association norms, mutual information, and lexicography

Neural Architectures for Named Entity Recognition

Related Papers (5)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Neural Architectures for Named Entity Recognition

Introduction to the CoNLL-2003 shared task: language-independent named entity recognition

Unsupervised Cross-lingual Representation Learning at Scale

RoBERTa: A Robustly Optimized BERT Pretraining Approach