Stanza: A Python Natural Language Processing Toolkit for Many Human Languages

doi:10.18653/V1/2020.ACL-DEMOS.14

Open AccessProceedings ArticleDOI

Stanza: A Python Natural Language Processing Toolkit for Many Human Languages

Peng Qi, +4 more

- pp 101-108

Chats0

TLDR

This work introduces Stanza, an open-source Python natural language processing toolkit supporting 66 human languages that features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition.

Abstract:

We introduce Stanza, an open-source Python natural language processing toolkit supporting 66 human languages Compared to existing widely used toolkits, Stanza features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition We have trained Stanza on a total of 112 datasets, including the Universal Dependencies treebanks and other multilingual corpora, and show that the same neural architecture generalizes well and achieves competitive performance on all languages tested Additionally, Stanza includes a native Python interface to the widely used Java Stanford CoreNLP software, which further extends its functionality to cover other tasks such as coreference resolution and relation extraction Source code, documentation, and pretrained models for 66 languages are available at https://stanfordnlpgithubio/stanza/

Citations

PDF

Open Access

More filters

Posted Content

HuggingFace's Transformers: State-of-the-art Natural Language Processing.

Thomas Wolf, +10 more

- 09 Oct 2019 -

arXiv: Computation and Language

TL;DR: The \textit{Transformers} library is an open-source library that consists of carefully engineered state-of-the art Transformer architectures under a unified API and a curated collection of pretrained models made by and available for the community.

...read moreread less

Posted Content

Transformers: State-of-the-art Natural Language Processing

Thomas Wolf, +10 more

TL;DR: Transformers is an open-source library that consists of carefully engineered state-of-the art Transformer architectures under a unified API and a curated collection of pretrained models made by and available for the community.

...read moreread less

Journal ArticleDOI

An introduction to Deep Learning in Natural Language Processing: Models, techniques, and tools

Ivano Lauriola, +3 more

- 01 Jan 2022 -

Neurocomputing

TL;DR: A survey of the application of deep learning techniques in NLP, with a focus on the various tasks where deep learning is demonstrating stronger impact.

...read moreread less

Posted Content

SimAlign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings

Masoud Jalili Sabet, +2 more

- 18 Apr 2020 -

arXiv: Computation and Language

TL;DR: The authors leverage multilingual word embeddings, both static and contextualized, for word alignment without relying on any parallel data or dictionaries, and find that alignments created from embedding are competitive and mostly superior to traditional statistical aligners, even in scenarios with abundant parallel data.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Enriching Word Vectors with Subword Information

Piotr Bojanowski, +3 more

- 12 Jun 2017 -

Transactions of the Association for Comp...

TL;DR: This paper proposed a new approach based on skip-gram model, where each word is represented as a bag of character n-grams, words being represented as the sum of these representations, allowing to train models on large corpora quickly and allowing to compute word representations for words that did not appear in the training data.

...read moreread less

Proceedings ArticleDOI

The Stanford CoreNLP Natural Language Processing Toolkit

Christopher D. Manning, +5 more

TL;DR: The design and use of the Stanford CoreNLP toolkit is described, an extensible pipeline that provides core natural language analysis, and it is suggested that this follows from a simple, approachable design, straightforward interfaces, the inclusion of robust and good quality analysis components, and not requiring use of a large amount of associated baggage.

...read moreread less

Proceedings ArticleDOI

Introduction to the CoNLL-2003 shared task: language-independent named entity recognition

Erik Tjong Kim Sang, +1 more

TL;DR: The CoNLL-2003 shared task on NER as mentioned in this paper was the first NER task with language-independent named entity recognition (NER) data sets and evaluation method, and a general overview of the systems that participated in the task and their performance.

...read moreread less

Proceedings Article

Contextual String Embeddings for Sequence Labeling

Alan Akbik, +2 more

TL;DR: This paper proposes to leverage the internal states of a trained character language model to produce a novel type of word embedding which they refer to as contextual string embeddings, which are fundamentally model words as sequences of characters and are contextualized by their surrounding text.

...read moreread less

Posted Content

One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling

Ciprian Chelba, +6 more

- 11 Dec 2013 -

arXiv: Computation and Language

TL;DR: This paper proposed a new benchmark corpus for measuring progress in statistical language modeling, which consists of almost one billion words of training data and can be used to quickly evaluate novel language modeling techniques, and to compare their contribution when combined with other advanced techniques.

...read moreread less

arXiv: Computation and Language

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

Bleu: a Method for Automatic Evaluation of Machine Translation

Kishore Papineni, +3 more

Stanza: A Python Natural Language Processing Toolkit for Many Human Languages

Citations

Transformers: State-of-the-Art Natural Language Processing

HuggingFace's Transformers: State-of-the-art Natural Language Processing.

Transformers: State-of-the-art Natural Language Processing

An introduction to Deep Learning in Natural Language Processing: Models, techniques, and tools

SimAlign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings

References

Enriching Word Vectors with Subword Information

The Stanford CoreNLP Natural Language Processing Toolkit

Introduction to the CoNLL-2003 shared task: language-independent named entity recognition

Contextual String Embeddings for Sequence Labeling

One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling

Related Papers (5)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Attention is All you Need

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Adam: A Method for Stochastic Optimization

Bleu: a Method for Automatic Evaluation of Machine Translation

Trending Questions (1)