scispace - formally typeset
Open AccessProceedings ArticleDOI

Stanza: A Python Natural Language Processing Toolkit for Many Human Languages

Reads0
Chats0
TLDR
This work introduces Stanza, an open-source Python natural language processing toolkit supporting 66 human languages that features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition.
Abstract
We introduce Stanza, an open-source Python natural language processing toolkit supporting 66 human languages Compared to existing widely used toolkits, Stanza features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition We have trained Stanza on a total of 112 datasets, including the Universal Dependencies treebanks and other multilingual corpora, and show that the same neural architecture generalizes well and achieves competitive performance on all languages tested Additionally, Stanza includes a native Python interface to the widely used Java Stanford CoreNLP software, which further extends its functionality to cover other tasks such as coreference resolution and relation extraction Source code, documentation, and pretrained models for 66 languages are available at https://stanfordnlpgithubio/stanza/

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

Transformers: State-of-the-Art Natural Language Processing

TL;DR: Transformers is an open-source library that consists of carefully engineered state-of-the art Transformer architectures under a unified API and a curated collection of pretrained models made by and available for the community.
Posted Content

HuggingFace's Transformers: State-of-the-art Natural Language Processing.

TL;DR: The \textit{Transformers} library is an open-source library that consists of carefully engineered state-of-the art Transformer architectures under a unified API and a curated collection of pretrained models made by and available for the community.
Posted Content

Transformers: State-of-the-art Natural Language Processing

TL;DR: Transformers is an open-source library that consists of carefully engineered state-of-the art Transformer architectures under a unified API and a curated collection of pretrained models made by and available for the community.
Journal ArticleDOI

An introduction to Deep Learning in Natural Language Processing: Models, techniques, and tools

TL;DR: A survey of the application of deep learning techniques in NLP, with a focus on the various tasks where deep learning is demonstrating stronger impact.
Posted Content

SimAlign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings

TL;DR: The authors leverage multilingual word embeddings, both static and contextualized, for word alignment without relying on any parallel data or dictionaries, and find that alignments created from embedding are competitive and mostly superior to traditional statistical aligners, even in scenarios with abundant parallel data.
References
More filters
Journal ArticleDOI

Enriching Word Vectors with Subword Information

TL;DR: This paper proposed a new approach based on skip-gram model, where each word is represented as a bag of character n-grams, words being represented as the sum of these representations, allowing to train models on large corpora quickly and allowing to compute word representations for words that did not appear in the training data.
Proceedings ArticleDOI

The Stanford CoreNLP Natural Language Processing Toolkit

TL;DR: The design and use of the Stanford CoreNLP toolkit is described, an extensible pipeline that provides core natural language analysis, and it is suggested that this follows from a simple, approachable design, straightforward interfaces, the inclusion of robust and good quality analysis components, and not requiring use of a large amount of associated baggage.
Proceedings ArticleDOI

Introduction to the CoNLL-2003 shared task: language-independent named entity recognition

TL;DR: The CoNLL-2003 shared task on NER as mentioned in this paper was the first NER task with language-independent named entity recognition (NER) data sets and evaluation method, and a general overview of the systems that participated in the task and their performance.
Proceedings Article

Contextual String Embeddings for Sequence Labeling

TL;DR: This paper proposes to leverage the internal states of a trained character language model to produce a novel type of word embedding which they refer to as contextual string embeddings, which are fundamentally model words as sequences of characters and are contextualized by their surrounding text.
Posted Content

One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling

TL;DR: This paper proposed a new benchmark corpus for measuring progress in statistical language modeling, which consists of almost one billion words of training data and can be used to quickly evaluate novel language modeling techniques, and to compare their contribution when combined with other advanced techniques.
Related Papers (5)
Trending Questions (1)
What is the stanza?

Stanza is an open-source Python natural language processing toolkit that supports 66 human languages and features a language-agnostic fully neural pipeline for text analysis.