Stanza: A Python Natural Language Processing Toolkit for Many Human Languages
Peng Qi,Yuhao Zhang,Yuhui Zhang,Jason Bolton,Christopher D. Manning +4 more
- pp 101-108
Reads0
Chats0
TLDR
This work introduces Stanza, an open-source Python natural language processing toolkit supporting 66 human languages that features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition.Abstract:
We introduce Stanza, an open-source Python natural language processing toolkit supporting 66 human languages Compared to existing widely used toolkits, Stanza features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition We have trained Stanza on a total of 112 datasets, including the Universal Dependencies treebanks and other multilingual corpora, and show that the same neural architecture generalizes well and achieves competitive performance on all languages tested Additionally, Stanza includes a native Python interface to the widely used Java Stanford CoreNLP software, which further extends its functionality to cover other tasks such as coreference resolution and relation extraction Source code, documentation, and pretrained models for 66 languages are available at https://stanfordnlpgithubio/stanza/read more
Citations
More filters
Proceedings ArticleDOI
Transformers: State-of-the-Art Natural Language Processing
Thomas Wolf,Lysandre Debut,Victor Sanh,Julien Chaumond,Clement Delangue,Anthony Moi,Pierric Cistac,Clara Ma,Yacine Jernite,Julien Plu,Canwen Xu,Teven Le Scao,Sylvain Gugger,Mariama Drame,Quentin Lhoest,Alexander M. Rush +15 more
TL;DR: Transformers is an open-source library that consists of carefully engineered state-of-the art Transformer architectures under a unified API and a curated collection of pretrained models made by and available for the community.
Posted Content
HuggingFace's Transformers: State-of-the-art Natural Language Processing.
Thomas Wolf,Lysandre Debut,Victor Sanh,Julien Chaumond,Clement Delangue,Anthony Moi,Pierric Cistac,Tim Rault,Rémi Louf,Morgan Funtowicz,Jamie Brew +10 more
TL;DR: The \textit{Transformers} library is an open-source library that consists of carefully engineered state-of-the art Transformer architectures under a unified API and a curated collection of pretrained models made by and available for the community.
Posted Content
Transformers: State-of-the-art Natural Language Processing
Thomas Wolf,Lysandre Debut,Victor Sanh,Julien Chaumond,Clement Delangue,Anthony Moi,Pierric Cistac,Tim Rault,Rémi Louf,Morgan Funtowicz,Jamie Brew +10 more
TL;DR: Transformers is an open-source library that consists of carefully engineered state-of-the art Transformer architectures under a unified API and a curated collection of pretrained models made by and available for the community.
Journal ArticleDOI
An introduction to Deep Learning in Natural Language Processing: Models, techniques, and tools
TL;DR: A survey of the application of deep learning techniques in NLP, with a focus on the various tasks where deep learning is demonstrating stronger impact.
Posted Content
SimAlign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings
TL;DR: The authors leverage multilingual word embeddings, both static and contextualized, for word alignment without relying on any parallel data or dictionaries, and find that alignments created from embedding are competitive and mostly superior to traditional statistical aligners, even in scenarios with abundant parallel data.
References
More filters
Journal ArticleDOI
Enriching Word Vectors with Subword Information
TL;DR: This paper proposed a new approach based on skip-gram model, where each word is represented as a bag of character n-grams, words being represented as the sum of these representations, allowing to train models on large corpora quickly and allowing to compute word representations for words that did not appear in the training data.
Proceedings ArticleDOI
The Stanford CoreNLP Natural Language Processing Toolkit
Christopher D. Manning,Mihai Surdeanu,John Bauer,Jenny Rose Finkel,Steven Bethard,David McClosky +5 more
TL;DR: The design and use of the Stanford CoreNLP toolkit is described, an extensible pipeline that provides core natural language analysis, and it is suggested that this follows from a simple, approachable design, straightforward interfaces, the inclusion of robust and good quality analysis components, and not requiring use of a large amount of associated baggage.
Proceedings ArticleDOI
Introduction to the CoNLL-2003 shared task: language-independent named entity recognition
TL;DR: The CoNLL-2003 shared task on NER as mentioned in this paper was the first NER task with language-independent named entity recognition (NER) data sets and evaluation method, and a general overview of the systems that participated in the task and their performance.
Proceedings Article
Contextual String Embeddings for Sequence Labeling
TL;DR: This paper proposes to leverage the internal states of a trained character language model to produce a novel type of word embedding which they refer to as contextual string embeddings, which are fundamentally model words as sequences of characters and are contextualized by their surrounding text.
Posted Content
One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling
Ciprian Chelba,Tomas Mikolov,Mike Schuster,Qi Ge,Thorsten Brants,Phillipp Koehn,Tony Robinson +6 more
TL;DR: This paper proposed a new benchmark corpus for measuring progress in statistical language modeling, which consists of almost one billion words of training data and can be used to quickly evaluate novel language modeling techniques, and to compare their contribution when combined with other advanced techniques.