Charformer: Fast Character Transformers via Gradient-based Subword Tokenization.

Open AccessPosted Content

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization.

- 23 Jun 2021 -

TLDR

This article proposed a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters in a data-driven fashion, enumerating candidate subword blocks and score them in a position-wise fashion using a block scoring network.

Abstract:

State-of-the-art models in natural language processing rely on separate rigid subword tokenization algorithms, which limit their generalization ability and adaptation to new settings. In this paper, we propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model. To this end, we introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters in a data-driven fashion. Concretely, GBST enumerates candidate subword blocks and learns to score them in a position-wise fashion using a block scoring network. We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level. Via extensive experiments on English GLUE, multilingual, and noisy text datasets, we show that Charformer outperforms a series of competitive byte-level baselines while generally performing on par and sometimes outperforming subword-based models. Additionally, Charformer is fast, improving the speed of both vanilla byte-level and subword-level Transformers by 28%-100% while maintaining competitive quality. We believe this work paves the way for highly performant token-free models that are trained completely end-to-end.

Citations

PDF

Open Access

More filters

Posted Content

On the Opportunities and Risks of Foundation Models.

Rishi Bommasani, +113 more

- 16 Aug 2021 -

arXiv: Learning

TL;DR: The authors provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles(e. g.g. model architectures, training procedures, data, systems, security, evaluation, theory) to their applications.

...read moreread less

Posted Content

Perceiver IO: A General Architecture for Structured Inputs & Outputs

Andrew Jaegle, +14 more

- 30 Jul 2021 -

arXiv: Learning

TL;DR: Perceiver IO as mentioned in this paper proposes to learn to flexibly query the model's latent space to produce outputs of arbitrary size and semantics, and achieves state-of-the-art results on tasks with highly structured output spaces.

...read moreread less

Posted Content

Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers

Yi Tay, +9 more

- 22 Sep 2021 -

arXiv: Computation and Language

TL;DR: In this paper, the authors present scaling insights from pretraining and finetuning Transformers, showing that aside from only the model size, model shape matters for downstream fine-tuning, scaling protocols operate differently at different compute regions, and widely adopted T5-base and T5large sizes are Pareto-inefficient.

...read moreread less

Posted Content

Evaluating Various Tokenizers for Arabic Text Classification.

loni, +4 more

- 14 Jun 2021 -

arXiv: Computation and Language

TL;DR: In this paper, the authors introduce three new tokenization algorithms for Arabic and compare them to three other baselines using unsupervised evaluations, and compare all the six algorithms by evaluating them on three tasks which are sentiment analysis, news classification and poetry classification.

...read moreread less

Posted Content

Demystifying Neural Language Models' Insensitivity to Word-Order

Louis Clouatre, +3 more

- 29 Jul 2021 -

arXiv: Computation and Language

TL;DR: This paper investigated the sensitivity of NLP models to word-order perturbations and analyzed their effect on neural models' performance on language understanding tasks in GLUE benchmark, finding that neural language models require local ordering more so than the global ordering of tokens.

...read moreread less

References

PDF

Open Access

More filters

Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

Proceedings ArticleDOI

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

Proceedings Article

Distributed Representations of Words and Phrases and their Compositionality

Tomas Mikolov, +4 more

TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.

...read moreread less

Book

Foundations of Statistical Natural Language Processing

Christopher D. Manning, +1 more

TL;DR: This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear and provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations.

...read moreread less

Proceedings ArticleDOI

Deep contextualized word representations

Matthew E. Peters, +6 more

TL;DR: This paper introduced a new type of deep contextualized word representation that models both complex characteristics of word use (e.g., syntax and semantics), and how these uses vary across linguistic contexts (i.e., to model polysemy).

...read moreread less