scispace - formally typeset
Open AccessPosted Content

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization.

TLDR
This article proposed a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters in a data-driven fashion, enumerating candidate subword blocks and score them in a position-wise fashion using a block scoring network.
Abstract
State-of-the-art models in natural language processing rely on separate rigid subword tokenization algorithms, which limit their generalization ability and adaptation to new settings. In this paper, we propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model. To this end, we introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters in a data-driven fashion. Concretely, GBST enumerates candidate subword blocks and learns to score them in a position-wise fashion using a block scoring network. We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level. Via extensive experiments on English GLUE, multilingual, and noisy text datasets, we show that Charformer outperforms a series of competitive byte-level baselines while generally performing on par and sometimes outperforming subword-based models. Additionally, Charformer is fast, improving the speed of both vanilla byte-level and subword-level Transformers by 28%-100% while maintaining competitive quality. We believe this work paves the way for highly performant token-free models that are trained completely end-to-end.

read more

Citations
More filters
Posted Content

On the Opportunities and Risks of Foundation Models.

Rishi Bommasani, +113 more
- 16 Aug 2021 - 
TL;DR: The authors provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles(e. g.g. model architectures, training procedures, data, systems, security, evaluation, theory) to their applications.
Posted Content

Perceiver IO: A General Architecture for Structured Inputs & Outputs

TL;DR: Perceiver IO as mentioned in this paper proposes to learn to flexibly query the model's latent space to produce outputs of arbitrary size and semantics, and achieves state-of-the-art results on tasks with highly structured output spaces.
Posted Content

Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers

TL;DR: In this paper, the authors present scaling insights from pretraining and finetuning Transformers, showing that aside from only the model size, model shape matters for downstream fine-tuning, scaling protocols operate differently at different compute regions, and widely adopted T5-base and T5large sizes are Pareto-inefficient.
Posted Content

Evaluating Various Tokenizers for Arabic Text Classification.

TL;DR: In this paper, the authors introduce three new tokenization algorithms for Arabic and compare them to three other baselines using unsupervised evaluations, and compare all the six algorithms by evaluating them on three tasks which are sentiment analysis, news classification and poetry classification.
Posted Content

Demystifying Neural Language Models' Insensitivity to Word-Order

TL;DR: This paper investigated the sensitivity of NLP models to word-order perturbations and analyzed their effect on neural models' performance on language understanding tasks in GLUE benchmark, finding that neural language models require local ordering more so than the global ordering of tokens.
References
More filters
Proceedings Article

Attention is All you Need

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Proceedings ArticleDOI

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Proceedings Article

Distributed Representations of Words and Phrases and their Compositionality

TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.
Book

Foundations of Statistical Natural Language Processing

TL;DR: This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear and provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations.
Proceedings ArticleDOI

Deep contextualized word representations

TL;DR: This paper introduced a new type of deep contextualized word representation that models both complex characteristics of word use (e.g., syntax and semantics), and how these uses vary across linguistic contexts (i.e., to model polysemy).
Related Papers (5)