Language Models with Transformers

Open AccessPosted Content

Language Models with Transformers

- 20 Apr 2019 -

TLDR

This paper explores effective Transformer architectures for language model, including adding additional LSTM layers to better capture the sequential context while still keeping the computation efficient, and proposes Coordinate Architecture Search (CAS) to find an effective architecture through iterative refinement of the model.

Abstract:

The Transformer architecture is superior to RNN-based models in computational efficiency. Recently, GPT and BERT demonstrate the efficacy of Transformer models on various NLP tasks using pre-trained language models on large-scale corpora. Surprisingly, these Transformer architectures are suboptimal for language model itself. Neither self-attention nor the positional encoding in the Transformer is able to efficiently incorporate the word-level sequential context crucial to language modeling. In this paper, we explore effective Transformer architectures for language model, including adding additional LSTM layers to better capture the sequential context while still keeping the computation efficient. We propose Coordinate Architecture Search (CAS) to find an effective architecture through iterative refinement of the model. Experimental results on the PTB, WikiText-2, and WikiText-103 show that CAS achieves perplexities between 20.42 and 34.11 on all problems, i.e. on average an improvement of 12.0 perplexity units compared to state-of-the-art LSTMs. The source code is publicly available.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Masked Language Model Scoring

Julian Salazar, +3 more

TL;DR: RoBERTa reduces an end-to-end LibriSpeech model’s WER by 30% relative and adds up to +1.7 BLEU on state-of-the-art baselines for low-resource translation pairs, with further gains from domain adaptation.

...read moreread less

Posted Content

Neural Machine Translation: A Review

Felix Stahlberg

TL;DR: This work traces back the origins of modern NMT architectures to word and sentence embeddings and earlier examples of the encoder-decoder network family and concludes with a survey of recent trends in the field.

...read moreread less

Posted Content

Towards Debiasing Sentence Representations

Paul Pu Liang, +5 more

- 16 Jul 2020 -

arXiv: Computation and Language

TL;DR: It is shown that Sent-Debias is effective in removing biases, and at the same time, preserves performance on sentence-level downstream tasks such as sentiment analysis, linguistic acceptability, and natural language understanding.

...read moreread less

Proceedings ArticleDOI

Open World Compositional Zero-Shot Learning

Massimiliano Mancini, +3 more

TL;DR: In this article, the authors proposed to use the cosine similarity between visual features and compositional embeddings to mask the output space and boost the performance of the CZSL model in the open world scenario.

...read moreread less

Posted Content

Weakly-Supervised Video Moment Retrieval via Semantic Completion Network

Zhijie Lin, +4 more

- 19 Nov 2019 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: This paper proposes a novel weakly-supervised moment retrieval framework requiring only coarse video-level annotations for training, and devise a proposal generation module that aggregates the context information to generate and score all candidate proposals in one single pass.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

Proceedings ArticleDOI

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

Posted Content

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

- 22 Dec 2014 -

arXiv: Learning

TL;DR: In this article, the adaptive estimates of lower-order moments are used for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimate of lowerorder moments.

...read moreread less

Proceedings ArticleDOI

Neural Machine Translation of Rare Words with Subword Units

Rico Sennrich, +2 more

TL;DR: This paper introduces a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units, and empirically shows that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 1.3 BLEU.

...read moreread less

Collapse

Language Models with Transformers

Citations

Masked Language Model Scoring

Neural Machine Translation: A Review

Towards Debiasing Sentence Representations

Open World Compositional Zero-Shot Learning

Weakly-Supervised Video Moment Retrieval via Semantic Completion Network

References

Adam: A Method for Stochastic Optimization

Attention is All you Need

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Adam: A Method for Stochastic Optimization

Neural Machine Translation of Rare Words with Subword Units

Related Papers (5)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Attention is All you Need

Deep Residual Learning for Image Recognition

Glove: Global Vectors for Word Representation

Distributed Representations of Words and Phrases and their Compositionality

Trending Questions (1)