Open AccessPosted Content
Language Models with Transformers
TLDR
This paper explores effective Transformer architectures for language model, including adding additional LSTM layers to better capture the sequential context while still keeping the computation efficient, and proposes Coordinate Architecture Search (CAS) to find an effective architecture through iterative refinement of the model.Abstract:
The Transformer architecture is superior to RNN-based models in computational efficiency. Recently, GPT and BERT demonstrate the efficacy of Transformer models on various NLP tasks using pre-trained language models on large-scale corpora. Surprisingly, these Transformer architectures are suboptimal for language model itself. Neither self-attention nor the positional encoding in the Transformer is able to efficiently incorporate the word-level sequential context crucial to language modeling.
In this paper, we explore effective Transformer architectures for language model, including adding additional LSTM layers to better capture the sequential context while still keeping the computation efficient. We propose Coordinate Architecture Search (CAS) to find an effective architecture through iterative refinement of the model. Experimental results on the PTB, WikiText-2, and WikiText-103 show that CAS achieves perplexities between 20.42 and 34.11 on all problems, i.e. on average an improvement of 12.0 perplexity units compared to state-of-the-art LSTMs. The source code is publicly available.read more
Citations
More filters
Proceedings ArticleDOI
Masked Language Model Scoring
TL;DR: RoBERTa reduces an end-to-end LibriSpeech model’s WER by 30% relative and adds up to +1.7 BLEU on state-of-the-art baselines for low-resource translation pairs, with further gains from domain adaptation.
Posted Content
Neural Machine Translation: A Review
TL;DR: This work traces back the origins of modern NMT architectures to word and sentence embeddings and earlier examples of the encoder-decoder network family and concludes with a survey of recent trends in the field.
Posted Content
Towards Debiasing Sentence Representations
Paul Pu Liang,Irene Mengze Li,Emily Zheng,Yao Chong Lim,Ruslan Salakhutdinov,Louis-Philippe Morency +5 more
TL;DR: It is shown that Sent-Debias is effective in removing biases, and at the same time, preserves performance on sentence-level downstream tasks such as sentiment analysis, linguistic acceptability, and natural language understanding.
Proceedings ArticleDOI
Open World Compositional Zero-Shot Learning
TL;DR: In this article, the authors proposed to use the cosine similarity between visual features and compositional embeddings to mask the output space and boost the performance of the CZSL model in the open world scenario.
Posted Content
Weakly-Supervised Video Moment Retrieval via Semantic Completion Network
TL;DR: This paper proposes a novel weakly-supervised moment retrieval framework requiring only coarse video-level annotations for training, and devise a proposal generation module that aggregates the context information to generate and score all candidate proposals in one single pass.
References
More filters
Proceedings Article
Adam: A Method for Stochastic Optimization
Diederik P. Kingma,Jimmy Ba +1 more
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Proceedings ArticleDOI
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Posted Content
Adam: A Method for Stochastic Optimization
Diederik P. Kingma,Jimmy Ba +1 more
TL;DR: In this article, the adaptive estimates of lower-order moments are used for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimate of lowerorder moments.
Proceedings ArticleDOI
Neural Machine Translation of Rare Words with Subword Units
TL;DR: This paper introduces a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units, and empirically shows that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 1.3 BLEU.