Universal Sentence Encoder for English

doi:10.18653/V1/D18-2029

Home
/
Papers
/
Universal Sentence Encoder for English

Proceedings Article•DOI•

Universal Sentence Encoder for English

Daniel Cer¹, Yinfei Yang², Sheng-yi Kong³, Nan Hua¹, Nicole Lyn Untalan Limtiaco, Rhomni St. John, Noah Constant¹, Mario Guajardo-Cespedes, Steve Yuan¹, Chris Tar¹, Brian Strope¹, Ray Kurzweil¹ - Show less +8 more•Institutions (3)

Google¹, Amazon.com², National Taiwan University³

01 Nov 2018-pp 169-174

TL;DR: Transfer learning using sentence-level embeddings is shown to outperform models without transfer learning and often those that use only word-level transfer.

read less

Abstract: We present easy-to-use TensorFlow Hub sentence embedding models having good task transfer performance. Model variants allow for trade-offs between accuracy and compute resources. We report the relationship between model complexity, resources, and transfer performance. Comparisons are made with baselines without transfer learning and to baselines that incorporate word-level transfer. Transfer learning using sentence-level embeddings is shown to outperform models without transfer learning and often those that use only word-level transfer. We show good transfer task performance with minimal training data and obtain encouraging results on word embedding association tests (WEAT) of model bias.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Posted Content•

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

[...]

Mikel Artetxe¹, Holger Schwenk²•Institutions (2)

University of the Basque Country¹, Facebook²

26 Dec 2018-arXiv: Computation and Language

TL;DR: This article used a single BiLSTM encoder with a shared BPE vocabulary for all languages, coupled with an auxiliary decoder and trained on publicly available parallel corpora to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts.

...read moreread less

Abstract: We introduce an architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts. Our system uses a single BiLSTM encoder with a shared BPE vocabulary for all languages, which is coupled with an auxiliary decoder and trained on publicly available parallel corpora. This enables us to learn a classifier on top of the resulting embeddings using English annotated data only, and transfer it to any of the 93 languages without any modification. Our experiments in cross-lingual natural language inference (XNLI dataset), cross-lingual document classification (MLDoc dataset) and parallel corpus mining (BUCC dataset) show the effectiveness of our approach. We also introduce a new test set of aligned sentences in 112 languages, and show that our sentence embeddings obtain strong results in multilingual similarity search even for low-resource languages. Our implementation, the pre-trained encoder and the multilingual test set are available at this https URL

...read moreread less

512 citations

Journal Article•DOI•

Text Data Augmentation for Deep Learning.

[...]

Connor Shorten¹, Taghi M. Khoshgoftaar¹, Borko Furht¹•Institutions (1)

Florida Atlantic University¹

29 Jun 2021-Journal of Big Data

TL;DR: A survey of data augmentation for text data can be found in this article, where the major motifs of Data Augmentation are summarized into strengthening local decision boundaries, brute force training, causality and counterfactual examples, and the distinction between meaning and form.

...read moreread less

Abstract: Natural Language Processing (NLP) is one of the most captivating applications of Deep Learning. In this survey, we consider how the Data Augmentation training strategy can aid in its development. We begin with the major motifs of Data Augmentation summarized into strengthening local decision boundaries, brute force training, causality and counterfactual examples, and the distinction between meaning and form. We follow these motifs with a concrete list of augmentation frameworks that have been developed for text data. Deep Learning generally struggles with the measurement of generalization and characterization of overfitting. We highlight studies that cover how augmentations can construct test sets for generalization. NLP is at an early stage in applying Data Augmentation compared to Computer Vision. We highlight the key differences and promising ideas that have yet to be tested in NLP. For the sake of practical implementation, we describe tools that facilitate Data Augmentation such as the use of consistency regularization, controllers, and offline and online augmentation pipelines, to preview a few. Finally, we discuss interesting topics around Data Augmentation in NLP such as task-specific augmentations, the use of prior knowledge in self-supervised learning versus Data Augmentation, intersections with transfer and multi-task learning, and ideas for AI-GAs (AI-Generating Algorithms). We hope this paper inspires further research interest in Text Data Augmentation.

...read moreread less

487 citations

Posted Content•

SimCSE: Simple Contrastive Learning of Sentence Embeddings

[...]

Tianyu Gao¹, Xingcheng Yao¹, Danqi Chen²•Institutions (2)

Princeton University¹, Tsinghua University²

18 Apr 2021-arXiv: Computation and Language

TL;DR: SimCSE as discussed by the authors proposes a contrastive learning framework for sentence embeddings, which takes an input sentence and predicts itself in contrastive objective, with only standard dropout used as noise.

...read moreread less

Abstract: This paper presents SimCSE, a simple contrastive learning framework that greatly advances the state-of-the-art sentence embeddings. We first describe an unsupervised approach, which takes an input sentence and predicts itself in a contrastive objective, with only standard dropout used as noise. This simple method works surprisingly well, performing on par with previous supervised counterparts. We hypothesize that dropout acts as minimal data augmentation and removing it leads to a representation collapse. Then, we draw inspiration from the recent success of learning sentence embeddings from natural language inference (NLI) datasets and incorporate annotated pairs from NLI datasets into contrastive learning by using "entailment" pairs as positives and "contradiction" pairs as hard negatives. We evaluate SimCSE on standard semantic textual similarity (STS) tasks, and our unsupervised and supervised models using BERT-base achieve an average of 74.5% and 81.6% Spearman's correlation respectively, a 7.9 and 4.6 points improvement compared to previous best results. We also show that contrastive learning theoretically regularizes pre-trained embeddings' anisotropic space to be more uniform, and it better aligns positive pairs when supervised signals are available.

...read moreread less

345 citations

Posted Content•

Language-agnostic BERT Sentence Embedding

[...]

Fangxiaoyu Feng, Yinfei Yang, Daniel Cer¹, Naveen Arivazhagan, Wei Wang - Show less +1 more•Institutions (1)

Google¹

03 Jul 2020-arXiv: Computation and Language

TL;DR: It is shown that introducing a pre-trained multilingual language model dramatically reduces the amount of parallel training data required to achieve good performance by 80%, and a model that achieves 83.7% bi-text retrieval accuracy over 112 languages on Tatoeba is released.

...read moreread less

Abstract: We adapt multilingual BERT to produce language-agnostic sentence embeddings for 109 languages %The state-of-the-art for numerous monolingual and multilingual NLP tasks is masked language model (MLM) pretraining followed by task specific fine-tuning While English sentence embeddings have been obtained by fine-tuning a pretrained BERT model, such models have not been applied to multilingual sentence embeddings Our model combines masked language model (MLM) and translation language model (TLM) pretraining with a translation ranking task using bi-directional dual encoders The resulting multilingual sentence embeddings improve average bi-text retrieval accuracy over 112 languages to 837%, well above the 655% achieved by the prior state-of-the-art on Tatoeba Our sentence embeddings also establish new state-of-the-art results on BUCC and UN bi-text retrieval

...read moreread less

344 citations

Additional excerpts

..., 2016), QuickThought (Logeswaran and Lee, 2018), USETrans (Cer et al., 2018), and m-USETrans (Yang et al....
[...]

Journal Article•DOI•

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

[...]

Mikel Artetxe¹, Holger Schwenk²•Institutions (2)

University of the Basque Country¹, Facebook²

11 Sep 2019-Transactions of the Association for Computational Linguistics

TL;DR: An architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts using a single BiLSTM encoder with a shared byte-pair encoding vocabulary for all languages, coupled with an auxiliary decoder and trained on publicly available parallel corpora.

...read moreread less

Abstract: We introduce an architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different families and written in 28 different scripts. Our system uses a single BiLSTM encoder with a shared BPE vocabulary for all languages, which is coupled with an auxiliary decoder and trained on publicly available parallel corpora. This enables us to learn a classifier on top of the resulting embeddings using English annotated data only and transfer it to any of the 93 languages without any modification. Our experiments in cross-lingual natural language inference (XNLI dataset), cross-lingual document classification (MLDoc dataset) and parallel corpus mining (BUCC dataset) show the effectiveness of our approach. We also introduce a new test set of aligned sentences in 112 languages and show that our sentence embeddings obtain strong results in multilingual similarity search even for low-resource languages. Our implementation, the pre-trained encoder and the multilingual test set are available at https://github.com/facebookresearch/LASER

...read moreread less

316 citations

Cites background or methods from "Universal Sentence Encoder for Engl..."

...Multitask learning has been shown to be helpful to learn English sentence embeddings (Subrama- 10We consider the average of en→xx and xx→en nian et al., 2018; Cer et al., 2018)....
[...]
...This was recently extended to multitask learning, combining different training objectives like that of skip-thought, NLI and machine translation (Cer et al., 2018; Subramanian et al., 2018)....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Long short-term memory

[...]

Sepp Hochreiter¹, Jürgen Schmidhuber²•Institutions (2)

Technische Universität München¹, Dalle Molle Institute for Artificial Intelligence Research²

01 Nov 1997-Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

Abstract: Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

...read moreread less

72,897 citations

"Universal Sentence Encoder for Engl..." refers methods in this paper

...This section presents the data used for the transfer learning experiments and word embedding association tests (WEAT): (MR) Movie review sentiment on a five star scale (Pang and Lee, 2005); (CR) Sentiment of customer reviews (Hu and Liu, 2004); (SUBJ) Subjectivity of movie reviews and plot summaries (Pang and Lee, 2004); 5The Skip-Thought like task replaces the LSTM (Hochreiter and Schmidhuber, 1997) in the original formulation with a transformer model....
[...]
...The Skip-Thought like task replaces the LSTM (Hochreiter and Schmidhuber, 1997) in the original formulation with a transformer model....
[...]

Proceedings Article•

Attention is All you Need

[...]

Ashish Vaswani¹, Noam Shazeer¹, Niki Parmar², Jakob Uszkoreit¹, Llion Jones¹, Aidan N. Gomez¹, Lukasz Kaiser¹, Illia Polosukhin¹ - Show less +4 more•Institutions (2)

Google¹, University of Southern California²

12 Jun 2017

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

Abstract: The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely.Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27.5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0.7 BLEU, achieving a BLEU score of 41.1.

...read moreread less

52,856 citations

Proceedings Article•

Distributed Representations of Words and Phrases and their Compositionality

[...]

Tomas Mikolov¹, Ilya Sutskever¹, Kai Chen¹, Greg S. Corrado¹, Jeffrey Dean¹ - Show less +1 more•Institutions (1)

Google¹

05 Dec 2013

TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.

...read moreread less

Abstract: The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.

...read moreread less

24,012 citations

"Universal Sentence Encoder for Engl..." refers methods in this paper

...For word-level transfer, we incorporate word embeddings from a word2vec skip-gram model trained on a corpus of news data (Mikolov et al., 2013)....
[...]

Proceedings Article•DOI•

TensorFlow: a system for large-scale machine learning

[...]

Martín Abadi¹, Paul Barham¹, Jianmin Chen¹, Zhifeng Chen¹, Andy Davis¹, Jeffrey Dean¹, Matthieu Devin¹, Sanjay Ghemawat¹, Geoffrey Irving¹, Michael Isard¹, Manjunath Kudlur¹, Josh Levenberg¹, Rajat Monga¹, Sherry Moore¹, Derek G. Murray¹, Benoit Steiner¹, Paul A. Tucker¹, Vijay K. Vasudevan¹, Pete Warden¹, Martin Wicke¹, Yuan Yu¹, Xiaoqiang Zheng¹ - Show less +18 more•Institutions (1)

Google¹

02 Nov 2016

TL;DR: TensorFlow as mentioned in this paper is a machine learning system that operates at large scale and in heterogeneous environments, using dataflow graphs to represent computation, shared state, and the operations that mutate that state.

...read moreread less

Abstract: TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. Tensor-Flow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of a dataflow graph across many machines in a cluster, and within a machine across multiple computational devices, including multicore CPUs, general-purpose GPUs, and custom-designed ASICs known as Tensor Processing Units (TPUs). This architecture gives flexibility to the application developer: whereas in previous "parameter server" designs the management of shared state is built into the system, TensorFlow enables developers to experiment with novel optimizations and training algorithms. TensorFlow supports a variety of applications, with a focus on training and inference on deep neural networks. Several Google services use TensorFlow in production, we have released it as an open-source project, and it has become widely used for machine learning research. In this paper, we describe the TensorFlow dataflow model and demonstrate the compelling performance that TensorFlow achieves for several real-world applications.

...read moreread less

10,913 citations

Proceedings Article•DOI•

Convolutional Neural Networks for Sentence Classification

[...]

Yoon Kim¹•Institutions (1)

New York University¹

25 Aug 2014

TL;DR: The CNN models discussed herein improve upon the state of the art on 4 out of 7 tasks, which include sentiment analysis and question classification, and are proposed to allow for the use of both task-specific and static vectors.

...read moreread less

Abstract: We report on a series of experiments with convolutional neural networks (CNN) trained on top of pre-trained word vectors for sentence-level classification tasks. We show that a simple CNN with little hyperparameter tuning and static vectors achieves excellent results on multiple benchmarks. Learning task-specific vectors through fine-tuning offers further gains in performance. We additionally propose a simple modification to the architecture to allow for the use of both task-specific and static vectors. The CNN models discussed herein improve upon the state of the art on 4 out of 7 tasks, which include sentiment analysis and question classification.

...read moreread less

9,776 citations

"Universal Sentence Encoder for Engl..." refers methods in this paper

...However, we observe that our best performing model still makes use of transformer sentencelevel transfer but combined with a CNN with no word-level transfer, UT+CNNrnd. Table 5 contrasts Caliskan et al. (2017)’s findings on bias within GloVe embeddings with results from the transformer and DAN encoders....
[...]
...On the STS Benchmark, we compare with InferSent and the state-of-the-art neural STS systems CNN (HCTI) (Shao, 2017) and gConv (Yang et al., 2018)....
[...]
...The pretrained word embeddings are included as input to two model types: a convolutional neural network model (CNN) (Kim, 2014); a DAN....
[...]
...Additional baseline CNN and DAN models are trained without using any pretrained word or sentence embeddings....
[...]
...Training with 1k labeled examples and the transformer sentence embeddings surpasses wordlevel transfer using the full training set, CNNw2v, and approaches the performance of the best model without transfer learning trained on the complete dataset, CNNrnd@67.3k....
[...]