Top 8 papers published by Zihang Dai from Google in 2020

Posted Content•

[...]

Hieu Pham¹, Zihang Dai¹, Qizhe Xie¹, Minh-Thang Luong¹, Quoc V. Le¹ - Show less +1 more•Institutions (1)

23 Mar 2020-arXiv: Learning

TL;DR: This work presents Meta Pseudo Labels, a semi-supervised learning method that achieves a new state-of-the-art top-1 accuracy of 90.2% on ImageNet, which is 1.6% better than the existing state of the art [16].

...read moreread less

Abstract: We present Meta Pseudo Labels, a semi-supervised learning method that achieves a new state-of-the-art top-1 accuracy of 90.2% on ImageNet, which is 1.6% better than the existing state-of-the-art. Like Pseudo Labels, Meta Pseudo Labels has a teacher network to generate pseudo labels on unlabeled data to teach a student network. However, unlike Pseudo Labels where the teacher is fixed, the teacher in Meta Pseudo Labels is constantly adapted by the feedback of the student's performance on the labeled dataset. As a result, the teacher generates better pseudo labels to teach the student. Our code will be available at this https URL.

...read moreread less

288 citations

Proceedings Article•

Unsupervised Data Augmentation for Consistency Training

[...]

Qizhe Xie¹, Zihang Dai², Eduard Hovy¹, Thang Luong², Quoc V. Le² - Show less +1 more•Institutions (2)

Carnegie Mellon University¹, Google²

01 Jan 2020

TL;DR: The authors proposed a new perspective on how to effectively noise unlabeled examples and argue that the quality of noising, specifically those produced by advanced data augmentation methods, plays a crucial role in semi-supervised learning.

...read moreread less

Abstract: Semi-supervised learning lately has shown much promise in improving deep learning models when labeled data is scarce. Common among recent approaches is the use of consistency training on a large amount of unlabeled data to constrain model predictions to be invariant to input noise. In this work, we present a new perspective on how to effectively noise unlabeled examples and argue that the quality of noising, specifically those produced by advanced data augmentation methods, plays a crucial role in semi-supervised learning. By substituting simple noising operations with advanced data augmentation methods such as RandAugment and back-translation, our method brings substantial improvements across six language and three vision tasks under the same consistency training framework. On the IMDb text classification dataset, with only 20 labeled examples, our method achieves an error rate of 4.20, outperforming the state-of-the-art model trained on 25,000 labeled examples. On a standard semi-supervised learning benchmark, CIFAR-10, our method outperforms all previous approaches and achieves an error rate of 5.43 with only 250 examples. Our method also combines well with transfer learning, e.g., when finetuning from BERT, and yields improvements in high-data regime, such as ImageNet, whether when there is only 10% labeled data or when a full labeled set with 1.3M extra unlabeled examples is used. Code is available at this https URL.

...read moreread less

224 citations

Proceedings Article•

Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing

[...]

Zihang Dai¹, Guokun Lai¹, Yiming Yang¹, Quoc V. Le²•Institutions (2)

Carnegie Mellon University¹, Google²

05 Jun 2020

TL;DR: This work proposes Funnel-Transformer, a model which gradually compresses the sequence of hidden states to a shorter one and hence reduces the computation cost and outperforms the standard Transformer on a wide variety of sequence-level prediction tasks.

...read moreread less

Abstract: With the success of language pretraining, it is highly desirable to develop more efficient architectures of good scalability that can exploit the abundant unlabeled data at a lower cost. To improve the efficiency, we examine the much-overlooked redundancy in maintaining a full-length token-level presentation, especially for tasks that only require a single-vector presentation of the sequence. With this intuition, we propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one and hence reduces the computation cost. More importantly, by re-investing the saved FLOPs from length reduction in constructing a deeper or wider model, we further improve the model capacity. In addition, to perform token-level predictions as required by common pretraining objectives, Funnel-Transformer is able to recover a deep representation for each token from the reduced hidden sequence via a decoder. Empirically, with comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks, including text classification, language understanding, and reading comprehension. The code and pretrained checkpoints are available at this https URL.

...read moreread less

123 citations

Proceedings Article•

A Mutual Information Maximization Perspective of Language Representation Learning

[...]

Lingpeng Kong¹, Cyprien de Masson d'Autume¹, Lei Yu¹, Wang Ling¹, Zihang Dai², Dani Yogatama¹ - Show less +2 more•Institutions (2)

Google¹, Carnegie Mellon University²

30 Apr 2020

TL;DR: This work shows state-of-the-art word representation learning methods maximize an objective function that is a lower bound on the mutual information between different parts of a word sequence (i.e., a sentence).

...read moreread less

Abstract: We show state-of-the-art word representation learning methods maximize an objective function that is a lower bound on the mutual information between different parts of a word sequence (i.e., a sentence). Our formulation provides an alternative perspective that unifies classical word embedding models (e.g., Skip-gram) and modern contextual embeddings (e.g., BERT, XLNet). In addition to enhancing our theoretical understanding of these methods, our derivation leads to a principled framework that can be used to construct new self-supervised tasks. We provide an example by drawing inspirations from related methods based on mutual information maximization that have been successful in computer vision, and introduce a simple self-supervised objective that maximizes the mutual information between a global sentence representation and n-grams in the sentence. Our analysis offers a holistic view of representation learning methods to transfer knowledge and translate progress across multiple domains (e.g., natural language processing, computer vision, audio processing).

...read moreread less

119 citations

Proceedings Article•

Wiki-40B: Multilingual Language Model Dataset

[...]

Mandy Guo¹, Zihang Dai², Denny Vrandecic¹, Rami Al-Rfou¹•Institutions (2)

Google¹, Carnegie Mellon University²

01 May 2020

TL;DR: A new multilingual language model benchmark that is composed of 40+ languages spanning several scripts and linguistic families with around 40 billion characters is proposed, and the task of multilingual causal language modeling is introduced.

...read moreread less

Abstract: We propose a new multilingual language model benchmark that is composed of 40+ languages spanning several scripts and linguistic families. With around 40 billion characters, we hope this new resource will accelerate the research of multilingual modeling. We train monolingual causal language models using a state-of-the-art model (Transformer-XL) establishing baselines for many languages. We also introduce the task of multilingual causal language modeling where we train our model on the combined text of 40+ languages from Wikipedia with different vocabulary sizes and evaluate on the languages individually. We released the cleaned-up text of 40+ Wikipedia language editions, the corresponding trained monolingual language models, and several multilingual language models with different fixed vocabulary sizes.

...read moreread less

47 citations

Posted Content•

Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing

[...]

Zihang Dai¹, Guokun Lai¹, Yiming Yang¹, Quoc V. Le²•Institutions (2)

Carnegie Mellon University¹, Google²

05 Jun 2020-arXiv: Learning

TL;DR: In this paper, the authors propose a truncated version of the standard Transformer, which gradually compresses the sequence of hidden states to a shorter one and hence reduces the computation cost.

...read moreread less

Abstract: With the success of language pretraining, it is highly desirable to develop more efficient architectures of good scalability that can exploit the abundant unlabeled data at a lower cost. To improve the efficiency, we examine the much-overlooked redundancy in maintaining a full-length token-level presentation, especially for tasks that only require a single-vector presentation of the sequence. With this intuition, we propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one and hence reduces the computation cost. More importantly, by re-investing the saved FLOPs from length reduction in constructing a deeper or wider model, we further improve the model capacity. In addition, to perform token-level predictions as required by common pretraining objectives, Funnel-Transformer is able to recover a deep representation for each token from the reduced hidden sequence via a decoder. Empirically, with comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks, including text classification, language understanding, and reading comprehension. The code and pretrained checkpoints are available at this https URL.

...read moreread less

10 citations

Posted Content•

Unsupervised Parallel Corpus Mining on Web Data.

[...]

Guokun Lai¹, Zihang Dai¹, Yiming Yang¹•Institutions (1)

Carnegie Mellon University¹

18 Sep 2020-arXiv: Computation and Language

TL;DR: A pipeline to mine the parallel corpus from the Internet in an unsupervised manner is presented and the machine translator trained with the data extracted by the pipeline achieves very close performance to the supervised results.

...read moreread less

Abstract: With a large amount of parallel data, neural machine translation systems are able to deliver human-level performance for sentence-level translation. However, it is costly to label a large amount of parallel data by humans. In contrast, there is a large-scale of parallel corpus created by humans on the Internet. The major difficulty to utilize them is how to filter them out from the noise website environments. Current parallel data mining methods all require labeled parallel data as the training source. In this paper, we present a pipeline to mine the parallel corpus from the Internet in an unsupervised manner. On the widely used WMT'14 English-French and WMT'16 English-German benchmarks, the machine translator trained with the data extracted by our pipeline achieves very close performance to the supervised results. On the WMT'16 English-Romanian and Romanian-English benchmarks, our system produces new state-of-the-art results, 39.81 and 38.95 BLEU scores, even compared with supervised approaches.

...read moreread less

4 citations

Patent•

Training machine learning models using unsupervised data augmentation

[...]

Thang Luong, Le Quoc, Qizhe Xie, Zihang Dai

29 Oct 2020

Showing papers by "Zihang Dai published in 2020"