Home
/
Authors
/
Iulia Turc

Author

Iulia Turc

Bio: Iulia Turc is an academic researcher from Google. The author has contributed to research in topics: Initialization & Computer science. The author has an hindex of 4, co-authored 7 publications receiving 360 citations.

Topics: Initialization, Computer science, Upsampling, Encoder, Language model ...read more

Papers

PDF

Open Access

More filters

Posted Content•

Well-Read Students Learn Better: On the Importance of Pre-training Compact Models

[...]

Iulia Turc, Ming-Wei Chang, Kenton Lee, Kristina Toutanova¹•Institutions (1)

Google¹

25 Sep 2019-arXiv: Computation and Language

TL;DR: It is shown that pre-training remains important in the context of smaller architectures, and fine-tuning pre-trained compact models can be competitive to more elaborate methods proposed in concurrent work.

...read moreread less

Abstract: Recent developments in natural language representations have been accompanied by large and expensive models that leverage vast amounts of general-domain text through self-supervised pre-training. Due to the cost of applying such models to down-stream tasks, several model compression techniques on pre-trained language representations have been proposed (Sun et al., 2019; Sanh, 2019). However, surprisingly, the simple baseline of just pre-training and fine-tuning compact models has been overlooked. In this paper, we first show that pre-training remains important in the context of smaller architectures, and fine-tuning pre-trained compact models can be competitive to more elaborate methods proposed in concurrent work. Starting with pre-trained compact models, we then explore transferring task knowledge from large fine-tuned models through standard knowledge distillation. The resulting simple, yet effective and general algorithm, Pre-trained Distillation, brings further improvements. Through extensive experiments, we more generally explore the interaction between pre-training and distillation under two variables that have been under-studied: model size and properties of unlabeled task data. One surprising observation is that they have a compound effect even when sequentially applied on the same data. To accelerate future research, we will make our 24 pre-trained miniature BERT models publicly available.

...read moreread less

422 citations

Posted Content•

Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation

[...]

Iulia Turc, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

23 Aug 2019

TL;DR: It is observed that applying language model pre-training to students unlocks their generalization potential, surprisingly even for very compact networks.

...read moreread less

Abstract: Recent developments in NLP have been accompanied by large, expensive models. Knowledge distillation is the standard method to realize these gains in applications with limited resources: a compact student is trained to recover the outputs of a powerful teacher. While most prior work investigates student architectures and transfer techniques, we focus on an often-neglected aspect---student initialization. We argue that a random starting point hinders students from fully leveraging the teacher expertise, even in the presence of a large transfer set. We observe that applying language model pre-training to students unlocks their generalization potential, surprisingly even for very compact networks. We conduct experiments on 4 NLP tasks and 24 sizes of Transformer-based students; for sentiment classification on the Amazon Book Reviews dataset, pre-training boosts size reduction and TPU speed-up from 3.1x/1.25x to 31x/16x. Extensive ablation studies dissect the interaction between pre-training and distillation, revealing a compound effect even when they are applied on the same unlabeled dataset.

...read moreread less

159 citations

Posted Content•

CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

[...]

Jonathan H. Clark¹, Dan Garrette¹, Iulia Turc¹, John Wieting¹•Institutions (1)

Google¹

11 Mar 2021-arXiv: Computation and Language

TL;DR: CANINE as discussed by the authors is a neural encoder that operates directly on character sequences, without explicit tokenization or vocabulary, and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias.

...read moreread less

Abstract: Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, we present CANINE, a neural encoder that operates directly on character sequences, without explicit tokenization or vocabulary, and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, CANINE combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. CANINE outperforms a comparable mBERT model by 2.8 F1 on TyDi QA, a challenging multilingual benchmark, despite having 28% fewer model parameters.

...read moreread less

36 citations

Journal Article•DOI•

Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

[...]

Kenton Lee, Mandar S. Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova - Show less +6 more

07 Oct 2022-arXiv.org

TL;DR: This work presents Pix2Struct, a pretrained image-to-text model for purely visual language understanding, which can be used on tasks containing visually-situated language, and shows that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains: documents, illustrations, user interfaces, and natural images.

...read moreread less

Abstract: Visually-situated language is ubiquitous -- sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to this diversity, previous work has typically relied on domain-specific recipes with limited sharing of the underlying data, model architectures, and objectives. We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding, which can be finetuned on tasks containing visually-situated language. Pix2Struct is pretrained by learning to parse masked screenshots of web pages into simplified HTML. The web, with its richness of visual elements cleanly reflected in the HTML structure, provides a large source of pretraining data well suited to the diversity of downstream tasks. Intuitively, this objective subsumes common pretraining signals such as OCR, language modeling, image captioning. In addition to the novel pretraining strategy, we introduce a variable-resolution input representation and a more flexible integration of language and vision inputs, where language prompts such as questions are rendered directly on top of the input image. For the first time, we show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains: documents, illustrations, user interfaces, and natural images.

...read moreread less

31 citations

Posted Content•

The MultiBERTs: BERT Reproductions for Robustness Analysis

[...]

Thibault Sellam, Steve Yadlowsky, Jason Wei, Naomi Saphra, Alexander D'Amour, Tal Linzen, Jasmijn Bastings, Iulia Turc, Jacob Eisenstein, Dipanjan Das, Ian Tenney, Ellie Pavlick¹ - Show less +8 more•Institutions (1)

Google¹

30 Jun 2021-arXiv: Computation and Language

TL;DR: The authors introduce MultiBERTs, a set of 25 BERT-base checkpoints, trained with similar hyper-parameters as the original BERT model but differing in random initialization and data shuffling.

...read moreread less

Abstract: Experiments with pretrained models such as BERT are often based on a single checkpoint. While the conclusions drawn apply to the artifact (i.e., the particular instance of the model), it is not always clear whether they hold for the more general procedure (which includes the model architecture, training data, initialization scheme, and loss function). Recent work has shown that re-running pretraining can lead to substantially different conclusions about performance, suggesting that alternative evaluations are needed to make principled statements about procedures. To address this question, we introduce MultiBERTs: a set of 25 BERT-base checkpoints, trained with similar hyper-parameters as the original BERT model but differing in random initialization and data shuffling. The aim is to enable researchers to draw robust and statistically justified conclusions about pretraining procedures. The full release includes 25 fully trained checkpoints, as well as statistical guidelines and a code library implementing our recommended hypothesis testing methods. Finally, for five of these models we release a set of 28 intermediate checkpoints in order to support research on learning dynamics.

...read moreread less

8 citations

Cited by

PDF

Open Access

More filters

Posted Content•

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

[...]

Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf

02 Oct 2019-arXiv: Computation and Language

TL;DR: This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses.

...read moreread less

Abstract: As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.

...read moreread less

3,877 citations

Proceedings Article•

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

[...]

Zhenzhong Lan¹, Mingda Chen², Sebastian Goodman¹, Kevin Gimpel³, Piyush Sharma¹, Radu Soricut¹ - Show less +2 more•Institutions (3)

Google¹, Toyota Technological Institute at Chicago², New York University³

30 Apr 2020

TL;DR: This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence.

...read moreread less

Abstract: Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations, longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large.

...read moreread less

2,367 citations

Posted Content•

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

[...]

Zhenzhong Lan¹, Mingda Chen², Sebastian Goodman¹, Kevin Gimpel³, Piyush Sharma¹, Radu Soricut¹ - Show less +2 more•Institutions (3)

Google¹, Toyota Technological Institute at Chicago², New York University³

26 Sep 2019-arXiv: Computation and Language

TL;DR: The authors proposed a self-supervised loss that focuses on modeling inter-sentence coherence, and showed it consistently helps downstream tasks with multientence inputs, achieving state-of-the-art results on the GLUE, RACE, and \squad benchmarks.

...read moreread less

Abstract: Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer training times. To address these problems, we present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and \squad benchmarks while having fewer parameters compared to BERT-large. The code and the pretrained models are available at this https URL.

...read moreread less

2,247 citations

Journal Article•DOI•

Knowledge Distillation: A Survey

[...]

Jianping Gou¹, Jianping Gou², Baosheng Yu², Stephen J. Maybank³, Dacheng Tao² - Show less +1 more•Institutions (3)

Jiangsu University¹, University of Sydney², Birkbeck, University of London³

09 Jun 2020-arXiv: Learning

TL;DR: A comprehensive survey of knowledge distillation from the perspectives of knowledge categories, training schemes, teacher-student architecture, distillation algorithms, performance comparison and applications can be found in this paper.

...read moreread less

Abstract: In recent years, deep neural networks have been successful in both industry and academia, especially for computer vision tasks. The great success of deep learning is mainly due to its scalability to encode large-scale data and to maneuver billions of model parameters. However, it is a challenge to deploy these cumbersome deep models on devices with limited resources, e.g., mobile phones and embedded devices, not only because of the high computational complexity but also the large storage requirements. To this end, a variety of model compression and acceleration techniques have been developed. As a representative type of model compression and acceleration, knowledge distillation effectively learns a small student model from a large teacher model. It has received rapid increasing attention from the community. This paper provides a comprehensive survey of knowledge distillation from the perspectives of knowledge categories, training schemes, teacher-student architecture, distillation algorithms, performance comparison and applications. Furthermore, challenges in knowledge distillation are briefly reviewed and comments on future research are discussed and forwarded.

...read moreread less

1,027 citations

Journal Article•DOI•

Pre-trained Models for Natural Language Processing: A Survey

[...]

Xipeng Qiu¹, Tianxiang Sun¹, Yige Xu¹, Yunfan Shao¹, Ning Dai¹, Xuanjing Huang¹ - Show less +2 more•Institutions (1)

Fudan University¹

18 Mar 2020-Science China-technological Sciences

TL;DR: Recently, the emergence of pre-trained models (PTMs) has brought natural language processing (NLP) to a new era as mentioned in this paper, and a comprehensive review of PTMs for NLP can be found in this survey.

...read moreread less

Abstract: Recently, the emergence of pre-trained models (PTMs) has brought natural language processing (NLP) to a new era. In this survey, we provide a comprehensive review of PTMs for NLP. We first briefly introduce language representation learning and its research progress. Then we systematically categorize existing PTMs based on a taxonomy from four different perspectives. Next, we describe how to adapt the knowledge of PTMs to downstream tasks. Finally, we outline some potential directions of PTMs for future research. This survey is purposed to be a hands-on guide for understanding, using, and developing PTMs for various NLP tasks.

...read moreread less

755 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126

Collapse