ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Citations
10,132 citations
6,953 citations
Cites background or methods or result from "ALBERT: A Lite BERT for Self-superv..."
...Some current work along these lines include distillation [Hinton et al., 2015; Sanh et al., 2019; Jiao et al., 2019], parameter sharing [Lan et al., 2019], and conditional computation [Shazeer et al., 2017]....
[...]
...For example, the variant of ALBERT [Lan et al., 2019] that achieved the previous state-of-the-art uses a model similar in size and architecture to our 3B variant (though it has dramatically fewer parameters due to clever parameter sharing)....
[...]
...Recent results suggest that this may hold true for transfer learning in NLP [Liu et al., 2019c; Radford et al., 2019; Yang et al., 2019; Lan et al., 2019], i.e. it has repeatedly been shown that scaling up produces improved performance....
[...]
...Concurrent work [Lan et al., 2019] also found that sharing parameters across Transformer blocks can be an effective means of lowering the total parameter count without sacrificing much performance....
[...]
...This approach has recently been used to obtain state-of-the-art results in many of the most common NLP benchmarks [Devlin et al., 2018; Yang et al., 2019; Dong et al., 2019; Liu et al., 2019c; Lan et al., 2019]....
[...]
4,798 citations
Additional excerpts
...ALBERT (Lan et al., 2019) Electra (Clark et al....
[...]
4,505 citations
3,863 citations
References
52,856 citations
30,558 citations
"ALBERT: A Lite BERT for Self-superv..." refers background in this paper
...in the last two years is the shift from pre-training word embeddings, whether standard (Mikolov et al., 2013; Pennington et al., 2014) or contextualized (McCann et al....
[...]
...One of the most significant changes in the last two years is the shift from pre-training word embeddings, whether standard (Mikolov et al., 2013; Pennington et al., 2014) or contextualized (McCann et al., 2017; Peters et al., 2018), to full-network pre-training followed by task-specific fine-tuning…...
[...]
29,480 citations
"ALBERT: A Lite BERT for Self-superv..." refers background or methods in this paper
...To keep the comparison as meaningful as possible, we follow the BERT (Devlin et al., 2019) setup in using the BOOKCORPUS (Zhu et al., 2015) and English Wikipedia (Devlin et al., 2019) for pretraining baseline models....
[...]
...Following Devlin et al. (2019), we set the feed-forward/filter size to be 4H and the number of attention heads to be H/64....
[...]
...We adapt these hyperparameters from Liu et al. (2019), Devlin et al. (2019), and Yang et al. (2019)....
[...]
...The experiments done up to this point use only the Wikipedia and BOOKCORPUS datasets, as in (Devlin et al., 2019)....
[...]
...BERT (Devlin et al., 2019) uses a loss based on predicting whether the second segment in a pair has been swapped with a segment from another document....
[...]
24,012 citations
"ALBERT: A Lite BERT for Self-superv..." refers background in this paper
...An orthogonal line of research, which could provide additional representation power, includes hard example mining (Mikolov et al., 2013) and more efficient language modeling training (Yang et al....
[...]
...Learning representations of natural language has been shown to be useful for a wide range of NLP tasks and has been widely adopted (Mikolov et al., 2013; Le & Mikolov, 2014; Dai & Le, 2015; Peters et al., 2018; Devlin et al., 2019; Radford et al., 2018; 2019)....
[...]
...One of the most significant changes in the last two years is the shift from pre-training word embeddings, whether standard (Mikolov et al., 2013; Pennington et al., 2014) or contextualized (McCann et al., 2017; Peters et al., 2018), to full-network pre-training followed by task-specific fine-tuning…...
[...]
...An orthogonal line of research, which could provide additional representation power, includes hard example mining (Mikolov et al., 2013) and more efficient language modeling training (Yang et al., 2019)....
[...]
13,994 citations
"ALBERT: A Lite BERT for Self-superv..." refers background or methods in this paper
...Following Liu et al. (2019), we fine-tune for RTE, STS, and MRPC using an MNLI checkpoint....
[...]
...A segment is usually comprised of more than one natural sentence, which has been shown to benefit performance by Liu et al. (2019)....
[...]