scispace - formally typeset
Search or ask a question
Posted Content

Uncertainty-aware Self-training for Text Classification with Few Labels.

TL;DR: This work proposes an approach to improve self-training by incorporating uncertainty estimates of the underlying neural network leveraging recent advances in Bayesian deep learning and proposes acquisition functions to select instances from the unlabeled pool leveraging Monte Carlo (MC) Dropout.
Abstract: Recent success of large-scale pre-trained language models crucially hinge on fine-tuning them on large amounts of labeled data for the downstream task, that are typically expensive to acquire In this work, we study self-training as one of the earliest semi-supervised learning approaches to reduce the annotation bottleneck by making use of large-scale unlabeled data for the target task Standard self-training mechanism randomly samples instances from the unlabeled pool to pseudo-label and augment labeled data In this work, we propose an approach to improve self-training by incorporating uncertainty estimates of the underlying neural network leveraging recent advances in Bayesian deep learning Specifically, we propose (i) acquisition functions to select instances from the unlabeled pool leveraging Monte Carlo (MC) Dropout, and (ii) learning mechanism leveraging model confidence for self-training As an application, we focus on text classification on five benchmark datasets We show our methods leveraging only 20-30 labeled samples per class for each task for training and for validation can perform within 3% of fully supervised pre-trained language models fine-tuned on thousands of labeled instances with an aggregate accuracy of 91% and improving by upto 12% over baselines
Citations
More filters
Proceedings ArticleDOI
01 Aug 2021
TL;DR: By jointly training the BERT and GCN modules within BertGCN, the proposed model is able to leverage the advantages of both worlds: large-scale pretraining which takes the advantage of the massive amount of raw data and transductive learning.
Abstract: In this work, we propose BertGCN, a model that combines large scale pretraining and transductive learning for text classification. BertGCN constructs a heterogeneous graph over the dataset and represents documents as nodes using BERT representations. By jointly training the BERT and GCN modules within BertGCN, the proposed model is able to leverage the advantages of both worlds: large-scale pretraining which takes the advantage of the massive amount of raw data and transductive learning which jointly learns representations for both training data and unlabeled test data by propagating label influence through graph convolution. Experiments show that BertGCN achieves SOTA performances on a wide range of text classification datasets.1

102 citations


Additional excerpts

  • ...Different neural model architectures (Kim, 2014; Zhou et al., 2015; Radford et al., 2018; Chai et al., 2020) have demonstrated their effectiveness against traditional statistical feature based methods (Wallach, 2006)....

    [...]

Proceedings ArticleDOI
01 Jun 2021
TL;DR: This work develops a contrastive self-training framework, COSINE, to enable fine-tuning LMs with weak supervision, underpinned by contrastive regularization and confidence-based reweighting, which gradually improves model fitting while effectively suppressing error propagation.
Abstract: Fine-tuned pre-trained language models (LMs) have achieved enormous success in many natural language processing (NLP) tasks, but they still require excessive labeled data in the fine-tuning stage. We study the problem of fine-tuning pre-trained LMs using only weak supervision, without any labeled data. This problem is challenging because the high capacity of LMs makes them prone to overfitting the noisy labels generated by weak supervision. To address this problem, we develop a contrastive self-training framework, COSINE, to enable fine-tuning LMs with weak supervision. Underpinned by contrastive regularization and confidence-based reweighting, our framework gradually improves model fitting while effectively suppressing error propagation. Experiments on sequence, token, and sentence pair classification tasks show that our model outperforms the strongest baseline by large margins and achieves competitive performance with fully-supervised fine-tuning methods. Our implementation is available on https://github.com/yueyu1030/COSINE.

83 citations


Cites background from "Uncertainty-aware Self-training for..."

  • ...UST (Mukherjee and Awadallah, 2020) is stateof-the-art for self-training with limited labels....

    [...]

  • ...We implement Self-ensemble, FreeLB, Mixup and UST based on their original paper....

    [...]

  • ...C.3 Number of Parameters COSINE and most of the baselines (RoBERTaWL / RoBERTa-CL / SMART / WeSTClass / SelfEnsemble / FreeLB / Mixup / UST) are built on the RoBERTa-base model with about 125M parameters....

    [...]

  • ...Com- pared with advanced fine-tuning and self-training methods (e.g. SMART and UST)11, our model consistently outperforms the baselines....

    [...]

  • ...We highlight that although UST, the state-of-the-art method to date, achieves strong performance under few-shot settings, their approach cannot estimate confidence well with noisy labels, and this yields inferior performance....

    [...]

Journal ArticleDOI
Kai He, Rui Mao, Tieliang Gong, Chen Li, Erik Cambria 
TL;DR: A meta-based self-training method with a meta-weighter (MSM) is proposed that believes that a generalizable model can be achieved by appropriate symbolic representation selection and effective learning control (regulation) in a neural system.
Abstract: Aspect-based sentiment analysis (ABSA) means to identify fine-grained aspects, opinions, and sentiment polarities. Recent ABSA research focuses on utilizing multi-task learning (MTL) to achieve less computational costs and better performance. However, there are certain limits in MTL-based ABSA. For example, unbalanced labels and sub-task learning difficulties may result in the biases that some labels and sub-tasks are overfitting, while the others are underfitting. To address these issues, inspired by neuro-symbolic learning systems, we propose a meta-based self-training method with a meta-weighter (MSM). We believe that a generalizable model can be achieved by appropriate symbolic representation selection (in-domain knowledge) and effective learning control (regulation) in a neural system. Thus, MSM trains a teacher model to generate in-domain knowledge (e.g., unlabeled data selection and pseudo-label generation), where the generated pseudo-labels are used by a student model for supervised learning. Then, the meta-weighter of MSM is jointly trained with the student model to provide each instance with sub-task-specific weights to coordinate their convergence rates, balancing class labels, and alleviating noise impacts introduced from self-training. The following experiments indicate that MSM can utilize 50% labeled data to achieve comparable results to state-of-arts models in ABSA and outperform them with all labeled data.

25 citations

Posted Content
TL;DR: A novel technique using aspect markers that learns to generate personalized explanations of recommendations from review texts, and it is shown that human users significantly prefer these explanations over those produced by state-of-the-art techniques.
Abstract: Using personalized explanations to support recommendations has been shown to increase trust and perceived quality. However, to actually obtain better recommendations, there needs to be a means for users to modify the recommendation criteria by interacting with the explanation. We present a novel technique using aspect markers that learns to generate personalized explanations of recommendations from review texts, and we show that human users significantly prefer these explanations over those produced by state-of-the-art techniques. Our work's most important innovation is that it allows users to react to a recommendation by critiquing the textual explanation: removing (symmetrically adding) certain aspects they dislike or that are no longer relevant (symmetrically that are of interest). The system updates its user model and the resulting recommendations according to the critique. This is based on a novel unsupervised critiquing method for single- and multi-step critiquing with textual explanations. Experiments on two real-world datasets show that our system is the first to achieve good performance in adapting to the preferences expressed in multi-step critiquing.

17 citations

Posted Content
TL;DR: This article proposed three orthogonal schemes to improve the model generalization ability for few-shot settings: (1) meta-learning to construct prototypes for different entity types, (2) supervised pre-training on noisy web data to extract entity-related generic representations and (3) self-training to leverage unlabeled in-domain data.
Abstract: This paper presents a comprehensive study to efficiently build named entity recognition (NER) systems when a small number of in-domain labeled data is available. Based upon recent Transformer-based self-supervised pre-trained language models (PLMs), we investigate three orthogonal schemes to improve the model generalization ability for few-shot settings: (1) meta-learning to construct prototypes for different entity types, (2) supervised pre-training on noisy web data to extract entity-related generic representations and (3) self-training to leverage unlabeled in-domain data. Different combinations of these schemes are also considered. We perform extensive empirical comparisons on 10 public NER datasets with various proportions of labeled data, suggesting useful insights for future research. Our experiments show that (i) in the few-shot learning setting, the proposed NER schemes significantly improve or outperform the commonly used baseline, a PLM-based linear classifier fine-tuned on domain labels; (ii) We create new state-of-the-art results on both few-shot and training-free settings compared with existing methods. We will release our code and pre-trained models for reproducible research.

15 citations

References
More filters
Proceedings Article
01 Jan 2015
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Abstract: We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

111,197 citations


"Uncertainty-aware Self-training for..." refers methods in this paper

  • ...We use Adam [Kingma and Ba, 2015] as the optimizer with early stopping and use the best model found so far from the validation loss for all the models....

    [...]

Journal ArticleDOI
TL;DR: This final installment of the paper considers the case where the signals or the messages or both are continuously variable, in contrast with the discrete nature assumed until now.
Abstract: In this final installment of the paper we consider the case where the signals or the messages or both are continuously variable, in contrast with the discrete nature assumed until now. To a considerable extent the continuous case can be obtained through a limiting process from the discrete case by dividing the continuum of messages and signals into a large but finite number of small regions and calculating the various parameters involved on a discrete basis. As the size of the regions is decreased these parameters in general approach as limits the proper values for the continuous case. There are, however, a few new effects that appear and also a general change of emphasis in the direction of specialization of the general results to particular cases.

65,425 citations

Journal Article
TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Abstract: Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

33,597 citations

Proceedings ArticleDOI
11 Oct 2018
TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Abstract: We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5 (7.7 point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

24,672 citations

Posted Content
TL;DR: It is found that BERT was significantly undertrained, and can match or exceed the performance of every model published after it, and the best model achieves state-of-the-art results on GLUE, RACE and SQuAD.
Abstract: Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.

13,994 citations