scispace - formally typeset
Search or ask a question

Semi-Supervised Learning Literature Survey

01 Jan 2005-
About: The article was published on 2005-01-01 and is currently open access. It has received 4189 citations till now. The article focuses on the topics: Literature survey & Semi-supervised learning.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: The relationship between transfer learning and other related machine learning techniques such as domain adaptation, multitask learning and sample selection bias, as well as covariate shift are discussed.
Abstract: A major assumption in many machine learning and data mining algorithms is that the training and future data must be in the same feature space and have the same distribution. However, in many real-world applications, this assumption may not hold. For example, we sometimes have a classification task in one domain of interest, but we only have sufficient training data in another domain of interest, where the latter data may be in a different feature space or follow a different data distribution. In such cases, knowledge transfer, if done successfully, would greatly improve the performance of learning by avoiding much expensive data-labeling efforts. In recent years, transfer learning has emerged as a new learning framework to address this problem. This survey focuses on categorizing and reviewing the current progress on transfer learning for classification, regression, and clustering problems. In this survey, we discuss the relationship between transfer learning and other related machine learning techniques such as domain adaptation, multitask learning and sample selection bias, as well as covariate shift. We also explore some potential future issues in transfer learning research.

18,616 citations


Cites background from "Semi-Supervised Learning Literature..."

  • ...However, many machine learning methods work well only under a common assumption: the training and test data are drawn from the same feature space and the same distribution....

    [...]

Journal ArticleDOI
TL;DR: A critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario is provided.
Abstract: With the continuous expansion of data availability in many large-scale, complex, and networked systems, such as surveillance, security, Internet, and finance, it becomes critical to advance the fundamental understanding of knowledge discovery and analysis from raw data to support decision-making processes. Although existing knowledge discovery and data engineering techniques have shown great success in many real-world applications, the problem of learning from imbalanced data (the imbalanced learning problem) is a relatively new challenge that has attracted growing attention from both academia and industry. The imbalanced learning problem is concerned with the performance of learning algorithms in the presence of underrepresented data and severe class distribution skews. Due to the inherent complex characteristics of imbalanced data sets, learning from such data requires new understandings, principles, algorithms, and tools to transform vast amounts of raw data efficiently into information and knowledge representation. In this paper, we provide a comprehensive review of the development of research in learning from imbalanced data. Our focus is to provide a critical review of the nature of the problem, the state-of-the-art technologies, and the current assessment metrics used to evaluate learning performance under the imbalanced learning scenario. Furthermore, in order to stimulate future research in this field, we also highlight the major opportunities and challenges, as well as potential important research directions for learning from imbalanced data.

6,320 citations


Cites background from "Semi-Supervised Learning Literature..."

  • ...The key idea of semisupervised learning is to exploit the unlabeled examples by using the labeled examples to modify, refine, or reprioritize the hypothesis obtained from the labeled data alone [135]....

    [...]

01 Jan 2009
TL;DR: This report provides a general introduction to active learning and a survey of the literature, including a discussion of the scenarios in which queries can be formulated, and an overview of the query strategy frameworks proposed in the literature to date.
Abstract: The key idea behind active learning is that a machine learning algorithm can achieve greater accuracy with fewer training labels if it is allowed to choose the data from which it learns. An active learner may pose queries, usually in the form of unlabeled data instances to be labeled by an oracle (e.g., a human annotator). Active learning is well-motivated in many modern machine learning problems, where unlabeled data may be abundant or easily obtained, but labels are difficult, time-consuming, or expensive to obtain. This report provides a general introduction to active learning and a survey of the literature. This includes a discussion of the scenarios in which queries can be formulated, and an overview of the query strategy frameworks proposed in the literature to date. An analysis of the empirical and theoretical evidence for successful active learning, a summary of problem setting variants and practical issues, and a discussion of related topics in machine learning research are also presented.

5,227 citations


Cites background from "Semi-Supervised Learning Literature..."

  • ...Zhu (2005a) reports that annotation at the word level can take ten times longer than the actual audio (e.g., one minute of speech takes ten minutes to label), and annotating phonemes can take 400 times as long (e.g., nearly seven hours)....

    [...]

  • ...Active learning and semi-supervised learning (for a good introduction, see Zhu, 2005b) both traffic in making the most out of unlabeled data....

    [...]

Posted Content
TL;DR: It is shown that deep generative models and approximate Bayesian inference exploiting recent advances in variational methods can be used to provide significant improvements, making generative approaches highly competitive for semi-supervised learning.
Abstract: The ever-increasing size of modern data sets combined with the difficulty of obtaining label information has made semi-supervised learning one of the problems of significant practical importance in modern data analysis. We revisit the approach to semi-supervised learning with generative models and develop new models that allow for effective generalisation from small labelled data sets to large unlabelled ones. Generative approaches have thus far been either inflexible, inefficient or non-scalable. We show that deep generative models and approximate Bayesian inference exploiting recent advances in variational methods can be used to provide significant improvements, making generative approaches highly competitive for semi-supervised learning.

2,194 citations


Cites background from "Semi-Supervised Learning Literature..."

  • ...Existing generative approaches based on models such as Gaussian mixture or hidden Markov models (Zhu, 2006), have not been very successful due to the need for a large number of mixtures components or states to perform well....

    [...]

  • ...Existing generative approaches based on models such as Gaussian mixture or hidden Markov models (Zhu, 2006), have not been very successful due to the limited capacity and the need for many states to perform well....

    [...]

Book
29 Jun 2009
TL;DR: This introductory book presents some popular semi-supervised learning models, including self-training, mixture models, co-training and multiview learning, graph-based methods, and semi- supervised support vector machines, and discusses their basic mathematical formulation.
Abstract: Semi-supervised learning is a learning paradigm concerned with the study of how computers and natural systems such as humans learn in the presence of both labeled and unlabeled data. Traditionally, learning has been studied either in the unsupervised paradigm (e.g., clustering, outlier detection) where all the data is unlabeled, or in the supervised paradigm (e.g., classification, regression) where all the data is labeled.The goal of semi-supervised learning is to understand how combining labeled and unlabeled data may change the learning behavior, and design algorithms that take advantage of such a combination. Semi-supervised learning is of great interest in machine learning and data mining because it can use readily available unlabeled data to improve supervised learning tasks when the labeled data is scarce or expensive. Semi-supervised learning also shows potential as a quantitative tool to understand human category learning, where most of the input is self-evidently unlabeled. In this introductory book, we present some popular semi-supervised learning models, including self-training, mixture models, co-training and multiview learning, graph-based methods, and semi-supervised support vector machines. For each model, we discuss its basic mathematical formulation. The success of semi-supervised learning depends critically on some underlying assumptions. We emphasize the assumptions made by each model and give counterexamples when appropriate to demonstrate the limitations of the different models. In addition, we discuss semi-supervised learning for cognitive psychology. Finally, we give a computational learning theoretic perspective on semi-supervised learning, and we conclude the book with a brief discussion of open questions in the field.

1,913 citations


Cites background from "Semi-Supervised Learning Literature..."

  • ...For further readings on these and other semi-supervised learning topics, there is a book collection from a machine learning perspective [37], a survey article with up-to-date papers [208], a book written for computational linguists [1], and a technical report [151]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Abstract: We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

30,570 citations


"Semi-Supervised Learning Literature..." refers background or methods in this paper

  • ...Latent Dirichlet Allocation (LDA) (Blei et al., 2003) is one step further. It assumes the topic proportion of each document is drawn from a Dirichlet dis tribution. With variational approximation, each document is represented by a pos terior Dirichlet over the topics. This is a much lower dimensional representation. Gr iffiths et al. (2005) extend LDA model to ‘HMM-LDA’ which uses both shortterm syntactic and long-term topical dependencies, as an effort to integrate s emantics and syntax. Li and McCallum (2005) apply the HMM-LDA model to obtain word clusters, as a rudimentary way for semi-supervised learning on sequenc es. Some algorithms derive a metric entirely from the density of U . These are motivated by unsupervised clustering and based on the intuition that data points in the same high density ‘clump’ should be close in the new metric. For instance, if U is generated from a single Gaussian, then the Mahalanobis distance induce d by the covariance matrix is such a metric. Tipping (1999) generalizes the Mahalano bis distance by fittingU with a mixture of Gaussian, and define a Riemannian manifold with metric atx being the weighted average of individual component inverse covariance. The distance between x1 andx2 is computed along the straight line (in Euclidean space) between the two points. Rattray (2000) further genera lizes the metric so that it only depends on the change in log probabilities of the density, n ot on a particular Gaussian mixture assumption....

    [...]

  • ...Latent Dirichlet Allocation (LDA) (Blei et al., 2003) is one step further. It assumes the topic proportion of each document is drawn from a Dirichlet dis tribution. With variational approximation, each document is represented by a pos terior Dirichlet over the topics. This is a much lower dimensional representation. Gr iffiths et al. (2005) extend LDA model to ‘HMM-LDA’ which uses both shortterm syntactic and long-term topical dependencies, as an effort to integrate s emantics and syntax....

    [...]

  • ...Latent Dirichlet Allocation (LDA) (Blei et al., 2003) is one step further. It assumes the topic proportion of each document is drawn from a Dirichlet dis tribution. With variational approximation, each document is represented by a pos terior Dirichlet over the topics. This is a much lower dimensional representation. Gr iffiths et al. (2005) extend LDA model to ‘HMM-LDA’ which uses both shortterm syntactic and long-term topical dependencies, as an effort to integrate s emantics and syntax. Li and McCallum (2005) apply the HMM-LDA model to obtain word clusters, as a rudimentary way for semi-supervised learning on sequenc es. Some algorithms derive a metric entirely from the density of U . These are motivated by unsupervised clustering and based on the intuition that data points in the same high density ‘clump’ should be close in the new metric. For instance, if U is generated from a single Gaussian, then the Mahalanobis distance induce d by the covariance matrix is such a metric. Tipping (1999) generalizes the Mahalano bis distance by fittingU with a mixture of Gaussian, and define a Riemannian manifold with metric atx being the weighted average of individual component inverse covariance. The distance between x1 andx2 is computed along the straight line (in Euclidean space) between the two points. Rattray (2000) further genera lizes the metric so that it only depends on the change in log probabilities of the density, n ot on a particular Gaussian mixture assumption. And the distance is computed along a curve that minimizes the distance. The new metric is invariant to linear transfor mation of the features, and connected regions of relatively homogeneous d nsity in U will be close to each other. Such metric is attractive, yet it depends on the homogeneity of the initial Euclidean space. Their application in semi-supervise d learning needs further investigation. Sajama and Orlitsky (2005) analyze th e lower and upper bounds on estimating data-density-based distance....

    [...]

  • ...Latent Dirichlet Allocation (LDA) (Blei et al., 2003) is one step further. It assumes the topic proportion of each document is drawn from a Dirichlet dis tribution. With variational approximation, each document is represented by a pos terior Dirichlet over the topics. This is a much lower dimensional representation. Gr iffiths et al. (2005) extend LDA model to ‘HMM-LDA’ which uses both shortterm syntactic and long-term topical dependencies, as an effort to integrate s emantics and syntax. Li and McCallum (2005) apply the HMM-LDA model to obtain word clusters, as a rudimentary way for semi-supervised learning on sequenc es. Some algorithms derive a metric entirely from the density of U . These are motivated by unsupervised clustering and based on the intuition that data points in the same high density ‘clump’ should be close in the new metric. For instance, if U is generated from a single Gaussian, then the Mahalanobis distance induce d by the covariance matrix is such a metric. Tipping (1999) generalizes the Mahalano bis distance by fittingU with a mixture of Gaussian, and define a Riemannian manifold with metric atx being the weighted average of individual component inverse covariance....

    [...]

  • ...Latent Dirichlet Allocation (LDA) (Blei et al., 2003) is one step further. It assumes the topic proportion of each document is drawn from a Dirichlet dis tribution. With variational approximation, each document is represented by a pos terior Dirichlet over the topics. This is a much lower dimensional representation. Gr iffiths et al. (2005) extend LDA model to ‘HMM-LDA’ which uses both shortterm syntactic and long-term topical dependencies, as an effort to integrate s emantics and syntax. Li and McCallum (2005) apply the HMM-LDA model to obtain word clusters, as a rudimentary way for semi-supervised learning on sequenc es....

    [...]

01 Jan 1998
TL;DR: Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.
Abstract: A comprehensive look at learning and generalization theory. The statistical theory of learning and generalization concerns the problem of choosing desired functions on the basis of empirical data. Highly applicable to a variety of computer science and robotics fields, this book offers lucid coverage of the theory as a whole. Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.

26,531 citations


"Semi-Supervised Learning Literature..." refers background in this paper

  • ...The decision boundary has the smallest generalization error bound on unlabeled data (Vapnik, 1998)....

    [...]

  • ...The name TSVM originates from the intention to work only on the observed data (though people use them for induction anyway), which according to (Vapnik, 1998) is solving a simpler problem....

    [...]

Proceedings Article
03 Jan 2001
TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).
Abstract: We propose a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. Inference and learning are carried out efficiently via variational algorithms. We present empirical results on applications of this model to problems in text modeling, collaborative filtering, and text classification.

25,546 citations

Journal ArticleDOI
Lawrence R. Rabiner1
01 Feb 1989
TL;DR: In this paper, the authors provide an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and give practical details on methods of implementation of the theory along with a description of selected applications of HMMs to distinct problems in speech recognition.
Abstract: This tutorial provides an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and gives practical details on methods of implementation of the theory along with a description of selected applications of the theory to distinct problems in speech recognition. Results from a number of original sources are combined to provide a single source of acquiring the background required to pursue further this area of research. The author first reviews the theory of discrete Markov chains and shows how the concept of hidden states, where the observation is a probabilistic function of the state, can be used effectively. The theory is illustrated with two simple examples, namely coin-tossing, and the classic balls-in-urns system. Three fundamental problems of HMMs are noted and several practical techniques for solving these problems are given. The various types of HMMs that have been studied, including ergodic as well as left-right models, are described. >

21,819 citations