scispace - formally typeset
Open AccessJournal ArticleDOI

The Use of Unlabeled Data in Predictive Modeling

TLDR
The fundamental statistical foundations for predictive modeling and the general questions associated with unlabeled data are overviewed, highlighting the relevance of venerable concepts of sampling design and prior specification.
Abstract
The incorporation of unlabeled data in regression and classification analysis is an increasing focus of the applied statistics and machine learning literatures, with a number of recent examples demonstrating the potential for unlabeled data to contribute to improved predictive accuracy. The statistical basis for this semisupervised analysis does not appear to have been well delineated; as a result, the underlying theory and rationale may be underappreciated, especially by nonstatisticians. There is also room for statisticians to become more fully engaged in the vigorous research in this important area of intersection of the statistical and computer sciences. Much of the theoretical work in the literature has focused, for example, on geometric and structural properties of the unlabeled data in the context of particular algorithms, rather than probabilistic and statistical questions. This paper overviews the fundamental statistical foundations for predictive modeling and the general questions associated with unlabeled data, highlighting the relevance of venerable concepts of sampling design and prior specification. This theory, illustrated with a series of central illustrative examples and two substantial real data analyses, shows precisely when, why and how unlabeled data matter.

read more

Citations
More filters
Journal ArticleDOI

Covariance-regularized regression and classification for high-dimensional problems

TL;DR: It is shown that ridge regression, the lasso and the elastic net are special cases of covariance‐regularized regression, and it is demonstrated that certain previously unexplored forms of covariant regularized regression can outperform existing methods in a range of situations.
Journal ArticleDOI

Penalized model-based clustering with unconstrained covariance matrices.

TL;DR: This article proposes a regularized Gaussian mixture model permitting a treatment of general covariance matrices, taking various dependencies into account, and derives an E-M algorithm utilizing the graphical lasso for parameter estimation, achieving better clustering and variable selection.
Journal ArticleDOI

Variable selection and updating in model-based discriminant analysis for high dimensional data with food authenticity applications

TL;DR: A model-based discriminant analysis method that includes variable selection that outperformed default implementations of Random Forests, AdaBoost, transductive SVMs and Bayesian Multinomial Regression by substantial margins.
Journal Article

Estimation of Gradients and Coordinate Covariation in Classification

TL;DR: An algorithm that simultaneously estimates a classification function as well as its gradient in the supervised learning framework to find salient variables and estimate how they covary is introduced.
References
More filters

Statistical learning theory

TL;DR: Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.
Book

Kernel Methods for Pattern Analysis

TL;DR: This book provides an easy introduction for students and researchers to the growing field of kernel-based pattern analysis, demonstrating with examples how to handcraft an algorithm or a kernel for a new specific application, and covering all the necessary conceptual and mathematical tools to do so.
Proceedings ArticleDOI

Combining labeled and unlabeled data with co-training

TL;DR: A PAC-style analysis is provided for a problem setting motivated by the task of learning to classify web pages, in which the description of each example can be partitioned into two distinct views, to allow inexpensive unlabeled data to augment, a much smaller set of labeled examples.
Journal ArticleDOI

A Bayesian Analysis of Some Nonparametric Problems

TL;DR: In this article, a class of prior distributions, called Dirichlet process priors, is proposed for nonparametric problems, for which treatment of many non-parametric statistical problems may be carried out, yielding results that are comparable to the classical theory.
Related Papers (5)