scispace - formally typeset
Search or ask a question

New theoretical frameworks for machine learning

TL;DR: This thesis introduces a new discriminative model (a PAC or Statistical Learning Theory style model) for semi-supervised learning, that can be used to reason about many of the different approaches taken over the past decade in the Machine Learning community.
Abstract: This thesis has two primary thrusts. The first is developing new models and algorithms for important modern and classic learning problems. The second is establishing new connections between Machine Learning and Algorithmic Game Theory. The formulation of the PAC learning model by Valiant [201] and the Statistical Learning Theory framework by Vapnik [203] have been instrumental in the development of machine learning and the design and analysis of algorithms for supervised learning. However, while extremely influential, these models do not capture or explain other important classic learning paradigms such as Clustering, nor do they capture important emerging learning paradigms such as Semi-Supervised Learning and other ways of incorporating unlabeled data in the learning process. In this thesis, we develop the first analog of these general discriminative models to the problems of Semi-Supervised Learning and Clustering, and we analyze both their algorithmic and sample complexity implications. We also provide the first generalization of the well-established theory of learning with kernel functions to case of general pairwise similarity functions and in addition provide new positive theoretical results for Active Learning Finally, this dissertation presents new applications of techniques from Machine Learning to Algorithmic Game Theory, which has been a major area of research at the intersection of Computer Science and Economics. In machine learning, there has been growing interest in using unlabeled data together with labeled data due to the availability of large amounts of unlabeled data in many contemporary applications. As a result, a number of different semi-supervised learning methods such as Co-training, transductive SVM, or graph based methods have been developed. However, the underlying assumptions of these methods are often quite distinct and not captured by standard theoretical models. This thesis introduces a new discriminative model (a PAC or Statistical Learning Theory style model) for semi-supervised learning, that can be used to reason about many of the different approaches taken over the past decade in the Machine Learning community. This model provides a unified framework for analyzing when and why unlabeled data can help in the semi-supervised learning setting, in which one can analyze both sample-complexity and algorithmic issues. In particular, our model allows us to address in a unified way key issues such as "Under what conditions will unlabeled data help and by how much?" and "How much data should I expect to need in order to perform well?". Another important part of this thesis is Active Learning for which we provide several new theoretical results. In particular, this dissertation includes the first active learning algorithm which works in the presence of arbitrary forms of noise, as well as a few margin based active learning algorithms. In the context of Kernel methods (another flourishing area of machine learning research), this thesis shows how Random Projection techniques can be used to convert a given kernel function into an explicit, distribution dependent set of features, which can then be fed into more general (not necessarily kernelizable) learning algorithms In addition, this work shows how such methods can be extended to more general pairwise similarity functions and also gives a formal theory that matches the standard intuition that a good kernel function is one that acts as a good measure of similarity. We thus strictly generalize and simplify the existing theory of kernel methods. Our approach brings a new perspective as well as a much simpler explanation for the effectiveness of kernel methods, which can help in the design of good kernel functions for new learning problems. We also show how we can use this perspective to help thinking about Clustering in a novel way. While the study of clustering is centered around an intuitively compelling goal (and it has been a major tool in manydifferent fields), reasoning about it in a generic and unified way has been difficult, in part due to the lack of a general theoretical framework along the lines we have for supervised classification. In our work we develop the first general discriminative clustering framework for analyzing accuracy without probabilistic assumptions. This dissertation also contributes with new connections between Machine Learning and Mechanism Design. Specifically, this thesis presents the first general framework in which machine learning methods can be used for reducing mechanism design problems to standard algorithmic questions for a wide range of revenue maximization problems in an unlimited supply setting. Our results substantially generalize the previous work based on random sampling mechanisms–both by broadening the applicability of such mechanisms and by simplifying the analysis. From a learning perspective, these settings present several unique challenges: the loss function is discontinuous and asymmetric, and the range of bidders' valuations may be large.

Content maybe subject to copyright    Report

Citations
More filters
Journal Article
TL;DR: It is shown that under the assumption that the best classifier in the learner's hypothesis class has generalization error at most β > 0, the label complexity of active learning is Ω(β 2 /∈ 2 log(1/δ)), where the accuracy parameter e measures how close to optimal within the hypothesis class the active learners has to get and δ is the confidence parameter.
Abstract: Most of the existing active learning algorithms are based on the realizability assumption: The learner's hypothesis class is assumed to contain a target function that perfectly classifies all training and test examples. This assumption can hardly ever be justified in practice. In this paper, we study how relaxing the realizability assumption affects the sample complexity of active learning. First, we extend existing results on query learning to show that any active learning algorithm for the realizable case can be transformed to tolerate random bounded rate class noise. Thus, bounded rate class noise adds little extra complications to active learning, and in particular exponential label complexity savings over passive learning are still possible. However, it is questionable whether this noise model is any more realistic in practice than assuming no noise at all. Our second result shows that if we move to the truly non-realizable model of statistical learning theory, then the label complexity of active learning has the same dependence Ω(1/∈ 2 ) on the accuracy parameter e as the passive learning label complexity. More specifically, we show that under the assumption that the best classifier in the learner's hypothesis class has generalization error at most β > 0, the label complexity of active learning is Ω(β 2 /∈ 2 log(1/δ)), where the accuracy parameter e measures how close to optimal within the hypothesis class the active learner has to get and δ is the confidence parameter. The implication of this lower bound is that exponential savings should not be expected in realistic models of active learning, and thus the label complexity goals in active learning should be refined.

98 citations

Proceedings ArticleDOI
19 Oct 2012
TL;DR: Dramatic performance improvements over supervised learning and anomaly detection in discriminating real, previously unseen, malicious network traffic while generating an order of magnitude fewer false alerts than any alternative, including a signature IDS tool deployed on the same network.
Abstract: A barrier to the widespread adoption of learning-based network intrusion detection tools is the in-situ training requirements for effective discrimination of malicious traffic. Supervised learning techniques necessitate a quantity of labeled examples that is often intractable, and at best cost-prohibitive. Recent advances in semi-supervised techniques have demonstrated the ability to generalize well based on a significantly smaller set of labeled samples. In network intrusion detection, placing reasonable requirements on the number of training examples provides realistic expectations that a learning-based system can be trained in the environment where it will be deployed. This in-situ training is necessary to ensure that the assumptions associated with the learning process hold, and thereby support a reasonable belief in the generalization ability of the resulting model. In this paper, we describe the application of a carefully selected nonparametric, semi-supervised learning algorithm to the network intrusion problem, and compare the performance to other model types using feature-based data derived from an operational network. We demonstrate dramatic performance improvements over supervised learning and anomaly detection in discriminating real, previously unseen, malicious network traffic while generating an order of magnitude fewer false alerts than any alternative, including a signature IDS tool deployed on the same network.

36 citations

Proceedings Article
01 Jan 2010
TL;DR: The novel SSL algorithms proposed in this thesis have numerous advantages over the existing semi-supervised algorithms as they yield convex, scalable, inherently multi-class loss functions that can be kernelized naturally.
Abstract: The maximum entropy (MaxEnt) framework has been studied extensively in the supervised setting. Here, the goal is to find a distribution p that maximizes an entropy function while enforcing data constraints so that the expected values of some (pre-defined) features with respect to p match their empirical counterparts approximately. Using different entropy measures, different model spaces for p, and different approximation criteria for the data constraints, yields a family of discriminative supervised learning methods (e.g., logistic regression, conditional random fields, least squares and boosting) (Dudik & Schapire, 2006; Friedlander & Gupta, 2006; Altun & Smola, 2006). This framework is known as the generalized maximum entropy framework. Semi-supervised learning (SSL) is a promising field that has increasingly attracted attention in the last decade. SSL algorithms utilize unlabeled data along with labeled data so as to increase the accuracy and robustness of inference algorithms. However, most SSL algorithms to date have had trade-offs, e.g., in terms of scalability or applicability to multi-categorical data. In this thesis, we extend the generalized MaxEnt framework to develop a family of novel SSL algorithms using two different approaches: (1) Introducing Similarity Constraints We incorporate unlabeled data via modifications to the primal MaxEnt objective in terms of additional potential functions. A potential function stands for a closed proper convex function that can take the form of a constraint and/or a penalty representing our structural assumptions on the data geometry. Specifically, we impose similarity constraints as additional penalties based on the semi-supervised smoothness assumption , i.e., we restrict the MaxEnt problem such that similar samples have similar model outputs. The motivation is reminiscent of that of Laplacian SVM (Sindhwani et al., 2005) and manifold transductive neural networks (Karlen et al., 2008), however, instead of regularizing the loss function in the dual we integrate our constraints directly to the primal MaxEnt problem which has a more straight-forward and natural interpretation. (2) Augmenting Constraints on Model Features We incorporate unlabeled data to enhance the moment matching constraints of the generalized MaxEnt problem in the primal. We improve the estimates of the model and empirical expectations of the feature functions using our assumptions on the data geometry. In particular, we derive the semi-supervised formulations for three specific instances of the generalized MaxEnt framework on conditional distributions, namely logistic regression and kernel logistic regression for multi-class problems, and conditional random fields for structured output prediction problems. A thorough empirical evaluation on standard data sets that are widely used in the literature demonstrates the validity and competitiveness of the proposed algorithms. In addition to these benchmark data sets, we apply our approach to two real-life problems, vision based robot grasping, and remote sensing image classification where the scarcity of the labeled training samples is the main bottleneck in the learning process. For the particular case of grasp learning, we also propose a combination of semi-supervised learning and active learning, another sub-field of machine learning that is focused on the scarcity of labeled samples, when the problem setup is suitable for incremental labeling. To conclude, the novel SSL algorithms proposed in this thesis have numerous advantages over the existing semi-supervised algorithms as they yield convex, scalable, inherently multi-class loss functions that can be kernelized naturally.

32 citations

01 Jan 2006
TL;DR: This chapter contains sections titled: Introduction, A Formal Framework, Sample Complexity Results, Algorithmic Results, Related Models and Discussion.
Abstract: This chapter contains sections titled: Introduction, A Formal Framework, Sample Complexity Results, Algorithmic Results, Related Models and Discussion

21 citations

Proceedings ArticleDOI
24 Feb 2014
TL;DR: An architecture designed to transparently and automatically scale the performance of sequential programs as a function of the hardware resources available is presented, including a collection of state predictors and a mechanism for speculatively executing threads that explore potential states along the execution path.
Abstract: We present an architecture designed to transparently and automatically scale the performance of sequential programs as a function of the hardware resources available. The architecture is predicated on a model of computation that views program execution as a walk through the enormous state space composed of the memory and registers of a single-threaded processor. Each instruction execution in this model moves the system from its current point in state space to a deterministic subsequent point. We can parallelize such execution by predictively partitioning the complete path and speculatively executing each partition in parallel. Accurately partitioning the path is a challenging prediction problem. We have implemented our system using a functional simulator that emulates the x86 instruction set, including a collection of state predictors and a mechanism for speculatively executing threads that explore potential states along the execution path. While the overhead of our simulation makes it impractical to measure speedup relative to native x86 execution, experiments on three benchmarks show scalability of up to a factor of 256 on a 1024 core machine when executing unmodified sequential programs.

13 citations


Cites methods from "New theoretical frameworks for mach..."

  • ...Most instruction control flow analysis and optimization techniques can be transformed into probabilistic predictors via a technique from machine learning called “the kernel trick” [6, 17]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated and the performance of the support- vector network is compared to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Abstract: The support-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data. High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.

37,861 citations


"New theoretical frameworks for mach..." refers background in this paper

  • ...Typical kernel functions for structured data include the polynomial kernel K(x, x′) = (1 + x · x′)d and the Gaussian kernel K(x, x′) = e−||x−x||(2)/2σ(2) , and a number of special-purpose kernels have been developed for sequence data, image data, and other types of data as well [55, 56, 93, 103, 118]....

    [...]

Journal Article

28,685 citations


"New theoretical frameworks for mach..." refers background in this paper

  • ...121 5.2.2 The Non-realizable Case under the Uniform Distribution . . . . . . . . . . . . . . 125 5.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 x 5.3 Other Results in Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....

    [...]

  • ...) At a high level, in [35, 42] we point out that traditional anal yses [96] have studied the number of label requests required before an algorithm can both produce an ǫ-good classifier and prove that the classifier’s error is no more than ǫ....

    [...]

  • ...Another learning setting where unlabeled data is useful and which has been increasingly popular for the past few years is Active Learning[31, 34, 35, 42, 87, 96]....

    [...]

  • ...• In Chapter 5 we analyze Active Learning and present two main results....

    [...]

  • ...(Full details of the model and results can be found in [35, 42] ....

    [...]

01 Jan 1998
TL;DR: Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.
Abstract: A comprehensive look at learning and generalization theory. The statistical theory of learning and generalization concerns the problem of choosing desired functions on the basis of empirical data. Highly applicable to a variety of computer science and robotics fields, this book offers lucid coverage of the theory as a whole. Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.

26,531 citations


"New theoretical frameworks for mach..." refers methods in this paper

  • ...To obtain our bounds we use and extend sample-complexity techniques from machine learning theory (see [11, 42, 88, 121]) and to design our mechanisms we employ machine learning methods such as structural risk minimization....

    [...]

Book
01 Jan 1968
TL;DR: The arrangement of this invention provides a strong vibration free hold-down mechanism while avoiding a large pressure drop to the flow of coolant fluid.
Abstract: A fuel pin hold-down and spacing apparatus for use in nuclear reactors is disclosed. Fuel pins forming a hexagonal array are spaced apart from each other and held-down at their lower end, securely attached at two places along their length to one of a plurality of vertically disposed parallel plates arranged in horizontally spaced rows. These plates are in turn spaced apart from each other and held together by a combination of spacing and fastening means. The arrangement of this invention provides a strong vibration free hold-down mechanism while avoiding a large pressure drop to the flow of coolant fluid. This apparatus is particularly useful in connection with liquid cooled reactors such as liquid metal cooled fast breeder reactors.

17,939 citations


"New theoretical frameworks for mach..." refers methods in this paper

  • ...Note that any given tree has at most 22k prunings of size k [154], so this model is at least as strict as the list model....

    [...]