Showing papers on "Statistical learning theory published in 2018"

PDF

Open Access

Journal Article•DOI•

A multi-verse optimizer approach for feature selection and optimizing SVM parameters based on a robust system architecture

[...]

Hossam Faris¹, Mohammad A. Hassonah¹, Ala' M. Al-Zoubi¹, Seyedali Mirjalili², Ibrahim Aljarah¹ - Show less +1 more•Institutions (2)

University of Jordan¹, Griffith University²

01 Oct 2018-Neural Computing and Applications

TL;DR: This work proposes a robust approach based on a recent nature-inspired metaheuristic called multi-verse optimizer (MVO) for selecting optimal features and optimizing the parameters of SVM simultaneously.

...read moreread less

Abstract: Support vector machine (SVM) is a well-regarded machine learning algorithm widely applied to classification tasks and regression problems. SVM was founded based on the statistical learning theory and structural risk minimization. Despite the high prediction rate of this technique in a wide range of real applications, the efficiency of SVM and its classification accuracy highly depends on the parameter setting as well as the subset feature selection. This work proposes a robust approach based on a recent nature-inspired metaheuristic called multi-verse optimizer (MVO) for selecting optimal features and optimizing the parameters of SVM simultaneously. In fact, the MVO algorithm is employed as a tuner to manipulate the main parameters of SVM and find the optimal set of features for this classifier. The proposed approach is implemented and tested on two different system architectures. MVO is benchmarked and compared with four classic and recent metaheuristic algorithms using ten binary and multi-class labeled datasets. Experimental results demonstrate that MVO can effectively reduce the number of features while maintaining a high prediction accuracy.

...read moreread less

187 citations

Book•

Non-convex Optimization for Machine Learning

[...]

Prateek Jain¹, Purushottam Kar²•Institutions (2)

Microsoft¹, Indian Institute of Technology Kanpur²

28 Feb 2018

TL;DR: Non-convex optimization as mentioned in this paper is a popular approach for large-scale optimization problems and has been shown to outperform relaxation-based heuristics, such as projected gradient descent and alternating minimization.

...read moreread less

Abstract: A vast majority of machine learning algorithms train their models andperform inference by solving optimization problems. In order to capturethe learning and prediction problems accurately, structural constraintssuch as sparsity or low rank are frequently imposed or else the objectiveitself is designed to be a non-convex function. This is especially trueof algorithms that operate in high-dimensional spaces or that trainnon-linear models such as tensor models and deep networks.The freedom to express the learning problem as a non-convex optimizationproblem gives immense modeling power to the algorithmdesigner, but often such problems are NP-hard to solve. A popularworkaround to this has been to relax non-convex problems to convexones and use traditional methods to solve the convex relaxed optimizationproblems. However this approach may be lossy and neverthelesspresents significant challenges for large scale optimization.On the other hand, direct approaches to non-convex optimizationhave met with resounding success in several domains and remain themethods of choice for the practitioner, as they frequently outperformrelaxation-based techniques - popular heuristics include projected gradientdescent and alternating minimization. However, these are oftenpoorly understood in terms of their convergence and other properties.This monograph presents a selection of recent advances that bridgea long-standing gap in our understanding of these heuristics. We hopethat an insight into the inner workings of these methods will allow thereader to appreciate the unique marriage of task structure and generativemodels that allow these heuristic techniques to provably succeed.The monograph will lead the reader through several widely used nonconvexoptimization techniques, as well as applications thereof. Thegoal of this monograph is to both, introduce the rich literature in thisarea, as well as equip the reader with the tools and techniques neededto analyze these simple procedures for non-convex problems.

...read moreread less

184 citations

Journal Article•DOI•

Fault Detection in Wireless Sensor Networks Through SVM Classifier

[...]

Salah Zidi¹, Tarek Moulahi², Bechir Alaya¹•Institutions (2)

Qassim University¹, University of Kairouan²

01 Jan 2018-IEEE Sensors Journal

TL;DR: Support vector machines (SVMs) classification method is used for fault detection in WSNs and can be easily executed at cluster heads to detect anomalous sensor.

...read moreread less

Abstract: Wireless sensor networks (WSNs) are prone to many failures such as hardware failures, software failures, and communication failures. The fault detection in WSNs is a challenging problem due to sensor resources limitation and the variety of deployment field. Furthermore, the detection has to be precise to avoid negative alerts, and rapid to limit loss. The use of machine learning seems to be one of the most convenient solutions for detecting failure in WSNs. In this paper, support vector machines (SVMs) classification method is used for this purpose. Based on statistical learning theory, SVM is used in our context to define a decision function. As a light process in term of required resources, this decision function can be easily executed at cluster heads to detect anomalous sensor. The effectiveness of SVM for fault detection in WSNs is shown through an experimental study, comparing it to latest for the same application.

...read moreread less

174 citations

Posted Content•

Machine Learning of coarse-grained Molecular Dynamics Force Fields

[...]

Jiang Wang¹, Simon Olsson², Christoph Wehmeyer², Adrià Pérez³, Nicholas E. Charron¹, Gianni De Fabritiis⁴, Gianni De Fabritiis³, Frank Noé², Frank Noé¹, Cecilia Clementi - Show less +6 more•Institutions (4)

Rice University¹, Free University of Berlin², Pompeu Fabra University³, Catalan Institution for Research and Advanced Studies⁴

04 Dec 2018-arXiv: Computational Physics

TL;DR: CGnets, a deep learning approach, that learns coarse-grained free energy functions and can be trained by a force-matching scheme, is introduced, which shows that CGnets can capture all-atom explicit-solvent free energy surfaces with models using only a few coarse- grained beads and no solvent, while classical coarse-Graining methods fail to capture crucial features of the free energy surface.

...read moreread less

Abstract: Atomistic or ab-initio molecular dynamics simulations are widely used to predict thermodynamics and kinetics and relate them to molecular structure. A common approach to go beyond the time- and length-scales accessible with such computationally expensive simulations is the definition of coarse-grained molecular models. Existing coarse-graining approaches define an effective interaction potential to match defined properties of high-resolution models or experimental data. In this paper, we reformulate coarse-graining as a supervised machine learning problem. We use statistical learning theory to decompose the coarse-graining error and cross-validation to select and compare the performance of different models. We introduce CGnets, a deep learning approach, that learns coarse-grained free energy functions and can be trained by a force matching scheme. CGnets maintain all physically relevant invariances and allow one to incorporate prior physics knowledge to avoid sampling of unphysical structures. We show that CGnets can capture all-atom explicit-solvent free energy surfaces with models using only a few coarse-grained beads and no solvent, while classical coarse-graining methods fail to capture crucial features of the free energy surface. Thus, CGnets are able to capture multi-body terms that emerge from the dimensionality reduction.

...read moreread less

106 citations

Proceedings Article•DOI•

Generalization Error Bounds for Noisy, Iterative Algorithms

[...]

Ankit Pensia¹, Varun Jog¹, Po-Ling Loh¹•Institutions (1)

University of Wisconsin-Madison¹

12 Jan 2018

TL;DR: This paper derived generalization error bounds for a broad class of iterative algorithms that are characterized by bounded, noisy updates with Markovian structure, including stochastic gradient Langevin dynamics (SGLD) and variants of the SGHMC algorithm.

...read moreread less

Abstract: In statistical learning theory, generalization error is used to quantify the degree to which a supervised machine learning algorithm may overfit to training data. Recent work [Xu and Raginsky (2017)] has established a bound on the generalization error of empirical risk minimization based on the mutual information $I$ ( $S$ ; W) between the algorithm input $S$ and the algorithm output W, when the loss function is sub-Gaussian. We leverage these results to derive generalization error bounds for a broad class of iterative algorithms that are characterized by bounded, noisy updates with Markovian structure. Our bounds are very general and are applicable to numerous settings of interest, including stochastic gradient Langevin dynamics (SGLD) and variants of the stochastic gradient Hamiltonian Monte Carlo (SGHMC) algorithm. Furthermore, our error bounds hold for any output function computed over the path of iterates, including the last iterate of the algorithm or the average of subsets of iterates, and also allow for non-uniform sampling of data in successive updates of the algorithm.

...read moreread less

84 citations

Journal Article•DOI•

An elementary proof of convex phase retrieval in the natural parameter space via the linear program PhaseMax

[...]

Paul Hand¹, Vladislav Voroninski•Institutions (1)

Northeastern University¹

01 Jan 2018-Communications in Mathematical Sciences

TL;DR: A short and elementary proof that PhaseMax exactly recovers real-valued vectors from random measurements under optimal sample complexity is presented, yielding a simpler and more direct proof than those that require statistical learning theory, geometric probability or the highly technical arguments for Wirtinger Flow-like approaches.

...read moreread less

Abstract: The phase retrieval problem has garnered significant attention since the development of the PhaseLift algorithm, which is a convex program that operates in a lifted space of matrices. Because of the substantial computational cost due to lifting, many approaches to phase retrieval have been developed, including non-convex optimization algorithms which operate in the natural parameter space, such as Wirtinger Flow. Very recently, a convex formulation called PhaseMax has been discovered, and it has been proven to achieve phase retrieval via linear programming in the natural parameter space under optimal sample complexity. The current proofs of PhaseMax rely on statistical learning theory or geometric probability theory. Here, we present a short and elementary proof that PhaseMax exactly recovers real-valued vectors from random measurements under optimal sample complexity. Our proof only relies on standard probabilistic concentration and covering arguments, yielding a simpler and more direct proof than those that require statistical learning theory, geometric probability or the highly technical arguments for Wirtinger Flow-like approaches.

...read moreread less

47 citations

Journal Article•DOI•

Linear Maximum Margin Classifier for Learning from Uncertain Data

[...]

Christos Tzelepis¹, Vasileios Mezaris², Ioannis Patras¹•Institutions (2)

Queen Mary University of London¹, Information Technology Institute²

01 Dec 2018-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The SVM framework is reformulate such that each training example can be modeled by a multi-dimensional Gaussian distribution described by its mean vector and its covariance matrix—the latter modeling the uncertainty.

...read moreread less

Abstract: In this paper, we propose a maximum margin classifier that deals with uncertainty in data input. More specifically, we reformulate the SVM framework such that each training example can be modeled by a multi-dimensional Gaussian distribution described by its mean vector and its covariance matrix—the latter modeling the uncertainty. We address the classification problem and define a cost function that is the expected value of the classical SVM cost when data samples are drawn from the multi-dimensional Gaussian distributions that form the set of the training examples. Our formulation approximates the classical SVM formulation when the training examples are isotropic Gaussians with variance tending to zero. We arrive at a convex optimization problem, which we solve efficiently in the primal form using a stochastic gradient descent approach. The resulting classifier, which we name SVM with Gaussian Sample Uncertainty (SVM-GSU), is tested on synthetic data and five publicly available and popular datasets; namely, the MNIST, WDBC, DEAP, TV News Channel Commercial Detection, and TRECVID MED datasets. Experimental results verify the effectiveness of the proposed method.

...read moreread less

40 citations

Book Chapter•DOI•

Support Vector Machines (SVM)

[...]

Joseph L. Awange¹, Béla Paláncz², Robert H. Lewis³, Lajos Völgyesi²•Institutions (3)

Curtin University¹, Budapest University of Technology and Economics², Fordham University³

01 Jan 2018

TL;DR: In statistical learning theory (regression, classification, etc.) there are many regression models, such as algebraic polynomials, which help in the development of models for classification.

...read moreread less

Abstract: In statistical learning theory (regression, classification, etc.) there are many regression models, such as algebraic polynomials,

...read moreread less

28 citations

Journal Article•DOI•

Nonlinear optimization and support vector machines

[...]

Veronica Piccialli¹, Marco Sciandrone²•Institutions (2)

University of Rome Tor Vergata¹, University of Florence²

23 May 2018-A Quarterly Journal of Operations Research

TL;DR: This paper presents the convex programming problems underlying SVM focusing on supervised binary classification and analyzes the most important and used optimization methods for SVM training problems, and discusses how the properties of these problems can be incorporated in designing useful algorithms.

...read moreread less

Abstract: Support Vector Machine (SVM) is one of the most important class of machine learning models and algorithms, and has been successfully applied in various fields. Nonlinear optimization plays a crucial role in SVM methodology, both in defining the machine learning models and in designing convergent and efficient algorithms for large-scale training problems. In this paper we present the convex programming problems underlying SVM focusing on supervised binary classification. We analyze the most important and used optimization methods for SVM training problems, and we discuss how the properties of these problems can be incorporated in designing useful algorithms.

...read moreread less

26 citations

Proceedings Article•DOI•

Universal Batch Learning with Log-Loss

[...]

Yaniv Fogel¹, Meir Feder¹•Institutions (1)

Tel Aviv University¹

17 Jun 2018

TL;DR: The minimax universal learning solution, a redundancy capacity theorem and an upper bound on the performance of the optimal solution for batch learning with log-loss are considered.

...read moreread less

Abstract: In this paper we consider the problem of batch learning with log-loss, in a stochastic setting where given the data features, the outcome is generated by an unknown distribution from a class of models. Utilizing the minimax theorem and information-theoretical tools, we came up with the minimax universal learning solution, a redundancy capacity theorem and an upper bound on the performance of the optimal solution. The resulting universal learning solution is a mixture over the models in the considered class. Furthermore, we get a better bound on the generalization error that decays as $O(\log N/N)$ , where $N$ is the sample size, instead of $O(\sqrt{\log N/N})$ which is commonly attained in statistical learning theory for the empirical risk minimizer.

...read moreread less

20 citations

Posted Content•

Spectral Pruning: Compressing Deep Neural Networks via Spectral Analysis and its Generalization Error

[...]

Taiji Suzuki¹, Hiroshi Abe, Tomoya Murata², Shingo Horiuchi², Kotaro Ito², Tokuma Wachi², So Hirai², Masatoshi Yukishima², Tomoaki Nishimura² - Show less +5 more•Institutions (2)

University of Tokyo¹, NTT DATA²

26 Aug 2018-arXiv: Machine Learning

TL;DR: A new theoretical framework for model compression is developed and a new pruning method called spectral pruning is proposed based on this framework, which defines the ``degrees of freedom'' to quantify the intrinsic dimensionality of a model by using the eigenvalue distribution of the covariance matrix across the internal nodes.

...read moreread less

Abstract: Compression techniques for deep neural network models are becoming very important for the efficient execution of high-performance deep learning systems on edge-computing devices. The concept of model compression is also important for analyzing the generalization error of deep learning, known as the compression-based error bound. However, there is still huge gap between a practically effective compression method and its rigorous background of statistical learning theory. To resolve this issue, we develop a new theoretical framework for model compression and propose a new pruning method called {\it spectral pruning} based on this framework. We define the ``degrees of freedom'' to quantify the intrinsic dimensionality of a model by using the eigenvalue distribution of the covariance matrix across the internal nodes and show that the compression ability is essentially controlled by this quantity. Moreover, we present a sharp generalization error bound of the compressed model and characterize the bias--variance tradeoff induced by the compression procedure. We apply our method to several datasets to justify our theoretical analyses and show the superiority of the the proposed method.

...read moreread less

Proceedings Article•DOI•

Kernel Target Alignment based Fuzzy Least Square Twin Bounded Support Vector Machine

[...]

Umesh Gupta¹, Deepak Gupta¹, Mukesh Prasad²•Institutions (2)

National Institute of Technology, Arunachal Pradesh¹, University of Technology, Sydney²

01 Nov 2018

TL;DR: A kernel-target alignment based fuzzy least square twin bounded support vector machine (KTAFLSTBSVM) is proposed to reduce the effects of outliers and noise and solves the two systems of linear equations.

...read moreread less

Abstract: A kernel-target alignment based fuzzy least square twin bounded support vector machine (KTAFLSTBSVM) is proposed to reduce the effects of outliers and noise. The proposed model is an effective and efficient fuzzy based least square twin bounded support vector machine for binary classification where the membership values are assigned based on kernel-target alignment approach. The proposed KTA-FLSTBSVM solves the two systems of linear equations, which is computationally very fast with significant comparable performance. To development the robust model, this approach minimizes the structural risk which is the gist of statistical learning theory. This powerful KTA-FLSTBSVM approach is tested on artificial data sets as well as benchmark real-world datasets that provide significantly better result in terms of generalization performance and computational time.

...read moreread less

Posted Content•

Spectral-Pruning: Compressing deep neural network via spectral analysis

[...]

Taiji Suzuki, Hiroshi Abe, Tomoya Murata, Shingo Horiuchi, Kotaro Ito, Tokuma Wachi, So Hirai, Masatoshi Yukishima, Tomoaki Nishimura - Show less +5 more

26 Aug 2018

TL;DR: This work develops a new theoretical frame-work for model compression, and proposes a new method called Spectral-Pruning based on the theory, which makes use of both "input" and "output" in each layer and is easy to implement.

...read moreread less

Proceedings Article•DOI•

MiSoSouP: Mining Interesting Subgroups with Sampling and Pseudodimension

[...]

Matteo Riondato¹, Fabio Vandin²•Institutions (2)

Amherst College¹, University of Padua²

19 Jul 2018

TL;DR: This work presents MiSoSouP, a suite of algorithms for extracting high-quality approximations of the most interesting subgroups, according to different interestingness measures, from a random sample of a transactional dataset, and describes a new formulation of these measures that makes it possible to approximate them using sampling.

...read moreread less

Abstract: We present MiSoSouP, a suite of algorithms for extracting high-quality approximations of the most interesting subgroups, according to different interestingness measures, from a random sample of a transactional dataset. We describe a new formulation of these measures that makes it possible to approximate them using sampling. We then discuss how pseudodimension, a key concept from statistical learning theory, relates to the sample size needed to obtain an high-quality approximation of the most interesting subgroups. We prove an upper bound on the pseudodimension of the problem at hand, which results in small sample sizes. Our evaluation on real datasets shows that MiSoSouP outperforms state-of-the-art algorithms offering the same guarantees, and it vastly speeds up the discovery of subgroups w.r.t. analyzing the whole dataset.

...read moreread less

Posted Content•

Generalization in Machine Learning via Analytical Learning Theory

[...]

Kenji Kawaguchi¹, Yoshua Bengio²•Institutions (2)

Massachusetts Institute of Technology¹, Université de Montréal²

21 Feb 2018-arXiv: Machine Learning

TL;DR: A novel measure-theoretic theory for machine learning that does not require statistical assumptions is introduced and a new regularization method in deep learning is derived and shown to outperform previous methods in CIFar-10, CIFAR-100, and SVHN.

...read moreread less

Abstract: This paper introduces a novel measure-theoretic theory for machine learning that does not require statistical assumptions. Based on this theory, a new regularization method in deep learning is derived and shown to outperform previous methods in CIFAR-10, CIFAR-100, and SVHN. Moreover, the proposed theory provides a theoretical basis for a family of practically successful regularization methods in deep learning. We discuss several consequences of our results on one-shot learning, representation learning, deep learning, and curriculum learning. Unlike statistical learning theory, the proposed learning theory analyzes each problem instance individually via measure theory, rather than a set of problem instances via statistics. As a result, it provides different types of results and insights when compared to statistical learning theory.

...read moreread less

Posted Content•

The Mismatch Principle: The Generalized Lasso Under Large Model Uncertainties

[...]

Martin Genzel, Gitta Kutyniok

20 Aug 2018-arXiv: Statistics Theory

TL;DR: The mismatch principle is studied, which is a simple recipe to establish theoretical error bounds for the generalized Lasso, and the benefits of the mismatch principle are demonstrated for a variety of popular problem classes, such as single-index models, generalized linear models, and variable selection.

...read moreread less

Abstract: We study the estimation capacity of the generalized Lasso, i.e., least squares minimization combined with a (convex) structural constraint. While Lasso-type estimators were originally designed for noisy linear regression problems, it has recently turned out that they are in fact robust against various types of model uncertainties and misspecifications, most notably, non-linearly distorted observation models. This work provides more theoretical evidence for this somewhat astonishing phenomenon. At the heart of our analysis stands the mismatch principle, which is a simple recipe to establish theoretical error bounds for the generalized Lasso. The associated estimation guarantees are of independent interest and are formulated in a fairly general setup, permitting arbitrary sub-Gaussian data, possibly with strongly correlated feature designs; in particular, we do not assume a specific observation model which connects the input and output variables. Although the mismatch principle is conceived based on ideas from statistical learning theory, its actual application area are (high-dimensional) estimation tasks for semi-parametric models. In this context, the benefits of the mismatch principle are demonstrated for a variety of popular problem classes, such as single-index models, generalized linear models, and variable selection. Apart from that, our findings are also relevant to recent advances in quantized and distributed compressed sensing.

...read moreread less

Posted Content•

Localized Structured Prediction

[...]

Carlo Ciliberto¹, Francis Bach², Alessandro Rudi²•Institutions (2)

Imperial College London¹, French Institute for Research in Computer Science and Automation²

06 Jun 2018-arXiv: Machine Learning

TL;DR: This work proposes the first theoretical framework to deal with part-based data from a general perspective and explicitly quantifies the benefits of leveraging the part- based structure of the problem with respect to the learning rates of the proposed estimator.

...read moreread less

Abstract: Key to structured prediction is exploiting the problem structure to simplify the learning process. A major challenge arises when data exhibit a local structure (e.g., are made by "parts") that can be leveraged to better approximate the relation between (parts of) the input and (parts of) the output. Recent literature on signal processing, and in particular computer vision, has shown that capturing these aspects is indeed essential to achieve state-of-the-art performance. While such algorithms are typically derived on a case-by-case basis, in this work we propose the first theoretical framework to deal with part-based data from a general perspective. We derive a novel approach to deal with these problems and study its generalization properties within the setting of statistical learning theory. Our analysis is novel in that it explicitly quantifies the benefits of leveraging the part-based structure of the problem with respect to the learning rates of the proposed estimator.

...read moreread less

Posted Content•

An Optimal Transport View on Generalization.

[...]

Jingwei Zhang, Tongliang Liu, Dacheng Tao

08 Nov 2018-arXiv: Machine Learning

TL;DR: The analyses of generalization error in deep learning mainly exploit the hierarchical structure in DNNs and the contraction property of $f$-divergence, which may be of independent interest in analyzing other learning models with hierarchical structure.

...read moreread less

Abstract: We derive upper bounds on the generalization error of learning algorithms based on their \emph{algorithmic transport cost}: the expected Wasserstein distance between the output hypothesis and the output hypothesis conditioned on an input example. The bounds provide a novel approach to study the generalization of learning algorithms from an optimal transport view and impose less constraints on the loss function, such as sub-gaussian or bounded. We further provide several upper bounds on the algorithmic transport cost in terms of total variation distance, relative entropy (or KL-divergence), and VC dimension, thus further bridging optimal transport theory and information theory with statistical learning theory. Moreover, we also study different conditions for loss functions under which the generalization error of a learning algorithm can be upper bounded by different probability metrics between distributions relating to the output hypothesis and/or the input data. Finally, under our established framework, we analyze the generalization in deep learning and conclude that the generalization error in deep neural networks (DNNs) decreases exponentially to zero as the number of layers increases. Our analyses of generalization error in deep learning mainly exploit the hierarchical structure in DNNs and the contraction property of $f$-divergence, which may be of independent interest in analyzing other learning models with hierarchical structure.

...read moreread less

Journal Article•DOI•

Geometric Positioning Accuracy Improvement of ZY-3 Satellite Imagery Based on Statistical Learning Theory

[...]

Niangang Jiao¹, Feng Wang¹, Hongjian You¹, Mudan Yang¹, Xinghui Yao¹ - Show less +1 more•Institutions (1)

Chinese Academy of Sciences¹

24 May 2018-Sensors

TL;DR: This work proposes a new method to improve the geometric positioning accuracy without ground control points (GCPs) of ZY-3 satellite imagery with the newly proposed inherent error compensation model based on statistical learning theory.

...read moreread less

Abstract: With the increasing demand for high-resolution remote sensing images for mapping and monitoring the Earth’s environment, geometric positioning accuracy improvement plays a significant role in the image preprocessing step. Based on the statistical learning theory, we propose a new method to improve the geometric positioning accuracy without ground control points (GCPs). Multi-temporal images from the ZY-3 satellite are tested and the bias-compensated rational function model (RFM) is applied as the block adjustment model in our experiment. An easy and stable weight strategy and the fast iterative shrinkage-thresholding (FIST) algorithm which is widely used in the field of compressive sensing are improved and utilized to define the normal equation matrix and solve it. Then, the residual errors after traditional block adjustment are acquired and tested with the newly proposed inherent error compensation model based on statistical learning theory. The final results indicate that the geometric positioning accuracy of ZY-3 satellite imagery can be improved greatly with our proposed method.

...read moreread less

Posted Content•

Towards Understanding Generalization via Analytical Learning Theory

[...]

Kenji Kawaguchi, Yoshua Benigo, Vikas Verma, Leslie Pack Kaelbling

01 Oct 2018

...read moreread less

Book Chapter•DOI•

Statistical Learning Theory

[...]

Rodrigo Fernandes de Mello¹, Moacir Antonelli Ponti¹•Institutions (1)

University of São Paulo¹

01 Jan 2018

TL;DR: This chapter starts by describing the necessary concepts and assumptions to ensure supervised learning, and details the Empirical Risk Minimization (ERM) principle, which is the key point for the Statistical Learning Theory (SLT).

...read moreread less

Abstract: This chapter starts by describing the necessary concepts and assumptions to ensure supervised learning. Later on, it details the Empirical Risk Minimization (ERM) principle, which is the key point for the Statistical Learning Theory (SLT). The ERM principle provides upper bounds to make the empirical risk a good estimator for the expected risk, given the bias of some learning algorithm. This bound is the main theoretical tool to provide learning guarantees for classification tasks. Afterwards, other useful tools and concepts are introduced.

...read moreread less

Posted Content•

Deep Neural Networks for Choice Analysis: A Statistical Learning Theory Perspective

[...]

Shenhao Wang¹, Qingyi Wang¹, Nate Bailey¹, Jinhua Zhao¹•Institutions (1)

Massachusetts Institute of Technology¹

01 Jan 2018-Research Papers in Economics

TL;DR: In this paper, the authors present a framework to examine the tradeoff between estimation and approximation errors, and between prediction and interpretation losses, by formulating the metrics of interpretation loss as the difference between true and estimated choice probability functions.

...read moreread less

Abstract: While researchers increasingly use deep neural networks (DNN) to analyze individual choices, overfitting and interpretability issues remain as obstacles in theory and practice. By using statistical learning theory, this study presents a framework to examine the tradeoff between estimation and approximation errors, and between prediction and interpretation losses. It operationalizes the DNN interpretability in the choice analysis by formulating the metrics of interpretation loss as the difference between true and estimated choice probability functions. This study also uses the statistical learning theory to upper bound the estimation error of both prediction and interpretation losses in DNN, shedding light on why DNN does not have the overfitting issue. Three scenarios are then simulated to compare DNN to binary logit model (BNL). We found that DNN outperforms BNL in terms of both prediction and interpretation for most of the scenarios, and larger sample size unleashes the predictive power of DNN but not BNL. DNN is also used to analyze the choice of trip purposes and travel modes based on the National Household Travel Survey 2017 (NHTS2017) dataset. These experiments indicate that DNN can be used for choice analysis beyond the current practice of demand forecasting because it has the inherent utility interpretation, the flexibility of accommodating various information formats, and the power of automatically learning utility specification. DNN is both more predictive and interpretable than BNL unless the modelers have complete knowledge about the choice task, and the sample size is small. Overall, statistical learning theory can be a foundation for future studies in the non-asymptotic data regime or using high-dimensional statistical models in choice analysis, and the experiments show the feasibility and effectiveness of DNN for its wide applications to policy and behavioral analysis.

...read moreread less

Journal Article•DOI•

Statistical performance of optimal scoring in reproducing kernel Hilbert spaces

[...]

Heng Chen¹, Di-Rong Chen²•Institutions (2)

Capital University of Economics and Business¹, Beihang University²

01 Mar 2018-Journal of Statistical Planning and Inference

TL;DR: In this article, the authors considered a version of optimal scoring in reproducing kernel Hilbert spaces, where estimators are constructed by minimizing regularized (penalized) empirical variances, as previously in penalized optimal scoring.

...read moreread less

Posted Content•

Computing the Shattering Coefficient of Supervised Learning Algorithms.

[...]

Rodrigo Fernandes de Mello, Moacir Antonelli Ponti, Carlos Henrique Grossi Ferreira

07 May 2018-arXiv: Learning

TL;DR: The Shattering coefficient for any Hilbert space H containing the input space X is proved and its effects in terms of learning guarantees for supervised machine algorithms are discussed.

...read moreread less

Abstract: The Statistical Learning Theory (SLT) provides the theoretical guarantees for supervised machine learning based on the Empirical Risk Minimization Principle (ERMP). Such principle defines an upper bound to ensure the uniform convergence of the empirical risk Remp(f), i.e., the error measured on a given data sample, to the expected value of risk R(f) (a.k.a. actual risk), which depends on the Joint Probability Distribution P(X x Y) mapping input examples x in X to class labels y in Y. The uniform convergence is only ensured when the Shattering coefficient N(F,2n) has a polynomial growing behavior. This paper proves the Shattering coefficient for any Hilbert space H containing the input space X and discusses its effects in terms of learning guarantees for supervised machine algorithms.

...read moreread less

DOI•

Theory and Algorithms for Hypothesis Transfer Learning

[...]

Ilja Kuzborskij

01 Jan 2018

TL;DR: This thesis theoretically investigates conditions when HTL guarantees improved generalization on a novel task subject to the relevant auxiliary (source) hypotheses, and investigates HTL theoretically by considering three scenarios: HTL through regularized least squares with biased regularization, through convex empirical risk minimization, and through stochastic optimization.

...read moreread less

Abstract: The design and analysis of machine learning algorithms typically considers the problem of learning on a single task, and the nature of learning in such scenario is well explored. On the other hand, very often tasks faced by machine learning systems arrive sequentially, and therefore it is reasonable to ask whether a better approach can be taken than retraining such systems from scratch given newly available data. Indeed, by drawing analogy from human learning, a novel skill could be acquired more easily whenever the learner shares a relevant past experience. In response to this observation, the machine learning community has drawn its attention towards a form of learning known as transfer learning - learning a novel task by leveraging upon auxiliary information extracted from previous tasks. Tangible progress has been made in both theory and practice of transfer learning; however, many questions are still to be addressed. In this thesis we will focus on an efficient type of transfer learning, known as the Hypothesis Transfer Learning (HTL), where auxiliary information is retained in a form of previously induced hypotheses. This is in contrast to the large body of work where one transfers from the data associated with previously encountered tasks. In particular, we theoretically investigate conditions when HTL guarantees improved generalization on a novel task subject to the relevant auxiliary (source) hypotheses. We investigate HTL theoretically by considering three scenarios: HTL through regularized least squares with biased regularization, through convex empirical risk minimization, and through stochastic optimization, which also touches the theory of non-convex transfer learning problems. In addition, we demonstrate the benefits of HTL empirically, by proposing two algorithms tailored for real-life situations with application to visual learning problems - learning a new class in a multi-class classification setting by transferring from known classes, and an efficient greedy HTL algorithm for learning with large number of source hypotheses. From theoretical point of view this thesis consistently identifies the key quantitative characteristics of relatedness between novel and previous tasks, and explicitates them in generalization bounds. These findings corroborate many previous works in the transfer learning literature and provide a theoretical basis for design and analysis of new HTL algorithms.

...read moreread less

Proceedings Article•DOI•

Research on Development and Application of Support Vector Machine - Transformer Fault Diagnosis

[...]

Ruifang Zhang¹, Yangxue Liu¹•Institutions (1)

Guilin University of Technology¹

29 Dec 2018

TL;DR: The article systematically introduces the theory of support vector machine, summarizes the common training algorithms of standard (traditional) support Vector machine and their existing problems, the new learning models and algorithms developed on this basis, and verifies the actual effect and scope of each support vectors machine model through the application of transformer fault diagnosis.

...read moreread less

Abstract: Support Vector Machine (SVM) is a machine learning method based on statistical learning theory, solving the problems of classification and regression by means of optimization methods. The method can effectively solve the problem of small number of samples, nonlinearity and high dimensionality, and largely avoids the problems of "dimensionality disaster", "over-fitting" and local minimum caused by traditional statistical theory. However, there are still some problems, such as high complexity of the algorithm and difficulty in adapting to large-scale data. The article systematically introduces the theory of support vector machine, summarizes the common training algorithms of standard (traditional) support vector machine and their existing problems, the new learning models and algorithms developed on this basis. And verify the actual effect and scope of each support vector machine model through the application of transformer fault diagnosis.

...read moreread less

Dissertation•DOI•

RAMBLE: robust acoustic modeling for Brazilian learners of English

[...]

Christopher Dane Shulby

08 Aug 2018

TL;DR: This dissertation investigates how more robust acoustic generalizations can be made, even with little data and noisy accented-speech data, and takes advantage of raw feature extraction provided by deep learning techniques to produce robust results for acoustic modeling without the dependency of big data.

...read moreread less

Abstract: SHULBY, C. D. RAMBLE: robust acoustic modeling for Brazilian learners of English. 2018. 160 p. Tese (Doutorado em Ciências – Ciências de Computação e Matemática Computacional) – Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo, São Carlos – SP, 2018. The gains made by current deep-learning techniques have often come with the price tag of big data and where that data is not available, a new solution must be found. Such is the case for accented and noisy speech where large databases do not exist and data augmentation techniques, which are less than perfect, present an even larger obstacle. Another problem is that state-of-the-art results are rarely reproducible because they use proprietary datasets, pretrained networks and/or weight initializations from other larger networks. An example of a low resource scenario exists even in the fifth largest land in the world; home to most of the speakers of the seventh most spoken language on earth. Brazil is the leader in the Latin-American economy and as a BRIC country aspires to become an ever-stronger player in the global marketplace. Still, English proficiency is low, even for professionals in businesses and universities. Low intelligibility and strong accents can damage professional credibility. It has been established in the literature for foreign language teaching that it is important that adult learners are made aware of their errors as outlined by the “Noticing Theory”, explaining that a learner is more successful when he is able to learn from his own mistakes. An essential objective of this dissertation is to classify phonemes in the acoustic model which is needed to properly identify phonemic errors automatically. A common belief in the community is that deep learning requires large datasets to be effective. This happens because brute force methods create a highly complex hypothesis space which requires large and complex networks which in turn demand a great amount of data samples in order to generate useful networks. Besides that, the loss functions used in neural learning does not provide statistical learning guarantees and only guarantees the network can memorize the training space well. In the case of accented or noisy speech where a new sample can carry a great deal of variation from the training samples, the generalization of such models suffers. The main objective of this dissertation is to investigate how more robust acoustic generalizations can be made, even with little data and noisy accented-speech data. The approach here is to take advantage of raw feature extraction provided by deep learning techniques and instead focus on how learning guarantees can be provided for small datasets to produce robust results for acoustic modeling without the dependency of big data. This has been done by careful and intelligent parameter and architecture selection within the framework of the statistical learning theory. Here, an intelligently defined CNN architecture, together with context windows and a knowledge-driven hierarchical tree of SVM classifiers achieves nearly state-of-the-art frame-wise phoneme recognition results with absolutely no pretraining or external weight initialization. A goal of this thesis is to produce transparent and reproducible architectures with high frame-level accuracy, comparable to the state of the art. Additionally, a convergence analysis based on the learning guarantees of the statistical learning theory is performed in order to evidence the generalization capacity of the model. The model achieves 39.7% error in framewise classification and a 43.5% phone error rate using deep feature extraction and SVM classification even with little data (less than 7 hours). These results are comparable to studies which use well over ten times that amount of data. Beyond the intrinsic evaluation, the model also achieves an accuracy of 88% in the identification of epenthesis, the error which is most difficult for Brazilian speakers of English This is a 69% relative percentage gain over the previous values in the literature. The results are significant because it shows how deep feature extraction can be applied to little data scenarios, contrary to popular belief. The extrinsic, task-based results also show how this approach could be useful in tasks like automatic error diagnosis. Another contribution is the publication of a number of freely available resources which previously did not exist, meant to aid future researches in dataset creation.

...read moreread less

Posted Content•

A Bayesian Perspective of Statistical Machine Learning for Big Data

[...]

Rajiv Sambasivan¹, Rajiv Sambasivan², Sourish Das², Sourish Das¹, Sujit K. Sahu², Sujit K. Sahu¹ - Show less +2 more•Institutions (2)

University of Southampton¹, Chennai Mathematical Institute²

09 Nov 2018-arXiv: Learning

TL;DR: A review of statistical machine learning from a Bayesian decision theoretic point of view is presented in this article, where the authors argue that many SML techniques are closely connected to making inference by using the so-called Bayesian paradigm.

...read moreread less

Abstract: Statistical Machine Learning (SML) refers to a body of algorithms and methods by which computers are allowed to discover important features of input data sets which are often very large in size. The very task of feature discovery from data is essentially the meaning of the keyword `learning' in SML. Theoretical justifications for the effectiveness of the SML algorithms are underpinned by sound principles from different disciplines, such as Computer Science and Statistics. The theoretical underpinnings particularly justified by statistical inference methods are together termed as statistical learning theory. This paper provides a review of SML from a Bayesian decision theoretic point of view -- where we argue that many SML techniques are closely connected to making inference by using the so called Bayesian paradigm. We discuss many important SML techniques such as supervised and unsupervised learning, deep learning, online learning and Gaussian processes especially in the context of very large data sets where these are often employed. We present a dictionary which maps the key concepts of SML from Computer Science and Statistics. We illustrate the SML techniques with three moderately large data sets where we also discuss many practical implementation issues. Thus the review is especially targeted at statisticians and computer scientists who are aspiring to understand and apply SML for moderately large to big data sets.

...read moreread less

Posted Content•

Importance of the Mathematical Foundations of Machine Learning Methods for Scientific and Engineering Applications

[...]

Paul J. Atzberger

07 Aug 2018-arXiv: Machine Learning

TL;DR: There is a strong need for further mathematical developments on the foundations of machine learning methods to increase the level of rigor of employed methods and to ensure more reliable and interpretable results.

...read moreread less

Abstract: Author(s): Atzberger, Paul J | Abstract: There has been a lot of recent interest in adopting machine learning methods for scientific and engineering applications. This has in large part been inspired by recent successes and advances in the domains of Natural Language Processing (NLP) and Image Classification (IC). However, scientific and engineering problems have their own unique characteristics and requirements raising new challenges for effective design and deployment of machine learning approaches. There is a strong need for further mathematical developments on the foundations of machine learning methods to increase the level of rigor of employed methods and to ensure more reliable and interpretable results. Also as reported in the recent literature on state-of-the-art results and indicated by the No Free Lunch Theorems of statistical learning theory incorporating some form of inductive bias and domain knowledge is essential to success. Consequently, even for existing and widely used methods there is a strong need for further mathematical work to facilitate ways to incorporate prior scientific knowledge and related inductive biases into learning frameworks and algorithms. We briefly discuss these topics and discuss some ideas proceeding in this direction.

...read moreread less

Proceedings Article•DOI•

Theory of Generative Deep Learning II:Probe Landscape of Empirical Error via Norm Based Capacity Control

[...]

Wendi Xu¹, Ming Zhang¹•Institutions (1)

Chinese Academy of Sciences¹

01 Nov 2018

TL;DR: This work highlights landscape of empirical error in generative case to complete the full picture through exquisite design of image super resolution under norm based capacity control and theoretical advance in interpretation of the training dynamic is achieved from both mathematical and biological sides.

...read moreread less

Abstract: Despite its remarkable empirical success as a highly competitive branch of artificial intelligence, deep learning is often blamed for its widely known low interpretation and lack of firm and rigorous mathematical foundation. However, most theoretical endeavor is devoted in discriminative deep learning case, whose complementary part is generative deep learning. To the best of our knowledge, we firstly highlight landscape of empirical error in generative case to complete the full picture through exquisite design of image super resolution under norm based capacity control. Our theoretical advance in interpretation of the training dynamic is achieved from both mathematical and biological sides.

...read moreread less