scispace - formally typeset
Search or ask a question

Showing papers on "Statistical learning theory published in 2018"


Journal ArticleDOI
TL;DR: This work proposes a robust approach based on a recent nature-inspired metaheuristic called multi-verse optimizer (MVO) for selecting optimal features and optimizing the parameters of SVM simultaneously.
Abstract: Support vector machine (SVM) is a well-regarded machine learning algorithm widely applied to classification tasks and regression problems. SVM was founded based on the statistical learning theory and structural risk minimization. Despite the high prediction rate of this technique in a wide range of real applications, the efficiency of SVM and its classification accuracy highly depends on the parameter setting as well as the subset feature selection. This work proposes a robust approach based on a recent nature-inspired metaheuristic called multi-verse optimizer (MVO) for selecting optimal features and optimizing the parameters of SVM simultaneously. In fact, the MVO algorithm is employed as a tuner to manipulate the main parameters of SVM and find the optimal set of features for this classifier. The proposed approach is implemented and tested on two different system architectures. MVO is benchmarked and compared with four classic and recent metaheuristic algorithms using ten binary and multi-class labeled datasets. Experimental results demonstrate that MVO can effectively reduce the number of features while maintaining a high prediction accuracy.

187 citations


Book
28 Feb 2018
TL;DR: Non-convex optimization as mentioned in this paper is a popular approach for large-scale optimization problems and has been shown to outperform relaxation-based heuristics, such as projected gradient descent and alternating minimization.
Abstract: A vast majority of machine learning algorithms train their models andperform inference by solving optimization problems. In order to capturethe learning and prediction problems accurately, structural constraintssuch as sparsity or low rank are frequently imposed or else the objectiveitself is designed to be a non-convex function. This is especially trueof algorithms that operate in high-dimensional spaces or that trainnon-linear models such as tensor models and deep networks.The freedom to express the learning problem as a non-convex optimizationproblem gives immense modeling power to the algorithmdesigner, but often such problems are NP-hard to solve. A popularworkaround to this has been to relax non-convex problems to convexones and use traditional methods to solve the convex relaxed optimizationproblems. However this approach may be lossy and neverthelesspresents significant challenges for large scale optimization.On the other hand, direct approaches to non-convex optimizationhave met with resounding success in several domains and remain themethods of choice for the practitioner, as they frequently outperformrelaxation-based techniques - popular heuristics include projected gradientdescent and alternating minimization. However, these are oftenpoorly understood in terms of their convergence and other properties.This monograph presents a selection of recent advances that bridgea long-standing gap in our understanding of these heuristics. We hopethat an insight into the inner workings of these methods will allow thereader to appreciate the unique marriage of task structure and generativemodels that allow these heuristic techniques to provably succeed.The monograph will lead the reader through several widely used nonconvexoptimization techniques, as well as applications thereof. Thegoal of this monograph is to both, introduce the rich literature in thisarea, as well as equip the reader with the tools and techniques neededto analyze these simple procedures for non-convex problems.

184 citations


Journal ArticleDOI
TL;DR: Support vector machines (SVMs) classification method is used for fault detection in WSNs and can be easily executed at cluster heads to detect anomalous sensor.
Abstract: Wireless sensor networks (WSNs) are prone to many failures such as hardware failures, software failures, and communication failures. The fault detection in WSNs is a challenging problem due to sensor resources limitation and the variety of deployment field. Furthermore, the detection has to be precise to avoid negative alerts, and rapid to limit loss. The use of machine learning seems to be one of the most convenient solutions for detecting failure in WSNs. In this paper, support vector machines (SVMs) classification method is used for this purpose. Based on statistical learning theory, SVM is used in our context to define a decision function. As a light process in term of required resources, this decision function can be easily executed at cluster heads to detect anomalous sensor. The effectiveness of SVM for fault detection in WSNs is shown through an experimental study, comparing it to latest for the same application.

174 citations


Posted Content
TL;DR: CGnets, a deep learning approach, that learns coarse-grained free energy functions and can be trained by a force-matching scheme, is introduced, which shows that CGnets can capture all-atom explicit-solvent free energy surfaces with models using only a few coarse- grained beads and no solvent, while classical coarse-Graining methods fail to capture crucial features of the free energy surface.
Abstract: Atomistic or ab-initio molecular dynamics simulations are widely used to predict thermodynamics and kinetics and relate them to molecular structure. A common approach to go beyond the time- and length-scales accessible with such computationally expensive simulations is the definition of coarse-grained molecular models. Existing coarse-graining approaches define an effective interaction potential to match defined properties of high-resolution models or experimental data. In this paper, we reformulate coarse-graining as a supervised machine learning problem. We use statistical learning theory to decompose the coarse-graining error and cross-validation to select and compare the performance of different models. We introduce CGnets, a deep learning approach, that learns coarse-grained free energy functions and can be trained by a force matching scheme. CGnets maintain all physically relevant invariances and allow one to incorporate prior physics knowledge to avoid sampling of unphysical structures. We show that CGnets can capture all-atom explicit-solvent free energy surfaces with models using only a few coarse-grained beads and no solvent, while classical coarse-graining methods fail to capture crucial features of the free energy surface. Thus, CGnets are able to capture multi-body terms that emerge from the dimensionality reduction.

106 citations


Proceedings ArticleDOI
12 Jan 2018
TL;DR: This paper derived generalization error bounds for a broad class of iterative algorithms that are characterized by bounded, noisy updates with Markovian structure, including stochastic gradient Langevin dynamics (SGLD) and variants of the SGHMC algorithm.
Abstract: In statistical learning theory, generalization error is used to quantify the degree to which a supervised machine learning algorithm may overfit to training data. Recent work [Xu and Raginsky (2017)] has established a bound on the generalization error of empirical risk minimization based on the mutual information $I$ ( $S$ ; W) between the algorithm input $S$ and the algorithm output W, when the loss function is sub-Gaussian. We leverage these results to derive generalization error bounds for a broad class of iterative algorithms that are characterized by bounded, noisy updates with Markovian structure. Our bounds are very general and are applicable to numerous settings of interest, including stochastic gradient Langevin dynamics (SGLD) and variants of the stochastic gradient Hamiltonian Monte Carlo (SGHMC) algorithm. Furthermore, our error bounds hold for any output function computed over the path of iterates, including the last iterate of the algorithm or the average of subsets of iterates, and also allow for non-uniform sampling of data in successive updates of the algorithm.

84 citations


Journal ArticleDOI
TL;DR: A short and elementary proof that PhaseMax exactly recovers real-valued vectors from random measurements under optimal sample complexity is presented, yielding a simpler and more direct proof than those that require statistical learning theory, geometric probability or the highly technical arguments for Wirtinger Flow-like approaches.
Abstract: The phase retrieval problem has garnered significant attention since the development of the PhaseLift algorithm, which is a convex program that operates in a lifted space of matrices. Because of the substantial computational cost due to lifting, many approaches to phase retrieval have been developed, including non-convex optimization algorithms which operate in the natural parameter space, such as Wirtinger Flow. Very recently, a convex formulation called PhaseMax has been discovered, and it has been proven to achieve phase retrieval via linear programming in the natural parameter space under optimal sample complexity. The current proofs of PhaseMax rely on statistical learning theory or geometric probability theory. Here, we present a short and elementary proof that PhaseMax exactly recovers real-valued vectors from random measurements under optimal sample complexity. Our proof only relies on standard probabilistic concentration and covering arguments, yielding a simpler and more direct proof than those that require statistical learning theory, geometric probability or the highly technical arguments for Wirtinger Flow-like approaches.

47 citations


Journal ArticleDOI
TL;DR: The SVM framework is reformulate such that each training example can be modeled by a multi-dimensional Gaussian distribution described by its mean vector and its covariance matrix—the latter modeling the uncertainty.
Abstract: In this paper, we propose a maximum margin classifier that deals with uncertainty in data input. More specifically, we reformulate the SVM framework such that each training example can be modeled by a multi-dimensional Gaussian distribution described by its mean vector and its covariance matrix—the latter modeling the uncertainty. We address the classification problem and define a cost function that is the expected value of the classical SVM cost when data samples are drawn from the multi-dimensional Gaussian distributions that form the set of the training examples. Our formulation approximates the classical SVM formulation when the training examples are isotropic Gaussians with variance tending to zero. We arrive at a convex optimization problem, which we solve efficiently in the primal form using a stochastic gradient descent approach. The resulting classifier, which we name SVM with Gaussian Sample Uncertainty (SVM-GSU), is tested on synthetic data and five publicly available and popular datasets; namely, the MNIST, WDBC, DEAP, TV News Channel Commercial Detection, and TRECVID MED datasets. Experimental results verify the effectiveness of the proposed method.

40 citations


Book ChapterDOI
01 Jan 2018
TL;DR: In statistical learning theory (regression, classification, etc.) there are many regression models, such as algebraic polynomials, which help in the development of models for classification.
Abstract: In statistical learning theory (regression, classification, etc.) there are many regression models, such as algebraic polynomials,

28 citations


Journal ArticleDOI
TL;DR: This paper presents the convex programming problems underlying SVM focusing on supervised binary classification and analyzes the most important and used optimization methods for SVM training problems, and discusses how the properties of these problems can be incorporated in designing useful algorithms.
Abstract: Support Vector Machine (SVM) is one of the most important class of machine learning models and algorithms, and has been successfully applied in various fields. Nonlinear optimization plays a crucial role in SVM methodology, both in defining the machine learning models and in designing convergent and efficient algorithms for large-scale training problems. In this paper we present the convex programming problems underlying SVM focusing on supervised binary classification. We analyze the most important and used optimization methods for SVM training problems, and we discuss how the properties of these problems can be incorporated in designing useful algorithms.

26 citations


Proceedings ArticleDOI
17 Jun 2018
TL;DR: The minimax universal learning solution, a redundancy capacity theorem and an upper bound on the performance of the optimal solution for batch learning with log-loss are considered.
Abstract: In this paper we consider the problem of batch learning with log-loss, in a stochastic setting where given the data features, the outcome is generated by an unknown distribution from a class of models. Utilizing the minimax theorem and information-theoretical tools, we came up with the minimax universal learning solution, a redundancy capacity theorem and an upper bound on the performance of the optimal solution. The resulting universal learning solution is a mixture over the models in the considered class. Furthermore, we get a better bound on the generalization error that decays as $O(\log N/N)$ , where $N$ is the sample size, instead of $O(\sqrt{\log N/N})$ which is commonly attained in statistical learning theory for the empirical risk minimizer.

20 citations


Posted Content
TL;DR: A new theoretical framework for model compression is developed and a new pruning method called spectral pruning is proposed based on this framework, which defines the ``degrees of freedom'' to quantify the intrinsic dimensionality of a model by using the eigenvalue distribution of the covariance matrix across the internal nodes.
Abstract: Compression techniques for deep neural network models are becoming very important for the efficient execution of high-performance deep learning systems on edge-computing devices. The concept of model compression is also important for analyzing the generalization error of deep learning, known as the compression-based error bound. However, there is still huge gap between a practically effective compression method and its rigorous background of statistical learning theory. To resolve this issue, we develop a new theoretical framework for model compression and propose a new pruning method called {\it spectral pruning} based on this framework. We define the ``degrees of freedom'' to quantify the intrinsic dimensionality of a model by using the eigenvalue distribution of the covariance matrix across the internal nodes and show that the compression ability is essentially controlled by this quantity. Moreover, we present a sharp generalization error bound of the compressed model and characterize the bias--variance tradeoff induced by the compression procedure. We apply our method to several datasets to justify our theoretical analyses and show the superiority of the the proposed method.

Proceedings ArticleDOI
01 Nov 2018
TL;DR: A kernel-target alignment based fuzzy least square twin bounded support vector machine (KTAFLSTBSVM) is proposed to reduce the effects of outliers and noise and solves the two systems of linear equations.
Abstract: A kernel-target alignment based fuzzy least square twin bounded support vector machine (KTAFLSTBSVM) is proposed to reduce the effects of outliers and noise. The proposed model is an effective and efficient fuzzy based least square twin bounded support vector machine for binary classification where the membership values are assigned based on kernel-target alignment approach. The proposed KTA-FLSTBSVM solves the two systems of linear equations, which is computationally very fast with significant comparable performance. To development the robust model, this approach minimizes the structural risk which is the gist of statistical learning theory. This powerful KTA-FLSTBSVM approach is tested on artificial data sets as well as benchmark real-world datasets that provide significantly better result in terms of generalization performance and computational time.

Posted Content
26 Aug 2018
TL;DR: This work develops a new theoretical frame-work for model compression, and proposes a new method called Spectral-Pruning based on the theory, which makes use of both "input" and "output" in each layer and is easy to implement.
Abstract: Compression techniques for deep neural network models are becoming very important for the efficient execution of high-performance deep learning systems on edge-computing devices. The concept of model compression is also important for analyzing the generalization error of deep learning, known as the compression-based error bound. However, there is still huge gap between a practically effective compression method and its rigorous background of statistical learning theory. To resolve this issue, we develop a new theoretical framework for model compression and propose a new pruning method called {\it spectral pruning} based on this framework. We define the ``degrees of freedom'' to quantify the intrinsic dimensionality of a model by using the eigenvalue distribution of the covariance matrix across the internal nodes and show that the compression ability is essentially controlled by this quantity. Moreover, we present a sharp generalization error bound of the compressed model and characterize the bias--variance tradeoff induced by the compression procedure. We apply our method to several datasets to justify our theoretical analyses and show the superiority of the the proposed method.

Proceedings ArticleDOI
19 Jul 2018
TL;DR: This work presents MiSoSouP, a suite of algorithms for extracting high-quality approximations of the most interesting subgroups, according to different interestingness measures, from a random sample of a transactional dataset, and describes a new formulation of these measures that makes it possible to approximate them using sampling.
Abstract: We present MiSoSouP, a suite of algorithms for extracting high-quality approximations of the most interesting subgroups, according to different interestingness measures, from a random sample of a transactional dataset. We describe a new formulation of these measures that makes it possible to approximate them using sampling. We then discuss how pseudodimension, a key concept from statistical learning theory, relates to the sample size needed to obtain an high-quality approximation of the most interesting subgroups. We prove an upper bound on the pseudodimension of the problem at hand, which results in small sample sizes. Our evaluation on real datasets shows that MiSoSouP outperforms state-of-the-art algorithms offering the same guarantees, and it vastly speeds up the discovery of subgroups w.r.t. analyzing the whole dataset.

Posted Content
TL;DR: A novel measure-theoretic theory for machine learning that does not require statistical assumptions is introduced and a new regularization method in deep learning is derived and shown to outperform previous methods in CIFar-10, CIFAR-100, and SVHN.
Abstract: This paper introduces a novel measure-theoretic theory for machine learning that does not require statistical assumptions. Based on this theory, a new regularization method in deep learning is derived and shown to outperform previous methods in CIFAR-10, CIFAR-100, and SVHN. Moreover, the proposed theory provides a theoretical basis for a family of practically successful regularization methods in deep learning. We discuss several consequences of our results on one-shot learning, representation learning, deep learning, and curriculum learning. Unlike statistical learning theory, the proposed learning theory analyzes each problem instance individually via measure theory, rather than a set of problem instances via statistics. As a result, it provides different types of results and insights when compared to statistical learning theory.

Posted Content
TL;DR: The mismatch principle is studied, which is a simple recipe to establish theoretical error bounds for the generalized Lasso, and the benefits of the mismatch principle are demonstrated for a variety of popular problem classes, such as single-index models, generalized linear models, and variable selection.
Abstract: We study the estimation capacity of the generalized Lasso, i.e., least squares minimization combined with a (convex) structural constraint. While Lasso-type estimators were originally designed for noisy linear regression problems, it has recently turned out that they are in fact robust against various types of model uncertainties and misspecifications, most notably, non-linearly distorted observation models. This work provides more theoretical evidence for this somewhat astonishing phenomenon. At the heart of our analysis stands the mismatch principle, which is a simple recipe to establish theoretical error bounds for the generalized Lasso. The associated estimation guarantees are of independent interest and are formulated in a fairly general setup, permitting arbitrary sub-Gaussian data, possibly with strongly correlated feature designs; in particular, we do not assume a specific observation model which connects the input and output variables. Although the mismatch principle is conceived based on ideas from statistical learning theory, its actual application area are (high-dimensional) estimation tasks for semi-parametric models. In this context, the benefits of the mismatch principle are demonstrated for a variety of popular problem classes, such as single-index models, generalized linear models, and variable selection. Apart from that, our findings are also relevant to recent advances in quantized and distributed compressed sensing.

Posted Content
TL;DR: This work proposes the first theoretical framework to deal with part-based data from a general perspective and explicitly quantifies the benefits of leveraging the part- based structure of the problem with respect to the learning rates of the proposed estimator.
Abstract: Key to structured prediction is exploiting the problem structure to simplify the learning process. A major challenge arises when data exhibit a local structure (e.g., are made by "parts") that can be leveraged to better approximate the relation between (parts of) the input and (parts of) the output. Recent literature on signal processing, and in particular computer vision, has shown that capturing these aspects is indeed essential to achieve state-of-the-art performance. While such algorithms are typically derived on a case-by-case basis, in this work we propose the first theoretical framework to deal with part-based data from a general perspective. We derive a novel approach to deal with these problems and study its generalization properties within the setting of statistical learning theory. Our analysis is novel in that it explicitly quantifies the benefits of leveraging the part-based structure of the problem with respect to the learning rates of the proposed estimator.

Posted Content
TL;DR: The analyses of generalization error in deep learning mainly exploit the hierarchical structure in DNNs and the contraction property of $f$-divergence, which may be of independent interest in analyzing other learning models with hierarchical structure.
Abstract: We derive upper bounds on the generalization error of learning algorithms based on their \emph{algorithmic transport cost}: the expected Wasserstein distance between the output hypothesis and the output hypothesis conditioned on an input example. The bounds provide a novel approach to study the generalization of learning algorithms from an optimal transport view and impose less constraints on the loss function, such as sub-gaussian or bounded. We further provide several upper bounds on the algorithmic transport cost in terms of total variation distance, relative entropy (or KL-divergence), and VC dimension, thus further bridging optimal transport theory and information theory with statistical learning theory. Moreover, we also study different conditions for loss functions under which the generalization error of a learning algorithm can be upper bounded by different probability metrics between distributions relating to the output hypothesis and/or the input data. Finally, under our established framework, we analyze the generalization in deep learning and conclude that the generalization error in deep neural networks (DNNs) decreases exponentially to zero as the number of layers increases. Our analyses of generalization error in deep learning mainly exploit the hierarchical structure in DNNs and the contraction property of $f$-divergence, which may be of independent interest in analyzing other learning models with hierarchical structure.

Journal ArticleDOI
Niangang Jiao1, Feng Wang1, Hongjian You1, Mudan Yang1, Xinghui Yao1 
24 May 2018-Sensors
TL;DR: This work proposes a new method to improve the geometric positioning accuracy without ground control points (GCPs) of ZY-3 satellite imagery with the newly proposed inherent error compensation model based on statistical learning theory.
Abstract: With the increasing demand for high-resolution remote sensing images for mapping and monitoring the Earth’s environment, geometric positioning accuracy improvement plays a significant role in the image preprocessing step. Based on the statistical learning theory, we propose a new method to improve the geometric positioning accuracy without ground control points (GCPs). Multi-temporal images from the ZY-3 satellite are tested and the bias-compensated rational function model (RFM) is applied as the block adjustment model in our experiment. An easy and stable weight strategy and the fast iterative shrinkage-thresholding (FIST) algorithm which is widely used in the field of compressive sensing are improved and utilized to define the normal equation matrix and solve it. Then, the residual errors after traditional block adjustment are acquired and tested with the newly proposed inherent error compensation model based on statistical learning theory. The final results indicate that the geometric positioning accuracy of ZY-3 satellite imagery can be improved greatly with our proposed method.

Posted Content
01 Oct 2018
TL;DR: A novel measure-theoretic theory for machine learning that does not require statistical assumptions is introduced and a new regularization method in deep learning is derived and shown to outperform previous methods in CIFar-10, CIFAR-100, and SVHN.
Abstract: This paper introduces a novel measure-theoretic theory for machine learning that does not require statistical assumptions. Based on this theory, a new regularization method in deep learning is derived and shown to outperform previous methods in CIFAR-10, CIFAR-100, and SVHN. Moreover, the proposed theory provides a theoretical basis for a family of practically successful regularization methods in deep learning. We discuss several consequences of our results on one-shot learning, representation learning, deep learning, and curriculum learning. Unlike statistical learning theory, the proposed learning theory analyzes each problem instance individually via measure theory, rather than a set of problem instances via statistics. As a result, it provides different types of results and insights when compared to statistical learning theory.

Book ChapterDOI
01 Jan 2018
TL;DR: This chapter starts by describing the necessary concepts and assumptions to ensure supervised learning, and details the Empirical Risk Minimization (ERM) principle, which is the key point for the Statistical Learning Theory (SLT).
Abstract: This chapter starts by describing the necessary concepts and assumptions to ensure supervised learning. Later on, it details the Empirical Risk Minimization (ERM) principle, which is the key point for the Statistical Learning Theory (SLT). The ERM principle provides upper bounds to make the empirical risk a good estimator for the expected risk, given the bias of some learning algorithm. This bound is the main theoretical tool to provide learning guarantees for classification tasks. Afterwards, other useful tools and concepts are introduced.

Posted Content
TL;DR: In this paper, the authors present a framework to examine the tradeoff between estimation and approximation errors, and between prediction and interpretation losses, by formulating the metrics of interpretation loss as the difference between true and estimated choice probability functions.
Abstract: While researchers increasingly use deep neural networks (DNN) to analyze individual choices, overfitting and interpretability issues remain as obstacles in theory and practice. By using statistical learning theory, this study presents a framework to examine the tradeoff between estimation and approximation errors, and between prediction and interpretation losses. It operationalizes the DNN interpretability in the choice analysis by formulating the metrics of interpretation loss as the difference between true and estimated choice probability functions. This study also uses the statistical learning theory to upper bound the estimation error of both prediction and interpretation losses in DNN, shedding light on why DNN does not have the overfitting issue. Three scenarios are then simulated to compare DNN to binary logit model (BNL). We found that DNN outperforms BNL in terms of both prediction and interpretation for most of the scenarios, and larger sample size unleashes the predictive power of DNN but not BNL. DNN is also used to analyze the choice of trip purposes and travel modes based on the National Household Travel Survey 2017 (NHTS2017) dataset. These experiments indicate that DNN can be used for choice analysis beyond the current practice of demand forecasting because it has the inherent utility interpretation, the flexibility of accommodating various information formats, and the power of automatically learning utility specification. DNN is both more predictive and interpretable than BNL unless the modelers have complete knowledge about the choice task, and the sample size is small. Overall, statistical learning theory can be a foundation for future studies in the non-asymptotic data regime or using high-dimensional statistical models in choice analysis, and the experiments show the feasibility and effectiveness of DNN for its wide applications to policy and behavioral analysis.

Journal ArticleDOI
TL;DR: In this article, the authors considered a version of optimal scoring in reproducing kernel Hilbert spaces, where estimators are constructed by minimizing regularized (penalized) empirical variances, as previously in penalized optimal scoring.

Posted Content
TL;DR: The Shattering coefficient for any Hilbert space H containing the input space X is proved and its effects in terms of learning guarantees for supervised machine algorithms are discussed.
Abstract: The Statistical Learning Theory (SLT) provides the theoretical guarantees for supervised machine learning based on the Empirical Risk Minimization Principle (ERMP). Such principle defines an upper bound to ensure the uniform convergence of the empirical risk Remp(f), i.e., the error measured on a given data sample, to the expected value of risk R(f) (a.k.a. actual risk), which depends on the Joint Probability Distribution P(X x Y) mapping input examples x in X to class labels y in Y. The uniform convergence is only ensured when the Shattering coefficient N(F,2n) has a polynomial growing behavior. This paper proves the Shattering coefficient for any Hilbert space H containing the input space X and discusses its effects in terms of learning guarantees for supervised machine algorithms.

DOI
01 Jan 2018
TL;DR: This thesis theoretically investigates conditions when HTL guarantees improved generalization on a novel task subject to the relevant auxiliary (source) hypotheses, and investigates HTL theoretically by considering three scenarios: HTL through regularized least squares with biased regularization, through convex empirical risk minimization, and through stochastic optimization.
Abstract: The design and analysis of machine learning algorithms typically considers the problem of learning on a single task, and the nature of learning in such scenario is well explored. On the other hand, very often tasks faced by machine learning systems arrive sequentially, and therefore it is reasonable to ask whether a better approach can be taken than retraining such systems from scratch given newly available data. Indeed, by drawing analogy from human learning, a novel skill could be acquired more easily whenever the learner shares a relevant past experience. In response to this observation, the machine learning community has drawn its attention towards a form of learning known as transfer learning - learning a novel task by leveraging upon auxiliary information extracted from previous tasks. Tangible progress has been made in both theory and practice of transfer learning; however, many questions are still to be addressed. In this thesis we will focus on an efficient type of transfer learning, known as the Hypothesis Transfer Learning (HTL), where auxiliary information is retained in a form of previously induced hypotheses. This is in contrast to the large body of work where one transfers from the data associated with previously encountered tasks. In particular, we theoretically investigate conditions when HTL guarantees improved generalization on a novel task subject to the relevant auxiliary (source) hypotheses. We investigate HTL theoretically by considering three scenarios: HTL through regularized least squares with biased regularization, through convex empirical risk minimization, and through stochastic optimization, which also touches the theory of non-convex transfer learning problems. In addition, we demonstrate the benefits of HTL empirically, by proposing two algorithms tailored for real-life situations with application to visual learning problems - learning a new class in a multi-class classification setting by transferring from known classes, and an efficient greedy HTL algorithm for learning with large number of source hypotheses. From theoretical point of view this thesis consistently identifies the key quantitative characteristics of relatedness between novel and previous tasks, and explicitates them in generalization bounds. These findings corroborate many previous works in the transfer learning literature and provide a theoretical basis for design and analysis of new HTL algorithms.

Proceedings ArticleDOI
29 Dec 2018
TL;DR: The article systematically introduces the theory of support vector machine, summarizes the common training algorithms of standard (traditional) support Vector machine and their existing problems, the new learning models and algorithms developed on this basis, and verifies the actual effect and scope of each support vectors machine model through the application of transformer fault diagnosis.
Abstract: Support Vector Machine (SVM) is a machine learning method based on statistical learning theory, solving the problems of classification and regression by means of optimization methods. The method can effectively solve the problem of small number of samples, nonlinearity and high dimensionality, and largely avoids the problems of "dimensionality disaster", "over-fitting" and local minimum caused by traditional statistical theory. However, there are still some problems, such as high complexity of the algorithm and difficulty in adapting to large-scale data. The article systematically introduces the theory of support vector machine, summarizes the common training algorithms of standard (traditional) support vector machine and their existing problems, the new learning models and algorithms developed on this basis. And verify the actual effect and scope of each support vector machine model through the application of transformer fault diagnosis.

DissertationDOI
08 Aug 2018
TL;DR: This dissertation investigates how more robust acoustic generalizations can be made, even with little data and noisy accented-speech data, and takes advantage of raw feature extraction provided by deep learning techniques to produce robust results for acoustic modeling without the dependency of big data.
Abstract: SHULBY, C. D. RAMBLE: robust acoustic modeling for Brazilian learners of English. 2018. 160 p. Tese (Doutorado em Ciências – Ciências de Computação e Matemática Computacional) – Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo, São Carlos – SP, 2018. The gains made by current deep-learning techniques have often come with the price tag of big data and where that data is not available, a new solution must be found. Such is the case for accented and noisy speech where large databases do not exist and data augmentation techniques, which are less than perfect, present an even larger obstacle. Another problem is that state-of-the-art results are rarely reproducible because they use proprietary datasets, pretrained networks and/or weight initializations from other larger networks. An example of a low resource scenario exists even in the fifth largest land in the world; home to most of the speakers of the seventh most spoken language on earth. Brazil is the leader in the Latin-American economy and as a BRIC country aspires to become an ever-stronger player in the global marketplace. Still, English proficiency is low, even for professionals in businesses and universities. Low intelligibility and strong accents can damage professional credibility. It has been established in the literature for foreign language teaching that it is important that adult learners are made aware of their errors as outlined by the “Noticing Theory”, explaining that a learner is more successful when he is able to learn from his own mistakes. An essential objective of this dissertation is to classify phonemes in the acoustic model which is needed to properly identify phonemic errors automatically. A common belief in the community is that deep learning requires large datasets to be effective. This happens because brute force methods create a highly complex hypothesis space which requires large and complex networks which in turn demand a great amount of data samples in order to generate useful networks. Besides that, the loss functions used in neural learning does not provide statistical learning guarantees and only guarantees the network can memorize the training space well. In the case of accented or noisy speech where a new sample can carry a great deal of variation from the training samples, the generalization of such models suffers. The main objective of this dissertation is to investigate how more robust acoustic generalizations can be made, even with little data and noisy accented-speech data. The approach here is to take advantage of raw feature extraction provided by deep learning techniques and instead focus on how learning guarantees can be provided for small datasets to produce robust results for acoustic modeling without the dependency of big data. This has been done by careful and intelligent parameter and architecture selection within the framework of the statistical learning theory. Here, an intelligently defined CNN architecture, together with context windows and a knowledge-driven hierarchical tree of SVM classifiers achieves nearly state-of-the-art frame-wise phoneme recognition results with absolutely no pretraining or external weight initialization. A goal of this thesis is to produce transparent and reproducible architectures with high frame-level accuracy, comparable to the state of the art. Additionally, a convergence analysis based on the learning guarantees of the statistical learning theory is performed in order to evidence the generalization capacity of the model. The model achieves 39.7% error in framewise classification and a 43.5% phone error rate using deep feature extraction and SVM classification even with little data (less than 7 hours). These results are comparable to studies which use well over ten times that amount of data. Beyond the intrinsic evaluation, the model also achieves an accuracy of 88% in the identification of epenthesis, the error which is most difficult for Brazilian speakers of English This is a 69% relative percentage gain over the previous values in the literature. The results are significant because it shows how deep feature extraction can be applied to little data scenarios, contrary to popular belief. The extrinsic, task-based results also show how this approach could be useful in tasks like automatic error diagnosis. Another contribution is the publication of a number of freely available resources which previously did not exist, meant to aid future researches in dataset creation.

Posted Content
TL;DR: A review of statistical machine learning from a Bayesian decision theoretic point of view is presented in this article, where the authors argue that many SML techniques are closely connected to making inference by using the so-called Bayesian paradigm.
Abstract: Statistical Machine Learning (SML) refers to a body of algorithms and methods by which computers are allowed to discover important features of input data sets which are often very large in size. The very task of feature discovery from data is essentially the meaning of the keyword `learning' in SML. Theoretical justifications for the effectiveness of the SML algorithms are underpinned by sound principles from different disciplines, such as Computer Science and Statistics. The theoretical underpinnings particularly justified by statistical inference methods are together termed as statistical learning theory. This paper provides a review of SML from a Bayesian decision theoretic point of view -- where we argue that many SML techniques are closely connected to making inference by using the so called Bayesian paradigm. We discuss many important SML techniques such as supervised and unsupervised learning, deep learning, online learning and Gaussian processes especially in the context of very large data sets where these are often employed. We present a dictionary which maps the key concepts of SML from Computer Science and Statistics. We illustrate the SML techniques with three moderately large data sets where we also discuss many practical implementation issues. Thus the review is especially targeted at statisticians and computer scientists who are aspiring to understand and apply SML for moderately large to big data sets.

Posted Content
TL;DR: There is a strong need for further mathematical developments on the foundations of machine learning methods to increase the level of rigor of employed methods and to ensure more reliable and interpretable results.
Abstract: Author(s): Atzberger, Paul J | Abstract: There has been a lot of recent interest in adopting machine learning methods for scientific and engineering applications. This has in large part been inspired by recent successes and advances in the domains of Natural Language Processing (NLP) and Image Classification (IC). However, scientific and engineering problems have their own unique characteristics and requirements raising new challenges for effective design and deployment of machine learning approaches. There is a strong need for further mathematical developments on the foundations of machine learning methods to increase the level of rigor of employed methods and to ensure more reliable and interpretable results. Also as reported in the recent literature on state-of-the-art results and indicated by the No Free Lunch Theorems of statistical learning theory incorporating some form of inductive bias and domain knowledge is essential to success. Consequently, even for existing and widely used methods there is a strong need for further mathematical work to facilitate ways to incorporate prior scientific knowledge and related inductive biases into learning frameworks and algorithms. We briefly discuss these topics and discuss some ideas proceeding in this direction.

Proceedings ArticleDOI
01 Nov 2018
TL;DR: This work highlights landscape of empirical error in generative case to complete the full picture through exquisite design of image super resolution under norm based capacity control and theoretical advance in interpretation of the training dynamic is achieved from both mathematical and biological sides.
Abstract: Despite its remarkable empirical success as a highly competitive branch of artificial intelligence, deep learning is often blamed for its widely known low interpretation and lack of firm and rigorous mathematical foundation. However, most theoretical endeavor is devoted in discriminative deep learning case, whose complementary part is generative deep learning. To the best of our knowledge, we firstly highlight landscape of empirical error in generative case to complete the full picture through exquisite design of image super resolution under norm based capacity control. Our theoretical advance in interpretation of the training dynamic is achieved from both mathematical and biological sides.