scispace - formally typeset
Search or ask a question

Showing papers on "Statistical learning theory published in 2017"


Journal ArticleDOI
TL;DR: This overview reviews theoretical underpinnings of multi-view learning and attempts to identify promising venues and point out some specific challenges which can hopefully promote further research in this rapidly developing field.

679 citations


Book
28 Jun 2017
TL;DR: The kernel mean embedding (KME) as discussed by the authors is a generalization of the original feature map of support vector machines (SVMs) and other kernel methods, and it can be viewed as a generalisation of the SVM feature map.
Abstract: A Hilbert space embedding of a distribution—in short, a kernel mean embedding—has recently emerged as a powerful tool for machine learning and statistical inference. The basic idea behind this framework is to map distributions into a reproducing kernel Hilbert space (RKHS) in which the whole arsenal of kernel methods can be extended to probability measures. It can be viewed as a generalization of the original “feature map” common to support vector machines (SVMs) and other kernel methods. In addition to the classical applications of kernel methods, the kernel mean embedding has found novel applications in fields ranging from probabilistic modeling to statistical inference, causal discovery, and deep learning. Kernel Mean Embedding of Distributions: A Review and Beyond provides a comprehensive review of existing work and recent advances in this research area, and to discuss some of the most challenging issues and open problems that could potentially lead to new research directions. The targeted audience includes graduate students and researchers in machine learning and statistics who are interested in the theory and applications of kernel mean embeddings.

375 citations


Journal ArticleDOI
TL;DR: The basic theory of S3VM is expounded and discussed in detail, the mainstream model of S 3VM is presented, including transductive support vector machine, Laplacian support Vector machine, S3 VM training via the label mean, S2VM based on cluster kernel and the conclusions are given.
Abstract: Support vector machine (SVM) is a machine learning method based on statistical learning theory. It has a lot of advantages, such as solid theoretical foundation, global optimization, the sparsity of the solution, nonlinear and generalization. The standard form of SVM only applies to supervised learning. Large amount of data generated in real life is unlabeled, and the standard form of SVM cannot make good use of these data to improve its learning ability. However, semi-supervised support vector machine (S3VM) is a good solution to this problem. This paper reviews the recent progress in semi-supervised support vector machine. First, the basic theory of S3VM is expounded and discussed in detail; then, the mainstream model of S3VM is presented, including transductive support vector machine, Laplacian support vector machine, S3VM training via the label mean, S3VM based on cluster kernel; finally, we give the conclusions and look ahead to the research on S3VM.

95 citations


Journal ArticleDOI
TL;DR: The results show that the KRNGM model outperforms the existing NGM, ONGM, NDGM model significantly, comparing to the existing nonhomogeneous models.

78 citations


Posted Content
TL;DR: In this article, the authors studied the generalization error of stochastic gradient Langevin dynamics with non-convex objectives and proposed two theories with nonasymptotic discrete-time analysis, using Stability and PAC-Bayesian results respectively.
Abstract: Algorithm-dependent generalization error bounds are central to statistical learning theory. A learning algorithm may use a large hypothesis space, but the limited number of iterations controls its model capacity and generalization error. The impacts of stochastic gradient methods on generalization error for non-convex learning problems not only have important theoretical consequences, but are also critical to generalization errors of deep learning. In this paper, we study the generalization errors of Stochastic Gradient Langevin Dynamics (SGLD) with non-convex objectives. Two theories are proposed with non-asymptotic discrete-time analysis, using Stability and PAC-Bayesian results respectively. The stability-based theory obtains a bound of $O\left(\frac{1}{n}L\sqrt{\beta T_k}\right)$, where $L$ is uniform Lipschitz parameter, $\beta$ is inverse temperature, and $T_k$ is aggregated step sizes. For PAC-Bayesian theory, though the bound has a slower $O(1/\sqrt{n})$ rate, the contribution of each step is shown with an exponentially decaying factor by imposing $\ell^2$ regularization, and the uniform Lipschitz constant is also replaced by actual norms of gradients along trajectory. Our bounds have no implicit dependence on dimensions, norms or other capacity measures of parameter, which elegantly characterizes the phenomenon of "Fast Training Guarantees Generalization" in non-convex settings. This is the first algorithm-dependent result with reasonable dependence on aggregated step sizes for non-convex learning, and has important implications to statistical learning aspects of stochastic gradient methods in complicated models such as deep learning.

60 citations


Journal ArticleDOI
TL;DR: In this article, the authors present eight PAC-Bayes bounds to analyze the generalization performance of multi-view classifiers, which are derived from two derived logarithmic determinant inequalities whose difference lies in whether the dimensionality of data is involved.

44 citations


Journal ArticleDOI
TL;DR: This paper's models capture several state-of-the-art empirical and theoretical approaches to the problem, ranging from self-improving algorithms to empirical performance models, and the results identify conditions under which these approaches are guaranteed to perform well.
Abstract: The best algorithm for a computational problem generally depends on the “relevant inputs,” a concept that depends on the application domain and often defies formal articulation. While there is a large body of literature on empirical approaches to selecting the best algorithm for a given application domain, there has been surprisingly little theoretical analysis of the problem. This paper adapts concepts from statistical and online learning theory to reason about application-specific algorithm selection. Our models capture several state-of-the-art empirical and theoretical approaches to the problem, ranging from self-improving algorithms to empirical performance models, and our results identify conditions under which these approaches are guaranteed to perform well. We present one framework that models algorithm selection as a statistical learning problem, and our work here shows that dimension notions from statistical learning theory, historically used to measure the complexity of classes of binary- and re...

44 citations


Proceedings Article
01 Jan 2017
TL;DR: In this paper, the average top-k loss was introduced as a new ensemble loss for supervised learning. But, it was shown that the average loss can lead to convex optimization problems that can be solved effectively with conventional sub-gradient based method.
Abstract: In this work, we introduce the average top-$k$ (\atk) loss as a new ensemble loss for supervised learning. The \atk loss provides a natural generalization of the two widely used ensemble losses, namely the average loss and the maximum loss. Furthermore, the \atk loss combines the advantages of them and can alleviate their corresponding drawbacks to better adapt to different data distributions. We show that the \atk loss affords an intuitive interpretation that reduces the penalty of continuous and convex individual losses on correctly classified data. The \atk loss can lead to convex optimization problems that can be solved effectively with conventional sub-gradient based method. We further study the Statistical Learning Theory of \matk by establishing its classification calibration and statistical consistency of \matk which provide useful insights on the practical choice of the parameter $k$. We demonstrate the applicability of \matk learning combined with different individual loss functions for binary and multi-class classification and regression using synthetic and real datasets.

34 citations


Journal ArticleDOI
TL;DR: In this paper, a support vector machine (SVM) based on the VC dimension theory of statistical learning theory is described and applied in machinery condition prediction, and wavelet transform is introduced into the SVM model to reduce the influence of irregular characteristics and simultaneously simplify the complexity of the original signal.
Abstract: The soft failure of mechanical equipment makes its performance drop gradually, which occupies a large proportion and has certain regularity. The performance can be evaluated and predicted through early state monitoring and data analysis. In this paper, the support vector machine (SVM), a novel learning machine based on the VC dimension theory of statistical learning theory, is described and applied in machinery condition prediction. To improve the modeling capability, wavelet transform (WT) is introduced into the SVM model to reduce the influence of irregular characteristics and simultaneously simplify the complexity of the original signal. The paper models the vibration signal from the double row bearing and wavelet transformation and SVM model (WT---SVM model) is constructed and trained for bearing degradation process prediction. Besides Hazen plotting position relationships is applied to describe the degradation trend distribution and a 95 % confidence level based on $$t$$t-distribution is given. The single SVM model and neural network (NN) approach is also investigated as a comparison. The modeling results indicate that the WT---SVM model outperforms the NN and single SVM models, and is feasible and effective in machinery condition prediction.

34 citations


Proceedings Article
19 Jul 2017
TL;DR: This is the first algorithm-dependent result with reasonable dependence on aggregated step sizes for non-convex learning, and has important implications to statistical learning aspects of stochastic gradient methods in complicated models such as deep learning.
Abstract: Algorithm-dependent generalization error bounds are central to statistical learning theory. A learning algorithm may use a large hypothesis space, but the limited number of iterations controls its model capacity and generalization error. The impacts of stochastic gradient methods on generalization error for non-convex learning problems not only have important theoretical consequences, but are also critical to generalization errors of deep learning. In this paper, we study the generalization errors of Stochastic Gradient Langevin Dynamics (SGLD) with non-convex objectives. Two theories are proposed with non-asymptotic discrete-time analysis, using Stability and PAC-Bayesian results respectively. The stability-based theory obtains a bound of $O\left(\frac{1}{n}L\sqrt{\beta T_k}\right)$, where $L$ is uniform Lipschitz parameter, $\beta$ is inverse temperature, and $T_k$ is aggregated step sizes. For PAC-Bayesian theory, though the bound has a slower $O(1/\sqrt{n})$ rate, the contribution of each step is shown with an exponentially decaying factor by imposing $\ell^2$ regularization, and the uniform Lipschitz constant is also replaced by actual norms of gradients along trajectory. Our bounds have no implicit dependence on dimensions, norms or other capacity measures of parameter, which elegantly characterizes the phenomenon of "Fast Training Guarantees Generalization" in non-convex settings. This is the first algorithm-dependent result with reasonable dependence on aggregated step sizes for non-convex learning, and has important implications to statistical learning aspects of stochastic gradient methods in complicated models such as deep learning.

28 citations


Proceedings ArticleDOI
01 Oct 2017
TL;DR: A version of the classic Linear Discriminant Analysis (LDA) classification rule is proposed for nonstationary data, using a linear-Gaussian state space model, based on the Kalman Smoother algorithm to estimate the evolving population parameters.
Abstract: Changes in population distributions over time are common in many applications. However, the vast majority of statistical learning theory takes place under the assumption that all points in the training data are identically distributed (and independent), that is, non-stationarity of the data is disregarded. In this paper, a version of the classic Linear Discriminant Analysis (LDA) classification rule is proposed for nonstationary data, using a linear-Gaussian state space model. This Nonstationary LDA (NSLDA) classification rule is based on the Kalman Smoother algorithm to estimate the evolving population parameters. In case the dynamics of the system are not fully known, a combination of the Expectation-Maximization (EM) algorithm and the Kalman Smoother is employed to simultaneously estimate population and statespace equation parameters. Performance is assessed in a set of numerical experiments using simulated data, where the average error rates obtained by NSLDA are compared to the error produced by a naive application of LDA to the pooled nonstationary data. Results demonstrate the promise of the proposed NSLDA classification rule.

Posted Content
TL;DR: This work studies universally consistent function learning, where the objective is to obtain low long-run average loss for any target function, when the data follow a given stochastic process, and finds that optimistically universal learning rules do indeed exist in the self-adaptive learning setting.
Abstract: This work initiates a general study of learning and generalization without the i.i.d. assumption, starting from first principles. While the traditional approach to statistical learning theory typically relies on standard assumptions from probability theory (e.g., i.i.d. or stationary ergodic), in this work we are interested in developing a theory of learning based only on the most fundamental and necessary assumptions implicit in the requirements of the learning problem itself. We specifically study universally consistent function learning, where the objective is to obtain low long-run average loss for any target function, when the data follow a given stochastic process. We are then interested in the question of whether there exist learning rules guaranteed to be universally consistent given only the assumption that universally consistent learning is possible for the given data process. The reasoning that motivates this criterion emanates from a kind of optimist's decision theory, and so we refer to such learning rules as being optimistically universal. We study this question in three natural learning settings: inductive, self-adaptive, and online. Remarkably, as our strongest positive result, we find that optimistically universal learning rules do indeed exist in the self-adaptive learning setting. Establishing this fact requires us to develop new approaches to the design of learning algorithms. Along the way, we also identify concise characterizations of the family of processes under which universally consistent learning is possible in the inductive and self-adaptive settings. We additionally pose a number of enticing open problems, particularly for the online learning setting.

Journal ArticleDOI
TL;DR: This paper focuses on analysis of the generalization error for this regularized least squares ranking algorithm, and improves the existing learning rates by virtue of an error decomposition technique from regression and Hoeffding’s decomposition for U-statistics.
Abstract: The ranking problem aims at learning real-valued functions to order instances, which has attracted great interest in statistical learning theory. In this paper, we consider the regularized least squares ranking algorithm within the framework of reproducing kernel Hilbert space. In particular, we focus on analysis of the generalization error for this ranking algorithm, and improve the existing learning rates by virtue of an error decomposition technique from regression and Hoeffding’s decomposition for U-statistics.

Proceedings Article
17 Jul 2017
TL;DR: This manuscript analyzes non-decomposable metrics such as the F-measure and the Jaccard measure from statistical and algorithmic points of view, and provides guidance to the theory and practice of binary classification with complex metrics.
Abstract: Statistical learning theory is at an inflection point enabled by recent advances in understanding and optimizing a wide range of metrics. Of particular interest are non-decomposable metrics such as the F-measure and the Jaccard measure which cannot be represented as a simple average over examples. Non-decomposability is the primary source of difficulty in theoretical analysis, and interestingly has led to two distinct settings and notions of consistency. In this manuscript we analyze both settings, from statistical and algorithmic points of view, to explore the connections and to highlight differences between them for a wide range of metrics. The analysis complements previous results on this topic, clarifies common confusions around both settings, and provides guidance to the theory and practice of binary classification with complex metrics.

Journal ArticleDOI
TL;DR: A kernel function is applied to reconstruct time-dependent open-ended sequences of observations, also referred to as data streams in the context of Machine Learning, into multidimensional spaces in attempt to hold the data independency assumption.
Abstract: We employ a Monte-Carlo approach to find the best phase space for a given data stream.We propose kFTCV, a novel approach to validate data stream classification.Results show Taken's theorem can transform data streams into independent states.Therefore, we can rely on SLT framework to ensure learning when dealing with data streams. The Statistical Learning Theory (SLT) defines five assumptions to ensure learning for supervised algorithms. Data independency is one of those assumptions, once the SLT relies on the Law of Large Numbers to ensure learning bounds. As a consequence, this assumption imposes a strong limitation to guarantee learning on time-dependent scenarios. In order to tackle this issue, some researchers relax this assumption with the detriment of invalidating all theoretical results provided by the SLT. In this paper we apply a kernel function, more precisely the Takens' immersion theorem, to reconstruct time-dependent open-ended sequences of observations, also referred to as data streams in the context of Machine Learning, into multidimensional spaces (a.k.a. phase spaces) in attempt to hold the data independency assumption. At first, we study the best immersion parameterization for our kernel function using the Distance-Weighted Nearest Neighbors (DWNN). Next, we use this best immersion to recursively forecast next observations based on the prediction horizon, estimated using the Lyapunov exponent. Afterwards, predicted observations are compared against the expected ones using the Mean Distance from the Diagonal Line (MDDL). Theoretical and experimental results based on a cross-validation strategy provide stronger evidences of generalization, what allows us to conclude that one can learn from time-dependent data after using our approach. This opens up a very important possibility for ensuring supervised learning when it comes to time-dependent data, being useful to tackle applications such as in the climate, animal tracking, biology and other domains.

Journal ArticleDOI
TL;DR: Different formal definitions of expressiveness of a kernel are provided by exploiting the most recent results in the field of Statistical Learning Theory, and the differences among some state-of-the-art graph kernels are analyzed.

Journal ArticleDOI
TL;DR: A predictive system based on kernel methods, a type of machine learning algorithm grounded in statistical learning theory, is presented, which employs a flexible graph encoding to preserve multiple structural hypotheses and exploit recent advances in representation and model induction to scale to large data volumes.
Abstract: Motivation: The importance of RNA protein-coding gene regulation is by now well appreciated. Non-coding RNAs (ncRNAs) are known to regulate gene expression at practically every stage, ranging from chromatin packaging to mRNA translation. However the functional characterization of specific instances remains a challenging task in genome scale settings. For this reason, automatic annotation approaches are of interest. Existing computational methods are either efficient but non-accurate or they offer increased precision, but present scalability problems. Results: In this article, we present a predictive system based on kernel methods, a type of machine learning algorithm grounded in statistical learning theory. We employ a flexible graph encoding to preserve multiple structural hypotheses and exploit recent advances in representation and model induction to scale to large data volumes. Experimental results on tens of thousands of ncRNA sequences available from the Rfam database indicate that we can not only improve upon state-of-the-art predictors, but also achieve speedups of several orders of magnitude. Availability and implementation: The code is available from http://www.bioinf.uni-freiburg.de/~costa/EDeN.tgz. Contact: f.costa@exeter.ac.uk

Journal ArticleDOI
TL;DR: This paper showed how to build an ELM model with a novel scalable approach and to carefully assess the performance, with the use of the most recent results from SLT, for a sentiment analysis problem.
Abstract: Big social data analysis is the area of research focusing on collecting, examining, and processing large multi-modal and multi-source datasets in order to discover patterns/correlations and extract information from the Social Web. This is usually accomplished through the use of supervised and unsupervised machine learning algorithms that learn from the available data. However, these are usually highly computationally expensive, either in the training or in the prediction phase, as they are often not able to handle current data volumes. Parallel approaches have been proposed in order to boost processing speeds, but this clearly requires technologies that support distributed computations. Extreme learning machines (ELMs) are an emerging learning paradigm, presenting an efficient unified solution to generalized feed-forward neural networks. ELM offers significant advantages such as fast learning speed, ease of implementation, and minimal human intervention. However, ELM cannot be easily parallelized, due to the presence of a pseudo-inverse calculation. Therefore, this paper aims to find a reliable method to realize a parallel implementation of ELM that can be applied to large datasets typical of Big Data problems with the employment of the most recent technology for parallel in-memory computation, i.e., Spark, designed to efficiently deal with iterative procedures that recursively perform operations over the same data. Moreover, this paper shows how to take advantage of the most recent advances in statistical learning theory (SLT) in order to address the issue of selecting ELM hyperparameters that give the best generalization performance. This involves assessing the performance of such algorithms (i.e., resampling methods and in-sample methods) by exploiting the most recent results in SLT and adapting them to the Big Data framework. The proposed approach has been tested on two affective analogical reasoning datasets. Affective analogical reasoning can be defined as the intrinsically human capacity to interpret the cognitive and affective information associated with natural language. In particular, we employed two benchmarks, each one composed by 21,743 common-sense concepts; each concept is represented according to two models of a semantic network in which common-sense concepts are linked to a hierarchy of affective domain labels. The labeled data have been split into two sets: The first 20,000 samples have been used for building the model with the ELM with the different SLT strategies, while the rest of the labeled samples, numbering 1743, have been kept apart as reference set in order to test the performance of the learned model. The splitting process has been repeated 30 times in order to obtain statistically relevant results. We ran the experiments through the use of the Google Cloud Platform, in particular, the Google Compute Engine. We employed the Google Compute Engine Platform with NM = 4 machines with two cores and 1.8 GB of RAM (machine type n1-highcpu-2) and an HDD of 30 GB equipped with Spark. Results on the affective dataset both show the effectiveness of the proposed parallel approach and underline the most suitable SLT strategies for the specific Big Data problem. In this paper we showed how to build an ELM model with a novel scalable approach and to carefully assess the performance, with the use of the most recent results from SLT, for a sentiment analysis problem. Thanks to recent technologies and methods, the computational requirements of these methods have been improved to allow for the scaling to large datasets, which are typical of Big Data applications.

Journal ArticleDOI
02 Apr 2017-Filomat
TL;DR: This paper proposes two smoothing approaches for an implicit Lagrangian twin support vector machine (TWSVM) classifiers by formulating a pair of unconstrained minimization problems in dual variables whose solutions will be obtained using finite Newton method.
Abstract: In this paper, we proposed two smoothing approaches for an implicit Lagrangian twin support vector machine (TWSVM) classifiers by formulating a pair of unconstrained minimization problems in dual variables whose solutions will be obtained using finite Newton method. The idea of our formulation is to reformulate TWSVM as a strongly convex problem by incorporated regularization techniques to improve the robustness. The solution of two modified unconstrained minimization problems reduces to solving just two systems of linear equations as opposed to solving two quadratic programming problems in TWSVM and TBSVM, which leads to extremely simple and fast algorithm. Unlike the classical TWSVM, the structural risk minimization principle is implemented by adding regularization term in the primal problems of our proposed algorithm. This embodies the marrow of statistical learning theory. To demonstrate the effectiveness of the proposed method, we performed numerical experiments on number of interesting real-world datasets and compared their results with other SVMs. Comparison of results with GEPSVM and TWSVM clearly demonstrate the effectiveness and suitability of the proposed method.

Journal ArticleDOI
15 Nov 2017-Entropy
TL;DR: A survey of various techniques used to derive information-theoretic lower bounds for estimation and learning, focusing on the settings of parameter and function estimation, community recovery, and online learning for multi-armed bandits.
Abstract: In recent years, tools from information theory have played an increasingly prevalent role in statistical machine learning. In addition to developing efficient, computationally feasible algorithms for analyzing complex datasets, it is of theoretical importance to determine whether such algorithms are “optimal” in the sense that no other algorithm can lead to smaller statistical error. This paper provides a survey of various techniques used to derive information-theoretic lower bounds for estimation and learning. We focus on the settings of parameter and function estimation, community recovery, and online learning for multi-armed bandits. A common theme is that lower bounds are established by relating the statistical learning problem to a channel decoding problem, for which lower bounds may be derived involving information-theoretic quantities such as the mutual information, total variation distance, and Kullback–Leibler divergence. We close by discussing the use of information-theoretic quantities to measure independence in machine learning applications ranging from causality to medical imaging, and mention techniques for estimating these quantities efficiently in a data-driven manner.

Book ChapterDOI
01 Jan 2017
TL;DR: Two intelligent diagnosis methods are introduced based on the idea of deep learning, which uses advanced intelligent techniques for both feature extraction and fault classification and can replace diagnosticians to efficiently process the massive collected signals and automatically diagnose the mechanical faults.
Abstract: This chapter introduces the intelligent diagnosis methods based on individual intelligent techniques. The concept and advantages of intelligent diagnosis are first described, as well as the main steps commonly included in intelligent diagnosis. Second, three methods using artificial neural networks, which are able to learn and generalize nonlinear relationships between input data and output data, are presented for diagnosing the mechanical faults. Then, two diagnosis methods based on statistical learning theory are detailed and they can give better generalization abilities, especially for limited sample cases. Finally, two intelligent diagnosis methods are introduced based on the idea of deep learning, which uses advanced intelligent techniques for both feature extraction and fault classification. The effectiveness of each method is verified by various diagnosis cases, involving intelligent diagnosis of rub faults, bearing faults, and gear faults, and these methods can replace diagnosticians to efficiently process the massive collected signals and automatically diagnose the mechanical faults.

Journal ArticleDOI
TL;DR: This paper proposes an ELM implementation that exploits the Spark distributed in memory technology and shows how to take advantage of SLT results in order to select ELM hyperparameters able to provide the best generalization performance.
Abstract: Recently, social networks and other forms of media communication have been gathering the interest of both the scientific and the business world, leading to the increasing development of the science of opinion and sentiment analysis. Facing the huge amount of information present on the Web represents a crucial task and leads to the study and creation of efficient models able to tackle the task. To this end, current research proposes an efficient approach to support emotion recognition and polarity detection in natural language text. In this paper, we show how the most recent advances in statistical learning theory (SLT) can support the development of an efficient extreme learning machine (ELM) and the assessment of the resultant model’s performance when applied to big social data analysis. ELM, developed to overcome some issues in back-propagation networks, represents a powerful learning tool. However, the main problem is represented by the necessity to cope with a large number of available samples, and the generalization performance has to be carefully assessed. For this reason, we propose an ELM implementation that exploits the Spark distributed in memory technology and show how to take advantage of SLT results in order to select ELM hyperparameters able to provide the best generalization performance.

Book ChapterDOI
01 Jan 2017
TL;DR: In this paper, the authors illustrate how quantum computing can be useful for addressing the computational issues of building, tuning, and estimating the performance of a model learned from data, which is a promising paradigm for solving complex problems, such as large number factorization, exhaustive search, optimization, and mean and median computation.
Abstract: Quantum computing represents a promising paradigm for solving complex problems, such as large-number factorization, exhaustive search, optimization, and mean and median computation. On the other hand, supervised learning deals with the classical induction problem where an unknown input-output relation is inferred from a set of data that consists of examples of this relation. Lately, because of the rapid growth of the size of datasets, the dimensionality of the input and output space, and the variety and structure of the data, conventional learning techniques have started to show their limits. Considering these problems, the purpose of this chapter is to illustrate how quantum computing can be useful for addressing the computational issues of building, tuning, and estimating the performance of a model learned from data.

Journal ArticleDOI
TL;DR: A novel and efficient pairing support vector algorithm for data regression, called PSVR, which embodies the essence of statistical learning theory by adopting the principle of structural risk minimization, resulting in better generalization capability than TSVR.

Journal ArticleDOI
TL;DR: In this paper, a novel application of statistical learning theory to structural reliability analysis of transmission lines considering the uncertainties of climatic variables such as, wind speed, ice thickness and wind angle, and of the resistance of structural elements is described.
Abstract: This paper describes a novel application of statistical learning theory to structural reliability analysis of transmission lines considering the uncertainties of climatic variables such as, wind speed, ice thickness and wind angle, and of the resistance of structural elements. The problem of reliability analysis of complex structural systems with implicit limit state functions is addressed by statistical model selection, where the goal is to select a surrogate model of the finite element solver that provides the value of the performance function for each conductor, insulator or tower element. After determining the performance function for each structural element, Monte Carlo simulation is used to calculate their failure probabilities. The failure probabilities of towers and the entire line are then estimated from the failure probabilities of their elements/components considering the correlation between failure events. In order to quantify the relative importance of line components and provide the ...

Book ChapterDOI
01 Jan 2017
TL;DR: This work proposes a new learning framework, which relies on Statistical Learning Theory that includes constraints inside the learning process itself that allows to train advanced resource-sparing ML models and to efficiently deploy them on smart mobile devices.
Abstract: Most state-of-the-art machine learning (ML) algorithms do not consider the computational constraints of implementing their learned models on mobile devices. These constraints are, for example, the limited depth of the arithmetic unit, the memory availability, and the battery capacity. We propose a new learning framework, which relies on Statistical Learning Theory that includes these constraints inside the learning process itself. This new framework allows to train advanced resource-sparing ML models and to efficiently deploy them on smart mobile devices. The advantages of our proposal are presented on a smartphone-based Human Activity Recognition application and compared against a conventional ML approach.

Posted Content
TL;DR: From the theoretical formulation, the conditions which Deep Neural Networks learn are shown as well as point out another issue: DL benchmarks may be strictly driven by empirical risks, disregarding the complexity of algorithms biases.
Abstract: Deep Learning (DL) is one of the most common subjects when Machine Learning and Data Science approaches are considered. There are clearly two movements related to DL: the first aggregates researchers in quest to outperform other algorithms from literature, trying to win contests by considering often small decreases in the empirical risk; and the second investigates overfitting evidences, questioning the learning capabilities of DL classifiers. Motivated by such opposed points of view, this paper employs the Statistical Learning Theory (SLT) to study the convergence of Deep Neural Networks, with particular interest in Convolutional Neural Networks. In order to draw theoretical conclusions, we propose an approach to estimate the Shattering coefficient of those classification algorithms, providing a lower bound for the complexity of their space of admissible functions, a.k.a. algorithm bias. Based on such estimator, we generalize the complexity of network biases, and, next, we study AlexNet and VGG16 architectures in the point of view of their Shattering coefficients, and number of training examples required to provide theoretical learning guarantees. From our theoretical formulation, we show the conditions which Deep Neural Networks learn as well as point out another issue: DL benchmarks may be strictly driven by empirical risks, disregarding the complexity of algorithms biases.

Proceedings ArticleDOI
01 Jul 2017
TL;DR: This paper introduces an approach which enjoys sparse regression and covariance matrix estimation to improve matrix completion accuracy, and as a result enhancing feature selection preciseness which leads to reduction in prediction Mean Squared Error (MSE).
Abstract: Inference and Estimation in Missing Information (MI) scenarios are important topics in Statistical Learning Theory and Machine Learning (ML). In ML literature, attempts have been made to enhance prediction through precise feature selection methods. In sparse linear models, LASSO is well-known in extracting the desired support of the signal and resisting against noisy systems. When sparse models are also suffering from MI, the sparse recovery and inference of the missing models are taken into account simultaneously. In this paper, we will introduce an approach which enjoys sparse regression and covariance matrix estimation to improve matrix completion accuracy, and as a result enhancing feature selection preciseness which leads to reduction in prediction Mean Squared Error (MSE). We will compare the effect of employing covariance matrix in enhancing estimation accuracy to the case it is not used in feature selection. Simulations show the improvement in the performance as compared to the case where the covariance matrix estimation is not used.

Book ChapterDOI
13 Jun 2017
TL;DR: This paper uses some lattice properties to derive the probability of overfitting for a set of classifier represented by concepts, and proposes exact combinatorial bounds for the family of classifiers making a lattice.
Abstract: Obtaining accurate bounds of the probability of overfitting is a fundamental question in statistical learning theory. In this paper we propose exact combinatorial bounds for the family of classifiers making a lattice. We use some lattice properties to derive the probability of overfitting for a set of classifiers represented by concepts. The extent of a concept, in turn, matches the set of objects correctly classified by the corresponding classifier. Conducted experiments illustrate that the proposed bounds are consistent with the Monte Carlo bounds.

Posted Content
TL;DR: This work provides the first proof in the literature of the NP-hardness of computing function norms of DNNs, motivating the necessity of an approximate approach and derives a generalization bound for functions trained with weighted norms and proves that a natural stochastic optimization strategy minimizes the bound.
Abstract: Deep neural networks (DNNs) have become increasingly important due to their excellent empirical performance on a wide range of problems. However, regularization is generally achieved by indirect means, largely due to the complex set of functions defined by a network and the difficulty in measuring function complexity. There exists no method in the literature for additive regularization based on a norm of the function, as is classically considered in statistical learning theory. In this work, we propose sampling-based approximations to weighted function norms as regularizers for deep neural networks. We provide, to the best of our knowledge, the first proof in the literature of the NP-hardness of computing function norms of DNNs, motivating the necessity of an approximate approach. We then derive a generalization bound for functions trained with weighted norms and prove that a natural stochastic optimization strategy minimizes the bound. Finally, we empirically validate the improved performance of the proposed regularization strategies for both convex function sets as well as DNNs on real-world classification and image segmentation tasks demonstrating improved performance over weight decay, dropout, and batch normalization. Source code will be released at the time of publication.