Universal approximation bounds for superpositions of a sigmoidal function
TL;DR: The approximation rate and the parsimony of the parameterization of the networks are shown to be advantageous in high-dimensional settings and the integrated squared approximation error cannot be made smaller than order 1/n/sup 2/d/ uniformly for functions satisfying the same smoothness assumption.
Abstract: Approximation properties of a class of artificial neural networks are established. It is shown that feedforward networks with one layer of sigmoidal nonlinearities achieve integrated squared error of order O(1/n), where n is the number of nodes. The approximated function is assumed to have a bound on the first moment of the magnitude distribution of the Fourier transform. The nonlinear parameters associated with the sigmoidal nodes, as well as the parameters of linear combination, are adjusted in the approximation. In contrast, it is shown that for series expansions with n terms, in which only the parameters of linear combination are adjusted, the integrated squared approximation error cannot be made smaller than order 1/n/sup 2/d/ uniformly for functions satisfying the same smoothness assumption, where d is the dimension of the input to the function. For the class of functions examined, the approximation rate and the parsimony of the parameterization of the networks are shown to be advantageous in high-dimensional settings. >
Citations
More filters
•
01 Jan 1995TL;DR: Setting of the learning problem consistency of learning processes bounds on the rate of convergence ofLearning processes controlling the generalization ability of learning process constructing learning algorithms what is important in learning theory?
Abstract: Setting of the learning problem consistency of learning processes bounds on the rate of convergence of learning processes controlling the generalization ability of learning processes constructing learning algorithms what is important in learning theory?.
40,147 citations
Cites background from "Universal approximation bounds for ..."
...13) i=l where ai and Vi are arbitrary values, Wi are arbitrary vectors, and S(u) is a sigmoid function (a monotonically increasing function such that limu --+-oo S(u) = -1, limu --+oo S(u) = 1) (Barron, 1993)....
[...]
...In 1992-1993 Jones, Barron, and Breiman described a structure on different sets of functions that has a fast rate of approximation (Jones, 1992), (Barron, 1993), and (Breiman, 1993)....
[...]
•
[...]
TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.
Abstract: Deep learning is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts. Because the computer gathers knowledge from experience, there is no need for a human computer operator to formally specify all the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones; a graph of these hierarchies would be many layers deep. This book introduces a broad range of topics in deep learning. The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models. Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors.
38,208 citations
•
01 Jan 1995TL;DR: This is the first comprehensive treatment of feed-forward neural networks from the perspective of statistical pattern recognition, and is designed as a text, with over 100 exercises, to benefit anyone involved in the fields of neural computation and pattern recognition.
Abstract: From the Publisher:
This is the first comprehensive treatment of feed-forward neural networks from the perspective of statistical pattern recognition. After introducing the basic concepts, the book examines techniques for modelling probability density functions and the properties and merits of the multi-layer perceptron and radial basis function network models. Also covered are various forms of error functions, principal algorithms for error function minimalization, learning and generalization in neural networks, and Bayesian techniques and their applications. Designed as a text, with over 100 exercises, this fully up-to-date work will benefit anyone involved in the fields of neural computation and pattern recognition.
19,056 citations
•
01 Jan 1996
TL;DR: The Bayes Error and Vapnik-Chervonenkis theory are applied as guide for empirical classifier selection on the basis of explicit specification and explicit enforcement of the maximum likelihood principle.
Abstract: Preface * Introduction * The Bayes Error * Inequalities and alternatedistance measures * Linear discrimination * Nearest neighbor rules *Consistency * Slow rates of convergence Error estimation * The regularhistogram rule * Kernel rules Consistency of the k-nearest neighborrule * Vapnik-Chervonenkis theory * Combinatorial aspects of Vapnik-Chervonenkis theory * Lower bounds for empirical classifier selection* The maximum likelihood principle * Parametric classification *Generalized linear discrimination * Complexity regularization *Condensed and edited nearest neighbor rules * Tree classifiers * Data-dependent partitioning * Splitting the data * The resubstitutionestimate * Deleted estimates of the error probability * Automatickernel rules * Automatic nearest neighbor rules * Hypercubes anddiscrete spaces * Epsilon entropy and totally bounded sets * Uniformlaws of large numbers * Neural networks * Other error estimates *Feature extraction * Appendix * Notation * References * Index
3,598 citations
••
TL;DR: This work surveys and extends recent research which demonstrates that randomization offers a powerful tool for performing low-rank matrix approximation, and presents a modular framework for constructing randomized algorithms that compute partial matrix decompositions.
Abstract: Low-rank matrix approximations, such as the truncated singular value decomposition and the rank-revealing QR decomposition, play a central role in data analysis and scientific computing. This work surveys and extends recent research which demonstrates that randomization offers a powerful tool for performing low-rank matrix approximation. These techniques exploit modern computational architectures more fully than classical methods and open the possibility of dealing with truly massive data sets. This paper presents a modular framework for constructing randomized algorithms that compute partial matrix decompositions. These methods use random sampling to identify a subspace that captures most of the action of a matrix. The input matrix is then compressed—either explicitly or implicitly—to this subspace, and the reduced matrix is manipulated deterministically to obtain the desired low-rank factorization. In many cases, this approach beats its classical competitors in terms of accuracy, robustness, and/or speed. These claims are supported by extensive numerical experiments and a detailed error analysis. The specific benefits of randomized techniques depend on the computational environment. Consider the model problem of finding the $k$ dominant components of the singular value decomposition of an $m \times n$ matrix. (i) For a dense input matrix, randomized algorithms require $\bigO(mn \log(k))$ floating-point operations (flops) in contrast to $ \bigO(mnk)$ for classical algorithms. (ii) For a sparse input matrix, the flop count matches classical Krylov subspace methods, but the randomized approach is more robust and can easily be reorganized to exploit multiprocessor architectures. (iii) For a matrix that is too large to fit in fast memory, the randomized techniques require only a constant number of passes over the data, as opposed to $\bigO(k)$ passes for classical algorithms. In fact, it is sometimes possible to perform matrix approximation with a single pass over the data.
3,248 citations
References
More filters
••
TL;DR: It is rigorously established that standard multilayer feedforward networks with as few as one hidden layer using arbitrary squashing functions are capable of approximating any Borel measurable function from one finite dimensional space to another to any desired degree of accuracy, provided sufficiently many hidden units are available.
18,794 citations
••
TL;DR: It is demonstrated that finite linear combinations of compositions of a fixed, univariate function and a set of affine functionals can uniformly approximate any continuous function ofn real variables with support in the unit hypercube.
Abstract: In this paper we demonstrate that finite linear combinations of compositions of a fixed, univariate function and a set of affine functionals can uniformly approximate any continuous function ofn real variables with support in the unit hypercube; only mild conditions are imposed on the univariate function. Our results settle an open question about representability in the class of single hidden layer neural networks. In particular, we show that arbitrary decision regions can be arbitrarily well approximated by continuous feedforward neural networks with only a single internal, hidden layer and any continuous sigmoidal nonlinearity. The paper discusses approximation properties of other possible types of nonlinearities that might be implemented by artificial neural networks.
12,286 citations
•
01 Nov 1971
TL;DR: In this paper, the authors present a unified treatment of basic topics that arise in Fourier analysis, and illustrate the role played by the structure of Euclidean spaces, particularly the action of translations, dilatations, and rotations.
Abstract: The authors present a unified treatment of basic topics that arise in Fourier analysis. Their intention is to illustrate the role played by the structure of Euclidean spaces, particularly the action of translations, dilatations, and rotations, and to motivate the study of harmonic analysis on more general spaces having an analogous structure, e.g., symmetric spaces.
5,579 citations
••
TL;DR: In this article, a generalization of the PAC learning model based on statistical decision theory is described, where the learner receives randomly drawn examples, each example consisting of an instance x in X and an outcome y in Y, and tries to find a hypothesis h : X < A, where h in H, that specifies the appropriate action a in A to take for each instance x, in order to minimize the expectation of a loss l(y,a).
Abstract: We describe a generalization of the PAC learning model that is based on statistical decision theory. In this model the learner receives randomly drawn examples, each example consisting of an instance x in X and an outcome y in Y , and tries to find a hypothesis h : X --< A , where h in H , that specifies the appropriate action a in A to take for each instance x , in order to minimize the expectation of a loss l(y,a). Here X, Y, and A are arbitrary sets, l is a real-valued function, and examples are generated according to an arbitrary joint distribution on X times Y . Special cases include the problem of learning a function from X into Y , the problem of learning the conditional probability distribution on Y given X (regression), and the problem of learning a distribution on X (density estimation). We give theorems on the uniform convergence of empirical loss estimates to true expected loss rates for certain hypothesis spaces H , and show how this implies learnability with bounded sample size, disregarding computational complexity. As an application, we give distribution-independent upper bounds on the sample size needed for learning with feedforward neural networks. Our theorems use a generalized notion of VC dimension that applies to classes of real-valued functions, adapted from Pollard''s work, and a notion of *capacity* and *metric dimension* for classes of functions that map into a bounded metric space. (Supersedes 89-30 and 90-52.) [Also in "Information and Computation", Vol. 100, No.1, September 1992]
1,025 citations
•
22 Jan 1985
TL;DR: In this paper, Tchebycheff Systems and total positivity of n-widths are studied in the context of the Tchebyschka system, and the existence of optimal subspaces for dn.
Abstract: I. Introduction.- II. Basic Properties of n-Widths.- 1. Properties of dn.- 2. Existence of Optimal Subspaces for dn.- 3. Properties of dn.- 4. Properties of ?n.- 5. Inequalities Between n-Widths.- 6. Duality Between dn and dn.- 7. n-Widths of Mappings of the Unit Ball.- 8. Some Relationships Between dn(T), dn(T) and ?n(T).- Notes and References.- III. Tchebycheff Systems and Total Positivity.- 1. Tchebycheff Systems.- 2. Matrices.- 3. Kernels.- 4. More on Kernels.- IV. n-Widths in Hilbert Spaces.- 1. Introduction.- 2. n-Widths of Compact Linear Operators.- 3. n-Widths, with Constraints.- 3.1 Restricted Approximating Subspaces.- 3.2 Restricting the Unit Ball and Optimal Recovery.- 3.3 n-Widths Under a Pair of Constraints.- 3.4 A Theorem of Ismagilov.- 4. n-Widths of Compact Periodic Convolution Operators.- 4.1 n-Widths as Fourier Coefficients.- 4.2 A Return to Ismagilov's Theorem.- 4.3 Bounded mth Modulus of Continuity.- 5. n-Widths of Totally Positive Operators in L2.- 5.1 The Main Theorem.- 5.2 Restricted Approximating Subspaces.- 6. Certain Classes of Periodic Functions.- 6.1 n-Widths of Cyclic Variation Diminishing Operators.- 6.2 n-Widths for Kernels Satisfying Property B.- Notes and References.- V. Exact n-Widths of Integral Operators.- 1. Introduction.- 2. Exact n-Widths of K? in Lq and Kp in L1.- 3. Exact n-Widths of K?r in Lq and Kpr in L1.- 4. Exact n-Widths for Periodic Functions.- 5. n-Widths of Rank n + 1 Kernels.- Notes and References.- VI. Matrices and n-Widths.- 1. Introduction and General Remarks.- 2. n-Widths of Diagonal Matrices.- 2.1 The Exact Solution for q ? p and p = 1, q = 2.- 2.2 Various Estimates for p = 1, q = ?.- 3. n-Widths of Strictly Totally Positive Matrices.- Notes and References.- VII. Asymptotic Estimates for n-Widths of Sobolev Spaces.- 1. Introduction.- 2. Optimal Lower Bounds.- 3. Optimal Upper Bounds.- 4. Another Look at ?n(B1(r) L?).- Notes and References.- VIII. n-Widths of Analytic Functions.- 1. Introduction.- 2. n-Widths of Analytic Functions with Bounded mth Derivative.- 3. n-Widths of Analytic Functions in H2.- 4. n-Widths of Analytic Functions in H?.- 5. n-Widths of a Class of Entire Functions.- Notes and References.- Glossary of Selected Symbols.- Author Index.
894 citations