02 Apr 2004-Science (American Association for the Advancement of Science)-Vol. 304, Iss: 5667, pp 78-80
TL;DR: A method for learning nonlinear systems, echo state networks (ESNs), which employ artificial recurrent neural networks in a way that has recently been proposed independently as a learning mechanism in biological brains is presented.
Abstract: We present a method for learning nonlinear systems, echo state networks (ESNs). ESNs employ artificial recurrent neural networks in a way that has recently been proposed independently as a learning mechanism in biological brains. The learning method is computationally efficient and easy to use. On a benchmark task of predicting a chaotic time series, accuracy is improved by a factor of 2400 over previous techniques. The potential for engineering applications is illustrated by equalizing a communication channel, where the signal error rate is improved by two orders of magnitude.
The authors present a method for learning nonlinear systems, echo state networks (ESNs).
The potential for engineering applications is illustrated by equalizing a communication channel, where the signal error rate is improved by two orders of magnitude.
Most technical systems, however, become nonlinear if operated at higher operational points (that is, closer to saturation).
The output neuron was equipped with random connections that project back into the reservoir (Fig. 2B).
This was ensured by a sparse interconnectivity of 1% within the reservoir.
The network output y(3084) was compared with the correct continuation d(3084).
The authors showed analytically (16) that under certain conditions an ESN of size N may be able to “remember” a number of previous inputs that is of the same order of magnitude as N.
This sequence is first transformed into an analog envelope signal d(n), then modulated on a high-frequency carrier signal and transmitted, then received and demodulated into an analog signal u(n), which is a corrupted version of d(n).
The quality measure for the entire process is the fraction of incorrect symbols finally obtained (symbol error rate).
TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.
Abstract: Deep learning is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts. Because the computer gathers knowledge from experience, there is no need for a human computer operator to formally specify all the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones; a graph of these hierarchies would be many layers deep. This book introduces a broad range of topics in deep learning. The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models. Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors.
TL;DR: This historical survey compactly summarizes relevant work, much of it from the previous millennium, review deep supervised learning, unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.
14,635 citations
Cites background or methods or result from "Harnessing Nonlinearity: Predicting..."
...In fact, some popular RNN algorithms restricted credit assignment to a single step backwards (Elman, 1990; Jordan, 1986, 1997), also inmore recent studies (Jaeger, 2001, 2004; Maass et al., 2002)....
[...]
...Compare other RNN algorithms (Jaeger, 2004; Schmidhuber et al., 2007; Pascanu et al., 2013b; Koutnı́k et al., 2014) that also at least sometimes yield better results than steepest descent for LSTM RNNs....
[...]
...Certain SL RNNs with fixed weights for all connections except those to output units (Jaeger, 2001; Maass et al., 2002; Jaeger, 2004; Schrauwen et al., 2007) have a maximal problem depth of 1, because only the final links in the corresponding CAPs are modifiable....
[...]
...Gradient-based LSTM is no panacea though—other methods sometimes outperformed it at least on certain tasks (Jaeger, 2004; Schmidhuber et al., 2007; Martens and Sutskever, 2011; Pascanu et al., 2013b; Koutnı́k et al., 2014) (compare Sec....
[...]
...In fact, some popular RNN algorithms restricted credit assignment to a single step backwards (Elman, 1988; Jordan, 1986), also in more recent studies (Jaeger, 2002; Maass et al., 2002; Jaeger, 2004)....
TL;DR: It is shown that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs to levels of performance that were previously achievable only with Hessian-Free optimization.
Abstract: Deep and recurrent neural networks (DNNs and RNNs respectively) are powerful models that were considered to be almost impossible to train using stochastic gradient descent with momentum. In this paper, we show that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs (on datasets with long-term dependencies) to levels of performance that were previously achievable only with Hessian-Free optimization. We find that both the initialization and the momentum are crucial since poorly initialized networks cannot be trained with momentum and well-initialized networks perform markedly worse when the momentum is absent or poorly tuned.
Our success training these models suggests that previous attempts to train deep and recurrent neural networks from random initializations have likely failed due to poor initialization schemes. Furthermore, carefully tuned momentum methods suffice for dealing with the curvature issues in deep and recurrent network training objectives without the need for sophisticated second-order methods.
4,121 citations
Cites background from "Harnessing Nonlinearity: Predicting..."
...As argued by Jaeger & Haas (2004), the spectral radius of the hidden-to-hidden matrix has a profound effect on the dynamics of the RNN’s hidden state (with a tanh nonlinearity)....
TL;DR: This paper proposes a gradient norm clipping strategy to deal with exploding gradients and a soft constraint for the vanishing gradients problem and validates empirically the hypothesis and proposed solutions.
Abstract: There are two widely known issues with properly training Recurrent Neural Networks, the vanishing and the exploding gradient problems detailed in Bengio et al. (1994). In this paper we attempt to improve the understanding of the underlying issues by exploring these problems from an analytical, a geometric and a dynamical systems perspective. Our analysis is used to justify a simple yet effective solution. We propose a gradient norm clipping strategy to deal with exploding gradients and a soft constraint for the vanishing gradients problem. We validate empirically our hypothesis and proposed solutions in the experimental section.
3,549 citations
Cites background or result from "Harnessing Nonlinearity: Predicting..."
...Echo State Networks (Jaeger and Haas, 2004) avoid the exploding and vanishing gradients problem by not learning Wrec and Win....
[...]
...In most cases, these results outperforms Martens and Sutskever (2011) in terms of success rate, they deal with longer sequences than in Hochreiter and Schmidhuber (1997) and compared to (Jaeger, 2012) they generalize to longer sequences....
[...]
...Echo State Networks (Lukoševičius and Jaeger, 2009) avoid the exploding and vanishing gradients problem by not learning the recurrent and input weights....
TL;DR: In this article, a gradient norm clipping strategy is proposed to deal with the vanishing and exploding gradient problems in recurrent neural networks. But the proposed solution is limited to the case of RNNs.
Abstract: There are two widely known issues with properly training recurrent neural networks, the vanishing and the exploding gradient problems detailed in Bengio et al. (1994). In this paper we attempt to improve the understanding of the underlying issues by exploring these problems from an analytical, a geometric and a dynamical systems perspective. Our analysis is used to justify a simple yet effective solution. We propose a gradient norm clipping strategy to deal with exploding gradients and a soft constraint for the vanishing gradients problem. We validate empirically our hypothesis and proposed solutions in the experimental section.
TL;DR: This paper first reviews basic backpropagation, a simple method which is now being widely used in areas like pattern recognition and fault diagnosis, and describes further extensions of this method, to deal with systems other than neural networks, systems involving simultaneous equations or true recurrent networks, and other practical issues which arise with this method.
Abstract: Basic backpropagation, which is a simple method now being widely used in areas like pattern recognition and fault diagnosis, is reviewed. The basic equations for backpropagation through time, and applications to areas like pattern recognition involving dynamic systems, systems identification, and control are discussed. Further extensions of this method, to deal with systems other than neural networks, systems involving simultaneous equations, or true recurrent networks, and other practical issues arising with the method are described. Pseudocode is provided to clarify the algorithms. The chain rule for ordered derivatives-the theorem which underlies backpropagation-is briefly discussed. The focus is on designing a simpler version of backpropagation which can be translated into computer code and applied directly by neutral network users. >
TL;DR: The exact form of a gradient-following learning algorithm for completely recurrent networks running in continually sampled time is derived and used as the basis for practical algorithms for temporal supervised learning tasks.
Abstract: The exact form of a gradient-following learning algorithm for completely recurrent networks running in continually sampled time is derived and used as the basis for practical algorithms for temporal supervised learning tasks. These algorithms have (1) the advantage that they do not require a precisely defined training interval, operating while the network runs; and (2) the disadvantage that they require nonlocal communication in the network being trained and are computationally expensive. These algorithms allow networks having recurrent connections to learn complex tasks that require the retention of information over time periods having either fixed or indefinite length.
TL;DR: First-order nonlinear differential-delay equations describing physiological control systems displaying a broad diversity of dynamical behavior including limit cycle oscillations, with a variety of wave forms, and apparently aperiodic or "chaotic" solutions are studied.
Abstract: First-order nonlinear differential-delay equations describing physiological control systems are studied. The equations display a broad diversity of dynamical behavior including limit cycle oscillations, with a variety of wave forms, and apparently aperiodic or "chaotic" solutions. These results are discussed in relation to dynamical respiratory and hematopoietic diseases.
TL;DR: A new computational model for real-time computing on time-varying input that provides an alternative to paradigms based on Turing machines or attractor neural networks, based on principles of high-dimensional dynamical systems in combination with statistical learning theory and can be implemented on generic evolved or found recurrent circuitry.
Abstract: A key challenge for neural modeling is to explain how a continuous stream of multimodal input from a rapidly changing environment can be processed by stereotypical recurrent circuits of integrate-and-fire neurons in real time. We propose a new computational model for real-time computing on time-varying input that provides an alternative to paradigms based on Turing machines or attractor neural networks. It does not require a task-dependent construction of neural circuits. Instead, it is based on principles of high-dimensional dynamical systems in combination with statistical learning theory and can be implemented on generic evolved or found recurrent circuitry. It is shown that the inherent transient dynamics of the high-dimensional dynamical system formed by a sufficiently large and heterogeneous neural circuit may serve as universal analog fading memory. Readout neurons can learn to extract in real time from the current state of such recurrent neural circuit information about current and past inputs that may be needed for diverse tasks. Stable internal states are not required for giving a stable output, since transient internal states can be transformed by readout neurons into stable target outputs due to the high dimensionality of the dynamical system. Our approach is based on a rigorous computational model, the liquid state machine, that, unlike Turing machines, does not require sequential transitions between well-defined discrete internal states. It is supported, as the Turing machine is, by rigorous mathematical results that predict universal computational power under idealized conditions, but for the biologically more realistic scenario of real-time processing of time-varying inputs. Our approach provides new perspectives for the interpretation of neural coding, the design of experiments and data analysis in neurophysiology, and the solution of problems in robotics and neurotechnology.
TL;DR: It is shown that the dynamics of the reference (weight) vectors during the input-driven adaptation procedure are determined by the gradient of an energy function whose shape can be modulated through a neighborhood determining parameter and resemble the dynamicsof Brownian particles moving in a potential determined by a data point density.
Abstract: A neural network algorithm based on a soft-max adaptation rule is presented. This algorithm exhibits good performance in reaching the optimum minimization of a cost function for vector quantization data compression. The soft-max rule employed is an extension of the standard K-means clustering procedure and takes into account a neighborhood ranking of the reference (weight) vectors. It is shown that the dynamics of the reference (weight) vectors during the input-driven adaptation procedure are determined by the gradient of an energy function whose shape can be modulated through a neighborhood determining parameter and resemble the dynamics of Brownian particles moving in a potential determined by the data point density. The network is used to represent the attractor of the Mackey-Glass equation and to predict the Mackey-Glass time series, with additional local linear mappings for generating output values. The results obtained for the time-series prediction compare favorably with the results achieved by backpropagation and radial basis function networks. >
Q1. What contributions have the authors mentioned in the paper "Harnessing nonlinearity: predicting chaotic systems and saving energy in wireless communication" ?
The authors present a method for learning nonlinear systems, echo state networks ( ESNs ). The potential for engineering applications is illustrated by equalizing a communication channel, where the signal error rate is improved by two orders of magnitude.