scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication

02 Apr 2004-Science (American Association for the Advancement of Science)-Vol. 304, Iss: 5667, pp 78-80
TL;DR: A method for learning nonlinear systems, echo state networks (ESNs), which employ artificial recurrent neural networks in a way that has recently been proposed independently as a learning mechanism in biological brains is presented.
Abstract: We present a method for learning nonlinear systems, echo state networks (ESNs). ESNs employ artificial recurrent neural networks in a way that has recently been proposed independently as a learning mechanism in biological brains. The learning method is computationally efficient and easy to use. On a benchmark task of predicting a chaotic time series, accuracy is improved by a factor of 2400 over previous techniques. The potential for engineering applications is illustrated by equalizing a communication channel, where the signal error rate is improved by two orders of magnitude.

Summary (1 min read)

Jump to:  and [Summary]

Summary

  • The authors present a method for learning nonlinear systems, echo state networks (ESNs).
  • The potential for engineering applications is illustrated by equalizing a communication channel, where the signal error rate is improved by two orders of magnitude.
  • Most technical systems, however, become nonlinear if operated at higher operational points (that is, closer to saturation).
  • The output neuron was equipped with random connections that project back into the reservoir (Fig. 2B).
  • This was ensured by a sparse interconnectivity of 1% within the reservoir.
  • The network output y(3084) was compared with the correct continuation d(3084).
  • The authors showed analytically (16) that under certain conditions an ESN of size N may be able to “remember” a number of previous inputs that is of the same order of magnitude as N.
  • This sequence is first transformed into an analog envelope signal d(n), then modulated on a high-frequency carrier signal and transmitted, then received and demodulated into an analog signal u(n), which is a corrupted version of d(n).
  • The quality measure for the entire process is the fraction of incorrect symbols finally obtained (symbol error rate).

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Harnessing Nonlinearity: Predicting
Chaotic Systems and Saving Energy
in Wireless Communication
Herbert Jaeger* and Harald Haas
We present a method for learning nonlinear systems, echo state networks
(ESNs). ESNs employ artificial recurrent neural networks in a way that has
recently been proposed independently as a learning mechanism in biological
brains. The learning method is computationally efficient and easy to use. On
a benchmark task of predicting a chaotic time series, accuracy is improved by
a factor of 2400 over previous techniques. The potential for engineering ap-
plications is illustrated by equalizing a communication channel, where the signal
error rate is improved by two orders of magnitude.
Nonlinear dynamical systems abound in the
sciences and in engineering. If one wishes to
simulate, predict, filter, classify, or control such
a system, one needs an executable system mod-
el. However, it is often infeasible to obtain
analytical models. In such cases, one has to
resort to black-box models, which ignore the
internal physical mechanisms and instead re-
produce only the outwardly observable input-
output behavior of the target system.
If the target system is linear, efficient
methods for black-box modeling are avail-
able. Most technical systems, however, be-
come nonlinear if operated at higher opera-
tional points (that is, closer to saturation).
Although this might lead to cheaper and more
energy-efficient designs, it is not done be-
cause the resulting nonlinearities cannot be
harnessed. Many biomechanical systems use
their full dynamic range (up to saturation)
and thereby become lightweight, energy effi-
cient, and thoroughly nonlinear.
Here, we present an approach to learn-
ing black-box models of nonlinear systems,
echo state networks (ESNs). An ESN is an
artificial recurrent neural network (RNN).
RNNs are characterized by feedback (“re-
current”) loops in their synaptic connection
pathways. They can maintain an ongoing
activation even in the absence of input and
thus exhibit dynamic memory. Biological
neural networks are typically recurrent.
Like biological neural networks, an artifi-
cial RNN can learn to mimic a target
system—in principle, with arbitrary accu-
racy (1). Several learning algorithms are
known (24) that incrementally adapt the
synaptic weights of an RNN in order to
tune it toward the target system. These
algorithms have not been widely employed
in technical applications because of slow
convergence and suboptimal solutions (5,
6). The ESN approach differs from these
methods in that a large RNN is used (on the
order of 50 to 1000 neurons; previous tech-
niques typically use 5 to 30 neurons) and in
that only the synaptic connections from the
RNN to the output readout neurons are
modified by learning; previous techniques
tune all synaptic connections (Fig. 1). Be-
cause there are no cyclic dependencies be-
tween the trained readout connections,
training an ESN becomes a simple linear
regression task.
We illustrate the ESN approach on a
task of chaotic time series prediction (Fig.
2) (7). The Mackey-Glass system (MGS)
(8) is a standard benchmark system for time
series prediction studies. It generates a sub-
tly irregular time series (Fig. 2A). The
prediction task has two steps: (i) using an
initial teacher sequence generated by the
original MGS to learn a black-box model M
of the generating system, and (ii) using M
to predict the value of the sequence some
steps ahead.
First, we created a random RNN with
1000 neurons (called the “reservoir”) and one
output neuron. The output neuron was
equipped with random connections that
project back into the reservoir (Fig. 2B). A
3000-step teacher sequence d(1),...,
d(3000) was generated from the MGS equa-
tion and fed into the output neuron. This
excited the internal neurons through the out-
put feedback connections. After an initial
transient period, they started to exhibit sys-
tematic individual variations of the teacher
sequence (Fig. 2B).
The fact that the internal neurons display
systematic variants of the exciting external
signal is constitutional for ESNs: The internal
neurons must work as “echo functions” for
the driving signal. Not every randomly gen-
erated RNN has this property, but it can
effectively be built into a reservoir (support-
ing online text).
It is important that the echo signals be
richly varied. This was ensured by a sparse
interconnectivity of 1% within the reservoir.
This condition lets the reservoir decompose
into many loosely coupled subsystems, estab-
lishing a richly structured reservoir of excit-
able dynamics.
After time n 3000, output connection
weights w
i
(i 1, . . . , 1000) were computed
(dashed arrows in Fig. 2B) from the last 2000
steps n 1001, . . . , 3000 of the training run
such that the training error
MSE
train
1/2000
n1001
3000
d(n)
i 1
1000
w
i
x
i
(n)
2
was minimized [x
i
(n), activation of the ith
internal neuron at time n]. This is a simple
linear regression.
With the new w
i
in place, the ESN was
disconnected from the teacher after step 3000
and left running freely. A bidirectional dy-
namical interplay of the network-generated
output signal with the internal signals x
i
(n)
unfolded. The output signal y(n) was created
from the internal neuron activation signals
x
i
(n) through the trained connections w
i
,by
y(n)⫽⌺
i1
1000
w
i
x
i
n)
. Conversely, the internal
signals were echoed from that output signal
through the fixed output feedback connec-
tions (supporting online text).
For testing, an 84-step continuation
d(3001), . . . , d(3084) of the original signal
was computed for reference. The network
output y(3084) was compared with the cor-
rect continuation d(3084). Averaged over 100
independent trials, a normalized root mean
square error
NRMSE
j1
100
(d
j
(3084) y
j
3084))
2
/100
2
1/2
10
4.2
was obtained (d
j
and y
j
teacher and network
International University Bremen, Bremen D-28759,
Germany.
*To whom correspondence should be addressed. E-
mail: h.jaeger@iu-bremen.de
Fig. 1. (A) Schema of previous approaches to
RNN learning. (B) Schema of ESN approach.
Solid bold arrows, fixed synaptic connections;
dotted arrows, adjustable connections. Both
approaches aim at minimizing the error d(n)–
y(n), where y(n) is the network output and d(n)
is the teacher time series observed from the
target system.
R EPORTS
2 APRIL 2004 VOL 304 SCIENCE www.sciencemag.org78

output in trial j,
2
variance of MGS signal),
improving the best previous techniques (9
15), which used training sequences of length
500 to 10,000, by a factor of 700. If the
prediction run was continued, deviations typ-
ically became visible after about 1300 steps
(Fig. 2A). With a refined variant of the learn-
ing method (7), the improvement factor rises
to 2400. Models of similar accuracy were
also obtained for other chaotic systems (sup-
porting online text).
The main reason for the jump in modeling
accuracy is that ESNs capitalize on a massive
short-term memory. We showed analytically
(16) that under certain conditions an ESN of
size N may be able to remember a number
of previous inputs that is of the same order of
magnitude as N. This information is more
massive than the information used in other
techniques (supporting online text).
We now illustrate the approach in a task
of practical relevance, namely, the equaliza-
tion of a wireless communication channel
(7). The essentials of equalization are as fol-
lows: A sender wants to communicate a sym-
bol sequence s(n). This sequence is first
transformed into an analog envelope signal
d(n), then modulated on a high-frequency
carrier signal and transmitted, then received
and demodulated into an analog signal u(n),
which is a corrupted version of d(n). Major
sources of corruption are noise (thermal or
due to interfering signals), multipath propa-
gation, which leads to a superposition of ad-
jacent symbols (intersymbol interference),
and nonlinear distortion induced by operating
the senders power amplifier in the high-gain
region. To avoid the latter, the actual power
amplification is run well below the maximum
amplification possible, thereby incurring a
substantial loss in energy efficiency, which is
clearly undesirable in cell-phone and satellite
communications. The corrupted signal u(n)is
then passed through an equalizing filter
whose output y(n) should restore u(n)as
closely as possible to d(n). Finally, the equal-
ized signal y(n) is converted back into a
symbol sequence. The quality measure for
the entire process is the fraction of incorrect
symbols finally obtained (symbol error rate).
To compare the performance of an ESN
equalizer with standard techniques, we took
a channel model for a nonlinear wireless
transmission system from a study (17) that
compared three customary nonlinear equal-
ization methods: a linear decision feedback
equalizer (DFE), which is actually a non-
linear method; a Volterra DFE; and a bilin-
ear DFE. The model equation featured
intersymbol interference across 10 consec-
utive symbols, a second-order and a third-
order nonlinear distortion, and additive
white Gaussian noise. All methods investi-
gated in that study had 47 adjustable pa-
rameters and used sequences of 5000
symbols for training. To make the ESN
equalizer comparable with the equalizers
studied in (17), we took ESNs with a res-
ervoir of 46 neurons (which is small for the
ESN approach), which yielded 47 adjust-
able parameters. (The 47th comes from a
direct connection from the input to the
output neuron.)
We carried out numerous learning trials
(7) to obtain ESN equalizers, using an online
learning method (a version of the recursive
least square algorithm known from linear
adaptive filters) to train the output weights on
5000-step training sequences. We chose an
online adaptation scheme here because the
methods in (17) were online adaptive, too,
and because wireless communication chan-
nels mostly are time-varying, such that an
equalizer must adapt to changing system
characteristics. The entire learning-testing
procedure was repeated for signal-to-noise
ratios ranging from 12 to 32 db. Figure 3
compares the average symbol error rates ob-
tained with the results reported in (17), show-
ing an improvement of two magnitudes for
high signal-to-noise ratios.
For tasks with multichannel input and/or
output, the ESN approach can be accommo-
dated simply by adding more input or output
neurons (16, 18).
ESNs can be applied to all basic tasks of
signal processing and control, including time
series prediction, inverse modeling, pattern
generation, event detection and classification,
modeling distributions of stochastic process-
es, filtering, and nonlinear control (16, 18,
19, 20). Because a single learning run takes
only a few seconds (or minutes, for very large
data sets and networks), engineers can test
out variants at a high turnover rate, a crucial
factor for practical usability.
ESNs have been developed from a mathe-
matical and engineering perspective, but exhibit
typical features of biological RNNs: a large
number of neurons, recurrent pathways, sparse
random connectivity, and local modification of
synaptic weights. The idea of using randomly
connected RNNs to represent and memorize
dynamic input in network states has frequently
been explored in specific contexts, for instance,
in artificial intelligence models of associative
memory (21), models of prefrontal cortex func-
tion in sensory-motor sequencing tasks (22),
models of birdsong (23), models of the cerebel-
lum (24), and general computational models of
neural oscillators (25). Many different learning
mechanisms were considered, mostly within
the RNN itself. The contribution of the ESN is
to elucidate the mathematical properties of
large RNNs such that they can be used with a
linear, trainable readout mechanism for general
black-box modeling. An approach essentially
equivalent to ESNs, liquid state networks (26,
27), has been developed independently to mod-
el computations in cortical microcircuits. Re-
cent findings in neurophysiology suggest that
the basic ESN/liquid state network principle
seems not uncommon in biological networks
(2830) and could eventually be exploited to
control prosthetic devices by signals collected
from a collective of neurons (31).
References and Notes
1. K.-I. Funahashi, Y. Nakamura, Neural Netw. 6, 801
(1993).
2. D. Zipser, R. J. Williams, Neural Comput. 1, 270
(1989).
3. P. J. Werbos, Proc. IEEE 78, 1550 (1990).
4. L. A. Feldkamp, D. V. Prokhorov, C. F. Eagen, F. Yuan,
in Nonlinear Modeling: Advanced Black-Box Tech-
niques, J. A. K. Suykens, J. Vandewalle, Eds. (Kluwer,
Dordrecht, Netherlands, 1998), pp. 29–54.
5. K. Doya, in The Handbook of Brain Theory and Neural
Networks, M. A. Arbib, Ed. (MIT Press, Cambridge, MA,
1995), pp. 796800.
6. H. Jaeger, “Tutorial on training recurrent neural
networks” (GMD-Report 159, German National Re-
search Institute for Computer Science, 2002); ftp://
borneo.gmd.de/pub/indy/publications_herbert/
CompleteTutorialTechrep.pdf.
Fig. 2. (A) Prediction output of the trained ESN
(dotted) overlaid with the correct continuation
(solid). (B) Learning the MG attractor. Three
sample activation traces of internal neurons are
shown. They echo the teacher signal d(n). After
training, the desired output is recreated from
the echo signals through output connections
(dotted arrows) whose weights w
i
are the result
of the training procedure.
SER
a
b
c
d
e
SNR
16 20 24 28 32
0.00001
0.0001
0.001
0.01
Fig. 3. Results of using an ESN for nonlinear
channel equalization. Plot shows signal error
rate (SER) versus signal-to-noise ratio (SNR).
(a) Linear DFE. (b) Volterra DFE. (c) Bilinear
DFE. [(a) to (c) taken from (20)]. (d) Blue line
represents average ESN performance with ran-
domly generated reservoirs. Error bars, varia-
tion across networks. (e) Green line indicates
performance of best network chosen from the
networks averaged in (d). Error bars, variation
across learning trials.
R EPORTS
www.sciencemag.org SCIENCE VOL 304 2 APRIL 2004 79

7. Materials and methods are available as supporting
material on Science Online.
8. M. C. Mackey, L. Glass, Science 197, 287 (1977).
9. J. Vesanto, in Proc. WSOM ’97 (1997); www.cis.hut.fi/
projects/monitor/publications/papers/wsom97.ps.
10. L. Chudy, I. Farkas, Neural Network World 8, 481
(1998).
11. H. Bersini, M. Birattari, G. Bontempi, in Proc. IEEE
World Congr. on Computational Intelligence (IJCNN
’98) (1997), pp. 2102–2106; ftp://iridia.ulb.ac.be/
pub/lazy/papers/IridiaTr1997-13_2.ps.gz.
12. T. M. Martinetz, S. G. Berkovich, K. J. Schulten, IEEE
Trans. Neural Netw. 4, 558 (1993).
13. X. Yao, Y. Liu, IEEE Trans. Neural Netw. 8, 694 (1997).
14. F. Gers, D. Eck, J. F. Schmidhuber, “Applying LSTM to
time series predictable through time-window ap-
proaches” (IDSIA-IDSIA-22-00, 2000); www.idsia.ch/
felix/Publications.html.
15. J. McNames, J. A. K. Suykens, J. Vandewalle, Int. J.
Bifurcat. Chaos 9, 1485 (1999).
16. H. Jaeger, “Short term memory in echo state net-
works” (GMD-Report 152, German National Re-
search Institute for Computer Science, 2002); ftp://
borneo.gmd.de/pub/indy/publications_herbert/
STMEchoStatesTechRep.pdf.
17. V. J. Mathews, J. Lee, in Advanced Signal Processing:
Algorithms, Architectures, and Implementations V
(Proc. SPIE Vol. 2296), (SPIE, San Diego, CA, 1994),
pp. 317–327.
18. J. Hertzberg, H. Jaeger, F. Scho¨nherr, in Proc. 15th
Europ. Conf. on Art. Int. (ECAI 02), F. van Harmelen,
Ed. (IOS Press, Amsterdam, 2002), pp. 708–712; www.
ais.fhg.de/schoenhe/papers/ECAI02.pdf.
19. H. Jaeger, “The echo state approach to analysing and
training recurrent neural networks” (GMD-Report
148, German National Research Institute for Com-
puter Science, 2001); ftp://borneo.gmd.de/pub/indy/
publications_herbert/EchoStatesTechRep.pdf.
20. H. Jaeger, in Advances in Neural Information Process-
ing Systems 15, S. Becker, S. Thrun, K. Obermayer,
Eds. (MIT Press, Cambridge, MA, 2003) pp. 593– 600.
21. G. E. Hinton, in Parallel Models of Associative Mem-
ory, G. E. Hinton, J. A. Anderson, Eds. (Erlbaum, Hills-
dale, NJ, 1981), pp. 161–187.
22. D. G. Beiser, J. C. Houk, J. Neurophysiol. 79, 3168
(1998).
23. S. Dehaene, J.-P. Changeux, J.-P. Nadal, Proc. Natl.
Acad. Sci. U.S.A. 84, 2727 (1987).
24. M. Kawato, in The Handbook of Brain Theory and
Neural Networks, M. Arbib, Ed. (MIT Press, Cam-
bridge, MA, 1995), pp. 172–178.
25. K. Doya, S. Yoshizawa, Neural Netw. 2, 375 (1989).
26. W. Maass, T. Natschla¨ger, H. Markram, Neural Com-
put. 14, 2531 (2002).
27. W. Maass, T. Natschla¨ger, H. Markram, in Compu-
tational Neuroscience: A Comprehensive Approach,
J. Feng, Ed. (Chapman & Hall/CRC, 2003), pp. 575–
605.
28. G. B. Stanley, F. F. Li, Y. Dan, J. Neurosci. 19, 8036
(1999).
29. G. B. Stanley, Neurocomputing 38–40, 1703 (2001).
30. W. M. Kistler, Ch. I. de Zeeuw, Neural Comput. 14,
2597 (2002).
31. S. Mussa-Ivaldi, Nature 408, 361 (2000).
32. The first author thanks T. Christaller for unfaltering
support and W. Maass for friendly cooperation. Inter-
national patents are claimed by Fraunhofer AIS (PCT/
EP01/11490).
Supporting Online Material
www.sciencemag.org/cgi/content/full/304/5667/78/DC1
Materials and Methods
SOM Text
Figs. S1 to S4
References
8 September 2003; accepted 26 February 2004
Ultrafast Electron Crystallography
of Interfacial Water
Chong-Yu Ruan, Vladimir A. Lobastov, Franco Vigliotti,
Songye Chen, Ahmed H. Zewail*
We report direct determination of the structures and dynamics of interfacial water
on a hydrophilic surface with atomic-scale resolution using ultrafast electron
crystallography. On the nanometer scale, we observed the coexistence of ordered
surface water and crystallite-like ice structures, evident in the superposition of
Bragg spots and Debye-Scherrer rings. The structures were determined to be
dominantly cubic, but each undergoes different dynamics after the ultrafast sub-
strate temperature jump. From changes in local bond distances (OH䡠䡠O and O䡠䡠䡠O)
with time, we elucidated the structural changes in the far-from-equilibrium regime
at short times and near-equilibration at long times.
The nature of interfacial molecular assemblies
of nanometer scale is of fundamental impor-
tance to chemical and biological phenomena
(14). For water, the directional molecular fea-
tures of hydrogen bonding (5, 6) and the dif-
ferent structures possible, from amorphous (7)
to crystalline (8), make the interfacial (9) col-
lective assembly on the mesoscopic (10) scale
much less understood. Structurally, the nature
of water on a substrate is determined by forces
of orientation at the interface and by the net
charge density, which establishes the hydro-
philic or hydrophobic character of the substrate.
However, the transformation from ordered to dis-
ordered structure and their coexistence critically
depends on the time scales for the movements of
atoms locally and at long range. Therefore, it is
essential to elucidate the nature of these structures
and the time scales for their equilibration.
Here, we report direct determination of the
structures of interfacial water with atomic-scale
resolution, using diffraction and the dynamics
following ultrafast infrared (IR) laser-initiated
temperature jump. Interfacial water is formed
on a hydrophilic surface (silicon, chlorine-
terminated) under controlled ultrahigh vacuum
(UHV) conditions (Fig. 1). With these atomic-
scale spatial, temporal, and energy resolutions,
the evolution of nonequilibrium structures was
monitored, their ordered or disordered nature
was established, and the time scale for the
breakage of long-range bonding and formation
of new structures was determined. We identi-
fied the structured and ordered interfacial water
from the Bragg diffraction and the layered crys-
tallite structure from the Debye-Scherrer rings.
The temporal evolution of interfacial water and
layered ice after the temperature jump was
studied with submonolayer sensitivity. We
compared these results with those obtained on
hydrophobic surfaces, such as hydrogen-
terminated silicon or silver substrate.
Spectroscopic techniques, such as internal
reflection (11) and nonlinear [second-harmonic
generation (12) and sum-frequency generation
Laboratory for Molecular Sciences, Arthur Amos
Noyes Laboratory of Chemical Physics, California
Institute of Technology, Pasadena, CA 91125, USA.
*To whom correspondence should be addressed. E-
mail: zewail@caltech.edu
Fig. 1. Structured wa-
ter at the hydrophilic
interface. The chlo-
rine termination on
a Si(111) substrate
forms a hydrophilic
layer that orients the
water bilayer. The
closest packing dis-
tance (4.43 Å) be-
tween oxygen atoms
in the bottom layer of
water is similar to the
distance (4.50 Å) be-
tween the on-top and
interstitial sites of the
chlorine layer, result-
ing in specific bilayer
orientations (30°)
with respect to the sil-
icon substrate. This ordered stacking persists for three to four bilayers (1 nm) before disorien-
tation takes place and results in crystallite islands, forming the layered structure. The size of atoms
is not to scale for the van der Waals radii.
R EPORTS
2 APRIL 2004 VOL 304 SCIENCE www.sciencemag.org80
Citations
More filters
Book
18 Nov 2016
TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.
Abstract: Deep learning is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts. Because the computer gathers knowledge from experience, there is no need for a human computer operator to formally specify all the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones; a graph of these hierarchies would be many layers deep. This book introduces a broad range of topics in deep learning. The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models. Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors.

38,208 citations

Journal ArticleDOI
TL;DR: This historical survey compactly summarizes relevant work, much of it from the previous millennium, review deep supervised learning, unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.

14,635 citations


Cites background or methods or result from "Harnessing Nonlinearity: Predicting..."

  • ...In fact, some popular RNN algorithms restricted credit assignment to a single step backwards (Elman, 1990; Jordan, 1986, 1997), also inmore recent studies (Jaeger, 2001, 2004; Maass et al., 2002)....

    [...]

  • ...Compare other RNN algorithms (Jaeger, 2004; Schmidhuber et al., 2007; Pascanu et al., 2013b; Koutnı́k et al., 2014) that also at least sometimes yield better results than steepest descent for LSTM RNNs....

    [...]

  • ...Certain SL RNNs with fixed weights for all connections except those to output units (Jaeger, 2001; Maass et al., 2002; Jaeger, 2004; Schrauwen et al., 2007) have a maximal problem depth of 1, because only the final links in the corresponding CAPs are modifiable....

    [...]

  • ...Gradient-based LSTM is no panacea though—other methods sometimes outperformed it at least on certain tasks (Jaeger, 2004; Schmidhuber et al., 2007; Martens and Sutskever, 2011; Pascanu et al., 2013b; Koutnı́k et al., 2014) (compare Sec....

    [...]

  • ...In fact, some popular RNN algorithms restricted credit assignment to a single step backwards (Elman, 1988; Jordan, 1986), also in more recent studies (Jaeger, 2002; Maass et al., 2002; Jaeger, 2004)....

    [...]

Proceedings Article
16 Jun 2013
TL;DR: It is shown that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs to levels of performance that were previously achievable only with Hessian-Free optimization.
Abstract: Deep and recurrent neural networks (DNNs and RNNs respectively) are powerful models that were considered to be almost impossible to train using stochastic gradient descent with momentum. In this paper, we show that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs (on datasets with long-term dependencies) to levels of performance that were previously achievable only with Hessian-Free optimization. We find that both the initialization and the momentum are crucial since poorly initialized networks cannot be trained with momentum and well-initialized networks perform markedly worse when the momentum is absent or poorly tuned. Our success training these models suggests that previous attempts to train deep and recurrent neural networks from random initializations have likely failed due to poor initialization schemes. Furthermore, carefully tuned momentum methods suffice for dealing with the curvature issues in deep and recurrent network training objectives without the need for sophisticated second-order methods.

4,121 citations


Cites background from "Harnessing Nonlinearity: Predicting..."

  • ...As argued by Jaeger & Haas (2004), the spectral radius of the hidden-to-hidden matrix has a profound effect on the dynamics of the RNN’s hidden state (with a tanh nonlinearity)....

    [...]

Posted Content
TL;DR: This paper proposes a gradient norm clipping strategy to deal with exploding gradients and a soft constraint for the vanishing gradients problem and validates empirically the hypothesis and proposed solutions.
Abstract: There are two widely known issues with properly training Recurrent Neural Networks, the vanishing and the exploding gradient problems detailed in Bengio et al. (1994). In this paper we attempt to improve the understanding of the underlying issues by exploring these problems from an analytical, a geometric and a dynamical systems perspective. Our analysis is used to justify a simple yet effective solution. We propose a gradient norm clipping strategy to deal with exploding gradients and a soft constraint for the vanishing gradients problem. We validate empirically our hypothesis and proposed solutions in the experimental section.

3,549 citations


Cites background or result from "Harnessing Nonlinearity: Predicting..."

  • ...Echo State Networks (Jaeger and Haas, 2004) avoid the exploding and vanishing gradients problem by not learning Wrec and Win....

    [...]

  • ...In most cases, these results outperforms Martens and Sutskever (2011) in terms of success rate, they deal with longer sequences than in Hochreiter and Schmidhuber (1997) and compared to (Jaeger, 2012) they generalize to longer sequences....

    [...]

  • ...Echo State Networks (Lukoševičius and Jaeger, 2009) avoid the exploding and vanishing gradients problem by not learning the recurrent and input weights....

    [...]

Proceedings Article
16 Jun 2013
TL;DR: In this article, a gradient norm clipping strategy is proposed to deal with the vanishing and exploding gradient problems in recurrent neural networks. But the proposed solution is limited to the case of RNNs.
Abstract: There are two widely known issues with properly training recurrent neural networks, the vanishing and the exploding gradient problems detailed in Bengio et al. (1994). In this paper we attempt to improve the understanding of the underlying issues by exploring these problems from an analytical, a geometric and a dynamical systems perspective. Our analysis is used to justify a simple yet effective solution. We propose a gradient norm clipping strategy to deal with exploding gradients and a soft constraint for the vanishing gradients problem. We validate empirically our hypothesis and proposed solutions in the experimental section.

2,586 citations

References
More filters
Journal ArticleDOI
01 Jan 1990
TL;DR: This paper first reviews basic backpropagation, a simple method which is now being widely used in areas like pattern recognition and fault diagnosis, and describes further extensions of this method, to deal with systems other than neural networks, systems involving simultaneous equations or true recurrent networks, and other practical issues which arise with this method.
Abstract: Basic backpropagation, which is a simple method now being widely used in areas like pattern recognition and fault diagnosis, is reviewed. The basic equations for backpropagation through time, and applications to areas like pattern recognition involving dynamic systems, systems identification, and control are discussed. Further extensions of this method, to deal with systems other than neural networks, systems involving simultaneous equations, or true recurrent networks, and other practical issues arising with the method are described. Pseudocode is provided to clarify the algorithms. The chain rule for ordered derivatives-the theorem which underlies backpropagation-is briefly discussed. The focus is on designing a simpler version of backpropagation which can be translated into computer code and applied directly by neutral network users. >

4,572 citations

Journal ArticleDOI
TL;DR: The exact form of a gradient-following learning algorithm for completely recurrent networks running in continually sampled time is derived and used as the basis for practical algorithms for temporal supervised learning tasks.
Abstract: The exact form of a gradient-following learning algorithm for completely recurrent networks running in continually sampled time is derived and used as the basis for practical algorithms for temporal supervised learning tasks. These algorithms have (1) the advantage that they do not require a precisely defined training interval, operating while the network runs; and (2) the disadvantage that they require nonlocal communication in the network being trained and are computationally expensive. These algorithms allow networks having recurrent connections to learn complex tasks that require the retention of information over time periods having either fixed or indefinite length.

4,351 citations

Journal ArticleDOI
15 Jul 1977-Science
TL;DR: First-order nonlinear differential-delay equations describing physiological control systems displaying a broad diversity of dynamical behavior including limit cycle oscillations, with a variety of wave forms, and apparently aperiodic or "chaotic" solutions are studied.
Abstract: First-order nonlinear differential-delay equations describing physiological control systems are studied. The equations display a broad diversity of dynamical behavior including limit cycle oscillations, with a variety of wave forms, and apparently aperiodic or "chaotic" solutions. These results are discussed in relation to dynamical respiratory and hematopoietic diseases.

3,839 citations

Journal ArticleDOI
TL;DR: A new computational model for real-time computing on time-varying input that provides an alternative to paradigms based on Turing machines or attractor neural networks, based on principles of high-dimensional dynamical systems in combination with statistical learning theory and can be implemented on generic evolved or found recurrent circuitry.
Abstract: A key challenge for neural modeling is to explain how a continuous stream of multimodal input from a rapidly changing environment can be processed by stereotypical recurrent circuits of integrate-and-fire neurons in real time. We propose a new computational model for real-time computing on time-varying input that provides an alternative to paradigms based on Turing machines or attractor neural networks. It does not require a task-dependent construction of neural circuits. Instead, it is based on principles of high-dimensional dynamical systems in combination with statistical learning theory and can be implemented on generic evolved or found recurrent circuitry. It is shown that the inherent transient dynamics of the high-dimensional dynamical system formed by a sufficiently large and heterogeneous neural circuit may serve as universal analog fading memory. Readout neurons can learn to extract in real time from the current state of such recurrent neural circuit information about current and past inputs that may be needed for diverse tasks. Stable internal states are not required for giving a stable output, since transient internal states can be transformed by readout neurons into stable target outputs due to the high dimensionality of the dynamical system. Our approach is based on a rigorous computational model, the liquid state machine, that, unlike Turing machines, does not require sequential transitions between well-defined discrete internal states. It is supported, as the Turing machine is, by rigorous mathematical results that predict universal computational power under idealized conditions, but for the biologically more realistic scenario of real-time processing of time-varying inputs. Our approach provides new perspectives for the interpretation of neural coding, the design of experiments and data analysis in neurophysiology, and the solution of problems in robotics and neurotechnology.

3,446 citations

Journal ArticleDOI
TL;DR: It is shown that the dynamics of the reference (weight) vectors during the input-driven adaptation procedure are determined by the gradient of an energy function whose shape can be modulated through a neighborhood determining parameter and resemble the dynamicsof Brownian particles moving in a potential determined by a data point density.
Abstract: A neural network algorithm based on a soft-max adaptation rule is presented. This algorithm exhibits good performance in reaching the optimum minimization of a cost function for vector quantization data compression. The soft-max rule employed is an extension of the standard K-means clustering procedure and takes into account a neighborhood ranking of the reference (weight) vectors. It is shown that the dynamics of the reference (weight) vectors during the input-driven adaptation procedure are determined by the gradient of an energy function whose shape can be modulated through a neighborhood determining parameter and resemble the dynamics of Brownian particles moving in a potential determined by the data point density. The network is used to represent the attractor of the Mackey-Glass equation and to predict the Mackey-Glass time series, with additional local linear mappings for generating output values. The results obtained for the time-series prediction compare favorably with the results achieved by backpropagation and radial basis function networks. >

1,504 citations

Frequently Asked Questions (1)
Q1. What contributions have the authors mentioned in the paper "Harnessing nonlinearity: predicting chaotic systems and saving energy in wireless communication" ?

The authors present a method for learning nonlinear systems, echo state networks ( ESNs ). The potential for engineering applications is illustrated by equalizing a communication channel, where the signal error rate is improved by two orders of magnitude.