scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Learning long-term dependencies with gradient descent is difficult

TL;DR: This work shows why gradient based learning algorithms face an increasingly difficult problem as the duration of the dependencies to be captured increases, and exposes a trade-off between efficient learning by gradient descent and latching on information for long periods.
Abstract: Recurrent neural networks can be used to map input sequences to output sequences, such as for recognition, production or prediction problems. However, practical difficulties have been reported in training recurrent neural networks to perform tasks in which the temporal contingencies present in the input/output sequences span long intervals. We show why gradient based learning algorithms face an increasingly difficult problem as the duration of the dependencies to be captured increases. These results expose a trade-off between efficient learning by gradient descent and latching on information for long periods. Based on an understanding of this problem, alternatives to standard gradient descent are considered. >
Citations
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

123,388 citations

Journal ArticleDOI
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Abstract: Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

72,897 citations


Cites background or methods from "Learning long-term dependencies wit..."

  • ...For instance, guessing solved a variant of Bengio and Frasconi’s parity problem (1994) much faster4 than the seven methods tested by Bengio et al. (1994) and Bengio and Frasconi (1994)....

    [...]

  • ...Experiments 3a/3b focus on Bengio et al.'s 1994 \2-sequence problem". Because this problem actually can be solved quickly by random weight guessing, we also include a far more di cult 2-sequence problem (3c) which requires to learn real-valued, conditional expectations of noisy targets, given the inputs. Experiments 4 and 5 involve distributed, continuous-valued input representations and require learning to store precise, real values for very long time periods. Relevant input signals can occur at quite di erent positions in input sequences. Again minimal time lags involve hundreds of steps. Similar tasks never have been solved by other recurrent net algorithms. Experiment 6 involves tasks of a di erent complex type that also has not been solved by other recurrent net algorithms. Again, relevant input signals can occur at quite di erent positions in input sequences. The experiment shows that LSTM can extract information conveyed by the temporal order of widely separated inputs. Subsection 4.7 will provide a detailed summary of experimental conditions in two tables for reference. 4.1 EXPERIMENT 1: EMBEDDED REBER GRAMMAR Task. Our rst task is to learn the \embedded Reber grammar", e.g. Smith and Zipser (1989), Cleeremans et al. (1989), and Fahlman (1991)....

    [...]

  • ...Experiments 3a/3b focus on Bengio et al.'s 1994 \2-sequence problem". Because this problem actually can be solved quickly by random weight guessing, we also include a far more di cult 2-sequence problem (3c) which requires to learn real-valued, conditional expectations of noisy targets, given the inputs. Experiments 4 and 5 involve distributed, continuous-valued input representations and require learning to store precise, real values for very long time periods. Relevant input signals can occur at quite di erent positions in input sequences. Again minimal time lags involve hundreds of steps. Similar tasks never have been solved by other recurrent net algorithms. Experiment 6 involves tasks of a di erent complex type that also has not been solved by other recurrent net algorithms. Again, relevant input signals can occur at quite di erent positions in input sequences. The experiment shows that LSTM can extract information conveyed by the temporal order of widely separated inputs. Subsection 4.7 will provide a detailed summary of experimental conditions in two tables for reference. 4.1 EXPERIMENT 1: EMBEDDED REBER GRAMMAR Task. Our rst task is to learn the \embedded Reber grammar", e.g. Smith and Zipser (1989), Cleeremans et al....

    [...]

  • ...Experiments 3a/3b focus on Bengio et al.'s 1994 \2-sequence problem". Because this problem actually can be solved quickly by random weight guessing, we also include a far more di cult 2-sequence problem (3c) which requires to learn real-valued, conditional expectations of noisy targets, given the inputs. Experiments 4 and 5 involve distributed, continuous-valued input representations and require learning to store precise, real values for very long time periods. Relevant input signals can occur at quite di erent positions in input sequences. Again minimal time lags involve hundreds of steps. Similar tasks never have been solved by other recurrent net algorithms. Experiment 6 involves tasks of a di erent complex type that also has not been solved by other recurrent net algorithms. Again, relevant input signals can occur at quite di erent positions in input sequences. The experiment shows that LSTM can extract information conveyed by the temporal order of widely separated inputs. Subsection 4.7 will provide a detailed summary of experimental conditions in two tables for reference. 4.1 EXPERIMENT 1: EMBEDDED REBER GRAMMAR Task. Our rst task is to learn the \embedded Reber grammar", e.g. Smith and Zipser (1989), Cleeremans et al. (1989), and Fahlman (1991). Since it allows for training sequences with short time lags (of as few as 9 steps), it is not a long time lag problem....

    [...]

  • ...1990) and Plate's method (Plate 1993), which updates unit activations based on a weighted sum of old activations (see also de Vries and Principe 1991). Lin et al. (1995) propose variants of time-delay networks called NARX networks; some of their problems can be solved quickly by simple weight guessing though. To deal with long time lags, Mozer (1992) uses time constants in uencing the activation changes. However, for long time lags the time constants need external ne tuning (Mozer 1992). Sun et al.'s alternative approach (1993) updates the activation of a recurrent unit by adding the old activation and the (scaled) current net input....

    [...]

Journal ArticleDOI
28 May 2015-Nature
TL;DR: Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years, and will have many more successes in the near future because it requires very little engineering by hand and can easily take advantage of increases in the amount of available computation and data.
Abstract: Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.

46,982 citations

Posted Content
TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

44,703 citations


Cites background from "Learning long-term dependencies wit..."

  • ...Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [1, 9], which hamper convergence from the beginning....

    [...]

Journal ArticleDOI
01 Jan 1998
TL;DR: In this article, a graph transformer network (GTN) is proposed for handwritten character recognition, which can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters.
Abstract: Multilayer neural networks trained with the back-propagation algorithm constitute the best example of a successful gradient based learning technique. Given an appropriate network architecture, gradient-based learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of 2D shapes, are shown to outperform all other techniques. Real-life document recognition systems are composed of multiple modules including field extraction, segmentation recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN), allows such multimodule systems to be trained globally using gradient-based methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank cheque is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal cheques. It is deployed commercially and reads several million cheques per day.

42,067 citations

References
More filters
Journal ArticleDOI
13 May 1983-Science
TL;DR: There is a deep and useful connection between statistical mechanics and multivariate or combinatorial optimization (finding the minimum of a given function depending on many parameters), and a detailed analogy with annealing in solids provides a framework for optimization of very large and complex systems.
Abstract: There is a deep and useful connection between statistical mechanics (the behavior of systems with many degrees of freedom in thermal equilibrium at a finite temperature) and multivariate or combinatorial optimization (finding the minimum of a given function depending on many parameters). A detailed analogy with annealing in solids provides a framework for optimization of the properties of very large and complex systems. This connection to statistical mechanics exposes new information and provides an unfamiliar perspective on traditional optimization problems and methods.

41,772 citations

Book ChapterDOI
01 Jan 1988
TL;DR: This chapter contains sections titled: The Problem, The Generalized Delta Rule, Simulation Results, Some Further Generalizations, Conclusion.
Abstract: This chapter contains sections titled: The Problem, The Generalized Delta Rule, Simulation Results, Some Further Generalizations, Conclusion

17,604 citations


"Learning long-term dependencies wit..." refers methods in this paper

  • ...Learning algorithms used for recurrent networks are usually based on computing the gradient of a cost function with respect to the weights of the network [ 22 ], [21]....

    [...]

Book
03 Jan 1986
TL;DR: In this paper, the problem of the generalized delta rule is discussed and the Generalized Delta Rule is applied to the simulation results of simulation results in terms of the generalized delta rule.
Abstract: This chapter contains sections titled: The Problem, The Generalized Delta Rule, Simulation Results, Some Further Generalizations, Conclusion

13,579 citations

Journal ArticleDOI
TL;DR: The exact form of a gradient-following learning algorithm for completely recurrent networks running in continually sampled time is derived and used as the basis for practical algorithms for temporal supervised learning tasks.
Abstract: The exact form of a gradient-following learning algorithm for completely recurrent networks running in continually sampled time is derived and used as the basis for practical algorithms for temporal supervised learning tasks. These algorithms have (1) the advantage that they do not require a precisely defined training interval, operating while the network runs; and (2) the disadvantage that they require nonlocal communication in the network being trained and are computationally expensive. These algorithms allow networks having recurrent connections to learn complex tasks that require the retention of information over time periods having either fixed or indefinite length.

4,351 citations


"Learning long-term dependencies wit..." refers methods in this paper

  • ...Other algorithms, such as the forward propagation algorithms [ 141, [ 23 ], are much more computationally expensive (for...

    [...]

Journal ArticleDOI
TL;DR: A new global optimization algorithm for functions of continuous variables is presented, derived from the “Simulated Annealing” algorithm recently introduced in combinatorial optimization, which is quite costly in terms of function evaluations, but its cost can be predicted in advance, depending only slightly on the starting point.
Abstract: A new global optimization algorithm for functions of continuous variables is presented, derived from the “Simulated Annealing” algorithm recently introduced in combinatorial optimization.The algorithm is essentially an iterative random search procedure with adaptive moves along the coordinate directions. It permits uphill moves under the control of a probabilistic criterion, thus tending to avoid the first local minima encountered.The algorithm has been tested against the Nelder and Mead simplex method and against a version of Adaptive Random Search. The test functions were Rosenbrock valleys and multiminima functions in 2,4, and 10 dimensions.The new method proved to be more reliable than the others, being always able to find the optimum, or at least a point very close to it. It is quite costly in term of function evaluations, but its cost can be predicted in advance, depending only slightly on the starting point.

1,598 citations


"Learning long-term dependencies wit..." refers methods in this paper

  • ...We implemented the simulated annealing algorithm presented in [6] for optimizing functions of continuous variables....

    [...]