Learning long-term dependencies with gradient descent is difficult
Citations
123,388 citations
72,897 citations
Cites background or methods from "Learning long-term dependencies wit..."
...For instance, guessing solved a variant of Bengio and Frasconi’s parity problem (1994) much faster4 than the seven methods tested by Bengio et al. (1994) and Bengio and Frasconi (1994)....
[...]
...Experiments 3a/3b focus on Bengio et al.'s 1994 \2-sequence problem". Because this problem actually can be solved quickly by random weight guessing, we also include a far more di cult 2-sequence problem (3c) which requires to learn real-valued, conditional expectations of noisy targets, given the inputs. Experiments 4 and 5 involve distributed, continuous-valued input representations and require learning to store precise, real values for very long time periods. Relevant input signals can occur at quite di erent positions in input sequences. Again minimal time lags involve hundreds of steps. Similar tasks never have been solved by other recurrent net algorithms. Experiment 6 involves tasks of a di erent complex type that also has not been solved by other recurrent net algorithms. Again, relevant input signals can occur at quite di erent positions in input sequences. The experiment shows that LSTM can extract information conveyed by the temporal order of widely separated inputs. Subsection 4.7 will provide a detailed summary of experimental conditions in two tables for reference. 4.1 EXPERIMENT 1: EMBEDDED REBER GRAMMAR Task. Our rst task is to learn the \embedded Reber grammar", e.g. Smith and Zipser (1989), Cleeremans et al. (1989), and Fahlman (1991)....
[...]
...Experiments 3a/3b focus on Bengio et al.'s 1994 \2-sequence problem". Because this problem actually can be solved quickly by random weight guessing, we also include a far more di cult 2-sequence problem (3c) which requires to learn real-valued, conditional expectations of noisy targets, given the inputs. Experiments 4 and 5 involve distributed, continuous-valued input representations and require learning to store precise, real values for very long time periods. Relevant input signals can occur at quite di erent positions in input sequences. Again minimal time lags involve hundreds of steps. Similar tasks never have been solved by other recurrent net algorithms. Experiment 6 involves tasks of a di erent complex type that also has not been solved by other recurrent net algorithms. Again, relevant input signals can occur at quite di erent positions in input sequences. The experiment shows that LSTM can extract information conveyed by the temporal order of widely separated inputs. Subsection 4.7 will provide a detailed summary of experimental conditions in two tables for reference. 4.1 EXPERIMENT 1: EMBEDDED REBER GRAMMAR Task. Our rst task is to learn the \embedded Reber grammar", e.g. Smith and Zipser (1989), Cleeremans et al....
[...]
...Experiments 3a/3b focus on Bengio et al.'s 1994 \2-sequence problem". Because this problem actually can be solved quickly by random weight guessing, we also include a far more di cult 2-sequence problem (3c) which requires to learn real-valued, conditional expectations of noisy targets, given the inputs. Experiments 4 and 5 involve distributed, continuous-valued input representations and require learning to store precise, real values for very long time periods. Relevant input signals can occur at quite di erent positions in input sequences. Again minimal time lags involve hundreds of steps. Similar tasks never have been solved by other recurrent net algorithms. Experiment 6 involves tasks of a di erent complex type that also has not been solved by other recurrent net algorithms. Again, relevant input signals can occur at quite di erent positions in input sequences. The experiment shows that LSTM can extract information conveyed by the temporal order of widely separated inputs. Subsection 4.7 will provide a detailed summary of experimental conditions in two tables for reference. 4.1 EXPERIMENT 1: EMBEDDED REBER GRAMMAR Task. Our rst task is to learn the \embedded Reber grammar", e.g. Smith and Zipser (1989), Cleeremans et al. (1989), and Fahlman (1991). Since it allows for training sequences with short time lags (of as few as 9 steps), it is not a long time lag problem....
[...]
...1990) and Plate's method (Plate 1993), which updates unit activations based on a weighted sum of old activations (see also de Vries and Principe 1991). Lin et al. (1995) propose variants of time-delay networks called NARX networks; some of their problems can be solved quickly by simple weight guessing though. To deal with long time lags, Mozer (1992) uses time constants in uencing the activation changes. However, for long time lags the time constants need external ne tuning (Mozer 1992). Sun et al.'s alternative approach (1993) updates the activation of a recurrent unit by adding the old activation and the (scaled) current net input....
[...]
[...]
46,982 citations
44,703 citations
Cites background from "Learning long-term dependencies wit..."
...Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [1, 9], which hamper convergence from the beginning....
[...]
42,067 citations
References
41,772 citations
17,604 citations
"Learning long-term dependencies wit..." refers methods in this paper
...Learning algorithms used for recurrent networks are usually based on computing the gradient of a cost function with respect to the weights of the network [ 22 ], [21]....
[...]
13,579 citations
4,351 citations
"Learning long-term dependencies wit..." refers methods in this paper
...Other algorithms, such as the forward propagation algorithms [ 141, [ 23 ], are much more computationally expensive (for...
[...]
1,598 citations
"Learning long-term dependencies wit..." refers methods in this paper
...We implemented the simulated annealing algorithm presented in [6] for optimizing functions of continuous variables....
[...]