scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Finding Structure in Time

01 Mar 1990-Cognitive Science (Lawrence Erlbaum Associates, Inc.)-Vol. 14, Iss: 2, pp 179-211
TL;DR: A proposal along these lines first described by Jordan (1986) which involves the use of recurrent links in order to provide networks with a dynamic memory and suggests a method for representing lexical categories and the type/token distinction is developed.
About: This article is published in Cognitive Science.The article was published on 1990-03-01 and is currently open access. It has received 10264 citations till now. The article focuses on the topics: Task (project management).
Citations
More filters
Journal ArticleDOI
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Abstract: Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

72,897 citations


Cites methods from "Finding Structure in Time"

  • ...The approaches of Elman (1988), Fahlman (1991), Williams (1989), Schmidhuber (1992a), Pearlmutter (1989), and many of the related algorithms in Pearlmutter’s comprehensive overview (1995) suffer from the same problems as BPTT and RTRL (see sections 1 and 3)....

    [...]

  • ...In experimental comparisons with RTRL, BPTT, Recurrent Cascade-Correlation, Elman nets, and Neural Sequence Chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex long time lag tasks that have never been solved by previous recurrent network algorithms. It works with local, distributed, real-valued, and noisy pattern representations. 1 INTRODUCTION Recurrent networks can in principle use their feedback connections to store representations of recent input events in form of activations (\short-term memory", as opposed to \long-term memory" embodied by slowly changing weights). This is potentially signi cant for many applications, including speech processing, non-Markovian control, and music composition (e.g., Mozer 1992). The most widely used algorithms for learning what to put in short-term memory, however, take too much time or don't work well at all, especially when minimal time lags between inputs and corresponding teacher signals are long. With conventional \Back-Propagation Through Time" (BPTT, e.g., Williams and Zipser 1992) or \Real-Time Recurrent Learning" (RTRL, e.g., Robinson and Fallside 1987), error signals \ owing backwards in time" tend to either (1) blow up or (2) vanish: the temporal evolution of the backpropagated error exponentially depends on the size of the weights. Case (1) may lead to oscillating weights, while in case (2) learning to bridge long time lags takes a prohibitive amount of time, or does not work at all. This paper presents \Long Short-Term Memory" (LSTM), a novel recurrent network architecture in conjunction with an appropriate gradient-based learning algorithm. The combination of both is designed to overcome these error back- ow problems. Unlike Schmidhuber's (1992b) chunking systems (which work well if input sequences contain local regularities that make them partly predictable), LSTM can learn to bridge time intervals in excess of 1000 steps even in noisy, 1...

    [...]

Posted Content
TL;DR: This paper proposed two novel model architectures for computing continuous vector representations of words from very large data sets, and the quality of these representations is measured in a word similarity task and the results are compared to the previously best performing techniques based on different types of neural networks.
Abstract: We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.

20,077 citations

Journal ArticleDOI
TL;DR: This historical survey compactly summarizes relevant work, much of it from the previous millennium, review deep supervised learning, unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.

14,635 citations


Cites result from "Finding Structure in Time"

  • ...In fact, some popular RNN algorithms restricted credit assignment to a single step backwards (Elman, 1988; Jordan, 1986), also in more recent studies (Jaeger, 2002; Maass et al....

    [...]

  • ...In fact, some popular RNN algorithms restricted credit assignment to a single step backwards (Elman, 1990; Jordan, 1986, 1997), also inmore recent studies (Jaeger, 2001, 2004; Maass et al., 2002)....

    [...]

Proceedings Article
Quoc V. Le1, Tomas Mikolov1
21 Jun 2014
TL;DR: Paragraph Vector is an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents, and its construction gives the algorithm the potential to overcome the weaknesses of bag-of-words models.
Abstract: Many machine learning algorithms require the input to be represented as a fixed-length feature vector. When it comes to texts, one of the most common fixed-length features is bag-of-words. Despite their popularity, bag-of-words features have two major weaknesses: they lose the ordering of the words and they also ignore semantics of the words. For example, "powerful," "strong" and "Paris" are equally distant. In this paper, we propose Paragraph Vector, an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents. Our algorithm represents each document by a dense vector which is trained to predict words in the document. Its construction gives our algorithm the potential to overcome the weaknesses of bag-of-words models. Empirical results show that Paragraph Vectors outperforms bag-of-words models as well as other techniques for text representations. Finally, we achieve new state-of-the-art results on several text classification and sentiment analysis tasks.

7,119 citations

Journal ArticleDOI
TL;DR: The authors propose to learn a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences, which can be expressed in terms of these representations.
Abstract: A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set. We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences. The model learns simultaneously (1) a distributed representation for each word along with (2) the probability function for word sequences, expressed in terms of these representations. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence. Training such large models (with millions of parameters) within a reasonable time is itself a significant challenge. We report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach significantly improves on state-of-the-art n-gram models, and that the proposed approach allows to take advantage of longer contexts.

6,832 citations

References
More filters
Book ChapterDOI
01 Jan 1988
TL;DR: This chapter contains sections titled: The Problem, The Generalized Delta Rule, Simulation Results, Some Further Generalizations, Conclusion.
Abstract: This chapter contains sections titled: The Problem, The Generalized Delta Rule, Simulation Results, Some Further Generalizations, Conclusion

17,604 citations


"Finding Structure in Time" refers methods in this paper

  • ...If so, the output is compared with a teacher input and backpropagation of error ( Rumelhart, Hinton, & Williams, 1986 ) is used to incrementally adjust connection strengths....

    [...]

Book
03 Jan 1986
TL;DR: In this paper, the problem of the generalized delta rule is discussed and the Generalized Delta Rule is applied to the simulation results of simulation results in terms of the generalized delta rule.
Abstract: This chapter contains sections titled: The Problem, The Generalized Delta Rule, Simulation Results, Some Further Generalizations, Conclusion

13,579 citations

Journal ArticleDOI
TL;DR: Methodological preliminaries of generative grammars as theories of linguistic competence; theory of performance; organization of a generative grammar; justification of grammar; descriptive and explanatory theories; evaluation procedures; linguistic theory and language learning.

12,586 citations


"Finding Structure in Time" refers background in this paper

  • ...In addition, it has been argued that generalizations about word order cannot be accounted for solely in terms of linear order (Chomsky, 1957, 1965)....

    [...]

  • ...In addition, it has been argued that generalizations about word order cannot be accounted for solely in terms of linear order (Chomsky, 1957; Chomsky, 1965)....

    [...]

Book
01 May 1965
TL;DR: Generative grammars as theories of linguistic competence as discussed by the authors have been used as a theory of performance for language learning. But they have not yet been applied to the problem of language modeling.
Abstract: : Contents: Methodological preliminaries: Generative grammars as theories of linguistic competence; theory of performance; organization of a generative grammar; justification of grammars; formal and substantive grammars; descriptive and explanatory theories; evaluation procedures; linguistic theory and language learning; generative capacity and its linguistic relevance Categories and relations in syntactic theory: Scope of the base; aspects of deep structure; illustrative fragment of the base component; types of base rules Deep structures and grammatical transformations Residual problems: Boundaries of syntax and semantics; structure of the lexicon

12,225 citations