Learning representations by back-propagating errors

doi:10.1038/323533A0

Home
/
Papers
/
Learning representations by back-propagating errors

Journal Article•DOI•

Learning representations by back-propagating errors

David E. Rumelhart¹, Geoffrey E. Hinton², Ronald J. Williams¹•Institutions (2)

University of California, San Diego¹, Carnegie Mellon University²

01 Jan 1988-Nature (Nature Publishing Group)-Vol. 323, Iss: 6088, pp 696-699

TL;DR: Back-propagation repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector, which helps to represent important features of the task domain.

read less

Abstract: We describe a new learning procedure, back-propagation, for networks of neurone-like units. The procedure repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector. As a result of the weight adjustments, internal ‘hidden’ units which are not part of the input or output come to represent important features of the task domain, and the regularities in the task are captured by the interactions of these units. The ability to create useful new features distinguishes back-propagation from earlier, simpler methods such as the perceptron-convergence procedure1.

...read moreread less

Citations

PDF

Open Access

More filters

Book•

Learning Deep Architectures for AI

[...]

Yoshua Bengio¹•Institutions (1)

Université de Montréal¹

01 Jan 2009

TL;DR: The motivations and principles regarding learning algorithms for deep architectures, in particular those exploiting as building blocks unsupervised learning of single-layer modelssuch as Restricted Boltzmann Machines, used to construct deeper models such as Deep Belief Networks are discussed.

...read moreread less

Abstract: Can machine learning deliver AI? Theoretical results, inspiration from the brain and cognition, as well as machine learning experiments suggest that in order to learn the kind of complicated functions that can represent high-level abstractions (e.g. in vision, language, and other AI-level tasks), one would need deep architectures. Deep architectures are composed of multiple levels of non-linear operations, such as in neural nets with many hidden layers, graphical models with many levels of latent variables, or in complicated propositional formulae re-using many sub-formulae. Each level of the architecture represents features at a different level of abstraction, defined as a composition of lower-level features. Searching the parameter space of deep architectures is a difficult task, but new algorithms have been discovered and a new sub-area has emerged in the machine learning community since 2006, following these discoveries. Learning algorithms such as those for Deep Belief Networks and other related unsupervised learning algorithms have recently been proposed to train deep architectures, yielding exciting results and beating the state-of-the-art in certain areas. Learning Deep Architectures for AI discusses the motivations for and principles of learning algorithms for deep architectures. By analyzing and comparing recent results with different learning algorithms for deep architectures, explanations for their success are proposed and discussed, highlighting challenges and suggesting avenues for future explorations in this area.

...read moreread less

7,767 citations

Cites background or methods from "Learning representations by back-pr..."

...The idea of distributed representation is an oldidea in machine learning and neural networks research (Hinton, 1986; Rumelhart et al., 1986a; Miikkulainen & Dyer, 1991; Bengio, Ducharme, & Vincent, 2001; Schwenk & Gauvain, 2002), and it may be of help in dealing with the curse of dimensionality and…...
[...]
...…Stacked Auto-Encoders) exploit as component or monitoring device a particular type of neural network: the auto-encoder, also called autoassociator, or Diabolo network (Rumelhart et al., 1986b; Bourlard & Kamp, 1988; Hinton & Zemel, 1994; Schwenk & Milgram, 1995; Japkowicz, Hanson, & Gluck, 2000)....
[...]
...• When we put artificial neurons (affine transformation followed by a non-linearity) in our set of elements, we obtain ordinary multi-layer neural networks (Rumelhart et al., 1986b)....
[...]
...4.1 Multi-Layer Neural Networks A typical set of equations for multi-layer neural networks (Rumelhart et al., 1986b) is the following....
[...]
...In the realm of supervised learning, multi-layer neural networks (Rumelhart et al., 1986a, 1986b) and in the realm of unsupervised learning, Boltzmann machines (Ackley, Hinton, & Sejnowski, 1985) have been introduced with the goal of learning distributed internal representations in the hidden…...
[...]

Journal Article•DOI•

Enriching Word Vectors with Subword Information

[...]

Piotr Bojanowski¹, Edouard Grave¹, Armand Joulin¹, Tomas Mikolov¹•Institutions (1)

Facebook¹

12 Jun 2017-Transactions of the Association for Computational Linguistics

TL;DR: This paper proposed a new approach based on skip-gram model, where each word is represented as a bag of character n-grams, words being represented as the sum of these representations, allowing to train models on large corpora quickly and allowing to compute word representations for words that did not appear in the training data.

...read moreread less

Abstract: Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models to learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skipgram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram, words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpora quickly and allows to compute word representations for words that did not appear in the training data. We evaluate our word representations on nine different languages, both on word similarity and analogy tasks. By comparing to recently proposed morphological word representations, we show that our vectors achieve state-of-the-art performance on these tasks.

...read moreread less

7,537 citations

Proceedings Article•

Distributed Representations of Sentences and Documents

[...]

Quoc V. Le¹, Tomas Mikolov¹•Institutions (1)

Google¹

21 Jun 2014

TL;DR: Paragraph Vector is an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents, and its construction gives the algorithm the potential to overcome the weaknesses of bag-of-words models.

...read moreread less

Abstract: Many machine learning algorithms require the input to be represented as a fixed-length feature vector. When it comes to texts, one of the most common fixed-length features is bag-of-words. Despite their popularity, bag-of-words features have two major weaknesses: they lose the ordering of the words and they also ignore semantics of the words. For example, "powerful," "strong" and "Paris" are equally distant. In this paper, we propose Paragraph Vector, an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents. Our algorithm represents each document by a dense vector which is trained to predict words in the document. Its construction gives our algorithm the potential to overcome the weaknesses of bag-of-words models. Empirical results show that Paragraph Vectors outperforms bag-of-words models as well as other techniques for text representations. Finally, we achieve new state-of-the-art results on several text classification and sentiment analysis tasks.

...read moreread less

7,119 citations

Posted Content•

Improving neural networks by preventing co-adaptation of feature detectors

[...]

Geoffrey E. Hinton¹, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov¹ - Show less +1 more•Institutions (1)

University of Toronto¹

03 Jul 2012-arXiv: Neural and Evolutionary Computing

TL;DR: The authors randomly omits half of the feature detectors on each training case to prevent complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors.

...read moreread less

Abstract: When a large feedforward neural network is trained on a small training set, it typically performs poorly on held-out test data. This "overfitting" is greatly reduced by randomly omitting half of the feature detectors on each training case. This prevents complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors. Instead, each neuron learns to detect a feature that is generally helpful for producing the correct answer given the combinatorially large variety of internal contexts in which it must operate. Random "dropout" gives big improvements on many benchmark tasks and sets new records for speech and object recognition.

...read moreread less

6,899 citations

Journal Article•DOI•

Training feedforward networks with the Marquardt algorithm

[...]

Martin T. Hagan, Mohammad Bagher Menhaj

01 Nov 1994-IEEE Transactions on Neural Networks

TL;DR: The Marquardt algorithm for nonlinear least squares is presented and is incorporated into the backpropagation algorithm for training feedforward neural networks and is found to be much more efficient than either of the other techniques when the network contains no more than a few hundred weights.

...read moreread less

Abstract: The Marquardt algorithm for nonlinear least squares is presented and is incorporated into the backpropagation algorithm for training feedforward neural networks. The algorithm is tested on several function approximation problems, and is compared with a conjugate gradient algorithm and a variable learning rate algorithm. It is found that the Marquardt algorithm is much more efficient than either of the other techniques when the network contains no more than a few hundred weights. >

...read moreread less

6,899 citations