scispace - formally typeset
Author

Sepp Hochreiter

Bio: Sepp Hochreiter is a academic researcher from Johannes Kepler University of Linz. The author has contributed to research in topic(s): Deep learning & Artificial neural network. The author has an hindex of 42, co-authored 168 publication(s) receiving 72856 citation(s). Previous affiliations of Sepp Hochreiter include Information Technology University & Dalle Molle Institute for Artificial Intelligence Research.

...read more

Papers
  More

Journal ArticleDOI: 10.1162/NECO.1997.9.8.1735
Sepp Hochreiter1, Jürgen Schmidhuber2Institutions (2)
01 Nov 1997-Neural Computation
Abstract: Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

...read more

49,735 Citations


Open accessProceedings Article
01 Jan 2017-
Abstract: Generative Adversarial Networks (GANs) excel at creating realistic images with complex models for which maximum likelihood is infeasible. However, the convergence of GAN training has still not been proved. We propose a two time-scale update rule (TTUR) for training GANs with stochastic gradient descent on arbitrary GAN loss functions. TTUR has an individual learning rate for both the discriminator and the generator. Using the theory of stochastic approximation, we prove that the TTUR converges under mild assumptions to a stationary local Nash equilibrium. The convergence carries over to the popular Adam optimization, for which we prove that it follows the dynamics of a heavy ball with friction and thus prefers flat minima in the objective landscape. For the evaluation of the performance of GANs at image generation, we introduce the `Frechet Inception Distance'' (FID) which captures the similarity of generated images to real ones better than the Inception Score. In experiments, TTUR improves learning for DCGANs and Improved Wasserstein GANs (WGAN-GP) outperforming conventional GAN training on CelebA, CIFAR-10, SVHN, LSUN Bedrooms, and the One Billion Word Benchmark.

...read more

3,731 Citations


Open accessPosted Content
23 Nov 2015-arXiv: Learning
Abstract: We introduce the "exponential linear unit" (ELU) which speeds up learning in deep neural networks and leads to higher classification accuracies. Like rectified linear units (ReLUs), leaky ReLUs (LReLUs) and parametrized ReLUs (PReLUs), ELUs alleviate the vanishing gradient problem via the identity for positive values. However, ELUs have improved learning characteristics compared to the units with other activation functions. In contrast to ReLUs, ELUs have negative values which allows them to push mean unit activations closer to zero like batch normalization but with lower computational complexity. Mean shifts toward zero speed up learning by bringing the normal gradient closer to the unit natural gradient because of a reduced bias shift effect. While LReLUs and PReLUs have negative values, too, they do not ensure a noise-robust deactivation state. ELUs saturate to a negative value with smaller inputs and thereby decrease the forward propagated variation and information. Therefore, ELUs code the degree of presence of particular phenomena in the input, while they do not quantitatively model the degree of their absence. In experiments, ELUs lead not only to faster learning, but also to significantly better generalization performance than ReLUs and LReLUs on networks with more than 5 layers. On CIFAR-100 ELUs networks significantly outperform ReLU networks with batch normalization while batch normalization does not improve ELU networks. ELU networks are among the top 10 reported CIFAR-10 results and yield the best published result on CIFAR-100, without resorting to multi-view evaluation or model averaging. On ImageNet, ELU networks considerably speed up learning compared to a ReLU network with the same architecture, obtaining less than 10% classification error for a single crop, single model network.

...read more

3,303 Citations


Open accessPosted Content
26 Jun 2017-arXiv: Learning
Abstract: Generative Adversarial Networks (GANs) excel at creating realistic images with complex models for which maximum likelihood is infeasible. However, the convergence of GAN training has still not been proved. We propose a two time-scale update rule (TTUR) for training GANs with stochastic gradient descent on arbitrary GAN loss functions. TTUR has an individual learning rate for both the discriminator and the generator. Using the theory of stochastic approximation, we prove that the TTUR converges under mild assumptions to a stationary local Nash equilibrium. The convergence carries over to the popular Adam optimization, for which we prove that it follows the dynamics of a heavy ball with friction and thus prefers flat minima in the objective landscape. For the evaluation of the performance of GANs at image generation, we introduce the "Frechet Inception Distance" (FID) which captures the similarity of generated images to real ones better than the Inception Score. In experiments, TTUR improves learning for DCGANs and Improved Wasserstein GANs (WGAN-GP) outperforming conventional GAN training on CelebA, CIFAR-10, SVHN, LSUN Bedrooms, and the One Billion Word Benchmark.

...read more

2,373 Citations


Open access
01 Jan 2001-
Abstract: D3EGF(FIH)J KMLONPEGQSRPETN UCV.WYX(Z R.[ V R6\M[ X N@]_^O\`JaNcb V RcQ W d EGKeL(^(QgfhKeLOE?i)^(QSj ETNPfPQkRl[ V R)m"[ X ^(KeLOEG^ npo qarpo m"[ X ^(KeLOEG^tsAu EGNPb V ^ v wyx zlwO{(|(}<~OC}€(‚(xp{aƒy„.~A}†…ˆ‡_~ ‰CŠlƒ3‰#|<€Az†w#|l€6‡ ‹(| Œ JpfhL XVŽ EG^O QgJ ‘ ETFOR†’“] ^O\”J•NPb V RcQ—– X E)ETR ˜6EGKeLOETNcKMLOEš™ Fˆ› ETN V RcQgJp^(^OE ZgZ E i ^(Qkj EGNPfhQSRO› E œOE2m1Jp^ RcNY› E V•Z sOŸž! ¡ q.n sCD X KGKa’8¢EG^ RPNhE¤£ ¥¦Q ZgZ E•s m§J•^ RPNO› E V•Z s( ̈ X › EG©#EKas# V ^ V œ V s(H a «a•¬3­ ®#|.€Y ̄y} xa°OC}l{x“‡ ‰ ƒyxl€Y~3{| „ ±2‡Pz „ ž V J Z J U N V fhKTJp^(Q ‘ ETFOR†’ J•\ D vYf3RPEGb ́f V ^(œ§ˆJpbF X RPETN@D KTQ—EG^(KTE i ^(QSjpEGNPfhQSR4vμJ•\ U¶Z JaNPEG^(K·E jYQ V œ(Q ̧D V ^ R V m V N3R V aOs#1 o ¡Ga r U Q—NhE^OoTE1⁄4»,] R V•Z vC1⁄2 3⁄4 „ x ± x  ‹#¿ }À‡ ‰3€t}l‚C}2‡P}<~ ¬t[ X NP•E^§D KeL(b ́Qgœ(L X ©yETN ] ‘ DY]_Á ˆJ•NPfhJàZ j EToQ V a• rpopo2Ä X  V ^(J(sCD Å)QSRPoTEGN ZgV ^(œ Æ ‰#|•{3 ̄|.€(C}.‹C¿Y}p„ ‡Pz†w

...read more

Topics: Term (time) (63%)

1,424 Citations


Cited by
  More

Open accessProceedings ArticleDOI: 10.1109/CVPR.2016.90
Kaiming He1, Xiangyu Zhang1, Shaoqing Ren1, Jian Sun1Institutions (1)
27 Jun 2016-
Abstract: Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [40] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

...read more

Topics: Deep learning (53%), Residual (53%), Convolutional neural network (53%) ...read more

93,356 Citations


Journal ArticleDOI: 10.1162/NECO.1997.9.8.1735
Sepp Hochreiter1, Jürgen Schmidhuber2Institutions (2)
01 Nov 1997-Neural Computation
Abstract: Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

...read more

49,735 Citations


Journal ArticleDOI: 10.1038/NATURE14539
Yann LeCun1, Yann LeCun2, Yoshua Bengio3, Geoffrey E. Hinton4  +1 moreInstitutions (5)
28 May 2015-Nature
Abstract: Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.

...read more

33,931 Citations


Open accessBook
Richard S. Sutton1, Andrew G. BartoInstitutions (1)
01 Jan 1988-
Abstract: Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives when interacting with a complex, uncertain environment. In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the key ideas and algorithms of reinforcement learning. Their discussion ranges from the history of the field's intellectual foundations to the most recent developments and applications. The only necessary mathematical background is familiarity with elementary concepts of probability. The book is divided into three parts. Part I defines the reinforcement learning problem in terms of Markov decision processes. Part II provides basic solution methods: dynamic programming, Monte Carlo methods, and temporal-difference learning. Part III presents a unified view of the solution methods and incorporates artificial neural networks, eligibility traces, and planning; the two final chapters present case studies and consider the future of reinforcement learning.

...read more

Topics: Learning classifier system (69%), Reinforcement learning (69%), Apprenticeship learning (65%) ...read more

32,257 Citations


Open accessBook
18 Nov 2016-
Abstract: Deep learning is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts. Because the computer gathers knowledge from experience, there is no need for a human computer operator to formally specify all the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones; a graph of these hierarchies would be many layers deep. This book introduces a broad range of topics in deep learning. The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models. Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors.

...read more

Topics: Feature learning (61%), Deep learning (59%), Approximate inference (51%) ...read more

26,972 Citations


Performance
Metrics

Author's H-index: 42

No. of papers from the Author in previous years
YearPapers
202120
202019
201923
201814
201712
20166

Top Attributes

Show by:

Author's top 5 most impactful journals

arXiv: Learning

25 papers, 7K citations

Bioinformatics

10 papers, 1.5K citations

bioRxiv

7 papers, 56 citations

Nucleic Acids Research

4 papers, 427 citations

Neural Computation

4 papers, 49.9K citations

Network Information
Related Authors (5)
Andreas Mayr

27 papers, 3.4K citations

87% related
Günter Klambauer

71 papers, 6.5K citations

84% related
Hubert Ramsauer

12 papers, 6.2K citations

83% related
Elisabeth Rumetshofer

7 papers, 89 citations

82% related
Kristina Preuer

6 papers, 372 citations

82% related