scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

TL;DR: This paper presents a novel method for training RNNs to label unsegmented sequences directly, thereby solving both problems of sequence learning and post-processing.
Abstract: Many real-world sequence learning tasks require the prediction of sequences of labels from noisy, unsegmented input data. In speech recognition, for example, an acoustic signal is transcribed into words or sub-word units. Recurrent neural networks (RNNs) are powerful sequence learners that would seem well suited to such tasks. However, because they require pre-segmented training data, and post-processing to transform their outputs into label sequences, their applicability has so far been limited. This paper presents a novel method for training RNNs to label unsegmented sequences directly, thereby solving both problems. An experiment on the TIMIT speech corpus demonstrates its advantages over both a baseline HMM and a hybrid HMM-RNN.
Citations
More filters
Book
18 Nov 2016
TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.
Abstract: Deep learning is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts. Because the computer gathers knowledge from experience, there is no need for a human computer operator to formally specify all the knowledge that the computer needs. The hierarchy of concepts allows the computer to learn complicated concepts by building them out of simpler ones; a graph of these hierarchies would be many layers deep. This book introduces a broad range of topics in deep learning. The text offers mathematical and conceptual background, covering relevant concepts in linear algebra, probability theory and information theory, numerical computation, and machine learning. It describes deep learning techniques used by practitioners in industry, including deep feedforward networks, regularization, optimization algorithms, convolutional networks, sequence modeling, and practical methodology; and it surveys such applications as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames. Finally, the book offers research perspectives, covering such theoretical topics as linear factor models, autoencoders, representation learning, structured probabilistic models, Monte Carlo methods, the partition function, approximate inference, and deep generative models. Deep Learning can be used by undergraduate or graduate students planning careers in either industry or research, and by software engineers who want to begin using deep learning in their products or platforms. A website offers supplementary material for both readers and instructors.

38,208 citations

Journal ArticleDOI
TL;DR: This historical survey compactly summarizes relevant work, much of it from the previous millennium, review deep supervised learning, unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.

14,635 citations


Cites background or result from "Connectionist temporal classificati..."

  • ...Currently successful techniques: LSTM RNNs and GPU-MPCNNs Most competition-winning or benchmark record-setting Deep Learners actually use one of two supervised techniques: (a) recurrent LSTM (1997) trained by CTC (2006) (Sections 5.13, 5.17, 5.21, 5.22), or (b) feedforward GPU-MPCNNs (2011, Sections 5.19, 5.21, 5.22) based on CNNs (1979, Section 5.4) with MP (1992, Section 5.11) trained through BP (1989–2007, Sections 5.8, 5.16)....

    [...]

  • ...2009: first official competitionswon by RNNs, andwithMPCNNs Stacks of LSTM RNNs trained by CTC (Sections 5.13, 5.16) became the first RNNs to win official international pattern recognition contests (with secret test sets known only to the organizers)....

    [...]

  • ...CTC-LSTM also helped to score first at NIST’s OpenHaRT2013 evaluation (Bluche et al., 2014)....

    [...]

  • ...Unlike traditional methods for automatic sequential program synthesis (e.g., Balzer, 1985; Deville & Lau, 1994; Soloway, Abbreviations in alphabetical order AE: Autoencoder AI: Artificial Intelligence ANN: Artificial Neural Network BFGS: Broyden–Fletcher–Goldfarb–Shanno BNN: Biological Neural Network BM: Boltzmann Machine BP: Backpropagation BRNN: Bi-directional Recurrent Neural Network CAP: Credit Assignment Path CEC: Constant Error Carousel CFL: Context Free Language CMA-ES: Covariance Matrix Estimation ES CNN: Convolutional Neural Network CoSyNE: Co-Synaptic Neuro-Evolution CSL: Context Sensitive Language CTC: Connectionist Temporal Classification DBN: Deep Belief Network DCT: Discrete Cosine Transform DL: Deep Learning DP: Dynamic Programming DS: Direct Policy Search EA: Evolutionary Algorithm EM: Expectation Maximization ES: Evolution Strategy FMS: Flat Minimum Search FNN: Feedforward Neural Network FSA: Finite State Automaton GMDH: Group Method of Data Handling GOFAI: Good Old-Fashioned AI GP: Genetic Programming GPU: Graphics Processing Unit GPU-MPCNN: GPU-Based MPCNN HMM: Hidden Markov Model HRL: Hierarchical Reinforcement Learning HTM: Hierarchical Temporal Memory HMAX: Hierarchical Model ‘‘and X’’ LSTM: Long Short-Term Memory (RNN) MDL: Minimum Description Length MDP: Markov Decision Process MNIST: Mixed National Institute of Standards and Technol- ogy Database MP: Max-Pooling MPCNN: Max-Pooling CNN NE: NeuroEvolution NEAT: NE of Augmenting Topologies NES: Natural Evolution Strategies NFQ: Neural Fitted Q-Learning NN: Neural Network OCR: Optical Character Recognition PCC: Potential Causal Connection PDCC: Potential Direct Causal Connection PM: Predictability Minimization POMDP: Partially Observable MDP RAAM: Recursive Auto-Associative Memory RBM: Restricted Boltzmann Machine ReLU: Rectified Linear Unit RL: Reinforcement Learning RNN: Recurrent Neural Network R-prop: Resilient Backpropagation SL: Supervised Learning SLIM NN: Self-Delimiting Neural Network SOTA: Self-Organizing Tree Algorithm SVM: Support Vector Machine TDNN: Time-Delay Neural Network TIMIT: TI/SRI/MIT Acoustic-Phonetic Continuous Speech Corpus UL: Unsupervised Learning WTA: Winner-Take-All 1986; Waldinger & Lee, 1969), RNNs can learn programs that mix sequential and parallel information processing in a natural and efficient way, exploiting the massive parallelism viewed as crucial for sustaining the rapid decline of computation cost observed over the past 75 years....

    [...]

  • ...CTC-LSTM performs simultaneous segmentation (alignment) and recognition (Section 5.22)....

    [...]

Proceedings Article
08 Dec 2014
TL;DR: The authors used a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector.
Abstract: Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which is close to the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.

12,299 citations

Posted Content
TL;DR: This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Abstract: Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT'14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which is close to the previous best result on this task. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.

11,936 citations


Cites methods from "Connectionist temporal classificati..."

  • ...The Connectionist Sequence Classification is another popular technique for mapping sequences to sequences with neural networks, but it assumes a monotonic alignment between the inputs and the outputs [11]....

    [...]

Proceedings ArticleDOI
26 May 2013
TL;DR: This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.
Abstract: Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.

7,316 citations


Cites background or methods from "Connectionist temporal classificati..."

  • ...In the past CTC networks have been decoded using either a form of bestfirst decoding known as prefix search, or by simply taking the most active output at every timestep [8]....

    [...]

  • ...Instead of combining RNNs with HMMs, it is possible to train RNNs ‘end-to-end’ for speech recognition [8, 9, 10]....

    [...]

  • ...possible alignments and determine the normalised probability Pr(z|x) of the target sequence given the input sequence [8]....

    [...]

  • ...The first method, known as Connectionist Temporal Classification (CTC) [8, 9], uses a softmax layer to define a separate output distribution Pr(k|t) at every step t along the input sequence....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Abstract: Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

72,897 citations


"Connectionist temporal classificati..." refers background in this paper

  • ...BLSTM combines the ability of Long Short-Term Memory (LSTM; Hochreiter & Schmidhuber, 1997) to bridge long time lags with the access of bidirectional RNNs (BRNNs; Schuster & Paliwal, 1997) to past and future context....

    [...]

  • ...Schmidhuber, 2005). BLSTM combines the ability of Long Short-Term Memory (LSTM; Hochreiter & Schmidhuber, 1997 ) to bridge long time lags with the access of bidirectional RNNs (BRNNs; Schuster & Paliwal, 1997) to past and future context....

    [...]

Journal ArticleDOI
Lawrence R. Rabiner1
01 Feb 1989
TL;DR: In this paper, the authors provide an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and give practical details on methods of implementation of the theory along with a description of selected applications of HMMs to distinct problems in speech recognition.
Abstract: This tutorial provides an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and gives practical details on methods of implementation of the theory along with a description of selected applications of the theory to distinct problems in speech recognition. Results from a number of original sources are combined to provide a single source of acquiring the background required to pursue further this area of research. The author first reviews the theory of discrete Markov chains and shows how the concept of hidden states, where the observation is a probabilistic function of the state, can be used effectively. The theory is illustrated with two simple examples, namely coin-tossing, and the classic balls-in-urns system. Three fundamental problems of HMMs are noted and several practical techniques for solving these problems are given. The various types of HMMs that have been studied, including ergodic as well as left-right models, are described. >

21,819 citations

Book
01 Jan 1995
TL;DR: This is the first comprehensive treatment of feed-forward neural networks from the perspective of statistical pattern recognition, and is designed as a text, with over 100 exercises, to benefit anyone involved in the fields of neural computation and pattern recognition.
Abstract: From the Publisher: This is the first comprehensive treatment of feed-forward neural networks from the perspective of statistical pattern recognition. After introducing the basic concepts, the book examines techniques for modelling probability density functions and the properties and merits of the multi-layer perceptron and radial basis function network models. Also covered are various forms of error functions, principal algorithms for error function minimalization, learning and generalization in neural networks, and Bayesian techniques and their applications. Designed as a text, with over 100 exercises, this fully up-to-date work will benefit anyone involved in the fields of neural computation and pattern recognition.

19,056 citations


"Connectionist temporal classificati..." refers background in this paper

  • ...Note that this is the same principle underlying the standard neural network objective functions (Bishop, 1995)....

    [...]

Proceedings Article
28 Jun 2001
TL;DR: This work presents iterative parameter estimation algorithms for conditional random fields and compares the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.
Abstract: We present conditional random fields , a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models. Conditional random fields also avoid a fundamental limitation of maximum entropy Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states. We present iterative parameter estimation algorithms for conditional random fields and compare the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.

13,190 citations


"Connectionist temporal classificati..." refers background in this paper

  • ...Currently, graphical models such as hidden Markov Models (HMMs; Rabiner, 1989), conditional random fields (CRFs; Lafferty et al., 2001) and their variants, are the predominant framework for sequence la- Appearing in Proceedings of the 23 rd International Conference on Machine Learning, Pittsburgh,…...

    [...]