scispace - formally typeset
Search or ask a question

Showing papers on "Word error rate published in 2014"


Proceedings Article
21 Jun 2014
TL;DR: A speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation is presented, based on a combination of the deep bidirectional LSTM recurrent neural network architecture and the Connectionist Temporal Classification objective function.
Abstract: This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the deep bidirectional LSTM recurrent neural network architecture and the Connectionist Temporal Classification objective function. A modification to the objective function is introduced that trains the network to minimise the expectation of an arbitrary transcription loss function. This allows a direct optimisation of the word error rate, even in the absence of a lexicon or language model. The system achieves a word error rate of 27.3% on the Wall Street Journal corpus with no prior linguistic information, 21.9% with only a lexicon of allowed words, and 8.2% with a trigram language model. Combining the network with a baseline system further reduces the error rate to 6.7%.

2,131 citations


Journal ArticleDOI
TL;DR: It is shown that further error rate reduction can be obtained by using convolutional neural networks (CNNs), and a limited-weight-sharing scheme is proposed that can better model speech features.
Abstract: Recently, the hybrid deep neural network (DNN)- hidden Markov model (HMM) has been shown to significantly improve speech recognition performance over the conventional Gaussian mixture model (GMM)-HMM. The performance improvement is partially attributed to the ability of the DNN to model complex correlations in speech features. In this paper, we show that further error rate reduction can be obtained by using convolutional neural networks (CNNs). We first present a concise description of the basic CNN and explain how it can be used for speech recognition. We further propose a limited-weight-sharing scheme that can better model speech features. The special structure such as local connectivity, weight sharing, and pooling in CNNs exhibits some degree of invariance to small shifts of speech features along the frequency axis, which is important to deal with speaker and environment variations. Experimental results show that CNNs reduce the error rate by 6%-10% compared with DNNs on the TIMIT phone recognition and the voice search large vocabulary speech recognition tasks.

1,948 citations


Proceedings ArticleDOI
04 May 2014
TL;DR: A novel framework for speaker recognition in which extraction of sufficient statistics for the state-of-the-art i-vector model is driven by a deep neural network (DNN) trained for automatic speech recognition (ASR) to produce frame alignments.
Abstract: We propose a novel framework for speaker recognition in which extraction of sufficient statistics for the state-of-the-art i-vector model is driven by a deep neural network (DNN) trained for automatic speech recognition (ASR) Specifically, the DNN replaces the standard Gaussian mixture model (GMM) to produce frame alignments The use of an ASR-DNN system in the speaker recognition pipeline is attractive as it integrates the information from speech content directly into the statistics, allowing the standard backends to remain unchanged Improvement from the proposed framework compared to a state-of-the-art system are of 30% relative at the equal error rate when evaluated on the telephone conditions from the 2012 NIST speaker recognition evaluation (SRE) The proposed framework is a successful way to efficiently leverage transcribed data for speaker recognition, thus opening up a wide spectrum of research directions

631 citations


Journal ArticleDOI
TL;DR: An approach in which both word images and text strings are embedded in a common vectorial subspace, allowing one to cast recognition and retrieval tasks as a nearest neighbor problem and is very fast to compute and, especially, to compare.
Abstract: This paper addresses the problems of word spotting and word recognition on images. In word spotting, the goal is to find all instances of a query word in a dataset of images. In recognition, the goal is to recognize the content of the word image, usually aided by a dictionary or lexicon. We describe an approach in which both word images and text strings are embedded in a common vectorial subspace. This is achieved by a combination of label embedding and attributes learning, and a common subspace regression. In this subspace, images and strings that represent the same word are close together, allowing one to cast recognition and retrieval tasks as a nearest neighbor problem. Contrary to most other existing methods, our representation has a fixed length, is low dimensional, and is very fast to compute and, especially, to compare. We test our approach on four public datasets of both handwritten documents and natural images showing results comparable or better than the state-of-the-art on spotting and recognition tasks.

522 citations


Patent
17 Oct 2014
TL;DR: In this article, a method for identifying possible errors made by a speech recognition system without using a transcript of words input to the system is described. But this method does not consider the use of a word-to-word model.
Abstract: Methods are disclosed for identifying possible errors made by a speech recognition system without using a transcript of words input to the system. A method for model adaptation for a speech recognition system includes determining an error rate, corresponding to either recognition of instances of a word or recognition of instances of various words, without using a transcript of words input to the system. The method may further include adjusting an adaptation, of the model for the word or various models for the various words, based on the error rate. Apparatus are disclosed for identifying possible errors made by a speech recognition system without using a transcript of words input to the system. An apparatus for model adaptation for a speech recognition system includes a processor adapted to estimate an error rate, corresponding to either recognition of instances of a word or recognition of instances of various words, without using a transcript of words input to the system. The apparatus may further include a controller adapted to adjust an adaptation of the model for the word or various models for the various words, based on the error rate.

306 citations


Posted Content
TL;DR: To perform inference after model selection, this work proposes controlling the selective type I error; i.e., the error rate of a test given that it was performed to recover long-run frequency properties among selected hypotheses analogous to those that apply in the classical (non-adaptive) context.
Abstract: To perform inference after model selection, we propose controlling the selective type I error; i.e., the error rate of a test given that it was performed. By doing so, we recover long-run frequency properties among selected hypotheses analogous to those that apply in the classical (non-adaptive) context. Our proposal is closely related to data splitting and has a similar intuitive justification, but is more powerful. Exploiting the classical theory of Lehmann

296 citations


Journal ArticleDOI
TL;DR: This work investigates convolutional neural networks for large vocabulary distant speech recognition, trained using speech recorded from a single distant microphone (SDM) and multiple distant microphones (MDM), and proposes a channel-wise convolution with two-way pooling.
Abstract: We investigate convolutional neural networks (CNNs) for large vocabulary distant speech recognition, trained using speech recorded from a single distant microphone (SDM) and multiple distant microphones (MDM). In the MDM case we explore a beamformed signal input representation compared with the direct use of multiple acoustic channels as a parallel input to the CNN. We have explored different weight sharing approaches, and propose a channel-wise convolution with two-way pooling. Our experiments, using the AMI meeting corpus, found that CNNs improve the word error rate (WER) by 6.5% relative compared to conventional deep neural network (DNN) models and 15.7% over a discriminatively trained Gaussian mixture model (GMM) baseline. For cross-channel CNN training, the WER improves by 3.5% relative over the comparable DNN structure. Compared with the best beamformed GMM system, cross-channel convolution reduces the WER by 9.7% relative, and matches the accuracy of a beamformed DNN.

248 citations


Journal ArticleDOI
29 Jan 2014
TL;DR: This work presents a large-scale benchmark evaluation on the second iteration of the publicly released NinaPro database, which contains surface electromyography data for 6 DOF force activations as well as for 40 discrete hand movements, and proposes the movement error rate as an alternative to the standard window-based accuracy.
Abstract: There has been increasing interest in applying learning algorithms to improve the dexterity of myoelectric prostheses. In this work, we present a large-scale benchmark evaluation on the second iteration of the publicly released NinaPro database, which contains surface electromyography data for 6 DOF force activations as well as for 40 discrete hand movements. The evaluation involves a modern kernel method and compares performance of three feature representations and three kernel functions. Both the force regression and movement classification problems can be learned successfully when using a nonlinear kernel function, while the exp- χ 2 kernel outperforms the more popular radial basis function kernel in all cases. Furthermore, combining surface electromyography and accelerometry in a multimodal classifier results in significant increases in accuracy as compared to when either modality is used individually. Since window-based classification accuracy should not be considered in isolation to estimate prosthetic controllability, we also provide results in terms of classification mistakes and prediction delay. To this extent, we propose the movement error rate as an alternative to the standard window-based accuracy. This error rate is insensitive to prediction delays and it allows us therefore to quantify mistakes and delays as independent performance characteristics. This type of analysis confirms that the inclusion of accelerometry is superior, as it results in fewer mistakes while at the same time reducing prediction delay.

175 citations


Posted Content
Xiangyu Zhang1, Jianhua Zou1, Xiang Ming1, Kaiming He2, Jian Sun2 
TL;DR: In this article, the reconstruction error of the nonlinear responses is minimized subject to a low-rank constraint, which helps to reduce the complexity of filters and reduces the accumulated error when multiple layers are approximated.
Abstract: This paper aims to accelerate the test-time computation of deep convolutional neural networks (CNNs). Unlike existing methods that are designed for approximating linear filters or linear responses, our method takes the nonlinear units into account. We minimize the reconstruction error of the nonlinear responses, subject to a low-rank constraint which helps to reduce the complexity of filters. We develop an effective solution to this constrained nonlinear optimization problem. An algorithm is also presented for reducing the accumulated error when multiple layers are approximated. A whole-model speedup ratio of 4x is demonstrated on a large network trained for ImageNet, while the top-5 error rate is only increased by 0.9%. Our accelerated model has a comparably fast speed as the "AlexNet", but is 4.7% more accurate.

173 citations


Proceedings ArticleDOI
Samy Bengio1, Georg Heigold1
14 Sep 2014
TL;DR: This work presents here an alternative construction, where words are projected into a continuous embedding space where words that sound alike are nearby in the Euclidean sense, and shows how embeddings can still allow to score words that were not in the training dictionary.
Abstract: Speech recognition systems have used the concept of states as a way to decompose words into sub-word units for decades. As the number of such states now reaches the number of words used to train acoustic models, it is interesting to consider approaches that relax the assumption that words are made of states. We present here an alternative construction, where words are projected into a continuous embedding space where words that sound alike are nearby in the Euclidean sense. We show how embeddings can still allow to score words that were not in the training dictionary. Initial experiments using a lattice rescoring approach and model combination on a large realistic dataset show improvements in word error rate.

166 citations


Journal ArticleDOI
TL;DR: This paper presents a new approach for the free text analysis of keystroke that combines monograph and digraph analysis, and uses a neural network to predict missing digraphs based on the relation between the monitored keystrokes.
Abstract: Accurate recognition of free text keystroke dynamics is challenging due to the unstructured and sparse nature of the data and its underlying variability As a result, most of the approaches published in the literature on free text recognition, except for one recent one, have reported extremely high error rates In this paper, we present a new approach for the free text analysis of keystrokes that combines monograph and digraph analysis, and uses a neural network to predict missing digraphs based on the relation between the monitored keystrokes Our proposed approach achieves an accuracy level comparable to the best results obtained through related techniques in the literature, while achieving a far lower processing time Experimental evaluation involving 53 users in a heterogeneous environment yields a false acceptance ratio (FAR) of 00152% and a false rejection ratio (FRR) of 482%, at an equal error rate (EER) of 246% Our follow-up experiment, in a homogeneous environment with 17 users, yields FAR=0% and FRR=501%, at EER=213%

Proceedings ArticleDOI
04 May 2014
TL;DR: This paper shows how this i-vector based speaker adaptation can be used to perform blind speaker adaptation of hybrid DNN-HMM speech recognition system and reports excellent results on a French language audio transcription task.
Abstract: State of the art speaker recognition systems are based on the i-vector representation of speech segments. In this paper we show how this representation can be used to perform blind speaker adaptation of hybrid DNN-HMM speech recognition system and we report excellent results on a French language audio transcription task. The implemenation is very simple. An audio file is first diarized and each speaker cluster is represented by an i-vector. Acoustic feature vectors are augmented by the corresponding i-vectors before being presented to the DNN. (The same i-vector is used for all acoustic feature vectors aligned with a given speaker.) This supplementary information improves the DNN's ability to discriminate between phonetic events in a speaker independent way without having to make any modification to the DNN training algorithms. We report results on the ETAPE 2011 transcription task, and show that i-vector based speaker adaptation is effective irrespective of whether cross-entropy or sequence training is used. For cross-entropy training, we obtained a word error rate (WER) reduction from 22.16% to 20.67% whereas for sequence training the WER reduces from 19.93% to 18.40%.

01 Jan 2014
TL;DR: On this type of noisy data, it is shown that in average, the BN features provide a 45% relative improvement in the Cavgor Equal Error Rate (EER) metrics across several test duration conditions, with resp ect to the single best acoustic features.
Abstract: This paper presents the application of Neural Network Bottleneck (BN) features in Language Identification (LID). BN f eatures are generally used for Large Vocabulary Speech Recognition in conjunction with conventional acoustic features, s uch as MFCC or PLP. We compare the BN features to several common types of acoustic features used in the state-of-the-art LID systems. The test set is from DARPA RATS (Robust Automatic Transcription of Speech) program, which seeks to advance state-of-the-art detection capabilities on audio from hig hly degraded radio communication channels. On this type of noisy data, we show that in average, the BN features provide a 45% relative improvement in the Cavgor Equal Error Rate (EER) metrics across several test duration conditions, with resp ect to our single best acoustic features. Index Terms: language identification, noisy speech, robust feature extraction

Proceedings ArticleDOI
14 Sep 2014
TL;DR: Experiments show that compared with the baseline DNN, the SAT-DNN model brings 7.5% and 6.0% relative improvement when DNN inputs are speaker-independent and speakeradapted features respectively.
Abstract: We investigate the concept of speaker adaptive training (SAT) in the context of deep neural network (DNN) acoustic models. Previous studies have shown success of performing speaker adaptation for DNNs in speech recognition. In this paper, we apply SAT to DNNs by learning two types of feature mapping neural networks. Given an initial DNN model, these networks take speaker i-vectors as additional information and project DNN inputs into a speaker-normalized space. The final SAT model is obtained by updating the canonical DNN in the normalized feature space. Experiments on a Switchboard 110hour setup show that compared with the baseline DNN, the SAT-DNN model brings 7.5% and 6.0% relative improvement when DNN inputs are speaker-independent and speakeradapted features respectively. Further evaluations on the more challenging BABEL datasets reveal significant word error rate reduction achieved by SAT-DNN.

Proceedings ArticleDOI
Chunpeng Wu1, Wei Fan1, Yuan He1, Jun Sun1, Satoshi Naoi1 
15 Dec 2014
TL;DR: The relaxation convolution layer adopted in the R-CNN, unlike traditional convolutional layer, does not require neurons within a feature map to share the same Convolutional kernel, endowing the neural network with more expressive power.
Abstract: Deep learning methods have recently achieved impressive performance in the area of visual recognition and speech recognition. In this paper, we propose a hand- writing recognition method based on relaxation convolutional neural network (R-CNN) and alternately trained relaxation convolutional neural network (ATR-CNN). Previous methods regularize CNN at full-connected layer or spatial-pooling layer, however, we focus on convolutional layer. The relaxation convolution layer adopted in our R-CNN, unlike traditional convolutional layer, does not require neurons within a feature map to share the same convolutional kernel, endowing the neural network with more expressive power. As relaxation convolution sharply increase the total number of parameters, we adopt alternate training in ATR-CNN to regularize the neural network during training procedure. Our previous C- NN took the 1st place in ICDAR'13 Chinese Handwriting Character Recognition Competition, while our latest ATR-CNN outperforms our previous one and achieves the state-of-the-art accuracy with an error rate of 3.94%, further narrowing the gap between machine and human observers (3.87%).

Proceedings ArticleDOI
04 May 2014
TL;DR: Results presented in this paper indicate that channel concatenation gives similar or better results than beamforming, andAugmenting the standard DNN input with the bottleneck feature from a Speaker Aware Deep Neural Network (SADNN) shows a general advantage over theStandard DNN based recognition system, and yields additional improvements for far field speech recognition.
Abstract: This paper presents an investigation of far field speech recognition using beamforming and channel concatenation in the context of Deep Neural Network (DNN) based feature extraction. While speech enhancement with beamforming is attractive, the algorithms are typically signal-based with no information about the special properties of speech. A simple alternative to beamforming is concatenating multiple channel features. Results presented in this paper indicate that channel concatenation gives similar or better results. On average the DNN front-end yields a 25% relative reduction in Word Error Rate (WER). Further experiments aim at including relevant information in training adapted DNN features. Augmenting the standard DNN input with the bottleneck feature from a Speaker Aware Deep Neural Network (SADNN) shows a general advantage over the standard DNN based recognition system, and yields additional improvements for far field speech recognition.

Proceedings ArticleDOI
31 May 2014
TL;DR: In this article, the authors consider the task of interactive communication in the presence of adversarial errors and present tight bounds on the tolerable error-rates in a number of different settings, including adaptive interactive communication where the communicating parties decide who should speak next based on the history of the interaction.
Abstract: We consider the task of interactive communication in the presence of adversarial errors and present tight bounds on the tolerable error-rates in a number of different settings. Most significantly, we explore adaptive interactive communication where the communicating parties decide who should speak next based on the history of the interaction. In particular, this decision can depend on estimates of the amount of errors that have occurred so far. Braverman and Rao [STOC'11] show that non-adaptively one can code for any constant error rate below 1/4 but not more. They asked whether this bound could be improved using adaptivity. We answer this open question in the affirmative (with a slightly different collection of resources): Our adaptive coding scheme tolerates any error rate below 2/7 and we show that tolerating a higher error rate is impossible. We also show that in the setting of Franklin et al. [CRYPTO'13], where parties share randomness not known to the adversary, adaptivity increases the tolerable error rate from 1/2 to 2/3. For list-decodable interactive communications, where each party outputs a constant size list of possible outcomes, the tight tolerable error rate is 1/2. Our negative results hold even if the communication and computation are unbounded, whereas for our positive results communication and computations are polynomially bounded. Most prior work considered coding schemes with linear communication bounds, while allowing unbounded computations. We argue that studying tolerable error rates in this relaxed context helps to identify a setting's intrinsic optimal error rate. We set forward a strong working hypothesis which stipulates that for any setting the maximum tolerable error rate is independent of many computational and communication complexity measures. We believe this hypothesis to be a powerful guideline for the design of simple, natural, and efficient coding schemes and for understanding the (im)possibilities of coding for interactive communications.

Proceedings ArticleDOI
04 May 2014
TL;DR: The two network architectures, convolution along the frequency axis and time-domain convolution, can be readily combined and report an error rate of 16.7% on the TIMIT phone recognition task, a new record on this dataset.
Abstract: Convolutional neural networks have proved very successful in image recognition, thanks to their tolerance to small translations. They have recently been applied to speech recognition as well, using a spectral representation as input. However, in this case the translations along the two axes - time and frequency - should be handled quite differently. So far, most authors have focused on convolution along the frequency axis, which offers invariance to speaker and speaking style variations. Other researchers have developed a different network architecture that applies time-domain convolution in order to process a longer time-span of input in a hierarchical manner. These two approaches have different background motivations, and both offer significant gains over a standard fully connected network. Here we show that the two network architectures can be readily combined, like their advantages. With the combined model we report an error rate of 16.7% on the TIMIT phone recognition task, a new record on this dataset.

Book ChapterDOI
19 Oct 2014
TL;DR: This work combines four different state-of-the approaches by using 15 different algorithms for ensemble learning and evaluates their performace on five different datasets to suggest that ensemble learning can reduce the error rate of state- of-the-art named entity recognition systems by 40%, thereby leading to over 95% f-score in the best run.
Abstract: A considerable portion of the information on the Web is still only available in unstructured form. Implementing the vision of the Semantic Web thus requires transforming this unstructured data into structured data. One key step during this process is the recognition of named entities. Previous works suggest that ensemble learning can be used to improve the performance of named entity recognition tools. However, no comparison of the performance of existing supervised machine learning approaches on this task has been presented so far. We address this research gap by presenting a thorough evaluation of named entity recognition based on ensemble learning. To this end, we combine four different state-of-the approaches by using 15 different algorithms for ensemble learning and evaluate their performace on five different datasets. Our results suggest that ensemble learning can reduce the error rate of state-of-the-art named entity recognition systems by 40%, thereby leading to over 95% f-score in our best run.

Journal ArticleDOI
TL;DR: This paper investigates and compares the performance of three supervised methods of classification including support vector machine (SVM), probabilistic neural network (PNN), and k-nearest neighbor (KNN) for water quality classification and demonstrates that the SVM algorithm presents the best performance with no errors in calibration and validation phases.
Abstract: Water quality is one of the major criteria for determining the planning and operation policies of water resources systems. In order to classify the quality of a water resource such as an aquifer, it is necessary that the quality of a large number of water samples be determined, which might be a very time consuming process. The goal of this paper is to classify the water quality using classification algorithms in order to reduce the computational time. The question is whether and to what extent the results of the classification algorithms are different. Another question is what method provides the most accurate results. In this regard, this paper investigates and compares the performance of three supervised methods of classification including support vector machine (SVM), probabilistic neural network (PNN), and k-nearest neighbor (KNN) for water quality classification. Using two performance evaluation statistics including error rate and error value, the efficiency of the algorithms is investigated. Furthermore, a 5-fold cross validation is performed to assess the effect of data value on the performance of the applied algorithms. Results demonstrate that the SVM algorithm presents the best performance with no errors in calibration and validation phases. The KNN algorithm, having the most total number and total value of errors, is the weakest one for classification of water quality data.

Proceedings ArticleDOI
04 May 2014
TL;DR: This work has evaluated the proposed direct SC-based adaptation method in the large scale 320-hr Switchboard task and shown that the proposed method leads to up to 8% relative reduction in word error rate in Switchboard by using only a very small number of adaptation utterances per speaker.
Abstract: Recently an effective fast speaker adaptation method using discriminative speaker code (SC) has been proposed for the hybrid DNN-HMM models in speech recognition [1]. This adaptation method depends on a joint learning of a large generic adaptation neural network for all speakers as well as multiple small speaker codes using the standard back-propagation algorithm. In this paper, we propose an alternative direct adaptation in model space, where speaker codes are directly connected to the original DNN models through a set of new connection weights, which can be estimated very efficiently from all or part of training data. As a result, the proposed method is more suitable for large scale speech recognition tasks since it eliminates the time-consuming training process to estimate another adaptation neural networks. In this work, we have evaluated the proposed direct SC-based adaptation method in the large scale 320-hr Switchboard task. Experimental results have shown that the proposed SC-based rapid adaptation method is very effective not only for small recognition tasks but also for very large scale tasks. For example, it has shown that the proposed method leads to up to 8% relative reduction in word error rate in Switchboard by using only a very small number of adaptation utterances per speaker (from 10 to a few dozens). Moreover, the extra training time required for adaptation is also significantly reduced from the method in [1].

Posted Content
TL;DR: Nite-sample exponential bounds on the error rate (in probability and in expectation) of general aggregation rules under the Dawid-Skene crowdsourcing model are provided and can be used to analyze many aggregation methods, including majority voting, weighted majority voting and the oracle Maximum A Posteriori rule.
Abstract: Crowdsourcing has become an eective and popular tool for human-powered computation to label large datasets. Since the workers can be unreliable, it is common in crowdsourcing to assign multiple workers to one task, and to aggregate the labels in order to obtain results of high quality. In this paper, we provide nite-sample exponential bounds on the error rate (in probability and in expectation) of general aggregation rules under the Dawid-Skene crowdsourcing model. The bounds are derived for multi-class labeling, and can be used to analyze many aggregation methods, including majority voting, weighted majority voting and the oracle Maximum A Posteriori (MAP) rule. We show that the oracle MAP rule approximately optimizes our upper bound on the mean error rate of weighted majority voting in certain setting. We propose an iterative weighted majority voting (IWMV) method that optimizes the error rate bound and approximates the oracle MAP rule. Its one step version has a provable theoretical guarantee on the error rate. The IWMV method is intuitive and computationally simple. Experimental results on simulated and real data show that IWMV performs at least on par with the state-of-the-art methods, and it has a much lower computational cost (around one hundred times faster) than the state-of-the-art methods.

Yun Lei1, Luciana Ferrer1, Aaron Lawson1, Mitchell McLaren1, Nicolas Scheffer1 
01 Jan 2014
TL;DR: Two novel frontends for robust language identification (LID) using a convolutional neural network trained for automatic speech recognition (ASR) and the CNN is used to obtain the posterior probabilities for i-vector training and extraction instead of a universal background model (UBM).
Abstract: This paper proposes two novel frontends for robust language identification (LID) using a convolutional neural network (CNN) trained for automatic speech recognition (ASR). In the CNN/i-vector frontend, the CNN is used to obtain the posterior probabilities for i-vector training and extraction instead of a universal background model (UBM). The CNN/posterior frontend is somewhat similar to a phonetic system in that the occupation counts of (tied) triphone states (senones) given by the CNN are used for classification. They are compressed to a low dimensional vector using probabilistic principal component analysis (PPCA). Evaluated on heavily degraded speech data, the proposed front ends provide significant improvements of up to 50% on average equal error rate compared to a UBM/i-vector baseline. Moreover, the proposed frontends are complementary and give significant gains of up to 20% relative to the best single system when combined.

Journal ArticleDOI
TL;DR: An MCMC algorithm is presented that achieves significantly lower logical error rates than MWPM at the cost of a runtime complexity increased by a factor O(L-2) for depolarizing noise with error rate p, an exponential improvement over all previously existing efficient algorithms.
Abstract: Minimum-weight perfect matching (MWPM) has been the primary classical algorithm for error correction in the surface code, since it is of low runtime complexity and achieves relatively low logical error rates [Phys. Rev. Lett. 108, 180501 (2012)]. A Markov chain Monte Carlo (MCMC) algorithm [Phys. Rev. Lett. 109, 160503 (2012)] is able to achieve lower logical error rates and higher thresholds than MWPM, but requires a classical runtime complexity, which is super-polynomial in L, the linear size of the code. In this work we present an MCMC algorithm that achieves significantly lower logical error rates than MWPM at the cost of a runtime complexity increased by a factor O(L-2). This advantage is due to taking correlations between bit-and phase-flip errors (as they appear, for example, in depolarizing noise) as well as entropic factors (i.e., the numbers of likely error paths in different equivalence classes) into account. For depolarizing noise with error rate p, we present an efficient algorithm for which the logical error rate is suppressed as O((p/3)(L/2)) for p -< 0-an exponential improvement over all previously existing efficient algorithms. Our algorithm allows for tradeoffs between runtime and achieved logical error rates as well as for parallelization, and can be also used for correction in the case of imperfect stabilizer measurements.

Proceedings ArticleDOI
04 May 2014
TL;DR: This work proposes a novel second-order stochastic optimization algorithm based on analytic results showing that a non-zero mean of features is harmful for the optimization, and proves convergence of the algorithm in a convex setting.
Abstract: Deep neural networks are typically optimized with stochastic gradient descent (SGD). In this work, we propose a novel second-order stochastic optimization algorithm. The algorithm is based on analytic results showing that a non-zero mean of features is harmful for the optimization. We prove convergence of our algorithm in a convex setting. In our experiments we show that our proposed algorithm converges faster than SGD. Further, in contrast to earlier work, our algorithm allows for training models with a factorized structure from scratch. We found this structure to be very useful not only because it accelerates training and decoding, but also because it is a very effective means against overfitting. Combining our proposed optimization algorithm with this model structure, model size can be reduced by a factor of eight and still improvements in recognition error rate are obtained. Additional gains are obtained by improving the Newbob learning rate strategy.

Journal ArticleDOI
TL;DR: Experimental results obtained on the IAM off-line database demonstrate that consistent word error rate reductions can be achieved with neural network language models when compared with statistical N-gram language models on the three tested systems.

Proceedings ArticleDOI
01 Dec 2014
TL;DR: It is shown that it is possible to learn an efficient acoustic model using only a small amount of easily available word-level similarity annotations, and the resulting model is shown to perform much better than raw speech features in an ABX minimal-pair discrimination task.
Abstract: We show that it is possible to learn an efficient acoustic model using only a small amount of easily available word-level similarity annotations. In contrast to the detailed phonetic labeling required by classical speech recognition technologies, the only information our method requires are pairs of speech excerpts which are known to be similar (same word) and pairs of speech excerpts which are known to be different (different words). An acoustic model is obtained by training shallow and deep neural networks, using an architecture and a cost function well-adapted to the nature of the provided information. The resulting model is evaluated in an ABX minimal-pair discrimination task and is shown to perform much better (11.8% ABX error rate) than raw speech features (19.6%), not far from a fully supervised baseline (best neural network: 9.2%, HMM-GMM: 11%).

Proceedings ArticleDOI
14 Sep 2014
TL;DR: These models are feedforward networks with the property that the unfolded layers which correspond to the recurrent layer have time-shifted inputs and tied weight matrices and can be implemented efficiently through matrix-matrix operations on GPU architectures which makes it scalable for large tasks.
Abstract: We introduce recurrent neural networks (RNNs) for acoustic modeling which are unfolded in time for a fixed number of time steps. The proposed models are feedforward networks with the property that the unfolded layers which correspond to the recurrent layer have time-shifted inputs and tied weight matrices. Besides the temporal depth due to unfolding, hierarchical processing depth is added by means of several non-recurrent hidden layers inserted between the unfolded layers and the output layer. The training of these models: (a) has a complexity that is comparable to deep neural networks (DNNs) with the same number of layers; (b) can be done on frame-randomized minibatches; (c) can be implemented efficiently through matrix-matrix operations on GPU architectures which makes it scalable for large tasks. Experimental results on the Switchboard 300 hours English conversational telephony task show a 5% relative improvement in word error rate over state-of-the-art DNNs trained on FMLLR features with i-vector speaker adaptation and hessianfree sequence discriminative training. Index Terms: recurrent neural networks, speech recognition

Journal ArticleDOI
TL;DR: Deep bidirectional LSTM networks processing log Mel filterbank outputs deliver best results with clean models, reaching down to 42% word error rate (WER) at signal-to-noise ratios ranging from −6 to 9 dB.

Journal ArticleDOI
TL;DR: This paper introduces multimode systems that allow seamless switching between audible and silent speech, investigates speaking mode variations, investigate different measures which quantify speaking mode differences, and presents the spectral mapping algorithm, which improves the word error rate on silent speech by up to 14.3% relative.
Abstract: An electromyographic (EMG) silent speech recognizer is a system that recognizes speech by capturing the electric potentials of the human articulatory muscles, thus enabling the user to communicate silently. After having established a baseline EMG-based continuous speech recognizer, in this paper, we investigate speaking mode variations, i.e., discrepancies between audible and silent speech that deteriorate recognition accuracy. We introduce multimode systems that allow seamless switching between audible and silent speech, investigate different measures which quantify speaking mode differences, and present the spectral mapping algorithm, which improves the word error rate (WER) on silent speech by up to 14.3% relative. Our best average silent speech WER is 34.7%, and our best WER on audibly spoken speech is 16.8%.