scispace - formally typeset
Search or ask a question

Showing papers on "Word error rate published in 2015"


Proceedings ArticleDOI
TL;DR: FaceNet as discussed by the authors uses a deep convolutional network trained to directly optimize the embedding itself, rather than an intermediate bottleneck layer as in previous deep learning approaches, and achieves state-of-the-art face recognition performance using only 128 bytes per face.
Abstract: Despite significant recent advances in the field of face recognition, implementing face verification and recognition efficiently at scale presents serious challenges to current approaches. In this paper we present a system, called FaceNet, that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. Once this space has been produced, tasks such as face recognition, verification and clustering can be easily implemented using standard techniques with FaceNet embeddings as feature vectors. Our method uses a deep convolutional network trained to directly optimize the embedding itself, rather than an intermediate bottleneck layer as in previous deep learning approaches. To train, we use triplets of roughly aligned matching / non-matching face patches generated using a novel online triplet mining method. The benefit of our approach is much greater representational efficiency: we achieve state-of-the-art face recognition performance using only 128-bytes per face. On the widely used Labeled Faces in the Wild (LFW) dataset, our system achieves a new record accuracy of 99.63%. On YouTube Faces DB it achieves 95.12%. Our system cuts the error rate in comparison to the best published result by 30% on both datasets. We also introduce the concept of harmonic embeddings, and a harmonic triplet loss, which describe different versions of face embeddings (produced by different networks) that are compatible to each other and allow for direct comparison between each other.

4,560 citations


Proceedings Article
07 Dec 2015
TL;DR: The authors proposed a location-aware attention mechanism for the TIMET phoneme recognition task, which achieved an improved 18.7% phoneme error rate (PER) on utterances which are roughly as long as the ones it was trained on.
Abstract: Recurrent sequence generators conditioned on input data through an attention mechanism have recently shown very good performance on a range of tasks including machine translation, handwriting synthesis [1,2] and image caption generation [3]. We extend the attention-mechanism with features needed for speech recognition. We show that while an adaptation of the model used for machine translation in [2] reaches a competitive 18.7% phoneme error rate (PER) on the TIMET phoneme recognition task, it can only be applied to utterances which are roughly as long as the ones it was trained on. We offer a qualitative explanation of this failure and propose a novel and generic method of adding location-awareness to the attention mechanism to alleviate this issue. The new method yields a model that is robust to long inputs and achieves 18% PER in single utterances and 20% in 10-times longer (repeated) utterances. Finally, we propose a change to the attention mechanism that prevents it from concentrating too much on single frames, which further reduces PER to 17.6% level.

1,574 citations


Posted Content
TL;DR: The attention-mechanism is extended with features needed for speech recognition and a novel and generic method of adding location-awareness to the attention mechanism is proposed to alleviate the issue of high phoneme error rate.
Abstract: Recurrent sequence generators conditioned on input data through an attention mechanism have recently shown very good performance on a range of tasks in- cluding machine translation, handwriting synthesis and image caption gen- eration. We extend the attention-mechanism with features needed for speech recognition. We show that while an adaptation of the model used for machine translation in reaches a competitive 18.7% phoneme error rate (PER) on the TIMIT phoneme recognition task, it can only be applied to utterances which are roughly as long as the ones it was trained on. We offer a qualitative explanation of this failure and propose a novel and generic method of adding location-awareness to the attention mechanism to alleviate this issue. The new method yields a model that is robust to long inputs and achieves 18% PER in single utterances and 20% in 10-times longer (repeated) utterances. Finally, we propose a change to the at- tention mechanism that prevents it from concentrating too much on single frames, which further reduces PER to 17.6% level.

1,447 citations


Book ChapterDOI
25 Aug 2015
TL;DR: It is demonstrated that LSTM speech enhancement, even when used 'naively' as front-end processing, delivers competitive results on the CHiME-2 speech recognition task.
Abstract: We evaluate some recent developments in recurrent neural network RNN based speech enhancement in the light of noise-robust automatic speech recognition ASR. The proposed framework is based on Long Short-Term Memory LSTM RNNs which are discriminatively trained according to an optimal speech reconstruction objective. We demonstrate that LSTM speech enhancement, even when used 'naively' as front-end processing, delivers competitive results on the CHiME-2 speech recognition task. Furthermore, simple, feature-level fusion based extensions to the framework are proposed to improve the integration with the ASR back-end. These yield a best result of 13.76i¾?% average word error rate, which is, to our knowledge, the best score to date.

603 citations


Proceedings ArticleDOI
06 Sep 2015
TL;DR: It is shown that raw waveform features match the performance of log-mel filterbank energies when used with a state-of-the-art CLDNN acoustic model trained on over 2,000 hours of speech.
Abstract: Learning an acoustic model directly from the raw waveform has been an active area of research. However, waveformbased models have not yet matched the performance of logmel trained neural networks. We will show that raw waveform features match the performance of log-mel filterbank energies when used with a state-of-the-art CLDNN acoustic model trained on over 2,000 hours of speech. Specifically, we will show the benefit of the CLDNN, namely the time convolution layer in reducing temporal variations, the frequency convolution layer for preserving locality and reducing frequency variations, as well as the LSTM layers for temporal modeling. In addition, by stacking raw waveform features with log-mel features, we achieve a 3% relative reduction in word error rate.

506 citations


Journal ArticleDOI
TL;DR: The results show that the proposed designs accomplish significant reductions in power dissipation, delay and transistor count compared to an exact design; moreover, two of the proposed multiplier designs provide excellent capabilities for image multiplication with respect to average normalized error distance and peak signal-to-noise ratio.
Abstract: Inexact (or approximate) computing is an attractive paradigm for digital processing at nanometric scales. Inexact computing is particularly interesting for computer arithmetic designs. This paper deals with the analysis and design of two new approximate 4-2 compressors for utilization in a multiplier. These designs rely on different features of compression, such that imprecision in computation (as measured by the error rate and the so-called normalized error distance) can meet with respect to circuit-based figures of merit of a design (number of transistors, delay and power consumption). Four different schemes for utilizing the proposed approximate compressors are proposed and analyzed for a Dadda multiplier. Extensive simulation results are provided and an application of the approximate multipliers to image processing is presented. The results show that the proposed designs accomplish significant reductions in power dissipation, delay and transistor count compared to an exact design; moreover, two of the proposed multiplier designs provide excellent capabilities for image multiplication with respect to average normalized error distance and peak signal-to-noise ratio (more than 50 dB for the considered image examples).

447 citations


Proceedings ArticleDOI
01 Jan 2015
TL;DR: A solution which normalizes the word vectors on a hypersphere and constrains the linear transform as an orthogonal transform and can offer better performance on a word similarity task and an English-toSpanish word translation task is proposed.
Abstract: Word embedding has been found to be highly powerful to translate words from one language to another by a simple linear transform. However, we found some inconsistence among the objective functions of the embedding and the transform learning, as well as the distance measurement. This paper proposes a solution which normalizes the word vectors on a hypersphere and constrains the linear transform as an orthogonal transform. The experimental results confirmed that the proposed solution can offer better performance on a word similarity task and an English-toSpanish word translation task.

436 citations


Journal ArticleDOI
TL;DR: This work presents a statistical recognition approach performing large vocabulary continuous sign language recognition across different signers, and is the first time system design on a large data set with true focus on real-life applicability is thoroughly presented.

309 citations


Book
Oded Ghitza1
03 May 2015
TL;DR: It is postulated that decoding time is governed by a cascade of neuronal oscillators, which guide template-matching operations at a hierarchy of temporal scales and is argued to be crucial for speech intelligibility.
Abstract: The premise of this study is that current models of speech perception, which are driven by acoustic features alone, are incomplete, and that the role of decoding time during memory access must be incorporated to account for the patterns of observed recognition phenomena. It is postulated that decoding time is governed by a cascade of neuronal oscillators, which guide template-matching operations at a hierarchy of temporal scales. Cascaded cortical oscillations in the theta, beta and gamma frequency bands are argued to be crucial for speech intelligibility. Intelligibility is high so long as these oscillations remain phase-locked to the auditory input rhythm. A model (Tempo) is presented which is capable of emulating recent psychophysical data on the intelligibility of speech sentences as a function of “packaging” rate (Ghitza and Greenberg, 2009). The data show that intelligibility of speech that is time-compressed by a factor of 3 (i.e., a high syllabic rate) is poor (above 50% word error rate), but is substantially restored when the information stream is re-packaged by the insertion of silence gaps in between successive compressed-signal intervals – a counterintuitive finding, difficult to explain using classical models of speech perception, but emerging naturally from the Tempo architecture.

302 citations


Journal ArticleDOI
TL;DR: The strategy to use several different feature models in a single pool, together with feature selection to optimize the fault diagnosis system, and robust performance estimation techniques usually not encountered in the context of engineering are employed are employed.
Abstract: Distinct feature extraction methods are simultaneously used to describe bearing faults. This approach produces a large number of heterogeneous features that augment discriminative information but, at the same time, create irrelevant and redundant information. A subsequent feature selection phase filters out the most discriminative features. The feature models are based on the complex envelope spectrum, statistical time- and frequency-domain parameters, and wavelet packet analysis. Feature selection is achieved by conventional search of the feature space by greedy methods. For the final fault diagnosis, the k-nearest neighbor classifier, feedforward net, and support vector machine are used. Performance criteria are the estimated error rate and the area under the receiver operating characteristic curve (AUC-ROC). Experimental results are shown for the Case Western Reserve University Bearing Data. The main contribution of this paper is the strategy to use several different feature models in a single pool, together with feature selection to optimize the fault diagnosis system. Moreover, robust performance estimation techniques usually not encountered in the context of engineering are employed.

291 citations


Proceedings ArticleDOI
Xiangyu Zhang1, Jianhua Zou1, Xiang Ming1, Kaiming He2, Jian Sun2 
07 Jun 2015
TL;DR: This paper aims to accelerate the test-time computation of deep convolutional neural networks (CNNs), and takes the nonlinear units into account, subject to a low-rank constraint which helps to reduce the complexity of filters.
Abstract: This paper aims to accelerate the test-time computation of deep convolutional neural networks (CNNs). Unlike existing methods that are designed for approximating linear filters or linear responses, our method takes the nonlinear units into account. We minimize the reconstruction error of the nonlinear responses, subject to a low-rank constraint which helps to reduce the complexity of filters. We develop an effective solution to this constrained nonlinear optimization problem. An algorithm is also presented for reducing the accumulated error when multiple layers are approximated. A whole-model speedup ratio of 4× is demonstrated on a large network trained for ImageNet, while the top-5 error rate is only increased by 0.9%. Our accelerated model has a comparably fast speed as the “AlexNet” [11], but is 4.7% more accurate.

Proceedings ArticleDOI
01 Dec 2015
TL;DR: NTT's CHiME-3 system is described, which integrates advanced speech enhancement and recognition techniques, which achieves a 3.45% development error rate and a 5.83% evaluation error rate.
Abstract: CHiME-3 is a research community challenge organised in 2015 to evaluate speech recognition systems for mobile multi-microphone devices used in noisy daily environments. This paper describes NTT's CHiME-3 system, which integrates advanced speech enhancement and recognition techniques. Newly developed techniques include the use of spectral masks for acoustic beam-steering vector estimation and acoustic modelling with deep convolutional neural networks based on the "network in network" concept. In addition to these improvements, our system has several key differences from the official baseline system. The differences include multi-microphone training, dereverberation, and cross adaptation of neural networks with different architectures. The impacts that these techniques have on recognition performance are investigated. By combining these advanced techniques, our system achieves a 3.45% development error rate and a 5.83% evaluation error rate. Three simpler systems are also developed to perform evaluations with constrained set-ups.

Posted Content
TL;DR: A neural machine translation model that views the input and output sentences as sequences of characters rather than words, which alleviates much of the challenges associated with preprocessing/tokenization of the source and target languages.
Abstract: We introduce a neural machine translation model that views the input and output sentences as sequences of characters rather than words. Since word-level information provides a crucial source of bias, our input model composes representations of character sequences into representations of words (as determined by whitespace boundaries), and then these are translated using a joint attention/translation model. In the target language, the translation is modeled as a sequence of word vectors, but each word is generated one character at a time, conditional on the previous character generations in each word. As the representation and generation of words is performed at the character level, our model is capable of interpreting and generating unseen word forms. A secondary benefit of this approach is that it alleviates much of the challenges associated with preprocessing/tokenization of the source and target languages. We show that our model can achieve translation results that are on par with conventional word-based models.

Proceedings Article
12 Aug 2015
TL;DR: It is shown that recurrent neural networks can identify functions in binaries with greater accuracy and efficiency than the state-of-the-art machine-learning-based method.
Abstract: Binary analysis facilitates many important applications like malware detection and automatically fixing vulnerable software In this paper, we propose to apply artificial neural networks to solve important yet difficult problems in binary analysis Specifically, we tackle the problem of function identification, a crucial first step in many binary analysis techniques Although neural networks have undergone a renaissance in the past few years, achieving breakthrough results in multiple application domains such as visual object recognition, language modeling, and speech recognition, no researchers have yet attempted to apply these techniques to problems in binary analysis Using a dataset from prior work, we show that recurrent neural networks can identify functions in binaries with greater accuracy and efficiency than the state-of-the-art machine-learning-based method We can train the model an order of magnitude faster and evaluate it on binaries hundreds of times faster Furthermore, it halves the error rate on six out of eight benchmarks, and performs comparably on the remaining two

Proceedings ArticleDOI
19 Apr 2015
TL;DR: In this article, a multimodal learning approach was proposed for fusing speech and visual modalities for audio-visual automatic speech recognition (AV-ASR) using uni-modal deep networks.
Abstract: In this paper, we present methods in deep multimodal learning for fusing speech and visual modalities for Audio-Visual Automatic Speech Recognition (AV-ASR). First, we study an approach where uni-modal deep networks are trained separately and their final hidden layers fused to obtain a joint feature space in which another deep network is built. While the audio network alone achieves a phone error rate (PER) of 41% under clean condition on the IBM large vocabulary audio-visual studio dataset, this fusion model achieves a PER of 35.83% demonstrating the tremendous value of the visual channel in phone classification even in audio with high signal to noise ratio. Second, we present a new deep network architecture that uses a bilinear softmax layer to account for class specific correlations between modalities. We show that combining the posteriors from the bilinear networks with those from the fused model mentioned above results in a further significant phone error rate reduction, yielding a final PER of 34.03%.

Proceedings ArticleDOI
19 Apr 2015
TL;DR: This work proposes a G2P model based on a Long Short-Term Memory (LSTM) recurrent neural network (RNN) that has the flexibility of taking into consideration the full context of graphemes and transform the problem from a series of grapheme-to-phoneme conversions to a word- to-pronunciation conversion.
Abstract: Grapheme-to-phoneme (G2P) models are key components in speech recognition and text-to-speech systems as they describe how words are pronounced We propose a G2P model based on a Long Short-Term Memory (LSTM) recurrent neural network (RNN) In contrast to traditional joint-sequence based G2P approaches, LSTMs have the flexibility of taking into consideration the full context of graphemes and transform the problem from a series of grapheme-to-phoneme conversions to a word-to-pronunciation conversion Training joint-sequence based G2P require explicit grapheme-to-phoneme alignments which are not straightforward since graphemes and phonemes don't correspond one-to-one The LSTM based approach forgoes the need for such explicit alignments We experiment with unidirectional LSTM (ULSTM) with different kinds of output delays and deep bidirectional LSTM (DBLSTM) with a connectionist temporal classification (CTC) layer The DBLSTM-CTC model achieves a word error rate (WER) of 258% on the public CMU dataset for US English Combining the DBLSTM-CTC model with a joint n-gram model results in a WER of 213%, which is a 9% relative improvement compared to the previous best WER of 234% from a hybrid system

Proceedings ArticleDOI
06 Sep 2015
TL;DR: Several integration architectures are proposed and tested, including a pipeline architecture of L STM-based SE and ASR with sequence training, an alternating estimation architecture, and a multi-task hybrid LSTM network architecture.
Abstract: Long Short-Term Memory (LSTM) recurrent neural network has proven effective in modeling speech and has achieved outstanding performance in both speech enhancement (SE) and automatic speech recognition (ASR). To further improve the performance of noise-robust speech recognition, a combination of speech enhancement and recognition was shown to be promising in earlier work. This paper aims to explore options for consistent integration of SE and ASR using LSTM networks. Since SE and ASR have different objective criteria, it is not clear what kind of integration would finally lead to the best word error rate for noise-robust ASR tasks. In this work, several integration architectures are proposed and tested, including: (1) a pipeline architecture of LSTM-based SE and ASR with sequence training, (2) an alternating estimation architecture, and (3) a multi-task hybrid LSTM network architecture. The proposed models were evaluated on the 2nd CHiME speech separation and recognition challenge task, and show significant improvements relative to prior results.

Proceedings ArticleDOI
06 Sep 2015
TL;DR: This work proposes RNN and LSTM models for utterance intent classification and finds that RNNs work best when utterances are short, while LSTMs are best when uttered words are longer.
Abstract: Utterance classification is a critical pre-processing step for many speech understanding and dialog systems. In multi-user settings, one needs to first identify if an utterance is even directed at the system, followed by another level of classification to determine the intent of the user’s input. In this work, we propose RNN and LSTM models for both these tasks. We show how both models outperform baselines based on ngram-based language models (LMs), feedforward neural network LMs, and boosting classifiers. To deal with the high rate of singleton and out-of-vocabulary words in the data, we also investigate a word input encoding based on character ngrams, and show how this representation beats the standard one-hot vector word encoding. Overall, these proposed approaches achieve over 30% relative reduction in equal error rate compared to boosting classifier baseline on an ATIS utterance intent classification task, and over 3.9% absolute reduction in equal error rate compared to a the maximum entropy LM baseline of 27.0% on an addressee detection task. We find that RNNs work best when utterances are short, while LSTMs are best when utterances are longer.

Journal ArticleDOI
TL;DR: This research work used C5.0 as the base classifier so proposed system will classify the result set with high accuracy and low memory usage, and over fitting problem of the decision tree is solved by using reduced error pruning technique.
Abstract: Data mining is a knowledge discovery process that analyzes data and generate useful pattern from it. Classification is the technique that uses pre-classified examples to classify the required results. Decision tree is used to model classification process. Using feature values of instances, Decision trees classify those instances. Each node in a decision tree represents a feature in an instance to be classified. In this research work ID3, C4.5 and C5.0 Compare with each other. Among all these classifiers C5.0 gives more accurate and efficient result. This research work used C5.0 as the base classifier so proposed system will classify the result set with high accuracy and low memory usage. The classification process generates fewer rules compare to other techniques so the proposed system has low memory usage. Error rate is low so accuracy in result set is high and pruned tree is generated so the system generates fast results as compare with other technique. In this research work proposed system use C5.0 classifier that Performs feature selection and reduced error pruning techniques which are described in this paper. Feature selection technique assumes that the data contains many redundant features. so remove that features which provides no useful information in any context. Select relevant features which are useful in model construction. Crossvalidation method gives more reliable estimate of predictive. Over fitting problem of the decision tree is solved by using reduced error pruning technique. With the proposed system achieve 1 to 3% of accuracy, reduced error rate and decision tree is construed within less time.

Journal ArticleDOI
TL;DR: In this article, a nearly optimal algorithm for denoising a mixture of sinusoids from noisy equispaced samples was derived by viewing line spectral estimation as a sparse recovery problem with a continuous, infinite dictionary.
Abstract: This paper establishes a nearly optimal algorithm for denoising a mixture of sinusoids from noisy equispaced samples. We derive our algorithm by viewing line spectral estimation as a sparse recovery problem with a continuous, infinite dictionary. We show how to compute the estimator via semidefinite programming and provide guarantees on its mean-squared error rate. We derive a complementary minimax lower bound on this estimation rate, demonstrating that our approach nearly achieves the best possible estimation error. Furthermore, we establish bounds on how well our estimator localizes the frequencies in the signal, showing that the localization error tends to zero as the number of samples grows. We verify our theoretical results in an array of numerical experiments, demonstrating that the semidefinite programming approach outperforms three classical spectral estimation techniques.

Proceedings ArticleDOI
01 Jan 2015
TL;DR: An approach to speech recognition that uses only a neural network to map acoustic input to characters, a character-level language model, and a beam search decoding procedure, making it possible to directly train a speech recognizer using errors generated by spoken language understanding tasks.
Abstract: We present an approach to speech recognition that uses only a neural network to map acoustic input to characters, a character-level language model, and a beam search decoding procedure. This approach eliminates much of the complex infrastructure of modern speech recognition systems, making it possible to directly train a speech recognizer using errors generated by spoken language understanding tasks. The system naturally handles out of vocabulary words and spoken word fragments. We demonstrate our approach using the challenging Switchboard telephone conversation transcription task, achieving a word error rate competitive with existing baseline systems. To our knowledge, this is the first entirely neural-network-based system to achieve strong speech transcription results on a conversational speech task. We analyze qualitative differences between transcriptions produced by our lexicon-free approach and transcriptions produced by a standard speech recognition system. Finally, we evaluate the impact of large context neural network character language models as compared to standard n-gram models within our framework.

Proceedings ArticleDOI
01 Dec 2015
TL;DR: A new beamformer front-end for Automatic Speech Recognition that leverages the power of a bi-directional Long Short-Term Memory network to robustly estimate soft masks for a subsequent beamforming step and achieves a 53% relative reduction of the word error rate over the best baseline enhancement system for the relevant test data set.
Abstract: We present a new beamformer front-end for Automatic Speech Recognition and apply it to the 3rd-CHiME Speech Separation and Recognition Challenge. Without any further modification of the back-end, we achieve a 53% relative reduction of the word error rate over the best baseline enhancement system for the relevant test data set. Our approach leverages the power of a bi-directional Long Short-Term Memory network to robustly estimate soft masks for a subsequent beamforming step. The utilized Generalized Eigenvalue beamforming operation with an optional Blind Analytic Normalization does not rely on a Direction-of-Arrival estimate and can cope with multi-path sound propagation, while at the same time only introducing very limited speech distortions. Our quite simple setup exploits the possibilities provided by simulated training data while still being able to generalize well to the fairly different real data. Finally, combining our front-end with data augmentation and another language model nearly yields a 64 % reduction of the word error rate on the real data test set.

Journal ArticleDOI
TL;DR: It is argued that quantum error correction for the circuit causes the quantum bucket brigade architecture to lose its primary advantage of a small number of "active" gates, since all components have to be actively error corrected.
Abstract: We study the robustness of the bucket brigade quantum random access memory model introduced by Giovannetti et al (2008 Phys. Rev. Lett.100 160501). Due to a result of Regev and Schiff (ICALP '08 733), we show that for a class of error models the error rate per gate in the bucket brigade quantum memory has to be of order (where is the size of the memory) whenever the memory is used as an oracle for the quantum searching problem. We conjecture that this is the case for any realistic error model that will be encountered in practice, and that for algorithms with super-polynomially many oracle queries the error rate must be super-polynomially small, which further motivates the need for quantum error correction. By contrast, for algorithms such as matrix inversion Harrow et al (2009 Phys. Rev. Lett.103 150502) or quantum machine learning Rebentrost et al (2014 Phys. Rev. Lett.113 130503) that only require a polynomial number of queries, the error rate only needs to be polynomially small and quantum error correction may not be required. We introduce a circuit model for the quantum bucket brigade architecture and argue that quantum error correction for the circuit causes the quantum bucket brigade architecture to lose its primary advantage of a small number of 'active' gates, since all components have to be actively error corrected.

Proceedings ArticleDOI
19 Apr 2015
TL;DR: By visualizing the localized filters learned in the convolutional layer, it is shown that edge detectors in varying directions can be automatically learned and it is established that the CNN structure combined with maxout units is the most effective model under small-sizing constraints for the purpose of deploying small-footprint models to devices.
Abstract: Despite the fact that several sites have reported the effectiveness of convolutional neural networks (CNNs) on some tasks, there is no deep analysis regarding why CNNs perform well and in which case we should see CNNs' advantage. In the light of this, this paper aims to provide some detailed analysis of CNNs. By visualizing the localized filters learned in the convolutional layer, we show that edge detectors in varying directions can be automatically learned. We then identify four domains we think CNNs can consistently provide advantages over fully-connected deep neural networks (DNNs): channel-mismatched training-test conditions, noise robustness, distant speech recognition, and low-footprint models. For distant speech recognition, a CNN trained on 1000 hours of Kinect distant speech data obtains relative 4% word error rate reduction (WERR) over a DNN of a similar size. To our knowledge, this is the largest corpus so far reported in the literature for CNNs to show its effectiveness. Lastly, we establish that the CNN structure combined with maxout units is the most effective model under small-sizing constraints for the purpose of deploying small-footprint models to devices. This setup gives relative 9.3% WERR from DNNs with sigmoid units.

Proceedings ArticleDOI
01 Dec 2015
TL;DR: Inspired by human spectrogram reading, this model first scans the frequency bands to generate a summary of the spectral information, and then uses the output layer activations as the input to a traditional time LSTM (T-LSTM).
Abstract: Long short-term memory (LSTM) recurrent neural networks (RNNs) have recently shown significant performance improvements over deep feed-forward neural networks (DNNs). A key aspect of these models is the use of time recurrence, combined with a gating architecture that ameliorates the vanishing gradient problem. Inspired by human spectrogram reading, in this paper we propose an extension to LSTMs that performs the recurrence in frequency as well as in time. This model first scans the frequency bands to generate a summary of the spectral information, and then uses the output layer activations as the input to a traditional time LSTM (T-LSTM). Evaluated on a Microsoft short message dictation task, the proposed model obtained a 3.6% relative word error rate reduction over the T-LSTM.

Journal ArticleDOI
TL;DR: This work investigates techniques based on deep neural networks for attacking the single-channel multi-talker speech recognition problem and demonstrates that the proposed DNN-based system has remarkable noise robustness to the interference of a competing speaker.
Abstract: We investigate techniques based on deep neural networks (DNNs) for attacking the single-channel multi-talker speech recognition problem. Our proposed approach contains five key ingredients: a multi-style training strategy on artificially mixed speech data, a separate DNN to estimate senone posterior probabilities of the louder and softer speakers at each frame, a weighted finite-state transducer (WFST)-based two-talker decoder to jointly estimate and correlate the speaker and speech, a speaker switching penalty estimated from the energy pattern change in the mixed-speech, and a confidence based system combination strategy. Experiments on the 2006 speech separation and recognition challenge task demonstrate that our proposed DNN-based system has remarkable noise robustness to the interference of a competing speaker. The best setup of our proposed systems achieves an average word error rate (WER) of 18.8% across different SNRs and outperforms the state-of-the-art IBM superhuman system by 2.8% absolute with fewer assumptions.

Journal ArticleDOI
TL;DR: In this paper, the authors use reported average gate fidelity to determine an upper bound on the quantum-gate error rate, which is the appropriate metric for assessing progress towards fault-tolerant quantum computation.
Abstract: Remarkable experimental advances in quantum computing are exemplified by recent announcements of impressive average gate fidelities exceeding 99.9% for single-qubit gates and 99% for two-qubit gates. Although these high numbers engender optimism that fault-tolerant quantum computing is within reach, the connection of average gate fidelity with fault-tolerance requirements is not direct. Here we use reported average gate fidelity to determine an upper bound on the quantum-gate error rate, which is the appropriate metric for assessing progress towards fault-tolerant quantum computation, and we demonstrate that this bound is asymptotically tight for general noise. Although this bound is unlikely to be saturated by experimental noise, we demonstrate using explicit examples that the bound indicates a realistic deviation between the true error rate and the reported average fidelity. We introduce the Pauli distance as a measure of this deviation, and we show that knowledge of the Pauli distance enables tighter estimates of the error rate of quantum gates.

Proceedings ArticleDOI
01 Dec 2015
TL;DR: In this paper, a novel variant of the dropout tailored for RNNs is proposed, rnnDrop, which applies dropout to the recurrent connections as well in such a way that RNN generalize well.
Abstract: Recently, recurrent neural networks (RNN) have achieved the state-of-the-art performance in several applications that deal with temporal data, e.g., speech recognition, handwriting recognition and machine translation. While the ability of handling long-term dependency in data is the key for the success of RNN, combating over-fitting in training the models is a critical issue for achieving the cutting-edge performance particularly when the depth and size of the network increase. To that end, there have been some attempts to apply the dropout, a popular regularization scheme for the feed-forward neural networks, to RNNs, but they do not perform as well as other regularization scheme such as weight noise injection. In this paper, we propose rnnDrop, a novel variant of the dropout tailored for RNNs. Unlike the existing methods where dropout is applied only to the non-recurrent connections, the proposed method applies dropout to the recurrent connections as well in such a way that RNNs generalize well. Our experiments show that rnnDrop is a better regularization method than others including weight noise injection. Namely, when deep bidirectional long short-term memory (LSTM) RNNs were trained with rnnDrop as acoustic models for phoneme and speech recognition, they significantly outperformed the current state-of-the-arts; we achieved the phoneme error rate of 16.29% on the TIMIT core test set for phoneme recognition and the word error rate of 5.53% on the Wall Street Journal (WSJ) dataset, dev93, for speech recognition, which are the best reported results on both of the datasets.

Proceedings ArticleDOI
18 Apr 2015
TL;DR: It is demonstrated that VelociTap has a significantly lower error rate than Google's keyboard while retaining the same entry rate and intermediate visual feedback does not significantly affect entry or error rates and the finding that using the space key results in the most accurate results.
Abstract: We present VelociTap: a state-of-the-art touchscreen keyboard decoder that supports a sentence-based text entry approach. VelociTap enables users to seamlessly choose from three word-delimiter actions: pushing a space key, swiping to the right, or simply omitting the space key and letting the decoder infer spaces automatically. We demonstrate that VelociTap has a significantly lower error rate than Google's keyboard while retaining the same entry rate. We show that intermediate visual feedback does not significantly affect entry or error rates and we find that using the space key results in the most accurate results. We also demonstrate that enabling flexible word-delimiter options does not incur an error rate penalty. Finally, we investigate how small we can make the keyboard when using VelociTap. We show that novice users can reach a mean entry rate of 41 wpm on a 40 mm wide smartwatch-sized keyboard at a 3% character error rate.

Proceedings ArticleDOI
19 Apr 2015
TL;DR: Two approaches to improve deep neural network (DNN) acoustic models for speech recognition in reverberant environments are proposed, each using a parameterization of the reverberant environment extracted from the observed signal to train a room-aware DNN.
Abstract: In this paper, we propose two approaches to improve deep neural network (DNN) acoustic models for speech recognition in reverberant environments. Both methods utilize auxiliary information in training the DNN but differ in the type of information and the manner in which it is used. The first method uses parallel training data for multi-task learning, in which the network is trained to perform both a primary senone classification task and a secondary feature enhancement task using a shared representation. The second method uses a parameterization of the reverberant environment extracted from the observed signal to train a room-aware DNN. Experiments were performed on the single microphone task of the REVERB Challenge corpus. The proposed approach obtained a word error rate of 7.8% on the SimData test set, which is lower than all reported systems using the same training data and evaluation conditions, and 27.5% on the mismatched RealData test set, which is lower than all but two systems.