Showing papers on "Word error rate published in 2013"

PDF

Open Access

Proceedings Article•DOI•

Hybrid speech recognition with Deep Bidirectional LSTM

[...]

Alex Graves¹, Navdeep Jaitly¹, Abdelrahman Mohamed¹•Institutions (1)

01 Dec 2013

TL;DR: The hybrid approach with DBLSTM appears to be well suited for tasks where acoustic modelling predominates, and the improvement in word error rate over the deep network is modest, despite a great increase in framelevel accuracy.

...read moreread less

Abstract: Deep Bidirectional LSTM (DBLSTM) recurrent neural networks have recently been shown to give state-of-the-art performance on the TIMIT speech database. However, the results in that work relied on recurrent-neural-network-specific objective functions, which are difficult to integrate with existing large vocabulary speech recognition systems. This paper investigates the use of DBLSTM as an acoustic model in a standard neural network-HMM hybrid system. We find that a DBLSTM-HMM hybrid gives equally good results on TIMIT as the previous work. It also outperforms both GMM and deep network benchmarks on a subset of the Wall Street Journal corpus. However the improvement in word error rate over the deep network is modest, despite a great increase in framelevel accuracy. We conclude that the hybrid approach with DBLSTM appears to be well suited for tasks where acoustic modelling predominates. Further investigation needs to be conducted to understand how to better leverage the improvements in frame-level accuracy towards better word error rates.

...read moreread less

1,619 citations

Proceedings Article•DOI•

PhotoOCR: Reading Text in Uncontrolled Conditions

[...]

Alessandro Bissacco¹, Mark Cummins¹, Yuval Netzer¹, Hartmut Neven¹•Institutions (1)

Google¹

01 Dec 2013

TL;DR: This work describes Photo OCR, a system for text extraction from images that is capable of recognizing text in a variety of challenging imaging conditions where traditional OCR systems fail, notably in the presence of substantial blur, low resolution, low contrast, high image noise and other distortions.

...read moreread less

Abstract: We describe Photo OCR, a system for text extraction from images. Our particular focus is reliable text extraction from smartphone imagery, with the goal of text recognition as a user input modality similar to speech recognition. Commercially available OCR performs poorly on this task. Recent progress in machine learning has substantially improved isolated character classification, we build on this progress by demonstrating a complete OCR system using these techniques. We also incorporate modern data center-scale distributed language modelling. Our approach is capable of recognizing text in a variety of challenging imaging conditions where traditional OCR systems fail, notably in the presence of substantial blur, low resolution, low contrast, high image noise and other distortions. It also operates with low latency, mean processing time is 600 ms per image. We evaluate our system on public benchmark datasets for text extraction and outperform all previously reported results, more than halving the error rate on multiple benchmarks. The system is currently in use in many applications at Google, and is available as a user input modality in Google Translate for Android.

...read moreread less

499 citations

Vocal Tract Length Perturbation (VTLP) improves speech recognition

[...]

Navdeep Jaitly¹, E. Hinton¹•Institutions (1)

University of Toronto¹

01 Jan 2013

TL;DR: Improvements in speech recognition are suggested without increasing the number of training epochs, and it is suggested that data transformations should be an important component of training neural networks for speech, especially for data limited projects.

...read moreread less

Abstract: Augmenting datasets by transforming inputs in a way that does not change the label is a crucial ingredient of the state of the art methods for object recognition using neural networks. However this approach has (to our knowledge) not been exploited successfully in speech recognition (with or without neural networks). In this paper we lay the foundation for this approach, and show one way of augmenting speech datasets by transforming spectrograms, using a random linear warping along the frequency dimension. In practice this can be achieved by using warping techniques that are used for vocal tract length normalization (VTLN) - with the difference that a warp factor is generated randomly each time, during training, rather than tting a single warp factor to each training and test speaker (or utterance). At test time, a prediction is made by averaging the predictions over multiple warp factors. When this technique is applied to TIMIT using Deep Neural Networks (DNN) of dierent depths, the Phone Error Rate (PER) improved by an average of 0.65% on the test set. For a Convolutional neural network (CNN) with convolutional layer in the bottom, a gain of 1.0% was observed. These improvements were achieved without increasing the number of training epochs, and suggest that data transformations should be an important component of training neural networks for speech, especially for data limited projects.

...read moreread less

351 citations

Proceedings Article•DOI•

Extracting deep bottleneck features using stacked auto-encoders

[...]

Jonas Gehring¹, Yajie Miao², Florian Metze², Alex Waibel¹•Institutions (2)

Karlsruhe Institute of Technology¹, Carnegie Mellon University²

26 May 2013

TL;DR: It is found that increasing the number of auto-encoders in the network produces more useful features, but requires pre-training, especially when little training data is available.

...read moreread less

Abstract: In this work, a novel training scheme for generating bottleneck features from deep neural networks is proposed. A stack of denoising auto-encoders is first trained in a layer-wise, unsupervised manner. Afterwards, the bottleneck layer and an additional layer are added and the whole network is fine-tuned to predict target phoneme states. We perform experiments on a Cantonese conversational telephone speech corpus and find that increasing the number of auto-encoders in the network produces more useful features, but requires pre-training, especially when little training data is available. Using more unlabeled data for pre-training only yields additional gains. Evaluations on larger datasets and on different system setups demonstrate the general applicability of our approach. In terms of word error rate, relative improvements of 9.2% (Cantonese, ML training), 9.3% (Tagalog, BMMI-SAT training), 12% (Tagalog, confusion network combinations with MFCCs), and 8.7% (Switchboard) are achieved.

...read moreread less

282 citations

Proceedings Article•DOI•

Multi-task learning in deep neural networks for improved phoneme recognition

[...]

Michael L. Seltzer¹, Jasha Droppo¹•Institutions (1)

Microsoft¹

26 May 2013

TL;DR: It is demonstrated that, even on a strong baseline, multi-task learning can provide a significant decrease in error rate, and this paper explores three natural choices for the secondary task: the phone label, the phone context, and the state context.

...read moreread less

Abstract: In this paper we demonstrate how to improve the performance of deep neural network (DNN) acoustic models using multi-task learning. In multi-task learning, the network is trained to perform both the primary classification task and one or more secondary tasks using a shared representation. The additional model parameters associated with the secondary tasks represent a very small increase in the number of trained parameters, and can be discarded at runtime. In this paper, we explore three natural choices for the secondary task: the phone label, the phone context, and the state context. We demonstrate that, even on a strong baseline, multi-task learning can provide a significant decrease in error rate. Using phone context, the phonetic error rate (PER) on TIMIT is reduced from 21.63% to 20.25% on the core test set, and surpassing the best performance in the literature for a DNN that uses a standard feed-forward network architecture.

...read moreread less

256 citations

Proceedings Article•DOI•

Aggregating crowdsourced binary ratings

[...]

Nilesh Dalvi¹, Anirban Dasgupta², Ravi Kumar³, Vibhor Rastogi³•Institutions (3)

Facebook¹, Yahoo!², Google³

13 May 2013

TL;DR: This paper obtains bounds on the error rate of the algorithm and shows it is governed by the expansion of the graph, and demonstrates, using several synthetic and real datasets, that the algorithm outperforms the state of the art.

...read moreread less

Abstract: In this paper we analyze a crowdsourcing system consisting of a set of users and a set of binary choice questions. Each user has an unknown, fixed, reliability that determines the user's error rate in answering questions. The problem is to determine the truth values of the questions solely based on the user answers. Although this problem has been studied extensively, theoretical error bounds have been shown only for restricted settings: when the graph between users and questions is either random or complete. In this paper we consider a general setting of the problem where the user--question graph can be arbitrary. We obtain bounds on the error rate of our algorithm and show it is governed by the expansion of the graph. We demonstrate, using several synthetic and real datasets, that our algorithm outperforms the state of the art.

...read moreread less

230 citations

Proceedings Article•DOI•

Cross-Entropy vs. Squared Error Training: a Theoretical and Experimental Comparison

[...]

Pavel Golik¹, Patrick Doetsch¹, Hermann Ney¹•Institutions (1)

RWTH Aachen University¹

25 Aug 2013

TL;DR: It is found that with randomly initialized weights, the squared error based ANN does not converge to a good local optimum, and with a good initialization by pre-training, the word error rate of the best CE trained system could be reduced.

...read moreread less

Abstract: In this paper we investigate the error criteria that are optimized during the training of artificial neural networks (ANN). We compare the bounds of the squared error (SE) and the crossentropy (CE) criteria being the most popular choices in stateof-the art implementations. The evaluation is performed on automatic speech recognition (ASR) and handwriting recognition (HWR) tasks using a hybrid HMM-ANN model. We find that with randomly initialized weights, the squared error based ANN does not converge to a good local optimum. However, with a good initialization by pre-training, the word error rate of our best CE trained system could be reduced from 30.9% to 30.5% on the ASR, and from 22.7% to 21.9% on the HWR task by performing a few additional “fine-tuning” iterations with the SE criterion.

...read moreread less

226 citations

Journal Article•DOI•

Topological quantum computing with a very noisy network and local error rates approaching one percent

[...]

Naomi H. Nickerson¹, Ying Li², Simon C. Benjamin², Simon C. Benjamin³•Institutions (3)

Imperial College London¹, National University of Singapore², University of Oxford³

23 Apr 2013-Nature Communications

TL;DR: This work describes a method by which even error-prone cells can perform purification: groups of cells generate shared resource states, which then enable stabilization of topologically encoded data.

...read moreread less

Abstract: A scalable quantum computer could be built by networking together many simple processor cells, thus avoiding the need to create a single complex structure. The difficulty is that realistic quantum links are very error prone. A solution is for cells to repeatedly communicate with each other and so purify any imperfections; however prior studies suggest that the cells themselves must then have prohibitively low internal error rates. Here we describe a method by which even error-prone cells can perform purification: groups of cells generate shared resource states, which then enable stabilization of topologically encoded data. Given a realistically noisy network (≥10% error rate) we find that our protocol can succeed provided that intra-cell error rates for initialisation, state manipulation and measurement are below 0.82%. This level of fidelity is already achievable in several laboratory systems.

...read moreread less

203 citations

Proceedings Article•DOI•

Recurrent neural networks for voice activity detection

[...]

Thad Hughes¹, Keir Banks Mierle¹•Institutions (1)

Google¹

26 May 2013

TL;DR: This work presents a novel recurrent neural network model for voice activity detection, in which nodes compute quadratic polynomials and outperforms a much larger baseline system composed of Gaussian mixture models and a hand-tuned state machine for temporal smoothing.

...read moreread less

Abstract: We present a novel recurrent neural network (RNN) model for voice activity detection. Our multi-layer RNN model, in which nodes compute quadratic polynomials, outperforms a much larger baseline system composed of Gaussian mixture models (GMMs) and a hand-tuned state machine (SM) for temporal smoothing. All parameters of our RNN model are optimized together, so that it properly weights its preference for temporal continuity against the acoustic features in each frame. Our RNN uses one tenth the parameters and outperforms the GMM+SM baseline system by 26% reduction in false alarms, reducing overall speech recognition computation time by 17% while reducing word error rate by 1% relative.

...read moreread less

193 citations

Proceedings Article•DOI•

A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion

[...]

Li Deng¹, Ossama Abdel-Hamid², Dong Yu¹•Institutions (2)

Microsoft¹, York University²

26 May 2013

TL;DR: A novel deep convolutional neural network architecture is developed, where heterogeneous pooling is used to provide constrained frequency-shift invariance in the speech spectrogram while minimizing speech-class confusion induced by such invariance.

...read moreread less

Abstract: We develop and present a novel deep convolutional neural network architecture, where heterogeneous pooling is used to provide constrained frequency-shift invariance in the speech spectrogram while minimizing speech-class confusion induced by such invariance. The design of the pooling layer is guided by domain knowledge about how speech classes would change when formant frequencies are modified. The convolution and heterogeneous-pooling layers are followed by a fully connected multi-layer neural network to form a deep architecture interfaced to an HMM for continuous speech recognition. During training, all layers of this entire deep net are regularized using a variant of the “dropout” technique. Experimental evaluation demonstrates the effectiveness of both heterogeneous pooling and dropout regularization. On the TIMIT phonetic recognition task, we have achieved an 18.7% phone error rate, lowest on this standard task reported in the literature with a single system and with no use of information about speaker identity. Preliminary experiments on large vocabulary speech recognition in a voice search task also show error rate reduction using heterogeneous pooling in the deep convolutional neural network.

...read moreread less

185 citations

Proceedings Article•DOI•

Audio-visual deep learning for noise robust speech recognition

[...]

Jing Huang¹, Brian Kingsbury¹•Institutions (1)

IBM¹

26 May 2013

TL;DR: This work uses DBNs for audio-visual speech recognition; in particular, it uses deep learning from audio and visual features for noise robust speech recognition and test two methods for using DBN’s in a multimodal setting.

...read moreread less

Abstract: Deep belief networks (DBN) have shown impressive improvements over Gaussian mixture models for automatic speech recognition. In this work we use DBNs for audio-visual speech recognition; in particular, we use deep learning from audio and visual features for noise robust speech recognition. We test two methods for using DBNs in a multimodal setting: a conventional decision fusion method that combines scores from single-modality DBNs, and a novel feature fusion method that operates on mid-level features learned by the single-modality DBNs. On a continuously spoken digit recognition task, our experiments show that these methods can reduce word error rate by as much as 21% relative over a baseline multi-stream audio-visual GMM/HMM system.

...read moreread less

Proceedings Article•DOI•

Learning filter banks within a deep neural network framework

[...]

Tara N. Sainath¹, Brian Kingsbury¹, Abdelrahman Mohamed², Bhuvana Ramabhadran¹•Institutions (2)

IBM¹, University of Toronto²

01 Dec 2013

TL;DR: This paper replaces the filter bank with a filter bank layer that is learned jointly with the rest of a deep neural network, and shows that on a 50-hour English Broadcast News task, it can achieve a 5% relative improvement in word error rate using thefilter bank learning approach.

...read moreread less

Abstract: Mel-filter banks are commonly used in speech recognition, as they are motivated from theory related to speech production and perception. While features derived from mel-filter banks are quite popular, we argue that this filter bank is not really an appropriate choice as it is not learned for the objective at hand, i.e. speech recognition. In this paper, we explore replacing the filter bank with a filter bank layer that is learned jointly with the rest of a deep neural network. Thus, the filter bank is learned to minimize cross-entropy, which is more closely tied to the speech recognition objective. On a 50-hour English Broadcast News task, we show that we can achieve a 5% relative improvement in word error rate (WER) using the filter bank learning approach, compared to having a fixed set of filters.

...read moreread less

Journal Article•DOI•

The Move-Split-Merge Metric for Time Series

[...]

Alexandra Stefan¹, Vassilis Athitsos¹, Gautam Das¹•Institutions (1)

University of Texas at Arlington¹

01 Jun 2013-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A novel metric for time series, called Move-Split-Merge (MSM), is proposed, which uses as building blocks three fundamental operations: Move, Split, and Merge, which can be applied in sequence to transform any time series into any other time series.

...read moreread less

Abstract: A novel metric for time series, called Move-Split-Merge (MSM), is proposed. This metric uses as building blocks three fundamental operations: Move, Split, and Merge, which can be applied in sequence to transform any time series into any other time series. A Move operation changes the value of a single element, a Split operation converts a single element into two consecutive elements, and a Merge operation merges two consecutive elements into one. Each operation has an associated cost, and the MSM distance between two time series is defined to be the cost of the cheapest sequence of operations that transforms the first time series into the second one. An efficient, quadratic-time algorithm is provided for computing the MSM distance. MSM has the desirable properties of being metric, in contrast to the Dynamic Time Warping (DTW) distance, and invariant to the choice of origin, in contrast to the Edit Distance with Real Penalty (ERP) metric. At the same time, experiments with public time series data sets demonstrate that MSM is a meaningful distance measure, that oftentimes leads to lower nearest neighbor classification error rate compared to DTW and ERP.

...read moreread less

Proceedings Article•DOI•

Synthetic speech detection using temporal modulation feature

[...]

Zhizheng Wu¹, Xiong Xiao¹, Eng Siong Chng¹, Haizhou Li¹•Institutions (1)

Nanyang Technological University¹

26 May 2013

TL;DR: From the synthetic speech detection results, the modulation features provide complementary information to magnitude/phase features, and the best detection performance is obtained by fusing phase modulation features and phase features, yielding an equal error rate.

...read moreread less

Abstract: Voice conversion and speaker adaptation techniques present a threat to current state-of-the-art speaker verification systems. To prevent such spoofing attack and enhance the security of speaker verification systems, the development of anti-spoofing techniques to distinguish synthetic and human speech is necessary. In this study, we continue the quest to discriminate synthetic and human speech. Motivated by the facts that current analysis-synthesis techniques operate on frame level and make the frame-by-frame independence assumption, we proposed to adopt magnitude/phase modulation features to detect synthetic speech from human speech. Modulation features derived from magnitude/phase spectrum carry long-term temporal information of speech, and may be able to detect temporal artifacts caused by the frame-by-frame processing in the synthesis of speech signal. From our synthetic speech detection results, the modulation features provide complementary information to magnitude/phase features. The best detection performance is obtained by fusing phase modulation features and phase features, yielding an equal error rate of 0.89%, which is significantly lower than the 1.25% of phase features and 10.98% of MFCC features.

...read moreread less

Proceedings Article•DOI•

A scalable approach to using DNN-derived features in GMM-HMM based acoustic modeling for LVCSR.

[...]

Zhi-Jie Yan¹, Qiang Huo¹, Jian Xu¹•Institutions (1)

Microsoft¹

25 Aug 2013

TL;DR: A new scalable approach to using deep neural network (DNN) derived features in Gaussian mixture density hidden Markov model (GMM-HMM) based acoustic modeling for large vocabulary continuous speech recognition (LVCSR).

...read moreread less

Abstract: We present a new scalable approach to using deep neural network (DNN) derived features in Gaussian mixture density hidden Markov model (GMM-HMM) based acoustic modeling for large vocabulary continuous speech recognition (LVCSR). The DNN-based feature extractor is trained from a subset of training data to mitigate the scalability issue of DNN training, while GMM-HMMs are trained by using state-of-the-art scalable training methods and tools to leverage the whole training set. In a benchmark evaluation, we used 309-hour Switchboard-I (SWB) training data to train a DNN first, which achieves a word error rate (WER) of 15.4% on NIST-2000 Hub5 evaluation set by a traditional DNN-HMM based approach. When the same DNN is used as a feature extractor and 2,000-hour “SWB+Fisher” training data is used to train the GMM-HMMs, our DNN-GMM-HMM approach achieves a WER of 13.8%. If per-conversation-side based unsupervised adaptation is performed, a WER of 13.1% can be achieved.

...read moreread less

Proceedings Article•DOI•

Hybrid acoustic models for distant and multichannel large vocabulary speech recognition

[...]

Pawel Swietojanski¹, Arnab Ghoshal¹, Steve Renals¹•Institutions (1)

University of Edinburgh¹

01 Dec 2013

TL;DR: The accuracy of a network recognising speech from a single distant microphone can approach that of a multi-microphone setup by training with data from other microphones.

...read moreread less

Abstract: We investigate the application of deep neural network (DNN)-hidden Markov model (HMM) hybrid acoustic models for far-field speech recognition of meetings recorded using microphone arrays We show that the hybrid models achieve significantly better accuracy than conventional systems based on Gaussian mixture models (GMMs) We observe up to 8% absolute word error rate (WER) reduction from a discriminatively trained GMM baseline when using a single distant microphone, and between 4-6% absolute WER reduction when using beamforming on various combinations of array channels By training the networks on audio from multiple channels, we find the networks can recover significant part of accuracy difference between the single distant microphone and beamformed configurations Finally, we show that the accuracy of a network recognising speech from a single distant microphone can approach that of a multi-microphone setup by training with data from other microphones

...read moreread less

Book•DOI•

Spoken Word Recognition

[...]

Delphine Dahan¹, James S. Magnuson², James S. Magnuson³•Institutions (3)

University of Pennsylvania¹, University of Connecticut², Haskins Laboratories³

11 Mar 2013

TL;DR: Spoken word recognition is a distinct subsystem providing the interface among low-level perception and cognitive processes of retrieval, parsing, and interpretation, and sentence processing.

...read moreread less

Abstract: Publisher Summary Spoken word recognition is a distinct subsystem providing the interface among low-level perception and cognitive processes of retrieval, parsing, and interpretation. The narrowest conception of the process of recognizing a spoken word is that it starts from a string of phonemes, establishes how these phonemes should be grouped to form words, and passes these words onto the next level of processing. Some theories, though, take a broader view and blur the distinctions among speech perception, spoken word recognition, and sentence processing. The broader view of spoken word recognition has empirical and theoretical motivations. One consideration is that by assuming that the input to spoken word recognition is a string of abstract, phonemic category labels, one implicitly assumes that the nonphonemic variability carried on the speech signal is not relevant for spoken word recognition and higher levels of processing. However, if this variability and detail is not random but is lawfully related to linguistic categories, the simplifying assumption that the output of speech perception is a string of phonemes may actually be a complicating assumption.

...read moreread less

Posted Content•

Near Minimax Line Spectral Estimation

[...]

Gongguo Tang¹, Badri Narayan Bhaskar¹, Benjamin Recht¹•Institutions (1)

University of Wisconsin-Madison¹

18 Mar 2013-arXiv: Information Theory

TL;DR: This paper establishes that using atomic norm soft thresholding (AST), it can achieve near minimax optimal prediction error rate for line spectral estimation, in spite of having a highly coherent dictionary corresponding to arbitrarily close frequencies.

...read moreread less

Abstract: This paper establishes a nearly optimal algorithm for estimating the frequencies and amplitudes of a mixture of sinusoids from noisy equispaced samples. We derive our algorithm by viewing line spectral estimation as a sparse recovery problem with a continuous, infinite dictionary. We show how to compute the estimator via semidefinite programming and provide guarantees on its mean-square error rate. We derive a complementary minimax lower bound on this estimation rate, demonstrating that our approach nearly achieves the best possible estimation error. Furthermore, we establish bounds on how well our estimator localizes the frequencies in the signal, showing that the localization error tends to zero as the number of samples grows. We verify our theoretical results in an array of numerical experiments, demonstrating that the semidefinite programming approach outperforms two classical spectral estimation techniques.

...read moreread less

Journal Article•DOI•

Fusion of feature sets and classifiers for facial expression recognition

[...]

Thiago H. H. Zavaschi¹, Alceu S. Britto¹, Luiz S. Oliveira², Alessandro L. Koerich¹•Institutions (2)

Pontifícia Universidade Católica do Paraná¹, Federal University of Paraná²

01 Feb 2013-Expert Systems With Applications

TL;DR: A novel method for facial expression recognition that employs the combination of two different feature sets in an ensemble approach, which improves the recognition rates between 5% and 10% over conventional approaches that employ single feature sets and single classifiers is presented.

...read moreread less

Abstract: This paper presents a novel method for facial expression recognition that employs the combination of two different feature sets in an ensemble approach. A pool of base support vector machine classifiers is created using Gabor filters and Local Binary Patterns. Then a multi-objective genetic algorithm is used to search for the best ensemble using as objective functions the minimization of both the error rate and the size of the ensemble. Experimental results on JAFFE and Cohn-Kanade databases have shown the efficiency of the proposed strategy in finding powerful ensembles, which improves the recognition rates between 5% and 10% over conventional approaches that employ single feature sets and single classifiers.

...read moreread less

Journal Article•DOI•

i-vector representation based on bottleneck features for language identification

[...]

Yan Song¹, Bing Jiang¹, Yebo Bao¹, Si Wei¹, Li-Rong Dai¹ - Show less +1 more•Institutions (1)

University of Science and Technology of China¹

21 Nov 2013-Electronics Letters

TL;DR: An i-vector representation based on bottleneck (BN) features is presented for language identification (LID) and the resulting performance of LID has been significantly improved with the proposed BN feature based i- vector representation.

...read moreread less

Abstract: An i-vector representation based on bottleneck (BN) features is presented for language identification (LID). In the proposed system, the BN features are extracted from a deep neural network, which can effectively mine the contextual information embedded in speech frames. The i-vector representation of each utterance is then obtained by applying a total variability approach on the BN features. The resulting performance of LID has been significantly improved with the proposed BN feature based i-vector representation. Compared with the stateof- the-art techniques, the equal error rate is relatively reduced by about 40% on the National Institute of Standards and Technology (NIST) 2009 evaluation sets.

...read moreread less

Journal Article•DOI•

Correcting phoneme recognition errors in learning word pronunciation through speech interaction

[...]

Xiang Zuo¹, Taisuke Sumii¹, Naoto Iwahashi², Mikio Nakano³, Kotaro Funakoshi³, Natsuki Oka¹ - Show less +2 more•Institutions (3)

Kyoto Institute of Technology¹, National Institute of Information and Communications Technology², Honda³

01 Jan 2013-Speech Communication

TL;DR: Experimental results show that the proposed IPU method reduces the error rate by a factor of three over a previously proposed maximum-likelihood-based method.

...read moreread less

Proceedings Article•DOI•

Recurrent neural network language modeling for code switching conversational speech

[...]

Heike Adel, Ngoc Thang Vu, Franziska Kraus, Tim Schlippe, Haizhou Li¹, Tanja Schultz - Show less +2 more•Institutions (1)

Nanyang Technological University¹

26 May 2013

TL;DR: This paper proposes a structure of recurrent neural networks to predict code-switches based on textual features with focus on Part-of-Speech tags and trigger words and extends the networks by adding POS information to the input layer and by factorizing the output layer into languages.

...read moreread less

Abstract: Code-switching is a very common phenomenon in multilingual communities. In this paper, we investigate language modeling for conversational Mandarin-English code-switching (CS) speech recognition. First, we investigate the prediction of code switches based on textual features with focus on Part-of-Speech (POS) tags and trigger words. Second, we propose a structure of recurrent neural networks to predict code-switches. We extend the networks by adding POS information to the input layer and by factorizing the output layer into languages. The resulting models are applied to our task of code-switching language modeling. The final performance shows 10.8% relative improvement in perplexity on the SEAME development set which transforms into a 2% relative improvement in terms of Mixed Error Rate and a relative improvement of 16.9% in perplexity on the evaluation set which leads to a 2.7% relative improvement of MER.

...read moreread less

Journal Article•DOI•

Acoustic Modeling With Hierarchical Reservoirs

[...]

Fabian Triefenbach¹, Azarakhsh Jalalvand¹, Kris Demuynck¹, Jean-Pierre Martens¹•Institutions (1)

Ghent University¹

01 Nov 2013-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: It is shown here that simple reservoir-based AMs achieve reasonable phone recognition and that deep hierarchical and bi-directional reservoir architectures lead to a very competitive Phone Error Rate (PER) of 23.1% on the well-known TIMIT task.

...read moreread less

Abstract: Accurate acoustic modeling is an essential requirement of a state-of-the-art continuous speech recognizer. The Acoustic Model (AM) describes the relation between the observed speech signal and the non-observable sequence of phonetic units uttered by the speaker. Nowadays, most recognizers use Hidden Markov Models (HMMs) in combination with Gaussian Mixture Models (GMMs) to model the acoustics, but neural-based architectures are on the rise again. In this work, the recently introduced Reservoir Computing (RC) paradigm is used for acoustic modeling. A reservoir is a fixed - and thus non-trained - Recurrent Neural Network (RNN) that is combined with a trained linear model. This approach combines the ability of an RNN to model the recent past of the input sequence with a simple and reliable training procedure. It is shown here that simple reservoir-based AMs achieve reasonable phone recognition and that deep hierarchical and bi-directional reservoir architectures lead to a very competitive Phone Error Rate (PER) of 23.1% on the well-known TIMIT task.

...read moreread less

Proceedings Article•DOI•

Improving Low-Resource CD-DNN-HMM Using Dropout and Multilingual DNN Training

[...]

Yajie Miao¹, Florian Metze¹•Institutions (1)

Carnegie Mellon University¹

25 Aug 2013

TL;DR: Two strategies to improve the context-dependent deep neural network hidden Markov model (CD-DNN-HMM) in low-resource speech recognition are investigated, exploiting dropout which prevents overfitting in DNN finetuning and improves model robustness under data sparseness.

...read moreread less

Abstract: We investigate two strategies to improve the context-dependent deep neural network hidden Markov model (CD-DNN-HMM) in low-resource speech recognition. Although outperforming the conventional Gaussian mixture model (GMM) HMM on various tasks, CD-DNN-HMM acoustic modeling becomes challenging with limited transcribed speech, e.g., less than 10 hours. To resolve this issue, we firstly exploit dropout which prevents overfitting in DNN finetuning and improves model robustness under data sparseness. Then, the effectiveness of multilingual DNN training is evaluated when additional auxiliary languages are available. The hidden layer parameters of the target language are shared and learned over multiple languages. Experiments show that both strategies boost the recognition performance significantly. Combining them results in further reduction in word error rate, achieving 11.6% and 6.2% relative improvement on two limited data conditions.

...read moreread less

Journal Article•DOI•

Effects of artificially intelligent tools on pattern recognition

[...]

Tanzila Saba¹, Amjad Rehman¹•Institutions (1)

Universiti Teknologi Malaysia¹

01 Jan 2013-International Journal of Machine Learning and Cybernetics

TL;DR: A comparative study between artificially trained and heuristics rule based techniques employed for pattern recognition in the state of the art focused on script pattern recognition is provided.

...read moreread less

Abstract: Pattern recognition is classification process that attempts to assign each input value to one of a given set of classes The process of pattern recognition in the state of art has been achieved either by training of artificially intelligent tools or using heuristic rule based approaches The objective of this paper is to provide a comparative study between artificially trained and heuristics rule based techniques employed for pattern recognition in the state of the art focused on script pattern recognition It is observed that mainly there are two categories of script pattern recognition techniques First category involves assistance of artificial intelligent learning and next, is based on heuristic-rules for cursive script pattern segmentation/recognition Accordingly, a detailed critical study is performed that focuses on size of training/testing data and implication of artificial learning on script pattern recognition accuracy Moreover, the techniques are described in details that are employed to identify character patterns Finally, performances of different techniques on benchmark database are compared regarding pattern recognition accuracy, error rate, single or multiple classifiers being employed Problems that still persist are also highlighted and possible directions are set

...read moreread less

Journal Article•DOI•

Performance Indicator for MIMO MMSE Receivers in the Presence of Channel Estimation Error

[...]

Eren Eraslan¹, Babak Daneshrad¹, Chung-Yu Lou¹•Institutions (1)

University of California, Los Angeles¹

01 Feb 2013-IEEE Wireless Communications Letters

TL;DR: The derivation of post-processing SNR for minimum mean-squared error (MMSE) receivers with imperfect channel estimates is presented and it is shown that it is an accurate indicator of the error rate performance of MIMO systems in the presence of channel estimation error.

...read moreread less

Abstract: We present the derivation of post-processing SNR for minimum mean-squared error (MMSE) receivers with imperfect channel estimates, and show that it is an accurate indicator of the error rate performance of MIMO systems in the presence of channel estimation error. Simulation results show the tightness of the analysis.

...read moreread less

Journal Article•

Beyond Fano's inequality: bounds on the optimal F-score, BER, and cost-sensitive risk and their implications

[...]

Ming-Jie Zhao¹, Narayanan Unny Edakunni¹, Adam Craig Pocock¹, Gavin Brown¹•Institutions (1)

University of Manchester¹

01 Jan 2013-Journal of Machine Learning Research

TL;DR: This work focuses on the two-class problem and uses a general definition of conditional entropy (including Shannon's as a special case) to derive upper/lower bounds on the optimal F-score, BER and cost-sensitive risk, extending Fano's result.

...read moreread less

Abstract: Fano's inequality lower bounds the probability of transmission error through a communication channel Applied to classification problems, it provides a lower bound on the Bayes error rate and motivates the widely used Infomax principle In modern machine learning, we are often interested in more than just the error rate In medical diagnosis, different errors incur different cost; hence, the overall risk is cost-sensitive Two other popular criteria are balanced error rate (BER) and F-score In this work, we focus on the two-class problem and use a general definition of conditional entropy (including Shannon's as a special case) to derive upper/lower bounds on the optimal F-score, BER and cost-sensitive risk, extending Fano's result As a consequence, we show that Infomax is not suitable for optimizing F-score or cost-sensitive risk, in that it can potentially lead to low F-score and high risk For cost-sensitive risk, we propose a new conditional entropy formulation which avoids this inconsistency In addition, we consider the common practice of using a threshold on the posterior probability to tune performance of a classifier As is widely known, a threshold of 05, where the posteriors cross, minimizes error rate--we derive similar optimal thresholds for F-score and BER

...read moreread less

Journal Article•DOI•

Subcarrier intensity modulated optical wireless communications in atmospheric turbulence with pointing errors

[...]

Xuegui Song¹, Fan Yang¹, Julian Cheng¹•Institutions (1)

University of British Columbia¹

01 Apr 2013-IEEE\/OSA Journal of Optical Communications and Networking

TL;DR: The findings suggest that pointing error compensation is necessary as pointing errors can severely degrade the error rate and outage probability performance of an uncompensated system.

...read moreread less

Abstract: An optical wireless communication system using subcarrier intensity modulation is analyzed for gamma-gamma turbulence channels with pointing errors We study the error rate performance of such a system employing M-ary phase-shift keying, differential phase-shift keying, and noncoherent frequency-shift keying Highly accurate error rate approximations are derived using a series expansion approach Furthermore, outage probability expressions are obtained for such a system Asymptotic error rate and outage probability analyses are also presented Our asymptotic analysis reveals some unique transmission characteristics of such a system Our findings suggest that pointing error compensation is necessary as pointing errors can severely degrade the error rate and outage probability performance of an uncompensated system

...read moreread less

Proceedings Article•DOI•

The distribution of calibrated likelihood-ratios in speaker recognition

[...]

David A. van Leeuwen¹, Niko Brümmer²•Institutions (2)

Radboud University Nijmegen¹, Brno University of Technology²

25 Aug 2013

TL;DR: Property of the score distributions of calibrated log-likelihood-ratios that are used in automatic speaker recognition are studied and closed-form expressions for these allow for a new way of computing the offset and scaling parameters for linear calibration.

...read moreread less

Abstract: This paper studies properties of the score distributions of calibrated log-likelihood-ratios that are used in automatic speaker recognition. We derive the essential condition for calibration that the log likelihood ratio of the log-likelihood-ratio is the log-likelihood-ratio. We then investigate what the consequence of this condition is to the probability density functions (PDFs) of the loglikelihood-ratio score. We show that if the PDF of the non-target distribution is Gaussian, then the PDF of the target distribution must be Gaussian as well. The means and variances of these two PDFs are interrelated, and determined completely by the discrimination performance of the recognizer characterized by the equal error rate. These relations allow for a new way of computing the offset and scaling parameters for linear calibration, and we derive closed-form expressions for these and show that for modern i-vector systems with PLDA scoring this leads to good calibration, comparable to traditional logistic regression, over a wide range of system performance.

...read moreread less

Proceedings Article•DOI•

Improved mixed language speech recognition using asymmetric acoustic model and language model with code-switch inversion constraints

[...]

Ying Li¹, Pascale Fung¹•Institutions (1)

Hong Kong University of Science and Technology¹

26 May 2013

TL;DR: An integrated framework for large vocabulary continuous mixed language speech recognition that handles the accent effect in the bilingual acoustic model and the inversion constraint well known to linguists in the language model is proposed.

...read moreread less

Abstract: We propose an integrated framework for large vocabulary continuous mixed language speech recognition that handles the accent effect in the bilingual acoustic model and the inversion constraint well known to linguists in the language model. Our asymmetric acoustic model with phone set extension improves upon previous work by striking a balance between data and phonetic knowledge. Our language model improves upon previous work by (1) using the inversion constraint to predict code switching points in the mixed language and (2) integrating a code-switch prediction model, a translation model and a reconstruction model together. This integration means that our language model avoids the pitfall of propagated error that could arise from decoupling these steps. Finally, a WFST-based decoder integrates the acoustic models, code-switch language model and a monolingual language model in the matrix language all together. Our system reduces word error rate by 1.88% on a lecture speech corpus and by 2.43% on a lunch conversation corpus, with statistical significance, over the conventional bilingual acoustic model and interpolated language model.

...read moreread less

Collapse