scispace - formally typeset
Search or ask a question

Showing papers on "Speaker recognition published in 2013"


Proceedings ArticleDOI
01 Dec 2013
TL;DR: This work proposes to adapt deep neural network acoustic models to a target speaker by supplying speaker identity vectors (i-vectors) as input features to the network in parallel with the regular acoustic features for ASR, comparable in performance to DNNs trained on speaker-adapted features with the advantage that only one decoding pass is needed.
Abstract: We propose to adapt deep neural network (DNN) acoustic models to a target speaker by supplying speaker identity vectors (i-vectors) as input features to the network in parallel with the regular acoustic features for ASR. For both training and test, the i-vector for a given speaker is concatenated to every frame belonging to that speaker and changes across different speakers. Experimental results on a Switchboard 300 hours corpus show that DNNs trained on speaker independent features and i-vectors achieve a 10% relative improvement in word error rate (WER) over networks trained on speaker independent features only. These networks are comparable in performance to DNNs trained on speaker-adapted features (with VTLN and FMLLR) with the advantage that only one decoding pass is needed. Furthermore, networks trained on speaker-adapted features and i-vectors achieve a 5-6% relative improvement in WER after hessian-free sequence training over networks trained on speaker-adapted features only.

714 citations


Journal ArticleDOI
TL;DR: The potential of a simple photonic architecture to process information at unprecedented data rates is demonstrated, implementing a learning-based approach and all digits with very low classification errors are identified and chaotic time-series prediction with 10% error is performed.
Abstract: Inspired by neural networks, reservoir computing uses nonlinear transient states to perform computations, offering faster parallel information processing Brunner et al show a photonic approach to reservoir computing capable of simultaneous spoken digit and speaker recognition at high data rates

712 citations


Proceedings ArticleDOI
26 May 2013
TL;DR: This paper is intended to be a reference on the 2nd `CHiME' Challenge, an initiative designed to analyze and evaluate the performance of ASR systems in a real-world domestic environment.
Abstract: Distant-microphone automatic speech recognition (ASR) remains a challenging goal in everyday environments involving multiple background sources and reverberation. This paper is intended to be a reference on the 2nd `CHiME' Challenge, an initiative designed to analyze and evaluate the performance of ASR systems in a real-world domestic environment. Two separate tracks have been proposed: a small-vocabulary task with small speaker movements and a medium-vocabulary task without speaker movements. We discuss the rationale for the challenge and provide a detailed description of the datasets, tasks and baseline performance results for each track.

377 citations


Proceedings ArticleDOI
Hank Liao1
26 May 2013
TL;DR: This work explores how deep neural networks may be adapted to speakers by re-training the input layer, the output layer or the entire network, and looks at how L2 regularization using weight decay to the speaker independent model improves generalization.
Abstract: There has been little work on examining how deep neural networks may be adapted to speakers for improved speech recognition accuracy. Past work has examined using a discriminatively trained affine transformation of the input features applied at a frame level or the re-training of the entire shallow network for a specific speaker. This work explores how deep neural networks may be adapted to speakers by re-training the input layer, the output layer or the entire network. We look at how L2 regularization using weight decay to the speaker independent model improves generalization. Other training factors are examined including the role momentum plays and stochastic mini-batch versus batch training. While improvements are significant for smaller networks, the largest show little gain from adaptation on a large vocabulary mobile speech recognition task.

271 citations


Proceedings ArticleDOI
26 May 2013
TL;DR: A new fast speaker adaptation method for the hybrid NN-HMM speech recognition model that can achieve over 10% relative reduction in phone error rate by using only seven utterances for adaptation.
Abstract: In this paper, we propose a new fast speaker adaptation method for the hybrid NN-HMM speech recognition model. The adaptation method depends on a joint learning of a large generic adaptation neural network for all speakers as well as multiple small speaker codes (one per speaker). The joint training method uses all training data along with speaker labels to update adaptation NN weights and speaker codes based on the standard back-propagation algorithm. In this way, the learned adaptation NN is capable of transforming each speaker features into a generic speaker-independent feature space when a small speaker code is given. Adaptation to a new speaker can be simply done by learning a new speaker code using the same back-propagation algorithm without changing any NN weights. In this method, a separate speaker code is learned for each speaker while the large adaptation NN is learned from the whole training set. The main advantage of this method is that the size of speaker codes is very small. As a result, it is possible to conduct a very fast adaptation of the hybrid NN/HMM model for each speaker based on only a small amount of adaptation data (i.e., just a few utterances). Experimental results on TIMIT have shown that it can achieve over 10% relative reduction in phone error rate by using only seven utterances for adaptation.

269 citations


Book
17 Sep 2013
TL;DR: This book presents the methods, tools and techniques that are currently being used to recognise (automatically) the affect, emotion, personality and everything else beyond linguistics (paralinguistics) expressed by or embedded in human speech and language.
Abstract: This book presents the methods, tools and techniques that are currently being used to recognise (automatically) the affect, emotion, personality and everything else beyond linguistics (paralinguistics) expressed by or embedded in human speech and language.It is the first book to provide such a systematic survey of paralinguistics in speech and language processing. The technology described has evolved mainly from automatic speech and speaker recognition and processing, but also takes into account recent developments within speech signal processing, machine intelligence and data mining.Moreover, the book offers a hands-on approach by integrating actual data sets, software, and open-source utilities which will make the book invaluable as a teaching tool and similarly useful for those professionals already in the field.Key features:Provides an integrated presentation of basic research (in phonetics/linguistics and humanities) with state-of-the-art engineering approaches for speech signal processing and machine intelligence.Explains the history and state of the art of all of the sub-fields which contribute to the topic of computational paralinguistics.C overs the signal processing and machine learning aspects of the actual computational modelling of emotion and personality and explains the detection process from corpus collection to feature extraction and from model testing to system integration.Details aspects of real-world system integration including distribution, weakly supervised learning and confidence measures.Outlines machine learning approaches including static, dynamic and contextsensitive algorithms for classification and regression.Includes a tutorial on freely available toolkits, such as the open-source openEAR toolkit for emotion and affect recognition co-developed by one of the authors, and a listing of standard databases and feature sets used in the field to allow for immediate experimentation enabling the reader to build an emotion detection model on an existing corpus.

234 citations


Proceedings ArticleDOI
26 May 2013
TL;DR: This paper shows how to quantify the uncertainty associated with the i-vector extraction process and propagate it into a PLDA classifier and finds that it led to substantial improvements in accuracy.
Abstract: The duration of speech segments has traditionally been controlled in the NIST speaker recognition evaluations so that researchers working in this framework have been relieved of the responsibility of dealing with the duration variability that arises in practical applications. The fixed dimensional i-vector representation of speech utterances is ideal for working under such controlled conditions and ignoring the fact that i-vectors extracted from short utterances are less reliable than those extracted from long utterances leads to a very simple formulation of the speaker recognition problem. However a more realistic approach seems to be needed to handle duration variability properly. In this paper, we show how to quantify the uncertainty associated with the i-vector extraction process and propagate it into a PLDA classifier. We evaluated this approach using test sets derived from the NIST 2010 core and extended core conditions by randomly truncating the utterances in the female, telephone speech trials so that the durations of all enrollment and test utterances lay in the range 3-60 seconds and we found that it led to substantial improvements in accuracy. Although the likelihood ratio computation for speaker verification is more computationally expensive than in the standard i-vector/PLDA classifier, it is still quite modest as it reduces to computing the probability density functions of two full covariance Gaussians (irrespective of the number of the number of utterances used to enroll a speaker).

233 citations


01 Sep 2013
TL;DR: The MSR Identity Toolbox is released, which contains a collection of MATLAB tools and routines that can be used for research and development in speaker recognition, and provides many of the functionalities available in other open-source speaker recognition toolkits.
Abstract: We are happy to announce the release of the MSR Identity Toolbox: A MATLAB toolbox for speaker-recognition research. This toolbox contains a collection of MATLAB tools and routines that can be used for research and development in speaker recognition. It provides researchers with a test bed for developing new front-end and back-end techniques, allowing replicable evaluation of new advancements. It will also help newcomers in the field by lowering the "barrier to entry," enabling them to quickly build baseline systems for their experiments. Although the focus of this toolbox is on speaker recognition, it can also be used for other speech related applications such as language, dialect, and accent identification. Additionally, it provides many of the functionalities available in other open-source speaker recognition toolkits (e.g., ALIZE

224 citations


Proceedings ArticleDOI
Jing Huang1, Brian Kingsbury1
26 May 2013
TL;DR: This work uses DBNs for audio-visual speech recognition; in particular, it uses deep learning from audio and visual features for noise robust speech recognition and test two methods for using DBN’s in a multimodal setting.
Abstract: Deep belief networks (DBN) have shown impressive improvements over Gaussian mixture models for automatic speech recognition. In this work we use DBNs for audio-visual speech recognition; in particular, we use deep learning from audio and visual features for noise robust speech recognition. We test two methods for using DBNs in a multimodal setting: a conventional decision fusion method that combines scores from single-modality DBNs, and a novel feature fusion method that operates on mid-level features learned by the single-modality DBNs. On a continuously spoken digit recognition task, our experiments show that these methods can reduce word error rate by as much as 21% relative over a baseline multi-stream audio-visual GMM/HMM system.

182 citations


Journal ArticleDOI
TL;DR: An improved clustering method is integrated with an existing re-segmentation algorithm and an iterative optimization scheme is implemented that demonstrates the ability to improve both speaker cluster assignments and segmentation boundaries in an unsupervised manner.
Abstract: In speaker diarization, standard approaches typically perform speaker clustering on some initial segmentation before refining the segment boundaries in a re-segmentation step to obtain a final diarization hypothesis. In this paper, we integrate an improved clustering method with an existing re-segmentation algorithm and, in iterative fashion, optimize both speaker cluster assignments and segmentation boundaries jointly. For clustering, we extend our previous research using factor analysis for speaker modeling. In continuing to take advantage of the effectiveness of factor analysis as a front-end for extracting speaker-specific features (i.e., i-vectors), we develop a probabilistic approach to speaker clustering by applying a Bayesian Gaussian Mixture Model (GMM) to principal component analysis (PCA)-processed i-vectors. We then utilize information at different temporal resolutions to arrive at an iterative optimization scheme that, in alternating between clustering and re-segmentation steps, demonstrates the ability to improve both speaker cluster assignments and segmentation boundaries in an unsupervised manner. Our proposed methods attain results that are comparable to those of a state-of-the-art benchmark set on the multi-speaker CallHome telephone corpus. We further compare our system with a Bayesian nonparametric approach to diarization and attempt to reconcile their differences in both methodology and performance.

181 citations


Proceedings ArticleDOI
26 May 2013
TL;DR: This study reveals that the nonlinear rectification accounts for the noise robustness differences primarily, and suggests how to enhance MFCC robustness, and further improve GF CC robustness by adopting a different time-frequency representation.
Abstract: Automatic speaker recognition can achieve a high level of performance in matched training and testing conditions. However, such performance drops significantly in mismatched noisy conditions. Recent research indicates that a new speaker feature, gammatone frequency cepstral coefficients (GFCC), exhibits superior noise robustness to commonly used mel-frequency cepstral coefficients (MFCC). To gain a deep understanding of the intrinsic robustness of GFCC relative to MFCC, we design speaker identification experiments to systematically analyze their differences and similarities. This study reveals that the nonlinear rectification accounts for the noise robustness differences primarily. Moreover, this study suggests how to enhance MFCC robustness, and further improve GFCC robustness by adopting a different time-frequency representation.

Proceedings ArticleDOI
01 Aug 2013
TL;DR: A set of abstractions, algorithms, and applications that are natively efficient for TrueNorth, a non-von Neumann architecture inspired by the brain's function and efficiency, and seven applications that include speaker recognition, music composer recognition, digit recognition, sequence prediction, collision avoidance, optical flow, and eye detection are developed.
Abstract: Marching along the DARPA SyNAPSE roadmap, IBM unveils a trilogy of innovations towards the TrueNorth cognitive computing system inspired by the brain's function and efficiency. The non-von Neumann nature of the TrueNorth architecture necessitates a novel approach to efficient system design. To this end, we have developed a set of abstractions, algorithms, and applications that are natively efficient for TrueNorth. First, we developed repeatedly-used abstractions that span neural codes (such as binary, rate, population, and time-to-spike), long-range connectivity, and short-range connectivity. Second, we implemented ten algorithms that include convolution networks, spectral content estimators, liquid state machines, restricted Boltzmann machines, hidden Markov models, looming detection, temporal pattern matching, and various classifiers. Third, we demonstrate seven applications that include speaker recognition, music composer recognition, digit recognition, sequence prediction, collision avoidance, optical flow, and eye detection. Our results showcase the parallelism, versatility, rich connectivity, spatio-temporality, and multi-modality of the TrueNorth architecture as well as compositionality of the corelet programming paradigm and the flexibility of the underlying neuron model.

Proceedings ArticleDOI
25 Aug 2013
TL;DR: This paper presents a voice conversion technique using Deep Belief Nets (DBNs) to build high-order eigen spaces of the source/target speakers, where it is easier to convert the source speech to the target speech than in the traditional cepstrum space.
Abstract: This paper presents a voice conversion technique using Deep Belief Nets (DBNs) to build high-order eigen spaces of the source/target speakers, where it is easier to convert the source speech to the target speech than in the traditional cepstrum space. DBNs have a deep architecture that automatically discovers abstractions to maximally express the original input features. If we train the DBNs using only the speech of an individual speaker, it can be considered that there is less phonological information and relatively more speaker individuality in the output features at the highest layer. Training the DBNs for a source speaker and a target speaker, we can then connect and convert the speaker individuality abstractions using Neural Networks (NNs). The converted abstraction of the source speaker is then brought back to the cepstrum space using an inverse process of the DBNs of the target speaker. We conducted speakervoice conversion experiments and confirmed the efficacy of our method with respect to subjective and objective criteria, comparing it with the conventional Gaussian Mixture Model-based method.

Proceedings ArticleDOI
26 May 2013
TL;DR: From the synthetic speech detection results, the modulation features provide complementary information to magnitude/phase features, and the best detection performance is obtained by fusing phase modulation features and phase features, yielding an equal error rate.
Abstract: Voice conversion and speaker adaptation techniques present a threat to current state-of-the-art speaker verification systems. To prevent such spoofing attack and enhance the security of speaker verification systems, the development of anti-spoofing techniques to distinguish synthetic and human speech is necessary. In this study, we continue the quest to discriminate synthetic and human speech. Motivated by the facts that current analysis-synthesis techniques operate on frame level and make the frame-by-frame independence assumption, we proposed to adopt magnitude/phase modulation features to detect synthetic speech from human speech. Modulation features derived from magnitude/phase spectrum carry long-term temporal information of speech, and may be able to detect temporal artifacts caused by the frame-by-frame processing in the synthesis of speech signal. From our synthetic speech detection results, the modulation features provide complementary information to magnitude/phase features. The best detection performance is obtained by fusing phase modulation features and phase features, yielding an equal error rate of 0.89%, which is significantly lower than the 1.25% of phase features and 10.98% of MFCC features.

Proceedings ArticleDOI
26 May 2013
TL;DR: The effect of duration variability on phoneme distributions of speech utterances and i-vector length is analyzed and it is demonstrated that, as utterance duration is decreased, number of detected unique phonemes andi- vector length approaches zero in a logarithmic and non-linear fashion.
Abstract: Speaker recognition systems trained on long duration utterances are known to perform significantly worse when short test segments are encountered. To address this mismatch, we analyze the effect of duration variability on phoneme distributions of speech utterances and i-vector length. We demonstrate that, as utterance duration is decreased, number of detected unique phonemes and i-vector length approaches zero in a logarithmic and non-linear fashion, respectively. Assuming duration variability as an additive noise in the i-vector space, we propose three different strategies for its compensation: i) multi-duration training in Probabilistic Linear Discriminant Analysis (PLDA) model, ii) score calibration using log duration as a Quality Measure Function (QMF), and iii) multi-duration PLDA training with synthesized short duration i-vectors. Experiments are designed based on the 2012 National Institute of Standards and Technology (NIST) Speaker Recognition Evaluation (SRE) protocol with varying test utterance duration. Experimental results demonstrate the effectiveness of the proposed schemes on short duration test conditions, especially with the QMF calibration approach.

Proceedings ArticleDOI
01 Dec 2013
TL;DR: The accuracy of a network recognising speech from a single distant microphone can approach that of a multi-microphone setup by training with data from other microphones.
Abstract: We investigate the application of deep neural network (DNN)-hidden Markov model (HMM) hybrid acoustic models for far-field speech recognition of meetings recorded using microphone arrays We show that the hybrid models achieve significantly better accuracy than conventional systems based on Gaussian mixture models (GMMs) We observe up to 8% absolute word error rate (WER) reduction from a discriminatively trained GMM baseline when using a single distant microphone, and between 4-6% absolute WER reduction when using beamforming on various combinations of array channels By training the networks on audio from multiple channels, we find the networks can recover significant part of accuracy difference between the single distant microphone and beamformed configurations Finally, we show that the accuracy of a network recognising speech from a single distant microphone can approach that of a multi-microphone setup by training with data from other microphones

Journal ArticleDOI
TL;DR: A wide range of features employed for speech emotion recognition and the acoustic characteristics of those features are presented and some important parameters such as: precision, recall, F-measure and recognition rate of the features are analyzed.
Abstract: Speech Emotion Recognition (SER) represents one of the emerging fields in human-computer interaction. Quality of the human-computer interface that mimics human speech emotions relies heavily on the types of features used and also on the classifier employed for recognition. The main purpose of this paper is to present a wide range of features employed for speech emotion recognition and the acoustic characteristics of those features. Also in this paper, we analyze the performance in terms of some important parameters such as: precision, recall, F-measure and recognition rate of the features using two of the commonly used emotional speech databases namely Berlin emotional database and Danish emotional database. Emotional speech recognition is being applied in modern human-computer interfaces and the overview of 10 interesting applications is also presented in this paper to illustrate the importance of this technique.

Proceedings ArticleDOI
26 May 2013
TL;DR: The advances in the multilingual text and speech database GlobalPhone, a multilingual database of high-quality read speech with corresponding transcriptions and pronunciation dictionaries in 20 languages, are described.
Abstract: This paper describes the advances in the multilingual text and speech database GlobalPhone, a multilingual database of high-quality read speech with corresponding transcriptions and pronunciation dictionaries in 20 languages. GlobalPhone was designed to be uniform across languages with respect to the amount of data, speech quality, the collection scenario, the transcription and phone set conventions. With more than 400 hours of transcribed audio data from more than 2000 native speakers GlobalPhone supplies an excellent basis for research in the areas of multilingual speech recognition, rapid deployment of speech processing systems to yet unsupported languages, language identification tasks, speaker recognition in multiple languages, multilingual speech synthesis, as well as monolingual speech recognition in a large variety of languages.

Proceedings ArticleDOI
25 Aug 2013
TL;DR: The latest version of the corpus and performance on the NIST-SRE 2010 extended task is presented and the toolkit includes a set of high level tools dedicated to speaker recognition based on the latest developments in speaker recognition.
Abstract: ALIZE is an open-source platform for speaker recognition. The ALIZE library implements a low-level statistical engine based on the well-known Gaussian mixture modelling. The toolkit includes a set of high level tools dedicated to speaker recognition based on the latest developments in speaker recognition such as Joint Factor Analysis, Support Vector Machine, i-vector modelling and Probabilistic Linear Discriminant Analysis. Since 2005, the performance of ALIZE has been demonstrated in series of Speaker Recognition Evaluations (SREs) conducted by NIST and has been used by many participants in the last NIST-SRE 2012. This paper presents the latest version of the corpus and performance on the NIST-SRE 2010 extended task.

Journal ArticleDOI
TL;DR: It seems that the state-of-the-art LID system performs much better on the standard 12 class NIST 2003 Language Recognition Evaluation task or the two class ethnic group recognition task than on the 14 class regional accent recognition task.

Proceedings ArticleDOI
21 Oct 2013
TL;DR: This work studies an alternative, likelihood ratio based VAD that trains speech and nonspeech models on an utterance-by-utterance basis from mel-frequency cepstral coefficients (MFCCs) and provides open-source implementation of the method.
Abstract: A voice activity detector (VAD) plays a vital role in robust speaker verification, where energy VAD is most commonly used. Energy VAD works well in noise-free conditions but deteriorates in noisy conditions. One way to tackle this is to introduce speech enhancement preprocessing. We study an alternative, likelihood ratio based VAD that trains speech and nonspeech models on an utterance-by-utterance basis from mel-frequency cepstral coefficients (MFCCs). The training labels are obtained from enhanced energy VAD. As the speech and nonspeech models are re-trained for each utterance, minimum assumptions of the background noise are made. According to both VAD error analysis and speaker verification results utilizing state-of-the-art i-vector system, the proposed method outperforms energy VAD variants by a wide margin. We provide open-source implementation of the method.


Patent
25 Feb 2013
TL;DR: In this paper, a method and apparatus employing classifier adaptation based on field data in a deployed voice-based interactive system comprise: collecting representations of voice characteristics, in association with corresponding speakers, the representations being generated by the deployed voice based interactive system; updating parameters of the classifier, used in speaker recognition, based on the representations collected.
Abstract: Typical speaker verification systems usually employ speakers' audio data collected during an enrollment phase when users enroll with the system and provide respective voice samples. Due to technical, business, or other constraints, the enrollment data may not be large enough or rich enough to encompass different inter-speaker and intra-speaker variations. According to at least one embodiment, a method and apparatus employing classifier adaptation based on field data in a deployed voice-based interactive system comprise: collecting representations of voice characteristics, in association with corresponding speakers, the representations being generated by the deployed voice-based interactive system; updating parameters of the classifier, used in speaker recognition, based on the representations collected; and employing the classifier, with the corresponding parameters updated, in performing speaker recognition.

Patent
13 Jun 2013
TL;DR: In this paper, a mobile terminal and a voice recognition method thereof are described, which includes receiving a user's voice, providing the received voice to a first voice recognition engine provided in the server and a second voice recognition system provided by the mobile terminal.
Abstract: The present disclosure relates to a mobile terminal and a voice recognition method thereof. The voice recognition method may include receiving a user's voice; providing the received voice to a first voice recognition engine provided in the server and a second voice recognition engine provided in the mobile terminal; acquiring first voice recognition data as a result of recognizing the received voice by the first voice recognition engine; acquiring second voice recognition data as a result of recognizing the received voice by the second voice recognition engine; estimating a function corresponding to the user's intention based on at least one of the first and the second voice recognition data; calculating a similarity between the first and the second voice recognition data when personal information is required for the estimated function; and selecting either one of the first and the second voice recognition data based on the calculated similarity.

Proceedings ArticleDOI
25 Aug 2013
TL;DR: A phrase-dependent PLDA model with uncertainty propagation is introduced and it is shown that despite its low channel variability, improved results over the GMM-UBM model are attained.
Abstract: In this paper, we apply and enhance the i-vector-PLDA paradigm to text-dependent speaker recognition. Due to its origin in text-independent speaker recognition, this paradigm does not make use of the phonetic content of each utterance. Moreover, the uncertainty in the i-vector estimates should be taken into account in the PLDA model, due to the short duration of the utterances. To bridge this gap, a phrase-dependent PLDA model with uncertainty propagation is introduced. We examined it on the RSR-2015 dataset and we show that despite its low channel variability, improved results over the GMM-UBM model are attained.

Proceedings ArticleDOI
01 Sep 2013
TL;DR: A novel countermeasure based on the analysis of speech signals using local binary patterns followed by a one-class classification approach is presented, which captures differences in the spectro-temporal texture of genuine and spoofed speech, but relies only on a model of the former.
Abstract: The vulnerability of automatic speaker verification systems to spoofing is now well accepted. While recent work has shown the potential to develop countermeasures capable of detecting spoofed speech signals, existing solutions typically function well only for specific attacks on which they are optimised. Since the exact nature of spoofing attacks can never be known in practice, there is thus a need for generalised countermeasures which can detect previously unseen spoofing attacks. This paper presents a novel countermeasure based on the analysis of speech signals using local binary patterns followed by a one-class classification approach. The new countermeasure captures differences in the spectro-temporal texture of genuine and spoofed speech, but relies only on a model of the former. We report experiments with three different approaches to spoofing and with a state-of-the-art i-vector speaker verification system which uses probabilistic linear discriminant analysis for intersession compensation. While a support vector machine classifier is tuned with examples of converted voice, it delivers reliable detection of spoofing attacks using synthesized speech and artificial signals, attacks for which it is not optimised.

Journal ArticleDOI
TL;DR: This work addresses the problem of increased degradation in performance when moving from speaker-dependent to speaker-independent conditions for connectionist hidden Markov model/artificial neural network (HMM/ANN) systems in the context of large vocabulary continuous speech recognition (LVCSR).
Abstract: Model adaptation techniques are an efficient way to reduce the mismatch that typically occurs between the training and test condition of any automatic speech recognition (ASR) system. This work addresses the problem of increased degradation in performance when moving from speaker-dependent (SD) to speaker-independent (SI) conditions for connectionist (or hybrid) hidden Markov model/artificial neural network (HMM/ANN) systems in the context of large vocabulary continuous speech recognition (LVCSR). Adapting hybrid HMM/ANN systems on a small amount of adaptation data has been proven to be a difficult task, and has been a limiting factor in the widespread deployment of hybrid techniques in operational ASR systems. Addressing the crucial issue of speaker adaptation (SA) for hybrid HMM/ANN system can thereby have a great impact on the connectionist paradigm, which will play a major role in the design of next-generation LVCSR considering the great success reported by deep neural networks - ANNs with many hidden layers that adopts the pre-training technique - on many speech tasks. Current adaptation techniques for ANNs based on injecting an adaptable linear transformation network connected to either the input, or the output layer are not effective especially with a small amount of adaptation data, e.g., a single adaptation utterance. In this paper, a novel solution is proposed to overcome those limits and make it robust to scarce adaptation resources. The key idea is to adapt the hidden activation functions rather than the network weights. The adoption of Hermitian activation functions makes this possible. Experimental results on an LVCSR task demonstrate the effectiveness of the proposed approach.

Proceedings ArticleDOI
26 May 2013
TL;DR: This work proposes a method to learn a universal speech model from a general corpus of speech and shows how to use this model to separate speech from other sound sources and shows that this method improves performance when training data of the non-speech source is available.
Abstract: Supervised and semi-supervised source separation algorithms based on non-negative matrix factorization have been shown to be quite effective. However, they require isolated training examples of one or more sources, which is often difficult to obtain. This limits the practical applicability of these algorithms. We examine the problem of efficiently utilizing general training data in the absence of specific training examples. Specifically, we propose a method to learn a universal speech model from a general corpus of speech and show how to use this model to separate speech from other sound sources. This model is used in lieu of a speech model trained on speaker-dependent training examples, and thus circumvents the aforementioned problem. Our experimental results show that our method achieves nearly the same performance as when speaker-dependent training examples are used. Furthermore, we show that our method improves performance when training data of the non-speech source is available.

Proceedings ArticleDOI
25 Aug 2013
TL;DR: A novel HMM-based system for off-line handwriting recognition that outperforms current state-of-the-art approaches on two standard evaluation corpora for English and French handwriting.
Abstract: In this paper we describe a novel HMM-based system for off-line handwriting recognition. We adapt successful techniques from the domains of large vocabulary speech recognition and image object recognition: moment-based image normalization, writer adaptation, discriminative feature extraction and training, and open-vocabulary recognition. We evaluate those methods and examine their cumulative effect on the recognition performance. The final system outperforms current state-of-the-art approaches on two standard evaluation corpora for English and French handwriting.

Proceedings ArticleDOI
25 Aug 2013
TL;DR: This work studies the vulnerability of two well-known speaker recognition systems, traditional Gaussian mixture model – universal background model (GMM-UBM) and a state-of-the-art i-vector classifier with cosine scoring, which consists of one professional Finnish imitator impersonating five wellknown Finnish public figures.
Abstract: Voice imitation is mimicry of another speaker’s voice characteristics and speech behavior. Professional voice mimicry can create entertaining, yet realistic sounding target speaker renditions. As mimicry tends to exaggerate prosodic, idiosyncratic and lexical behavior, it is unclear how modern spectral-feature automatic speaker verification systems respond to mimicry “attacks”. We study the vulnerability of two well-known speaker recognition systems, traditional Gaussian mixture model – universal background model (GMM-UBM) and a state-of-the-art i-vector classifier with cosine scoring. The material consists of one professional Finnish imitator impersonating five wellknown Finnish public figures. In a carefully controlled setting, mimicry attack does slightly increase the false acceptance rate for the i-vector system, but generally this is not alarmingly large in comparison to voice conversion or playback attacks. Index Terms: Voice imitation, speaker recognition, mimicry attack