Showing papers on "Speaker recognition published in 2010"

PDF

Open Access

Journal Article•DOI•

An overview of text-independent speaker recognition: From features to supervectors

[...]

Tomi Kinnunen¹, Haizhou Li²•Institutions (2)

University of Eastern Finland¹, Institute for Infocomm Research Singapore²

01 Jan 2010-Speech Communication

TL;DR: This paper starts with the fundamentals of automatic speaker recognition, concerning feature extraction and speaker modeling and elaborate advanced computational techniques to address robustness and session variability.

...read moreread less

1,433 citations

Journal Article•DOI•

Spectral Mapping Using Artificial Neural Networks for Voice Conversion

[...]

Srinivas Desai, Alan W. Black¹, B. Yegnanarayana, Kishore Prahallad•Institutions (1)

Carnegie Mellon University¹

01 Jul 2010-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A voice conversion approach using an ANN model to capture speaker-specific characteristics of a target speaker is proposed and it is demonstrated that such a voice Conversion approach can perform monolingual as well as cross-lingual voice conversion of an arbitrary source speaker.

...read moreread less

Abstract: In this paper, we use artificial neural networks (ANNs) for voice conversion and exploit the mapping abilities of an ANN model to perform mapping of spectral features of a source speaker to that of a target speaker. A comparative study of voice conversion using an ANN model and the state-of-the-art Gaussian mixture model (GMM) is conducted. The results of voice conversion, evaluated using subjective and objective measures, confirm that an ANN-based VC system performs as good as that of a GMM-based VC system, and the quality of the transformed speech is intelligible and possesses the characteristics of a target speaker. In this paper, we also address the issue of dependency of voice conversion techniques on parallel data between the source and the target speakers. While there have been efforts to use nonparallel data and speaker adaptation techniques, it is important to investigate techniques which capture speaker-specific characteristics of a target speaker, and avoid any need for source speaker's data either for training or for adaptation. In this paper, we propose a voice conversion approach using an ANN model to capture speaker-specific characteristics of a target speaker and demonstrate that such a voice conversion approach can perform monolingual as well as cross-lingual voice conversion of an arbitrary source speaker.

...read moreread less

269 citations

Proceedings Article•

Roles of Pre-Training and Fine-Tuning in Context-Dependent DBN-HMMs for Real-World Speech Recognition

[...]

Dong Yu¹, Li Deng¹, George E. Dahl²•Institutions (2)

Microsoft¹, University of Toronto²

01 Dec 2010

TL;DR: It is shown that pre-training can initialize weights to a point in the space where fine-tuning can be effective and thus is crucial in training deep structured models and in the recognition performance of a CD-DBN-HMM based large-vocabulary speech recognizer.

...read moreread less

Abstract: Recently, deep learning techniques have been successfully applied to automatic speech recognition tasks -first to phonetic recognition with context-independent deep belief network (DBN) hidden Markov models (HMMs) and later to large vocabulary continuous speech recognition using context-dependent (CD) DBN-HMMs. In this paper, we report our most recent experiments designed to understand the roles of the two main phases of the DBN learning -pre-training and fine tuning -in the recognition performance of a CD-DBN-HMM based large-vocabulary speech recognizer. As expected, we show that pre-training can initialize weights to a point in the space where fine-tuning can be effective and thus is crucial in training deep structured models. However, a moderate increase of the amount of unlabeled pre-training data has an insignificant effect on the final recognition results as long as the original training size is sufficiently large to initialize the DBN weights. On the other hand, with additional labeled training data, the fine-tuning phase of DBN training can significantly improve the recognition accuracy.

...read moreread less

235 citations

Proceedings Article•DOI•

A novel approach for MFCC feature extraction

[...]

Md. Afzal Hossan¹, Sheeraz Memon¹, Mark A. Gregory¹•Institutions (1)

RMIT University¹

01 Dec 2010

TL;DR: A new MFCC feature extraction method based on distributed Discrete Cosine Transform (DCT-II) is presented and speaker verification tests are proposed based on three different feature extraction methods.

...read moreread less

Abstract: The Mel-Frequency Cepstral Coefficients (MFCC) feature extraction method is a leading approach for speech feature extraction and current research aims to identify performance enhancements. One of the recent MFCC implementations is the Delta-Delta MFCC, which improves speaker verification. In this paper, a new MFCC feature extraction method based on distributed Discrete Cosine Transform (DCT-II) is presented. Speaker verification tests are proposed based on three different feature extraction methods including: conventional MFCC, Delta-Delta MFCC and distributed DCT-II based Delta-Delta MFCC with a Gaussian Mixture Model (GMM) classifier.

...read moreread less

224 citations

[...]

Najim Dehak¹, Réda Dehak², James Glass¹, Douglas A. Reynolds¹, Patrick Kenny - Show less +1 more•Institutions (2)

Massachusetts Institute of Technology¹, École Normale Supérieure²

01 Jan 2010

TL;DR: This paper introduces a modification to the cosine similarity that does not require explicit score normalization, relying instead on simple mean and covariance statistics from a collection of impostor speaker ivectors to enable application of a new unsupervised speaker adaptation technique to models defined in the ivector space.

...read moreread less

Abstract: In recent work [1], a simplified and highly effective approach to speaker recognition based on the cosine similarity between lowdimensional vectors, termed ivectors, defined in a total variability space was introduced. The total variability space representation is motivated by the popular Joint Factor Analysis (JFA) approach, but does not require the complication of estimating separate speaker and channel spaces and has been shown to be less dependent on score normalization procedures, such as znorm and t-norm. In this paper, we introduce a modification to the cosine similarity that does not require explicit score normalization, relying instead on simple mean and covariance statistics from a collection of impostor speaker ivectors. By avoiding the complication of zand t-norm, the new approach further allows for application of a new unsupervised speaker adaptation technique to models defined in the ivector space. Experiments are conducted on the core condition of the NIST 2008 corpora, where, with adaptation, the new approach produces an equal error rate (EER) of 4.8% and min decision cost function (MinDCF) of 2.3% on all female speaker trials.

...read moreread less

188 citations

Journal Article•DOI•

Super-human multi-talker speech recognition: A graphical modeling approach

[...]

John R. Hershey¹, Steven J. Rennie¹, Peder A. Olsen¹, Trausti Kristjansson²•Institutions (2)

IBM¹, Google²

01 Jan 2010-Computer Speech & Language

TL;DR: A system that can separate and recognize the simultaneous speech of two people recorded in a single channel is presented and how belief-propagation reduces the complexity of temporal inference from exponential to linear in the number of sources and the size of the language model is shown.

...read moreread less

187 citations

The speaker partitioning problem

[...]

Niko Brümmer, Edward de Villiers

01 Jan 2010

TL;DR: This work gives a unification of several different speaker recognition problems in terms of the general speaker partitioning problem, where a set of N inputs has to be partitioned into subsets according to speaker.

...read moreread less

Abstract: We give a unification of several different speaker recognition problems in terms of the general speaker partitioning problem, where a set of N inputs has to be partitioned into subsets according to speaker. We show how to solve this problem in terms of a simple generative model and demonstrate performance on NIST SRE 2006 and 2008 data. Our solution yields probabilistic outputs, which we show how to evaluate with a cross-entropy criterion. Finally, we show improved accuracy of the generative model via a discriminatively trained re-calibration transformation of log-likelihoods.

...read moreread less

179 citations

Journal Article•DOI•

Modeling coarticulation in EMG-based continuous speech recognition

[...]

Tanja Schultz¹, Michael Wand¹•Institutions (1)

Karlsruhe Institute of Technology¹

01 Apr 2010-Speech Communication

TL;DR: The new approach of phonetic feature bundling for modeling coarticulation in EMG-based speech recognition is described and results on theEMG-PIT corpus, a multiple speaker large vocabulary database of silent and audible EMG speech recordings, which was recently collected are reported.

...read moreread less

161 citations

Patent•

System and method for open speech recognition

[...]

Mazin Gilbert¹, Srinivas Bangalore¹, Patrick Haffner¹, Robert M. Bell¹•Institutions (1)

AT&T¹

30 Sep 2010

TL;DR: In this article, the authors present systems, methods and non-transitory computer-readable media for performing speech recognition across different applications or environments without model customization or prior knowledge of the received speech.

...read moreread less

Abstract: Disclosed herein are systems, methods and non-transitory computer-readable media for performing speech recognition across different applications or environments without model customization or prior knowledge of the domain of the received speech. The disclosure includes recognizing received speech with a collection of domain-specific speech recognizers, determining a speech recognition confidence for each of the speech recognition outputs, selecting speech recognition candidates based on a respective speech recognition confidence for each speech recognition output, and combining selected speech recognition candidates to generate text based on the combination.

...read moreread less

152 citations

Proceedings Article•

Phoneme Recognition with Large Hierarchical Reservoirs

[...]

Fabian Triefenbach¹, Azarakhsh Jalalvand¹, Benjamin Schrauwen¹, Jean-Pierre Martens¹•Institutions (1)

Ghent University¹

06 Dec 2010

TL;DR: It is shown that the recently introduced concept of Reservoir Computing might form the basis of such a methodology, and in a limited amount of time, a reservoir system that can recognize the elementary sounds of continuous speech has been built.

...read moreread less

Abstract: Automatic speech recognition has gradually improved over the years, but the reliable recognition of unconstrained speech is still not within reach. In order to achieve a breakthrough, many research groups are now investigating new methodologies that have potential to outperform the Hidden Markov Model technology that is at the core of all present commercial systems. In this paper, it is shown that the recently introduced concept of Reservoir Computing might form the basis of such a methodology. In a limited amount of time, a reservoir system that can recognize the elementary sounds of continuous speech has been built. The system already achieves a state-of-the-art performance, and there is evidence that the margin for further improvements is still significant.

...read moreread less

146 citations

Journal Article•DOI•

Diarization of Telephone Conversations Using Factor Analysis

[...]

Patrick Kenny, Douglas A. Reynolds¹, Fabio Castaldo•Institutions (1)

Massachusetts Institute of Technology¹

30 Sep 2010-IEEE Journal of Selected Topics in Signal Processing

TL;DR: Three diarization systems were developed and experiments were conducted using the summed-channel telephone data from the 2008 NIST speaker recognition evaluation, with the Variational Bayes system proving to be the most effective.

...read moreread less

Abstract: We report on work on speaker diarization of telephone conversations which was begun at the Robust Speaker Recognition Workshop held at Johns Hopkins University in 2008. Three diarization systems were developed and experiments were conducted using the summed-channel telephone data from the 2008 NIST speaker recognition evaluation. The systems are a Baseline agglomerative clustering system, a Streaming system which uses speaker factors for speaker change point detection and traditional methods for speaker clustering, and a Variational Bayes system designed to exploit a large number of speaker factors as in state of the art speaker recognition systems. The Variational Bayes system proved to be the most effective, achieving a diarization error rate of 1.0% on the summed-channel data. This represents an 85% reduction in errors compared with the Baseline agglomerative clustering system. An interesting aspect of the Variational Bayes approach is that it implicitly performs speaker clustering in a way which avoids making premature hard decisions. This type of soft speaker clustering can be incorporated into other diarization systems (although causality has to be sacrificed in the case of the Streaming system). With this modification, the Baseline system achieved a diarization error rate of 3.5% (a 50% reduction in errors).

...read moreread less

Journal Article•DOI•

On-line Emotion Recognition in a 3-D Activation-Valence-Time Continuum using Acoustic and Linguistic Cues

[...]

Florian Eyben¹, Martin Wöllmer¹, Alex Graves¹, Björn Schuller¹, Ellen Douglas-Cowie², Roddy Cowie² - Show less +2 more•Institutions (2)

Technische Universität München¹, Queen's University²

01 Mar 2010-Journal on Multimodal User Interfaces

TL;DR: This work presents a novel approach to on-line emotion recognition from speech using Long Short-Term Memory Recurrent Neural Networks, which recognition is performed on low-level signal frames, similar to those used for speech recognition.

...read moreread less

Abstract: For many applications of emotion recognition, such as virtual agents, the system must select responses while the user is speaking. This requires reliable on-line recognition of the user’s affect. However most emotion recognition systems are based on turnwise processing. We present a novel approach to on-line emotion recognition from speech using Long Short-Term Memory Recurrent Neural Networks. Emotion is recognised frame-wise in a two-dimensional valence-activation continuum. In contrast to current state-of-the-art approaches, recognition is performed on low-level signal frames, similar to those used for speech recognition. No statistical functionals are applied to low-level feature contours. Framing at a higher level is therefore unnecessary and regression outputs can be produced in real-time for every low-level input frame. We also investigate the benefits of including linguistic features on the signal frame level obtained by a keyword spotter.

...read moreread less

Journal Article•DOI•

Evolving spiking neural networks for audiovisual information processing

[...]

Simei Gomes Wysoski¹, Lubica Benuskova¹, Nikola Kasabov¹•Institutions (1)

Auckland University of Technology¹

01 Sep 2010-Neural Networks

TL;DR: A new modular and integrative sensory information system inspired by the way the brain performs information processing, in particular, pattern recognition is presented, trained to perform the specific task of person authentication.

...read moreread less

Proceedings Article•DOI•

The NIST 2010 speaker recognition evaluation.

[...]

Alvin F. Martin¹, Craig S. Greenberg¹•Institutions (1)

National Institute of Standards and Technology¹

26 Sep 2010

TL;DR: The 2010 NIST Speaker Recognition Evaluation continues a series of evaluations of text independent speaker detection and utilizes the newly collected Mixer-6 and Greybeard Corpora from the Linguistic Data Consortium.

...read moreread less

Abstract: The 2010 NIST Speaker Recognition Evaluation continues a series of evaluations of text independent speaker detection begun in 1996. It utilizes the newly collected Mixer-6 and Greybeard Corpora from the Linguistic Data Consortium. Major test conditions to be examined include variations in channel, speech style, vocal effort, and the effect of speaker aging over a multi-year period. A new primary evaluation metric giving increased weight to false alarm errors compared to misses is being used. A small evaluation test with a limited number of trials is also being offered for systems that include human expertise in their processing.

...read moreread less

Journal Article•DOI•

Modulation Spectral Features for Robust Far-Field Speaker Identification

[...]

Tiago H. Falk¹, Wai-Yip Chan¹•Institutions (1)

Queen's University¹

01 Jan 2010-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: In this article, auditory inspired modulation spectral features are used to improve automatic speaker identification (ASI) performance in the presence of room reverberation, and a Gaussian mixture model based ASI system, trained on the proposed features, consistently outperforms a baseline system trained on mel-frequency cepstral coefficients.

...read moreread less

Abstract: In this paper, auditory inspired modulation spectral features are used to improve automatic speaker identification (ASI) performance in the presence of room reverberation. The modulation spectral signal representation is obtained by first filtering the speech signal with a 23-channel gammatone filterbank. An eight-channel modulation filterbank is then applied to the temporal envelope of each gammatone filter output. Features are extracted from modulation frequency bands ranging from 3-15 H z and are shown to be robust to mismatch between training and testing conditions and to increasing reverberation levels. To demonstrate the gains obtained with the proposed features, experiments are performed with clean speech, artificially generated reverberant speech, and reverberant speech recorded in a meeting room. Simulation results show that a Gaussian mixture model based ASI system, trained on the proposed features, consistently outperforms a baseline system trained on mel-frequency cepstral coefficients. For multimicrophone ASI applications, three multichannel score combination and adaptive channel selection techniques are investigated and shown to further improve ASI performance.

...read moreread less

An i-vector extractor suitable for speaker recognition with both microphone and telephone speech

[...]

Mohammed Senoussaoui, Patrick Kenny, Najim Dehak, Pierre Dumouchel

01 Jan 2010

TL;DR: This paper proposes a new architecture for text-independent speaker verification systems that are satisfactorily trained by virtue of a limited amount of application-specific data, supplemented with a sufficient amount of training data from some other context.

...read moreread less

Abstract: It is widely believed that speaker verification systems perform better when there is sufficient background training data to deal with nuisance effects of transmission channels. It is also known that these systems perform at their best when the sound environment of the training data is similar to that of the context of use (test context). For some applications however, training data from the same type of sound environment is scarce, whereas a considerable amount of data from a different type of environment is available. In this paper, we propose a new architecture for text-independent speaker verification systems that are satisfactorily trained by virtue of a limited amount of application-specific data, supplemented with a sufficient amount of training data from some other context. This architecture is based on the extraction of parameters (i-vectors) from a low-dimensional space (total variability space) proposed by Dehak [1]. Our aim is to extend Dehak’s work to speaker recognition on sparse data, namely microphone speech. The main challenge is to overcome the fact that insufficient application-specific data is available to accurately estimate the total variability covariance matrix. We propose a method based on Joint Factor Analysis (JFA) to estimate microphone eigenchannels (sparse data) with telephone eigenchannels (sufficient data). For classification, we experimented with the following two approaches: Support Vector Machines (SVM) and Cosine Distance Scoring (CDS) classifier, based on cosine distances. We present recognition results for the part of female voices in the interview data of the NIST 2008 SRE. The best performance is obtained when our system is fused with the state-of-the-art JFA. We achieve 13% relative improvement on equal error rate and the minimum value of detection cost function decreases from 0.0219 to 0.0164.

...read moreread less

Dissertation•

Measuring, refining and calibrating speaker and language information extracted from speech

[...]

Niko Brümmer

01 Dec 2010

TL;DR: A new methodology, based on proper scoring rules, is proposed, allowing for the evaluation of the goodness of pattern recognizers with probabilistic outputs, which are intended to be usefully applied over a wide range of applications, having variable priors and costs.

...read moreread less

Abstract: We propose a new methodology, based on proper scoring rules, for the evaluation of the goodness of pattern recognizers with probabilistic outputs. The recognizers of interest take an input, known to belong to one of a discrete set of classes, and output a calibrated likelihood for each class. This is a generalization of the traditional use of proper scoring rules to evaluate the goodness of probability distributions. A recognizer with outputs in well-calibrated probability distribution form can be applied to make cost-effective Bayes decisions over a range of applications, having different cost functions. A recognizer with likelihood output can additionally be employed for a wide range of prior distributions for the to-be-recognized classes. We use automatic speaker recognition and automatic spoken language recognition as prototypes of this type of pattern recognizer. The traditional evaluation methods in these fields, as represented by the series of NIST Speaker and Language Recognition Evaluations, evaluate hard decisions made by the recognizers. This makes these recognizers cost-and-prior-dependent. The proposed methodology generalizes that of the NIST evaluations, allowing for the evaluation of recognizers which are intended to be usefully applied over a wide range of applications, having variable priors and costs. The proposal includes a family of evaluation criteria, where each member of the family is formed by a proper scoring rule. We emphasize two members of this family: (i) A non-strict scoring rule, directly representing error-rate at a given prior. (ii) The strict logarithmic scoring rule which represents information content, or which equivalently represents summarized error-rate, or expected cost, over a wide range of applications. We further show how to form a family of secondary evaluation criteria, which by contrasting with the primary criteria, form an analysis of the goodness of calibration of the recognizer’s likelihoods. Finally, we show how to use the logarithmic scoring rule as an objective function for the discriminative training of fusion and calibration of speaker and language recognizers.

...read moreread less

Patent•

Voice control system

[...]

Fletcher R. Rothkopf¹, Stephen Brian Lynch¹, Adam D. Mittleman¹, Hobson Phil M¹•Institutions (1)

Apple Inc.¹

24 Sep 2010

TL;DR: In this article, a first electronic device communicatively coupled to a server and configured to receive a speech recognition file from the server is described, and the file may include a speech-recognition algorithm for converting one or more voice commands into text and a database including a set of commands associated with the voice commands.

...read moreread less

Abstract: One embodiment of a voice control system includes a first electronic device communicatively coupled to a server and configured to receive a speech recognition file from the server. The speech recognition file may include a speech recognition algorithm for converting one or more voice commands into text and a database including one or more entries comprising one or more voice commands and one or more executable commands associated with the one or more voice commands.

...read moreread less

Journal Article•DOI•

Progressive associative phonagnosia: a neuropsychological analysis.

[...]

Julia C. Hailstone¹, Sebastian J. Crutch¹, Martin D. Vestergaard², Roy D. Patterson², Jason D. Warren¹ - Show less +1 more•Institutions (2)

University College London¹, University of Cambridge²

01 Mar 2010-Neuropsychologia

TL;DR: Two patients with progressive phonagnosia in the context of frontotemporal lobar degeneration are described and both patients demonstrated preserved ability to analyse perceptual properties of voices and to recognise vocal emotions.

...read moreread less

Proceedings Article•DOI•

Towards multi-speaker unsupervised speech pattern discovery

[...]

Yaodong Zhang¹, James Glass¹•Institutions (1)

Massachusetts Institute of Technology¹

14 Mar 2010

TL;DR: The viability of using the posteriorgram approach to handle many talkers by finding clusters of words in the TIMIT corpus is demonstrated.

...read moreread less

Abstract: In this paper, we explore the use of a Gaussian posteriorgram based representation for unsupervised discovery of speech patterns. Compared with our previous work, the new approach provides significant improvement towards speaker independence. The framework consists of three main procedures: a Gaussian posteriorgram generation procedure which learns an unsupervised Gaussian mixture model and labels each speech frame with a Gaussian posteriorgram representation; a segmental dynamic time warping procedure which locates pairs of similar sequences of Gaussian posteriorgram vectors; and a graph clustering procedure which groups similar sequences into clusters. We demonstrate the viability of using the posteriorgram approach to handle many talkers by finding clusters of words in the TIMIT corpus.

...read moreread less

Journal Article•DOI•

GMM-SVM Kernel With a Bhattacharyya-Based Distance for Speaker Recognition

[...]

Chang Huai You¹, Kong Aik Lee¹, Haizhou Li¹•Institutions (1)

Agency for Science, Technology and Research¹

01 Aug 2010-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This paper introduces a Bhattacharyya-based GMM-distance to measure the distance between two GMM distributions, and introduces a GUMI kernel which can be used in conjunction with support vector machine (SVM) for speaker recognition.

...read moreread less

Abstract: Among conventional methods for text-independent speaker recognition, Gaussian mixture model (GMM) is known for its effectiveness and scalability in modeling the spectral distribution of speech. A GMM-supervector characterizes a speaker's voice by the GMM parameters such as the mean vectors, covariance matrices and mixture weights. Besides the first-order statistics, it is generally believed that speaker's cues are partly conveyed by the second-order statistics. In this paper, we introduce a Bhattacharyya-based GMM-distance to measure the distance between two GMM distributions. Subsequently, the GMM-UBM mean interval (GUMI) concept is introduced to derive a GUMI kernel which can be used in conjunction with support vector machine (SVM) for speaker recognition. The GUMI kernel allows us to exploit the speaker's information not only from the mean vectors of GMM but also from the covariance matrices. Moreover, by analyzing the Bhattacharyya-based GMM-distance measure, we extend the Bhattacharyya-based kernel by involving both the mean and covariance statistical dissimilarities. We demonstrate the effectiveness of the new kernel on the National Institute of Standards and Technology (NIST) speaker recognition evaluation (SRE) 2006 dataset.

...read moreread less

Patent•

Personalized text-to-speech synthesis and personalized speech feature extraction

[...]

Qingfang Wang¹, Shouchun He²•Institutions (2)

Ericsson Mobile Communications¹, Sony Broadcast & Professional Research Laboratories²

06 Dec 2010

TL;DR: In this paper, a personalized text-to-speech synthesizer is used to synthesize personalized speech with the speech characteristics of a specific speaker, based on the personalized speech feature library associated with the specific speaker.

...read moreread less

Abstract: A personalized text-to-speech synthesizing device includes: a personalized speech feature library creator, configured to recognize personalized speech features of a specific speaker by comparing a random speech fragment of the specific speaker with preset keywords, thereby to create a personalized speech feature library associated with the specific speaker, and store the personalized speech feature library in association with the specific speaker; and a text-to-speech synthesizer, configured to perform a speech synthesis of a text message from the specific speaker, based on the personalized speech feature library associated with the specific speaker and created by the personalized speech feature library creator, thereby to generate and output a speech fragment having pronunciation characteristics of the specific speaker. A personalized speech feature library of a specific speaker is established without a deliberate training process, and a text is synthesized into personalized speech with the speech characteristics of the speaker.

...read moreread less

Proceedings Article•DOI•

English digits speech recognition system based on Hidden Markov Models

[...]

Ahmad A. M. Abushariah¹, Teddy Surya Gunawan¹, Othman Omran Khalifa¹, Mohammad A. M. Abushariah²•Institutions (2)

International Islamic University Malaysia¹, Information Technology University²

11 May 2010

TL;DR: This work was based on the Hidden Markov Model (HMM), which provides a highly reliable way for recognizing speech, and two modules were developed, namely the isolated words speech recognition and the continuous speech recognition.

...read moreread less

Abstract: This paper aims to design and implement English digits speech recognition system using Matlab (GUI). This work was based on the Hidden Markov Model (HMM), which provides a highly reliable way for recognizing speech. The system is able to recognize the speech waveform by translating the speech waveform into a set of feature vectors using Mel Frequency Cepstral Coefficients (MFCC) technique This paper focuses on all English digits from (Zero through Nine), which is based on isolated words structure. Two modules were developed, namely the isolated words speech recognition and the continuous speech recognition. Both modules were tested in both clean and noisy environments and showed a successful recognition rates. In clean environment and isolated words speech recognition module, the multi-speaker mode achieved 99.5% whereas the speaker-independent mode achieved 79.5%. In clean environment and continuous speech recognition module, the multi-speaker mode achieved 72.5% whereas the speaker-independent mode achieved 56.25%. However in noisy environment and isolated words speech recognition module, the multi-speaker mode achieved 88% whereas the speaker-independent mode achieved 67%. In noisy environment and continuous speech recognition module, the multi-speaker mode achieved 82.5% whereas the speaker-independent mode achieved 76.67%. These recognition rates are relatively successful if compared to similar systems.

...read moreread less

Journal Article•DOI•

Thousands of Voices for HMM-Based Speech Synthesis–Analysis and Application of TTS Systems Built on Various ASR Corpora

[...]

Junichi Yamagishi¹, Bela Usabaev, Simon King¹, Oliver Watts¹, John Dines², Jilei Tian³, Yong Guan³, Rile Hu³, Keiichiro Oura⁴, Yi-Jian Wu⁴, Keiichi Tokuda⁴, Reima Karhila⁵, Mikko Kurimo⁵ - Show less +9 more•Institutions (5)

University of Edinburgh¹, Idiap Research Institute², Nokia³, Nagoya Institute of Technology⁴, Helsinki University of Technology⁵

01 Jul 2010-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This paper demonstrates the thousands of voices for HMM-based speech synthesis that are made from several popular ASR corpora such as the Wall Street Journal, Resource Management, Globalphone, and SPEECON databases.

...read moreread less

Abstract: In conventional speech synthesis, large amounts of phonetically balanced speech data recorded in highly controlled recording studio environments are typically required to build a voice. Although using such data is a straightforward solution for high quality synthesis, the number of voices available will always be limited, because recording costs are high. On the other hand, our recent experiments with HMM-based speech synthesis systems have demonstrated that speaker-adaptive HMM-based speech synthesis (which uses an “average voice model” plus model adaptation) is robust to non-ideal speech data that are recorded under various conditions and with varying microphones, that are not perfectly clean, and/or that lack phonetic balance. This enables us to consider building high-quality voices on “non-TTS” corpora such as ASR corpora. Since ASR corpora generally include a large number of speakers, this leads to the possibility of producing an enormous number of voices automatically. In this paper, we demonstrate the thousands of voices for HMM-based speech synthesis that we have made from several popular ASR corpora such as the Wall Street Journal (WSJ0, WSJ1, and WSJCAM0), Resource Management, Globalphone, and SPEECON databases. We also present the results of associated analysis based on perceptual evaluation, and discuss remaining issues.

...read moreread less

Proceedings Article•DOI•

Sparse Representation for Speaker Identification

[...]

Imran Naseem¹, Roberto Togneri¹, Mohammed Bennamoun¹•Institutions (1)

University of Western Australia¹

23 Aug 2010

TL;DR: The closed-set problem of speaker identification is addressed by presenting a novel sparse representation classification algorithm using the GMM mean super vector kernel for all the training utterances to generate a naturally sparse representation.

...read moreread less

Abstract: We address the closed-set problem of speaker identification by presenting a novel sparse representation classification algorithm. We propose to develop an over complete dictionary using the GMM mean super vector kernel for all the training utterances. A given test utterance corresponds to only a small fraction of the whole training database. We therefore propose to represent a given test utterance as a linear combination of all the training utterances, thereby generating a naturally sparse representation. Using this sparsity, the unknown vector of coefficients is computed via l1minimization which is also the sparsest solution [12]. Ideally, the vector of coefficients so obtained has nonzero entries representing the class index of the given test utterance. Experiments have been conducted on the standard TIMIT [14] database and a comparison with the state-of-art speaker identification algorithms yields a favorable performance index for the proposed algorithm.

...read moreread less

Unsupervised Speaker Adaptation based on the Cosine Similarity for Text-Independent Speaker Verification.

[...]

Stephen Shum¹, Najim Dehak¹, Réda Dehak¹, James Glass²•Institutions (2)

Massachusetts Institute of Technology¹, École Normale Supérieure²

01 Jan 2010

TL;DR: The Symmetric Normalization (S-norm) method is proposed, which takes advantage of the symmetry in cosine similarity scoring and achieves competitive performance to that of the ZT-norm while requiring fewer parameter calculations.

...read moreread less

Abstract: This paper proposes a new approach to unsupervised speaker adaptation inspired by the recent success of the factor analysisbased Total Variability Approach to text-independent speaker verification [1]. This approach effectively represents speaker variability in terms of low-dimensional total factor vectors and, when paired alongside the simplicity of cosine similarity scoring, allows for easy manipulation and efficient computation [2]. The development of our adaptation algorithm is motivated by the desire to have a robust method of setting an adaptation threshold, to minimize the amount of required computation for each adaptation update, and to simplify the associated score normalization procedures where possible. To address the final issue, we propose the Symmetric Normalization (S-norm) method, which takes advantage of the symmetry in cosine similarity scoring and achieves competitive performance to that of the ZT-norm while requiring fewer parameter calculations. In subsequent experiments, we also assess an attempt to replace the use of score normalization procedures altogether with a Normalized Cosine Similarity scoring function [3]. We evaluated the performance of our unsupervised speaker adaptation algorithm under various score normalization procedures on the 10sec-10sec and core conditions of the 2008 NIST SRE dataset. Using results without adaptation as our baseline, it was found that the proposed methods are consistent in successfully improving speaker verification performance to achieve state-of-the-art results.

...read moreread less

Journal Article•DOI•

Temporally Weighted Linear Prediction Features for Tackling Additive Noise in Speaker Verification

[...]

Rahim Saeidi¹, Jouni Pohjalainen², Tomi Kinnunen¹, Paavo Alku²•Institutions (2)

University of Eastern Finland¹, Aalto University²

19 Apr 2010-IEEE Signal Processing Letters

TL;DR: Two temporally weighted variants of linear predictive modeling are introduced to speaker verification and they are compared to FFT, which is normally used in computing MFCCs, and to conventional linear prediction and the effect of speech enhancement (spectral subtraction) on the system performance with each of the four feature representations is investigated.

...read moreread less

Abstract: Text-independent speaker verification under additive noise corruption is considered. In the popular mel-frequency cepstral coefficient (MFCC) front-end, the conventional Fourier-based spectrum estimation is substituted with weighted linear predictive methods, which have earlier shown success in noise-robust speech recognition. Two temporally weighted variants of linear predictive modeling are introduced to speaker verification and they are compared to FFT, which is normally used in computing MFCCs, and to conventional linear prediction. The effect of speech enhancement (spectral subtraction) on the system performance with each of the four feature representations is also investigated. Experiments by the authors on the NIST 2002 SRE corpus indicate that the accuracy of the conventional and proposed features are close to each other on clean data. For factory noise at 0 dB SNR level, baseline FFT and the better of the proposed features give EERs of 17.4% and 15.6%, respectively. These accuracies improve to 11.6% and 11.2%, respectively, when spectral subtraction is included as a preprocessing method. The new features hold a promise for noise-robust speaker verification.

...read moreread less

Journal Article•DOI•

An improved method for voice pathology detection by means of a HMM-based feature space transformation

[...]

Julián D. Arias-Londoño¹, Juan Ignacio Godino-Llorente¹, Nicolás Sáenz-Lechón¹, Víctor Osma-Ruiz¹, Germán Castellanos-Domínguez² - Show less +1 more•Institutions (2)

Technical University of Madrid¹, National University of Colombia²

01 Sep 2010-Pattern Recognition

TL;DR: The proposed feature space transformation technique demonstrates a significant improvement of the performance with no addition of new features to the original input space and it is expected that this technique could provide good results in other areas such as speaker verification and/or identification.

...read moreread less

Proceedings Article•DOI•

Speaker identification by combining MFCC and phase information in noisy environments

[...]

Longbiao Wang¹, Kazue Minami², Kazumasa Yamamoto², Seiichi Nakagawa²•Institutions (2)

Shizuoka University¹, Toyohashi University of Technology²

14 Mar 2010

TL;DR: The effectiveness of phase information for noisy environments on speaker identification in noisy environments with integrated MFCC with phase information is described.

...read moreread less

Abstract: In conventional speaker recognition methods based on MFCC, the phase information has been ignored. Recently, we proposed a method that integrated MFCC with the phase information on a speaker recognition method. Using the phase information, the speaker identification error rate was reduced by 78% for clean speech. In this paper, we describe the effectiveness of phase information for noisy environments on speaker identification. Integrationg MFCC with phase information, the speaker error identification rates were reduced by 20%∼70% in comparison with using only MFCC in noisy environments.

...read moreread less

Proceedings Article•DOI•

Choice of Mel filter bank in computing MFCC of a resampled speech

[...]

Sunil Kumar Kopparapu¹, M Laxminarayana¹•Institutions (1)

Tata Consultancy Services¹

10 May 2010

TL;DR: The effect of resampling a speech signal on these speech features is studied to identify the most effective choice of Mel-filter band that enables the computed MFCC of the resampled speech to be as close as possible to the MF CC of the original speech.

...read moreread less

Abstract: Mel Frequency Cepstral Coefficients (MFCCs) are the most popularly used speech features in many speech and speaker recognition applications. In this paper, we study the effect of resampling a speech signal on these speech features. We first derive a relationship between the MFCC parameters of the resampled speech and the MFCC parameters of the original speech. We propose six methods of calculating the MFCC parameters of downsampled speech by transforming the Mel filter bank used to compute MFCC of the original speech. We then experimentally compute the MFCC parameters of the down sampled speech using the proposed methods and compute the Pearson coefficient between the MFCC parameters of the downsampled speech and that of the original speech to identify the most effective choice of Mel-filter band that enables the computed MFCC of the resampled speech to be as close as possible to the MFCC of the original speech.

...read moreread less

Collapse