scispace - formally typeset
Search or ask a question

Showing papers on "Speaker recognition published in 2017"


Proceedings ArticleDOI
15 Mar 2017
TL;DR: This work studies the use of deep learning to automatically discover emotionally relevant features from speech and proposes a novel strategy for feature pooling over time which uses local attention in order to focus on specific regions of a speech signal that are more emotionally salient.
Abstract: Automatic emotion recognition from speech is a challenging task which relies heavily on the effectiveness of the speech features used for classification. In this work, we study the use of deep learning to automatically discover emotionally relevant features from speech. It is shown that using a deep recurrent neural network, we can learn both the short-time frame-level acoustic features that are emotionally relevant, as well as an appropriate temporal aggregation of those features into a compact utterance-level representation. Moreover, we propose a novel strategy for feature pooling over time which uses local attention in order to focus on specific regions of a speech signal that are more emotionally salient. The proposed solution is evaluated on the IEMOCAP corpus, and is shown to provide more accurate predictions compared to existing emotion recognition algorithms.

556 citations


Journal ArticleDOI
TL;DR: This paper proposes to analyze a large number of established and recent techniques according to four transverse axes: 1) the acoustic impulse response model, 2) the spatial filter design criterion, 3) the parameter estimation algorithm, and 4) optional postfiltering.
Abstract: Speech enhancement and separation are core problems in audio signal processing, with commercial applications in devices as diverse as mobile phones, conference call systems, hands-free systems, or hearing aids. In addition, they are crucial preprocessing steps for noise-robust automatic speech and speaker recognition. Many devices now have two to eight microphones. The enhancement and separation capabilities offered by these multichannel interfaces are usually greater than those of single-channel interfaces. Research in speech enhancement and separation has followed two convergent paths, starting with microphone array processing and blind source separation, respectively. These communities are now strongly interrelated and routinely borrow ideas from each other. Yet, a comprehensive overview of the common foundations and the differences between these approaches is lacking at present. In this paper, we propose to fill this gap by analyzing a large number of established and recent techniques according to four transverse axes: 1 the acoustic impulse response model, 2 the spatial filter design criterion, 3 the parameter estimation algorithm, and 4 optional postfiltering. We conclude this overview paper by providing a list of software and data resources and by discussing perspectives and future trends in the field.

452 citations


Posted Content
TL;DR: Results that suggest adapting from a model trained with Mandarin can improve accuracy for English speaker recognition are presented, and it is suggested that Deep Speaker outperforms a DNN-based i-vector baseline.
Abstract: We present Deep Speaker, a neural speaker embedding system that maps utterances to a hypersphere where speaker similarity is measured by cosine similarity. The embeddings generated by Deep Speaker can be used for many tasks, including speaker identification, verification, and clustering. We experiment with ResCNN and GRU architectures to extract the acoustic features, then mean pool to produce utterance-level speaker embeddings, and train using triplet loss based on cosine similarity. Experiments on three distinct datasets suggest that Deep Speaker outperforms a DNN-based i-vector baseline. For example, Deep Speaker reduces the verification equal error rate by 50% (relatively) and improves the identification accuracy by 60% (relatively) on a text-independent dataset. We also present results that suggest adapting from a model trained with Mandarin can improve accuracy for English speaker recognition.

446 citations


Proceedings ArticleDOI
20 Aug 2017
TL;DR: It is found that the sequence-to-sequence models are competitive with traditional state-of-the-art approaches on dictation test sets, although the baseline, which uses a separate pronunciation and language model, outperforms these models on voice-search test sets.
Abstract: In this work, we conduct a detailed evaluation of various allneural, end-to-end trained, sequence-to-sequence models applied to the task of speech recognition. Notably, each of these systems directly predicts graphemes in the written domain, without using an external pronunciation lexicon, or a separate language model. We examine several sequence-to-sequence models including connectionist temporal classification (CTC), the recurrent neural network (RNN) transducer, an attentionbased model, and a model which augments the RNN transducer with an attention mechanism. We find that the sequence-to-sequence models are competitive with traditional state-of-the-art approaches on dictation test sets, although the baseline, which uses a separate pronunciation and language model, outperforms these models on voice-search test sets.

271 citations


Proceedings ArticleDOI
05 Mar 2017
TL;DR: This work proposes an alternative approach for learning representations via deep neural networks to remove the i-vector extraction process from the pipeline entirely and shows that, though this approach does not respond as well to unsupervised calibration strategies as previous systems, the incorporation of well-founded speaker priors sufficiently mitigates this shortcoming.
Abstract: Speaker diarization is an important front-end for many speech technologies in the presence of multiple speakers, but current methods that employ i-vector clustering for short segments of speech are potentially too cumbersome and costly for the front-end role. In this work, we propose an alternative approach for learning representations via deep neural networks to remove the i-vector extraction process from the pipeline entirely. The proposed architecture simultaneously learns a fixed-dimensional embedding for acoustic segments of variable length and a scoring function for measuring the likelihood that the segments originated from the same or different speakers. Through tests on the CALLHOME conversational telephone speech corpus, we demonstrate that, in addition to streamlining the diarization architecture, the proposed system matches or exceeds the performance of state-of-the-art baselines. We also show that, though this approach does not respond as well to unsupervised calibration strategies as previous systems, the incorporation of well-founded speaker priors sufficiently mitigates this shortcoming.

248 citations


Proceedings ArticleDOI
30 Oct 2017
TL;DR: This work proposes VoiceGesture, a liveness detection system for replay attack detection on smartphones that detects a live user by leveraging both the unique articulatory gesture of the user when speaking a passphrase and the mobile audio hardware advances.
Abstract: Voice biometrics is drawing increasing attention as it is a promising alternative to legacy passwords for mobile authentication. Recently, a growing body of work shows that voice biometrics is vulnerable to spoofing through replay attacks, where an adversary tries to spoof voice authentication systems by using a pre-recorded voice sample collected from a genuine user. In this work, we propose VoiceGesture, a liveness detection system for replay attack detection on smartphones. It detects a live user by leveraging both the unique articulatory gesture of the user when speaking a passphrase and the mobile audio hardware advances. Specifically, our system re-uses the smartphone as a Doppler radar, which transmits a high frequency acoustic sound from the built-in speaker and listens to the reflections at the microphone when a user speaks a passphrase. The signal reflections due to user's articulatory gesture result in Doppler shifts, which are then analyzed for live user detection. VoiceGesture is practical as it requires neither cumbersome operations nor additional hardware but a speaker and a microphone that are commonly available on smartphones. Our experimental evaluation with 21 participants and different types of phones shows that it achieves over 99% detection accuracy at around 1% Equal Error Rate (EER). Results also show that it is robust to different phone placements and is able to work with different sampling frequencies.

173 citations


Posted Content
TL;DR: This work combines LSTM-based d-vector audio embeddings with recent work in nonparametric clustering to obtain a state-of-the-art speaker diarization system that achieves a 12.0% diarization error rate on NIST SRE 2000 CALLHOME, while the model is trained with out- of-domain data from voice search logs.
Abstract: For many years, i-vector based audio embedding techniques were the dominant approach for speaker verification and speaker diarization applications. However, mirroring the rise of deep learning in various domains, neural network based audio embeddings, also known as d-vectors, have consistently demonstrated superior speaker verification performance. In this paper, we build on the success of d-vector based speaker verification systems to develop a new d-vector based approach to speaker diarization. Specifically, we combine LSTM-based d-vector audio embeddings with recent work in non-parametric clustering to obtain a state-of-the-art speaker diarization system. Our system is evaluated on three standard public datasets, suggesting that d-vector based diarization systems offer significant advantages over traditional i-vector based systems. We achieved a 12.0% diarization error rate on NIST SRE 2000 CALLHOME, while our model is trained with out-of-domain data from voice search logs.

170 citations


Proceedings ArticleDOI
20 Aug 2017
TL;DR: This work proposes to use deep neural networks to learn short-duration speaker embeddings based on a deep convolutional architecture wherein recordings are treated as images and advocates treating utterances as images or ‘speaker snapshots, much like in face recognition.
Abstract: The performance of a state-of-the-art speaker verification system is severely degraded when it is presented with trial recordings of short duration. In this work we propose to use deep neural networks to learn short-duration speaker embeddings. We focus on the 5s-5s condition, wherein both sides of a verification trial are 5 seconds long. In our previous work we established that learning a non-linear mapping from i-vectors to speaker labels is beneficial for speaker verification [1]. In this work we take the idea of learning a speaker classifier one step further we apply deep neural networks directly to timefrequency speech representations. We propose two feedforward network architectures for this task. Our best model is based on a deep convolutional architecture wherein recordings are treated as images. From our experimental findings we advocate treating utterances as images or ‘speaker snapshots, much like in face recognition. Our convolutional speaker embeddings perform significantly better than i-vectors when scoring is done using cosine distance, where the relative improvement is 23.5%. The proposed deep embeddings combined with cosine distance also outperform a state-of-the-art i-vector verification system by 1%, providing further empirical evidence in favor of our learned speaker features.

155 citations


Proceedings ArticleDOI
20 Aug 2017
TL;DR: Results from the evaluation suggest that systems found it easier to reject non-target trials where the test speaker was among the target speakers.
Abstract: In 2018, the U.S. National Institute of Standards and Technology (NIST) conducted the most recent in an ongoing series of speaker recognition evaluations (SRE). SRE18 was organized in a similar manner to SRE16, focusing on speaker detection over conversational telephony speech (CTS) collected outside north America. SRE18 also featured several new aspects including: two new data domains, namely voice over internet protocol (VoIP) and audio extracted from amateur online videos (AfV), as well as a new language (Tunisian Arabic). A total of 78 organizations (forming 48 teams) from academia and industry participated in SRE18 and submitted 129 valid system outputs under fixed and open training conditions first introduced in SRE16. This paper presents an overview of the evaluation and several analyses of system performance for all primary conditions in SRE18. The evaluation results suggest that 1) speaker recognition on AfV was more challenging than on telephony data, 2) speaker representations (aka embeddings) extracted using end-to-end neural network frameworks were most effective, 3) top performing systems exhibited similar performance, and 4) greatest performance improvements were largely due to data augmentation, use of extended and more complex models for data representation, as well as effective use of the provided development sets.

124 citations


Journal ArticleDOI
TL;DR: It is identified that the current SI research trend is to develop a robust universal SI framework to address the important problems of SI such as adaptability, complexity, multi-lingual recognition, and noise robustness.
Abstract: Speaker Identification (SI) is the process of identifying the speaker from a given utterance by comparing the voice biometrics of the utterance with those utterance models stored beforehand. SI technologies are taken a new direction due to the advances in artificial intelligence and have been used widely in various domains. Feature extraction is one of the most important aspects of SI, which significantly influences the SI process and performance. This systematic review is conducted to identify, compare, and analyze various feature extraction approaches, methods, and algorithms of SI to provide a reference on feature extraction approaches for SI applications and future studies. The review was conducted according to Kitchenham systematic review methodology and guidelines, and provides an in-depth analysis on proposals and implementations of SI feature extraction methods discussed in the literature between year 2011 and 2106. Three research questions were determined and an initial set of 535 publications were identified to answer the questions. After applying exclusion criteria 160 related publications were shortlisted and reviewed in this paper; these papers were considered to answer the research questions. Results indicate that pure Mel-Frequency Cepstral Coefficients (MFCCs) based feature extraction approaches have been used more than any other approach. Furthermore, other MFCC variations, such as MFCC fusion and cleansing approaches, are proven to be very popular as well. This study identified that the current SI research trend is to develop a robust universal SI framework to address the important problems of SI such as adaptability, complexity, multi-lingual recognition, and noise robustness. The results presented in this research are based on past publications, citations, and number of implementations with citations being most relevant. This paper also presents the general process of SI.

122 citations


Journal ArticleDOI
TL;DR: This paper proposes the use of a coupled 3D convolutional neural network (3D CNN) architecture that can map both modalities into a representation space to evaluate the correspondence of audio–visual streams using the learned multimodal features.
Abstract: Audio–visual recognition (AVR) has been considered as a solution for speech recognition tasks when the audio is corrupted, as well as a visual recognition method used for speaker verification in multi-speaker scenarios. The approach of AVR systems is to leverage the extracted information from one modality to improve the recognition ability of the other modality by complementing the missing information. The essential problem is to find the correspondence between the audio and visual streams, which is the goal of this paper. We propose the use of a coupled 3D convolutional neural network (3D CNN) architecture that can map both modalities into a representation space to evaluate the correspondence of audio–visual streams using the learned multimodal features. The proposed architecture will incorporate both spatial and temporal information jointly to effectively find the correlation between temporal information for different modalities. By using a relatively small network architecture and much smaller data set for training, our proposed method surpasses the performance of the existing similar methods for audio–visual matching, which use 3D CNNs for feature representation. We also demonstrate that an effective pair selection method can significantly increase the performance. The proposed method achieves relative improvements over 20% on the equal error rate and over 7% on the average precision in comparison to the state-of-the-art method.

Proceedings ArticleDOI
20 Aug 2017
TL;DR: The analysis shows that the adaptive score normalization (using top scoring files per trial) selects cohorts that in 68% contain recordings from the same language and in 92% of the same gender as the enrollment and test recordings.
Abstract: NIST Speaker Recognition Evaluation 2016 has revealed the importance of score normalization for mismatched data conditions. This paper analyzes several score normalization techniques for test conditions with multiple languages. The best performing one for a PLDA classifier is an adaptive s-norm with 30% relative improvement over the system without any score normalization. The analysis shows that the adaptive score normalization (using top scoring files per trial) selects cohorts that in 68% contain recordings from the same language and in 92% of the same gender as the enrollment and test recordings. Our results suggest that the data to select score normalization cohorts should be a pool of several languages and channels and if possible, its subset should contain data from the target domain.

Proceedings ArticleDOI
05 Mar 2017
TL;DR: This study proposes to use DNNs to encode each utterance into a fixed-length vector by pooling the activations of the last hidden layer over time, which demonstrates the effectiveness of these methods on speech emotion and age/gender recognition tasks.
Abstract: Accurately recognizing speaker emotion and age/gender from speech can provide better user experience for many spoken dialogue systems. In this study, we propose to use deep neural networks (DNNs) to encode each utterance into a fixed-length vector by pooling the activations of the last hidden layer over time. The feature encoding process is designed to be jointly trained with the utterance-level classifier for better classification. A kernel extreme learning machine (ELM) is further trained on the encoded vectors for better utterance-level classification. Experiments on a Mandarin dataset demonstrate the effectiveness of our proposed methods on speech emotion and age/gender recognition tasks.

Proceedings ArticleDOI
19 Jun 2017
TL;DR: Experimental results show that high-performance multi-speaker models can be constructed using the proposed code vectors with a variety of encoding schemes, and that adaptation and manipulation can be performed effectively using the codes.
Abstract: Methods for adapting and controlling the characteristics of output speech are important topics in speech synthesis. In this work, we investigated the performance of DNN-based text-to-speech systems that in parallel to conventional text input also take speaker, gender, and age codes as inputs, in order to 1) perform multi-speaker synthesis, 2) perform speaker adaptation using small amounts of target-speaker adaptation data, and 3) modify synthetic speech characteristics based on the input codes. Using a large-scale, studio-quality speech corpus with 135 speakers of both genders and ages between tens and eighties, we performed three experiments: 1) First, we used a subset of speakers to construct a DNN-based, multi-speaker acoustic model with speaker codes. 2) Next, we performed speaker adaptation by estimating code vectors for new speakers via backpropagation from a small amount of adaptation material. 3) Finally, we experimented with manually manipulating input code vectors to alter the gender and/or age characteristics of the synthesised speech. Experimental results show that high-performance multi-speaker models can be constructed using the proposed code vectors with a variety of encoding schemes, and that adaptation and manipulation can be performed effectively using the codes.

Proceedings ArticleDOI
10 Mar 2017
TL;DR: Different types of methods to collect emotional speech data and issues related to them are covered by this presentation along with the previous works review.
Abstract: Emotion recognition or affect detection from speech is an old and challenging problem in the field of artificial intelligence. Many significant research works have been done on emotion recognition. In this paper, the recent works on affect detection using speech and different issues related to affect detection has been presented. The primary challenges of emotion recognition are choosing the emotion recognition corpora (speech database), identification of different features related to speech and an appropriate choice of a classification model. Different types of methods to collect emotional speech data and issues related to them are covered by this presentation along with the previous works review. Literature survey on different features used for recognizing emotion from human speech has been discussed. The significance of various classification models has been presented along with some recent research works review. A detailed description of a prime feature extraction technique named Mel Frequency Cepstral Coefficient (MFCC) and brief description of the working principle of some classification models are also discussed here. In this paper terms like affect detection and emotion recognition are used interchangeably.

Proceedings ArticleDOI
05 Mar 2017
TL;DR: The final results on speaker diarization system indicate that the use of speaker change detection based on CNN is beneficial with relative improvement of diarized error rate by 28 %.
Abstract: The aim of this paper is to propose a speaker change detection technique based on Convolutional Neural Network (CNN) and evaluate its contribution to the performance of a speaker diarization system for telephone conversations. For the comparison we used an i-vector based speaker diarization system. The baseline speaker change detection uses Generalized Likelihood Ratio (GLR) metric. Experiments were conducted on the English part of the CallHome corpus. Our proposed CNN speaker change detection outperformed the GLR approach, reducing the Equal Error Rate relatively by 46 %. The final results on speaker diarization system indicate that the use of speaker change detection based on CNN is beneficial with relative improvement of diarization error rate by 28 %.

Journal ArticleDOI
TL;DR: This work proposes a straightforward hidden Markov model (HMM) based extension of the i-vector approach, which allows i-vectors to be successfully applied to text-dependent speaker verification and presents the best published results obtained with a single system on both RSR2015 and RedDots dataset.
Abstract: The low-dimensional i-vector representation of speech segments is used in the state-of-the-art text-independent speaker verification systems. However, i-vectors were deemed unsuitable for the text-dependent task, where simpler and older speaker recognition approaches were found more effective. In this work, we propose a straightforward hidden Markov model (HMM) based extension of the i-vector approach, which allows i-vectors to be successfully applied to text-dependent speaker verification. In our approach, the Universal Background Model (UBM) for training phrase-independent i-vector extractor is based on a set of monophone HMMs instead of the standard Gaussian Mixture Model (GMM). To compensate for the channel variability, we propose to precondition i-vectors using a regularized variant of within-class covariance normalization, which can be robustly estimated in a phrase-dependent fashion on the small datasets available for the text-dependent task. The verification scores are cosine similarities between the i-vectors normalized using phrase-dependent s-norm. The experimental results on RSR2015 and RedDots databases confirm the effectiveness of the proposed approach, especially in rejecting test utterances with a wrong phrase. A simple MFCC based i-vector/HMM system performs competitively when compared to very computationally expensive DNN-based approaches or the conventional relevance MAP GMM-UBM, which does not allow for compact speaker representations. To our knowledge, this paper presents the best published results obtained with a single system on both RSR2015 and RedDots dataset.

Proceedings ArticleDOI
18 Jun 2017
TL;DR: An analysis of how a text-independent voice identification system can be built is presented and the ability for such systems to be utilized in both speaker identification and speaker verification tasks is shown.
Abstract: Speaker identification systems are becoming more important in today's world. This is especially true as devices rely on the user to speak commands. In this article, an analysis of how a text-independent voice identification system can be built is presented. Extracting the Mel-Frequency Cepstral Coefficients is evaluated and a support vector machine is trained and tested on two different data sets, one from LibriSpeech and one from in-house recorded audio files. The results show the ability for such systems to be utilized in both speaker identification and speaker verification tasks.

Posted Content
TL;DR: In this article, a convolutional time-delay deep neural network structure (CT-DNN) was proposed for speaker feature learning, which can produce high quality speaker features.
Abstract: Recently deep neural networks (DNNs) have been used to learn speaker features. However, the quality of the learned features is not sufficiently good, so a complex back-end model, either neural or probabilistic, has to be used to address the residual uncertainty when applied to speaker verification, just as with raw features. This paper presents a convolutional time-delay deep neural network structure (CT-DNN) for speaker feature learning. Our experimental results on the Fisher database demonstrated that this CT-DNN can produce high-quality speaker features: even with a single feature (0.3 seconds including the context), the EER can be as low as 7.68%. This effectively confirmed that the speaker trait is largely a deterministic short-time property rather than a long-time distributional pattern, and therefore can be extracted from just dozens of frames.

Journal ArticleDOI
TL;DR: The proposed AFDBN method is used to find out the optimal weights which are used to recognize the emotion efficiently and attains 99.17% accuracy for Berlin database and 97.74% for Telugu database.
Abstract: Due to the rapid development of human computer interaction systems, the recognition of emotion becomes a challenging task. Various handheld devices such as smart phones and PCs are utilized to recognize the human emotion from the speech. But, the recognition of emotion is burdensome to the human computer interaction system since it differs according to the speaker. To resolve this problem, the Adaptive Fractional Deep Belief Network (AFDBN) is proposed in this paper. Initially, the spectral features are extracted from the input speech signal. The features obtained are the tonal power ratio, spectral flux, pitch chroma and MFCC. The extracted feature set is then given into the network for the classification. Thus, the AFDBN is newly designed by the fractional theory and Deep belief network. Then, the proposed AFDBN method is used to find out the optimal weights which are used to recognize the emotion efficiently. Finally, the experimental results are evaluated and its performance is analyzed by the evaluation metrics which is compared with the existing systems. The outcome of the proposed method attains 99.17% accuracy for Berlin database and 97.74% for Telugu database.

Journal ArticleDOI
14 Dec 2017-PLOS ONE
TL;DR: The goal of this work is to identify and interpret new reliable and complementary articulatory biomarkers that could be applied to predict/evaluate Parkinson’s Disease from a diadochokinetic test, contributing to the possibility of a further multidimensional analysis of the speech of parkinsonian patients.
Abstract: Although a large amount of acoustic indicators have already been proposed in the literature to evaluate the hypokinetic dysarthria of people with Parkinson's Disease, the goal of this work is to identify and interpret new reliable and complementary articulatory biomarkers that could be applied to predict/evaluate Parkinson's Disease from a diadochokinetic test, contributing to the possibility of a further multidimensional analysis of the speech of parkinsonian patients. The new biomarkers proposed are based on the kinetic behaviour of the envelope trace, which is directly linked with the articulatory dysfunctions introduced by the disease since the early stages. The interest of these new articulatory indicators stands on their easiness of identification and interpretation, and their potential to be translated into computer based automatic methods to screen the disease from the speech. Throughout this paper, the accuracy provided by these acoustic kinetic biomarkers is compared with the one obtained with a baseline system based on speaker identification techniques. Results show accuracies around 85% that are in line with those obtained with the complex state of the art speaker recognition techniques, but with an easier physical interpretation, which open the possibility to be transferred to a clinical setting.

Journal ArticleDOI
TL;DR: Through a comprehensive study, it is shown that the multitask recurrent neural net models deliver improved performance on both automatic speech and speaker recognition tasks as compared to single-task systems.
Abstract: Automatic speech and speaker recognition are traditionally treated as two independent tasks and are studied separately. The human brain in contrast deciphers the linguistic content, and the speaker traits from the speech in a collaborative manner. This key observation motivates the work presented in this paper. A collaborative joint training approach based on multitask recurrent neural network models is proposed, where the output of one task is backpropagated to the other tasks. This is a general framework for learning collaborative tasks and fits well with the goal of joint learning of automatic speech and speaker recognition. Through a comprehensive study, it is shown that the multitask recurrent neural net models deliver improved performance on both automatic speech and speaker recognition tasks as compared to single-task systems. The strength of such multitask collaborative learning is analyzed, and the impact of various training configurations is investigated.

Proceedings ArticleDOI
06 Mar 2017
TL;DR: A novel raw waveform based deep model for spoofing detection is presented, which jointly acts as a feature extractor and classifier, thus allowing it to directly classify speech signals.
Abstract: Albeit recent progress in speaker verification generates powerful models, malicious attacks in the form of spoofed speech, are generally not coped with. Recent results in ASVSpoof2015 and BTAS2016 challenges indicate that spoof-aware features are a possible solution to this problem. Most successful methods in both challenges focus on spoof-aware features, rather than focusing on a powerful classifier. In this paper we present a novel raw waveform based deep model for spoofing detection, which jointly acts as a feature extractor and classifier, thus allowing it to directly classify speech signals. This approach can be considered as an end-to-end classifier, which removes the need for any pre- or post-processing on the data, making training and evaluation a streamlined process, consuming less time than other neural-network based approaches. The experiments on the BTAS2016 dataset show that the system performance is significantly improved by the proposed raw waveform convolutional long short term neural network (CLDNN), from the previous best published 1.26% half total error rate (HTER) to the current 0.82% HTER. Moreover it shows that the proposed system also performs well under the unknown (RE-PH2-PH3,RE-LPPH2-PH3) conditions.

Journal ArticleDOI
TL;DR: The multi-level framework having combination of the three modules helps in achieving better performance than that of the individual modules, which shows its potential for practical deployment.
Abstract: This work presents the development of a multi-level speech based person authentication system with attendance as an application. The multi-level system consists of three different modules of speaker verification, namely voice-password, text-dependent and text-independent speaker verification. The three speaker verification modules are combined in a sequential manner to develop a multi-level framework which is ported over a telephone network through interactive voice response (IVR) system for aiding remote authentication. The users call from a fixed set of mobile handsets to verify their claim against their respective models, which is then authenticated in a multi-level mode using the above stated three modules. An analysis over a period of two months is shown on the performance of the multi-level system in attendance marking. The multi-level framework having combination of the three modules helps in achieving better performance than that of the individual modules, which shows its potential for practical deployment.

Proceedings ArticleDOI
05 Mar 2017
TL;DR: A two-stage algorithm to deal with the confounding effects of noise and reverberation separately, where denoising and dereverberation are conducted sequentially using deep neural networks is proposed, and it substantially outperforms one-stage enhancement baselines.
Abstract: In daily listening environments, speech is commonly corrupted by room reverberation and background noise. These distortions are detrimental to speech intelligibility and quality, and also severely degrade the performance of automatic speech and speaker recognition systems. In this paper, we propose a two-stage algorithm to deal with the confounding effects of noise and reverberation separately, where denoising and dereverberation are conducted sequentially using deep neural networks. In addition, we design a new objective function that incorporates clean phase information during training. As the objective function emphasizes more important time-frequency (T-F) units, better estimated magnitude is obtained during testing. By jointly training the two-stage model to optimize the proposed objective function, our algorithm improves objective metrics of speech intelligibility and quality significantly, and substantially outperforms one-stage enhancement baselines.

Proceedings ArticleDOI
01 Dec 2017
TL;DR: The potential of the proposed sequence summarizing scheme enabling to learn the speaker representation jointly with the network as a front-end for speech recognition is extended and the effect of additional noise on the performance of the method is explored.
Abstract: Recently, schemes employing deep neural networks (DNNs) for extracting speech from noisy observation have demonstrated great potential for noise robust automatic speech recognition. However, these schemes are not well suited when the interfering noise is another speaker. To enable extracting a target speaker from a mixture of speakers, we have recently proposed to inform the neural network using speaker information extracted from an adaptation utterance from the same speaker. In our previous work, we explored ways how to inform the network about the speaker and found a speaker adaptive layer approach to be suitable for this task. In our experiments, we used speaker features designed for speaker recognition tasks as the additional speaker information, which may not be optimal for the speaker extraction task. In this paper, we propose a usage of a sequence summarizing scheme enabling to learn the speaker representation jointly with the network. Furthermore, we extend the previous experiments to demonstrate the potential of our proposed method as a front-end for speech recognition and explore the effect of additional noise on the performance of the method.

Journal ArticleDOI
13 Mar 2017
TL;DR: A speaker adaptation method based on a combination of L2 regularization and confusion-reducing regularization, which can enhance discriminability between categorical distributions of the KL-HMM states while preserving speaker-specific information is proposed.
Abstract: This paper addresses the problem of recognizing the speech uttered by patients with dysarthria, which is a motor speech disorder impeding the physical production of speech. Patients with dysarthria have articulatory limitation, and therefore, they often have trouble in pronouncing certain sounds, resulting in undesirable phonetic variation. Modern automatic speech recognition systems designed for regular speakers are ineffective for dysarthric sufferers due to the phonetic variation. To capture the phonetic variation, Kullback–Leibler divergence-based hidden Markov model (KL-HMM) is adopted, where the emission probability of state is parameterized by a categorical distribution using phoneme posterior probabilities obtained from a deep neural network-based acoustic model. To further reflect speaker-specific phonetic variation patterns, a speaker adaptation method based on a combination of L2 regularization and confusion-reducing regularization, which can enhance discriminability between categorical distributions of the KL-HMM states while preserving speaker-specific information is proposed. Evaluation of the proposed speaker adaptation method on a database of several hundred words for 30 speakers consisting of 12 mildly dysarthric, 8 moderately dysarthric, and 10 non-dysarthric control speakers showed that the proposed approach significantly outperformed the conventional deep neural network-based speaker adapted system on dysarthric as well as non-dysarthric speech.

Proceedings ArticleDOI
13 Sep 2017
TL;DR: The executed study evaluated the influence of manual and speech input on driving quality, stress and strain situation and user acceptance when using a Driver Information System (DIS).
Abstract: The executed study evaluated the influence of manual and speech input on driving quality, stress and strain situation and user acceptance when using a Driver Information System (DIS). The study is part of the EU-project SENECA. Sixteen subjects took part in the investigations. A car was equipped with a modified DIS to carry out the evaluation in real traffic situations. The used DIS is a standard product with manual input control elements. This DIS was extended by a speech input system with a speaker independent speech recogniser. For the use of the different DIS devices (radio, CD player, telephone, navigation) 12 different representative tasks were given to the subjects. Independently the type of task speech input needs longer operation times than manual input. In case of complex tasks a distinct improvement of the driving quality can be observed with speech instead of manual input. The subjective safety feeling is stronger with speech than with manual input. With speech input the number of glances at the mirrors and aside is clearly higher than with manual input. The most frequent user errors can be explained by problems when spelling and by the selection of wrong speech commands. The rate of speech recognition errors amounts on the average to 20.6% which makes it necessary to increase the recognition performance of the examined speech system. This improvement of system performance is the task of the development for the system demonstrator in the second half of the SENECA project.

Posted Content
TL;DR: This work develops an end-to-end speaker verification system that is initialized to mimic an i-vector + PLDA baseline that outperforms the i- vector +PLDA baseline on both long and short duration utterances.
Abstract: Recently several end-to-end speaker verification systems based on deep neural networks (DNNs) have been proposed. These systems have been proven to be competitive for text-dependent tasks as well as for text-independent tasks with short utterances. However, for text-independent tasks with longer utterances, end-to-end systems are still outperformed by standard i-vector + PLDA systems. In this work, we develop an end-to-end speaker verification system that is initialized to mimic an i-vector + PLDA baseline. The system is then further trained in an end-to-end manner but regularized so that it does not deviate too far from the initial system. In this way we mitigate overfitting which normally limits the performance of end-to-end systems. The proposed system outperforms the i-vector + PLDA baseline on both long and short duration utterances.

Patent
11 Jan 2017
TL;DR: In this article, a voice identification method using a long short term memory model recurrent neural network (LSTM) was proposed. But the model was not designed for speech recognition.
Abstract: The invention discloses a voice identification method using a long-short term memory model recurrent neural network. The voice identification method comprises training and identification. The training process comprises steps of introducing voice data and text data to generate a commonly-trained acoustic and language mode, and using an RNN sensor to perform decoding to form a model parameter. The identification process comprises steps of converting voice input to a frequency spectrum graph through Fourier conversion, using the recursion neural network of the long-short term memory model to perform orientational searching decoding and finally generating an identification result. The voice identification method adopts the recursion neural network (RNNs) and adopts connection time classification (CTC) to train RNNs through an end-to-end training method. These LSTM units combining with the long-short term memory have good effects and combines with multi-level expression to prove effective in a deep network; only one neural network model (end-to-end model) exits from a voice characteristic (an input end) to a character string (an output end) and the neural network can be directly trained by a target function which is a some kind of a proxy of WER, which avoids to cost useless work to optimize an individual target function.