Showing papers on "Speaker diarisation published in 2016"

PDF

Open Access

Posted Content•

A Persona-Based Neural Conversation Model

[...]

Jiwei Li¹, Michel Galley², Chris Brockett³, Georgios P. Spithourakis⁴, Jianfeng Gao³, Bill Dolan³ - Show less +2 more•Institutions (4)

Stanford University¹, Carnegie Mellon University², Microsoft³, National Technical University of Athens⁴

19 Mar 2016-arXiv: Computation and Language

TL;DR: This work presents persona-based models for handling the issue of speaker consistency in neural response generation that yield qualitative performance improvements in both perplexity and BLEU scores over baseline sequence-to-sequence models.

...read moreread less

Abstract: We present persona-based models for handling the issue of speaker consistency in neural response generation. A speaker model encodes personas in distributed embeddings that capture individual characteristics such as background information and speaking style. A dyadic speaker-addressee model captures properties of interactions between two interlocutors. Our models yield qualitative performance improvements in both perplexity and BLEU scores over baseline sequence-to-sequence models, with similar gains in speaker consistency as measured by human judges.

...read moreread less

647 citations

Journal Article•DOI•

Metrics for Polyphonic Sound Event Detection

[...]

Annamaria Mesaros, Toni Heittola, Tuomas Virtanen

25 May 2016-Applied Sciences

TL;DR: This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously.

...read moreread less

Abstract: This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously The system output in this case contains overlapping events, marked as multiple sounds detected as being active at the same time The polyphonic system output requires a suitable procedure for evaluation against a reference Metrics from neighboring fields such as speech recognition and speaker diarization can be used, but they need to be partially redefined to deal with the overlapping events We present a review of the most common metrics in the field and the way they are adapted and interpreted in the polyphonic case We discuss segment-based and event-based definitions of each metric and explain the consequences of instance-based and class-based averaging using a case study In parallel, we provide a toolbox containing implementations of presented metrics

...read moreread less

493 citations

Proceedings Article•DOI•

End-to-end text-dependent speaker verification

[...]

Georg Heigold¹, Ignacio Lopez Moreno², Samy Bengio², Noam Shazeer²•Institutions (2)

Saarland University¹, Google²

20 Mar 2016

TL;DR: A data-driven, integrated approach to speaker verification, which maps a test utterance and a few reference utterances directly to a single score for verification and jointly optimizes the system's components using the same evaluation protocol and metric as at test time.

...read moreread less

Abstract: In this paper we present a data-driven, integrated approach to speaker verification, which maps a test utterance and a few reference utterances directly to a single score for verification and jointly optimizes the system's components using the same evaluation protocol and metric as at test time. Such an approach will result in simple and efficient systems, requiring little domain-specific knowledge and making few model assumptions. We implement the idea by formulating the problem as a single neural network architecture, including the estimation of a speaker model on only a few utterances, and evaluate it on our internal "Ok Google" benchmark for text-dependent speaker verification. The proposed approach appears to be very effective for big data applications Like ours that require highly accurate, easy-to-maintain systems with a small footprint.

...read moreread less

378 citations

Proceedings Article•DOI•

End-to-End attention based text-dependent speaker verification

[...]

Shi-Xiong Zhang¹, Zhuo Chen², Yong Zhao¹, Jinyu Li¹, Yifan Gong¹ - Show less +1 more•Institutions (2)

Microsoft¹, Columbia University²

01 Dec 2016

TL;DR: In this article, a speaker discriminate CNN is used to extract the noise-robust frame-level features and these features are smartly combined to form an utterance-level speaker vector through an attention mechanism.

...read moreread less

Abstract: A new type of End-to-End system for text-dependent speaker verification is presented in this paper. Previously, using the phonetic discriminate/speaker discriminate DNN as a feature extractor for speaker verification has shown promising results. The extracted frame-level (bottleneck, posterior or d-vector) features are equally weighted and aggregated to compute an utterance-level speaker representation (d-vector or i-vector). In this work we use a speaker discriminate CNN to extract the noise-robust frame-level features. These features are smartly combined to form an utterance-level speaker vector through an attention mechanism. The proposed attention model takes the speaker discriminate information and the phonetic information to learn the weights. The whole system, including the CNN and attention model, is joint optimized using an end-to-end criterion. The training algorithm imitates exactly the evaluation process — directly mapping a test utterance and a few target speaker utterances into a single verification score. The algorithm can smartly select the most similar impostor for each target speaker to train the network. We demonstrated the effectiveness of the proposed end-to-end system on Windows 10 “Hey Cortana” speaker verification task.

...read moreread less

137 citations

Patent•

Recognizing speech in the presence of additional audio

[...]

Diego Melendo Casado¹, Ignacio Lopez Moreno¹, Javier Gonzalez-Dominguez¹•Institutions (1)

Google¹

07 Apr 2016

81 citations

Proceedings Article•DOI•

Speaker identification and clustering using convolutional neural networks

[...]

Yanick Xavier Lukic¹, Carlo Vogt¹, Oliver Dürr¹, Thilo Stadelmann¹•Institutions (1)

Zürcher Fachhochschule¹

01 Sep 2016

TL;DR: This paper uses simple spectrograms as input to a CNN and study the optimal design of those networks for speaker identification and clustering, and demonstrates the approach on the well known TIMIT dataset, achieving results comparable with the state of the art-without the need for handcrafted features.

...read moreread less

Abstract: Deep learning, especially in the form of convolutional neural networks (CNNs), has triggered substantial improvements in computer vision and related fields in recent years. This progress is attributed to the shift from designing features and subsequent individual sub-systems towards learning features and recognition systems end to end from nearly unprocessed data. For speaker clustering, however, it is still common to use handcrafted processing chains such as MFCC features and GMM-based models. In this paper, we use simple spectrograms as input to a CNN and study the optimal design of those networks for speaker identification and clustering. Furthermore, we elaborate on the question how to transfer a network, trained for speaker identification, to speaker clustering. We demonstrate our approach on the well known TIMIT dataset, achieving results comparable with the state of the art-without the need for handcrafted features.

...read moreread less

73 citations

Proceedings Article•DOI•

Improved speaker independent lip reading using speaker adaptive training and deep neural networks

[...]

Ibrahim Almajai¹, Stephen Cox¹, Richard P. Harvey¹, Yuxuan Lan¹•Institutions (1)

University of East Anglia¹

20 Mar 2016

TL;DR: It is shown that error-rates for speaker-independent lip-reading can be very significantly reduced and that there is no need to map phonemes to visemes for context-dependent visual speech transcription.

...read moreread less

Abstract: Recent improvements in tracking and feature extraction mean that speaker-dependent lip-reading of continuous speech using a medium size vocabulary (around 1000 words) is realistic. However, the recognition of previously unseen speakers has been found to be a very challenging task, because of the large variation in lip-shapes across speakers and the lack of large, tracked databases of visual features, which are very expensive to produce. By adapting a technique that is established in speech recognition but has not previously been used in lip-reading, we show that error-rates for speaker-independent lip-reading can be very significantly reduced. Furthermore, we show that error-rates can be even further reduced by the additional use of Deep Neural Networks (DNN). We also find that there is no need to map phonemes to visemes for context-dependent visual speech transcription.

...read moreread less

63 citations

Proceedings Article•DOI•

Audio enhancing with DNN autoencoder for speaker recognition

[...]

Oldrich Plchot¹, Lukas Burget¹, Hagai Aronowitz², Pavel Matejka¹•Institutions (2)

Brno University of Technology¹, IBM²

20 Mar 2016

TL;DR: A DNN-based autoencoder for speech enhancement and its use for speaker recognition systems for distant microphones and noisy data is presented and a more detailed analysis on various conditions of NIST SRE 2010 and PRISM is presented suggesting that the proposed preprocessig is a promising and efficient way to build a robust speaker recognition system.

...read moreread less

Abstract: In this paper we present a design of a DNN-based autoencoder for speech enhancement and its use for speaker recognition systems for distant microphones and noisy data. We started with augmenting the Fisher database with artificially noised and reverberated data and trained the autoencoder to map noisy and reverberated speech to its clean version. We use the autoencoder as a preprocessing step in the later stage of modelling in state-of-the-art text-dependent and text-independent speaker recognition systems. We report relative improvements up to 50% for the text-dependent system and up to 48% for the text-independent one. With text-independent system, we present a more detailed analysis on various conditions of NIST SRE 2010 and PRISM suggesting that the proposed preprocessig is a promising and efficient way to build a robust speaker recognition system for distant microphone and noisy data.

...read moreread less

63 citations

Journal Article•DOI•

A unified approach to transfer learning of deep neural networks with applications to speaker adaptation in automatic speech recognition

[...]

Zhen Huang¹, Sabato Marco Siniscalchi², Chin-Hui Lee¹•Institutions (2)

Georgia Institute of Technology¹, Kore University of Enna²

19 Dec 2016-Neurocomputing

TL;DR: Transfer learning is an enabling technology for speaker adaptation, since it outperforms both the transformation-based adaptation algorithms usually adapted in the speech community, and the multi-condition training schemes, which is a data combination methods often adopted to cover more acoustic variabilities in speech when data from the source and target domains are both available at the training time.

...read moreread less

58 citations

Book Chapter•DOI•

Cross-Modal Supervision for Learning Active Speaker Detection in Video

[...]

Punarjay Chakravarty¹, Tinne Tuytelaars¹•Institutions (1)

Katholieke Universiteit Leuven¹

08 Oct 2016

TL;DR: This work is the first to present an active speaker detection system that learns on one audio-visual dataset and automatically adapts to speakers in a new dataset, and is seen as an example of how the availability of multi-modal data allows us to learn a model without the need for supervision.

...read moreread less

Abstract: In this paper, we show how to use audio to supervise the learning of active speaker detection in video. Voice Activity Detection (VAD) guides the learning of the vision-based classifier in a weakly supervised manner. The classifier uses spatio-temporal features to encode upper body motion - facial expressions and gesticulations associated with speaking. We further improve a generic model for active speaker detection by learning person specific models. Finally, we demonstrate the online adaptation of generic models learnt on one dataset, to previously unseen people in a new dataset, again using audio (VAD) for weak supervision. The use of temporal continuity overcomes the lack of clean training data. We are the first to present an active speaker detection system that learns on one audio-visual dataset and automatically adapts to speakers in a new dataset. This work can be seen as an example of how the availability of multi-modal data allows us to learn a model without the need for supervision, by transferring knowledge from one modality to another.

...read moreread less

47 citations

Patent•

Real-time speaker state analytics platform

[...]

Andreas Tsiartas¹, Elizabeth Shriberg¹, Cory Albright¹, Michael Frandsen¹•Institutions (1)

SRI International¹

10 Jun 2016

TL;DR: In this article, machine learning-based technologies are used to analyze an audio input and provide speaker state predictions in response to the audio input, which can be selected and customized for each of a variety of different applications.

...read moreread less

Abstract: Disclosed are machine learning-based technologies that analyze an audio input and provide speaker state predictions in response to the audio input The speaker state predictions can be selected and customized for each of a variety of different applications

...read moreread less

Proceedings Article•DOI•

Online speaker diarization using adapted i-vector transforms

[...]

Weizhong Zhu¹, Jason W. Pelecanos¹•Institutions (1)

IBM¹

20 Mar 2016

TL;DR: This paper proposes a novel Maximum a Posteriori (MAP) adapted transform within an i-vector speaker diarization framework, that operates in a strict left-to-right fashion.

...read moreread less

Abstract: Many speaker diarization systems operate in an off-line mode. Such systems typically find homogeneous segments and then cluster these segments according to speaker. Such algorithms, like bottom-up clustering, k-means or spectral clustering, generally require the registration of all segments before clustering can begin. However, for real-time applications such as with multi-person voice interactive systems, there is a need to perform online speaker assignment in a strict left-to-right fashion. In this paper we propose a novel Maximum a Posteriori (MAP) adapted transform within an i-vector speaker diarization framework, that operates in a strict left-to-right fashion. Previous work by the community has shown that the principal components of variation of fixed dimensional i-vectors learned across segments tend to indicate a strong basis by which to separate speakers. However, determining this basis can be problematic when there are few segments or when operating in an online manner. The proposed method blends the prior with the estimated subspace as more i-vectors are observed. Given oracle SAD segments, with adaptation we achieve 3.2% speaker diarization error for a strict left-to-right constraint on the LDC Callhome English Corpus compared to 4.8% without adaptation.

...read moreread less

Proceedings Article•DOI•

Exploring the role of phonetic bottleneck features for speaker and language recognition

[...]

Mitchell McLaren¹, Luciana Ferrer², Aaron Lawson¹•Institutions (2)

SRI International¹, University of Buenos Aires²

20 Mar 2016

TL;DR: The role of bottleneck features in language and speaker identification tasks are analyzed by varying the DNN layer from which they are extracted, under the hypothesis that speaker information is traded for dense phonetic information as the layer moves toward theDNN output layer.

...read moreread less

Abstract: Using bottleneck features extracted from a deep neural network (DNN) trained to predict senone posteriors has resulted in new, state-of-the-art technology for language and speaker identification For language identification, the features' dense phonetic information is believed to enable improved performance by better representing language-dependent phone distributions For speaker recognition, the role of these features is less clear, given that a bottleneck layer near the DNN output layer is thought to contain limited speaker information In this article, we analyze the role of bottleneck features in these identification tasks by varying the DNN layer from which they are extracted, under the hypothesis that speaker information is traded for dense phonetic information as the layer moves toward the DNN output layer Experiments support this hypothesis under certain conditions, and highlight the benefit of using a bottleneck layer close to the DNN output layer when DNN training data is matched to the evaluation conditions, and a layer more central to the DNN otherwise

...read moreread less

Journal Article•DOI•

Improving short utterance speaker recognition by modeling speech unit classes

[...]

Lantian Li¹, Dong Wang¹, Chenhao Zhang¹, Thomas Fang Zheng¹•Institutions (1)

Tsinghua University¹

01 Jun 2016-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A novel solution that distributes speech signals into a multitude of acoustic subregion that are defined by speech units, and models speakers within the subregions, and can be greatly improved in scenarios where no enrollment data is available for some speech unit classes.

...read moreread less

Abstract: Short utterance speaker recognition (SUSR) is highly challenging due to the limited enrollment and/or test data. We argue that the difficulty can be largely attributed to the mismatched prior distributions of the speech data used to train the universal background model (UBM) and those for enrollment and test. This paper presents a novel solution that distributes speech signals into a multitude of acoustic subregions that are defined by speech units, and models speakers within the subregions. To avoid data sparsity, a data-driven approach is proposed to cluster speech units into speech unit classes, based on which robust subregion models can be constructed. Further more, we propose a model synthesis approach based on maximum likelihood linear regression (MLLR) to deal with no-data speech unit classes. The experiments were conducted on a publicly available database SUD12. The results demonstrated that on a text-independent speaker recognition task where the test utterances are no longer than 2 seconds and mostly shorter than 0.5 seconds, the proposed sub-region modeling offered a 21.51% relative reduction in equal error rate (EER), compared with the standard GMM-UBM baseline. In addition, with the model synthesis approach, the performance can be greatly improved in scenarios where no enrollment data are available for some speech unit classes.

...read moreread less

Journal Article•DOI•

A technology prototype system for rating therapist empathy from audio recordings in addiction counseling.

[...]

Bo Xiao¹, Che-Wei Huang¹, Zac E. Imel², David C. Atkins³, Panayiotis G. Georgiou¹, Shrikanth S. Narayanan¹ - Show less +2 more•Institutions (3)

University of Southern California¹, University of Utah², University of Washington³

01 Apr 2016-PeerJ

TL;DR: A speech technology-based system to automate the assessment of therapist empathy-a key therapy quality index-from audio recordings of the psychotherapy interactions is proposed, and provides useful information that can contribute to automatic quality insurance and therapist training.

...read moreread less

Abstract: Scaling up psychotherapy services such as for addiction counseling is a critical societal need. One challenge is ensuring quality of therapy, due to the heavy cost of manual observational assessment. This work proposes a speech technology-based system to automate the assessment of therapist empathy-a key therapy quality index-from audio recordings of the psychotherapy interactions. We designed a speech processing system that includes voice activity detection and diarization modules, and an automatic speech recognizer plus a speaker role matching module to extract the therapist's language cues. We employed Maximum Entropy models, Maximum Likelihood language models, and a Lattice Rescoring method to characterize high vs. low empathic language. We estimated therapy-session level empathy codes using utterance level evidence obtained from these models. Our experiments showed that the fully automated system achieved a correlation of 0.643 between expert annotated empathy codes and machine-derived estimations, and an accuracy of 81% in classifying high vs. low empathy, in comparison to a 0.721 correlation and 86% accuracy in the oracle setting using manual transcripts. The results show that the system provides useful information that can contribute to automatic quality insurance and therapist training.

...read moreread less

Patent•

Audio diarization system that segments audio input

[...]

Lyren Philip Scott, Norris Glen A

10 Jun 2016

TL;DR: In this paper, an audio diarization system segments the audio input into speech and non-speech segments, and these segments are convolved with one or more head related transfer functions (HRTFs) so the sounds localize to different sound localization points (SLPs) for the user.

...read moreread less

Abstract: Speech and/or non-speech in an audio input are convolved to localize sounds to different locations for a user. An audio diarization system segments the audio input into speech and non-speech segments. These segments are convolved with one or more head related transfer functions (HRTFs) so the sounds localize to different sound localization points (SLPs) for the user.

...read moreread less

Journal Article•DOI•

Prediction of Who Will Be the Next Speaker and When Using Gaze Behavior in Multiparty Meetings

[...]

Ryo Ishii¹, Kazuhiro Otsuka¹, Shiro Kumano¹, Junji Yamato¹•Institutions (1)

Nippon Telegraph and Telephone¹

05 May 2016-Ksii Transactions on Internet and Information Systems

TL;DR: A prediction model that features three processing steps to predict whether turn-changing or turn-keeping will occur, who will be the next speaker in turn- changing, and the timing of the start of the next speakers’ utterance is proposed.

...read moreread less

Abstract: In multiparty meetings, participants need to predict the end of the speaker’s utterance and who will start speaking next, as well as consider a strategy for good timing to speak next. Gaze behavior plays an important role in smooth turn-changing. This article proposes a prediction model that features three processing steps to predict (I) whether turn-changing or turn-keeping will occur, (II) who will be the next speaker in turn-changing, and (III) the timing of the start of the next speaker’s utterance. For the feature values of the model, we focused on gaze transition patterns and the timing structure of eye contact between a speaker and a listener near the end of the speaker’s utterance. Gaze transition patterns provide information about the order in which gaze behavior changes. The timing structure of eye contact is defined as who looks at whom and who looks away first, the speaker or listener, when eye contact between the speaker and a listener occurs. We collected corpus data of multiparty meetings, using the data to demonstrate relationships between gaze transition patterns and timing structure and situations (I), (II), and (III). The results of our analyses indicate that the gaze transition pattern of the speaker and listener and the timing structure of eye contact have a strong association with turn-changing, the next speaker in turn-changing, and the start time of the next utterance. On the basis of the results, we constructed prediction models using the gaze transition patterns and timing structure. The gaze transition patterns were found to be useful in predicting turn-changing, the next speaker in turn-changing, and the start time of the next utterance. Contrary to expectations, we did not find that the timing structure is useful for predicting the next speaker and the start time. This study opens up new possibilities for predicting the next speaker and the timing of the next utterance using gaze transition patterns in multiparty meetings.

...read moreread less

Proceedings Article•DOI•

Speaker and language factorization in DNN-based TTS synthesis

[...]

Yuchen Fan¹, Yao Qian, Frank K. Soong¹, Lei He¹•Institutions (1)

Microsoft¹

20 Mar 2016

TL;DR: Experimental results on a speech corpus of multiple speakers in both Mandarin and English show that the proposed factorized DNN can not only achieve a similar voice quality as that of a multi-speaker DNN, but also perform polyglot synthesis with a monolingual speaker's voice.

...read moreread less

Abstract: We have successfully proposed to use multi-speaker modelling in DNN-based TTS synthesis for improved voice quality with limited available data from a speaker. In this paper, we propose a new speaker and language factorized DNN, where speaker-specific layers are used for multi-speaker modelling, and shared layers and language-specific layers are employed for multi-language, linguistic feature transformation. Experimental results on a speech corpus of multiple speakers in both Mandarin and English show that the proposed factorized DNN can not only achieve a similar voice quality as that of a multi-speaker DNN, but also perform polyglot synthesis with a monolingual speaker's voice.

...read moreread less

Proceedings Article•DOI•

i-Vector/HMM Based Text-Dependent Speaker Verification System for RedDots Challenge.

[...]

Hossein Zeinali¹, Hossein Sameti¹, Lukas Burget², Jan Cernocký², Nooshin Maghsoodi¹, Pavel Matějka² - Show less +2 more•Institutions (2)

Sharif University of Technology¹, Brno University of Technology²

08 Sep 2016

TL;DR: This paper analyses systems built for RedDots challenge – the effort to collect and compare the initial results on this new evaluation data set obtained at different sites, and uses the recently introduced HMM based i- vector approach, where a set of phone specific HMMs is used to collect the sufficient statistics for i-vector extraction.

...read moreread less

Abstract: Recently, a new data collection was initiated within the RedDots project in order to evaluate text-dependent and text-prompted speaker recognition technology on data from a wider speaker population and with more realistic noise, channel and phonetic variability. This paper analyses our systems built for RedDots challenge – the effort to collect and compare the initial results on this new evaluation data set obtained at different sites. We use our recently introduced HMM based i-vector approach, where, instead of the traditional GMM, a set of phone specific HMMs is used to collect the sufficient statistics for i-vector extraction. Our systems are trained in a completely phraseindependent way on the data from RSR2015 and Libri speech databases. We compare systems making use of standard cepstral features and their combination with neural network based bottle-neck features. The best results are obtained with a scorelevel fusion of such systems.

...read moreread less

Proceedings Article•DOI•

Improving Speaker Diarization of TV Series using Talking-Face Detection and Clustering

[...]

Hervé Bredin¹, Gregory Gelly¹•Institutions (1)

Université Paris-Saclay¹

01 Oct 2016

TL;DR: This paper proposes to rely on state-of-the-art face clustering techniques to guide acoustic speaker diarization, and two approaches are tested and evaluated on the first season of Game Of Thrones TV series.

...read moreread less

Abstract: While successful on broadcast news, meetings or telephone conversation, state-of-the-art speaker diarization techniques tend to perform poorly on TV series or movies. In this paper, we propose to rely on state-of-the-art face clustering techniques to guide acoustic speaker diarization. Two approaches are tested and evaluated on the first season of Game Of Thrones TV series. The second (better) approach relies on a novel talking-face detection module based on bi-directional long short-term memory recurrent neural network. Both audio-visual approaches outperform the audio-only baseline. A detailed study of the behavior of these approaches is also provided and paves the way to future improvements.

...read moreread less

Proceedings Article•DOI•

An Investigation of DNN-Based Speech Synthesis Using Speaker Codes.

[...]

Nobukatsu Hojo¹, Yusuke Ijima¹, Hideyuki Mizuno¹•Institutions (1)

Nippon Telegraph and Telephone¹

08 Sep 2016

TL;DR: Experimental results showed that the proposed model outperformed the conventional speaker-dependent DNN when the model architecture was set at optimal for the amount of training data of the selected target speaker.

...read moreread less

Abstract: Recent studies have shown that DNN-based speech synthesis can produce more natural synthesized speech than the conventional HMM-based speech synthesis. However, an open problem remains as to whether the synthesized speech quality can be improved by utilizing a multi-speaker speech corpus. To address this problem, this paper proposes DNN-based speech synthesis using speaker codes as a simple method to improve the performance of the conventional speaker dependent DNN-based method. In order to model speaker variation in the DNN, the augmented feature (speaker codes) is fed to the hidden layer(s) of the conventional DNN. The proposed method trains connection weights of the whole DNN using a multispeaker speech corpus. When synthesizing a speech parameter sequence, a target speaker is chosen from the corpus and the speaker code corresponding to the selected target speaker is fed to the DNN to generate the speaker’s voice. We investigated the relationship between the prediction performance and architecture of the DNNs by changing the input hidden layer for speaker codes. Experimental results showed that the proposed model outperformed the conventional speaker-dependent DNN when the model architecture was set at optimal for the amount of training data of the selected target speaker.

...read moreread less

Proceedings Article•DOI•

Deep neural network based posteriors for text-dependent speaker verification

[...]

Subhadeep Dey¹, Srikanth Madikeri¹, Marc Ferras¹, Petr Motlicek¹•Institutions (1)

Idiap Research Institute¹

20 Mar 2016

TL;DR: A Deep Neural Network/Hidden Markov Model Automatic Speech Recognition (DNN/HMM ASR) system is used to extract content-related posterior probabilities and outperforms systems using Gaussian mixture model posteriors by at least 50% Equal Error Rate (EER) on the RSR2015 in content mismatch trials.

...read moreread less

Abstract: The i-vector and Joint Factor Analysis (JFA) systems for text-dependent speaker verification use sufficient statistics computed from a speech utterance to estimate speaker models. These statistics average the acoustic information over the utterance thereby losing all the sequence information. In this paper, we study explicit content matching using Dynamic Time Warping (DTW) and present the best achievable error rates for speaker-dependent and speaker-independent content matching. For this purpose, a Deep Neural Network/Hidden Markov Model Automatic Speech Recognition (DNN/HMM ASR) system is used to extract content-related posterior probabilities. This approach outperforms systems using Gaussian mixture model posteriors by at least 50% Equal Error Rate (EER) on the RSR2015 in content mismatch trials. DNN posteriors are also used in i-vector and JFA systems, obtaining EERs as low as 0.02%.

...read moreread less

Proceedings Article•DOI•

Privacy-preserving sound to degrade automatic speaker verification performance

[...]

Kei Hashimoto¹, Junichi Yamagishi², Isao Echizen²•Institutions (2)

Nagoya Institute of Technology¹, National Institute of Informatics²

20 Mar 2016

TL;DR: The experimental results show that appropriate sound can efficiently degrade the speaker verification performance without degrading speech intelligibility.

...read moreread less

Abstract: In this paper, a privacy protection method to prevent speaker identification from recorded speech is proposed and evaluated. Although many techniques for preserving various private information included in speech have been proposed, their impacts on human speech communication in physical space are not taken into account. To overcome this problem, this paper proposes privacy-preserving sound as a privacy protection method. The privacy-preserving sound can degrade speaker verification performance without interfering with human speech communication in physical space. To make a first step toward solving this problem, suitable sound characteristics for preserving privacy are evaluated in terms of the speaker verification performance and speech intelligibility. The experimental results show that appropriate sound can efficiently degrade the speaker verification performance without degrading speech intelligibility.

...read moreread less

Proceedings Article•DOI•

Speaker-aware long short-term memory multi-task learning for speech recognition

[...]

Gueorgui Pironkov¹, Stéphane Dupont¹, Thierry Dutoit¹•Institutions (1)

University of Mons¹

01 Aug 2016

TL;DR: This paper considers speaker classification as an auxiliary task in order to improve the generalization abilities of the acoustic model, by training the model to recognize the speaker, or find the closest one inside the training set.

...read moreread less

Abstract: In order to address the commonly met issue of overfitting in speech recognition, this article investigates Multi-Task Learning, when the auxiliary task focuses on speaker classification. Overfitting occurs when the amount of training data is limited, leading to an over-sensible acoustic model. Multi-Task Learning is a method, among many other regularization methods, which decreases the overfitting impact by forcing the acoustic model to train jointly for multiple different, but related, tasks. In this paper, we consider speaker classification as an auxiliary task in order to improve the generalization abilities of the acoustic model, by training the model to recognize the speaker, or find the closest one inside the training set. We investigate this Multi-Task Learning setup on the TIMIT database, while the acoustic modeling is performed using a Recurrent Neural Network with Long Short-Term Memory cells.

...read moreread less

Patent•

Acoustic signature building for a speaker from multiple sessions

[...]

Alex Gorodetski, Ido Shapira, Ron Wein, Oana Sidi

26 Jan 2016

TL;DR: In this paper, the authors present methods of diarizing audio data using first-pass blind diarization and second-passblind diarisation that generate speaker statistical models.

...read moreread less

Abstract: Disclosed herein are methods of diarizing audio data using first-pass blind diarization and second-pass blind diarization that generate speaker statistical models, wherein the first pass-blind diarization is on a per-frame basis and the second pass-blind diarization is on a per-word basis, and methods of creating acoustic signatures for a common speaker based only on the statistical models of the speakers in each audio session.

...read moreread less

Journal Article•DOI•

Structure of pauses in speech in the context of speaker verification and classification of speech type

[...]

Magdalena Igras-Cybulska¹, Bartosz Ziółko¹, Piotr Żelasko¹, Marcin Witkowski¹•Institutions (1)

AGH University of Science and Technology¹

09 Nov 2016-Eurasip Journal on Audio, Speech, and Music Processing

TL;DR: Statistics of pauses appearing in Polish as a potential source of biometry information for automatic speaker recognition were described and quantity and duration of filled pauses, audible breaths, and correlation between the temporal structure of speech and the syntax structure of the spoken language were the features which characterize speakers most.

...read moreread less

Abstract: Statistics of pauses appearing in Polish as a potential source of biometry information for automatic speaker recognition were described. The usage of three main types of acoustic pauses (silent, filled and breath pauses) and syntactic pauses (punctuation marks in speech transcripts) was investigated quantitatively in three types of spontaneous speech (presentations, simultaneous interpretation and radio interviews) and read speech (audio books). Selected parameters of pauses extracted for each speaker separately or for speaker groups were examined statistically to verify usefulness of information on pauses for speaker recognition and speaker profile estimation. Quantity and duration of filled pauses, audible breaths, and correlation between the temporal structure of speech and the syntax structure of the spoken language were the features which characterize speakers most. The experiment of using pauses in speaker biometry system (using Universal Background Model and i-vectors) resulted in 30 % equal error rate. Including pause-related features to the baseline Mel-frequency cepstral coefficient system has not significantly improved its performance. In the experiment with automatic recognition of three types of spontaneous speech, we achieved 78 % accuracy, using GMM classifier. Silent pause-related features allowed distinguishing between read and spontaneous speech by extreme gradient boosting with 75 % accuracy.

...read moreread less

Journal Article•DOI•

Text-Independent Speaker Identification Using Vowel Formants

[...]

Noor Almaadeed¹, Amar Aggoun², Abbes Amira³•Institutions (3)

Qatar University¹, University of Bedfordshire², University of the West of Scotland³

01 Mar 2016

TL;DR: A scalable algorithm for real-time text-independent speaker identification based on vowel recognition that requires less than 100 bytes of data to be saved for each speaker to be identified, and can identify the speaker within a second.

...read moreread less

Abstract: Automatic speaker identification has become a challenging research problem due to its wide variety of applications. Neural networks and audio-visual identification systems can be very powerful, but they have limitations related to the number of speakers. The performance drops gradually as more and more users are registered with the system. This paper proposes a scalable algorithm for real-time text-independent speaker identification based on vowel recognition. Vowel formants are unique across different speakers and reflect the vocal tract information of a particular speaker. The contribution of this paper is the design of a scalable system based on vowel formant filters and a scoring scheme for classification of an unseen instance. Mel-Frequency Cepstral Coefficients (MFCC) and Linear Predictive Coding (LPC) have both been analysed for comparison to extract vowel formants by windowing the given signal. All formants are filtered by known formant frequencies to separate the vowel formants for further processing. The formant frequencies of each speaker are collected during the training phase. A test signal is also processed in the same way to find vowel formants and compare them with the saved vowel formants to identify the speaker for the current signal. A score-based scheme allows the speaker with the highest matching formants to own the current signal. This model requires less than 100 bytes of data to be saved for each speaker to be identified, and can identify the speaker within a second. Tests conducted on multiple databases show that this score-based scheme outperforms the back propagation neural network and Gaussian mixture models. Usually, the longer the speech files, the more significant were the improvements in accuracy.

...read moreread less

Journal Article•DOI•

A Large-Scale Open-Source Acoustic Simulator for Speaker Recognition

[...]

Marc Ferras¹, Srikanth Madikeri¹, Petr Motlicek¹, Subhadeep Dey¹, Hervé Bourlard¹ - Show less +1 more•Institutions (1)

Idiap Research Institute¹

03 Mar 2016-IEEE Signal Processing Letters

TL;DR: While error rates increase considerably under degraded speech conditions, large relative equal error rate (EER) reductions were observed when using a PLDA model trained with a large number of degraded sessions per speaker.

...read moreread less

Abstract: The state-of-the-art speaker-recognition systems suffer from significant performance loss on degraded speech conditions and acoustic mismatch between enrolment and test phases. Past international evaluation campaigns, such as the NIST speaker recognition evaluation (SRE), have partly addressed these challenges in some evaluation conditions. This work aims at further assessing and compensating for the effect of a wide variety of speech-degradation processes on speaker-recognition performance. We present an open-source simulator generating degraded telephone, VoIP, and interview-speech recordings using a comprehensive list of narrow-band, wide-band, and audio codecs, together with a database of over 60 h of environmental noise recordings and over 100 impulse responses collected from publicly available data. We provide speaker-verification results obtained with an $i$ -vector-based system using either a clean or degraded PLDA back-end on a NIST SRE subset of data corrupted by the proposed simulator. While error rates increase considerably under degraded speech conditions, large relative equal error rate (EER) reductions were observed when using a PLDA model trained with a large number of degraded sessions per speaker.

...read moreread less

Proceedings Article•DOI•

Active speaker detection with audio-visual co-training

[...]

Punarjay Chakravarty¹, Jeroen Zegers¹, Tinne Tuytelaars¹, Hugo Van hamme¹•Institutions (1)

Katholieke Universiteit Leuven¹

31 Oct 2016

TL;DR: This work shows how to co-train a classifier for active speaker detection using audio-visual data and uses the training of a personalized audio voice classifier to detect active speakers.

...read moreread less

Abstract: In this work, we show how to co-train a classifier for active speaker detection using audio-visual data. First, audio Voice Activity Detection (VAD) is used to train a personalized video-based active speaker classifier in a weakly supervised fashion. The video classifier is in turn used to train a voice model for each person. The individual voice models are then used to detect active speakers. There is no manual supervision - audio weakly supervises video classification, and the co-training loop is completed by using the trained video classifier to supervise the training of a personalized audio voice classifier.

...read moreread less

Proceedings Article•DOI•

Multi-output RNN-LSTM for multiple speaker speech synthesis and adaptation

[...]

Santiago Pascual¹, Antonio Bonafonte¹•Institutions (1)

Polytechnic University of Catalonia¹

01 Aug 2016

TL;DR: An architecture for speech synthesis using multiple speakers that addresses the problem of speaker adaptation by adding a new output branch to the model and successfully training it without the need of modifying the base optimized model.

...read moreread less

Abstract: Deep Learning has been applied successfully to speech processing. In this paper we propose an architecture for speech synthesis using multiple speakers. Some hidden layers are shared by all the speakers, while there is a specific output layer for each speaker. Objective and perceptual experiments prove that this scheme produces much better results in comparison with single speaker model. Moreover, we also tackle the problem of speaker adaptation by adding a new output branch to the model and successfully training it without the need of modifying the base optimized model. This fine tuning method achieves better results than training the new speaker from scratch with its own model.

...read moreread less

Collapse