Showing papers on "Speaker diarisation published in 2017"

PDF

Open Access

Posted Content•

Deep Speaker: an End-to-End Neural Speaker Embedding System

[...]

Chao Li, Ma Xiaokong, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, Zhenyao Zhu - Show less +5 more

05 May 2017-arXiv: Computation and Language

TL;DR: Results that suggest adapting from a model trained with Mandarin can improve accuracy for English speaker recognition are presented, and it is suggested that Deep Speaker outperforms a DNN-based i-vector baseline.

...read moreread less

Abstract: We present Deep Speaker, a neural speaker embedding system that maps utterances to a hypersphere where speaker similarity is measured by cosine similarity. The embeddings generated by Deep Speaker can be used for many tasks, including speaker identification, verification, and clustering. We experiment with ResCNN and GRU architectures to extract the acoustic features, then mean pool to produce utterance-level speaker embeddings, and train using triplet loss based on cosine similarity. Experiments on three distinct datasets suggest that Deep Speaker outperforms a DNN-based i-vector baseline. For example, Deep Speaker reduces the verification equal error rate by 50% (relatively) and improves the identification accuracy by 60% (relatively) on a text-independent dataset. We also present results that suggest adapting from a model trained with Mandarin can improve accuracy for English speaker recognition.

...read moreread less

446 citations

Proceedings Article•DOI•

Speaker diarization using deep neural network embeddings

[...]

Daniel Garcia-Romero¹, David Snyder¹, Gregory Sell¹, Daniel Povey¹, Alan V. McCree¹ - Show less +1 more•Institutions (1)

Johns Hopkins University¹

05 Mar 2017

TL;DR: This work proposes an alternative approach for learning representations via deep neural networks to remove the i-vector extraction process from the pipeline entirely and shows that, though this approach does not respond as well to unsupervised calibration strategies as previous systems, the incorporation of well-founded speaker priors sufficiently mitigates this shortcoming.

...read moreread less

Abstract: Speaker diarization is an important front-end for many speech technologies in the presence of multiple speakers, but current methods that employ i-vector clustering for short segments of speech are potentially too cumbersome and costly for the front-end role. In this work, we propose an alternative approach for learning representations via deep neural networks to remove the i-vector extraction process from the pipeline entirely. The proposed architecture simultaneously learns a fixed-dimensional embedding for acoustic segments of variable length and a scoring function for measuring the likelihood that the segments originated from the same or different speakers. Through tests on the CALLHOME conversational telephone speech corpus, we demonstrate that, in addition to streamlining the diarization architecture, the proposed system matches or exceeds the performance of state-of-the-art baselines. We also show that, though this approach does not respond as well to unsupervised calibration strategies as previous systems, the incorporation of well-founded speaker priors sufficiently mitigates this shortcoming.

...read moreread less

248 citations

Posted Content•

Speaker Diarization with LSTM

[...]

Quan Wang¹, Carlton Downey², Li Wan¹, Philip Andrew Mansfield¹, Ignacio Lopez Moreno¹ - Show less +1 more•Institutions (2)

Google¹, Carnegie Mellon University²

28 Oct 2017-arXiv: Audio and Speech Processing

TL;DR: This work combines LSTM-based d-vector audio embeddings with recent work in nonparametric clustering to obtain a state-of-the-art speaker diarization system that achieves a 12.0% diarization error rate on NIST SRE 2000 CALLHOME, while the model is trained with out- of-domain data from voice search logs.

...read moreread less

Abstract: For many years, i-vector based audio embedding techniques were the dominant approach for speaker verification and speaker diarization applications. However, mirroring the rise of deep learning in various domains, neural network based audio embeddings, also known as d-vectors, have consistently demonstrated superior speaker verification performance. In this paper, we build on the success of d-vector based speaker verification systems to develop a new d-vector based approach to speaker diarization. Specifically, we combine LSTM-based d-vector audio embeddings with recent work in non-parametric clustering to obtain a state-of-the-art speaker diarization system. Our system is evaluated on three standard public datasets, suggesting that d-vector based diarization systems offer significant advantages over traditional i-vector based systems. We achieved a 12.0% diarization error rate on NIST SRE 2000 CALLHOME, while our model is trained with out-of-domain data from voice search logs.

...read moreread less

170 citations

Proceedings Article•DOI•

Deep Speaker Embeddings for Short-Duration Speaker Verification.

[...]

Gautam Bhattacharya¹, Jahangir Alam², Patrick Kenny³•Institutions (3)

McGill University¹, Institut national de la recherche scientifique², École de technologie supérieure³

20 Aug 2017

TL;DR: This work proposes to use deep neural networks to learn short-duration speaker embeddings based on a deep convolutional architecture wherein recordings are treated as images and advocates treating utterances as images or ‘speaker snapshots, much like in face recognition.

...read moreread less

Abstract: The performance of a state-of-the-art speaker verification system is severely degraded when it is presented with trial recordings of short duration. In this work we propose to use deep neural networks to learn short-duration speaker embeddings. We focus on the 5s-5s condition, wherein both sides of a verification trial are 5 seconds long. In our previous work we established that learning a non-linear mapping from i-vectors to speaker labels is beneficial for speaker verification [1]. In this work we take the idea of learning a speaker classifier one step further we apply deep neural networks directly to timefrequency speech representations. We propose two feedforward network architectures for this task. Our best model is based on a deep convolutional architecture wherein recordings are treated as images. From our experimental findings we advocate treating utterances as images or ‘speaker snapshots, much like in face recognition. Our convolutional speaker embeddings perform significantly better than i-vectors when scoring is done using cosine distance, where the relative improvement is 23.5%. The proposed deep embeddings combined with cosine distance also outperform a state-of-the-art i-vector verification system by 1%, providing further empirical evidence in favor of our learned speaker features.

...read moreread less

155 citations

Proceedings Article•DOI•

Developing On-Line Speaker Diarization System.

[...]

Dimitrios Dimitriadis¹, Petr Fousek²•Institutions (2)

AT&T¹, IBM²

20 Aug 2017

TL;DR: This paper presents novel ideas for speeding up the system by using Automatic Speech Recognition (ASR) transcripts as an input to diarization, and introduces a concept of active window to keep the computational complexity linear.

...read moreread less

Abstract: In this paper we describe the process of converting a research prototype system for Speaker Diarization into a fully deployed product running in real time and with low latency. The deployment is a part of the IBM Cloud Speech-to-Text (STT) Service. First, the prototype system is described and the requirements for the on-line, deployable system are introduced. Then we describe the technical approaches we took to satisfy these requirements and discuss some of the challenges we have faced. In particular, we present novel ideas for speeding up the system by using Automatic Speech Recognition (ASR) transcripts as an input to diarization, we introduce a concept of active window to keep the computational complexity linear, we improve the speaker model using a new speaker-clustering algorithm, we automatically keep track of the number of active speakers and we enable the users to set an operating point on a continuous scale between low latency and optimal accuracy. The deployed system has been tuned on real-life data reaching average Speaker Error Rates around 3% and improving over the prototype system by about 10% relative.

...read moreread less

80 citations

Proceedings Article•DOI•

Adapting and controlling DNN-based speech synthesis using input codes

[...]

Hieu-Thi Luong¹, Shinji Takaki², Gustav Eje Henter², Junichi Yamagishi²•Institutions (2)

Ho Chi Minh City University of Science¹, National Institute of Informatics²

19 Jun 2017

TL;DR: Experimental results show that high-performance multi-speaker models can be constructed using the proposed code vectors with a variety of encoding schemes, and that adaptation and manipulation can be performed effectively using the codes.

...read moreread less

Abstract: Methods for adapting and controlling the characteristics of output speech are important topics in speech synthesis. In this work, we investigated the performance of DNN-based text-to-speech systems that in parallel to conventional text input also take speaker, gender, and age codes as inputs, in order to 1) perform multi-speaker synthesis, 2) perform speaker adaptation using small amounts of target-speaker adaptation data, and 3) modify synthetic speech characteristics based on the input codes. Using a large-scale, studio-quality speech corpus with 135 speakers of both genders and ages between tens and eighties, we performed three experiments: 1) First, we used a subset of speakers to construct a DNN-based, multi-speaker acoustic model with speaker codes. 2) Next, we performed speaker adaptation by estimating code vectors for new speakers via backpropagation from a small amount of adaptation material. 3) Finally, we experimented with manually manipulating input code vectors to alter the gender and/or age characteristics of the synthesised speech. Experimental results show that high-performance multi-speaker models can be constructed using the proposed code vectors with a variety of encoding schemes, and that adaptation and manipulation can be performed effectively using the codes.

...read moreread less

71 citations

Proceedings Article•DOI•

pyannote.metrics: A Toolkit for Reproducible Evaluation, Diagnostic, and Error Analysis of Speaker Diarization Systems.

[...]

Hervé Bredin¹•Institutions (1)

Université Paris-Saclay¹

20 Aug 2017

TL;DR: This work provides a command line interface (CLI) to improve reproducibility and comparison of speaker diarization research results, and shows that it can also be used for detailed error analysis purposes.

...read moreread less

Abstract: pyannote.metrics is an open-source Python library aimed at researchers working in the wide area of speaker diarization. It provides a command line interface (CLI) to improve reproducibility and comparison of speaker diarization research results. Through its application programming interface (API), a large set of evaluation metrics is available for diagnostic purposes of all modules of typical speaker diarization pipelines (speech activity detection, speaker change detection, clustering, and identification). Finally, thanks to visualization capabilities, we show that it can also be used for detailed error analysis purposes. pyannote.metrics can be downloaded from http://pyannote.github.io.

...read moreread less

70 citations

Proceedings Article•DOI•

Convolutional Neural Network for speaker change detection in telephone speaker diarization system

[...]

Marek Hrúz¹, Zbynek Zajic¹•Institutions (1)

University of West Bohemia¹

05 Mar 2017

TL;DR: The final results on speaker diarization system indicate that the use of speaker change detection based on CNN is beneficial with relative improvement of diarized error rate by 28 %.

...read moreread less

Abstract: The aim of this paper is to propose a speaker change detection technique based on Convolutional Neural Network (CNN) and evaluate its contribution to the performance of a speaker diarization system for telephone conversations. For the comparison we used an i-vector based speaker diarization system. The baseline speaker change detection uses Generalized Likelihood Ratio (GLR) metric. Experiments were conducted on the English part of the CallHome corpus. Our proposed CNN speaker change detection outperformed the GLR approach, reducing the Equal Error Rate relatively by 46 %. The final results on speaker diarization system indicate that the use of speaker change detection based on CNN is beneficial with relative improvement of diarization error rate by 28 %.

...read moreread less

68 citations

Proceedings Article•DOI•

Speaker Change Detection in Broadcast TV Using Bidirectional Long Short-Term Memory Networks

[...]

Ruiqing Yin, Hervé Bredin, Claude Barras

20 Aug 2017

TL;DR: The result shows that the proposed model brings good improvement over conventional methods based on BIC and Gaussian Divergence.

...read moreread less

Abstract: Speaker change detection is an important step in a speaker di-arization system. It aims at finding speaker change points in the audio stream. In this paper, it is treated as a sequence labeling task and addressed by Bidirectional long short term memory networks (Bi-LSTM). The system is trained and evaluated on the Broadcast TV subset from ETAPE database. The result shows that the proposed model brings good improvement over conventional methods based on BIC and Gaussian Divergence. For instance, in comparison to Gaussian divergence, it produces speech turns that are 19.5% longer on average, with the same level of purity.

...read moreread less

67 citations

Journal Article•DOI•

HMM-Based Phrase-Independent i-Vector Extractor for Text-Dependent Speaker Verification

[...]

Hossein Zeinali¹, Hossein Sameti¹, Lukas Burget²•Institutions (2)

Sharif University of Technology¹, Brno University of Technology²

01 Jul 2017-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This work proposes a straightforward hidden Markov model (HMM) based extension of the i-vector approach, which allows i-vectors to be successfully applied to text-dependent speaker verification and presents the best published results obtained with a single system on both RSR2015 and RedDots dataset.

...read moreread less

Abstract: The low-dimensional i-vector representation of speech segments is used in the state-of-the-art text-independent speaker verification systems. However, i-vectors were deemed unsuitable for the text-dependent task, where simpler and older speaker recognition approaches were found more effective. In this work, we propose a straightforward hidden Markov model (HMM) based extension of the i-vector approach, which allows i-vectors to be successfully applied to text-dependent speaker verification. In our approach, the Universal Background Model (UBM) for training phrase-independent i-vector extractor is based on a set of monophone HMMs instead of the standard Gaussian Mixture Model (GMM). To compensate for the channel variability, we propose to precondition i-vectors using a regularized variant of within-class covariance normalization, which can be robustly estimated in a phrase-dependent fashion on the small datasets available for the text-dependent task. The verification scores are cosine similarities between the i-vectors normalized using phrase-dependent s-norm. The experimental results on RSR2015 and RedDots databases confirm the effectiveness of the proposed approach, especially in rejecting test utterances with a wrong phrase. A simple MFCC based i-vector/HMM system performs competitively when compared to very computationally expensive DNN-based approaches or the conventional relevance MAP GMM-UBM, which does not allow for compact speaker representations. To our knowledge, this paper presents the best published results obtained with a single system on both RSR2015 and RedDots dataset.

...read moreread less

65 citations

Proceedings Article•DOI•

Voice biometrics: Deep learning-based voiceprint authentication system

[...]

Andrew Boles¹, Paul Rad¹•Institutions (1)

University of Texas at San Antonio¹

18 Jun 2017

TL;DR: An analysis of how a text-independent voice identification system can be built is presented and the ability for such systems to be utilized in both speaker identification and speaker verification tasks is shown.

...read moreread less

Abstract: Speaker identification systems are becoming more important in today's world. This is especially true as devices rely on the user to speak commands. In this article, an analysis of how a text-independent voice identification system can be built is presented. Extracting the Mel-Frequency Cepstral Coefficients is evaluated and a support vector machine is trained and tested on two different data sets, one from LibriSpeech and one from in-house recorded audio files. The results show the ability for such systems to be utilized in both speaker identification and speaker verification tasks.

...read moreread less

Posted Content•

Deep Speaker Feature Learning for Text-independent Speaker Verification

[...]

Lantian Li¹, Yixiang Chen¹, Ying Shi¹, Zhiyuan Tang¹, Dong Wang² - Show less +1 more•Institutions (2)

Tsinghua University¹, Harbin Institute of Technology²

10 May 2017-arXiv: Sound

TL;DR: In this article, a convolutional time-delay deep neural network structure (CT-DNN) was proposed for speaker feature learning, which can produce high quality speaker features.

...read moreread less

Abstract: Recently deep neural networks (DNNs) have been used to learn speaker features. However, the quality of the learned features is not sufficiently good, so a complex back-end model, either neural or probabilistic, has to be used to address the residual uncertainty when applied to speaker verification, just as with raw features. This paper presents a convolutional time-delay deep neural network structure (CT-DNN) for speaker feature learning. Our experimental results on the Fisher database demonstrated that this CT-DNN can produce high-quality speaker features: even with a single feature (0.3 seconds including the context), the EER can be as low as 7.68%. This effectively confirmed that the speaker trait is largely a deterministic short-time property rather than a long-time distributional pattern, and therefore can be extracted from just dozens of frames.

...read moreread less

Journal Article•DOI•

Development of Multi-Level Speech based Person Authentication System

[...]

Rohan Kumar Das¹, Sarfaraz Jelil¹, S. R. Mahadeva Prasanna¹•Institutions (1)

Indian Institute of Technology Guwahati¹

01 Sep 2017-Journal of Signal Processing Systems

TL;DR: The multi-level framework having combination of the three modules helps in achieving better performance than that of the individual modules, which shows its potential for practical deployment.

...read moreread less

Abstract: This work presents the development of a multi-level speech based person authentication system with attendance as an application. The multi-level system consists of three different modules of speaker verification, namely voice-password, text-dependent and text-independent speaker verification. The three speaker verification modules are combined in a sequential manner to develop a multi-level framework which is ported over a telephone network through interactive voice response (IVR) system for aiding remote authentication. The users call from a fixed set of mobile handsets to verify their claim against their respective models, which is then authenticated in a multi-level mode using the above stated three modules. An analysis over a period of two months is shown on the performance of the multi-level system in attendance marking. The multi-level framework having combination of the three modules helps in achieving better performance than that of the individual modules, which shows its potential for practical deployment.

...read moreread less

Proceedings Article•DOI•

JHU Kaldi system for Arabic MGB-3 ASR challenge using diarization, audio-transcript alignment and transfer learning

[...]

Vimal Manohar¹, Daniel Povey¹, Sanjeev Khudanpur¹•Institutions (1)

Johns Hopkins University¹

01 Dec 2017

TL;DR: This paper describes the JHU team's Kaldi system submission to the Arabic MGB-3: The Arabic speech recognition in the Wild Challenge for ASRU-2017 and describes its own approach for speaker diarization and audio-transcript alignment.

...read moreread less

Abstract: This paper describes the JHU team's Kaldi system submission to the Arabic MGB-3: The Arabic speech recognition in the Wild Challenge for ASRU-2017 We use a weights transfer approach to adapt a neural network trained on the out-of-domain MGB-2 multi-dialect Arabic TV broadcast corpus to the MGB-3 Egyptian YouTube video corpus The neural network has a TDNN-LSTM architecture and is trained using lattice-free maximum mutual information (LF-MMI) objective followed by sMBR discriminative training For supervision, we fuse transcripts from 4 independent transcribers into confusion network training graphs We also describe our own approach for speaker diarization and audio-transcript alignment We use this to prepare lightly supervised transcriptions for training the seed system used for adaptation to MGB-3 Our primary submission to the challenge gives a multi-reference WER of 3278% on the MGB-3 test set

...read moreread less

Journal Article•DOI•

A Gender Mixture Detection Approach to Unsupervised Single-Channel Speech Separation Based on Deep Neural Networks

[...]

Yannan Wang¹, Jun Du¹, Li-Rong Dai¹, Chin-Hui Lee²•Institutions (2)

University of Science and Technology of China¹, Georgia Institute of Technology²

01 Jul 2017-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: The experimental results indicate that the proposed DNN-based approach achieves large performance gains over the state-of-the-art unsupervised techniques without using any specific knowledge about the mixed target and interfering speakers being segregated.

...read moreread less

Abstract: We propose an unsupervised speech separation framework for mixtures of two unseen speakers in a single-channel setting based on deep neural networks (DNNs). We rely on a key assumption that two speakers could be well segregated if they are not too similar to each other. A dissimilarity measure between two speakers is first proposed to characterize the separation ability between competing speakers. We then show that speakers with the same or different genders can often be separated if two speaker clusters, with large enough distances between them, for each gender group could be established, resulting in four speaker clusters. Next, a DNN-based gender mixture detection algorithm is proposed to determine whether the two speakers in the mixture are females, males, or from different genders. This detector is based on a newly proposed DNN architecture with four outputs, two of them representing the female speaker clusters and the other two characterizing the male groups. Finally, we propose to construct three independent speech separation DNN systems, one for each of the female–female, male–male, and female–male mixture situations. Each DNN gives dual outputs, one representing the target speaker group and the other characterizing the interfering speaker cluster. Trained and tested on the speech separation challenge corpus, our experimental results indicate that the proposed DNN-based approach achieves large performance gains over the state-of-the-art unsupervised techniques without using any specific knowledge about the mixed target and interfering speakers being segregated.

...read moreread less

Proceedings Article•DOI•

Speaker Adaptation in DNN-Based Speech Synthesis Using d-Vectors.

[...]

Rama Doddipatla¹, Norbert Braunschweiler¹, Ranniery Maia¹•Institutions (1)

Toshiba¹

20 Aug 2017

TL;DR: The proposed method of unsupervised adaptation using the d-vector is compared with the commonly used i-vector based approach for speaker adaptation and listening tests show that: (1) for speech quality, the DNN based approach is significantly preferred over the i- vector based approach; and (2) for speaker similarity, both d- vector and i- vectors were found to perform similar.

...read moreread less

Abstract: The paper presents a mechanism to perform speaker adaptation in speech synthesis based on deep neural networks (DNNs). The mechanism extracts speaker identification vectors, socalled d-vectors, from the training speakers and uses them jointly with the linguistic features to train a multi-speaker DNNbased text-to-speech synthesizer (DNN-TTS). The d-vectors are derived by applying principal component analysis (PCA) on the bottle-neck features of a speaker classifier network. At the adaptation stage, three variants are explored: (1) d-vectors calculated using data from the target speaker, or (2) d-vectors calculated as a weighted sum of d-vectors from training speakers, or (3) d-vectors calculated as an average of the above two approaches. The proposed method of unsupervised adaptation using the d-vector is compared with the commonly used i-vector based approach for speaker adaptation. Listening tests show that: (1) for speech quality, the d-vector based approach is significantly preferred over the i-vector based approach. All the d-vector variants perform similar for speech quality; (2) for speaker similarity, both d-vector and i-vector based adaptation were found to perform similar, except a small significant preference for the d-vector calculated as an average over the i-vector.

...read moreread less

Journal Article•DOI•

Reversible speaker de-identification using pre-trained transformation functions

[...]

Carmen Magariños, Paula Lopez-Otero, Laura Docio-Fernandez, Eduardo Rodriguez-Banga, Daniel Erro¹, Carmen García-Mateo - Show less +2 more•Institutions (1)

University of the Basque Country¹

01 Nov 2017-Computer Speech & Language

TL;DR: A technique is proposed in this paper in which a pool of pre-trained transformations between a set of speakers is used as follows, making it possible to produce de-identified speech in real-time with a high level of naturalness.

...read moreread less

Proceedings Article•DOI•

Probabilistic spatial dictionary based online adaptive beamforming for meeting recognition in noisy and reverberant environments

[...]

Nobutaka Ito¹, Shoko Araki¹, Marc Delcroix¹, Tomohiro Nakatani¹•Institutions (1)

Nippon Telegraph and Telephone¹

01 Mar 2017

TL;DR: Online adaptive beamforming for automatic speech recognition (ASR) in meetings in noisy, reverberant environments is proposed, based on recently developed mask-based beamforming, which reduced the word error rate (WER) on real meeting data by 54.8% relative to the previous beamforming method.

...read moreread less

Abstract: Here we propose online adaptive beamforming for automatic speech recognition (ASR) in meetings in noisy, reverberant environments. The proposed method is based on recently developed mask-based beamforming, in which accurate mask estimation and diarization are paramount. Real-world experiments have shown that mask-based beamforming enables accurate ASR in meetings in small noise and reverberation with a signal-to-noise ratio (SNR) of 15–25 dB and a reverberation time (RT) of 120–350 ms. In this paper, we deal with a more adverse condition: meetings in large noise and reverberation with an SNR of 3–15 dB and an RT of 500 ms. To this end, we exploit a probabilistic spatial dictionary, a dictionary that consists of a pre-trained probability distribution of source location features for each potential speaker location. This dictionary enables us to perform mask estimation and diarization for beamforming accurately, even in the above adverse condition. The proposed method reduced the word error rate (WER) on real meeting data by 54.8% relative to our previous beamforming method.

...read moreread less

Proceedings Article•DOI•

Robust speaker recognition based on DNN/i-vectors and speech separation

[...]

Jorge Chang¹, DeLiang Wang¹•Institutions (1)

Ohio State University¹

01 Mar 2017

TL;DR: This study investigates a phonetically-aware i-vector system in noisy conditions and proposes a front-end to tackle the noise problem by performing speech separation and examines its performance for both verification and identification tasks.

...read moreread less

Abstract: Recent research shows that the i-vector framework for speaker recognition can significantly benefit from phonetic information. A common approach is to use a deep neural network (DNN) trained for automatic speech recognition to generate a universal background model (UBM). Studies in this area have been done in relatively clean conditions. However, strong background noise is known to severely reduce speaker recognition performance. This study investigates a phonetically-aware i-vector system in noisy conditions. We propose a front-end to tackle the noise problem by performing speech separation and examine its performance for both verification and identification tasks. The proposed separation system trains a DNN to estimate the ideal ratio mask of the noisy speech. The separated speech is then used to extract enhanced features for the i-vector framework. We compare the proposed system against a multi-condition trained baseline and a traditional GMM-UBM i-vector system. Our proposed system provides an absolute average improvement of 8% in identification accuracy and 1.2% in equal error rate.

...read moreread less

Proceedings Article•DOI•

An MFCC-Based Speaker Identification System

[...]

Fang-Yie Leu¹, Guan-Liang Lin¹•Institutions (1)

Tunghai University¹

01 Mar 2017

TL;DR: A speaker identification technique which first takes the original voice signals of a person, e.g., Bob, and then normalizes the audio energies of the signals by employing Fourier transformation approach is studied.

...read moreread less

Abstract: Nowadays, many speech recognition applications have been used by people in the world. Typical examples are the SIRI of iPhone, Google speech recognition system, and mobile phones operated by voice, etc. On the contrary, speaker identification in its current stage is relatively immature. Therefore, in this paper, we study a speaker identification technique which first takes the original voice signals of a person, e.g., Bob, and then normalizes the audio energies of the signals. After that, the audio signals is converted from time domain to frequency domain by employing Fourier transformation approach. Next, a MFCC-based human auditory filtering model is utilized to identify the energy levels of different frequencies as the quantified characteristics of Bob's voice. Further, the probability density function of Gaussian mixture model is utilized to indicate the distribution of the quantified characteristics as Bob's specific acoustic model. When receiving an unknown person, e.g., x's voice, the system processes the voice with the same procedure, and compares the processing result, which is x's acoustic model, with known-people's acoustic models collected in an acoustic-model database beforehand to identify who the most possible speaker is.

...read moreread less

Proceedings Article•DOI•

Speaker Diarization Using Convolutional Neural Network for Statistics Accumulation Refinement.

[...]

Zbyněk Zajíc¹, Marek Hrúz¹, Luděk Müller¹•Institutions (1)

University of West Bohemia¹

20 Aug 2017

TL;DR: The experiments on the English part of the CallHome corpus show that the proposed refinement of the statistics accumulation is beneficial with the relative improvement of Diarization Error Rate almost by 16 % when compared to the speaker diarization system without statistics refinement.

...read moreread less

Abstract: The aim of this paper is to investigate the benefit of information from a speaker change detection system based on Convolutional Neural Network (CNN) when applied to the process of accumulation of statistics for an i-vector generation. The investigation is carried out on the problem of diarization. In our system, the output of the CNN is a probability value of a speaker change in a conversation for a given time segment. According to this probability, we cut the conversation into short segments that are then represented by the i-vector (to describe a speaker in it). We propose a technique to utilize the information from the CNN for the weighting of the acoustic data in a segment to refine the statistics accumulation process. This technique enables us to represent the speaker better in the final i-vector. The experiments on the English part of the CallHome corpus show that our proposed refinement of the statistics accumulation is beneficial with the relative improvement of Diarization Error Rate almost by 16 % when compared to the speaker diarization system without statistics refinement.

...read moreread less

Journal Article•DOI•

Text-independent speaker identification based on selection of the most similar feature vectors

[...]

Mohammad Soleymanpour¹, Hossein Marvi¹•Institutions (1)

University of Shahrood¹

01 Mar 2017-International Journal of Speech Technology

TL;DR: Two methods to find MFCCs feature vectors with the highest similar that is applied to text independent speaker identification system are proposed and Experimental results indicate that the performance of speaker Identification system has been improved in accuracy and time consumption term.

...read moreread less

Abstract: The speaker recognition has been one of the interesting issues in signal and speech processing over the last few decades. Feature selection is one of the main parts of speaker recognition system which can improve the performance of the system. In this paper, we have proposed two methods to find MFCCs feature vectors with the highest similar that is applied to text independent speaker identification system. These feature vectors show individual properties of each person's vocal tract that are mostly repeated. They are used to build speaker's model and to specify decision boundary. We applied MFCC of each window over main signal as a feature vector and used clustering to obtain feature vectors with the highest similar. The Speaker identification experiments are performed using the ELSDSR database that consists of 22 speakers (12 male and 10 female) and Neural Network is used as a classifier. The effect of three main parameters have been considered in two proposed methods. Experimental results indicate that the performance of speaker identification system has been improved in accuracy and time consumption term.

...read moreread less

Journal Article•DOI•

Text-dependent speaker verification based on i-vectors, Neural Networks and Hidden Markov Models

[...]

Hossein Zeinali¹, Hossein Zeinali², Hossein Sameti², Lukas Burget¹, Jan Cernocký¹ - Show less +1 more•Institutions (2)

Brno University of Technology¹, Sharif University of Technology²

01 Nov 2017-Computer Speech & Language

TL;DR: New advances are described with the state-of-the-art i-vector based approach to text-dependent speaker verification, which also makes use of different DNN techniques.

...read moreread less

Book Chapter•DOI•

Speaker Diarization Using Deep Recurrent Convolutional Neural Networks for Speaker Embeddings

[...]

Pawel Cyrta, Tomasz Trzcinski¹, Wojciech Stokowiec²•Institutions (2)

Warsaw University of Technology¹, Polish-Japanese Academy of Information Technology²

17 Sep 2017

TL;DR: A new method of speaker diarization that employs a deep learning architecture to learn speaker embeddings using a recurrent convolutional neural network applied directly on magnitude spectrograms is proposed.

...read moreread less

Abstract: In this paper we propose a new method of speaker diarization that employs a deep learning architecture to learn speaker embeddings. In contrast to the traditional approaches that build their speaker embeddings using manually hand-crafted spectral features, we propose to train for this purpose a recurrent convolutional neural network applied directly on magnitude spectrograms. To compare our approach with the state of the art, we collect and release for the public an additional dataset of over 6 h of fully annotated broadcast material. The results of our evaluation on the new dataset and three other benchmark datasets show that our proposed method significantly outperforms the competitors and reduces diarization error rate by a large margin of over 30% with respect to the baseline.

...read moreread less

Proceedings Article•DOI•

Language diarization for semi-supervised bilingual acoustic model training

[...]

Emre Yilmaz¹, Mitchell McLaren², Henk van den Heuvel¹, David A. van Leeuwen¹•Institutions (2)

Radboud University Nijmegen¹, SRI International²

01 Dec 2017

TL;DR: The results demonstrate that applying language diarization to the raw speech data to enable using the monolingual resources improves the automatic transcription quality compared to a baseline system using a bilingual ASR system.

...read moreread less

Abstract: In this paper, we investigate several automatic transcription schemes for using raw bilingual broadcast news data in semi-supervised bilingual acoustic model training Specifically, we compare the transcription quality provided by a bilingual ASR system with another system performing language diarization at the front-end followed by two monolingual ASR systems chosen based on the assigned language label Our research focuses on the Frisian-Dutch code-switching (CS) speech that is extracted from the archives of a local radio broadcaster Using 11 hours of manually transcribed Frisian speech as a reference, we aim to increase the amount of available training data by using these automatic transcription techniques By merging the manually and automatically transcribed data, we learn bilingual acoustic models and run ASR experiments on the development and test data of the FAME! speech corpus to quantify the quality of the automatic transcriptions Using these acoustic models, we present speech recognition and CS detection accuracies The results demonstrate that applying language diarization to the raw speech data to enable using the monolingual resources improves the automatic transcription quality compared to a baseline system using a bilingual ASR system

...read moreread less

Journal Article•DOI•

A Review on Feature Extraction for Speaker Recognition under Degraded Conditions

[...]

Gökay Dişken¹, Zekeriya Tufekci², Lütfü Saribulut¹, Ulus Çevik²•Institutions (2)

Adana Science and Technology University¹, Çukurova University²

04 May 2017-Iete Technical Review

TL;DR: A literature review on speaker-specific information extraction from speech is presented by considering the latest studies offering solutions to the aforementioned problem by considering their robustness against channel mismatch, additive noise, and other degradations such as vocal effort, emotion mismatch, etc.

...read moreread less

Abstract: Speech is a signal that includes speaker's emotion, characteristic specification, phoneme-information etc. Various methods have been proposed for speaker recognition by extracting specifications of a given utterance. Among them, short-term cepstral features are used excessively in speech, and speaker recognition areas because of their low complexity, and high performance in controlled environments. On the other hand, their performances decrease dramatically under degraded conditions such as channel mismatch, additive noise, emotional variability, etc. In this paper, a literature review on speaker-specific information extraction from speech is presented by considering the latest studies offering solutions to the aforementioned problem. The studies are categorized in three groups considering their robustness against channel mismatch, additive noise, and other degradations such as vocal effort, emotion mismatch, etc. For a more understandable representation, they are also classified into two tables b...

...read moreread less

Patent•

Automatic speaker identification using speech recognition features

[...]

Hugh Evan Secker-Walker¹, Baiyang Liu¹, Frederick V. Weber¹•Institutions (1)

Amazon.com¹

30 Jan 2017

TL;DR: In this paper, features are disclosed for automatically identifying a speaker, and scores are determined based on individual components of Gaussian mixture models (GMMs) that score best for frames of audio data of an utterance.

...read moreread less

Abstract: Features are disclosed for automatically identifying a speaker. Artifacts of automatic speech recognition (“ASR”) and/or other automatically determined information may be processed against individual user profiles or models. Scores may be determined reflecting the likelihood that individual users made an utterance. The scores can be based on, e.g., individual components of Gaussian mixture models (“GMMs”) that score best for frames of audio data of an utterance. A user associated with the highest likelihood score for a particular utterance can be identified as the speaker of the utterance. Information regarding the identified user can be provided to components of a spoken language processing system, separate applications, etc.

...read moreread less

Journal Article•DOI•

Template-matching for text-dependent speaker verification

[...]

Subhadeep Dey¹, Petr Motlicek¹, Srikanth Madikeri¹, Marc Ferras¹•Institutions (1)

Idiap Research Institute¹

01 Apr 2017-Speech Communication

TL;DR: The proposed DTW approach obtained at least 74% relative improvement in equal error rate on the RSR corpus over other state-of-the-art approaches, including i-vector and JFA.

...read moreread less

Journal Article•DOI•

Active Learning Based Constrained Clustering For Speaker Diarization

[...]

Chengzhu Yu, John H. L. Hansen¹•Institutions (1)

University of Texas at Dallas¹

01 Nov 2017-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: The results indicate that the proposed active learning algorithms are able to reduce diarization error rate significantly with a relatively small amount of human supervision.

...read moreread less

Abstract: Most speaker diarization research has focused on unsupervised scenarios, where no human supervision is available. However, in many real-world applications, a certain amount of human input could be expected, especially when minimal human supervision brings significant performance improvement. In this study, we propose an active learning based bottom-up speaker clustering algorithm to effectively improve speaker diarization performance with limited human input. Specifically, the proposed active learning based speaker clustering has two different stages: explore and constrained clustering . The explore stage is to quickly discover at least one sample for each speaker for boosting speaker clustering process with reliable initial speaker clusters. After discovering all, or a majority, of the involved speakers during explore stage, the constrained clustering is performed. Constrained clustering is similar to traditional bottom-up clustering process with an important difference that the clusters created during explore stage are restricted from merging with each other. Constrained clustering continues until only the clusters generated from the explore stage are left. Since the objective of active learning based speaker clustering algorithm is to provide good initial speaker models, performance saturates as soon as sufficient examples are ensured for each cluster. To further improve diarization performance with increasing human input, we propose a second method which actively select speech segments that account for the largest expected speaker error from existing cluster assignments for human evaluation and reassignment. The algorithms are evaluated on our recently created Apollo Mission Control Center dataset as well as augmented multiparty interaction meeting corpus. The results indicate that the proposed active learning algorithms are able to reduce diarization error rate significantly with a relatively small amount of human supervision.

...read moreread less

Proceedings Article•DOI•

DNN i-Vector Speaker Verification with Short, Text-Constrained Test Utterances.

[...]

Jinghua Zhong¹, Wenping Hu², Frank K. Soong², Helen Meng¹•Institutions (2)

The Chinese University of Hong Kong¹, Microsoft²

20 Aug 2017

TL;DR: It is found that by tandeming MFCC with bottleneck features, EERs can be further reduced.

...read moreread less

Abstract: We investigate how to improve the performance of DNN ivector based speaker verification for short, text-constrained test utterances, e.g. connected digit strings. A text-constrained verification, due to its smaller, limited vocabulary, can deliver better performance than a text-independent one for a short utterance. We study the problem with “phonetically aware” Deep Neural Net (DNN) in its capability on “stochastic phonetic-alignment” in constructing supervectors and estimating the corresponding i-vectors with two speech databases: a large vocabulary, conversational, speaker independent database (Fisher) and a small vocabulary, continuous digit database (RSR2015 Part III). The phonetic alignment efficiency and resultant speaker verification performance are compared with differently sized senone sets which can characterize the phonetic pronunciations of utterances in the two databases. Performance on RSR2015 Part III evaluation shows a relative improvement of EER, i.e., 7.89% for male speakers and 3.54% for female speakers with only digit related senones. The DNN bottleneck features were also studied to investigate their capability of extracting phonetic sensitive information which is useful for text-independent or textconstrained speaker verifications. We found that by tandeming MFCC with bottleneck features, EERs can be further reduced.

...read moreread less

Collapse