Showing papers on "Speaker recognition published in 2018"

PDF

Open Access

Proceedings Article•DOI•

X-Vectors: Robust DNN Embeddings for Speaker Recognition

[...]

David Snyder¹, Daniel Garcia-Romero¹, Gregory Sell¹, Daniel Povey¹, Sanjeev Khudanpur¹ - Show less +1 more•Institutions (1)

Johns Hopkins University¹

15 Apr 2018

TL;DR: This paper uses data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness of deep neural network embeddings for speaker recognition.

...read moreread less

Abstract: In this paper, we use data augmentation to improve performance of deep neural network (DNN) embeddings for speaker recognition. The DNN, which is trained to discriminate between speakers, maps variable-length utterances to fixed-dimensional embeddings that we call x-vectors. Prior studies have found that embeddings leverage large-scale training datasets better than i-vectors. However, it can be challenging to collect substantial quantities of labeled data for training. We use data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness. The x-vectors are compared with i-vector baselines on Speakers in the Wild and NIST SRE 2016 Cantonese. We find that while augmentation is beneficial in the PLDA classifier, it is not helpful in the i-vector extractor. However, the x-vector DNN effectively exploits data augmentation, due to its supervised training. As a result, the x-vectors achieve superior performance on the evaluation datasets.

...read moreread less

2,300 citations

Proceedings Article•DOI•

VoxCeleb2: Deep Speaker Recognition.

[...]

Joon Son Chung¹, Arsha Nagrani¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

14 Jun 2018

TL;DR: In this article, a large-scale audio-visual speaker recognition dataset, VoxCeleb2, is presented, which contains over a million utterances from over 6,000 speakers.

...read moreread less

Abstract: The objective of this paper is speaker recognition under noisy and unconstrained conditions. We make two key contributions. First, we introduce a very large-scale audio-visual speaker recognition dataset collected from open-source media. Using a fully automated pipeline, we curate VoxCeleb2 which contains over a million utterances from over 6,000 speakers. This is several times larger than any publicly available speaker recognition dataset. Second, we develop and compare Convolutional Neural Network (CNN) models and training strategies that can effectively recognise identities from voice under various conditions. The models trained on the VoxCeleb2 dataset surpass the performance of previous works on a benchmark dataset by a significant margin.

...read moreread less

1,289 citations

Proceedings Article•DOI•

Speaker Recognition from Raw Waveform with SincNet

[...]

Mirco Ravanelli¹, Yoshua Bengio¹•Institutions (1)

Université de Montréal¹

29 Jul 2018

TL;DR: This paper proposes a novel CNN architecture, called SincNet, that encourages the first convolutional layer to discover more meaningful filters, based on parametrized sinc functions, which implement band-pass filters.

...read moreread less

Abstract: Deep learning is progressively gaining popularity as a viable alternative to i-vectors for speaker recognition. Promising results have been recently obtained with Convolutional Neural Networks (CNNs) when fed by raw speech samples directly. Rather than employing standard hand-crafted features, the latter CNNs learn low-level speech representations from waveforms, potentially allowing the network to better capture important narrow-band speaker characteristics such as pitch and formants. Proper design of the neural network is crucial to achieve this goal.This paper proposes a novel CNN architecture, called SincNet, that encourages the first convolutional layer to discover more meaningful filters. SincNet is based on parametrized sinc functions, which implement band-pass filters. In contrast to standard CNNs, that learn all elements of each filter, only low and high cutoff frequencies are directly learned from data with the proposed method. This offers a very compact and efficient way to derive a customized filter bank specifically tuned for the desired application.Our experiments, conducted on both speaker identification and speaker verification tasks, show that the proposed architecture converges faster and performs better than a standard CNN on raw waveforms.

...read moreread less

605 citations

Patent•

System and Method for Extremely Efficient Image and Pattern Recognition and Artificial Intelligence Platform

[...]

Lotfi A. Zadeh, Saied Tadayon, Bijan Tadayon

12 Mar 2018

TL;DR: In this article, the first application of general-AI is described, which covers new algorithms, methods, and systems for: Artificial Intelligence; the first applications of General-AI. (versus Specific, Vertical, or Narrow-AI) (as humans can do) (which also includes Explainable-AI or XAI); addition of reasoning, inference, and cognitive layers/engines to learning module/engine/layer; soft computing; Information Principle; Stratification; Incremental Enlargement Principle; deep-level/detailed recognition, e.g.,

...read moreread less

Abstract: Specification covers new algorithms, methods, and systems for: Artificial Intelligence; the first application of General-AI. (versus Specific, Vertical, or Narrow-AI) (as humans can do) (which also includes Explainable-AI or XAI); addition of reasoning, inference, and cognitive layers/engines to learning module/engine/layer; soft computing; Information Principle; Stratification; Incremental Enlargement Principle; deep-level/detailed recognition, e.g., image recognition (e.g., for action, gesture, emotion, expression, biometrics, fingerprint, tilted or partial-face, OCR, relationship, position, pattern, and object); Big Data analytics; machine learning; crowd-sourcing; classification; clustering; SVM; similarity measures; Enhanced Boltzmann Machines; Enhanced Convolutional Neural Networks; optimization; search engine; ranking; semantic web; context analysis; question-answering system; soft, fuzzy, or un-sharp boundaries/impreciseness/ambiguities/fuzziness in class or set, e.g., for language analysis; Natural Language Processing (NLP); Computing-with-Words (CWW); parsing; machine translation; music, sound, speech, or speaker recognition; video search and analysis (e.g., “intelligent tracking”, with detailed recognition); image annotation; image or color correction; data reliability; Z-Number; Z-Web; Z-Factor; rules engine; playing games; control system; autonomous vehicles or drones; self-diagnosis and self-repair robots; system diagnosis; medical diagnosis/images; genetics; drug discovery; biomedicine; data mining; event prediction; financial forecasting (e.g., for stocks); economics; risk assessment; fraud detection (e.g., for cryptocurrency); e-mail management; database management; indexing and join operation; memory management; data compression; event-centric social network; social behavior; drone/satellite vision/navigation; smart city/home/appliances/IoT; and Image Ad and Referral Networks, for e-commerce, e.g., 3D shoe recognition, from any view angle.

...read moreread less

216 citations

Posted Content•

VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking.

[...]

Quan Wang, Hannah Muckenhirn, Kevin W. Wilson, Prashant Sridhar, Zelin Wu, John R. Hershey, Rif A. Saurous, Ron Weiss, Ye Jia, Ignacio Lopez Moreno - Show less +6 more

11 Oct 2018-arXiv: Audio and Speech Processing

TL;DR: In this paper, a speaker recognition network that produces speaker-discriminative embeddings and a spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask.

...read moreread less

Abstract: In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask. Our system significantly reduces the speech recognition WER on multi-speaker signals, with minimal WER degradation on single-speaker signals.

...read moreread less

197 citations

Proceedings Article•DOI•

Attention-Based Models for Text-Dependent Speaker Verification

[...]

F A Rezaur Rahman Chowdhury¹, Quan Wang², Ignacio Lopez Moreno², Li Wan²•Institutions (2)

Washington State University¹, Google²

12 Apr 2018

TL;DR: The authors explored different topologies and their variants of the attention layer, and compared different pooling methods on the attention weights, and showed that attention-based models can improve the Equal Error Rate (EER) of speaker verification system by relatively 14% compared to a non-attention LSTM baseline model.

...read moreread less

Abstract: Attention-based models have recently shown great performance on a range of tasks, such as speech recognition, machine translation, and image captioning due to their ability to summarize relevant information that expands through the entire length of an input sequence. In this paper, we analyze the usage of attention mechanisms to the problem of sequence summarization in our end-to-end text-dependent speaker recognition system. We explore different topologies and their variants of the attention layer, and compare different pooling methods on the attention weights. Ultimately, we show that attention-based models can improves the Equal Error Rate (EER) of our speaker verification system by relatively 14% compared to our non-attention LSTM baseline model.

...read moreread less

174 citations

Journal Article•DOI•

Text-Independent Speaker Verification Based on Triplet Convolutional Neural Network Embeddings

[...]

Chunlei Zhang¹, Kazuhito Koishida², John H. L. Hansen¹•Institutions (2)

University of Texas at Dallas¹, Microsoft²

01 Sep 2018-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A novel text-independent speaker verification framework based on the triplet loss and a very deep convolutional neural network architecture are investigated in this study, where a fixed-length speaker discriminative embedding is learned from sparse speech features and utilized as a feature representation for the SV tasks.

...read moreread less

Abstract: The effectiveness of introducing deep neural networks into conventional speaker recognition pipelines has been broadly shown to benefit system performance. A novel text-independent speaker verification (SV) framework based on the triplet loss and a very deep convolutional neural network architecture (i.e., Inception-Resnet-v1) are investigated in this study, where a fixed-length speaker discriminative embedding is learned from sparse speech features and utilized as a feature representation for the SV tasks. A concise description of the neural network based speaker discriminative training with triplet loss is presented. An Euclidean distance similarity metric is applied in both network training and SV testing, which ensures the SV system to follow an end-to-end fashion. By replacing the final max/average pooling layer with a spatial pyramid pooling layer in the Inception-Resnet-v1 architecture, the fixed-length input constraint is relaxed and an obvious performance gain is achieved compared with the fixed-length input speaker embedding system. For datasets with more severe training/test condition mismatches, the probabilistic linear discriminant analysis (PLDA) back end is further introduced to replace the distance based scoring for the proposed speaker embedding system. Thus, we reconstruct the SV task with a neural network based front-end speaker embedding system and a PLDA that provides channel and noise variabilities compensation in the back end. Extensive experiments are conducted to provide useful hints that lead to a better testing performance. Comparison with the state-of-the-art SV frameworks on three public datasets (i.e., a prompt speech corpus, a conversational speech Switchboard corpus, and NIST SRE10 10 s–10 s condition) justifies the effectiveness of our proposed speaker embedding system.

...read moreread less

151 citations

Proceedings Article•DOI•

Unsupervised Domain Adaptation via Domain Adversarial Training for Speaker Recognition

[...]

Qing Wang¹, Wei Rao², Sining Sun¹, Leib Xie¹, Eng Siong Chng², Haizhou Li² - Show less +2 more•Institutions (2)

Northwestern Polytechnical University¹, Nanyang Technological University²

25 Apr 2018

TL;DR: Experiments demonstrate that the proposed domain adversarial training method is not only effective in solving the dataset mismatch problem, but also outperforms the compared unsupervised domain adaptation methods.

...read moreread less

Abstract: The i-vector approach to speaker recognition has achieved good performance when the domain of the evaluation dataset is similar to that of the training dataset. However, in realworld applications, there is always a mismatch between the training and evaluation datasets, that leads to performance degradation. To address this problem, this paper proposes to learn the domain-invariant and speaker-discriminative speech representations via domain adversarial training. Specifically, with domain adversarial training method, we use a gradient reversal layer to remove the domain variation and project the different domain data into the same subspace. Moreover, we compare the proposed method with other state-of-the-art unsupervised domain adaptation techniques for i-vector approach to speaker recognition (e.g. autoencoder based domain adaptation, inter dataset variability compensation, dataset-invariant covariance normalization, and so on). Experiments on 2013 domain adaptation challenge (DAC) dataset demonstrate that the proposed method is not only effective in solving the dataset mismatch problem, but also outperforms the compared unsupervised domain adaptation methods.

...read moreread less

132 citations

Journal Article•DOI•

GMM and CNN Hybrid Method for Short Utterance Speaker Recognition

[...]

Zheli Liu¹, Zhendong Wu², Tong Li³, Jin Li³, Chao Shen⁴ - Show less +1 more•Institutions (4)

Nankai University¹, Hangzhou Dianzi University², Guangzhou University³, Xi'an Jiaotong University⁴

05 Mar 2018-IEEE Transactions on Industrial Informatics

TL;DR: A novel model to enhance the recognition accuracy of the short utterance speaker recognition system is proposed using a convolutional neural network to process spectrograms, which can describe speakers better and gains the considerable accuracy as well as the reasonable convergence speed.

...read moreread less

Abstract: During the last few years, the speaker recognition technique has been widely attractive for its extensive application in many fields, such as speech communications, domestics services, and smart terminals. As a critical method, the Gaussian mixture model (GMM) makes it possible to achieve the recognition capability that is close to the hearing ability of human in a long speech. However, the GMM is failing to recognize a short utterance speaker with a high accuracy. Aiming at solving this problem, in this paper, we propose a novel model to enhance the recognition accuracy of the short utterance speaker recognition system. Different from traditional models based on the GMM, we design a method to train a convolutional neural network to process spectrograms, which can describe speakers better. Thus, the recognition system gains the considerable accuracy as well as the reasonable convergence speed. The experiment results show that our model can help to decrease the equal error rate of the recognition from 4.9% to 2.5%.

...read moreread less

123 citations

Journal Article•DOI•

Machine learning-based self-powered acoustic sensor for speaker recognition

[...]

Jae Hyun Han¹, Bae Kang Min¹, Seong Kwang Hong¹, Hyunsin Park¹, Jun-Hyuk Kwak, Hee Seung Wang¹, Daniel J. Joe¹, Jung-Hwan Park¹, Younghoon Jung¹, Shin Hur, Chang D. Yoo¹, Keon Jae Lee¹ - Show less +8 more•Institutions (1)

KAIST¹

01 Nov 2018-Nano Energy

TL;DR: In this article, a flexible piezoelectric acoustic sensor (f-PAS) with a highly sensitive multi-resonant frequency band was fabricated by mimicking the operating mechanism of the basilar membrane in the human cochlear.

...read moreread less

113 citations

Proceedings Article•DOI•

Spoken Language Understanding without Speech Recognition

[...]

Yuan-Ping Chen¹, Ryan Price, Srinivas Bangalore•Institutions (1)

University of California, Santa Cruz¹

15 Apr 2018

TL;DR: This paper describes a novel approach for deriving semantics directly from the speech signal without the need for an explicit speech recognition step and demonstrates its effectiveness in comparison to the conventional approach.

...read moreread less

Abstract: While conventional approaches to spoken language understanding involve cascading a speech recognizer with a language understanding system, in this paper, we describe a novel approach for deriving semantics directly from the speech signal without the need for an explicit speech recognition step. We evaluate this approach in the context of a customer care dialog system and demonstrate its effectiveness in comparison to the conventional approach.

...read moreread less

Journal Article•DOI•

Speaker verification with short utterances: a review of challenges, trends and opportunities

[...]

Arnab Poddar, Sahidullah, Goutam Saha

01 Mar 2018-IET Biometrics

TL;DR: The authors present an extensive survey of SV with short utterances considering the studies from recent past and include latest research offering various solutions and analyses to address the limited data issue within the scope of SV.

...read moreread less

Abstract: Automatic speaker verification (ASV) technology now reports a reasonable level of accuracy in its applications in voice-based biometric systems. However, it requires adequate amount of speech data for enrolment and verification; otherwise, the performance becomes considerably degraded. For this reason, the trade-off between the convenience and security is difficult to maintain in practical scenarios. The utterance duration remains a critical issue while deploying a voice biometric system in real-world applications. A large amount of research work has been carried out to address the limited data issue within the scope of SV. The advancements and research activities in mitigating the challenges due to short utterance have seen a significant rise in recent times. In this study, the authors present an extensive survey of SV with short utterances considering the studies from recent past and include latest research offering various solutions and analyses. The review also summarises the major findings of the studies of duration variability problem in ASV systems. Finally, they discuss a number of possible future directions promoting further research in this field.

...read moreread less

Posted Content•

Unified Hypersphere Embedding for Speaker Recognition

[...]

Mahdi Hajibabaei, Dengxin Dai

22 Jul 2018-arXiv: Audio and Speech Processing

TL;DR: Results of experiments suggest that simple repetition and random time-reversion of utterances can reduce prediction errors by up to 18% and proposed logistic margin loss function leads to unified embeddings with state-of-the-art identification and competitive verification accuracies.

...read moreread less

Abstract: Incremental improvements in accuracy of Convolutional Neural Networks are usually achieved through use of deeper and more complex models trained on larger datasets. However, enlarging dataset and models increases the computation and storage costs and cannot be done indefinitely. In this work, we seek to improve the identification and verification accuracy of a text-independent speaker recognition system without use of extra data or deeper and more complex models by augmenting the training and testing data, finding the optimal dimensionality of embedding space and use of more discriminative loss functions. Results of experiments on VoxCeleb dataset suggest that: (i) Simple repetition and random time-reversion of utterances can reduce prediction errors by up to 18%. (ii) Lower dimensional embeddings are more suitable for verification. (iii) Use of proposed logistic margin loss function leads to unified embeddings with state-of-the-art identification and competitive verification accuracies.

...read moreread less

Journal Article•DOI•

Speaker recognition with hybrid features from a deep belief network

[...]

Hazrat Ali¹, Son N. Tran², Emmanouil Benetos³, Emmanouil Benetos², A.S. d'Avila Garcez² - Show less +1 more•Institutions (3)

COMSATS Institute of Information Technology¹, City University London², Queen Mary University of London³

01 Mar 2018-Neural Computing and Applications

TL;DR: This paper studies the use of features from different levels of deep belief network for quantizing the audio data into vectors of audio word counts, and shows that the audio word count vectors generated from mixture of DBN features at different layers give better performance than the MFCC features.

...read moreread less

Abstract: Learning representation from audio data has shown advantages over the handcrafted features such as mel-frequency cepstral coefficients (MFCCs) in many audio applications. In most of the representation learning approaches, the connectionist systems have been used to learn and extract latent features from the fixed length data. In this paper, we propose an approach to combine the learned features and the MFCC features for speaker recognition task, which can be applied to audio scripts of different lengths. In particular, we study the use of features from different levels of deep belief network for quantizing the audio data into vectors of audio word counts. These vectors represent the audio scripts of different lengths that make them easier to train a classifier. We show in the experiment that the audio word count vectors generated from mixture of DBN features at different layers give better performance than the MFCC features. We also can achieve further improvement by combining the audio word count vector and the MFCC features.

...read moreread less

Proceedings Article•DOI•

Frame-Level Speaker Embeddings for Text-Independent Speaker Recognition and Analysis of End-to-End Model

[...]

Suwon Shon¹, Hao Tang¹, James Glass¹•Institutions (1)

Massachusetts Institute of Technology¹

12 Sep 2018

TL;DR: A Convolutional Neural Network (CNN) based speaker recognition model for extracting robust speaker embeddings is proposed and it is found that the networks are better at discriminating broad phonetic classes than individual phonemes.

...read moreread less

Abstract: In this paper, we propose a Convolutional Neural Network (CNN) based speaker recognition model for extracting robust speaker embeddings. The embedding can be extracted efficiently with linear activation in the embedding layer. To understand how the speaker recognition model operates with text-independent input, we modify the structure to extract frame-level speaker embeddings from each hidden layer. We feed utterances from the TIMIT dataset to the trained network and use several proxy tasks to study the networks ability to represent speech input and differentiate voice identity. We found that the networks are better at discriminating broad phonetic classes than individual phonemes. In particular, frame-level embeddings that belong to the same phonetic classes are similar (based on cosine distance) for the same speaker. The frame level representation also allows us to analyze the networks at the frame level, and has the potential for other analyses to improve speaker recognition.

...read moreread less

Journal Article•DOI•

Analysis of speaker recognition methodologies and the influence of kinetic changes to automatically detect Parkinson's Disease

[...]

Laureano Moro-Velázquez¹, Jorge A. Gomez-Garcia¹, Juan Ignacio Godino-Llorente¹, Jesús Villalba², Juan Rafael Orozco-Arroyave³, Juan Rafael Orozco-Arroyave⁴, Najim Dehak² - Show less +3 more•Institutions (4)

Technical University of Madrid¹, Johns Hopkins University², University of Erlangen-Nuremberg³, University of Antioquia⁴

01 Jan 2018

TL;DR: Results suggest that Rasta-PLP is the most reliable parameterization for the proposed task among all the tested features while the two employed classification schemes perform similarly and confirm that kinetic changes provide a substantial performance improvement in Parkinson's Disease automatic detection systems and should be considered in the future.

...read moreread less

Abstract: The diagnosis of Parkinson's Disease is a challenging task which might be supported by new tools to objectively evaluate the presence of deviations in patient's motor capabilities. To this respect, the dysarthric nature of patient's speech has been exploited in several works to detect the presence of this disease, but none of them has deeply studied the use of state-of-the-art speaker recognition techniques for this task. In this paper, two classification schemes (GMM-UBM and i-Vectors-GPLDA) are employed separately with several parameterization techniques, namely PLP, MFCC and LPC. Additionally, the influence of the kinetic changes, described by their derivatives, is analysed. With the proposed methodology, an accuracy of 87% with an AUC of 0.93 is obtained in the optimal configuration. These results are comparable to those obtained in other works employing speech for Parkinson's Disease detection and confirm that the selected speaker recognition techniques are a solid baseline to compare with future works. Results suggest that Rasta-PLP is the most reliable parameterization for the proposed task among all the tested features while the two employed classification schemes perform similarly. Additionally, results confirm that kinetic changes provide a substantial performance improvement in Parkinson's Disease automatic detection systems and should be considered in the future.

...read moreread less

Journal Article•DOI•

Age Estimation in Short Speech Utterances Based on LSTM Recurrent Neural Networks

[...]

Ruben Zazo¹, Phani Sankar Nidadavolu², Nanxin Chen², Joaquin Gonzalez-Rodriguez¹, Najim Dehak² - Show less +1 more•Institutions (2)

Autonomous University of Madrid¹, Johns Hopkins University²

15 Mar 2018-IEEE Access

TL;DR: A novel age estimation system based on LSTM-RNNs that is able to deal with short utterances, easily deployed in a real-time architecture and compared with a state-of-the-art i-vector approach.

...read moreread less

Abstract: Age estimation from speech has recently received increased interest as it is useful for many applications such as user-profiling, targeted marketing, or personalized call-routing. This kind of applications need to quickly estimate the age of the speaker and might greatly benefit from real-time capabilities. Long short-term memory (LSTM) recurrent neural networks (RNN) have shown to outperform state-of-the-art approaches in related speech-based tasks, such as language identification or voice activity detection, especially when an accurate real-time response is required. In this paper, we propose a novel age estimation system based on LSTM-RNNs. This system is able to deal with short utterances (from 3 to 10 s) and it can be easily deployed in a real-time architecture. The proposed system has been tested and compared with a state-of-the-art i-vector approach using data from NIST speaker recognition evaluation 2008 and 2010 data sets. Experiments on short duration utterances show a relative improvement up to 28% in terms of mean absolute error of this new approach over the baseline system.

...read moreread less

Journal Article•DOI•

Automatic speaker, age-group and gender identification from children’s speech

[...]

Saeid Safavi¹, Martin J. Russell¹, Peter Jancovic¹•Institutions (1)

University of Birmingham¹

01 Jul 2018-Computer Speech & Language

TL;DR: The performances of several classification methods are compared, including Gaussian Mixture Model–Universal Background Model (GMM–UBM), GMM–Support Vector Machine (G MM–SVM) and i-vector based approaches, and the utility of different frequency bands for speaker, age-group and gender recognition from children’s speech is assessed.

...read moreread less

Posted Content•

Convolutional Neural Networks and x-vector Embedding for DCASE2018 Acoustic Scene Classification Challenge.

[...]

Hossein Zeinali, Lukas Burget, Jan Cernocký

01 Oct 2018-arXiv: Audio and Speech Processing

TL;DR: The Brno University of Technology (BUT) team submissions for Task 1 (Acoustic Scene Classification, ASC) of the DCASE-2018 challenge are described and the proposed approach is a fusion of two different Convolutional Neural Network topologies.

...read moreread less

Abstract: In this paper, the Brno University of Technology (BUT) team submissions for Task 1 (Acoustic Scene Classification, ASC) of the DCASE-2018 challenge are described. Also, the analysis of different methods on the leaderboard set is provided. The proposed approach is a fusion of two different Convolutional Neural Network (CNN) topologies. The first one is the common two-dimensional CNNs which is mainly used in image classification. The second one is a one-dimensional CNN for extracting fixed-length audio segment embeddings, so called x-vectors, which has also been used in speech processing, especially for speaker recognition. In addition to the different topologies, two types of features were tested: log mel-spectrogram and CQT features. Finally, the outputs of different systems are fused using a simple output averaging in the best performing system. Our submissions ranked third among 24 teams in the ASC sub-task A (task1a).

...read moreread less

Proceedings Article•DOI•

Triplet Loss Based Cosine Similarity Metric Learning for Text-independent Speaker Recognition.

[...]

Sergey Novoselov¹, Vadim Shchemelinin¹, Andrey Shulipa¹, Alexander Kozlov, Ivan Kremnev - Show less +1 more•Institutions (1)

Saint Petersburg State University of Information Technologies, Mechanics and Optics¹

02 Sep 2018

TL;DR: This work demonstrates that performance of deep speaker embeddings based systems can be improved by using Cosine Similarity Metric Learning (CSML) with the triplet loss training scheme.

...read moreread less

Abstract: Deep neural network based speaker embeddings become increasingly popular in the text-independent speaker recognition task. In contrast to a generatively trained i-vector extractor, a DNN speaker embedding extractor is usually trained discriminatively in the closed set classification scenario using softmax. The problem we addressed in the paper is choosing a dnn based speaker embedding backend solution for the speaker verification scoring. There are several options to perform speaker verification in the dnn embedding space. One of them is using a simple heuristic speaker similarity metric for scoring (e.g. cosine metric). Similarly with i-vector based systems, the standard Linear Discriminant Analisys (LDA) followed by the Probabilistic Linear Discriminant Analisys (PLDA) can be used for segregating speaker information. As an alternative, the discriminative metric learning approach can be considered. This work demonstrates that performance of deep speaker embeddings based systems can be improved by using Cosine Similarity Metric Learning (CSML) with the triplet loss training scheme. Results obtained on Speakers in the Wild and NIST SRE 2016 evaluation sets demonstrate superiority and robustness of CSML based systems.

...read moreread less

Journal Article•DOI•

Speaker Recognition for Hindi Speech Signal using MFCC-GMM Approach

[...]

Ankur Maurya¹, Divya Kumar¹, R.K. Agarwal²•Institutions (2)

Motilal Nehru National Institute of Technology Allahabad¹, National Institute of Technology, Kurukshetra²

01 Jan 2018-Procedia Computer Science

TL;DR: This paper aims to implement speaker recognition for Hindi speech samples using Mel frequency cepestral coffiecient–vector quantization (MFCC-VQ) and Mel frequency cofficient-Gaussian mixture model ( MFCC-GMM) for text dependent and text independent phrases.

...read moreread less

Posted Content•

On deep speaker embeddings for text-independent speaker recognition

[...]

Sergey Novoselov, Andrey Shulipa, Ivan Kremnev, Alexander Kozlov, Vadim Shchemelinin - Show less +1 more

26 Apr 2018-arXiv: Sound

TL;DR: It is demonstrated that using angular softmax activation at the last classification layer of a classification neural network instead of a simple softmaxactivation allows to train a more generalized discriminative speaker embedding extractor.

...read moreread less

Abstract: We investigate deep neural network performance in the textindependent speaker recognition task. We demonstrate that using angular softmax activation at the last classification layer of a classification neural network instead of a simple softmax activation allows to train a more generalized discriminative speaker embedding extractor. Cosine similarity is an effective metric for speaker verification in this embedding space. We also address the problem of choosing an architecture for the extractor. We found that deep networks with residual frame level connections outperform wide but relatively shallow architectures. This paper also proposes several improvements for previous DNN-based extractor systems to increase the speaker recognition accuracy. We show that the discriminatively trained similarity metric learning approach outperforms the standard LDA-PLDA method as an embedding backend. The results obtained on Speakers in the Wild and NIST SRE 2016 evaluation sets demonstrate robustness of the proposed systems when dealing with close to real-life conditions.

...read moreread less

Posted Content•

Frame-level speaker embeddings for text-independent speaker recognition and analysis of end-to-end model

[...]

Suwon Shon¹, Hao Tang¹, James Glass¹•Institutions (1)

Massachusetts Institute of Technology¹

12 Sep 2018-arXiv: Audio and Speech Processing

TL;DR: In this paper, a CNN-based speaker recognition model was proposed for extracting robust speaker embeddings, which can be extracted efficiently with linear activation in the embedding layer with text-independent input.

...read moreread less

Journal Article•DOI•

Curriculum Learning Based Approaches for Noise Robust Speaker Recognition

[...]

Shivesh Ranjan¹, John H. L. Hansen¹•Institutions (1)

University of Texas at Dallas¹

01 Jan 2018-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This study introduces a novel class of curriculum learning (CL) based algorithms for noise robust speaker recognition at two stages within a state-of-the-art speaker verification system: at the i-Vector extractor estimation and at the probabilistic linear discriminant (PLDA) back-end.

...read moreread less

Abstract: Performance of speaker identification (SID) systems is known to degrade rapidly in the presence of mismatch such as noise and channel degradations. This study introduces a novel class of curriculum learning (CL) based algorithms for noise robust speaker recognition. We introduce CL-based approaches at two stages within a state-of-the-art speaker verification system: at the i-Vector extractor estimation and at the probabilistic linear discriminant (PLDA) back-end. Our proposed CL-based approaches operate by categorizing the available training data into progressively more challenging subsets using a suitable difficulty criterion. Next, the corresponding training algorithms are initialized with a subset that is closest to a clean noise-free set, and progressively moving to subsets that are more challenging for training as the algorithms progress. We evaluate the performance of our proposed approaches on the noisy and severely degraded data from the DARPA RATS SID task, and show consistent and significant improvement across multiple test sets over a baseline SID framework with a standard i-Vector extractor and multisession PLDA-based back-end. We also construct a very challenging evaluation set by adding noise to the NIST SRE 2010 C5 extended condition trials, where our proposed CL-based PLDA is shown to offer significant improvements over a traditional PLDA based back-end.

...read moreread less

Journal Article•DOI•

MindID: Person Identification from Brain Waves through Attention-based Recurrent Neural Network

[...]

Xiang Zhang¹, Lina Yao¹, Salil S. Kanhere¹, Yunhao Liu², Tao Gu³, Kaixuan Chen¹ - Show less +2 more•Institutions (3)

University of New South Wales¹, Tsinghua University², RMIT University³

18 Sep 2018

TL;DR: In this article, an attention-based Encoder-Decoder RNNs (Recurrent Neural Networks) structure was used to assign varying attention weights to different EEG channels based on their importance, and the discriminative representations learned from the attentionbased RNN were used to identify the user through a boosting classifier.

...read moreread less

Abstract: Person identification technology recognizes individuals by exploiting their unique, measurable physiological and behavioral characteristics. However, the state-of-the-art person identification systems have been shown to be vulnerable, e.g., anti-surveillance prosthetic masks can thwart face recognition, contact lenses can trick iris recognition, vocoder can compromise voice identification and fingerprint films can deceive fingerprint sensors. EEG (Electroencephalography)-based identification, which utilizes the user's brainwave signals for identification and offers a more resilient solution, has recently drawn a lot of attention. However, the state-of-the-art systems cannot achieve similar accuracy as the aforementioned methods. We propose MindID, an EEG-based biometric identification approach, with the aim of achieving high accuracy and robust performance. At first, the EEG data patterns are analyzed and the results show that the Delta pattern contains the most distinctive information for user identification. Next, the decomposed Delta signals are fed into an attention-based Encoder-Decoder RNNs (Recurrent Neural Networks) structure which assigns varying attention weights to different EEG channels based on their importance. The discriminative representations learned from the attention-based RNN are used to identify the user through a boosting classifier. The proposed approach is evaluated over 3 datasets (two local and one public). One local dataset (EID-M) is used for performance assessment and the results illustrate that our model achieves an accuracy of 0.982 and significantly outperforms the state-of-the-art and relevant baselines. The second local dataset (EID-S) and a public dataset (EEG-S) are utilized to demonstrate the robustness and adaptability, respectively. The results indicate that the proposed approach has the potential to be widely deployed in practical settings.

...read moreread less

Proceedings Article•DOI•

End-to-End DNN Based Speaker Recognition Inspired by I-Vector and PLDA

[...]

Johan Rohdin¹, Anna Silnova¹, Mireia Diez¹, Oldrich Plchot¹, Pavel Matejka¹, Lukas Burget¹ - Show less +2 more•Institutions (1)

Brno University of Technology¹

18 Apr 2018

TL;DR: In this paper, an end-to-end speaker verification system was developed that is initialized to mimic an i-vector + PLDA baseline, which is then further trained in an end to end manner but regularized so that it does not deviate too far from the initial system.

...read moreread less

Abstract: Recently, several end-to-end speaker verification systems based on deep neural networks (DNNs) have been proposed. These systems have been proven to be competitive for text-dependent tasks as well as for text-independent tasks with short utterances. However, for text-independent tasks with longer utterances, end-to-end systems are still outperformed by standard i-vector + PLDA systems. In this work, we develop an end-to-end speaker verification system that is initialized to mimic an i-vector + PLDA baseline. The system is then further trained in an end-to-end manner but regularized so that it does not deviate too far from the initial system. In this way we mitigate overfitting which normally limits the performance of end-to-end systems. The proposed system outperforms the i-vector + PLDA baseline on both long and short duration utterances.

...read moreread less

Proceedings Article•DOI•

Dereverberation and Beamforming in Far-Field Speaker Recognition

[...]

Ladislav Mosner¹, Pavel Matejka¹, Ondrej Novotny¹, Jan Cernocky¹•Institutions (1)

Brno University of Technology¹

13 Apr 2018

TL;DR: The work shows that a speaker recognition system working robustly in the far-field scenario can be developed and weighted prediction error based dereverberation combined with generalized eigenvalue beamformer with power-spectral density and weighting masks generated by neural networks provides results approaching the clean close-microphone setup.

...read moreread less

Abstract: This paper deals with far-field speaker recognition. On a corpus of NIST SRE 2010 data retransmitted in a real room with multiple microphones, we first demonstrate how room acoustics cause significant degradation of state-of-the-art i-vector based speaker recognition system. We then investigate several techniques to improve the performances ranging from probabilistic linear discriminant analysis (PLDA) re-training, through dereverberation, to beamforming. We found that weighted prediction error (WPE) based dereverberation combined with generalized eigenvalue beamformer with power-spectral density (PSD) weighting masks generated by neural networks (NN) provides results approaching the clean close-microphone setup. Further improvement was obtained by re-training PLDA or the mask-generating NNs on simulated target data. The work shows that a speaker recognition system working robustly in the far-field scenario can be developed.

...read moreread less

Journal Article•DOI•

Speaker recognition from whispered speech: a tutorial survey and an application of time-varying linear prediction

[...]

Ville Vestman¹, Dhananjaya Gowda², Sahidullah¹, Paavo Alku³, Tomi Kinnunen¹ - Show less +1 more•Institutions (3)

University of Eastern Finland¹, Samsung², Aalto University³

01 May 2018-Speech Communication

TL;DR: This work addresses the problem of normal-whisper acoustic mismatch compensation from the viewpoint of robust feature extraction using a novel method, frequency-domain linear prediction with time-varying linear prediction (FDLP-TVLP), which is an extension of the 2-dimensional autoregressive (2DAR) model that allows vocal tract filter parameters to be time- varying, rather than piecewise constant as in classic short-term speech analysis.

...read moreread less

Journal Article•DOI•

Smart and Robust Speaker Recognition for Context-Aware In-Vehicle Applications

[...]

Igor Bisio¹, Chiara Garibotto¹, Aldo Grattarola¹, Fabio Lavagetto¹, Andrea Sciarrone¹ - Show less +1 more•Institutions (1)

University of Genoa¹

21 Jun 2018-IEEE Transactions on Vehicular Technology

TL;DR: This paper proposes the design of a robust speaker identification algorithm embedding a smart preprocessing method based on voice activity detection, which can effectively reduce the influence of noise and distance on classification.

...read moreread less

Abstract: The importance of robust audio speech processing has rapidly increased in the latest years, as the number of smart and connected devices is growing. This effect is strongly related to the Internet of things framework, introducing concepts such as connected vehicles and future smart cities. Context-aware applications are fundamental in this evolving environment, enabling smart and custom-tailored services for a variety of users. The use of on-board speaker recognition (SR) systems can play a key role in enhancing the customization of in-vehicle applications, by identifying the actual users and personalizing services based on their identity. Driven by this motivation, in this paper we present a performance study of an SR system, designed to face typical challenging conditions of an in-vehicle environment. We propose the design of a robust speaker identification algorithm embedding a smart preprocessing method based on voice activity detection, which can effectively reduce the influence of noise and distance on classification. Results show that our solution is able to efficiently improve the correct classification rate, even in the case of distant audio acquisition and in a variety of noisy environments.

...read moreread less

Journal Article•DOI•

An MFCC‐based text‐independent speaker identification system for access control

[...]

Jung-Chun Liu¹, Fang-Yie Leu¹, Guan-Liang Lin¹, Heru Susanto¹•Institutions (1)

Tunghai University¹

25 Jan 2018-Concurrency and Computation: Practice and Experience

TL;DR: A speaker identification system named mel frequency cepstral coefficients‐based speaker identificationSystem for access control (MSIAC for short), which identifies a speaker U by first collecting U's voice signals and converting the signals to frequency domain, and determines whether the access will be accepted or denied.

...read moreread less

Abstract: Summary In recent years, by merit of convenient and unique features, bio-authentication techniques have been applied to identify and authenticate a person based on his/her spoken words and/or sentences. Among these techniques, speaker recognition/identification is the most convenient one, providing a secure and strong authentication solution viable for a wide range of applications. In this paper, to safeguard real-world objects, like buildings, we develop a speaker identification system named mel frequency cepstral coefficients (MFCC)-based speaker identification system for access control (MSIAC for short), which identifies a speaker U by first collecting U's voice signals and converting the signals to frequency domain. An MFCC-based human auditory filtering model is utilized to adjust the energy levels of different frequencies as U's voice quantified features. Next, a Gaussian mixture model is employed to represent the distribution of the logarithmic features as U's specific acoustic model. When a person, eg, x, would like to access a real-world object protected by the MSIAC, x's acoustic model is compared with known-people's acoustic models. Based on the identification result, the MSIAC will determine whether the access will be accepted or denied.

...read moreread less

Collapse