scispace - formally typeset
Search or ask a question

Showing papers on "Speaker recognition published in 2018"


Proceedings ArticleDOI
15 Apr 2018
TL;DR: This paper uses data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness of deep neural network embeddings for speaker recognition.
Abstract: In this paper, we use data augmentation to improve performance of deep neural network (DNN) embeddings for speaker recognition. The DNN, which is trained to discriminate between speakers, maps variable-length utterances to fixed-dimensional embeddings that we call x-vectors. Prior studies have found that embeddings leverage large-scale training datasets better than i-vectors. However, it can be challenging to collect substantial quantities of labeled data for training. We use data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness. The x-vectors are compared with i-vector baselines on Speakers in the Wild and NIST SRE 2016 Cantonese. We find that while augmentation is beneficial in the PLDA classifier, it is not helpful in the i-vector extractor. However, the x-vector DNN effectively exploits data augmentation, due to its supervised training. As a result, the x-vectors achieve superior performance on the evaluation datasets.

2,300 citations


Proceedings ArticleDOI
14 Jun 2018
TL;DR: In this article, a large-scale audio-visual speaker recognition dataset, VoxCeleb2, is presented, which contains over a million utterances from over 6,000 speakers.
Abstract: The objective of this paper is speaker recognition under noisy and unconstrained conditions. We make two key contributions. First, we introduce a very large-scale audio-visual speaker recognition dataset collected from open-source media. Using a fully automated pipeline, we curate VoxCeleb2 which contains over a million utterances from over 6,000 speakers. This is several times larger than any publicly available speaker recognition dataset. Second, we develop and compare Convolutional Neural Network (CNN) models and training strategies that can effectively recognise identities from voice under various conditions. The models trained on the VoxCeleb2 dataset surpass the performance of previous works on a benchmark dataset by a significant margin.

1,289 citations


Proceedings ArticleDOI
29 Jul 2018
TL;DR: This paper proposes a novel CNN architecture, called SincNet, that encourages the first convolutional layer to discover more meaningful filters, based on parametrized sinc functions, which implement band-pass filters.
Abstract: Deep learning is progressively gaining popularity as a viable alternative to i-vectors for speaker recognition. Promising results have been recently obtained with Convolutional Neural Networks (CNNs) when fed by raw speech samples directly. Rather than employing standard hand-crafted features, the latter CNNs learn low-level speech representations from waveforms, potentially allowing the network to better capture important narrow-band speaker characteristics such as pitch and formants. Proper design of the neural network is crucial to achieve this goal.This paper proposes a novel CNN architecture, called SincNet, that encourages the first convolutional layer to discover more meaningful filters. SincNet is based on parametrized sinc functions, which implement band-pass filters. In contrast to standard CNNs, that learn all elements of each filter, only low and high cutoff frequencies are directly learned from data with the proposed method. This offers a very compact and efficient way to derive a customized filter bank specifically tuned for the desired application.Our experiments, conducted on both speaker identification and speaker verification tasks, show that the proposed architecture converges faster and performs better than a standard CNN on raw waveforms.

605 citations


Patent
12 Mar 2018
TL;DR: In this article, the first application of general-AI is described, which covers new algorithms, methods, and systems for: Artificial Intelligence; the first applications of General-AI. (versus Specific, Vertical, or Narrow-AI) (as humans can do) (which also includes Explainable-AI or XAI); addition of reasoning, inference, and cognitive layers/engines to learning module/engine/layer; soft computing; Information Principle; Stratification; Incremental Enlargement Principle; deep-level/detailed recognition, e.g.,
Abstract: Specification covers new algorithms, methods, and systems for: Artificial Intelligence; the first application of General-AI. (versus Specific, Vertical, or Narrow-AI) (as humans can do) (which also includes Explainable-AI or XAI); addition of reasoning, inference, and cognitive layers/engines to learning module/engine/layer; soft computing; Information Principle; Stratification; Incremental Enlargement Principle; deep-level/detailed recognition, e.g., image recognition (e.g., for action, gesture, emotion, expression, biometrics, fingerprint, tilted or partial-face, OCR, relationship, position, pattern, and object); Big Data analytics; machine learning; crowd-sourcing; classification; clustering; SVM; similarity measures; Enhanced Boltzmann Machines; Enhanced Convolutional Neural Networks; optimization; search engine; ranking; semantic web; context analysis; question-answering system; soft, fuzzy, or un-sharp boundaries/impreciseness/ambiguities/fuzziness in class or set, e.g., for language analysis; Natural Language Processing (NLP); Computing-with-Words (CWW); parsing; machine translation; music, sound, speech, or speaker recognition; video search and analysis (e.g., “intelligent tracking”, with detailed recognition); image annotation; image or color correction; data reliability; Z-Number; Z-Web; Z-Factor; rules engine; playing games; control system; autonomous vehicles or drones; self-diagnosis and self-repair robots; system diagnosis; medical diagnosis/images; genetics; drug discovery; biomedicine; data mining; event prediction; financial forecasting (e.g., for stocks); economics; risk assessment; fraud detection (e.g., for cryptocurrency); e-mail management; database management; indexing and join operation; memory management; data compression; event-centric social network; social behavior; drone/satellite vision/navigation; smart city/home/appliances/IoT; and Image Ad and Referral Networks, for e-commerce, e.g., 3D shoe recognition, from any view angle.

216 citations


Posted Content
TL;DR: In this paper, a speaker recognition network that produces speaker-discriminative embeddings and a spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask.
Abstract: In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask. Our system significantly reduces the speech recognition WER on multi-speaker signals, with minimal WER degradation on single-speaker signals.

197 citations


Proceedings ArticleDOI
12 Apr 2018
TL;DR: The authors explored different topologies and their variants of the attention layer, and compared different pooling methods on the attention weights, and showed that attention-based models can improve the Equal Error Rate (EER) of speaker verification system by relatively 14% compared to a non-attention LSTM baseline model.
Abstract: Attention-based models have recently shown great performance on a range of tasks, such as speech recognition, machine translation, and image captioning due to their ability to summarize relevant information that expands through the entire length of an input sequence. In this paper, we analyze the usage of attention mechanisms to the problem of sequence summarization in our end-to-end text-dependent speaker recognition system. We explore different topologies and their variants of the attention layer, and compare different pooling methods on the attention weights. Ultimately, we show that attention-based models can improves the Equal Error Rate (EER) of our speaker verification system by relatively 14% compared to our non-attention LSTM baseline model.

174 citations


Journal ArticleDOI
TL;DR: A novel text-independent speaker verification framework based on the triplet loss and a very deep convolutional neural network architecture are investigated in this study, where a fixed-length speaker discriminative embedding is learned from sparse speech features and utilized as a feature representation for the SV tasks.
Abstract: The effectiveness of introducing deep neural networks into conventional speaker recognition pipelines has been broadly shown to benefit system performance. A novel text-independent speaker verification (SV) framework based on the triplet loss and a very deep convolutional neural network architecture (i.e., Inception-Resnet-v1) are investigated in this study, where a fixed-length speaker discriminative embedding is learned from sparse speech features and utilized as a feature representation for the SV tasks. A concise description of the neural network based speaker discriminative training with triplet loss is presented. An Euclidean distance similarity metric is applied in both network training and SV testing, which ensures the SV system to follow an end-to-end fashion. By replacing the final max/average pooling layer with a spatial pyramid pooling layer in the Inception-Resnet-v1 architecture, the fixed-length input constraint is relaxed and an obvious performance gain is achieved compared with the fixed-length input speaker embedding system. For datasets with more severe training/test condition mismatches, the probabilistic linear discriminant analysis (PLDA) back end is further introduced to replace the distance based scoring for the proposed speaker embedding system. Thus, we reconstruct the SV task with a neural network based front-end speaker embedding system and a PLDA that provides channel and noise variabilities compensation in the back end. Extensive experiments are conducted to provide useful hints that lead to a better testing performance. Comparison with the state-of-the-art SV frameworks on three public datasets (i.e., a prompt speech corpus, a conversational speech Switchboard corpus, and NIST SRE10 10 s–10 s condition) justifies the effectiveness of our proposed speaker embedding system.

151 citations


Proceedings ArticleDOI
25 Apr 2018
TL;DR: Experiments demonstrate that the proposed domain adversarial training method is not only effective in solving the dataset mismatch problem, but also outperforms the compared unsupervised domain adaptation methods.
Abstract: The i-vector approach to speaker recognition has achieved good performance when the domain of the evaluation dataset is similar to that of the training dataset. However, in realworld applications, there is always a mismatch between the training and evaluation datasets, that leads to performance degradation. To address this problem, this paper proposes to learn the domain-invariant and speaker-discriminative speech representations via domain adversarial training. Specifically, with domain adversarial training method, we use a gradient reversal layer to remove the domain variation and project the different domain data into the same subspace. Moreover, we compare the proposed method with other state-of-the-art unsupervised domain adaptation techniques for i-vector approach to speaker recognition (e.g. autoencoder based domain adaptation, inter dataset variability compensation, dataset-invariant covariance normalization, and so on). Experiments on 2013 domain adaptation challenge (DAC) dataset demonstrate that the proposed method is not only effective in solving the dataset mismatch problem, but also outperforms the compared unsupervised domain adaptation methods.

132 citations


Journal ArticleDOI
TL;DR: A novel model to enhance the recognition accuracy of the short utterance speaker recognition system is proposed using a convolutional neural network to process spectrograms, which can describe speakers better and gains the considerable accuracy as well as the reasonable convergence speed.
Abstract: During the last few years, the speaker recognition technique has been widely attractive for its extensive application in many fields, such as speech communications, domestics services, and smart terminals. As a critical method, the Gaussian mixture model (GMM) makes it possible to achieve the recognition capability that is close to the hearing ability of human in a long speech. However, the GMM is failing to recognize a short utterance speaker with a high accuracy. Aiming at solving this problem, in this paper, we propose a novel model to enhance the recognition accuracy of the short utterance speaker recognition system. Different from traditional models based on the GMM, we design a method to train a convolutional neural network to process spectrograms, which can describe speakers better. Thus, the recognition system gains the considerable accuracy as well as the reasonable convergence speed. The experiment results show that our model can help to decrease the equal error rate of the recognition from 4.9% to 2.5%.

123 citations


Journal ArticleDOI
TL;DR: In this article, a flexible piezoelectric acoustic sensor (f-PAS) with a highly sensitive multi-resonant frequency band was fabricated by mimicking the operating mechanism of the basilar membrane in the human cochlear.

113 citations


Proceedings ArticleDOI
15 Apr 2018
TL;DR: This paper describes a novel approach for deriving semantics directly from the speech signal without the need for an explicit speech recognition step and demonstrates its effectiveness in comparison to the conventional approach.
Abstract: While conventional approaches to spoken language understanding involve cascading a speech recognizer with a language understanding system, in this paper, we describe a novel approach for deriving semantics directly from the speech signal without the need for an explicit speech recognition step. We evaluate this approach in the context of a customer care dialog system and demonstrate its effectiveness in comparison to the conventional approach.

Journal ArticleDOI
TL;DR: The authors present an extensive survey of SV with short utterances considering the studies from recent past and include latest research offering various solutions and analyses to address the limited data issue within the scope of SV.
Abstract: Automatic speaker verification (ASV) technology now reports a reasonable level of accuracy in its applications in voice-based biometric systems. However, it requires adequate amount of speech data for enrolment and verification; otherwise, the performance becomes considerably degraded. For this reason, the trade-off between the convenience and security is difficult to maintain in practical scenarios. The utterance duration remains a critical issue while deploying a voice biometric system in real-world applications. A large amount of research work has been carried out to address the limited data issue within the scope of SV. The advancements and research activities in mitigating the challenges due to short utterance have seen a significant rise in recent times. In this study, the authors present an extensive survey of SV with short utterances considering the studies from recent past and include latest research offering various solutions and analyses. The review also summarises the major findings of the studies of duration variability problem in ASV systems. Finally, they discuss a number of possible future directions promoting further research in this field.

Posted Content
TL;DR: Results of experiments suggest that simple repetition and random time-reversion of utterances can reduce prediction errors by up to 18% and proposed logistic margin loss function leads to unified embeddings with state-of-the-art identification and competitive verification accuracies.
Abstract: Incremental improvements in accuracy of Convolutional Neural Networks are usually achieved through use of deeper and more complex models trained on larger datasets. However, enlarging dataset and models increases the computation and storage costs and cannot be done indefinitely. In this work, we seek to improve the identification and verification accuracy of a text-independent speaker recognition system without use of extra data or deeper and more complex models by augmenting the training and testing data, finding the optimal dimensionality of embedding space and use of more discriminative loss functions. Results of experiments on VoxCeleb dataset suggest that: (i) Simple repetition and random time-reversion of utterances can reduce prediction errors by up to 18%. (ii) Lower dimensional embeddings are more suitable for verification. (iii) Use of proposed logistic margin loss function leads to unified embeddings with state-of-the-art identification and competitive verification accuracies.

Journal ArticleDOI
TL;DR: This paper studies the use of features from different levels of deep belief network for quantizing the audio data into vectors of audio word counts, and shows that the audio word count vectors generated from mixture of DBN features at different layers give better performance than the MFCC features.
Abstract: Learning representation from audio data has shown advantages over the handcrafted features such as mel-frequency cepstral coefficients (MFCCs) in many audio applications. In most of the representation learning approaches, the connectionist systems have been used to learn and extract latent features from the fixed length data. In this paper, we propose an approach to combine the learned features and the MFCC features for speaker recognition task, which can be applied to audio scripts of different lengths. In particular, we study the use of features from different levels of deep belief network for quantizing the audio data into vectors of audio word counts. These vectors represent the audio scripts of different lengths that make them easier to train a classifier. We show in the experiment that the audio word count vectors generated from mixture of DBN features at different layers give better performance than the MFCC features. We also can achieve further improvement by combining the audio word count vector and the MFCC features.

Proceedings ArticleDOI
12 Sep 2018
TL;DR: A Convolutional Neural Network (CNN) based speaker recognition model for extracting robust speaker embeddings is proposed and it is found that the networks are better at discriminating broad phonetic classes than individual phonemes.
Abstract: In this paper, we propose a Convolutional Neural Network (CNN) based speaker recognition model for extracting robust speaker embeddings. The embedding can be extracted efficiently with linear activation in the embedding layer. To understand how the speaker recognition model operates with text-independent input, we modify the structure to extract frame-level speaker embeddings from each hidden layer. We feed utterances from the TIMIT dataset to the trained network and use several proxy tasks to study the networks ability to represent speech input and differentiate voice identity. We found that the networks are better at discriminating broad phonetic classes than individual phonemes. In particular, frame-level embeddings that belong to the same phonetic classes are similar (based on cosine distance) for the same speaker. The frame level representation also allows us to analyze the networks at the frame level, and has the potential for other analyses to improve speaker recognition.

Journal ArticleDOI
01 Jan 2018
TL;DR: Results suggest that Rasta-PLP is the most reliable parameterization for the proposed task among all the tested features while the two employed classification schemes perform similarly and confirm that kinetic changes provide a substantial performance improvement in Parkinson's Disease automatic detection systems and should be considered in the future.
Abstract: The diagnosis of Parkinson's Disease is a challenging task which might be supported by new tools to objectively evaluate the presence of deviations in patient's motor capabilities. To this respect, the dysarthric nature of patient's speech has been exploited in several works to detect the presence of this disease, but none of them has deeply studied the use of state-of-the-art speaker recognition techniques for this task. In this paper, two classification schemes (GMM-UBM and i-Vectors-GPLDA) are employed separately with several parameterization techniques, namely PLP, MFCC and LPC. Additionally, the influence of the kinetic changes, described by their derivatives, is analysed. With the proposed methodology, an accuracy of 87% with an AUC of 0.93 is obtained in the optimal configuration. These results are comparable to those obtained in other works employing speech for Parkinson's Disease detection and confirm that the selected speaker recognition techniques are a solid baseline to compare with future works. Results suggest that Rasta-PLP is the most reliable parameterization for the proposed task among all the tested features while the two employed classification schemes perform similarly. Additionally, results confirm that kinetic changes provide a substantial performance improvement in Parkinson's Disease automatic detection systems and should be considered in the future.

Journal ArticleDOI
TL;DR: A novel age estimation system based on LSTM-RNNs that is able to deal with short utterances, easily deployed in a real-time architecture and compared with a state-of-the-art i-vector approach.
Abstract: Age estimation from speech has recently received increased interest as it is useful for many applications such as user-profiling, targeted marketing, or personalized call-routing. This kind of applications need to quickly estimate the age of the speaker and might greatly benefit from real-time capabilities. Long short-term memory (LSTM) recurrent neural networks (RNN) have shown to outperform state-of-the-art approaches in related speech-based tasks, such as language identification or voice activity detection, especially when an accurate real-time response is required. In this paper, we propose a novel age estimation system based on LSTM-RNNs. This system is able to deal with short utterances (from 3 to 10 s) and it can be easily deployed in a real-time architecture. The proposed system has been tested and compared with a state-of-the-art i-vector approach using data from NIST speaker recognition evaluation 2008 and 2010 data sets. Experiments on short duration utterances show a relative improvement up to 28% in terms of mean absolute error of this new approach over the baseline system.

Journal ArticleDOI
TL;DR: The performances of several classification methods are compared, including Gaussian Mixture Model–Universal Background Model (GMM–UBM), GMM–Support Vector Machine (G MM–SVM) and i-vector based approaches, and the utility of different frequency bands for speaker, age-group and gender recognition from children’s speech is assessed.

Posted Content
TL;DR: The Brno University of Technology (BUT) team submissions for Task 1 (Acoustic Scene Classification, ASC) of the DCASE-2018 challenge are described and the proposed approach is a fusion of two different Convolutional Neural Network topologies.
Abstract: In this paper, the Brno University of Technology (BUT) team submissions for Task 1 (Acoustic Scene Classification, ASC) of the DCASE-2018 challenge are described. Also, the analysis of different methods on the leaderboard set is provided. The proposed approach is a fusion of two different Convolutional Neural Network (CNN) topologies. The first one is the common two-dimensional CNNs which is mainly used in image classification. The second one is a one-dimensional CNN for extracting fixed-length audio segment embeddings, so called x-vectors, which has also been used in speech processing, especially for speaker recognition. In addition to the different topologies, two types of features were tested: log mel-spectrogram and CQT features. Finally, the outputs of different systems are fused using a simple output averaging in the best performing system. Our submissions ranked third among 24 teams in the ASC sub-task A (task1a).

Proceedings ArticleDOI
02 Sep 2018
TL;DR: This work demonstrates that performance of deep speaker embeddings based systems can be improved by using Cosine Similarity Metric Learning (CSML) with the triplet loss training scheme.
Abstract: Deep neural network based speaker embeddings become increasingly popular in the text-independent speaker recognition task. In contrast to a generatively trained i-vector extractor, a DNN speaker embedding extractor is usually trained discriminatively in the closed set classification scenario using softmax. The problem we addressed in the paper is choosing a dnn based speaker embedding backend solution for the speaker verification scoring. There are several options to perform speaker verification in the dnn embedding space. One of them is using a simple heuristic speaker similarity metric for scoring (e.g. cosine metric). Similarly with i-vector based systems, the standard Linear Discriminant Analisys (LDA) followed by the Probabilistic Linear Discriminant Analisys (PLDA) can be used for segregating speaker information. As an alternative, the discriminative metric learning approach can be considered. This work demonstrates that performance of deep speaker embeddings based systems can be improved by using Cosine Similarity Metric Learning (CSML) with the triplet loss training scheme. Results obtained on Speakers in the Wild and NIST SRE 2016 evaluation sets demonstrate superiority and robustness of CSML based systems.

Journal ArticleDOI
TL;DR: This paper aims to implement speaker recognition for Hindi speech samples using Mel frequency cepestral coffiecient–vector quantization (MFCC-VQ) and Mel frequency cofficient-Gaussian mixture model ( MFCC-GMM) for text dependent and text independent phrases.

Posted Content
TL;DR: It is demonstrated that using angular softmax activation at the last classification layer of a classification neural network instead of a simple softmaxactivation allows to train a more generalized discriminative speaker embedding extractor.
Abstract: We investigate deep neural network performance in the textindependent speaker recognition task. We demonstrate that using angular softmax activation at the last classification layer of a classification neural network instead of a simple softmax activation allows to train a more generalized discriminative speaker embedding extractor. Cosine similarity is an effective metric for speaker verification in this embedding space. We also address the problem of choosing an architecture for the extractor. We found that deep networks with residual frame level connections outperform wide but relatively shallow architectures. This paper also proposes several improvements for previous DNN-based extractor systems to increase the speaker recognition accuracy. We show that the discriminatively trained similarity metric learning approach outperforms the standard LDA-PLDA method as an embedding backend. The results obtained on Speakers in the Wild and NIST SRE 2016 evaluation sets demonstrate robustness of the proposed systems when dealing with close to real-life conditions.

Posted Content
TL;DR: In this paper, a CNN-based speaker recognition model was proposed for extracting robust speaker embeddings, which can be extracted efficiently with linear activation in the embedding layer with text-independent input.
Abstract: In this paper, we propose a Convolutional Neural Network (CNN) based speaker recognition model for extracting robust speaker embeddings. The embedding can be extracted efficiently with linear activation in the embedding layer. To understand how the speaker recognition model operates with text-independent input, we modify the structure to extract frame-level speaker embeddings from each hidden layer. We feed utterances from the TIMIT dataset to the trained network and use several proxy tasks to study the networks ability to represent speech input and differentiate voice identity. We found that the networks are better at discriminating broad phonetic classes than individual phonemes. In particular, frame-level embeddings that belong to the same phonetic classes are similar (based on cosine distance) for the same speaker. The frame level representation also allows us to analyze the networks at the frame level, and has the potential for other analyses to improve speaker recognition.

Journal ArticleDOI
TL;DR: This study introduces a novel class of curriculum learning (CL) based algorithms for noise robust speaker recognition at two stages within a state-of-the-art speaker verification system: at the i-Vector extractor estimation and at the probabilistic linear discriminant (PLDA) back-end.
Abstract: Performance of speaker identification (SID) systems is known to degrade rapidly in the presence of mismatch such as noise and channel degradations. This study introduces a novel class of curriculum learning (CL) based algorithms for noise robust speaker recognition. We introduce CL-based approaches at two stages within a state-of-the-art speaker verification system: at the i-Vector extractor estimation and at the probabilistic linear discriminant (PLDA) back-end. Our proposed CL-based approaches operate by categorizing the available training data into progressively more challenging subsets using a suitable difficulty criterion. Next, the corresponding training algorithms are initialized with a subset that is closest to a clean noise-free set, and progressively moving to subsets that are more challenging for training as the algorithms progress. We evaluate the performance of our proposed approaches on the noisy and severely degraded data from the DARPA RATS SID task, and show consistent and significant improvement across multiple test sets over a baseline SID framework with a standard i-Vector extractor and multisession PLDA-based back-end. We also construct a very challenging evaluation set by adding noise to the NIST SRE 2010 C5 extended condition trials, where our proposed CL-based PLDA is shown to offer significant improvements over a traditional PLDA based back-end.

Journal ArticleDOI
18 Sep 2018
TL;DR: In this article, an attention-based Encoder-Decoder RNNs (Recurrent Neural Networks) structure was used to assign varying attention weights to different EEG channels based on their importance, and the discriminative representations learned from the attentionbased RNN were used to identify the user through a boosting classifier.
Abstract: Person identification technology recognizes individuals by exploiting their unique, measurable physiological and behavioral characteristics. However, the state-of-the-art person identification systems have been shown to be vulnerable, e.g., anti-surveillance prosthetic masks can thwart face recognition, contact lenses can trick iris recognition, vocoder can compromise voice identification and fingerprint films can deceive fingerprint sensors. EEG (Electroencephalography)-based identification, which utilizes the user's brainwave signals for identification and offers a more resilient solution, has recently drawn a lot of attention. However, the state-of-the-art systems cannot achieve similar accuracy as the aforementioned methods. We propose MindID, an EEG-based biometric identification approach, with the aim of achieving high accuracy and robust performance. At first, the EEG data patterns are analyzed and the results show that the Delta pattern contains the most distinctive information for user identification. Next, the decomposed Delta signals are fed into an attention-based Encoder-Decoder RNNs (Recurrent Neural Networks) structure which assigns varying attention weights to different EEG channels based on their importance. The discriminative representations learned from the attention-based RNN are used to identify the user through a boosting classifier. The proposed approach is evaluated over 3 datasets (two local and one public). One local dataset (EID-M) is used for performance assessment and the results illustrate that our model achieves an accuracy of 0.982 and significantly outperforms the state-of-the-art and relevant baselines. The second local dataset (EID-S) and a public dataset (EEG-S) are utilized to demonstrate the robustness and adaptability, respectively. The results indicate that the proposed approach has the potential to be widely deployed in practical settings.

Proceedings ArticleDOI
18 Apr 2018
TL;DR: In this paper, an end-to-end speaker verification system was developed that is initialized to mimic an i-vector + PLDA baseline, which is then further trained in an end to end manner but regularized so that it does not deviate too far from the initial system.
Abstract: Recently, several end-to-end speaker verification systems based on deep neural networks (DNNs) have been proposed. These systems have been proven to be competitive for text-dependent tasks as well as for text-independent tasks with short utterances. However, for text-independent tasks with longer utterances, end-to-end systems are still outperformed by standard i-vector + PLDA systems. In this work, we develop an end-to-end speaker verification system that is initialized to mimic an i-vector + PLDA baseline. The system is then further trained in an end-to-end manner but regularized so that it does not deviate too far from the initial system. In this way we mitigate overfitting which normally limits the performance of end-to-end systems. The proposed system outperforms the i-vector + PLDA baseline on both long and short duration utterances.

Proceedings ArticleDOI
13 Apr 2018
TL;DR: The work shows that a speaker recognition system working robustly in the far-field scenario can be developed and weighted prediction error based dereverberation combined with generalized eigenvalue beamformer with power-spectral density and weighting masks generated by neural networks provides results approaching the clean close-microphone setup.
Abstract: This paper deals with far-field speaker recognition. On a corpus of NIST SRE 2010 data retransmitted in a real room with multiple microphones, we first demonstrate how room acoustics cause significant degradation of state-of-the-art i-vector based speaker recognition system. We then investigate several techniques to improve the performances ranging from probabilistic linear discriminant analysis (PLDA) re-training, through dereverberation, to beamforming. We found that weighted prediction error (WPE) based dereverberation combined with generalized eigenvalue beamformer with power-spectral density (PSD) weighting masks generated by neural networks (NN) provides results approaching the clean close-microphone setup. Further improvement was obtained by re-training PLDA or the mask-generating NNs on simulated target data. The work shows that a speaker recognition system working robustly in the far-field scenario can be developed.

Journal ArticleDOI
TL;DR: This work addresses the problem of normal-whisper acoustic mismatch compensation from the viewpoint of robust feature extraction using a novel method, frequency-domain linear prediction with time-varying linear prediction (FDLP-TVLP), which is an extension of the 2-dimensional autoregressive (2DAR) model that allows vocal tract filter parameters to be time- varying, rather than piecewise constant as in classic short-term speech analysis.

Journal ArticleDOI
TL;DR: This paper proposes the design of a robust speaker identification algorithm embedding a smart preprocessing method based on voice activity detection, which can effectively reduce the influence of noise and distance on classification.
Abstract: The importance of robust audio speech processing has rapidly increased in the latest years, as the number of smart and connected devices is growing. This effect is strongly related to the Internet of things framework, introducing concepts such as connected vehicles and future smart cities. Context-aware applications are fundamental in this evolving environment, enabling smart and custom-tailored services for a variety of users. The use of on-board speaker recognition (SR) systems can play a key role in enhancing the customization of in-vehicle applications, by identifying the actual users and personalizing services based on their identity. Driven by this motivation, in this paper we present a performance study of an SR system, designed to face typical challenging conditions of an in-vehicle environment. We propose the design of a robust speaker identification algorithm embedding a smart preprocessing method based on voice activity detection, which can effectively reduce the influence of noise and distance on classification. Results show that our solution is able to efficiently improve the correct classification rate, even in the case of distant audio acquisition and in a variety of noisy environments.

Journal ArticleDOI
TL;DR: A speaker identification system named mel frequency cepstral coefficients‐based speaker identificationSystem for access control (MSIAC for short), which identifies a speaker U by first collecting U's voice signals and converting the signals to frequency domain, and determines whether the access will be accepted or denied.
Abstract: Summary In recent years, by merit of convenient and unique features, bio-authentication techniques have been applied to identify and authenticate a person based on his/her spoken words and/or sentences. Among these techniques, speaker recognition/identification is the most convenient one, providing a secure and strong authentication solution viable for a wide range of applications. In this paper, to safeguard real-world objects, like buildings, we develop a speaker identification system named mel frequency cepstral coefficients (MFCC)-based speaker identification system for access control (MSIAC for short), which identifies a speaker U by first collecting U's voice signals and converting the signals to frequency domain. An MFCC-based human auditory filtering model is utilized to adjust the energy levels of different frequencies as U's voice quantified features. Next, a Gaussian mixture model is employed to represent the distribution of the logarithmic features as U's specific acoustic model. When a person, eg, x, would like to access a real-world object protected by the MSIAC, x's acoustic model is compared with known-people's acoustic models. Based on the identification result, the MSIAC will determine whether the access will be accepted or denied.