Showing papers on "Speaker recognition published in 2022"

PDF

Open Access

Journal Article•DOI•

A review of speaker diarization: Recent advances with deep learning

[...]

01 Mar 2022

TL;DR: Speaker diarization is a task to label audio or video recordings with classes that correspond to speaker identity as mentioned in this paper , or in short, identifying "who spoke when" in audio and video recordings.

...read moreread less

Abstract: Speaker diarization is a task to label audio or video recordings with classes that correspond to speaker identity, or in short, a task to identify “who spoke when”. In the early years, speaker diarization algorithms were developed for speech recognition on multispeaker audio recordings to enable speaker adaptive processing. These algorithms also gained their own value as a standalone application over time to provide speaker-specific metainformation for downstream tasks such as audio retrieval. More recently, with the emergence of deep learning technology, which has driven revolutionary changes in research and practices across speech application domains, rapid advancements have been made for speaker diarization. In this paper, we review not only the historical development of speaker diarization technology but also the recent advancements in neural speaker diarization approaches. Furthermore, we discuss how speaker diarization systems have been integrated with speech recognition applications and how the recent surge of deep learning is leading the way of jointly modeling these two components to be complementary to each other. By considering such exciting technical trends, we believe that this paper is a valuable contribution to the community to provide a survey work by consolidating the recent developments with neural methods and thus facilitating further progress toward a more efficient speaker diarization.

...read moreread less

57 citations

Journal Article•DOI•

A review of speaker diarization: Recent advances with deep learning

[...]

Tae Jin Park¹, Naoyuki Kanda², Dimitrios Dimitriadis², Kyu Jeong Han, Shinji Watanabe³, Shrikanth S. Narayanan¹ - Show less +2 more•Institutions (3)

University of Southern California¹, Microsoft², Johns Hopkins University³

01 Mar 2022-Computer Speech & Language

TL;DR: Speaker diarization is a task to label audio or video recordings with classes that correspond to speaker identity as mentioned in this paper, or in short, identifying "who spoke when" in audio and video recordings.

...read moreread less

55 citations

Proceedings Article•DOI•

Bias in Automated Speaker Recognition

[...]

Wiebke Toussaint, Aaron Yi Ding

24 Jan 2022

TL;DR: It is shown that bias exists at every development stage in the well-known VoxCeleb Speaker Recognition Challenge, including data generation, model building, and implementation, and most affected are female speakers and non-US nationalities, who experience significant performance degradation.

...read moreread less

Abstract: Automated speaker recognition uses data processing to identify speakers by their voice. Today, automated speaker recognition is deployed on billions of smart devices and in services such as call centres. Despite their wide-scale deployment and known sources of bias in related domains like face recognition and natural language processing, bias in automated speaker recognition has not been studied systematically. We present an in-depth empirical and analytical study of bias in the machine learning development workflow of speaker verification, a voice biometric and core task in automated speaker recognition. Drawing on an established framework for understanding sources of harm in machine learning, we show that bias exists at every development stage in the well-known VoxCeleb Speaker Recognition Challenge, including data generation, model building, and implementation. Most affected are female speakers and non-US nationalities, who experience significant performance degradation. Leveraging the insights from our findings, we make practical recommendations for mitigating bias in automated speaker recognition, and outline future research directions.

...read moreread less

19 citations

Proceedings Article•DOI•

Unispeech-Sat: Universal Speech Representation Learning With Speaker Aware Pre-Training

[...]

23 May 2022

TL;DR: In this article , two methods are introduced for enhancing the unsupervised speaker information extraction: utterance mixing strategy for data augmentation, where additional overlapped utterances are created unsupervisely and incorporated during training.

...read moreread less

Abstract: Self-supervised learning (SSL) is a long-standing goal for speech processing, since it utilizes large-scale unlabeled data and avoids extensive human labeling. Recent years witness great successes in applying self-supervised learning in speech recognition, while limited exploration was attempted in applying SSL for modeling speaker characteristics. In this paper, we aim to improve the existing SSL framework for speaker representation learning. Two methods are introduced for enhancing the unsupervised speaker information extraction. First, we apply the multi-task learning to the current SSL framework, where we integrate the utterance-wise contrastive loss with the SSL objective function. Second, for better speaker discrimination, we propose an utterance mixing strategy for data augmentation, where additional overlapped utterances are created unsupervisely and incorporate during training. We integrate the proposed methods into the HuBERT framework. Experiment results on SUPERB benchmark show that the proposed system achieves state-of-the-art performance in universal representation learning, especially for speaker identification oriented tasks. An ablation study is performed verifying the efficacy of each proposed method. Finally, we scale up training dataset to 94 thousand hours public audio data and achieve further performance improvement in all SUPERB tasks.

...read moreread less

19 citations

Proceedings Article•DOI•

Fine-Tuning Wav2Vec2 for Speaker Recognition

[...]

23 May 2022

TL;DR: W2V2-Speaker as discussed by the authors applies the wav2vec2 framework to speaker recognition instead of speech recognition, and proposes a single-utterance classification with cross-entropy or additive angular softmax loss, and an utterance-pair classification variant with BCE loss.

...read moreread less

Abstract: This paper explores applying the wav2vec2 framework to speaker recognition instead of speech recognition. We study the effectiveness of the pre-trained weights on the speaker recognition task, and how to pool the wav2vec2 output sequence into a fixed-length speaker embedding. To adapt the framework to speaker recognition, we propose a single-utterance classification variant with cross-entropy or additive angular softmax loss, and an utterance-pair classification variant with BCE loss. Our best performing variant achieves a 1.88% EER on the extended voxceleb1 test set compared to 1.69% EER with an ECAPA-TDNN baseline. Code is available at github.com/nikvaessen/w2v2-speaker.

...read moreread less

16 citations

Proceedings Article•DOI•

Large-Scale Self-Supervised Speech Representation Learning for Automatic Speaker Verification

[...]

23 May 2022

TL;DR: In this article , the authors explore the limits of speech representations learned by different self-supervised objectives and datasets for automatic speaker verification (ASV), especially with a well-recognized SOTA ASV model, ECAPA-TDNN, as a downstream model.

...read moreread less

Abstract: The speech representations learned from large-scale unlabeled data have shown better generalizability than those from supervised learning and thus attract a lot of interest to be applied for various downstream tasks. In this paper, we explore the limits of speech representations learned by different self-supervised objectives and datasets for automatic speaker verification (ASV), especially with a well-recognized SOTA ASV model, ECAPA-TDNN [1], as a downstream model. The representations from all hidden layers of the pre-trained model are firstly averaged with learnable weights and then fed into the ECAPA-TDNN as input features. The experimental results on Voxceleb dataset show that the weighted average representation is significantly superior to FBank, a conventional handcrafted feature for ASV. Our best single system achieves 0.537%, 0.569%, and 1.180% equal error rate (EER) on the three official trials of VoxCeleb1, separately. Accordingly, the ensemble system with three pre-trained models can further improve the EER to 0.479%, 0.536% and 1.023%. Among the three evaluation trials, our best system outperforms the winner system [2] of the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC2021) on the VoxCeleb1-E trial.

...read moreread less

15 citations

Proceedings Article•DOI•

MFA: TDNN with Multi-Scale Frequency-Channel Attention for Text-Independent Speaker Verification with Short Utterances

[...]

23 May 2022

TL;DR: In this article , a multi-scale frequency-channel attention (MFA) was proposed to characterize speakers at different scales through a dual-path design which consists of a convolutional neural network and a time delay neural network.

...read moreread less

Abstract: The time delay neural network (TDNN) represents one of the state-of-the-art of neural solutions to text-independent speaker verification. However, they require a large number of filters to capture the speaker characteristics at any local frequency region. In addition, the performance of such systems may degrade under short utterance scenarios. To address these issues, we propose a multi-scale frequency-channel attention (MFA), where we characterize speakers at different scales through a novel dual-path design which consists of a convolutional neural network and TDNN. We evaluate the proposed MFA on the VoxCeleb database and observe that the proposed framework with MFA can achieve state-of-the-art performance while reducing parameters and computation complexity. Further, the MFA mechanism is found to be effective for speaker verification with short test utterances.

...read moreread less

14 citations

Journal Article•DOI•

Face Mask Recognition from Audio: The MASC Database and an Overview on the Mask Challenge

[...]

Mostafa M. Mohamed¹, Mostafa M. Mohamed², Mina A. Nessiem¹, Mina A. Nessiem², Anton Batliner², Christian Bergler³, Simone Hantke, Maximilian Schmitt², Alice Baird², Adria Mallol-Ragolta², Vincent Karas², Shahin Amiriparian², Björn Schuller², Björn Schuller⁴ - Show less +10 more•Institutions (4)

Augsburg College¹, University of Augsburg², University of Erlangen-Nuremberg³, Imperial College London⁴

01 Feb 2022-Pattern Recognition

TL;DR: The Mask Sub-Challenge (MSC) of the INTERSPEECH 2020 COMputational Paralinguistics challengE (ComParE) as discussed by the authors focused on the following classification task: given an audio chunk of a speaker, classify whether the speaker is wearing a mask or not.

...read moreread less

14 citations

Proceedings Article•DOI•

Tight Integration Of Neural- And Clustering-Based Diarization Through Deep Unfolding Of Infinite Gaussian Mixture Model

[...]

Keisuke Kinoshita, M. Delcroix, Tomoharu Iwata

14 Feb 2022

TL;DR: Experimental results show that the proposed approach outperforms the conventional approach in terms of diarization error rate (DER), especially by substantially reducing speaker confusion errors, that indeed reflects the effectiveness of the proposed iGMM integration.

...read moreread less

Abstract: Speaker diarization has been investigated extensively as an important central task for meeting analysis. Recent trend shows that integration of end-to-end neural (EEND)- and clustering-based diarization is a promising approach to handle realistic conversational data containing overlapped speech with an arbitrarily large number of speakers, and achieved state-of-the-art results on various tasks. However, the approaches proposed so far have not realized tight integration yet, because the clustering employed therein was not optimal in any sense for clustering the speaker embeddings estimated by the EEND module. To address this problem, this paper introduces a trainable clustering algorithm into the integration framework, by deep-unfolding a non-parametric Bayesian model called the infinite Gaussian mixture model (iGMM). Specifically, the speaker embeddings are optimized during training such that it better fits iGMM clustering, based on a novel clustering loss based on Adjusted Rand Index (ARI). Experimental results based on CALLHOME data show that the proposed approach outperforms the conventional approach in terms of diarization error rate (DER), especially by substantially reducing speaker confusion errors, that indeed reflects the effectiveness of the proposed iGMM integration.

...read moreread less

14 citations

Journal Article•DOI•

Privacy and Utility of X-Vector Based Speaker Anonymization

[...]

Brij Mohan Lal Srivastava, Mohamed Maouche, Md. Sahidullah, Emmanuel Vincent, Aurélien Bellet, Marc Tommasi, Natalia A. Tomashenko, Xin Wang, Junichi Yamagishi - Show less +5 more

01 Jan 2022-IEEE/ACM transactions on audio, speech, and language processing

TL;DR:

...read moreread less

Abstract: We study the scenario where individuals (speakers) contribute to the publication of an anonymized speech corpus. Data users leverage this public corpus for downstream tasks, e.g., training an automatic speech recognition (ASR) system, while attackers may attempt to de-anonymize it using auxiliary knowledge. Motivated by this scenario, speaker anonymization aims to conceal speaker identity while preserving the quality and usefulness of speech data. In this article, we study x-vector based speaker anonymization, the leading approach in the VoicePrivacy Challenge, which converts the speaker's voice into that of a random pseudo-speaker. We show that the strength of anonymization varies significantly depending on how the pseudo-speaker is chosen. We explore four design choices for this step: the distance metric between speakers, the region of speaker space where the pseudo-speaker is picked, its gender, and whether to assign it to one or all utterances of the original speaker. We assess the quality of anonymization from the perspective of the three actors involved in our threat model, namely the speaker, the user and the attacker. To measure privacy and utility, we use respectively the linkability score achieved by the attackers and the decoding word error rate achieved by an ASR model trained on the anonymized data. Experiments on LibriSpeech show that the best combination of design choices yields state-of-the-art performance in terms of both privacy and utility. Experiments on Mozilla Common Voice further show that it guarantees the same anonymization level against re-identification attacks among 50 speakers as original speech among 20,000 speakers.

...read moreread less

12 citations

Journal Article•DOI•

Improving the potential of Enhanced Teager Energy Cepstral Coefficients (ETECC) for replay attack detection

[...]

Ankur T. Patil¹, Rajul Acharya¹, Hemant A. Patil¹, Rodrigo Capobianco Guido²•Institutions (2)

Dhirubhai Ambani Institute of Information and Communication Technology¹, Sao Paulo State University²

01 Mar 2022-Computer Speech & Language

TL;DR: Comprehensive evaluations which include a detailed mathematical analysis, a simulation on amplitude and frequency modulated (AM-FM) signals, and a spectrographic inspection involving different filterbank structures, along with their experimental results are provided in this paper.

...read moreread less

Journal Article•DOI•

Differentially Private Speaker Anonymization

[...]

Ali Shahin Shamsabadi, Brij Mohan Lal Srivastava, Aurélien Bellet, Nathalie Vauquier, Emmanuel Vincent, Mohamed Maouche, Marc Tommasi, Nicolas Papernot - Show less +4 more

23 Feb 2022-Proceedings on Privacy Enhancing Technologies

TL;DR: Experimental results show that the generated utterances retain very high utility for automatic speech recognition training and inference, while being much better protected against strong adversaries who leverage the full knowledge of the anonymization process to try to infer the speaker identity.

...read moreread less

Abstract: Sharing real-world speech utterances is key to the training and deployment of voice-based services. However, it also raises privacy risks as speech contains a wealth of personal data. Speaker anonymization aims to remove speaker information from a speech utterance while leaving its linguistic and prosodic attributes intact. State-of-the-art techniques operate by disentangling the speaker information (represented via a speaker embedding) from these attributes and re-synthesizing speech based on the speaker embedding of another speaker. Prior research in the privacy community has shown that anonymization often provides brittle privacy protection, even less so any provable guarantee. In this work, we show that disentanglement is indeed not perfect: linguistic and prosodic attributes still contain speaker information. We remove speaker information from these attributes by introducing differentially private feature extractors based on an autoencoder and an automatic speech recognizer, respectively, trained using noise layers. We plug these extractors in the state-of-the-art anonymization pipeline and generate, for the first time, private speech utterances with a provable upper bound on the speaker information they contain. We evaluate empirically the privacy and utility resulting from our differentially private speaker anonymization approach on the LibriSpeech data set. Experimental results show that the generated utterances retain very high utility for automatic speech recognition training and inference, while being much better protected against strong adversaries who leverage the full knowledge of the anonymization process to try to infer the speaker identity.

...read moreread less

Proceedings Article•DOI•

Multimodal Emotion Recognition using Transfer Learning from Speaker Recognition and BERT-based models

[...]

Sarala Padi, Seyed Omid Sadjadi, Dinesh Manocha, Ram D. Sriram

16 Feb 2022

TL;DR: Experimental results indicate that both audio and text-based models improve the emotion recognition performance and that the proposed multimodal solution achieves state-of-the-art results on the IEMOCAP benchmark.

...read moreread less

Abstract: Automatic emotion recognition plays a key role in computer-human interaction as it has the potential to enrich the next-generation artificial intelligence with emotional intelligence. It finds applications in customer and/or representative behavior analysis in call centers, gaming, personal assistants, and social robots, to mention a few. Therefore, there has been an increasing demand to develop robust automatic methods to analyze and recognize the various emotions. In this paper, we propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities. More specifically, we i) adapt a residual network (ResNet) based model trained on a large-scale speaker recognition task using transfer learning along with a spectrogram augmentation approach to recognize emotions from speech, and ii) use a fine-tuned bidirectional encoder representations from transformers (BERT) based model to represent and recognize emotions from the text. The proposed system then combines the ResNet and BERT-based model scores using a late fusion strategy to further improve the emotion recognition performance. The proposed multimodal solution addresses the data scarcity limitation in emotion recognition using transfer learning, data augmentation, and fine-tuning, thereby improving the generalization performance of the emotion recognition models. We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture (IEMOCAP) dataset. Experimental results indicate that both audio and text-based models improve the emotion recognition performance and that the proposed multimodal solution achieves state-of-the-art results on the IEMOCAP benchmark.

...read moreread less

Proceedings Article•DOI•

nnSpeech: Speaker-Guided Conditional Variational Autoencoder for Zero-Shot Multi-speaker text-to-speech

[...]

23 May 2022

TL;DR: In this paper , a zero-shot multi-speaker text-to-speech (nnSpeech) model was proposed to generate natural and similar speech with only one adaption utterance.

...read moreread less

Abstract: Multi-speaker text-to-speech (TTS) using a few adaption data is a challenge in practical applications. To address that, we propose a zero-shot multi-speaker TTS, named nnSpeech, that could synthesis a new speaker voice without fine-tuning and using only one adaption utterance. Compared with using a speaker representation module to extract the characteristics of new speakers, our method bases on a speaker-guided conditional variational autoencoder and can generate a variable Z, which contains both speaker characteristics and content information. The latent variable Z distribution is approximated by another variable conditioned on reference mel-spectrogram and phoneme. Experiments on the English corpus, Mandarin corpus, and cross-dataset proves that our model could generate natural and similar speech with only one adaption speech.

...read moreread less

Journal Article•DOI•

Improving speaker de-identification with functional data analysis of f0 trajectories

[...]

Lauri Tavi, Tomi Kinnunen, Rosa González Hautamäki

01 Mar 2022-Speech Communication

TL;DR: In this paper , a speaker de-identification method was proposed, which, in addition to simple formant shifts, manipulates f0 trajectories based on functional data analysis.

...read moreread less

Book Chapter•DOI•

Voice-Based Gender Recognition Using Neural Network

[...]

Kavita Chachadi¹, S. R. Nirmala¹•Institutions (1)

KLE Technological University¹

01 Jan 2022

TL;DR: In this article, the proposed neural network (NN) model with the different features like MFCC and mel spectrogram extracted from the speech signal to recognize the gender from voice is considered as one of the essential tasks to be detected for such applications.

...read moreread less

Abstract: The human speech contains paralinguistic information used in many speech recognition applications like automatic speech recognition, speaker recognition, and verification. Gender from voice is considered as one of the essential tasks to be detected for such applications. To build a model from a training set, a set of relevant speech features is extracted in order to distinguish gender (i.e., female or male) from a speech signal. This paper focuses on comparison of the proposed neural network (NN) model with the different features like MFCC and mel spectrogram extracted from the speech signal to recognize the gender. Experiments are carried on Mozilla voice dataset and evaluated performance of the network. Experiments show that the combination of MFCC and mel feature sets shows the better accuracy with 94.32%.

...read moreread less

Proceedings Article•DOI•

The Coral++ Algorithm for Unsupervised Domain Adaptation of Speaker Recognition

[...]

Rongjin Li, Weibin Zhang, Dongpeng Chen

02 Feb 2022

TL;DR: A new feature-based unsupervised domain adaptation algorithm based on the well-known CORrelation ALignment (CORAL), so it is called CORAL++ is proposed and used on the NIST 2019 Speaker Recognition Evaluation (SRE19), which shows the effectiveness of the algorithm.

...read moreread less

Abstract: State-of-the-art speaker recognition systems are trained with a large amount of human-labeled training data set. Such a training set is usually composed of various data sources to enhance the modeling capability of models. However, in practical deployment, unseen condition is almost inevitable. Domain mismatch is a common problem in real-life applications due to the statistical difference between the training and testing data sets. To alleviate the degradation caused by domain mismatch, we propose a new feature-based unsupervised domain adaptation algorithm. The algorithm we propose is a further optimization based on the well-known CORrelation ALignment (CORAL), so we call it CORAL++. On the NIST 2019 Speaker Recognition Evaluation (SRE19), we use SRE18 CTS set as the development set to verify the effectiveness of CORAL++. With the typical x-vector/PLDA setup, the CORAL++ outperforms the CORAL by 9.40% relatively on EER.

...read moreread less

Proceedings Article•DOI•

Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?

[...]

18 Sep 2022

TL;DR: In this paper , the authors study the success of self-supervised learning on speaker-related tasks, e.g. speaker verification (SV), through a series of carefully designed experiments and find that the benefit of SSL to SV task is from a combination of mask speech prediction loss, data scale, and model size.

...read moreread less

Abstract: Recently, self-supervised learning (SSL) has demonstrated strong performance in speaker recognition, even if the pre-training objective is designed for speech recognition. In this paper, we study which factor leads to the success of self-supervised learning on speaker-related tasks, e.g. speaker verification (SV), through a series of carefully designed experiments. Our empirical results on the Voxceleb-1 dataset suggest that the benefit of SSL to SV task is from a combination of mask speech prediction loss, data scale, and model size, while the SSL quantizer has a minor impact. We further employ the integrated gradients attribution method and loss landscape visualization to understand the effectiveness of self-supervised learning for speaker recognition performance.

...read moreread less

Proceedings Article•DOI•

VCVTS: Multi-Speaker Video-to-Speech Synthesis Via Cross-Modal Knowledge Transfer from Voice Conversion

[...]

Disong Wang, Shan Yang, Dan Su, Xunying Liu, Dong Yu, Helen Meng - Show less +2 more

18 Feb 2022

TL;DR: In this article , a multi-speaker VTS system based on cross-modal knowledge transfer from voice conversion is proposed, where vector quantization with contrastive predictive coding is used for the content encoder of VC to derive discrete phoneme-like acoustic units, which are transferred to a Lip-to-Index (Lip2Ind) network to infer the index sequence of acoustic units.

...read moreread less

Abstract: Though significant progress has been made for speaker-dependent Video-to-Speech (VTS) synthesis, little attention is devoted to multi-speaker VTS that can map silent video to speech, while allowing flexible control of speaker identity, all in a single system. This paper proposes a novel multi-speaker VTS system based on cross-modal knowledge transfer from voice conversion (VC), where vector quantization with contrastive predictive coding (VQCPC) is used for the content encoder of VC to derive discrete phoneme-like acoustic units, which are transferred to a Lip-to-Index (Lip2Ind) network to infer the index sequence of acoustic units. The Lip2Ind network can then substitute the content encoder of VC to form a multi-speaker VTS system to convert silent video to acoustic units for reconstructing accurate spoken content. The VTS system also inherits the advantages of VC by using a speaker encoder to produce speaker representations to effectively control the speaker identity of generated speech. Extensive evaluations verify the effectiveness of proposed approach, which can be applied in both constrained vocabulary and open vocabulary conditions, achieving state-of-the-art performance in generating high-quality speech with high naturalness, intelligibility and speaker similarity. Our demo page is released here1.

...read moreread less

Journal Article•DOI•

Multi-level self-attentive TDNN: A general and efficient approach to summarize speech into discriminative utterance-level representations

[...]

João L. Monteiro, Jahangir Alam, Tiago H. Falk

01 Apr 2022-Speech Communication

TL;DR: In this paper , the authors explore the utility of performing pooling operations across different levels of the convolutional stack and further propose an approach to efficiently combine such set of representations.

...read moreread less

Proceedings Article•DOI•

Turn-to-Diarize: Online Speaker Diarization Constrained by Transformer Transducer Speaker Turn Detection

[...]

23 May 2022

TL;DR: In this paper , a transformer transducer is used to detect the speaker turns and represent each speaker turn by a speaker embedding, then cluster these embeddings with constraints from the detected speaker turns.

...read moreread less

Abstract: In this paper, we present a novel speaker diarization system for streaming on-device applications. In this system, we use a transformer transducer to detect the speaker turns, represent each speaker turn by a speaker embedding, then cluster these embeddings with constraints from the detected speaker turns. Compared with conventional clustering-based diarization systems, our system largely reduces the computational cost of clustering due to the sparsity of speaker turns. Unlike other supervised speaker diarization systems which require annotations of time-stamped speaker labels for training, our system only requires including speaker turn tokens during the transcribing process, which largely reduces the human efforts involved in data collection.

...read moreread less

Proceedings Article•DOI•

Self-Supervised Speaker Recognition with Loss-Gated Learning

[...]

23 May 2022

TL;DR: In this article , a loss-gated learning (LGL) strategy was proposed to extract the reliable labels through the fitting ability of the neural network during training, which obtains a 46.3% performance gain over the system without it.

...read moreread less

Abstract: In self-supervised learning for speaker recognition, pseudo labels are useful as the supervision signals. It is a known fact that a speaker recognition model doesn’t always benefit from pseudo labels due to their unreliability. In this work, we observe that a speaker recognition network tends to model the data with reliable labels faster than those with unreliable labels. This motivates us to study a loss-gated learning (LGL) strategy, which extracts the reliable labels through the fitting ability of the neural network during training. With the proposed LGL, our speaker recognition model obtains a 46.3% performance gain over the system without it. Further, the proposed self-supervised speaker recognition with LGL trained on the VoxCeleb2 dataset without any labels achieves an equal error rate of 1.66% on the VoxCeleb1 original test set.

...read moreread less

Proceedings Article•DOI•

FastAudio: A Learnable Audio Front-End For Spoof Speech Detection

[...]

23 May 2022

TL;DR: FastAudio as discussed by the authors proposes replacing fixed filterbanks with a learnable layer that can better adapt to anti-spoofing tasks and achieves a relative improvement of 29.7% when compared with fixed front-ends.

...read moreread less

Abstract: Spoof speech can be used to try and fool speaker verification systems that determine the identity of the speaker based on voice characteristics. This paper compares popular learnable front-ends on this task. We categorize the front-ends by defining two generic architectures and then analyze the filtering stages of both types in terms of learning constraints. We pro-pose replacing fixed filterbanks with a learnable layer that can better adapt to anti-spoofing tasks. The proposed FastAudio front-end is then tested with two popular back-ends to measure the performance on the Logical Access track of the ASVspoof 2019 dataset. The FastAudio front-end achieves a relative improvement of 29.7% when compared with fixed front-ends, outperforming all other learnable front-ends on this task.

...read moreread less

Journal Article•DOI•

Deep Learning based Holistic Speaker Independent Visual Speech Recognition

[...]

Praneeth Nemani, Ghanta Sai Krishna, Nikhil Ramisetty, B. D. S. Sai, Santosh Kumar - Show less +1 more

01 Jan 2022-IEEE transactions on artificial intelligence

TL;DR: In this article , a 3D CNN architecture is proposed by extracting the spatio-temporal features and eventually mapping the prediction probabilities of the elements in the corpus to validate the concept of person-independence, achieving a training accuracy of 80.2% and a testing accuracy of 77.9%.

...read moreread less

Abstract: From a broader perspective, the objective of Visual Speech Recognition (VSR) is to comprehend the speech spoken by an individual using visual deformations. However, some of the significant limitations of existing solutions include the dearth of training data, improper end-to-end deployed solutions, lack of holistic feature representation, and less accuracy. To resolve these limitations, this study proposes a novel, scalable, and robust VSR system that uses the videotape of the user to determine the word which is being spoken. In this regard, a customized 3-Dimensional Convolutional Neural Network (3D CNN) architecture is proposed by extracting the Spatio-temporal features and eventually mapping the prediction probabilities of the elements in the corpus. We have created a customized dataset resembling the metadata contained in the MIRACL-VC1 dataset to validate the concept of person-independence. While being robust to a broad spectrum of lighting conditions across multiple devices, our model achieves a training accuracy of 80.2% and a testing accuracy of 77.9% in predicting the word spoken by the user.

...read moreread less

Proceedings Article•DOI•

Privacy Attacks for Automatic Speech Recognition Acoustic Models in A Federated Learning Framework

[...]

23 May 2022

TL;DR: In this article , the authors propose an approach to analyze information in neural network AMs based on a neural network footprint on the so-called Indicator dataset, and develop two attack models that aim to infer speaker identity from the updated personalized models without access to the actual users' speech data.

...read moreread less

Abstract: This paper investigates methods to effectively retrieve speaker information from the personalized speaker adapted neural network acoustic models (AMs) in automatic speech recognition (ASR). This problem is especially important in the context of federated learning of ASR acoustic models where a global model is learnt on the server based on the updates received from multiple clients. We propose an approach to analyze information in neural network AMs based on a neural network footprint on the so-called Indicator dataset. Using this method, we develop two attack models that aim to infer speaker identity from the updated personalized models without access to the actual users' speech data. Experiments on the TED-LIUM 3 corpus demonstrate that the proposed approaches are very effective and can provide equal error rate (EER) of 1-2%.

...read moreread less

Proceedings Article•DOI•

Speaker Normalization for Self-Supervised Speech Emotion Recognition

[...]

23 May 2022

TL;DR: This paper proposed a gradient-based adversary learning framework that learns a speech emotion recognition task while normalizing speaker characteristics from the feature representation, which achieved state-of-the-art results on the challenging IEMOCAP dataset.

...read moreread less

Abstract: Large speech emotion recognition datasets are hard to obtain, and small datasets may contain biases. Deep-net-based classifiers, in turn, are prone to exploit those biases and find shortcuts such as speaker characteristics. These shortcuts usually harm a model’s ability to generalize. To address this challenge, we propose a gradient-based adversary learning framework that learns a speech emotion recognition task while normalizing speaker characteristics from the feature representation. We demonstrate the efficacy of our method on both speaker-independent and speaker-dependent settings and obtain new state-of-the-art results on the challenging IEMOCAP dataset.

...read moreread less

Proceedings Article•DOI•

Towards end-to-end Speaker Diarization with Generalized Neural Speaker Clustering

[...]

Chunlei Zhang, Jiatong Shi, Chao Weng, Meng Yu, Dong Yu - Show less +1 more

23 May 2022

TL;DR: This study presents a novel speaker diarization system, with a generalized neural speaker clustering module as the backbone, able to integrate SAD, OSD and speaker segmentation/clustering, and yield competitive results in the VoxConverse20 benchmarks.

...read moreread less

Abstract: Speaker diarization consists of many components, e.g., front-end processing, speech activity detection (SAD), overlapped speech detection (OSD) and speaker segmentation/clustering. Conventionally, most of the involved components are separately developed and optimized. The resulting speaker diarization systems are complicated and sometimes lack of satisfying generalization capabilities. In this study, we present a novel speaker diarization system, with a generalized neural speaker clustering module as the backbone. The whole system can be simplified to contain only two major parts, a speaker embedding extractor followed by a clustering module. Both parts are implemented with neural networks. In the training phase, an on-the-fly spoken dialogue generator is designed to provide the system with audio streams and the corresponding annotations in categories of non-speech, overlapped speech and active speakers. The chunk-wise inference and a speaker verification based tracing module are conducted to handle the arbitrary number of speakers. We demonstrate that the proposed speaker diarization system is able to integrate SAD, OSD and speaker segmentation/clustering, and yield competitive results in the VoxConverse20 benchmarks.

...read moreread less

Journal Article•DOI•

Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech

[...]

01 Jan 2022

TL;DR: In this paper , a meta-learning algorithm was proposed to adapt a speaker adaptation model to unseen speakers by using Model Agnostic Meta-Learning (MAML) to find a great meta-initialization.

...read moreread less

Abstract: Personalizing a speech synthesis system is a highly desired application, where the system can generate speech with the user’s voice with rare enrolled recordings. There are two main approaches to build such a system in recent works: speaker adaptation and speaker encoding. On the one hand, speaker adaptation methods fine-tune a trained multi-speaker text-to-speech (TTS) model with few enrolled samples. However, they require at least thousands of fine-tuning steps for high-quality adaptation, making it hard to apply on devices. On the other hand, speaker encoding methods encode enrollment utterances into a speaker embedding. The trained TTS model can synthesize the user’s speech conditioned on the corresponding speaker embedding. Nevertheless, the speaker encoder suffers from the generalization gap between the seen and unseen speakers. In this paper, we propose applying a meta-learning algorithm to the speaker adaptation method. More specifically, we use Model Agnostic Meta-Learning (MAML) as the training algorithm of a multi-speaker TTS model, which aims to find a great meta-initialization to adapt the model to any few-shot speaker adaptation tasks quickly. Therefore, we can also adapt the meta-trained TTS model to unseen speakers efficiently. Our experiments compare the proposed method (Meta-TTS) with two baselines: a speaker adaptation method baseline and a speaker encoding method baseline. The evaluation results show that Meta-TTS can synthesize high speaker-similarity speech from few enrollment samples with fewer adaptation steps than the speaker adaptation baseline and outperforms the speaker encoding baseline under the same training scheme. When the speaker encoder of the baseline is pre-trained with extra 8371 speakers of data, Meta-TTS can still outperform the baseline on LibriTTS dataset and achieve comparable results on VCTK dataset.

...read moreread less

Proceedings Article•DOI•

Cross-Speaker Style Transfer for Text-to-Speech Using Data Augmentation

[...]

Manuel Sam Ribeiro, Julian Roth, Giulia Comini, Goeric Huybrechts, Adam Gabrys, Jaime Lorenzo-Trueba - Show less +2 more

10 Feb 2022

TL;DR: Results indicate that the proposed technology consistently improves synthetic samples in terms of style similarity, while retaining the target speaker’s identity.

...read moreread less

Abstract: We address the problem of cross-speaker style transfer for text-to-speech (TTS) using data augmentation via voice conversion. We assume to have a corpus of neutral non-expressive data from a target speaker and supporting conversational expressive data from different speakers. Our goal is to build a TTS system that is expressive, while retaining the target speaker’s identity. The proposed approach relies on voice conversion to first generate high-quality data from the set of supporting expressive speakers. The voice converted data is then pooled with natural data from the target speaker and used to train a single-speaker multi-style TTS system. We provide evidence that this approach is efficient, flexible, and scalable. The method is evaluated using one or more supporting speakers, as well as a variable amount of supporting data. We further provide evidence that this approach allows some controllability of speaking style, when using multiple supporting speakers. We conclude by scaling our proposed technology to a set of 14 speakers across 7 languages. Results indicate that our technology consistently improves synthetic samples in terms of style similarity, while retaining the target speaker’s identity.

...read moreread less

Journal Article•DOI•

Speaker Recognition Using Constrained Convolutional Neural Networks in Emotional Speech

[...]

Nikola Simic, Sinisa Suzic, Tijana Nosek, Mia Vujović, Zoran Peric, Milan Savic, Vlado Delic - Show less +3 more

01 Mar 2022-Entropy

TL;DR: This paper proposes a simple and constrained convolutional neural network for speaker recognition tasks and examines its robustness for recognition in emotional speech conditions, and examines three quantization methods for developing a constrained network.

...read moreread less

Abstract: Speaker recognition is an important classification task, which can be solved using several approaches. Although building a speaker recognition model on a closed set of speakers under neutral speaking conditions is a well-researched task and there are solutions that provide excellent performance, the classification accuracy of developed models significantly decreases when applying them to emotional speech or in the presence of interference. Furthermore, deep models may require a large number of parameters, so constrained solutions are desirable in order to implement them on edge devices in the Internet of Things systems for real-time detection. The aim of this paper is to propose a simple and constrained convolutional neural network for speaker recognition tasks and to examine its robustness for recognition in emotional speech conditions. We examine three quantization methods for developing a constrained network: floating-point eight format, ternary scalar quantization, and binary scalar quantization. The results are demonstrated on the recently recorded SEAC dataset.

...read moreread less

Collapse