scispace - formally typeset
Search or ask a question

Showing papers on "Speaker recognition published in 2022"


Journal ArticleDOI
01 Mar 2022
TL;DR: Speaker diarization is a task to label audio or video recordings with classes that correspond to speaker identity as mentioned in this paper , or in short, identifying "who spoke when" in audio and video recordings.
Abstract: Speaker diarization is a task to label audio or video recordings with classes that correspond to speaker identity, or in short, a task to identify “who spoke when”. In the early years, speaker diarization algorithms were developed for speech recognition on multispeaker audio recordings to enable speaker adaptive processing. These algorithms also gained their own value as a standalone application over time to provide speaker-specific metainformation for downstream tasks such as audio retrieval. More recently, with the emergence of deep learning technology, which has driven revolutionary changes in research and practices across speech application domains, rapid advancements have been made for speaker diarization. In this paper, we review not only the historical development of speaker diarization technology but also the recent advancements in neural speaker diarization approaches. Furthermore, we discuss how speaker diarization systems have been integrated with speech recognition applications and how the recent surge of deep learning is leading the way of jointly modeling these two components to be complementary to each other. By considering such exciting technical trends, we believe that this paper is a valuable contribution to the community to provide a survey work by consolidating the recent developments with neural methods and thus facilitating further progress toward a more efficient speaker diarization.

57 citations


Journal ArticleDOI
TL;DR: Speaker diarization is a task to label audio or video recordings with classes that correspond to speaker identity as mentioned in this paper, or in short, identifying "who spoke when" in audio and video recordings.

55 citations


Proceedings ArticleDOI
24 Jan 2022
TL;DR: It is shown that bias exists at every development stage in the well-known VoxCeleb Speaker Recognition Challenge, including data generation, model building, and implementation, and most affected are female speakers and non-US nationalities, who experience significant performance degradation.
Abstract: Automated speaker recognition uses data processing to identify speakers by their voice. Today, automated speaker recognition is deployed on billions of smart devices and in services such as call centres. Despite their wide-scale deployment and known sources of bias in related domains like face recognition and natural language processing, bias in automated speaker recognition has not been studied systematically. We present an in-depth empirical and analytical study of bias in the machine learning development workflow of speaker verification, a voice biometric and core task in automated speaker recognition. Drawing on an established framework for understanding sources of harm in machine learning, we show that bias exists at every development stage in the well-known VoxCeleb Speaker Recognition Challenge, including data generation, model building, and implementation. Most affected are female speakers and non-US nationalities, who experience significant performance degradation. Leveraging the insights from our findings, we make practical recommendations for mitigating bias in automated speaker recognition, and outline future research directions.

19 citations


Proceedings ArticleDOI
23 May 2022
TL;DR: In this article , two methods are introduced for enhancing the unsupervised speaker information extraction: utterance mixing strategy for data augmentation, where additional overlapped utterances are created unsupervisely and incorporated during training.
Abstract: Self-supervised learning (SSL) is a long-standing goal for speech processing, since it utilizes large-scale unlabeled data and avoids extensive human labeling. Recent years witness great successes in applying self-supervised learning in speech recognition, while limited exploration was attempted in applying SSL for modeling speaker characteristics. In this paper, we aim to improve the existing SSL framework for speaker representation learning. Two methods are introduced for enhancing the unsupervised speaker information extraction. First, we apply the multi-task learning to the current SSL framework, where we integrate the utterance-wise contrastive loss with the SSL objective function. Second, for better speaker discrimination, we propose an utterance mixing strategy for data augmentation, where additional overlapped utterances are created unsupervisely and incorporate during training. We integrate the proposed methods into the HuBERT framework. Experiment results on SUPERB benchmark show that the proposed system achieves state-of-the-art performance in universal representation learning, especially for speaker identification oriented tasks. An ablation study is performed verifying the efficacy of each proposed method. Finally, we scale up training dataset to 94 thousand hours public audio data and achieve further performance improvement in all SUPERB tasks.

19 citations


Proceedings ArticleDOI
23 May 2022
TL;DR: W2V2-Speaker as discussed by the authors applies the wav2vec2 framework to speaker recognition instead of speech recognition, and proposes a single-utterance classification with cross-entropy or additive angular softmax loss, and an utterance-pair classification variant with BCE loss.
Abstract: This paper explores applying the wav2vec2 framework to speaker recognition instead of speech recognition. We study the effectiveness of the pre-trained weights on the speaker recognition task, and how to pool the wav2vec2 output sequence into a fixed-length speaker embedding. To adapt the framework to speaker recognition, we propose a single-utterance classification variant with cross-entropy or additive angular softmax loss, and an utterance-pair classification variant with BCE loss. Our best performing variant achieves a 1.88% EER on the extended voxceleb1 test set compared to 1.69% EER with an ECAPA-TDNN baseline. Code is available at github.com/nikvaessen/w2v2-speaker.

16 citations


Proceedings ArticleDOI
23 May 2022
TL;DR: In this article , the authors explore the limits of speech representations learned by different self-supervised objectives and datasets for automatic speaker verification (ASV), especially with a well-recognized SOTA ASV model, ECAPA-TDNN, as a downstream model.
Abstract: The speech representations learned from large-scale unlabeled data have shown better generalizability than those from supervised learning and thus attract a lot of interest to be applied for various downstream tasks. In this paper, we explore the limits of speech representations learned by different self-supervised objectives and datasets for automatic speaker verification (ASV), especially with a well-recognized SOTA ASV model, ECAPA-TDNN [1], as a downstream model. The representations from all hidden layers of the pre-trained model are firstly averaged with learnable weights and then fed into the ECAPA-TDNN as input features. The experimental results on Voxceleb dataset show that the weighted average representation is significantly superior to FBank, a conventional handcrafted feature for ASV. Our best single system achieves 0.537%, 0.569%, and 1.180% equal error rate (EER) on the three official trials of VoxCeleb1, separately. Accordingly, the ensemble system with three pre-trained models can further improve the EER to 0.479%, 0.536% and 1.023%. Among the three evaluation trials, our best system outperforms the winner system [2] of the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC2021) on the VoxCeleb1-E trial.

15 citations


Proceedings ArticleDOI
23 May 2022
TL;DR: In this article , a multi-scale frequency-channel attention (MFA) was proposed to characterize speakers at different scales through a dual-path design which consists of a convolutional neural network and a time delay neural network.
Abstract: The time delay neural network (TDNN) represents one of the state-of-the-art of neural solutions to text-independent speaker verification. However, they require a large number of filters to capture the speaker characteristics at any local frequency region. In addition, the performance of such systems may degrade under short utterance scenarios. To address these issues, we propose a multi-scale frequency-channel attention (MFA), where we characterize speakers at different scales through a novel dual-path design which consists of a convolutional neural network and TDNN. We evaluate the proposed MFA on the VoxCeleb database and observe that the proposed framework with MFA can achieve state-of-the-art performance while reducing parameters and computation complexity. Further, the MFA mechanism is found to be effective for speaker verification with short test utterances.

14 citations


Journal ArticleDOI
TL;DR: The Mask Sub-Challenge (MSC) of the INTERSPEECH 2020 COMputational Paralinguistics challengE (ComParE) as discussed by the authors focused on the following classification task: given an audio chunk of a speaker, classify whether the speaker is wearing a mask or not.

14 citations


Proceedings ArticleDOI
14 Feb 2022
TL;DR: Experimental results show that the proposed approach outperforms the conventional approach in terms of diarization error rate (DER), especially by substantially reducing speaker confusion errors, that indeed reflects the effectiveness of the proposed iGMM integration.
Abstract: Speaker diarization has been investigated extensively as an important central task for meeting analysis. Recent trend shows that integration of end-to-end neural (EEND)- and clustering-based diarization is a promising approach to handle realistic conversational data containing overlapped speech with an arbitrarily large number of speakers, and achieved state-of-the-art results on various tasks. However, the approaches proposed so far have not realized tight integration yet, because the clustering employed therein was not optimal in any sense for clustering the speaker embeddings estimated by the EEND module. To address this problem, this paper introduces a trainable clustering algorithm into the integration framework, by deep-unfolding a non-parametric Bayesian model called the infinite Gaussian mixture model (iGMM). Specifically, the speaker embeddings are optimized during training such that it better fits iGMM clustering, based on a novel clustering loss based on Adjusted Rand Index (ARI). Experimental results based on CALLHOME data show that the proposed approach outperforms the conventional approach in terms of diarization error rate (DER), especially by substantially reducing speaker confusion errors, that indeed reflects the effectiveness of the proposed iGMM integration.

14 citations


Journal ArticleDOI
TL;DR:
Abstract: We study the scenario where individuals (speakers) contribute to the publication of an anonymized speech corpus. Data users leverage this public corpus for downstream tasks, e.g., training an automatic speech recognition (ASR) system, while attackers may attempt to de-anonymize it using auxiliary knowledge. Motivated by this scenario, speaker anonymization aims to conceal speaker identity while preserving the quality and usefulness of speech data. In this article, we study x-vector based speaker anonymization, the leading approach in the VoicePrivacy Challenge, which converts the speaker's voice into that of a random pseudo-speaker. We show that the strength of anonymization varies significantly depending on how the pseudo-speaker is chosen. We explore four design choices for this step: the distance metric between speakers, the region of speaker space where the pseudo-speaker is picked, its gender, and whether to assign it to one or all utterances of the original speaker. We assess the quality of anonymization from the perspective of the three actors involved in our threat model, namely the speaker, the user and the attacker. To measure privacy and utility, we use respectively the linkability score achieved by the attackers and the decoding word error rate achieved by an ASR model trained on the anonymized data. Experiments on LibriSpeech show that the best combination of design choices yields state-of-the-art performance in terms of both privacy and utility. Experiments on Mozilla Common Voice further show that it guarantees the same anonymization level against re-identification attacks among 50 speakers as original speech among 20,000 speakers.

12 citations


Journal ArticleDOI
TL;DR: Comprehensive evaluations which include a detailed mathematical analysis, a simulation on amplitude and frequency modulated (AM-FM) signals, and a spectrographic inspection involving different filterbank structures, along with their experimental results are provided in this paper.

Journal ArticleDOI
TL;DR: Experimental results show that the generated utterances retain very high utility for automatic speech recognition training and inference, while being much better protected against strong adversaries who leverage the full knowledge of the anonymization process to try to infer the speaker identity.
Abstract: Sharing real-world speech utterances is key to the training and deployment of voice-based services. However, it also raises privacy risks as speech contains a wealth of personal data. Speaker anonymization aims to remove speaker information from a speech utterance while leaving its linguistic and prosodic attributes intact. State-of-the-art techniques operate by disentangling the speaker information (represented via a speaker embedding) from these attributes and re-synthesizing speech based on the speaker embedding of another speaker. Prior research in the privacy community has shown that anonymization often provides brittle privacy protection, even less so any provable guarantee. In this work, we show that disentanglement is indeed not perfect: linguistic and prosodic attributes still contain speaker information. We remove speaker information from these attributes by introducing differentially private feature extractors based on an autoencoder and an automatic speech recognizer, respectively, trained using noise layers. We plug these extractors in the state-of-the-art anonymization pipeline and generate, for the first time, private speech utterances with a provable upper bound on the speaker information they contain. We evaluate empirically the privacy and utility resulting from our differentially private speaker anonymization approach on the LibriSpeech data set. Experimental results show that the generated utterances retain very high utility for automatic speech recognition training and inference, while being much better protected against strong adversaries who leverage the full knowledge of the anonymization process to try to infer the speaker identity.

Proceedings ArticleDOI
16 Feb 2022
TL;DR: Experimental results indicate that both audio and text-based models improve the emotion recognition performance and that the proposed multimodal solution achieves state-of-the-art results on the IEMOCAP benchmark.
Abstract: Automatic emotion recognition plays a key role in computer-human interaction as it has the potential to enrich the next-generation artificial intelligence with emotional intelligence. It finds applications in customer and/or representative behavior analysis in call centers, gaming, personal assistants, and social robots, to mention a few. Therefore, there has been an increasing demand to develop robust automatic methods to analyze and recognize the various emotions. In this paper, we propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities. More specifically, we i) adapt a residual network (ResNet) based model trained on a large-scale speaker recognition task using transfer learning along with a spectrogram augmentation approach to recognize emotions from speech, and ii) use a fine-tuned bidirectional encoder representations from transformers (BERT) based model to represent and recognize emotions from the text. The proposed system then combines the ResNet and BERT-based model scores using a late fusion strategy to further improve the emotion recognition performance. The proposed multimodal solution addresses the data scarcity limitation in emotion recognition using transfer learning, data augmentation, and fine-tuning, thereby improving the generalization performance of the emotion recognition models. We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture (IEMOCAP) dataset. Experimental results indicate that both audio and text-based models improve the emotion recognition performance and that the proposed multimodal solution achieves state-of-the-art results on the IEMOCAP benchmark.

Proceedings ArticleDOI
23 May 2022
TL;DR: In this paper , a zero-shot multi-speaker text-to-speech (nnSpeech) model was proposed to generate natural and similar speech with only one adaption utterance.
Abstract: Multi-speaker text-to-speech (TTS) using a few adaption data is a challenge in practical applications. To address that, we propose a zero-shot multi-speaker TTS, named nnSpeech, that could synthesis a new speaker voice without fine-tuning and using only one adaption utterance. Compared with using a speaker representation module to extract the characteristics of new speakers, our method bases on a speaker-guided conditional variational autoencoder and can generate a variable Z, which contains both speaker characteristics and content information. The latent variable Z distribution is approximated by another variable conditioned on reference mel-spectrogram and phoneme. Experiments on the English corpus, Mandarin corpus, and cross-dataset proves that our model could generate natural and similar speech with only one adaption speech.

Journal ArticleDOI
TL;DR: In this paper , a speaker de-identification method was proposed, which, in addition to simple formant shifts, manipulates f0 trajectories based on functional data analysis.

Book ChapterDOI
01 Jan 2022
TL;DR: In this article, the proposed neural network (NN) model with the different features like MFCC and mel spectrogram extracted from the speech signal to recognize the gender from voice is considered as one of the essential tasks to be detected for such applications.
Abstract: The human speech contains paralinguistic information used in many speech recognition applications like automatic speech recognition, speaker recognition, and verification. Gender from voice is considered as one of the essential tasks to be detected for such applications. To build a model from a training set, a set of relevant speech features is extracted in order to distinguish gender (i.e., female or male) from a speech signal. This paper focuses on comparison of the proposed neural network (NN) model with the different features like MFCC and mel spectrogram extracted from the speech signal to recognize the gender. Experiments are carried on Mozilla voice dataset and evaluated performance of the network. Experiments show that the combination of MFCC and mel feature sets shows the better accuracy with 94.32%.

Proceedings ArticleDOI
02 Feb 2022
TL;DR: A new feature-based unsupervised domain adaptation algorithm based on the well-known CORrelation ALignment (CORAL), so it is called CORAL++ is proposed and used on the NIST 2019 Speaker Recognition Evaluation (SRE19), which shows the effectiveness of the algorithm.
Abstract: State-of-the-art speaker recognition systems are trained with a large amount of human-labeled training data set. Such a training set is usually composed of various data sources to enhance the modeling capability of models. However, in practical deployment, unseen condition is almost inevitable. Domain mismatch is a common problem in real-life applications due to the statistical difference between the training and testing data sets. To alleviate the degradation caused by domain mismatch, we propose a new feature-based unsupervised domain adaptation algorithm. The algorithm we propose is a further optimization based on the well-known CORrelation ALignment (CORAL), so we call it CORAL++. On the NIST 2019 Speaker Recognition Evaluation (SRE19), we use SRE18 CTS set as the development set to verify the effectiveness of CORAL++. With the typical x-vector/PLDA setup, the CORAL++ outperforms the CORAL by 9.40% relatively on EER.

Proceedings ArticleDOI
18 Sep 2022
TL;DR: In this paper , the authors study the success of self-supervised learning on speaker-related tasks, e.g. speaker verification (SV), through a series of carefully designed experiments and find that the benefit of SSL to SV task is from a combination of mask speech prediction loss, data scale, and model size.
Abstract: Recently, self-supervised learning (SSL) has demonstrated strong performance in speaker recognition, even if the pre-training objective is designed for speech recognition. In this paper, we study which factor leads to the success of self-supervised learning on speaker-related tasks, e.g. speaker verification (SV), through a series of carefully designed experiments. Our empirical results on the Voxceleb-1 dataset suggest that the benefit of SSL to SV task is from a combination of mask speech prediction loss, data scale, and model size, while the SSL quantizer has a minor impact. We further employ the integrated gradients attribution method and loss landscape visualization to understand the effectiveness of self-supervised learning for speaker recognition performance.

Proceedings ArticleDOI
18 Feb 2022
TL;DR: In this article , a multi-speaker VTS system based on cross-modal knowledge transfer from voice conversion is proposed, where vector quantization with contrastive predictive coding is used for the content encoder of VC to derive discrete phoneme-like acoustic units, which are transferred to a Lip-to-Index (Lip2Ind) network to infer the index sequence of acoustic units.
Abstract: Though significant progress has been made for speaker-dependent Video-to-Speech (VTS) synthesis, little attention is devoted to multi-speaker VTS that can map silent video to speech, while allowing flexible control of speaker identity, all in a single system. This paper proposes a novel multi-speaker VTS system based on cross-modal knowledge transfer from voice conversion (VC), where vector quantization with contrastive predictive coding (VQCPC) is used for the content encoder of VC to derive discrete phoneme-like acoustic units, which are transferred to a Lip-to-Index (Lip2Ind) network to infer the index sequence of acoustic units. The Lip2Ind network can then substitute the content encoder of VC to form a multi-speaker VTS system to convert silent video to acoustic units for reconstructing accurate spoken content. The VTS system also inherits the advantages of VC by using a speaker encoder to produce speaker representations to effectively control the speaker identity of generated speech. Extensive evaluations verify the effectiveness of proposed approach, which can be applied in both constrained vocabulary and open vocabulary conditions, achieving state-of-the-art performance in generating high-quality speech with high naturalness, intelligibility and speaker similarity. Our demo page is released here1.

Journal ArticleDOI
TL;DR: In this paper , the authors explore the utility of performing pooling operations across different levels of the convolutional stack and further propose an approach to efficiently combine such set of representations.

Proceedings ArticleDOI
23 May 2022
TL;DR: In this paper , a transformer transducer is used to detect the speaker turns and represent each speaker turn by a speaker embedding, then cluster these embeddings with constraints from the detected speaker turns.
Abstract: In this paper, we present a novel speaker diarization system for streaming on-device applications. In this system, we use a transformer transducer to detect the speaker turns, represent each speaker turn by a speaker embedding, then cluster these embeddings with constraints from the detected speaker turns. Compared with conventional clustering-based diarization systems, our system largely reduces the computational cost of clustering due to the sparsity of speaker turns. Unlike other supervised speaker diarization systems which require annotations of time-stamped speaker labels for training, our system only requires including speaker turn tokens during the transcribing process, which largely reduces the human efforts involved in data collection.

Proceedings ArticleDOI
23 May 2022
TL;DR: In this article , a loss-gated learning (LGL) strategy was proposed to extract the reliable labels through the fitting ability of the neural network during training, which obtains a 46.3% performance gain over the system without it.
Abstract: In self-supervised learning for speaker recognition, pseudo labels are useful as the supervision signals. It is a known fact that a speaker recognition model doesn’t always benefit from pseudo labels due to their unreliability. In this work, we observe that a speaker recognition network tends to model the data with reliable labels faster than those with unreliable labels. This motivates us to study a loss-gated learning (LGL) strategy, which extracts the reliable labels through the fitting ability of the neural network during training. With the proposed LGL, our speaker recognition model obtains a 46.3% performance gain over the system without it. Further, the proposed self-supervised speaker recognition with LGL trained on the VoxCeleb2 dataset without any labels achieves an equal error rate of 1.66% on the VoxCeleb1 original test set.

Proceedings ArticleDOI
23 May 2022
TL;DR: FastAudio as discussed by the authors proposes replacing fixed filterbanks with a learnable layer that can better adapt to anti-spoofing tasks and achieves a relative improvement of 29.7% when compared with fixed front-ends.
Abstract: Spoof speech can be used to try and fool speaker verification systems that determine the identity of the speaker based on voice characteristics. This paper compares popular learnable front-ends on this task. We categorize the front-ends by defining two generic architectures and then analyze the filtering stages of both types in terms of learning constraints. We pro-pose replacing fixed filterbanks with a learnable layer that can better adapt to anti-spoofing tasks. The proposed FastAudio front-end is then tested with two popular back-ends to measure the performance on the Logical Access track of the ASVspoof 2019 dataset. The FastAudio front-end achieves a relative improvement of 29.7% when compared with fixed front-ends, outperforming all other learnable front-ends on this task.

Journal ArticleDOI
TL;DR: In this article , a 3D CNN architecture is proposed by extracting the spatio-temporal features and eventually mapping the prediction probabilities of the elements in the corpus to validate the concept of person-independence, achieving a training accuracy of 80.2% and a testing accuracy of 77.9%.
Abstract: From a broader perspective, the objective of Visual Speech Recognition (VSR) is to comprehend the speech spoken by an individual using visual deformations. However, some of the significant limitations of existing solutions include the dearth of training data, improper end-to-end deployed solutions, lack of holistic feature representation, and less accuracy. To resolve these limitations, this study proposes a novel, scalable, and robust VSR system that uses the videotape of the user to determine the word which is being spoken. In this regard, a customized 3-Dimensional Convolutional Neural Network (3D CNN) architecture is proposed by extracting the Spatio-temporal features and eventually mapping the prediction probabilities of the elements in the corpus. We have created a customized dataset resembling the metadata contained in the MIRACL-VC1 dataset to validate the concept of person-independence. While being robust to a broad spectrum of lighting conditions across multiple devices, our model achieves a training accuracy of 80.2% and a testing accuracy of 77.9% in predicting the word spoken by the user.

Proceedings ArticleDOI
23 May 2022
TL;DR: In this article , the authors propose an approach to analyze information in neural network AMs based on a neural network footprint on the so-called Indicator dataset, and develop two attack models that aim to infer speaker identity from the updated personalized models without access to the actual users' speech data.
Abstract: This paper investigates methods to effectively retrieve speaker information from the personalized speaker adapted neural network acoustic models (AMs) in automatic speech recognition (ASR). This problem is especially important in the context of federated learning of ASR acoustic models where a global model is learnt on the server based on the updates received from multiple clients. We propose an approach to analyze information in neural network AMs based on a neural network footprint on the so-called Indicator dataset. Using this method, we develop two attack models that aim to infer speaker identity from the updated personalized models without access to the actual users' speech data. Experiments on the TED-LIUM 3 corpus demonstrate that the proposed approaches are very effective and can provide equal error rate (EER) of 1-2%.

Proceedings ArticleDOI
23 May 2022
TL;DR: This paper proposed a gradient-based adversary learning framework that learns a speech emotion recognition task while normalizing speaker characteristics from the feature representation, which achieved state-of-the-art results on the challenging IEMOCAP dataset.
Abstract: Large speech emotion recognition datasets are hard to obtain, and small datasets may contain biases. Deep-net-based classifiers, in turn, are prone to exploit those biases and find shortcuts such as speaker characteristics. These shortcuts usually harm a model’s ability to generalize. To address this challenge, we propose a gradient-based adversary learning framework that learns a speech emotion recognition task while normalizing speaker characteristics from the feature representation. We demonstrate the efficacy of our method on both speaker-independent and speaker-dependent settings and obtain new state-of-the-art results on the challenging IEMOCAP dataset.

Proceedings ArticleDOI
23 May 2022
TL;DR: This study presents a novel speaker diarization system, with a generalized neural speaker clustering module as the backbone, able to integrate SAD, OSD and speaker segmentation/clustering, and yield competitive results in the VoxConverse20 benchmarks.
Abstract: Speaker diarization consists of many components, e.g., front-end processing, speech activity detection (SAD), overlapped speech detection (OSD) and speaker segmentation/clustering. Conventionally, most of the involved components are separately developed and optimized. The resulting speaker diarization systems are complicated and sometimes lack of satisfying generalization capabilities. In this study, we present a novel speaker diarization system, with a generalized neural speaker clustering module as the backbone. The whole system can be simplified to contain only two major parts, a speaker embedding extractor followed by a clustering module. Both parts are implemented with neural networks. In the training phase, an on-the-fly spoken dialogue generator is designed to provide the system with audio streams and the corresponding annotations in categories of non-speech, overlapped speech and active speakers. The chunk-wise inference and a speaker verification based tracing module are conducted to handle the arbitrary number of speakers. We demonstrate that the proposed speaker diarization system is able to integrate SAD, OSD and speaker segmentation/clustering, and yield competitive results in the VoxConverse20 benchmarks.

Journal ArticleDOI
01 Jan 2022
TL;DR: In this paper , a meta-learning algorithm was proposed to adapt a speaker adaptation model to unseen speakers by using Model Agnostic Meta-Learning (MAML) to find a great meta-initialization.
Abstract: Personalizing a speech synthesis system is a highly desired application, where the system can generate speech with the user’s voice with rare enrolled recordings. There are two main approaches to build such a system in recent works: speaker adaptation and speaker encoding. On the one hand, speaker adaptation methods fine-tune a trained multi-speaker text-to-speech (TTS) model with few enrolled samples. However, they require at least thousands of fine-tuning steps for high-quality adaptation, making it hard to apply on devices. On the other hand, speaker encoding methods encode enrollment utterances into a speaker embedding. The trained TTS model can synthesize the user’s speech conditioned on the corresponding speaker embedding. Nevertheless, the speaker encoder suffers from the generalization gap between the seen and unseen speakers. In this paper, we propose applying a meta-learning algorithm to the speaker adaptation method. More specifically, we use Model Agnostic Meta-Learning (MAML) as the training algorithm of a multi-speaker TTS model, which aims to find a great meta-initialization to adapt the model to any few-shot speaker adaptation tasks quickly. Therefore, we can also adapt the meta-trained TTS model to unseen speakers efficiently. Our experiments compare the proposed method (Meta-TTS) with two baselines: a speaker adaptation method baseline and a speaker encoding method baseline. The evaluation results show that Meta-TTS can synthesize high speaker-similarity speech from few enrollment samples with fewer adaptation steps than the speaker adaptation baseline and outperforms the speaker encoding baseline under the same training scheme. When the speaker encoder of the baseline is pre-trained with extra 8371 speakers of data, Meta-TTS can still outperform the baseline on LibriTTS dataset and achieve comparable results on VCTK dataset.

Proceedings ArticleDOI
10 Feb 2022
TL;DR: Results indicate that the proposed technology consistently improves synthetic samples in terms of style similarity, while retaining the target speaker’s identity.
Abstract: We address the problem of cross-speaker style transfer for text-to-speech (TTS) using data augmentation via voice conversion. We assume to have a corpus of neutral non-expressive data from a target speaker and supporting conversational expressive data from different speakers. Our goal is to build a TTS system that is expressive, while retaining the target speaker’s identity. The proposed approach relies on voice conversion to first generate high-quality data from the set of supporting expressive speakers. The voice converted data is then pooled with natural data from the target speaker and used to train a single-speaker multi-style TTS system. We provide evidence that this approach is efficient, flexible, and scalable. The method is evaluated using one or more supporting speakers, as well as a variable amount of supporting data. We further provide evidence that this approach allows some controllability of speaking style, when using multiple supporting speakers. We conclude by scaling our proposed technology to a set of 14 speakers across 7 languages. Results indicate that our technology consistently improves synthetic samples in terms of style similarity, while retaining the target speaker’s identity.

Journal ArticleDOI
01 Mar 2022-Entropy
TL;DR: This paper proposes a simple and constrained convolutional neural network for speaker recognition tasks and examines its robustness for recognition in emotional speech conditions, and examines three quantization methods for developing a constrained network.
Abstract: Speaker recognition is an important classification task, which can be solved using several approaches. Although building a speaker recognition model on a closed set of speakers under neutral speaking conditions is a well-researched task and there are solutions that provide excellent performance, the classification accuracy of developed models significantly decreases when applying them to emotional speech or in the presence of interference. Furthermore, deep models may require a large number of parameters, so constrained solutions are desirable in order to implement them on edge devices in the Internet of Things systems for real-time detection. The aim of this paper is to propose a simple and constrained convolutional neural network for speaker recognition tasks and to examine its robustness for recognition in emotional speech conditions. We examine three quantization methods for developing a constrained network: floating-point eight format, ternary scalar quantization, and binary scalar quantization. The results are demonstrated on the recently recorded SEAC dataset.