scispace - formally typeset
Search or ask a question

Showing papers on "Speaker diarisation published in 2022"


Journal ArticleDOI
TL;DR: This work shows that VBx achieves superior performance on three of the most popular datasets for evaluating diarization: CALLHOME, AMI and DIHARDII datasets and presents for the first time the derivation and update formulae for the VBX model.

110 citations


Journal ArticleDOI
01 Mar 2022
TL;DR: Speaker diarization is a task to label audio or video recordings with classes that correspond to speaker identity as mentioned in this paper , or in short, identifying "who spoke when" in audio and video recordings.
Abstract: Speaker diarization is a task to label audio or video recordings with classes that correspond to speaker identity, or in short, a task to identify “who spoke when”. In the early years, speaker diarization algorithms were developed for speech recognition on multispeaker audio recordings to enable speaker adaptive processing. These algorithms also gained their own value as a standalone application over time to provide speaker-specific metainformation for downstream tasks such as audio retrieval. More recently, with the emergence of deep learning technology, which has driven revolutionary changes in research and practices across speech application domains, rapid advancements have been made for speaker diarization. In this paper, we review not only the historical development of speaker diarization technology but also the recent advancements in neural speaker diarization approaches. Furthermore, we discuss how speaker diarization systems have been integrated with speech recognition applications and how the recent surge of deep learning is leading the way of jointly modeling these two components to be complementary to each other. By considering such exciting technical trends, we believe that this paper is a valuable contribution to the community to provide a survey work by consolidating the recent developments with neural methods and thus facilitating further progress toward a more efficient speaker diarization.

57 citations


Journal ArticleDOI
TL;DR: Speaker diarization is a task to label audio or video recordings with classes that correspond to speaker identity as mentioned in this paper, or in short, identifying "who spoke when" in audio and video recordings.

55 citations


Journal ArticleDOI
TL;DR: The VBx model as discussed by the authors uses a Bayesian hidden Markov model to find speaker clusters in a sequence of x-vectors and achieves superior performance on three popular datasets for evaluating diarization: CALLHOME, AMI and DIHARD II.

41 citations


Proceedings ArticleDOI
23 May 2022
TL;DR: W2V2-Speaker as discussed by the authors applies the wav2vec2 framework to speaker recognition instead of speech recognition, and proposes a single-utterance classification with cross-entropy or additive angular softmax loss, and an utterance-pair classification variant with BCE loss.
Abstract: This paper explores applying the wav2vec2 framework to speaker recognition instead of speech recognition. We study the effectiveness of the pre-trained weights on the speaker recognition task, and how to pool the wav2vec2 output sequence into a fixed-length speaker embedding. To adapt the framework to speaker recognition, we propose a single-utterance classification variant with cross-entropy or additive angular softmax loss, and an utterance-pair classification variant with BCE loss. Our best performing variant achieves a 1.88% EER on the extended voxceleb1 test set compared to 1.69% EER with an ECAPA-TDNN baseline. Code is available at github.com/nikvaessen/w2v2-speaker.

16 citations


Proceedings ArticleDOI
08 Feb 2022
TL;DR: The ICASSP 2022 Multi-channel Multi-party Meeting Transcription Grand Challenge (M2MeT) focuses on one of the most valuable and the most challenging scenarios of speech technologies, and releases 120 hours of real-recorded Mandarin meeting speech data with manual annotation.
Abstract: The ICASSP 2022 Multi-channel Multi-party Meeting Transcription Grand Challenge (M2MeT) focuses on one of the most valuable and the most challenging scenarios of speech technologies. The M2MeT challenge has particularly set up two tracks, speaker diarization (track 1) and multi-speaker automatic speech recognition (ASR) (track 2). Along with the challenge, we released 120 hours of real-recorded Mandarin meeting speech data with manual annotation, including far-field data collected by 8-channel micro-phone array as well as near-field data collected by each participants’ headset microphone. We briefly describe the released dataset, track setups, baselines and summarize the challenge results and major techniques used in the submissions.

16 citations


Proceedings ArticleDOI
14 Feb 2022
TL;DR: Experimental results show that the proposed approach outperforms the conventional approach in terms of diarization error rate (DER), especially by substantially reducing speaker confusion errors, that indeed reflects the effectiveness of the proposed iGMM integration.
Abstract: Speaker diarization has been investigated extensively as an important central task for meeting analysis. Recent trend shows that integration of end-to-end neural (EEND)- and clustering-based diarization is a promising approach to handle realistic conversational data containing overlapped speech with an arbitrarily large number of speakers, and achieved state-of-the-art results on various tasks. However, the approaches proposed so far have not realized tight integration yet, because the clustering employed therein was not optimal in any sense for clustering the speaker embeddings estimated by the EEND module. To address this problem, this paper introduces a trainable clustering algorithm into the integration framework, by deep-unfolding a non-parametric Bayesian model called the infinite Gaussian mixture model (iGMM). Specifically, the speaker embeddings are optimized during training such that it better fits iGMM clustering, based on a novel clustering loss based on Adjusted Rand Index (ARI). Experimental results based on CALLHOME data show that the proposed approach outperforms the conventional approach in terms of diarization error rate (DER), especially by substantially reducing speaker confusion errors, that indeed reflects the effectiveness of the proposed iGMM integration.

14 citations


Proceedings ArticleDOI
23 May 2022
TL;DR: The AliMeeting corpus as discussed by the authors contains 120 hours of recorded Mandarin meeting data, including far-field data collected by 8-channel microphone array as well as near-field collected by headset microphone.
Abstract: Recent development of speech signal processing, such as speech recognition, speaker diarization, etc., has inspired numerous applications of speech technologies. The meeting scenario is one of the most valuable and, at the same time, most challenging scenarios for the deployment of speech technologies. Speaker diarization and multi-speaker automatic speech recognition in meeting scenarios have attracted much attention recently. However, the lack of large public meeting data has been a major obstacle for advancement of the field. Therefore, we make available the AliMeeting corpus, which consists of 120 hours of recorded Mandarin meeting data, including far-field data collected by 8-channel microphone array as well as near-field data collected by headset microphone. Each meeting session is composed of 2-4 speakers with different speaker overlap ratio, recorded in meeting rooms with different size. Along with the dataset, we launch the ICASSP 2022 Multi-channel Multi-party Meeting Transcription Challenge (M2MeT) with two tracks, namely speaker diarization and multi-speaker ASR, aiming to provide a common testbed for meeting rich transcription and promote reproducible research in this field. In this paper we provide a detailed introduction of the AliMeeting dateset, challenge rules, evaluation methods and baseline systems.

13 citations


Journal ArticleDOI
TL;DR:
Abstract: We study the scenario where individuals (speakers) contribute to the publication of an anonymized speech corpus. Data users leverage this public corpus for downstream tasks, e.g., training an automatic speech recognition (ASR) system, while attackers may attempt to de-anonymize it using auxiliary knowledge. Motivated by this scenario, speaker anonymization aims to conceal speaker identity while preserving the quality and usefulness of speech data. In this article, we study x-vector based speaker anonymization, the leading approach in the VoicePrivacy Challenge, which converts the speaker's voice into that of a random pseudo-speaker. We show that the strength of anonymization varies significantly depending on how the pseudo-speaker is chosen. We explore four design choices for this step: the distance metric between speakers, the region of speaker space where the pseudo-speaker is picked, its gender, and whether to assign it to one or all utterances of the original speaker. We assess the quality of anonymization from the perspective of the three actors involved in our threat model, namely the speaker, the user and the attacker. To measure privacy and utility, we use respectively the linkability score achieved by the attackers and the decoding word error rate achieved by an ASR model trained on the anonymized data. Experiments on LibriSpeech show that the best combination of design choices yields state-of-the-art performance in terms of both privacy and utility. Experiments on Mozilla Common Voice further show that it guarantees the same anonymization level against re-identification attacks among 50 speakers as original speech among 20,000 speakers.

12 citations


Journal ArticleDOI
TL;DR: Experimental results show that the generated utterances retain very high utility for automatic speech recognition training and inference, while being much better protected against strong adversaries who leverage the full knowledge of the anonymization process to try to infer the speaker identity.
Abstract: Sharing real-world speech utterances is key to the training and deployment of voice-based services. However, it also raises privacy risks as speech contains a wealth of personal data. Speaker anonymization aims to remove speaker information from a speech utterance while leaving its linguistic and prosodic attributes intact. State-of-the-art techniques operate by disentangling the speaker information (represented via a speaker embedding) from these attributes and re-synthesizing speech based on the speaker embedding of another speaker. Prior research in the privacy community has shown that anonymization often provides brittle privacy protection, even less so any provable guarantee. In this work, we show that disentanglement is indeed not perfect: linguistic and prosodic attributes still contain speaker information. We remove speaker information from these attributes by introducing differentially private feature extractors based on an autoencoder and an automatic speech recognizer, respectively, trained using noise layers. We plug these extractors in the state-of-the-art anonymization pipeline and generate, for the first time, private speech utterances with a provable upper bound on the speaker information they contain. We evaluate empirically the privacy and utility resulting from our differentially private speaker anonymization approach on the LibriSpeech data set. Experimental results show that the generated utterances retain very high utility for automatic speech recognition training and inference, while being much better protected against strong adversaries who leverage the full knowledge of the anonymization process to try to infer the speaker identity.

10 citations


Journal ArticleDOI
TL;DR: A neural-network-based similarity measurement method to learn the similarity between any two speaker embeddings, where both previous and future contexts are considered, and is extended to the target-speaker voice activity detection (TS-VAD).
Abstract: In this paper, we propose a neural-network-based similarity measurement method to learn the similarity between any two speaker embeddings, where both previous and future contexts are considered. Moreover, we propose the segmental pooling strategy and jointly train the speaker embedding network along with the similarity measurement model. Later, this joint training framework is further extended to the target-speaker voice activity detection (TS-VAD), with only slight modification in the network architecture. Experimental results of the DIHARD II, DIHARD III and VoxConverse datasets show that our clustering-based system with the neural similarity measurement achieves superior performance to recent approaches on all three datasets. In addition, the segment-level TS-VAD method further improves the clustering-based results and achieves DER of 16.48%, 11.62% and 4.39% on the DIHARD II, DIHARD III and VoxConverse datasets, respectively.

Journal ArticleDOI
TL;DR: In this paper , a speaker de-identification method was proposed, which, in addition to simple formant shifts, manipulates f0 trajectories based on functional data analysis.

Proceedings ArticleDOI
18 Feb 2022
TL;DR: In this article , a multi-speaker VTS system based on cross-modal knowledge transfer from voice conversion is proposed, where vector quantization with contrastive predictive coding is used for the content encoder of VC to derive discrete phoneme-like acoustic units, which are transferred to a Lip-to-Index (Lip2Ind) network to infer the index sequence of acoustic units.
Abstract: Though significant progress has been made for speaker-dependent Video-to-Speech (VTS) synthesis, little attention is devoted to multi-speaker VTS that can map silent video to speech, while allowing flexible control of speaker identity, all in a single system. This paper proposes a novel multi-speaker VTS system based on cross-modal knowledge transfer from voice conversion (VC), where vector quantization with contrastive predictive coding (VQCPC) is used for the content encoder of VC to derive discrete phoneme-like acoustic units, which are transferred to a Lip-to-Index (Lip2Ind) network to infer the index sequence of acoustic units. The Lip2Ind network can then substitute the content encoder of VC to form a multi-speaker VTS system to convert silent video to acoustic units for reconstructing accurate spoken content. The VTS system also inherits the advantages of VC by using a speaker encoder to produce speaker representations to effectively control the speaker identity of generated speech. Extensive evaluations verify the effectiveness of proposed approach, which can be applied in both constrained vocabulary and open vocabulary conditions, achieving state-of-the-art performance in generating high-quality speech with high naturalness, intelligibility and speaker similarity. Our demo page is released here1.

Proceedings ArticleDOI
23 May 2022
TL;DR: In this paper , a transformer transducer is used to detect the speaker turns and represent each speaker turn by a speaker embedding, then cluster these embeddings with constraints from the detected speaker turns.
Abstract: In this paper, we present a novel speaker diarization system for streaming on-device applications. In this system, we use a transformer transducer to detect the speaker turns, represent each speaker turn by a speaker embedding, then cluster these embeddings with constraints from the detected speaker turns. Compared with conventional clustering-based diarization systems, our system largely reduces the computational cost of clustering due to the sparsity of speaker turns. Unlike other supervised speaker diarization systems which require annotations of time-stamped speaker labels for training, our system only requires including speaker turn tokens during the transcribing process, which largely reduces the human efforts involved in data collection.

Proceedings ArticleDOI
23 May 2022
TL;DR: TitaNet as mentioned in this paper employs 1D depth-wise separable convolutions with Squeeze-and-Excitation (SE) layers with global context followed by channel attention based statistics pooling layer to map variable-length utterances to a fixed-length embedding (t-vector).
Abstract: In this paper, we propose TitaNet, a novel neural network architecture for extracting speaker representations. We employ 1D depth-wise separable convolutions with Squeeze-and-Excitation (SE) layers with global context followed by channel attention based statistics pooling layer to map variable-length utterances to a fixed-length embedding (t-vector). TitaNet is a scalable architecture and achieves state-of-the-art performance on speaker verification task with an equal error rate (EER) of 0.68% on the VoxCeleb1 trial file and also on speaker diarization tasks with diarization error rate (DER) of 1.73% on AMI-MixHeadset, 1.99% on AMI-Lapel and 1.11% on CH109. Furthermore, we investigate various sizes of TitaNet and present a light TitaNet-S model with only 6M parameters that achieve near state-of-the-art results in diarization tasks.

Proceedings ArticleDOI
23 May 2022
TL;DR: An end-to-end target-speaker voice activity detection method for speaker diarization using a ResNet-based network and a BiLSTM-based TS-VAD model, which achieves better performance than the original TS- VAD method with the clustering-based initialization.
Abstract: In this paper, we propose an end-to-end target-speaker voice activity detection (E2E-TS-VAD) method for speaker diarization. First, a ResNet-based network extracts the frame-level speaker embeddings from the acoustic features. Then, the L2-normalized frame-level speaker embeddings are fed to the transformer encoder which produces the initialization of the speaker diarization results. Later, the frame-level speaker embeddings are aggregated to several target-speaker embeddings based on the output from the transformer encoder. Finally, a BiLSTM-based TS-VAD model predicts the refined diarization results. Several aggregation methods are explored, including soft/hard decisions with/without normalization. Results show that E2E-TS-VAD achieves better performance than the original TS-VAD method with the clustering-based initialization.

Proceedings ArticleDOI
04 Feb 2022
TL;DR: This paper describes the speaker diarization system submitted to the Multi-channel Multi-party Meeting Transcription (M2MeT) challenge, where Mandarin meeting data were recorded in multi-channel format for diarizations and automatic speech recognition (ASR) tasks.
Abstract: This paper describes our speaker diarization system submitted to the Multi-channel Multi-party Meeting Transcription (M2MeT) challenge, where Mandarin meeting data were recorded in multi-channel format for diarization and automatic speech recognition (ASR) tasks. In these meeting scenarios, the uncertainty of the speaker number and the high ratio of overlapped speech present great challenges for diarization. Based on the assumption that there is valuable complementary information between acoustic features, spatial-related and speaker-related features, we propose a multi-level feature fusion mechanism based target-speaker voice activity detection (FFM-TS-VAD) system to improve the performance of the conventional TS-VAD system. Furthermore, we propose a data augmentation method during training to improve the system robustness when the angular difference between two speakers is relatively small. We provide comparisons for different sub-systems we used in M2MeT challenge. Our submission is a fusion of several sub-systems and ranks second in the diarization task.

Proceedings ArticleDOI
23 May 2022
TL;DR: This study presents a novel speaker diarization system, with a generalized neural speaker clustering module as the backbone, able to integrate SAD, OSD and speaker segmentation/clustering, and yield competitive results in the VoxConverse20 benchmarks.
Abstract: Speaker diarization consists of many components, e.g., front-end processing, speech activity detection (SAD), overlapped speech detection (OSD) and speaker segmentation/clustering. Conventionally, most of the involved components are separately developed and optimized. The resulting speaker diarization systems are complicated and sometimes lack of satisfying generalization capabilities. In this study, we present a novel speaker diarization system, with a generalized neural speaker clustering module as the backbone. The whole system can be simplified to contain only two major parts, a speaker embedding extractor followed by a clustering module. Both parts are implemented with neural networks. In the training phase, an on-the-fly spoken dialogue generator is designed to provide the system with audio streams and the corresponding annotations in categories of non-speech, overlapped speech and active speakers. The chunk-wise inference and a speaker verification based tracing module are conducted to handle the arbitrary number of speakers. We demonstrate that the proposed speaker diarization system is able to integrate SAD, OSD and speaker segmentation/clustering, and yield competitive results in the VoxConverse20 benchmarks.

Journal ArticleDOI
01 Jan 2022
TL;DR: In this paper , a meta-learning algorithm was proposed to adapt a speaker adaptation model to unseen speakers by using Model Agnostic Meta-Learning (MAML) to find a great meta-initialization.
Abstract: Personalizing a speech synthesis system is a highly desired application, where the system can generate speech with the user’s voice with rare enrolled recordings. There are two main approaches to build such a system in recent works: speaker adaptation and speaker encoding. On the one hand, speaker adaptation methods fine-tune a trained multi-speaker text-to-speech (TTS) model with few enrolled samples. However, they require at least thousands of fine-tuning steps for high-quality adaptation, making it hard to apply on devices. On the other hand, speaker encoding methods encode enrollment utterances into a speaker embedding. The trained TTS model can synthesize the user’s speech conditioned on the corresponding speaker embedding. Nevertheless, the speaker encoder suffers from the generalization gap between the seen and unseen speakers. In this paper, we propose applying a meta-learning algorithm to the speaker adaptation method. More specifically, we use Model Agnostic Meta-Learning (MAML) as the training algorithm of a multi-speaker TTS model, which aims to find a great meta-initialization to adapt the model to any few-shot speaker adaptation tasks quickly. Therefore, we can also adapt the meta-trained TTS model to unseen speakers efficiently. Our experiments compare the proposed method (Meta-TTS) with two baselines: a speaker adaptation method baseline and a speaker encoding method baseline. The evaluation results show that Meta-TTS can synthesize high speaker-similarity speech from few enrollment samples with fewer adaptation steps than the speaker adaptation baseline and outperforms the speaker encoding baseline under the same training scheme. When the speaker encoder of the baseline is pre-trained with extra 8371 speakers of data, Meta-TTS can still outperform the baseline on LibriTTS dataset and achieve comparable results on VCTK dataset.

Proceedings ArticleDOI
18 Sep 2022
TL;DR: A novel end-to-end neural-network-based audio-visual speaker diarization method that can explicitly handle the overlapping speech and distinguish between speech and non-speech accurately with multi-modal information is proposed.
Abstract: In this paper, we propose a novel end-to-end neural-network-based audio-visual speaker diarization method. Unlike most existing audio-visual methods, our audio-visual model takes audio features (e.g., FBANKs), multi-speaker lip regions of in-terest (ROIs), and multi-speaker i-vector embbedings as multi-modal inputs. And a set of binary classification output layers produces activities of each speaker. With the finely designed end-to-end structure, the proposed method can explicitly handle the overlapping speech and distinguish between speech and non-speech accurately with multi-modal information. I-vectors are the key point to solve the alignment problem caused by visual modality error (e.g., occlusions, off-screen speakers or unreliable detection). Besides, our audio-visual model is robust to the absence of visual modality, where the diarization performance degrades significantly using the visual-only model. Evaluated on the datasets of the first multi-model information based speech processing (MISP) challenge, the proposed method achieved diarization error rates (DERs) of 10.1%/9.5% on development/eval set with reference voice activity detection (VAD) information, while audio-only and video-only system yielded DERs of 27.9%/29.0% and 14.6%/13.1% respectively.

Proceedings ArticleDOI
10 Feb 2022
TL;DR: Results indicate that the proposed technology consistently improves synthetic samples in terms of style similarity, while retaining the target speaker’s identity.
Abstract: We address the problem of cross-speaker style transfer for text-to-speech (TTS) using data augmentation via voice conversion. We assume to have a corpus of neutral non-expressive data from a target speaker and supporting conversational expressive data from different speakers. Our goal is to build a TTS system that is expressive, while retaining the target speaker’s identity. The proposed approach relies on voice conversion to first generate high-quality data from the set of supporting expressive speakers. The voice converted data is then pooled with natural data from the target speaker and used to train a single-speaker multi-style TTS system. We provide evidence that this approach is efficient, flexible, and scalable. The method is evaluated using one or more supporting speakers, as well as a variable amount of supporting data. We further provide evidence that this approach allows some controllability of speaking style, when using multiple supporting speakers. We conclude by scaling our proposed technology to a set of 14 speakers across 7 languages. Results indicate that our technology consistently improves synthetic samples in terms of style similarity, while retaining the target speaker’s identity.

Proceedings ArticleDOI
09 Feb 2022
TL;DR: This paper describes the submission to ICASSP 2022 Multi-channel Multi-party Meeting Transcription (M2MeT) Challenge, and proposes a neural front-end module to model multi-channel audio and train the model end-to-end.
Abstract: This paper describes our submission to ICASSP 2022 Multi-channel Multi-party Meeting Transcription (M2MeT) Challenge. For Track 1, we propose several approaches to make the clustering-based speaker diarization system enable to handle overlapped speech. Front-end dereverberation and the direction-of-arrival (DOA) estimation are used to improve the accuracy of speaker diarization. Multi-channel combination and overlap detection are applied to reduce the missed speaker error. A modified DOVER-Lap is also proposed to fuse the results from different systems. We achieve the final DER of 5.79% on the Eval set and 7.23% on the Test set, which ranks 4th in the diarization challenge. For Track 2, we develop our system using the Conformer model in a joint CTC-attention architecture. Serialized output training (SOT) is adopted to multi-speaker overlapped speech recognition. We propose a neural front-end module to model multi-channel audio and train the model end-to-end. Various data augmentation methods are utilized to mitigate over-fitting in the multi-channel multi-speaker E2E system. Transformer language model fusion is developed to achieve better performance. The final CER is 19.2% on the Eval set and 20.8% on the Test set, which ranks 2nd in the ASR challenge.

Proceedings ArticleDOI
21 Feb 2022
TL;DR: In this paper , the authors proposed an end-to-end localized target speaker extraction method based on pure speech cues, which is called L-SpEx, and designed a speaker localizer driven by the target speaker embedding to extract the spatial features, including direction-of-arrival (DOA) of the target speakers and beamforming output.
Abstract: Speaker extraction aims to extract the target speaker’s voice from a multi-talker speech mixture given an auxiliary reference utterance. Recent studies show that speaker extraction benefits from the location or direction of the target speaker. However, these studies assume that the target speaker’s location is known in advance or detected by an extra visual cue, e.g., face image or video. In this paper, we propose an end-to-end localized target speaker extraction on pure speech cues, that is called L-SpEx. Specifically, we design a speaker localizer driven by the target speaker’s embedding to extract the spatial features, including direction-of-arrival (DOA) of the target speaker and beamforming output. Then, the spatial cues and target speaker’s embedding are both used to form a top-down auditory attention to the target speaker. Experiments on the multi-channel reverberant dataset called MCLibri2Mix show that our L-SpEx approach significantly outperforms the baseline system.

Proceedings ArticleDOI
10 Feb 2022
TL;DR: Two improvements to target-speaker voice activity detection (TS-VAD) are proposed, the core component in the proposed speaker diarization system that was submitted to the 2022 Multi-Channel Multi-Party Meeting Transcription (M2MeT) challenge.
Abstract: We propose two improvements to target-speaker voice activity detection (TS-VAD), the core component in our proposed speaker diarization system that was submitted to the 2022 Multi-Channel Multi-Party Meeting Transcription (M2MeT) challenge. These techniques are designed to handle multi-speaker conversations in real-world meeting scenarios with high speaker-overlap ratios and under heavy reverberant and noisy condition. First, for data preparation and augmentation in training TS-VAD models, speech data containing both real meetings and simulated indoor conversations are used. Second, in refining results obtained after TS-VAD based decoding, we perform a series of post-processing steps to improve the VAD results needed to reduce diarization error rates (DERs). Tested on the ALIMEETING corpus, the newly released Mandarin meeting dataset used in M2MeT, we demonstrate that our proposed system can decrease the DER by up to 66.55/60.59% relatively when compared with classical clustering based diarization on the Eval/Test set.


Proceedings ArticleDOI
23 May 2022
TL;DR: An InfoMax domain separation and adaptation network (InfoMax–DSAN) to disentangle the domain-specific features and domain-invariant speaker features based on domain adaptation techniques is proposed and a frame-based mutual information neural estimator is proposed to maximize the mutual information between frame-level features and input acoustic features, which can help retain more useful information.
Abstract: Entanglement of speaker features and redundant features may lead to poor performance when evaluating speaker verification systems on an unseen domain. To address this issue, we propose an InfoMax domain separation and adaptation network (InfoMax–DSAN) to disentangle the domain-specific features and domain-invariant speaker features based on domain adaptation techniques. A frame-based mutual information neural estimator is proposed to maximize the mutual information between frame-level features and input acoustic features, which can help retain more useful information. Furthermore, we propose adopting triplet loss based on the idea of self-supervised learning to overcome the label mismatch problem. Experimental results on VOiCES Challenge 2019 demonstrate that our proposed method can help learn more discriminative and robust speaker embeddings.

Journal ArticleDOI
TL;DR: In this paper , a self-supervised pre-training strategy is proposed to exploit the speech-lip synchronization cue for target speaker extraction, which allows to leverage abundant unlabeled in-domain data.
Abstract: A speaker extraction algorithm seeks to extract the speech of a target speaker from a multi-talker speech mixture when given a cue that represents the target speaker, such as a pre-enrolled speech utterance, or an accompanying video track. Visual cues are particularly useful when a pre-enrolled speech is not available. In this work, we don’t rely on the target speaker’s pre-enrolled speech, but rather use the target speaker’s face track as the speaker cue, that is referred to as the auxiliary reference, to form an attractor towards the target speaker. We advocate that the temporal synchronization between the speech and its accompanying lip movements is a direct and dominant audio-visual cue. Therefore, we propose a self-supervised pre-training strategy, to exploit the speech-lip synchronization cue for target speaker extraction, which allows us to leverage abundant unlabeled in-domain data. We transfer the knowledge from the pre-trained model to the attractor encoder of the speaker extraction network. We show that the proposed speaker extraction network outperforms various competitive baselines in terms of signal quality, perceptual quality, and intelligibility, achieving state-of-the-art performance.

Proceedings ArticleDOI
18 Sep 2022
TL;DR: A novel online speaker diarization approach based on the VBx algorithm which works well on the offline speaker darization tasks and solves the label ambiguity problem by a global constrained clustering algorithm.
Abstract: We propose a novel online speaker diarization approach based on the VBx algorithm which works well on the offline speaker diarization tasks. To efficiently process long-time recordings, we perform the online diarization in a block-wise manner. First, we devise a core samples updating strategy utilizing time penalty function, which can preserve important historical information with a low memory cost. Then we select clustering samples from core samples by stratified sampling to enhance the variability among samples and retain sufficient speaker identity information, which helps VBx to improve classification accuracy on a small amount of data. Finally, we solve the label ambiguity problem by a global constrained clustering algorithm. We evaluate our system on DIHARD and AMI datasets. The experimental results demonstrate that our online approach achieves superior performance compared with the state-of-the-art.

Proceedings ArticleDOI
23 May 2022
TL;DR: In this paper , the authors exploit the weight matrix changes of a neural acoustic model locally adapted to this speaker to retrieve the gender of the speaker, but also his identity, by exploiting the weights from personalized models that could be exchanged instead of user data.
Abstract: The widespread of powerful personal devices capable of collecting voice of their users has opened the opportunity to build speaker adapted speech recognition system (ASR) or to participate to collaborative learning of ASR. In both cases, personalized acoustic models (AM), i.e. fine-tuned AM with specific speaker data, can be built. A question that naturally arises is whether the dissemination of personalized acoustic models can leak personal information. In this paper, we show that it is possible to retrieve the gender of the speaker, but also his identity, by just exploiting the weight matrix changes of a neural acoustic model locally adapted to this speaker. Incidentally we observe phenomena that may be useful towards explainability of deep neural networks in the context of speech processing. Gender can be identified almost surely using only the first layers and speaker verification performs well when using middle-up layers. Our experimental study on the TED-LIUM 3 dataset with HMM/TDNN models shows a purity of 95% for gender detection, and an Equal Error Rate of 9.07% for a speaker verification task by only exploiting the weights from personalized models that could be exchanged instead of user data.

Journal ArticleDOI
TL;DR: Huang et al. as discussed by the authors proposed region proposal network-based speaker diarization (RPNSD) to handle overlapping speech and integrate an automatic speech recognition (ASR) component into the system.