Showing papers in "arXiv: Audio and Speech Processing in 2019"

PDF

Open Access

Posted Content•

Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram

[...]

Ryuichi Yamamoto, Eunwoo Song¹, Jae-Min Kim¹•Institutions (1)

Naver Corporation¹

25 Oct 2019-arXiv: Audio and Speech Processing

TL;DR: The proposed Parallel WaveGAN has only 1.44 M parameters and can generate 24 kHz speech waveform 28.68 times faster than real-time on a single GPU environment, which is comparative to the best distillation-based Parallel WaveNet system.

...read moreread less

Abstract: We propose Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a generative adversarial network. In the proposed method, a non-autoregressive WaveNet is trained by jointly optimizing multi-resolution spectrogram and adversarial loss functions, which can effectively capture the time-frequency distribution of the realistic speech waveform. As our method does not require density distillation used in the conventional teacher-student framework, the entire model can be easily trained. Furthermore, our model is able to generate high-fidelity speech even with its compact architecture. In particular, the proposed Parallel WaveGAN has only 1.44 M parameters and can generate 24 kHz speech waveform 28.68 times faster than real-time on a single GPU environment. Perceptual listening test results verify that our proposed method achieves 4.16 mean opinion score within a Transformer-based text-to-speech framework, which is comparative to the best distillation-based Parallel WaveNet system.

...read moreread less

256 citations

Posted Content•

BUT System Description to VoxCeleb Speaker Recognition Challenge 2019

[...]

Hossein Zeinali, Shuai Wang, Anna Silnova, Pavel Matejka, Oldrich Plchot - Show less +1 more

16 Oct 2019-arXiv: Audio and Speech Processing

TL;DR: The submission of Brno University of Technology (BUT) team to the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2019 is described, a fusion of 4 Convolutional Neural Network (CNN) topologies and the best systems for Fixed and Open conditions achieved 1.42% and 1.26% ERR on the challenge evaluation set respectively.

...read moreread less

Abstract: In this report, we describe the submission of Brno University of Technology (BUT) team to the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2019. We also provide a brief analysis of different systems on VoxCeleb-1 test sets. Submitted systems for both Fixed and Open conditions are a fusion of 4 Convolutional Neural Network (CNN) topologies. The first and second networks have ResNet34 topology and use two-dimensional CNNs. The last two networks are one-dimensional CNN and are based on the x-vector extraction topology. Some of the networks are fine-tuned using additive margin angular softmax. Kaldi FBanks and Kaldi PLPs were used as features. The difference between Fixed and Open systems lies in the used training data and fusion strategy. The best systems for Fixed and Open conditions achieved 1.42% and 1.26% ERR on the challenge evaluation set respectively.

...read moreread less

167 citations

Posted Content•

MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

[...]

Kundan Kumar¹, Rithesh Kumar², Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo², Alexandre de Brebisson³, Yoshua Bengio², Aaron Courville² - Show less +5 more•Institutions (3)

Indian Institute of Technology Kanpur¹, Université de Montréal², Imperial College London³

08 Oct 2019-arXiv: Audio and Speech Processing

TL;DR: This article proposed a non-autoregressive, fully convolutional GAN for mel-spectrogram inversion and achieved state-of-the-art performance in speech synthesis, music domain translation and unconditional music synthesis.

...read moreread less

Abstract: Previous works (Donahue et al., 2018a; Engel et al., 2019a) have found that generating coherent raw audio waveforms with GANs is challenging. In this paper, we show that it is possible to train GANs reliably to generate high quality coherent waveforms by introducing a set of architectural changes and simple training techniques. Subjective evaluation metric (Mean Opinion Score, or MOS) shows the effectiveness of the proposed approach for high quality mel-spectrogram inversion. To establish the generality of the proposed techniques, we show qualitative results of our model in speech synthesis, music domain translation and unconditional music synthesis. We evaluate the various components of the model through ablation studies and suggest a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks. Our model is non-autoregressive, fully convolutional, with significantly fewer parameters than competing models and generalizes to unseen speakers for mel-spectrogram inversion. Our pytorch implementation runs at more than 100x faster than realtime on GTX 1080Ti GPU and more than 2x faster than real-time on CPU, without any hardware specific optimization tricks.

...read moreread less

136 citations

Posted Content•

Speech Model Pre-training for End-to-End Spoken Language Understanding

[...]

Loren Lugosch¹, Mirco Ravanelli², Patrick Ignoto, Vikrant Singh Tomar¹, Yoshua Bengio² - Show less +1 more•Institutions (2)

McGill University¹, Université de Montréal²

07 Apr 2019-arXiv: Audio and Speech Processing

TL;DR: A method to reduce the data requirements of end-to-end SLU in which the model is first pre-trained to predict words and phonemes, thus learning good features for SLU is proposed and improves performance both when the full dataset is used for training and when only a small subset is used.

...read moreread less

Abstract: Whereas conventional spoken language understanding (SLU) systems map speech to text, and then text to intent, end-to-end SLU systems map speech directly to intent through a single trainable model. Achieving high accuracy with these end-to-end models without a large amount of training data is difficult. We propose a method to reduce the data requirements of end-to-end SLU in which the model is first pre-trained to predict words and phonemes, thus learning good features for SLU. We introduce a new SLU dataset, Fluent Speech Commands, and show that our method improves performance both when the full dataset is used for training and when only a small subset is used. We also describe preliminary experiments to gauge the model's ability to generalize to new phrases not heard during training.

...read moreread less

130 citations

Posted Content•

Imperceptible, Robust, and Targeted Adversarial Examples for Automatic Speech Recognition

[...]

Yao Qin, Nicholas Carlini, Ian Goodfellow, Garrison W. Cottrell, Colin Raffel - Show less +1 more

22 Mar 2019-arXiv: Audio and Speech Processing

TL;DR: This paper develops effectively imperceptible audio adversarial examples by leveraging the psychoacoustic principle of auditory masking, while retaining 100% targeted success rate on arbitrary full-sentence targets and makes progress towards physical-world over-the-air audio adversaria examples by constructing perturbations which remain effective even after applying realistic simulated environmental distortions.

...read moreread less

Abstract: Adversarial examples are inputs to machine learning models designed by an adversary to cause an incorrect output. So far, adversarial examples have been studied most extensively in the image domain. In this domain, adversarial examples can be constructed by imperceptibly modifying images to cause misclassification, and are practical in the physical world. In contrast, current targeted adversarial examples applied to speech recognition systems have neither of these properties: humans can easily identify the adversarial perturbations, and they are not effective when played over-the-air. This paper makes advances on both of these fronts. First, we develop effectively imperceptible audio adversarial examples (verified through a human study) by leveraging the psychoacoustic principle of auditory masking, while retaining 100% targeted success rate on arbitrary full-sentence targets. Next, we make progress towards physical-world over-the-air audio adversarial examples by constructing perturbations which remain effective even after applying realistic simulated environmental distortions.

...read moreread less

122 citations

Posted Content•

Transformer-Transducer: End-to-End Speech Recognition with Self-Attention

[...]

Ching-Feng Yeh, Jay Mahadeokar, Kaustubh Kalgaonkar, Yongqiang Wang, Duc Le, Mahaveer Jain, Kjell Schubert, Christian Fuegen, Michael L. Seltzer - Show less +5 more

28 Oct 2019-arXiv: Audio and Speech Processing

TL;DR: The proposed Transformer-Transducer outperforms neural transducer with LSTM/BLSTM networks and achieved word error rates of 6.37 % on the test-clean set and 15.30%) while remaining streamable, compact, and computationally efficient with complexity of O(T), where T is input sequence length.

...read moreread less

Abstract: We explore options to use Transformer networks in neural transducer for end-to-end speech recognition. Transformer networks use self-attention for sequence modeling and comes with advantages in parallel computation and capturing contexts. We propose 1) using VGGNet with causal convolution to incorporate positional information and reduce frame rate for efficient inference 2) using truncated self-attention to enable streaming for Transformer and reduce computational complexity. All experiments are conducted on the public LibriSpeech corpus. The proposed Transformer-Transducer outperforms neural transducer with LSTM/BLSTM networks and achieved word error rates of 6.37 % on the test-clean set and 15.30 % on the test-other set, while remaining streamable, compact with 45.7M parameters for the entire system, and computationally efficient with complexity of O(T), where T is input sequence length.

...read moreread less

121 citations

Posted Content•

MelNet: A Generative Model for Audio in the Frequency Domain

[...]

Sean Vasquez, Michael Lewis

04 Jun 2019-arXiv: Audio and Speech Processing

TL;DR: This work designs a model capable of generating high-fidelity audio samples which capture structure at timescales that time-domain models have yet to achieve, and applies it to a variety of audio generation tasks, showing improvements over previous approaches in both density estimates and human judgments.

...read moreread less

Abstract: Capturing high-level structure in audio waveforms is challenging because a single second of audio spans tens of thousands of timesteps. While long-range dependencies are difficult to model directly in the time domain, we show that they can be more tractably modelled in two-dimensional time-frequency representations such as spectrograms. By leveraging this representational advantage, in conjunction with a highly expressive probabilistic model and a multiscale generation procedure, we design a model capable of generating high-fidelity audio samples which capture structure at timescales that time-domain models have yet to achieve. We apply our model to a variety of audio generation tasks, including unconditional speech generation, music generation, and text-to-speech synthesis---showing improvements over previous approaches in both density estimates and human judgments.

...read moreread less

119 citations

Posted Content•

Who is Real Bob? Adversarial Attacks on Speaker Recognition Systems

[...]

Guangke Chen, Sen Chen, Lingling Fan, Xiaoning Du, Zhe Zhao, Fu Song, Yang Liu - Show less +3 more

03 Nov 2019-arXiv: Audio and Speech Processing

TL;DR: This paper conducts the first comprehensive and systematic study of the adversarial attacks on SR systems (SRSs) to understand their security weakness in the practical black-box setting, and proposes an adversarial attack, named FakeBob, to craft adversarial samples.

...read moreread less

Abstract: Speaker recognition (SR) is widely used in our daily life as a biometric authentication or identification mechanism. The popularity of SR brings in serious security concerns, as demonstrated by recent adversarial attacks. However, the impacts of such threats in the practical black-box setting are still open, since current attacks consider the white-box setting only. In this paper, we conduct the first comprehensive and systematic study of the adversarial attacks on SR systems (SRSs) to understand their security weakness in the practical blackbox setting. For this purpose, we propose an adversarial attack, named FAKEBOB, to craft adversarial samples. Specifically, we formulate the adversarial sample generation as an optimization problem, incorporated with the confidence of adversarial samples and maximal distortion to balance between the strength and imperceptibility of adversarial voices. One key contribution is to propose a novel algorithm to estimate the score threshold, a feature in SRSs, and use it in the optimization problem to solve the optimization problem. We demonstrate that FAKEBOB achieves 99% targeted attack success rate on both open-source and commercial systems. We further demonstrate that FAKEBOB is also effective on both open-source and commercial systems when playing over the air in the physical world. Moreover, we have conducted a human study which reveals that it is hard for human to differentiate the speakers of the original and adversarial voices. Last but not least, we show that four promising defense methods for adversarial attack from the speech recognition domain become ineffective on SRSs against FAKEBOB, which calls for more effective defense methods. We highlight that our study peeks into the security implications of adversarial attacks on SRSs, and realistically fosters to improve the security robustness of SRSs.

...read moreread less

98 citations

Posted Content•

Jasper: An End-to-End Convolutional Neural Acoustic Model

[...]

Jason Li¹, Vitaly Lavrukhin¹, Boris Ginsburg¹, Ryan Leary¹, Oleksii Kuchaiev¹, Jonathan Cohen¹, Huyen Nguyen¹, Ravi Teja Gadde² - Show less +4 more•Institutions (2)

Nvidia¹, Amazon.com²

05 Apr 2019-arXiv: Audio and Speech Processing

TL;DR: Jasper as mentioned in this paper uses only 1D convolutions, batch normalization, ReLU, dropout, and residual connections to improve training, and further introduces a new layer-wise optimizer called NovoGrad.

...read moreread less

Abstract: In this paper, we report state-of-the-art results on LibriSpeech among end-to-end speech recognition models without any external training data. Our model, Jasper, uses only 1D convolutions, batch normalization, ReLU, dropout, and residual connections. To improve training, we further introduce a new layer-wise optimizer called NovoGrad. Through experiments, we demonstrate that the proposed deep architecture performs as well or better than more complex choices. Our deepest Jasper variant uses 54 convolutional layers. With this architecture, we achieve 2.95% WER using a beam-search decoder with an external neural language model and 3.86% WER with a greedy decoder on LibriSpeech test-clean. We also report competitive results on the Wall Street Journal and the Hub5'00 conversational evaluation datasets.

...read moreread less

91 citations

Posted Content•

Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation

[...]

Yi Luo¹, Zhuo Chen², Takuya Yoshioka²•Institutions (2)

Columbia University¹, Microsoft²

14 Oct 2019-arXiv: Audio and Speech Processing

TL;DR: Experiments show that by replacing 1-D CNN with DPRNN and apply sample-level modeling in the time-domain audio separation network (TasNet), a new state-of-the-art performance on WSJ0-2mix is achieved with a 20 times smaller model than the previous best system.

...read moreread less

Abstract: Recent studies in deep learning-based speech separation have proven the superiority of time-domain approaches to conventional time-frequency-based methods. Unlike the time-frequency domain approaches, the time-domain separation systems often receive input sequences consisting of a huge number of time steps, which introduces challenges for modeling extremely long sequences. Conventional recurrent neural networks (RNNs) are not effective for modeling such long sequences due to optimization difficulties, while one-dimensional convolutional neural networks (1-D CNNs) cannot perform utterance-level sequence modeling when its receptive field is smaller than the sequence length. In this paper, we propose dual-path recurrent neural network (DPRNN), a simple yet effective method for organizing RNN layers in a deep structure to model extremely long sequences. DPRNN splits the long sequential input into smaller chunks and applies intra- and inter-chunk operations iteratively, where the input length can be made proportional to the square root of the original sequence length in each operation. Experiments show that by replacing 1-D CNN with DPRNN and apply sample-level modeling in the time-domain audio separation network (TasNet), a new state-of-the-art performance on WSJ0-2mix is achieved with a 20 times smaller model than the previous best system.

...read moreread less

89 citations

Journal Article•DOI•

Listen and Fill in the Missing Letters: Non-Autoregressive Transformer for Speech Recognition.

[...]

Nanxin Chen, Shinji Watanabe, Jesús Villalba, Najim Dehak

10 Nov 2019-arXiv: Audio and Speech Processing

TL;DR: Results on Mandarin (Aishell) and Japanese ASR benchmarks show the possibility to train such a non-autoregressive network for ASR and it matches the performance of the state-of-the-art autoregressive transformer with 7x speedup.

...read moreread less

Abstract: Recently very deep transformers have outperformed conventional bi-directional long short-term memory networks by a large margin in speech recognition. However, to put it into production usage, inference computation cost is still a serious concern in real scenarios. In this paper, we study two different non-autoregressive transformer structure for automatic speech recognition (ASR): A-CMLM and A-FMLM. During training, for both frameworks, input tokens fed to the decoder are randomly replaced by special mask tokens. The network is required to predict the tokens corresponding to those mask tokens by taking both unmasked context and input speech into consideration. During inference, we start from all mask tokens and the network iteratively predicts missing tokens based on partial results. We show that this framework can support different decoding strategies, including traditional left-to-right. A new decoding strategy is proposed as an example, which starts from the easiest predictions to the most difficult ones. Results on Mandarin (Aishell) and Japanese (CSJ) ASR benchmarks show the possibility to train such a non-autoregressive network for ASR. Especially in Aishell, the proposed method outperformed the Kaldi ASR system and it matches the performance of the state-of-the-art autoregressive transformer with 7x speedup. Pretrained models and code will be made available after publication.

...read moreread less

Posted Content•

The VOiCES from a Distance Challenge 2019 Evaluation Plan.

[...]

Mahesh Kumar Nandwana, Julien van Hout, Mitchell McLaren, Colleen Richey¹, Aaron Lawson, Maria Alejandra Barrios - Show less +2 more•Institutions (1)

SRI International¹

27 Feb 2019-arXiv: Audio and Speech Processing

TL;DR: The "VOiCES from a Distance Challenge 2019" is designed to foster research in the area of speaker recognition and automatic speech recognition with the special focus on single channel distant/far-field audio, under noisy conditions.

...read moreread less

Abstract: The "VOiCES from a Distance Challenge 2019" is designed to foster research in the area of speaker recognition and automatic speech recognition (ASR) with the special focus on single channel distant/far-field audio, under noisy conditions. The main objectives of this challenge are to: (i) benchmark state-of-the-art technology in the area of speaker recognition and automatic speech recognition (ASR), (ii) support the development of new ideas and technologies in speaker recognition and ASR, (iii) support new research groups entering the field of distant/far-field speech processing, and (iv) provide a new, publicly available dataset to the community that exhibits realistic distance characteristics.

...read moreread less

Posted Content•

Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition

[...]

Xu Xiang¹, Shuai Wang¹, Houjun Huang, Yanmin Qian¹, Kai Yu¹ - Show less +1 more•Institutions (1)

Shanghai Jiao Tong University¹

18 Jun 2019-arXiv: Audio and Speech Processing

TL;DR: Three different margin based losses which not only separate classes but also demand a fixed margin between classes are introduced to deep speaker embedding learning and it could be demonstrated that the margin is the key to obtain more discriminative speaker embeddings.

...read moreread less

Abstract: Recently, speaker embeddings extracted from a speaker discriminative deep neural network (DNN) yield better performance than the conventional methods such as i-vector. In most cases, the DNN speaker classifier is trained using cross entropy loss with softmax. However, this kind of loss function does not explicitly encourage inter-class separability and intra-class compactness. As a result, the embeddings are not optimal for speaker recognition tasks. In this paper, to address this issue, three different margin based losses which not only separate classes but also demand a fixed margin between classes are introduced to deep speaker embedding learning. It could be demonstrated that the margin is the key to obtain more discriminative speaker embeddings. Experiments are conducted on two public text independent tasks: VoxCeleb1 and Speaker in The Wild (SITW). The proposed approach can achieve the state-of-the-art performance, with 25% ~ 30% equal error rate (EER) reduction on both tasks when compared to strong baselines using cross entropy loss with softmax, obtaining 2.238% EER on VoxCeleb1 test set and 2.761% EER on SITW core-core test set, respectively.

...read moreread less

Posted Content•

Recognizing long-form speech using streaming end-to-end models

[...]

Arun Narayanan¹, Rohit Prabhavalkar¹, Chung-Cheng Chiu¹, David Rybach¹, Tara N. Sainath¹, Trevor Strohman¹ - Show less +2 more•Institutions (1)

Google¹

24 Oct 2019-arXiv: Audio and Speech Processing

TL;DR: This work examines the ability of E2E models to generalize to unseen domains, and proposes two complementary solutions to address this: training on diverse acoustic data, and LSTM state manipulation to simulate long-form audio when training using short utterances.

...read moreread less

Abstract: All-neural end-to-end (E2E) automatic speech recognition (ASR) systems that use a single neural network to transduce audio to word sequences have been shown to achieve state-of-the-art results on several tasks. In this work, we examine the ability of E2E models to generalize to unseen domains, where we find that models trained on short utterances fail to generalize to long-form speech. We propose two complementary solutions to address this: training on diverse acoustic data, and LSTM state manipulation to simulate long-form audio when training using short utterances. On a synthesized long-form test set, adding data diversity improves word error rate (WER) by 90% relative, while simulating long-form training improves it by 67% relative, though the combination doesn't improve over data diversity alone. On a real long-form call-center test set, adding data diversity improves WER by 40% relative. Simulating long-form training on top of data diversity improves performance by an additional 27% relative.

...read moreread less

Posted Content•

Zero-Shot Multi-Speaker Text-To-Speech with State-of-the-art Neural Speaker Embeddings

[...]

Erica Cooper¹, Cheng-I Lai², Yusuke Yasuda¹, Fuming Fang¹, Xin Wang¹, Nanxin Chen³, Junichi Yamagishi¹ - Show less +3 more•Institutions (3)

National Institute of Informatics¹, Massachusetts Institute of Technology², Johns Hopkins University³

23 Oct 2019-arXiv: Audio and Speech Processing

TL;DR: Learnable dictionary encoding-based speaker embeddings with angular softmax loss can improve equal error rates over x-vectors in a speaker verification task and improve speaker similarity and naturalness for unseen speakers when used for zero-shot adaptation to new speakers in end-to-end speech synthesis.

...read moreread less

Abstract: While speaker adaptation for end-to-end speech synthesis using speaker embeddings can produce good speaker similarity for speakers seen during training, there remains a gap for zero-shot adaptation to unseen speakers. We investigate multi-speaker modeling for end-to-end text-to-speech synthesis and study the effects of different types of state-of-the-art neural speaker embeddings on speaker similarity for unseen speakers. Learnable dictionary encoding-based speaker embeddings with angular softmax loss can improve equal error rates over x-vectors in a speaker verification task; these embeddings also improve speaker similarity and naturalness for unseen speakers when used for zero-shot adaptation to new speakers in end-to-end speech synthesis.

...read moreread less

Posted Content•

Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model

[...]

Anjuli Kannan¹, Arindrima Datta¹, Tara N. Sainath¹, Eugene Weinstein¹, Bhuvana Ramabhadran², Yonghui Wu¹, Ankur Bapna¹, Zhifeng Chen¹, Seungji Lee¹ - Show less +5 more•Institutions (2)

Google¹, IBM²

11 Sep 2019-arXiv: Audio and Speech Processing

TL;DR: This work presents an E2E multilingual system which is equipped to operate in low-latency interactive applications, as well as handle a key challenge of real world data: the imbalance in training data across languages.

...read moreread less

Abstract: Multilingual end-to-end (E2E) models have shown great promise in expansion of automatic speech recognition (ASR) coverage of the world's languages. They have shown improvement over monolingual systems, and have simplified training and serving by eliminating language-specific acoustic, pronunciation, and language models. This work presents an E2E multilingual system which is equipped to operate in low-latency interactive applications, as well as handle a key challenge of real world data: the imbalance in training data across languages. Using nine Indic languages, we compare a variety of techniques, and find that a combination of conditioning on a language vector and training language-specific adapter layers produces the best model. The resulting E2E multilingual model achieves a lower word error rate (WER) than both monolingual E2E models (eight of nine languages) and monolingual conventional systems (all nine languages).

...read moreread less

Posted Content•

Integrating the Data Augmentation Scheme with Various Classifiers for Acoustic Scene Modeling.

[...]

Hangting Chen, Zuozhen Liu, Zongming Liu, Pengyuan Zhang, Yonghong Yan - Show less +1 more

15 Jul 2019-arXiv: Audio and Speech Processing

TL;DR: The IOA team's submission for TASK1A of DCASE2019 challenge adopts a data augmentation scheme employing generative adversary networks, and the final fusion systems A-D could achieve an accuracy higher than 85% on the officially provided fold 1 evaluation dataset.

...read moreread less

Abstract: This technical report describes the IOA team's submission for TASK1A of DCASE2019 challenge. Our acoustic scene classification (ASC) system adopts a data augmentation scheme employing generative adversary networks. Two major classifiers, 1D deep convolutional neural network integrated with scalogram features and 2D fully convolutional neural network integrated with Mel filter bank features, are deployed in the scheme. Other approaches, such as adversary city adaptation, temporal module based on discrete cosine transform and hybrid architectures, have been developed for further fusion. The results of our experiments indicates that the final fusion systems A-D could achieve an accuracy higher than 85% on the officially provided fold 1 evaluation dataset.

...read moreread less

Posted Content•

The Second DIHARD Diarization Challenge: Dataset, task, and baselines

[...]

Neville Ryant¹, Kenneth Church², Christopher Cieri¹, Alejandrina Cristia³, Jun Du⁴, Sriram Ganapathy⁵, Mark Liberman¹ - Show less +3 more•Institutions (5)

University of Pennsylvania¹, Baidu², Centre national de la recherche scientifique³, University of Science and Technology of China⁴, Indian Institute of Science⁵

18 Jun 2019-arXiv: Audio and Speech Processing

TL;DR: This paper introduces the second DIHARD challenge, the second in a series of speaker diarization challenges intended to improve the robustness of diarized systems to variation in recording equipment, noise conditions, and conversational domain.

...read moreread less

Abstract: This paper introduces the second DIHARD challenge, the second in a series of speaker diarization challenges intended to improve the robustness of diarization systems to variation in recording equipment, noise conditions, and conversational domain. The challenge comprises four tracks evaluating diarization performance under two input conditions (single channel vs. multi-channel) and two segmentation conditions (diarization from a reference speech segmentation vs. diarization from scratch). In order to prevent participants from overtuning to a particular combination of recording conditions and conversational domain, recordings are drawn from a variety of sources ranging from read audiobooks to meeting speech, to child language acquisition recordings, to dinner parties, to web video. We describe the task and metrics, challenge design, datasets, and baseline systems for speech enhancement, speech activity detection, and diarization.

...read moreread less

Posted Content•

MIMO-SPEECH: End-to-End Multi-Channel Multi-Speaker Speech Recognition

[...]

Xuankai Chang¹, Wangyou Zhang¹, Yanmin Qian¹, Jonathan Le Roux², Shinji Watanabe³ - Show less +1 more•Institutions (3)

Shanghai Jiao Tong University¹, Mitsubishi Electric Research Laboratories², Johns Hopkins University³

15 Oct 2019-arXiv: Audio and Speech Processing

TL;DR: In this paper, a neural sequence-to-sequence (seq2seq) architecture, MIMO-Speech, was proposed to deal with multi-channel input and output.

...read moreread less

Abstract: Recently, the end-to-end approach has proven its efficacy in monaural multi-speaker speech recognition. However, high word error rates (WERs) still prevent these systems from being used in practical applications. On the other hand, the spatial information in multi-channel signals has proven helpful in far-field speech recognition tasks. In this work, we propose a novel neural sequence-to-sequence (seq2seq) architecture, MIMO-Speech, which extends the original seq2seq to deal with multi-channel input and multi-channel output so that it can fully model multi-channel multi-speaker speech separation and recognition. MIMO-Speech is a fully neural end-to-end framework, which is optimized only via an ASR criterion. It is comprised of: 1) a monaural masking network, 2) a multi-source neural beamformer, and 3) a multi-output speech recognition model. With this processing, the input overlapped speech is directly mapped to text sequences. We further adopted a curriculum learning strategy, making the best use of the training set to improve the performance. The experiments on the spatialized wsj1-2mix corpus show that our model can achieve more than 60% WER reduction compared to the single-channel system with high quality enhanced signals (SI-SDR = 23.1 dB) obtained by the above separation function.

...read moreread less

Posted Content•

T-GSA: Transformer with Gaussian-weighted self-attention for speech enhancement

[...]

Jae-Young Kim¹, Mostafa El-Khamy¹, Jungwon Lee¹•Institutions (1)

Samsung¹

13 Oct 2019-arXiv: Audio and Speech Processing

TL;DR: A Transformer with Gaussian-weighted self-attention (T-GSA), whose attention weights are attenuated according to the distance between target and context symbols, which has significantly improved speech-enhancement performance, compared to the Transformer and RNNs.

...read moreread less

Abstract: Transformer neural networks (TNN) demonstrated state-of-art performance on many natural language processing (NLP) tasks, replacing recurrent neural networks (RNNs), such as LSTMs or GRUs. However, TNNs did not perform well in speech enhancement, whose contextual nature is different than NLP tasks, like machine translation. Self-attention is a core building block of the Transformer, which not only enables parallelization of sequence computation, but also provides the constant path length between symbols that is essential to learning long-range dependencies. In this paper, we propose a Transformer with Gaussian-weighted self-attention (T-GSA), whose attention weights are attenuated according to the distance between target and context symbols. The experimental results show that the proposed T-GSA has significantly improved speech-enhancement performance, compared to the Transformer and RNNs.

...read moreread less

Posted Content•

Non-intrusive speech quality assessment using neural networks

[...]

Anderson R. Avila¹, Hannes Gamper², Chandan K A Reddy², Ross Cutler², Ivan Tashev², Johannes Gehrke² - Show less +2 more•Institutions (2)

Institut national de la recherche scientifique¹, Microsoft²

16 Mar 2019-arXiv: Audio and Speech Processing

TL;DR: This work presents an investigation of the applicability of neural networks for non-intrusive audio quality assessment, and proposes three neural network-based approaches for mean opinion score (MOS) estimation.

...read moreread less

Abstract: Estimating the perceived quality of an audio signal is critical for many multimedia and audio processing systems. Providers strive to offer optimal and reliable services in order to increase the user quality of experience (QoE). In this work, we present an investigation of the applicability of neural networks for non-intrusive audio quality assessment. We propose three neural network-based approaches for mean opinion score (MOS) estimation. We compare our results to three instrumental measures: the perceptual evaluation of speech quality (PESQ), the ITU-T Recommendation P.563, and the speech-to-reverberation energy ratio. Our evaluation uses a speech dataset contaminated with convolutive and additive noise, labeled using a crowd-based QoE evaluation, evaluated with Pearson correlation with MOS labels, and mean-squared-error of the estimated MOS. Our proposed approaches outperform the aforementioned instrumental measures, with a fully connected deep neural network using Mel-frequency features providing the best correlation (0.87) and the lowest mean squared error (0.15)

...read moreread less

Posted Content•

Almost Unsupervised Text to Speech and Automatic Speech Recognition

[...]

Yi Ren¹, Xu Tan², Tao Qin², Sheng Zhao², Zhou Zhao¹, Tie-Yan Liu² - Show less +2 more•Institutions (2)

Zhejiang University¹, Microsoft²

13 May 2019-arXiv: Audio and Speech Processing

TL;DR: In this paper, a denoising auto-encoder is used to reconstruct the speech and text sequences respectively to develop the capability of language modeling both in the speech domain and the text domain.

...read moreread less

Abstract: Text to speech (TTS) and automatic speech recognition (ASR) are two dual tasks in speech processing and both achieve impressive performance thanks to the recent advance in deep learning and large amount of aligned speech and text data. However, the lack of aligned data poses a major practical problem for TTS and ASR on low-resource languages. In this paper, by leveraging the dual nature of the two tasks, we propose an almost unsupervised learning method that only leverages few hundreds of paired data and extra unpaired data for TTS and ASR. Our method consists of the following components: (1) a denoising auto-encoder, which reconstructs speech and text sequences respectively to develop the capability of language modeling both in speech and text domain; (2) dual transformation, where the TTS model transforms the text $y$ into speech $\hat{x}$, and the ASR model leverages the transformed pair $(\hat{x},y)$ for training, and vice versa, to boost the accuracy of the two tasks; (3) bidirectional sequence modeling, which addresses error propagation especially in the long speech and text sequence when training with few paired data; (4) a unified model structure, which combines all the above components for TTS and ASR based on Transformer model. Our method achieves 99.84% in terms of word level intelligible rate and 2.68 MOS for TTS, and 11.7% PER for ASR on LJSpeech dataset, by leveraging only 200 paired speech and text data (about 20 minutes audio), together with extra unpaired speech and text data.

...read moreread less

Posted Content•

Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining.

[...]

Wen-Chin Huang¹, Tomoki Hayashi¹, Yi-Chiao Wu¹, Hirokazu Kameoka², Tomoki Toda¹ - Show less +1 more•Institutions (2)

Nagoya University¹, Nippon Telegraph and Telephone²

14 Dec 2019-arXiv: Audio and Speech Processing

TL;DR: Experimental results show that a simple yet effective pretraining technique to transfer knowledge from learned TTS models, which benefit from large-scale, easily accessible TTS corpora, can facilitate data-efficient training and outperform an RNN-basedseq VC model in terms of intelligibility, naturalness, and similarity.

...read moreread less

Abstract: We introduce a novel sequence-to-sequence (seq2seq) voice conversion (VC) model based on the Transformer architecture with text-to-speech (TTS) pretraining. Seq2seq VC models are attractive owing to their ability to convert prosody. While seq2seq models based on recurrent neural networks (RNNs) and convolutional neural networks (CNNs) have been successfully applied to VC, the use of the Transformer network, which has shown promising results in various speech processing tasks, has not yet been investigated. Nonetheless, their data-hungry property and the mispronunciation of converted speech make seq2seq models far from practical. To this end, we propose a simple yet effective pretraining technique to transfer knowledge from learned TTS models, which benefit from large-scale, easily accessible TTS corpora. VC models initialized with such pretrained model parameters are able to generate effective hidden representations for high-fidelity, highly intelligible converted speech. Experimental results show that such a pretraining scheme can facilitate data-efficient training and outperform an RNN-based seq2seq VC model in terms of intelligibility, naturalness, and similarity.

...read moreread less

Posted Content•

QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions.

[...]

Samuel Kriman¹, Stanislav Beliaev², Boris Ginsburg², Jocelyn Huang², Oleksii Kuchaiev², Vitaly Lavrukhin², Ryan Leary², Jason Li², Yang Zhang² - Show less +5 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, Nvidia²

22 Oct 2019-arXiv: Audio and Speech Processing

TL;DR: In this paper, an end-to-end neural acoustic model for automatic speech recognition is proposed, which is composed of multiple blocks with residual connections between them, each block consists of one or more modules with 1D time-channel separable convolutional layers.

...read moreread less

Abstract: We propose a new end-to-end neural acoustic model for automatic speech recognition. The model is composed of multiple blocks with residual connections between them. Each block consists of one or more modules with 1D time-channel separable convolutional layers, batch normalization, and ReLU layers. It is trained with CTC loss. The proposed network achieves near state-of-the-art accuracy on LibriSpeech and Wall Street Journal, while having fewer parameters than all competing models. We also demonstrate that this model can be effectively fine-tuned on new datasets.

...read moreread less

Patent•

Self-Supervised Audio Representation Learning for Mobile Devices

[...]

Marco Tagliasacchi¹, Beat Gfeller¹, Felix de Chaumont Quitry¹, Dominik Roblek¹•Institutions (1)

Google¹

24 May 2019-arXiv: Audio and Speech Processing

TL;DR: The quality of the embeddings produced by the self-supervised learning models are evaluated, and it is shown that they can be re-used for a variety of downstream tasks, and for some tasks even approach the performance of fully supervised models of similar size.

...read moreread less

Abstract: Systems and methods for training a machine-learned model are provided. A method can include can include obtaining an unlabeled audio signal, sampling the unlabeled audio signal to select one or more sampled slices, inputting the one or more sampled slices into a machine-learned model, receiving, as an output of the machine-learned model, one or more determined characteristics associated with the audio signal, determining a loss function for the machine-learned model based at least in part on a difference between the one or more determined characteristics and one or more corresponding ground truth characteristics of the audio signal, and training the machine-learned model from end to end based at least in part on the loss function. The one or more determined characteristics can include one or more reconstructed portions of the audio signal temporally adjacent to the one or more sampled slices or an estimated distance between two sampled slices.

...read moreread less

Posted Content•

Semi-Supervised Speech Emotion Recognition with Ladder Networks

[...]

Srinivas Parthasarathy¹, Carlos Busso¹•Institutions (1)

University of Texas at Dallas¹

08 May 2019-arXiv: Audio and Speech Processing

TL;DR: The proposed approach to ladder networks for emotion recognition achieves superior performance than fully supervised single-task learning (STL) and MTL baselines, and is implemented with sentence-level or frame-level features, demonstrating the flexibility of the approach.

...read moreread less

Abstract: Speech emotion recognition (SER) systems find applications in various fields such as healthcare, education, and security and defense. A major drawback of these systems is their lack of generalization across different conditions. This problem can be solved by training models on large amounts of labeled data from the target domain, which is expensive and time-consuming. Another approach is to increase the generalization of the models. An effective way to achieve this goal is by regularizing the models through multitask learning (MTL), where auxiliary tasks are learned along with the primary task. These methods often require the use of labeled data which is computationally expensive to collect for emotion recognition (gender, speaker identity, age or other emotional descriptors). This study proposes the use of ladder networks for emotion recognition, which utilizes an unsupervised auxiliary task. The primary task is a regression problem to predict emotional attributes. The auxiliary task is the reconstruction of intermediate feature representations using a denoising autoencoder. This auxiliary task does not require labels so it is possible to train the framework in a semi-supervised fashion with abundant unlabeled data from the target domain. This study shows that the proposed approach creates a powerful framework for SER, achieving superior performance than fully supervised single-task learning (STL) and MTL baselines. The approach is implemented with several acoustic features, showing that ladder networks generalize significantly better in cross-corpus settings. Compared to the STL baselines, the proposed approach achieves relative gains in concordance correlation coefficient (CCC) between 3.0% and 3.5% for within corpus evaluations, and between 16.1% and 74.1% for cross corpus evaluations, highlighting the power of the architecture.

...read moreread less

Posted Content•

Adversarial Attacks on Spoofing Countermeasures of automatic speaker verification

[...]

Songxiang Liu¹, Haibin Wu², Hung-yi Lee², Helen Meng¹•Institutions (2)

The Chinese University of Hong Kong¹, National Taiwan University²

19 Oct 2019-arXiv: Audio and Speech Processing

TL;DR: In this paper, the authors investigate the vulnerability of spoofing countermeasures for ASV under both white-box and black-box adversarial attacks with the fast gradient sign method (FGSM) and the projected gradient descent (PGD) method.

...read moreread less

Abstract: High-performance spoofing countermeasure systems for automatic speaker verification (ASV) have been proposed in the ASVspoof 2019 challenge. However, the robustness of such systems under adversarial attacks has not been studied yet. In this paper, we investigate the vulnerability of spoofing countermeasures for ASV under both white-box and black-box adversarial attacks with the fast gradient sign method (FGSM) and the projected gradient descent (PGD) method. We implement high-performing countermeasure models in the ASVspoof 2019 challenge and conduct adversarial attacks on them. We compare performance of black-box attacks across spoofing countermeasure models with different network architectures and different amount of model parameters. The experimental results show that all implemented countermeasure models are vulnerable to FGSM and PGD attacks under the scenario of white-box attack. The more dangerous black-box attacks also prove to be effective by the experimental results.

...read moreread less

Posted Content•

Ensemble Models for Spoofing Detection in Automatic Speaker Verification

[...]

Bhusan Chettri¹, Daniel Stoller¹, Veronica Morfi¹, Marco A. Martínez Ramírez¹, Emmanouil Benetos¹, Bob L. Sturm² - Show less +2 more•Institutions (2)

Queen Mary University of London¹, Royal Institute of Technology²

09 Apr 2019-arXiv: Audio and Speech Processing

TL;DR: This work investigates why some models on the PA dataset strongly outperform others and finds that spoofed recordings in the dataset tend to have longer silences at the end than genuine ones.

...read moreread less

Abstract: Detecting spoofing attempts of automatic speaker verification (ASV) systems is challenging, especially when using only one modeling approach. For robustness, we use both deep neural networks and traditional machine learning models and combine them as ensemble models through logistic regression. They are trained to detect logical access (LA) and physical access (PA) attacks on the dataset released as part of the ASV Spoofing and Countermeasures Challenge 2019. We propose dataset partitions that ensure different attack types are present during training and validation to improve system robustness. Our ensemble model outperforms all our single models and the baselines from the challenge for both attack types. We investigate why some models on the PA dataset strongly outperform others and find that spoofed recordings in the dataset tend to have longer silences at the end than genuine ones. By removing them, the PA task becomes much more challenging, with the tandem detection cost function (t-DCF) of our best single model rising from 0.1672 to 0.5018 and equal error rate (EER) increasing from 5.98% to 19.8% on the development set.

...read moreread less

Posted Content•

A Framework for the Robust Evaluation of Sound Event Detection

[...]

Cagdas Bilen, Giacomo Ferroni, Francesco Tuveri, Juan Azcarreta, Sacha Krstulovic - Show less +1 more

18 Oct 2019-arXiv: Audio and Speech Processing

TL;DR: A new framework for performance evaluation of polyphonic sound event detection (SED) systems is defined, which overcomes the limitations of the conventional collar-based event decisions, event F-scores and event error rates and introduces a definition of event detection that is more robust against labelling subjectivity.

...read moreread less

Abstract: This work defines a new framework for performance evaluation of polyphonic sound event detection (SED) systems, which overcomes the limitations of the conventional collar-based event decisions, event F-scores and event error rates. The proposed framework introduces a definition of event detection that is more robust against labelling subjectivity. It also resorts to polyphonic receiver operating characteristic (ROC) curves to deliver more global insight into system performance than F1-scores, and proposes a reduction of these curves into a single polyphonic sound detection score (PSDS), which allows system comparison independently from operating points (OPs). The presented method also delivers better insight into data biases and classification stability across sound classes. Furthermore, it can be tuned to varying applications in order to match a variety of user experience requirements. The benefits of the proposed approach are demonstrated by re-evaluating the baseline and two of the top-performing systems from DCASE 2019 Task 4.

...read moreread less

Posted Content•

End-to-end Microphone Permutation and Number Invariant Multi-channel Speech Separation

[...]

Yi Luo¹, Zhuo Chen², Nima Mesgarani¹, Takuya Yoshioka²•Institutions (2)

Columbia University¹, Microsoft²

30 Oct 2019-arXiv: Audio and Speech Processing

TL;DR: In this article, the authors proposed transform-average-concatenate (TAC), a simple design paradigm for channel permutation and number invariant multi-channel speech separation.

...read moreread less

Abstract: An important problem in ad-hoc microphone speech separation is how to guarantee the robustness of a system with respect to the locations and numbers of microphones. The former requires the system to be invariant to different indexing of the microphones with the same locations, while the latter requires the system to be able to process inputs with varying dimensions. Conventional optimization-based beamforming techniques satisfy these requirements by definition, while for deep learning-based end-to-end systems those constraints are not fully addressed. In this paper, we propose transform-average-concatenate (TAC), a simple design paradigm for channel permutation and number invariant multi-channel speech separation. Based on the filter-and-sum network (FaSNet), a recently proposed end-to-end time-domain beamforming system, we show how TAC significantly improves the separation performance across various numbers of microphones in noisy reverberant separation tasks with ad-hoc arrays. Moreover, we show that TAC also significantly improves the separation performance with fixed geometry array configuration, further proving the effectiveness of the proposed paradigm in the general problem of multi-microphone speech separation.

...read moreread less

Collapse