scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Audio, Speech, and Language Processing in 2020"


Journal ArticleDOI
TL;DR: This paper proposes pretrained audio neural networks (PANNs) trained on the large-scale AudioSet dataset, and investigates the performance and computational complexity of PANNs modeled by a variety of convolutional neural networks.
Abstract: Audio pattern recognition is an important research topic in the machine learning area, and includes several tasks such as audio tagging, acoustic scene classification, music classification, speech emotion classification and sound event detection. Recently, neural networks have been applied to tackle audio pattern recognition problems. However, previous systems are built on specific datasets with limited durations. Recently, in computer vision and natural language processing, systems pretrained on large-scale datasets have generalized well to several tasks. However, there is limited research on pretraining systems on large-scale datasets for audio pattern recognition. In this paper, we propose pretrained audio neural networks (PANNs) trained on the large-scale AudioSet dataset. These PANNs are transferred to other audio related tasks. We investigate the performance and computational complexity of PANNs modeled by a variety of convolutional neural networks. We propose an architecture called Wavegram-Logmel-CNN using both log-mel spectrogram and waveform as input feature. Our best PANN system achieves a state-of-the-art mean average precision (mAP) of 0.439 on AudioSet tagging, outperforming the best previous system of 0.392. We transfer PANNs to six audio pattern recognition tasks, and demonstrate state-of-the-art performance in several of those tasks. We have released the source code and pretrained models of PANNs: https://github.com/qiuqiangkong/audioset_tagging_cnn .

560 citations


Journal ArticleDOI
TL;DR: A gated convolutional recurrent network (GCRN) for complex spectral mapping is proposed, which amounts to a causal system for monaural speech enhancement and yields significantly higher STOI and PESQ than magnitude spectral mapping and complex ratio masking.
Abstract: Phase is important for perceptual quality of speech. However, it seems intractable to directly estimate phase spectra through supervised learning due to their lack of spectrotemporal structure in it. Complex spectral mapping aims to estimate the real and imaginary spectrograms of clean speech from those of noisy speech, which simultaneously enhances magnitude and phase responses of speech. Inspired by multi-task learning, we propose a gated convolutional recurrent network (GCRN) for complex spectral mapping, which amounts to a causal system for monaural speech enhancement. Our experimental results suggest that the proposed GCRN substantially outperforms an existing convolutional neural network (CNN) for complex spectral mapping in terms of both objective speech intelligibility and quality. Moreover, the proposed approach yields significantly higher STOI and PESQ than magnitude spectral mapping and complex ratio masking. We also find that complex spectral mapping with the proposed GCRN provides an effective phase estimate.

237 citations


Journal ArticleDOI
TL;DR: A novel method of time-varying beamforming with estimated complex spectra for single- and multi-channel speech enhancement, where deep neural networks are used to predict the real and imaginary components of the direct-path signal from noisy and reverberant ones.
Abstract: This study proposes a complex spectral mapping approach for single- and multi-channel speech enhancement, where deep neural networks (DNNs) are used to predict the real and imaginary (RI) components of the direct-path signal from noisy and reverberant ones. The proposed system contains two DNNs. The first one performs single-channel complex spectral mapping. The estimated complex spectra are used to compute a minimum variance distortion-less response (MVDR) beamformer. The RI components of beamforming results, which encode spatial information, are then combined with the RI components of the mixture to train the second DNN for multi-channel complex spectral mapping. With estimated complex spectra, we also propose a novel method of time-varying beamforming. State-of-the-Art performance is obtained on the speech enhancement and recognition tasks of the CHiME-4 corpus. More specifically, our system obtains 6.82%, 3.19% and 1.99% word error rates (WER) respectively on the single-, two-, and six-microphone tasks of CHiME-4, significantly surpassing the current best results of 9.15%, 3.91% and 2.24% WER.

141 citations


Journal ArticleDOI
TL;DR: It was demonstrated that the NSF models generated waveforms at least 100 times faster than the authors' WaveNet-vocoder, and the quality of the synthetic speech from the best NSF model was comparable to that from WaveNet on a large single-speaker Japanese speech corpus.
Abstract: Neural waveform models have demonstrated better performance than conventional vocoders for statistical parametric speech synthesis. One of the best models, called WaveNet, uses an autoregressive (AR) approach to model the distribution of waveform sampling points, but it has to generate a waveform in a time-consuming sequential manner. Some new models that use inverse-autoregressive flow (IAF) can generate a whole waveform in a one-shot manner but require either a larger amount of training time or a complicated model architecture plus a blend of training criteria. As an alternative to AR and IAF-based frameworks, we propose a neural source-filter (NSF) waveform modeling framework that is straightforward to train and fast to generate waveforms. This framework requires three components to generate waveforms: a source module that generates a sine-based signal as excitation, a non-AR dilated-convolution-based filter module that transforms the excitation into a waveform, and a conditional module that pre-processes the input acoustic features for the source and filter modules. This framework minimizes spectral-amplitude distances for model training, which can be efficiently implemented using short-time Fourier transform routines. As an initial NSF study, we designed three NSF models under the proposed framework and compared them with WaveNet using our deep learning toolkit. It was demonstrated that the NSF models generated waveforms at least 100 times faster than our WaveNet-vocoder, and the quality of the synthetic speech from the best NSF model was comparable to that from WaveNet on a large single-speaker Japanese speech corpus.

104 citations


Journal ArticleDOI
TL;DR: A novel model is proposed to generate high-quality pseudo OOD samples that are akin to IN-Domain (IND) input utterances and thereby improves the performance of OOD detection and is demonstrated to be effective in NLU.
Abstract: Natural Language Understanding (NLU) is a vital component of dialogue systems, and its ability to detect Out-of-Domain (OOD) inputs is critical in practical applications, since the acceptance of the OOD input that is unsupported by the current system may lead to catastrophic failure. However, most existing OOD detection methods rely heavily on manually labeled OOD samples and cannot take full advantage of unlabeled data. This limits the feasibility of these models in practical applications. In this paper, we propose a novel model to generate high-quality pseudo OOD samples that are akin to IN-Domain (IND) input utterances and thereby improves the performance of OOD detection. To this end, an autoencoder is trained to map an input utterance into a latent code. Moreover, the codes of IND and OOD samples are trained to be indistinguishable by utilizing a generative adversarial network. To provide more supervision signals, an auxiliary classifier is introduced to regularize the generated OOD samples to have indistinguishable intent labels. Experiments show that these pseudo OOD samples generated by our model can be used to effectively improve OOD detection in NLU. Besides, we also demonstrate that the effectiveness of these pseudo OOD data can be further improved by efficiently utilizing unlabeled data.

100 citations


Journal ArticleDOI
TL;DR: The authors proposed a new sentence embedding method by dissecting BERT-based word models through geometric analysis of the space spanned by the word representation, which is called the SBERT-WK method.
Abstract: Sentence embedding is an important research topic in natural language processing (NLP) since it can transfer knowledge to downstream tasks. Meanwhile, a contextualized word representation, called BERT, achieves the state-of-the-art performance in quite a few NLP tasks. Yet, it is an open problem to generate a high quality sentence representation from BERT-based word models. It was shown in previous study that different layers of BERT capture different linguistic properties. This allows us to fuse information across layers to find better sentence representations. In this work, we study the layer-wise pattern of the word representation of deep contextualized models. Then, we propose a new sentence embedding method by dissecting BERT-based word models through geometric analysis of the space spanned by the word representation. It is called the SBERT-WK method. No further training is required in SBERT-WK. We evaluate SBERT-WK on semantic textual similarity and downstream supervised tasks. Furthermore, ten sentence-level probing tasks are presented for detailed linguistic analysis. Experiments show that SBERT-WK achieves the state-of-the-art performance. Our codes are publicly available.

98 citations


Journal ArticleDOI
TL;DR: The LOCAlization and Tracking Challenge (LOCATA) as discussed by the authors is an open-access framework for the objective evaluation and benchmarking of broad classes of algorithms for sound source localization and tracking.
Abstract: The ability to localize and track acoustic events is a fundamental prerequisite for equipping machines with the ability to be aware of and engage with humans in their surrounding environment. However, in realistic scenarios, audio signals are adversely affected by reverberation, noise, interference, and periods of speech inactivity. In dynamic scenarios, where the sources and microphone platforms may be moving, the signals are additionally affected by variations in the source-sensor geometries. In practice, approaches to sound source localization and tracking are often impeded by missing estimates of active sources, estimation errors, as well as false estimates. The aim of the LOCAlization and TrAcking (LOCATA) Challenge is an open-access framework for the objective evaluation and benchmarking of broad classes of algorithms for sound source localization and tracking. This article provides a review of relevant localization and tracking algorithms and, within the context of the existing literature, a detailed evaluation and dissemination of the LOCATA submissions. The evaluation highlights achievements in the field, open challenges, and identifies potential future directions.

91 citations


Journal ArticleDOI
TL;DR: The proposed noise PSD tracker, called DeepMMSE makes no assumptions about the characteristics of the noise or the speech, exhibits no tracking delay, and produces an accurate estimate that requires no bias correction, and when employed in a speech enhancement framework is able to outperform state-of-the-art noise PSd trackers, as well as multiple deep learning approaches to speech enhancement.
Abstract: An accurate noise power spectral density (PSD) tracker is an indispensable component of a single-channel speech enhancement system. Bayesian-motivated minimum mean-square error (MMSE)-based noise PSD estimators have been the most prominent in recent time. However, they lack the ability to track highly non-stationary noise sources due to current methods of a priori signal-to-noise (SNR) estimation. This is caused by the underlying assumption that the noise signal changes at a slower rate than the speech signal. As a result, MMSE-based noise PSD trackers exhibit a large tracking delay and produce noise PSD estimates that require bias compensation. Motivated by this, we propose an MMSE-based noise PSD tracker that employs a temporal convolutional network (TCN) a priori SNR estimator. The proposed noise PSD tracker, called DeepMMSE makes no assumptions about the characteristics of the noise or the speech, exhibits no tracking delay, and produces an accurate estimate that requires no bias correction. Our extensive experimental investigation shows that the proposed DeepMMSE method outperforms state-of-the-art noise PSD trackers and demonstrates the ability to track abrupt changes in the noise level. Furthermore, when employed in a speech enhancement framework, the proposed DeepMMSE method is able to outperform state-of-the-art noise PSD trackers, as well as multiple deep learning approaches to speech enhancement. Availability: DeepMMSE is available at: https://github.com/anicolson/DeepXi .

88 citations


Journal ArticleDOI
TL;DR: In this paper, a sequence-to-sequence (seq2seq) voice conversion method using non-parallel training data is proposed, which preserves the linguistic representations of source utterances while replacing the speaker representations with the target ones.
Abstract: This article presents a method of sequence-to-sequence (seq2seq) voice conversion using non-parallel training data. In this method, disentangled linguistic and speaker representations are extracted from acoustic features, and voice conversion is achieved by preserving the linguistic representations of source utterances while replacing the speaker representations with the target ones. Our model is built under the framework of encoder-decoder neural networks. A recognition encoder is designed to learn the disentangled linguistic representations with two strategies. First, phoneme transcriptions of training data are introduced to provide the references for leaning linguistic representations of audio signals. Second, an adversarial training strategy is employed to further wipe out speaker information from the linguistic representations. Meanwhile, speaker representations are extracted from audio signals by a speaker encoder. The model parameters are estimated by two-stage training, including a pre-training stage using a multi-speaker dataset and a fine-tuning stage using the dataset of a specific conversion pair. Since both the recognition encoder and the decoder for recovering acoustic features are seq2seq neural networks, there are no constrains of frame alignment and frame-by-frame conversion in our proposed method. Experimental results showed that our method obtained higher similarity and naturalness than the best non-parallel voice conversion method in Voice Conversion Challenge 2018. Besides, the performance of our proposed method was closed to the state-of-the-art parallel seq2seq voice conversion method.

87 citations


Journal ArticleDOI
TL;DR: These models show excellent speech dereverberation and recognition performance on the test set of the REVERB challenge, consistently better than single- and multi-channel weighted prediction error (WPE) algorithms.
Abstract: This article investigates deep learning based single- and multi-channel speech dereverberation. For single-channel processing, we extend magnitude-domain masking and mapping based dereverberation to complex-domain mapping, where deep neural networks (DNNs) are trained to predict the real and imaginary (RI) components of the direct-path signal from reverberant (and noisy) ones. For multi-channel processing, we first compute a minimum variance distortionless response (MVDR) beamformer to cancel the direct-path signal, and then feed the RI components of the cancelled signal, which is expected to be a filtered version of non-target signals, as additional features to perform dereverberation. Trained on a large dataset of simulated room impulse responses, our models show excellent speech dereverberation and recognition performance on the test set of the REVERB challenge, consistently better than single- and multi-channel weighted prediction error (WPE) algorithms.

74 citations


Journal ArticleDOI
TL;DR: A tree-structured regional CNN-LSTM model consisting of two parts: regional CNN and LSTM to predict the VA ratings of texts is proposed, showing that the proposed method outperforms lexicon-, regression-, conventional NN and other structured NN methods proposed in previous studies.
Abstract: Dimensional sentiment analysis aims to recognize continuous numerical values in multiple dimensions such as the valence-arousal (VA) space. Compared to the categorical approach that focuses on sentiment classification such as binary classification (i.e., positive and negative), the dimensional approach can provide a more fine-grained sentiment analysis. This article proposes a tree-structured regional CNN-LSTM model consisting of two parts: regional CNN and LSTM to predict the VA ratings of texts. Unlike a conventional CNN which considers a whole text as input, the proposed regional CNN uses a part of the text as a region, dividing an input text into several regions such that the useful affective information in each region can be extracted and weighted according to their contribution to the VA prediction. Such regional information is sequentially integrated across regions using LSTM for VA prediction. By combining the regional CNN and LSTM, both local (regional) information within sentences and long-distance dependencies across sentences can be considered in the prediction process. To further improve performance, a region division strategy is proposed to discover task-relevant phrases and clauses to incorporate structured information into VA prediction. Experimental results on different corpora show that the proposed method outperforms lexicon-, regression-, conventional NN and other structured NN methods proposed in previous studies.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a time-domain speaker extraction network (SpEx) that converts the mixture speech into multi-scale embedding coefficients instead of decomposing the speech signal into magnitude and phase spectra.
Abstract: Speaker extraction aims to mimic humans’ selective auditory attention by extracting a target speaker's voice from a multi-talker environment. It is common to perform the extraction in frequency-domain, and reconstruct the time-domain signal from the extracted magnitude and estimated phase spectra. However, such an approach is adversely affected by the inherent difficulty of phase estimation. Inspired by Conv-TasNet, we propose a time-domain speaker extraction network (SpEx) that converts the mixture speech into multi-scale embedding coefficients instead of decomposing the speech signal into magnitude and phase spectra. In this way, we avoid phase estimation. The SpEx network consists of four network components, namely speaker encoder , speech encoder , speaker extractor , and speech decoder . Specifically, the speech encoder converts the mixture speech into multi-scale embedding coefficients, the speaker encoder learns to represent the target speaker with a speaker embedding. The speaker extractor takes the multi-scale embedding coefficients and target speaker embedding as input and estimates a receptive mask. Finally, the speech decoder reconstructs the target speaker's speech from the masked embedding coefficients. We also propose a multi-task learning framework and a multi-scale embedding implementation. Experimental results show that the proposed SpEx achieves 37.3%, 37.7% and 15.0% relative improvements over the best baseline in terms of signal-to-distortion ratio (SDR), scale-invariant SDR (SI-SDR), and perceptual evaluation of speech quality (PESQ) under an open evaluation condition.

Journal ArticleDOI
TL;DR: Experimental results show the effectiveness of the proposed batch-wise variable-length training with online data augmentation and the LDE layer, which significantly outperforms the baseline methods.
Abstract: In this article, our recent efforts on directly modeling utterance-level aggregation for speaker and language recognition is summarized. First, an on-the-fly data loader for efficient network training is proposed. The data loader acts as a bridge between the full-length utterances and the network. It generates mini-batch samples on the fly, which allows batch-wise variable-length training and online data augmentation. Second, the traditional dictionary learning and Baum-Welch statistical accumulation mechanisms are applied to the network structure, and a learnable dictionary encoding (LDE) layer is introduced. The former accumulates discriminative statistics from the variable-length input sequence and outputs a single fixed-dimensional utterance-level representation. Experiments were conducted on four different datasets, namely NIST LRE 2007, AP17-OLR, SITW, and NIST SRE 2016. Experimental results show the effectiveness of the proposed batch-wise variable-length training with online data augmentation and the LDE layer, which significantly outperforms the baseline methods.

Journal ArticleDOI
TL;DR: In this paper, a loss function based on scale-invariant signal-to-distortion ratio (SI-SDR) was proposed as a general-purpose loss function for speech enhancement systems.
Abstract: Many deep learning-based speech enhancement algorithms are designed to minimize the mean-square error (MSE) in some transform domain between a predicted and a target speech signal. However, optimizing for MSE does not necessarily guarantee high speech quality or intelligibility, which is the ultimate goal of many speech enhancement algorithms. Additionally, only little is known about the impact of the loss function on the emerging class of time-domain deep learning-based speech enhancement systems. We study how popular loss functions influence the performance of time-domain deep learning-based speech enhancement systems. First, we demonstrate that perceptually inspired loss functions might be advantageous over classical loss functions like MSE. Furthermore, we show that the learning rate is a crucial design parameter even for adaptive gradient-based optimizers, which has been generally overlooked in the literature. Also, we found that waveform matching performance metrics must be used with caution as they in certain situations can fail completely. Finally, we show that a loss function based on scale-invariant signal-to-distortion ratio (SI-SDR) achieves good general performance across a range of popular speech enhancement evaluation metrics, which suggests that SI-SDR is a good candidate as a general-purpose loss function for speech enhancement systems.

Journal ArticleDOI
TL;DR: In this article, the authors proposed the use of ladder networks for emotion recognition, which utilizes an unsupervised auxiliary task, which is the reconstruction of intermediate feature representations using a denoising autoencoder.
Abstract: Speech emotion recognition (SER) systems find applications in various fields such as healthcare, education, and security and defense. A major drawback of these systems is their lack of generalization across different conditions. For example, systems that show superior performance on certain databases show poor performance when tested on other corpora. This problem can be solved by training models on large amounts of labeled data from the target domain, which is expensive and time-consuming. Another approach is to increase the generalization of the models. An effective way to achieve this goal is by regularizing the models through multitask learning (MTL), where auxiliary tasks are learned along with the primary task. These methods often require the use of labeled data which is computationally expensive to collect for emotion recognition (gender, speaker identity, age or other emotional descriptors). This study proposes the use of ladder networks for emotion recognition, which utilizes an unsupervised auxiliary task. The primary task is a regression problem to predict emotional attributes. The auxiliary task is the reconstruction of intermediate feature representations using a denoising autoencoder. This auxiliary task does not require labels so it is possible to train the framework in a semi-supervised fashion with abundant unlabeled data from the target domain. This study shows that the proposed approach creates a powerful framework for SER, achieving superior performance than fully supervised single-task learning (STL) and MTL baselines. We implement the approach with sentence-level or frame-level features, demonstrating the flexibility of our approach. Additionally, the generalization of the ladder networks is evaluated in cross-corpus settings using sentence-level features, obtaining important improvements. Compared to the STL baselines, the proposed approach achieves relative gains in concordance correlation coefficient (CCC) between 3.0% and 3.5% for within corpus evaluations, and between 16.1% and 74.1% for cross corpus evaluations, highlighting the power of the architecture.

Journal ArticleDOI
TL;DR: The usefulness of images for entity-level sentiment detection in social media posts is explored, and an Entity-Sensitive Attention and Fusion Network (ESAFN) is proposed for this task, which can significantly outperform several highly competitive unimodal and multimodal methods.
Abstract: Entity-level (aka target-dependent) sentiment analysis of social media posts has recently attracted increasing attention, and its goal is to predict the sentiment orientations over individual target entities mentioned in users’ posts. Most existing approaches to this task primarily rely on the textual content, but fail to consider the other important data sources (e.g., images, videos, and user profiles), which can potentially enhance these text-based approaches. Motivated by the observation, we study entity-level multimodal sentiment classification in this article, and aim to explore the usefulness of images for entity-level sentiment detection in social media posts. Specifically, we propose an Entity-Sensitive Attention and Fusion Network (ESAFN) for this task. First, to capture the intra-modality dynamics, ESAFN leverages an effective attention mechanism to generate entity-sensitive textual representations, followed by aggregating them with a textual fusion layer. Next, ESAFN learns the entity-sensitive visual representation with an entity-oriented visual attention mechanism, followed by a gated mechanism to eliminate the noisy visual context. Moreover, to capture the inter-modality dynamics, ESAFN further fuses the textual and visual representations with a bilinear interaction layer. To evaluate the effectiveness of ESAFN, we manually annotate the sentiment orientation over each given entity based on two recently released multimodal NER datasets, and show that ESAFN can significantly outperform several highly competitive unimodal and multimodal methods.

Journal ArticleDOI
TL;DR: A filtered-x generalized maximum correntropy criterion (FxGMCC) algorithm is proposed, which adopts the generalized Gaussian density (GGD) function as its kernel, which is superior to most of the existing robust adaptive algorithms.
Abstract: As a robust nonlinear similarity measure, the maximum correntropy criterion (MCC) has been successfully applied to active noise control (ANC) for impulsive noise. The default kernel function of the filtered-x maximum correntropy criterion (FxMCC) algorithm is the Gaussian kernel, which is desirable in many cases for its smooth and strict positive-definite. However, it is not always the best choice. In this study, a filtered-x generalized maximum correntropy criterion (FxGMCC) algorithm is proposed, which adopts the generalized Gaussian density (GGD) function as its kernel. The FxGMCC algorithm has greater robust ability against non-Gaussian environments, but, it still adopts a single error norm which exhibits poor convergence rate and noise reduction performance. To surmount this problem, an improved FxGMCC (IFxGMCC) algorithm with continuous mixed Lp- norm is proposed. Moreover, to make a trade-off between fast convergence rate and low steady-state misalignment, a convexly combined IFxGMCC (C-IFxGMCC) algorithm is further developed. The stability mechanism and computational complexity of the proposed algorithms are analyzed. Simulation results in the context of different impulsive noises as well as the real noise signals verify that the proposed algorithms are superior to most of the existing robust adaptive algorithms.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed ConvS2S-VC, which uses a model with a fully convolutional architecture and enables effective normalization techniques such as batch normalization to be used for all the hidden layers in the networks.
Abstract: This article proposes a voice conversion (VC) method using sequence-to-sequence (seq2seq or S2S) learning, which flexibly converts not only the voice characteristics but also the pitch contour and duration of input speech. The proposed method, called ConvS2S-VC, has three key features. First, it uses a model with a fully convolutional architecture. This is particularly advantageous in that it is suitable for parallel computations using GPUs. It is also beneficial since it enables effective normalization techniques such as batch normalization to be used for all the hidden layers in the networks. Second, it achieves many-to-many conversion by simultaneously learning mappings among multiple speakers using only a single model instead of separately learning mappings between each speaker pair using a different model. This enables the model to fully utilize available training data collected from multiple speakers by capturing common latent features that can be shared across different speakers. Owing to this structure, our model works reasonably well even without source speaker information, thus making it able to handle any-to-many conversion tasks. Third, we introduce a mechanism, called the conditional batch normalization that switches batch normalization layers in accordance with the target speaker. This particular mechanism has been found to be extremely effective for our many-to-many conversion model. We conducted speaker identity conversion experiments and found that ConvS2S-VC obtained higher sound quality and speaker similarity than baseline methods. We also found from audio examples that it could perform well in various tasks including emotional expression conversion, electrolaryngeal speech enhancement, and English accent conversion.

Journal ArticleDOI
TL;DR: In this paper, a convolutional neural network transformer (CNN-Transfomer) was proposed for audio tagging and sound event detection, which outperformed the previous state-of-the-art.
Abstract: Sound event detection (SED) is a task to detect sound events in an audio recording. One challenge of the SED task is that many datasets such as the Detection and Classification of Acoustic Scenes and Events (DCASE) datasets are weakly labelled. That is, there are only audio tags for each audio clip without the onset and offset times of sound events. We compare segment-wise and clip-wise training for SED that is lacking in previous works. We propose a convolutional neural network transformer (CNN-Transfomer) for audio tagging and SED, and show that CNN-Transformer performs similarly to a convolutional recurrent neural network (CRNN). Another challenge of SED is that thresholds are required for detecting sound events. Previous works set thresholds empirically, and are not an optimal approaches. To solve this problem, we propose an automatic threshold optimization method. The first stage is to optimize the system with respect to metrics that do not depend on thresholds, such as mean average precision (mAP). The second stage is to optimize the thresholds with respect to metrics that depends on those thresholds. Our proposed automatic threshold optimization system achieves a state-of-the-art audio tagging F1 of 0.646, outperforming that without threshold optimization of 0.629, and a sound event detection F1 of 0.584, outperforming that without threshold optimization of 0.564.

Journal ArticleDOI
TL;DR: A comprehensive statistical convergence analysis of the FxLMS algorithm without assuming a specific model for the reference signal and the covariance matrix of the augmented weight-error vector is evaluated using the vectorization operation, which makes the analysis easy to follow and suitable for arbitrary input distributions.
Abstract: The filtered-x least-mean-square (FxLMS) algorithm has been widely used for the active noise control. A fundamental analysis of the convergence behavior of the FxLMS algorithm, including the transient and steady-state performance, could provide some new insights into the algorithm and can be also helpful for its practical applications, e.g., the choice of the step size. Although many efforts have been devoted to the statistical analysis of the FxLMS algorithm, it was usually assumed that the reference signal is Gaussian or white. However, non-Gaussian and/or non-white processes could be very widespread in practice as well. Moreover, the step-size bound that guarantees both of the mean and mean-square stability of the FxLMS for an arbitrary reference signal and a general secondary path is still not available in the literature. To address these problems, this article presents a comprehensive statistical convergence analysis of the FxLMS algorithm without assuming a specific model for the reference signal. We formulate the mean weight behavior and the mean-square error (MSE) in terms of an augmented weight vector. The covariance matrix of the augmented weight-error vector is then evaluated using the vectorization operation, which makes the analysis easy to follow and suitable for arbitrary input distributions. The stability bound is derived based on the first-order and second-order moments analysis of the FxLMS. Computer simulations confirmed the effectiveness of the proposed theoretical model.

Journal ArticleDOI
TL;DR: The proposed knowledge guided capsule network (KGCapsAN) implements the routing method by attention mechanism, and the results show that the proposed method yields the state-of-the-art.
Abstract: Aspect-based (aspect-level) sentiment analysis is an important task in fine-grained sentiment analysis, which aims to automatically infer the sentiment towards an aspect in its context Previous studies have shown that utilizing the attention-based method can effectively improve the accuracy of the aspect-based sentiment analysis Despite the outstanding progress, aspect-based sentiment analysis in the real-world remains several challenges (1) The current attention-based method may cause a given aspect to incorrectly focus on syntactically unrelated words (2) Conventional methods fail to identify the sentiment with the special sentence structure, such as double negatives (3) Most of the studies leverage only one vector to represent context and target However, utilizing one vector to represent the sentence is limited, as the natural languages are delicate and complex In this paper, we propose a knowledge guided capsule network (KGCapsAN), which can address the above deficiencies Our method is composed of two parts, a Bi-LSTM network and a capsule attention network The capsule attention network implements the routing method by attention mechanism Moreover, we utilize two prior knowledge to guide the capsule attention process, which are syntactical and n-gram structures Extensive experiments are conducted on six datasets, and the results show that the proposed method yields the state-of-the-art

Journal ArticleDOI
TL;DR: The mean and mean-square behaviors of the M-estimate based normalized subband adaptive filter algorithm (M-NSAF) with robustness against impulsive noise are studied and the stability condition, transient and steady-state results of the algorithm are formulated analytically.
Abstract: This article studies the mean and mean-square behaviors of the M-estimate based normalized subband adaptive filter algorithm (M-NSAF) with robustness against impulsive noise. Based on the contaminated-Gaussian noise model, the stability condition, transient and steady-state results of the algorithm are formulated analytically. These analysis results help us to better understand the M-NSAF performance in impulsive noise. To further obtain fast convergence and low steady-state estimation error, we derive a variable step size (VSS) M-NSAF algorithm. This VSS scheme is also generalized to the proportionate M-NSAF variant for sparse systems. Computer simulations on the system identification in impulsive noise and the acoustic echo cancellation with double-talk are performed to demonstrate our theoretical analysis and the effectiveness of the proposed algorithms.

Journal ArticleDOI
TL;DR: Systematic evaluations and comparisons on the NIST SRE 2010 retransmitted corpus show that both monaural and multi-channel speech enhancement significantly outperform x-vector's performance, and the covariance matrix estimate is effective for the MVDR beamformer.
Abstract: Deep neural network (DNN) embeddings for speaker recognition have recently attracted much attention. Compared to i-vectors, they are more robust to noise and room reverberation as DNNs leverage large-scale training. This article addresses the question of whether speech enhancement approaches are still useful when DNN embeddings are used for speaker recognition. We investigate single- and multi-channel speech enhancement for text-independent speaker verification based on x-vectors in conditions where strong diffuse noise and reverberation are both present. Single-channel (monaural) speech enhancement is based on complex spectral mapping and is applied to individual microphones. We use masking-based minimum variance distortion-less response (MVDR) beamformer and its rank-1 approximation for multi-channel speech enhancement. We propose a novel method of deriving time-frequency masks from the estimated complex spectrogram. In addition, we investigate gammatone frequency cepstral coefficients (GFCCs) as robust speaker features. Systematic evaluations and comparisons on the NIST SRE 2010 retransmitted corpus show that both monaural and multi-channel speech enhancement significantly outperform x-vector's performance, and our covariance matrix estimate is effective for the MVDR beamformer.

Journal ArticleDOI
TL;DR: A new theory of differential beamformers with uniform linear arrays is proposed, which shows clearly the connection between the conventional differential beamforming and the null-constrained differential beamforms methods.
Abstract: This article presents a theoretical study of differential beamforming with uniform linear arrays. By defining a forward spatial difference operator, any order of the spatial difference of the observed signals can be represented as a product of a difference operator matrix and the microphone array observations. Consequently, differential beamforming is implemented in two stages, where the first one obtains spatial difference of the observations and the second stage optimizes the beamformer. The major contributions of this article are as follows. First, we propose a new theory of differential beamforming with uniform linear arrays, which shows clearly the connection between the conventional differential beamforming and the null-constrained differential beamforming methods. This provides some new insight into the design of differential beamformers. Second, we deduce some new differential beamformers, where conventional beamforming may be seen as a particular case. Specifically, we derive the maximum white noise gain (MWNG), maximum directivity factor (MDF), parameterized MDF, and parameterized maximum front-to-back ratio differential beamformers. Third, we further extend the idea of how to design optimal differential beamformers by combining both the observed signals and their spatial differences.

Journal ArticleDOI
TL;DR: This article describes the first work to provide the full-stack online solution for CTC/attention end-to-end ASR architecture, and proposes stable monotonic chunk-wise attention (sMoChA) to stream the conventional global attention and a dynamic waiting joint decoding algorithm to dynamically collect the predictions of CTC and attention in an online manner.
Abstract: Recently, there has been increasing progress in end-to-end automatic speech recognition (ASR) architecture, which transcribes speech to text without any pre-trained alignments. One popular end-to-end approach is the hybrid Connectionist Temporal Classification (CTC) and attention (CTC/attention) based ASR architecture, which utilizes the advantages of both CTC and attention. The hybrid CTC/attention ASR systems exhibit performance comparable to that of the conventional deep neural network (DNN)/ hidden Markov model (HMM) ASR systems. However, how to deploy hybrid CTC/attention systems for online speech recognition is still a non-trivial problem. This article describes our proposed online hybrid CTC/attention end-to-end ASR architecture, which replaces all the offline components of conventional CTC/attention ASR architecture with their corresponding streaming components. Firstly, we propose stable monotonic chunk-wise attention (sMoChA) to stream the conventional global attention, and further propose monotonic truncated attention (MTA) to simplify sMoChA and solve the training-and-decoding mismatch problem of sMoChA. Secondly, we propose truncated CTC (T-CTC) prefix score to stream CTC prefix score calculation. Thirdly, we design dynamic waiting joint decoding (DWJD) algorithm to dynamically collect the predictions of CTC and attention in an online manner. Finally, we use latency-controlled bidirectional long short-term memory (LC-BLSTM) to stream the widely-used offline bidirectional encoder network. Experiments with LibriSpeech English and HKUST Mandarin tasks demonstrate that, compared with the offline CTC/attention model, our proposed online CTC/attention model improves the real time factor in human-computer interaction services and maintains its performance with moderate degradation. To the best of our knowledge, this is the first work to provide the full-stack online solution for CTC/attention end-to-end ASR architecture.

Journal ArticleDOI
TL;DR: A monaural (single-channel) speech dereverberation algorithm using temporal convolutional networks with self attention to combat the effect of reverberation, which improves objective metrics of speech quality in a wide range of reverberant conditions.
Abstract: In daily listening environments, human speech is often degraded by room reverberation, especially under highly reverberant conditions. Such degradation poses a challenge for many speech processing systems, where the performance becomes much worse than in anechoic environments. To combat the effect of reverberation, we propose a monaural (single-channel) speech dereverberation algorithm using temporal convolutional networks with self attention. Specifically, the proposed system includes a self-attention module to produce dynamic representations given input features, a temporal convolutional network to learn a nonlinear mapping from such representations to the magnitude spectrum of anechoic speech, and a one-dimensional (1-D) convolution module to smooth the enhanced magnitude among adjacent frames. Systematic evaluations demonstrate that the proposed algorithm improves objective metrics of speech quality in a wide range of reverberant conditions. In addition, it generalizes well to untrained reverberation times, room sizes, measured room impulse responses, real-world recorded noisy-reverberant speech, and different speakers.

Journal ArticleDOI
TL;DR: A more complete analysis of the Diarization system is presented, the inference of the model is fully described and derivations of all update formulas are provided for a complete understanding of the algorithm.
Abstract: In our previous work, we introduced our Bayesian Hidden Markov Model with eigenvoice priors, which has been recently recognized as the state-of-the-art model for Speaker Diarization. In this article we present a more complete analysis of the Diarization system. The inference of the model is fully described and derivations of all update formulas are provided for a complete understanding of the algorithm. An extensive analysis on the effect, sensitivity and interactions of all model parameters is provided, which might be used as a guide for their optimal setting. The newly introduced speaker regularization coefficient allows us to control the number of speakers inferred in an utterance. A naive speaker model merging strategy is also presented, which allows to drive the variational inference out of local optima. Experiments for the different diarization scenarios are presented on CALLHOME and DIHARD datasets.

Journal ArticleDOI
TL;DR: The proposed techniques to address cross-corpus generalization include channel normalization, better training corpus, and smaller frame shift in short-time Fourier transform (STFT) can improve the objective intelligibility and quality scores on untrained corpora significantly.
Abstract: In recent years, supervised approaches using deep neural networks (DNNs) have become the mainstream for speech enhancement. It has been established that DNNs generalize well to untrained noises and speakers if trained using a large number of noises and speakers. However, we find that DNNs fail to generalize to new speech corpora in low signal-to-noise ratio (SNR) conditions. In this work, we establish that the lack of generalization is mainly due to the channel mismatch, i.e. different recording conditions between the trained and untrained corpus. Additionally, we observe that traditional channel normalization techniques are not effective in improving cross-corpus generalization. Further, we evaluate publicly available datasets that are promising for generalization. We find one particular corpus to be significantly better than others. Finally, we find that using a smaller frame shift in short-time processing of speech can significantly improve cross-corpus generalization. The proposed techniques to address cross-corpus generalization include channel normalization, better training corpus, and smaller frame shift in short-time Fourier transform (STFT). These techniques together improve the objective intelligibility and quality scores on untrained corpora significantly.

Journal ArticleDOI
TL;DR: The experimental results confirm the outstanding denoising capability of the proposed SE systems on the three tasks and the benefits of using the residual architecture on the overall SE performance.
Abstract: In recent years, waveform-mapping-based speech enhancement (SE) methods have garnered significant attention. These methods generally use a deep learning model to directly process and reconstruct speech waveforms. Because both the input and output are in waveform format, the waveform-mapping-based SE methods can overcome the distortion caused by imperfect phase estimation, which may be encountered in spectral-mapping-based SE systems. So far, most waveform-mapping-based SE methods have focused on single-channel tasks. In this article, we propose a novel fully convolutional network (FCN) with Sinc and dilated convolutional layers (termed SDFCN) for multichannel SE that operates in the time domain. We also propose an extended version of SDFCN, called the residual SDFCN (termed rSDFCN). The proposed methods are evaluated on three multichannel SE tasks, namely the dual-channel inner-ear microphones SE task, the distributed microphones SE task, and the CHiME-3 dataset. The experimental results confirm the outstanding denoising capability of the proposed SE systems on the three tasks and the benefits of using the residual architecture on the overall SE performance.

Journal ArticleDOI
TL;DR: This article develops a conditional VAE (CVAE) where the audio speech generative process is conditioned on visual information of the lip region, and it improves the speech enhancement performance compared with the audio-only VAE model.
Abstract: Variational auto-encoders (VAEs) are deep generative latent variable models that can be used for learning the distribution of complex data. VAEs have been successfully used to learn a probabilistic prior over speech signals, which is then used to perform speech enhancement. One advantage of this generative approach is that it does not require pairs of clean and noisy speech signals at training. In this article, we propose audio-visual variants of VAEs for single-channel and speaker-independent speech enhancement. We develop a conditional VAE (CVAE) where the audio speech generative process is conditioned on visual information of the lip region. At test time, the audio-visual speech generative model is combined with a noise model based on nonnegative matrix factorization, and speech enhancement relies on a Monte Carlo expectation-maximization algorithm. Experiments are conducted with the recently published NTCD-TIMIT dataset as well as the GRID corpus. The results confirm that the proposed audio-visual CVAE effectively fuses audio and visual information, and it improves the speech enhancement performance compared with the audio-only VAE model, especially when the speech signal is highly corrupted by noise. We also show that the proposed unsupervised audio-visual speech enhancement approach outperforms a state-of-the-art supervised deep learning method.