Convolutional neural networks (CNN) are extensions to deep neural networks (DNN) which are used as alternate acoustic models with state-of-the-art performances for speech recognition. In this paper, CNNs are used as acoustic models for speech activity detection (SAD) on data collected over noisy radio communication channels. When these SAD models are tested on audio recorded from radio channels not seen during training, there is severe performance degradation. We attribute this degradation to mismatches between the two dimensional filters learnt in the initial CNN layers and the novel channel data. Using a small amount of supervised data from the novel channels, the filters can be adapted to provide significant improvements in SAD performance. In mismatched acoustic conditions, the adapted models provide significant improvements (about 10-25%) relative to conventional DNN-based SAD systems. These results illustrate that CNNs have a considerable advantage in fast adaptation for acoustic modeling in these settings.

Analyzing convolutional neural networks for speech activity detection in mismatched acoustic conditions

In this paper we describe improvements to the IBM speech activity detection (SAD) system for the third phase of the DARPA RATS program. The progress during this final phase comes from jointly training convolutional and regular deep neural networks with rich time-frequency representations of speech. With these additions, the phase 3 system reduces the equal error rate (EER) significantly on both of the program's development sets (relative improvements of 20% on dev1 and 7% on dev2) compared to an earlier phase 2 system. For the final program evaluation, the newly developed system also performs well past the program target of 3% Pmiss at 1% Pfa with a performance of 1.2% Pmiss at 1% Pfa and 0.3% Pfa at 3% Pmiss.

The IBM speech activity detection system for the DARPA RATS program.

Voice activity detection (VAD) is the task of predicting which parts of an utterance contains speech versus background noise. It is an important first step to determine which samples to send to the decoder and when to close the microphone. The long short-term memory neural network (LSTM) is a popular architecture for sequential modeling of acoustic signals, and has been successfully used in several VAD applications. However, it has been observed that LSTMs suffer from state saturation problems when the utterance is long (i.e., for voice dictation tasks), and thus requires the LSTM state to be periodically reset. In this paper, we propose an alternative architecture that does not suffer from saturation problems by modeling temporal variations through a stateless dilated convolution neural network (CNN). The proposed architecture differs from conventional CNNs in three respects: it uses dilated causal convolution, gated activations and residual connections. Results on a Google Voice Typing task shows that the proposed architecture achieves 14% relative FA improvement at a FR of 1% over state-of-the-art LSTMs for VAD task. We also include detailed experiments investigating the factors that distinguish the proposed architecture from conventional convolution.

Temporal Modeling Using Dilated Convolution and Gating for Voice-Activity-Detection

Mean Hilbert envelope coefficients (MHEC) for robust speaker and language identification

Reliable automatic detection of speech/non-speech activity in degraded, noisy audio signals is a fundamental and challenging task in robust signal processing. As various speech technology applications rely on the accuracy of a Voice Activity Detection (VAD) system for their effectiveness and robustness, the problem has gained considerable research interest over the years. It has been shown that in highly distorted conditions, an accurate segmentation of the target speech can be achieved by combining multiple feature streams. In this paper, we extract four one-dimensional streams each attempting to separate speech from the disturbing background by exploiting a different speech-related characteristic, i.e. (i) the spectral shape, (ii) spectro-temporal modulations, (iii) the periodicity structure due to the presence of pitch harmonics, and (iv) the long-term spectral variability profile. The information from these streams is then expanded over long duration context windows and applied to the input layer of a standard Multilayer Perceptron classifier. The proposed VAD was evaluated on the DARPA RATS corpora and shows to be very competitive to current state-of-the art systems.

A robust frontend for VAD: exploiting contextual, discriminative and spectral cues of human voice.

Speech activity detection (SAD) on channel transmissions is a critical preprocessing task for speech, speaker and language recognition or for further human analysis. This paper presents a feature combination approach to improve SAD on highly channel degraded speech as part of the Defense Advanced Research Projects Agency’s (DARPA) Robust Automatic Transcription of Speech (RATS) program. The key contribution is the feature combination exploration of different novel SAD features based on pitch and spectro-temporal processing and the standard Mel Frequency Cepstral Coefficients (MFCC) acoustic feature. The SAD features are: (1) a GABOR feature representation, followed by a multilayer perceptron (MLP); (2) a feature that combines multiple voicing features and spectral flux measures (Combo); (3) a feature based on subband autocorrelation (SAcC) and MLP postprocessing and (4) a multiband comb-filter F0 (MBCombF0) voicing measure. We present single, pairwise and all feature combinations, show high error reductions from pairwise feature level combination over the MFCC baseline and show that the best performance is achieved by the combination of all features. Index Terms: speech detection, channel-degraded speech, robust voicing features

/pdf/all-for-one-feature-combination-for-highly-channel-degraded-2z7lzrbrc2.pdf

All for One: Feature Combination for Highly Channel-Degraded Speech Activity Detection

The goal of addressee detection is to answer the question , “Are you talking to me?” When a dialogue system interacts with multiple users, it is crucial to detect when a user is speaking to the system as opposed to another person. We study this problem in a multimodal scenario, using lexical, acoustic, visual, dialogue state, and beamforming information. Using data from a multiparty dialogue system, we quantify the benefits of using multiple modalities over using a single modality. We also assess the relative importance of the various modalities, as well as of key individual features, in estimating the addressee. We find that energy-based acoustic features are by far the most important, that information from speech recognition and system state is useful as well, and that visual and beamforming features provide little additional benefit. While we find that head pose is affected by whom the speaker is addressing, it yields little nonredundant information due to the system acting as a situational attractor. Our findings would be relevant to multiparty, open-world dialogue systems in which the agent plays an active, conversational role, such as an interactive assistant deployed in a public, open space. For these scenarios , our study suggests that acoustic, lexical, and system-state information is an effective and practical combination of modalities to use for addressee detection. We also consider how our analyses might be affected by the ongoing development of more realistic, natural dialogue systems.

/pdf/a-study-of-multimodal-addressee-detection-in-human-human-3j9iq9tmuf.pdf

A Study of Multimodal Addressee Detection in Human-Human-Computer Interaction

TED talks are the pinnacle of public speaking. They combine compelling content with flawless delivery, and their popularity is attested by the millions of views they attract. In this work, we compare the prosodic voice characteristics of TED speakers and university professors. Our aim is to identify the characteristics that separate TED speakers from other public speakers. Based on a simple set of features derived from pitch and energy, we train a discriminative classifier to predict whether a 5 minute audio sample is from a TED talk or a university lecture. We are able to achieve < 10% equal error rate. We then investigate which features are most discriminative, and discuss conflating factors that might contribute to those features.

/pdf/are-you-ted-talk-material-comparing-prosody-in-professors-2opq3kqfwf.pdf

Are you TED talk material? comparing prosody in professors and TED speakers.

We have incorporated spectrotemporal features in a speech activity detection (SAD) task for the Speech in Noisy Environments 2 (SPINE2) data set. The features were generated by applying 2D Gabor filters to the mel spectrogram in order to measure the strength of various spectral and temporal modulation frequencies in different patches of the spectrogram. Using several different back-ends, the Gabor features significantly outperformed MFCCs, yielding relative reductions in equal error rate (EER) of between 40 and 50%. Compared to the other backends, Adaboost with tree stumps performed particularly well with Gabor features and particularly poorly with MFCCs. An investigation into the reasons for this disparity suggests that the most useful features for SAD incorporate information over longer time scales.

Longer Features: They do a speech detector good.

Addressee detection answers the question, “Are you talking to me?” When multiple users interact with a dialogue system, it is important to know when a user is speaking to the computer and when he or she is speaking to another person. We approach this problem from a multimodal perspective, using lexical, acoustic, visual, dialog state, and beam-forming information. Using data from a multiparty dialogue system, we demonstrate the benefit of using multiple modalities over using a single modality. We also assess the relative importance of the various modalities in predicting the addressee. In our experiments, we find that acoustic features are by far the most important, that ASR and system-state information are useful, and that visual and beamforming features provide little additional benefit. Our study suggests that acoustic, lexical, and system state information are an effective, economical combination of modalities to use in addressee detection.

/pdf/multimodal-addressee-detection-in-multiparty-dialogue-3zkvilktxi.pdf

T. J. Tsai

Papers

All for One: Feature Combination for Highly Channel-Degraded Speech Activity Detection

A Study of Multimodal Addressee Detection in Human-Human-Computer Interaction

Are you TED talk material? comparing prosody in professors and TED speakers.

Longer Features: They do a speech detector good.

Multimodal addressee detection in multiparty dialogue systems