scispace - formally typeset
Search or ask a question

Showing papers on "Voice activity detection published in 2016"


Proceedings ArticleDOI
20 Mar 2016
TL;DR: Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional speech recognizers is presented.
Abstract: We present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional speech recognizers. In LAS, the neural network architecture subsumes the acoustic, pronunciation and language models making it not only an end-to-end trained system but an end-to-end model. In contrast to DNN-HMM, CTC and most other models, LAS makes no independence assumptions about the probability distribution of the output character sequences given the acoustic sequence. Our system has two components: a listener and a speller. The listener is a pyramidal recurrent network encoder that accepts filter bank spectra as inputs. The speller is an attention-based recurrent network decoder that emits each character conditioned on all previous characters, and the entire acoustic sequence. On a Google voice search task, LAS achieves a WER of 14.1% without a dictionary or an external language model and 10.3% with language model rescoring over the top 32 beams. In comparison, the state-of-the-art CLDNN-HMM model achieves a WER of 8.0% on the same set.

2,279 citations


Proceedings ArticleDOI
01 Jun 2016
TL;DR: A list of criteria founded in critical race theory is provided, and these are used to annotate a publicly available corpus of more than 16k tweets and present a dictionary based the most indicative words in the data.
Abstract: Hate speech in the form of racist and sexist remarks are a common occurrence on social media. For that reason, many social media services address the problem of identifying hate speech, but the definition of hate speech varies markedly and is largely a manual effort (BBC, 2015; Lomas, 2015). We provide a list of criteria founded in critical race theory, and use them to annotate a publicly available corpus of more than 16k tweets. We analyze the impact of various extra-linguistic features in conjunction with character n-grams for hatespeech detection. We also present a dictionary based the most indicative words in our data.

1,368 citations


Journal ArticleDOI
TL;DR: A vocoder-based speech synthesis system, named WORLD, was developed in an effort to improve the sound quality of realtime applications using speech and showed that it was superior to the other systems in terms of both sound quality and processing speed.
Abstract: A vocoder-based speech synthesis system, named WORLD, was developed in an effort to improve the sound quality of realtime applications using speech. Speech analysis, manipulation, and synthesis on the basis of vocoders are used in various kinds of speech research. Although several high-quality speech synthesis systems have been developed, real-time processing has been difficult with them because of their high computational costs. This new speech synthesis system has not only sound quality but also quick processing. It consists of three analysis algorithms and one synthesis algorithm proposed in our previous research. The effectiveness of the system was evaluated by comparing its output with against natural speech including consonants. Its processing speed was also compared with those of conventional systems. The results showed that WORLD was superior to the other systems in terms of both sound quality and processing speed. In particular, it was over ten times faster than the conventional systems, and the real time factor (RTF) indicated that it was fast enough for real-time processing. key words: speech analysis, speech synthesis, vocoder, sound quality, realtime processing

1,025 citations


Journal ArticleDOI
TL;DR: The proposed approach improves over other methods when evaluated with several objective metrics, including the perceptual evaluation of speech quality (PESQ), and a listening test where subjects prefer the proposed approach with at least a 69% rate.
Abstract: Speech separation systems usually operate on the short-time Fourier transform (STFT) of noisy speech, and enhance only the magnitude spectrum while leaving the phase spectrum unchanged. This is done because there was a belief that the phase spectrum is unimportant for speech enhancement. Recent studies, however, suggest that phase is important for perceptual quality, leading some researchers to consider magnitude and phase spectrum enhancements. We present a supervised monaural speech separation approach that simultaneously enhances the magnitude and phase spectra by operating in the complex domain. Our approach uses a deep neural network to estimate the real and imaginary components of the ideal ratio mask defined in the complex domain. We report separation results for the proposed method and compare them to related systems. The proposed approach improves over other methods when evaluated with several objective metrics, including the perceptual evaluation of speech quality (PESQ), and a listening test where subjects prefer the proposed approach with at least a 69% rate.

699 citations


Proceedings ArticleDOI
01 Nov 2016
TL;DR: It is found that amateur annotators are more likely than expert annotators to label items as hate speech, and that systems training on expert annotations outperform systems trained on amateur annotations.
Abstract: Hate speech in the form of racism and sexism is commonplace on the internet (Waseem and Hovy, 2016). For this reason, there has been both an academic and an industry interest in detection of hate speech. The volume of data to be reviewed for creating data sets encourages a use of crowd sourcing for the annotation efforts. In this paper, we provide an examination of the influence of annotator knowledge of hate speech on classification models by comparing classification results obtained from training on expert and amateur annotations. We provide an evaluation on our own data set and run our models on the data set released by Waseem and Hovy (2016). We find that amateur annotators are more likely than expert annotators to label items as hate speech, and that systems trained on expert annotations outperform systems trained on amateur annotations.

487 citations


Proceedings ArticleDOI
Lifa Sun1, Kun Li1, Hao Wang1, Shiyin Kang1, Helen Meng1 
11 Jul 2016
TL;DR: This paper proposes a novel approach to voice conversion with non-parallel training data to bridge between speakers by means of Phonetic PosteriorGrams obtained from a speaker-independent automatic speech recognition system.
Abstract: This paper proposes a novel approach to voice conversion with non-parallel training data. The idea is to bridge between speakers by means of Phonetic PosteriorGrams (PPGs) obtained from a speaker-independent automatic speech recognition (SI-ASR) system. It is assumed that these PPGs can represent articulation of speech sounds in a speaker-normalized space and correspond to spoken content speaker-independently. The proposed approach first obtains PPGs of target speech. Then, a Deep Bidirectional Long Short-Term Memory based Recurrent Neural Network (DBLSTM) structure is used to model the relationships between the PPGs and acoustic features of the target speech. To convert arbitrary source speech, we obtain its PPGs from the same SI-ASR and feed them into the trained DBLSTM for generating converted speech. Our approach has two main advantages: 1) no parallel training data is required; 2) a trained model can be applied to any other source speaker for a fixed target speaker (i.e., many-to-one conversion). Experiments show that our approach performs equally well or better than state-of-the-art systems in both speech quality and speaker similarity.

296 citations


DOI
22 Sep 2016
TL;DR: The authors collected potentially hateful messages and asked two groups of internet users to determine whether they were hate speech or not, whether they should be banned or not and to rate their degree of offensiveness.
Abstract: Some users of social media are spreading racist, sexist, and otherwise hateful content. For the purpose of training a hate speech detection system, the reliability of the annotations is crucial, but there is no universally agreed-upon definition. We collected potentially hateful messages and asked two groups of internet users to determine whether they were hate speech or not, whether they should be banned or not and to rate their degree of offensiveness. One of the groups was shown a definition prior to completing the survey. We aimed to assess whether hate speech can be annotated reliably, and the extent to which existing definitions are in accordance with subjective ratings. Our results indicate that showing users a definition caused them to partially align their own opinion with the definition but did not improve reliability, which was very low overall. We conclude that the presence of hate speech should perhaps not be considered a binary yes-or-no decision, and raters need more detailed instructions for the annotation.

248 citations


Proceedings ArticleDOI
20 Mar 2016
TL;DR: A large vocabulary speech recognition system that is accurate, has low latency, and yet has a small enough memory and computational footprint to run faster than real-time on a Nexus 5 Android smartphone is described.
Abstract: We describe a large vocabulary speech recognition system that is accurate, has low latency, and yet has a small enough memory and computational footprint to run faster than real-time on a Nexus 5 Android smartphone. We employ a quantized Long Short-Term Memory (LSTM) acoustic model trained with connectionist temporal classification (CTC) to directly predict phoneme targets, and further reduce its memory footprint using an SVD-based compression scheme. Additionally, we minimize our memory footprint by using a single language model for both dictation and voice command domains, constructed using Bayesian interpolation. Finally, in order to properly handle device-specific information, such as proper names and other context-dependent information, we inject vocabulary items into the decoder graph and bias the language model on-the-fly. Our system achieves 13.5% word error rate on an open-ended dictation task, running with a median speed that is seven times faster than real-time.

159 citations


Journal ArticleDOI
TL;DR: When trained on a large amount of noise types and a wide range of signal-to-noise ratios, the MRS-based VAD demonstrates surprisingly good generalization performance on unseen test scenarios, approaching the performance with noise-dependent training.
Abstract: Voice activity detection (VAD) is an important topic in audio signal processing. Contextual information is important for improving the performance of VAD at low signal-to-noise ratios. Here we explore contextual information by machine learning methods at three levels. At the top level, we employ an ensemble learning framework, named multi-resolution stacking (MRS), which is a stack of ensemble classifiers. Each classifier in a building block inputs the concatenation of the predictions of its lower building blocks and the expansion of the raw acoustic feature by a given window (called a resolution). At the middle level, we describe a base classifier in MRS, named boosted deep neural network (bDNN). bDNN first generates multiple base predictions from different contexts of a single frame by only one DNN and then aggregates the base predictions for a better prediction of the frame, and it is different from computationally-expensive boosting methods that train ensembles of classifiers for multiple base predictions. At the bottom level, we employ the multi-resolution cochleagram feature, which incorporates the contextual information by concatenating the cochleagram features at multiple spectrotemporal resolutions. Experimental results show that the MRS-based VAD outperforms other VADs by a considerable margin. Moreover, when trained on a large amount of noise types and a wide range of signal-to-noise ratios, the MRS-based VAD demonstrates surprisingly good generalization performance on unseen test scenarios, approaching the performance with noise-dependent training.

135 citations


Journal ArticleDOI
TL;DR: It is shown that phase-aware signal processing is an important emerging field with high potential in the current speech communication applications and can complement the possible solutions that magnitude-only methods suggest.

126 citations


Proceedings ArticleDOI
12 Sep 2016
TL;DR: This paper proposes a novel approach to VAD to tackle both feature and model selection jointly and shows that using the raw waveform allows the neural network to learn features directly for the task at hand, which is more powerful than using log-mel features, specially for noisy environments.
Abstract: Voice Activity Detection (VAD) is an important preprocessing step in any state-of-the-art speech recognition system. Choosing the right set of features and model architecture can be challenging and is an active area of research. In this paper we propose a novel approach to VAD to tackle both feature and model selection jointly. The proposed method is based on a CLDNN (Convolutional, Long Short-Term Memory, Deep Neural Networks) architecture fed directly with the raw waveform. We show that using the raw waveform allows the neural network to learn features directly for the task at hand, which is more powerful than using log-mel features, specially for noisy environments. In addition, using a CLDNN, which takes advantage of both frequency modeling with the CNN and temporal modeling with LSTM, is a much better model for VAD compared to the DNN. The proposed system achieves over 78% relative improvement in False Alarms (FA) at the operating point of 2% False Rejects (FR) on both clean and noisy conditions compared to a DNN of comparable size trained with log-mel features. In addition, we study the impact of the model size and the learned features to provide a better understanding of the proposed architecture.

Journal ArticleDOI
TL;DR: It is argued that modalities based on metabolic processes, such as functional Near Infrared Spectroscopy and functional Magnetic Resonance Imaging, are less suited for Automatic Speech Recognition from neural signals due to low temporal resolution but are very useful for the investigation of the underlying neural mechanisms involved in speech processes.
Abstract: Speech interfaces have become widely accepted and are nowadays integrated in various real-life applications and devices. They have become a part of our daily life. However, speech interfaces presume the ability to produce intelligible speech, which might be impossible due to either loud environments, bothering bystanders or incapabilities to produce speech (i.e.~patients suffering from locked-in syndrome). For these reasons it would be highly desirable to not speak but to simply envision oneself to say words or sentences. Interfaces based on imagined speech would enable fast and natural communication without the need for audible speech and would give a voice to otherwise mute people. This focused review analyzes the potential of different brain imaging techniques to recognize speech from neural signals by applying Automatic Speech Recognition technology. We argue that modalities based on metabolic processes, such as functional Near Infrared Spectroscopy and functional Magnetic Resonance Imaging, are less suited for Automatic Speech Recognition from neural signals due to low temporal resolution but are very useful for the investigation of the underlying neural mechanisms involved in speech processes. In contrast, electrophysiologic activity is fast enough to capture speech processes and is therefor better suited for ASR. Our experimental results indicate the potential of these signals for speech recognition from neural data with a focus on invasively measured brain activity (electrocorticography). As a first example of Automatic Speech Recognition techniques used from neural signals, we discuss the \emph{Brain-to-text} system.

Patent
29 Jun 2016
TL;DR: In this article, a system configured to process speech commands may classify incoming audio as desired speech, undesired speech, or non-speech, where desired speech is speech that is from a same speaker as reference speech.
Abstract: A system configured to process speech commands may classify incoming audio as desired speech, undesired speech, or non-speech. Desired speech is speech that is from a same speaker as reference speech. The reference speech may be obtained from a configuration session or from a first portion of input speech that includes a wakeword. The reference speech may be encoded using a recurrent neural network (RNN) encoder to create a reference feature vector. The reference feature vector and incoming audio data may be processed by a trained neural network classifier to label the incoming audio data (for example, frame-by-frame) as to whether each frame is spoken by the same speaker as the reference speech. The labels may be passed to an automatic speech recognition (ASR) component which may allow the ASR component to focus its processing on the desired speech.

Posted Content
TL;DR: This paper deals with improving speech quality in office environment where multiple stationary as well as non-stationary noises can be simultaneously present in speech and proposes several strategies based on Deep Neural Networks for speech enhancement in these scenarios.
Abstract: In this paper we consider the problem of speech enhancement in real-world like conditions where multiple noises can simultaneously corrupt speech. Most of the current literature on speech enhancement focus primarily on presence of single noise in corrupted speech which is far from real-world environments. Specifically, we deal with improving speech quality in office environment where multiple stationary as well as non-stationary noises can be simultaneously present in speech. We propose several strategies based on Deep Neural Networks (DNN) for speech enhancement in these scenarios. We also investigate a DNN training strategy based on psychoacoustic models from speech coding for enhancement of noisy speech

Proceedings ArticleDOI
20 Mar 2016
TL;DR: This paper proposes to enhance the noisy and reverberant speech by learning a mapping to reverberant target speech rather than anechoic target speech, and develops a masking-based method for denoising and compares it with the spectral mapping method.
Abstract: In the real world, speech is usually distorted by both reverberation and background noise. In such conditions, speech intelligibility is degraded substantially, especially for hearing-impaired (HI) listeners. As a consequence, it is essential to enhance speech in the noisy and reverberant environment. Recently, deep neural networks have been introduced to learn a spectral mapping to enhance corrupted speech, and shown significant improvements in objective metrics and automatic speech recognition score. However, listening tests have not yet shown any speech intelligibility benefit. In this paper, we propose to enhance the noisy and reverberant speech by learning a mapping to reverberant target speech rather than anechoic target speech. A preliminary listening test was conducted, and the results show that the proposed algorithm is able to improve speech intelligibility of HI listeners in some conditions. Moreover, we develop a masking-based method for denoising and compare it with the spectral mapping method. Evaluation results show that the masking-based method outperforms the mapping-based method.

Journal ArticleDOI
TL;DR: The above proposed technique is called separable deep auto encoder (SDAE), and given the under-determined nature of the above optimization problem, the clean speech reconstruction is confined in the convex hull spanned by a pre-trained speech dictionary.
Abstract: Unseen noise estimation is a key yet challenging step to make a speech enhancement algorithm work in adverse environments. At worst, the only prior knowledge we know about the encountered noise is that it is different from the involved speech. Therefore, by subtracting the components which cannot be adequately represented by a well defined speech model, the noises can be estimated and removed. Given the good performance of deep learning in signal representation, a deep auto encoder (DAE) is employed in this work for accurately modeling the clean speech spectrum. In the subsequent stage of speech enhancement, an extra DAE is introduced to represent the residual part obtained by subtracting the estimated clean speech spectrum (by using the pre-trained DAE) from the noisy speech spectrum. By adjusting the estimated clean speech spectrum and the unknown parameters of the noise DAE, one can reach a stationary point to minimize the total reconstruction error of the noisy speech spectrum. The enhanced speech signal is thus obtained by transforming the estimated clean speech spectrum back into time domain. The above proposed technique is called separable deep auto encoder (SDAE). Given the under-determined nature of the above optimization problem, the clean speech reconstruction is confined in the convex hull spanned by a pre-trained speech dictionary. New learning algorithms are investigated to respect the non-negativity of the parameters in the SDAE. Experimental results on TIMIT with 20 noise types at various noise levels demonstrate the superiority of the proposed method over the conventional baselines.

Journal ArticleDOI
TL;DR: In this article, the authors investigated the joint use of source and filter-based features, and proposed two strategies are proposed to merge source and filtering information: feature and decision fusion.
Abstract: Voice Activity Detection (VAD) refers to the problem of distinguishing speech segments from background noise. Numerous approaches have been proposed for this purpose. Some are based on features derived from the power spectral density, others exploit the periodicity of the signal. The goal of this letter is to investigate the joint use of source and filter-based features. Interestingly, a mutual information-based assessment shows superior discrimination power for the source-related features, especially the proposed ones. The features are further the input of an artificial neural network-based classifier trained on a multi-condition database. Two strategies are proposed to merge source and filter information: feature and decision fusion. Our experiments indicate an absolute reduction of 3% of the equal error rate when using decision fusion. The final proposed system is compared to four state-of-the-art methods on 150 minutes of data recorded in real environments. Thanks to the robustness of its source-related features, its multi-condition training and its efficient information fusion, the proposed system yields over the best state-of-the-art VAD a substantial increase of accuracy across all conditions (24% absolute on average).

Journal ArticleDOI
TL;DR: It is found that real-time synthesis of vowels and consonants was possible with good intelligibility and open to future speech BCI applications using such articulatory-based speech synthesizer.
Abstract: Restoring natural speech in paralyzed and aphasic people could be achieved using a Brain-Computer Interface (BCI) controlling a speech synthesizer in real-time. To reach this goal, a prerequisite is to develop a speech synthesizer producing intelligible speech in real-time with a reasonable number of control parameters. We present here an articulatory-based speech synthesizer that can be controlled in real-time for future BCI applications. This synthesizer converts movements of the main speech articulators (tongue, jaw, velum, and lips) into intelligible speech. The articulatory-to-acoustic mapping is performed using a deep neural network (DNN) trained on electromagnetic articulography (EMA) data recorded on a reference speaker synchronously with the produced speech signal. This DNN is then used in both offline and online modes to map the position of sensors glued on different speech articulators into acoustic parameters that are further converted into an audio signal using a vocoder. In offline mode, highly intelligible speech could be obtained as assessed by perceptual evaluation performed by 12 listeners. Then, to anticipate future BCI applications, we further assessed the real-time control of the synthesizer by both the reference speaker and new speakers, in a closed-loop paradigm using EMA data recorded in real time. A short calibration period was used to compensate for differences in sensor positions and articulatory differences between new speakers and the reference speaker. We found that real-time synthesis of vowels and consonants was possible with good intelligibility. In conclusion, these results open to future speech BCI applications using such articulatory-based speech synthesizer.


Patent
22 Jun 2016
TL;DR: In this paper, a system uses trained models to detect a speech quality and generate an indicator of the speech quality, which is sent to downstream components of the system such as a command processor or TTS system.
Abstract: A system matches text-to-speech (TTS) or other output to a quality of an input spoken utterance. The system uses trained models to detect a speech quality and generates an indicator of the speech quality. The speech quality may be determined from audio or non-audio data. The indicator is sent to downstream components of the system such as a command processor or TTS system. The output of the system is then determined using the indicator of speech quality, thus customizing an output of the system to the manner in which the utterance was spoken.

Proceedings ArticleDOI
20 Mar 2016
TL;DR: It is shown that error-rates for speaker-independent lip-reading can be very significantly reduced and that there is no need to map phonemes to visemes for context-dependent visual speech transcription.
Abstract: Recent improvements in tracking and feature extraction mean that speaker-dependent lip-reading of continuous speech using a medium size vocabulary (around 1000 words) is realistic. However, the recognition of previously unseen speakers has been found to be a very challenging task, because of the large variation in lip-shapes across speakers and the lack of large, tracked databases of visual features, which are very expensive to produce. By adapting a technique that is established in speech recognition but has not previously been used in lip-reading, we show that error-rates for speaker-independent lip-reading can be very significantly reduced. Furthermore, we show that error-rates can be even further reduced by the additional use of Deep Neural Networks (DNN). We also find that there is no need to map phonemes to visemes for context-dependent visual speech transcription.

Proceedings ArticleDOI
08 Sep 2016
TL;DR: An experiment to detect Alzheimer’s disease using spontaneous conversational speech, with an F-score of 0.8, clearly showing the approach detects dementia well.
Abstract: The worldwide population is aging. With a larger population of elderly people, the numbers of people affected by cognitive impairment such as Alzheimer’s disease are growing. Unfortunately, there is no known cure for Alzheimer’s disease. The only way to alleviate it’s serious effects is to start therapy very early before the disease has wrought too much irreversible damage. Current diagnostic procedures are neither cost nor time efficient and therefore do not meet the demands for frequent mass screening required to mitigate the consequences of cognitive impairments on the global scale. We present an experiment to detect Alzheimer’s disease using spontaneous conversational speech. The speech data was recorded during biographic interviews in the Interdisciplinary Longitudinal Study on Adult Development and Aging (ILSE), a large data resource on healthy and satisfying aging in middle adulthood and later life in Germany. From these recordings we extract ten speech-based features using voice activity detection and transcriptions. In an experimental setup with 98 data samples we train a linear discriminant analysis classifier to distinguish subjects with Alzheimer’s disease from the control group. This setup results in an F-score of 0.8 for the detection of Alzheimer’s disease, clearly showing our approach detects dementia well.

Proceedings ArticleDOI
20 Mar 2016
TL;DR: This paper investigated the use of DNNs for automatic scream and shouted speech detection, within the framework of surveillance systems in public transportation, and recorded a database of sounds occurring in subway trains.
Abstract: Deep Neural Networks (DNNs) have recently become a popular technique for regression and classification problems. Their capacity to learn high-order correlations between input and output data proves to be very powerful for automatic speech recognition. In this paper we investigate the use of DNNs for automatic scream and shouted speech detection, within the framework of surveillance systems in public transportation. We recorded a database of sounds occurring in subway trains in real conditions of exploitation and used DNNs to classify the sounds into screams, shouts and other categories. We report encouraging results, given the difficulty of the task, especially when a high level of surrounding noise is present.

01 Jan 2016
TL;DR: The speech communications human and machine is universally compatible with any devices to read and is available in the digital library an online access to it is set as public so you can download it instantly.
Abstract: Thank you for reading speech communications human and machine. Maybe you have knowledge that, people have search hundreds times for their chosen readings like this speech communications human and machine, but end up in malicious downloads. Rather than reading a good book with a cup of tea in the afternoon, instead they are facing with some infectious virus inside their desktop computer. speech communications human and machine is available in our digital library an online access to it is set as public so you can download it instantly. Our digital library hosts in multiple locations, allowing you to get the most less latency time to download any of our books like this one. Kindly say, the speech communications human and machine is universally compatible with any devices to read.

Patent
11 Mar 2016
TL;DR: In this paper, a multimodal system using at least one speech recognizer to perform speech recognition utilizing a circular buffer to unify all modal events into a single interpretation of the user's intent is presented.
Abstract: A multimodal system using at least one speech recognizer to perform speech recognition utilizing a circular buffer to unify all modal events into a single interpretation of the user's intent.

Journal ArticleDOI
TL;DR: The developed system can effectively be used in voice pathology detection and classification systems, and the proposed features can visually differentiate between normal and pathological samples.

Journal ArticleDOI
TL;DR: Results show that it is possible to reconstruct speech from articulator movements captured by an unobtrusive technique without an intermediate recognition step, and the SSI is capable of producing speech of sufficient intelligibility and naturalness that the speaker is clearly identifiable.

Journal ArticleDOI
TL;DR: A significant gap is revealed between the performance of state-of-the-art spoofing detectors between clean and noisy conditions and a study with two score fusion strategies shows that combining different feature based systems improves recognition accuracy for known and unknown attacks in both clean and noise conditions.

Proceedings ArticleDOI
20 Mar 2016
TL;DR: To improve the robustness of deep learning based VAD models, a new noise-aware training (NAT) approach is also proposed and experiments show that LSTM-based VAD is most robust but the performance degrades dramatically in the conditions with unseen noise or diverse SNR.
Abstract: Voice activity detection (VAD) is an important step for real-world automatic speech recognition (ASR) systems Deep learning approaches, such as DNN, RNN or CNN, have been widely used in model-based VAD Although they have achieved success in practice, they are developed on different VAD tasks separately Whilst VAD performance under noisy conditions, especially with unseen noise or very low SNR, are of great interest, there has no robustness comparison of different deep learning approaches so far In this paper, to learn the robustness property, VAD models based on DNN, LSTM and CNN are thoroughly compared at both frame and segment level under various noisy conditions on Aurora 4, a commonly used speech corpus with rich noises To improve the robustness of deep learning based VAD models, a new noise-aware training (NAT) approach is also proposed Experiments show that LSTM-based VAD is most robust but the performance degrades dramatically in the conditions with unseen noise or diverse SNR By incorporating NAT, significant performance gains can be obtained in these conditions

Book ChapterDOI
08 Oct 2016
TL;DR: This work is the first to present an active speaker detection system that learns on one audio-visual dataset and automatically adapts to speakers in a new dataset, and is seen as an example of how the availability of multi-modal data allows us to learn a model without the need for supervision.
Abstract: In this paper, we show how to use audio to supervise the learning of active speaker detection in video. Voice Activity Detection (VAD) guides the learning of the vision-based classifier in a weakly supervised manner. The classifier uses spatio-temporal features to encode upper body motion - facial expressions and gesticulations associated with speaking. We further improve a generic model for active speaker detection by learning person specific models. Finally, we demonstrate the online adaptation of generic models learnt on one dataset, to previously unseen people in a new dataset, again using audio (VAD) for weak supervision. The use of temporal continuity overcomes the lack of clean training data. We are the first to present an active speaker detection system that learns on one audio-visual dataset and automatically adapts to speakers in a new dataset. This work can be seen as an example of how the availability of multi-modal data allows us to learn a model without the need for supervision, by transferring knowledge from one modality to another.