Showing papers on "Voice activity detection published in 2016"

PDF

Open Access

Proceedings Article•DOI•

Listen, attend and spell: A neural network for large vocabulary conversational speech recognition

[...]

William Chan¹, Navdeep Jaitly², Quoc V. Le², Oriol Vinyals²•Institutions (2)

20 Mar 2016

TL;DR: Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional speech recognizers is presented.

...read moreread less

Abstract: We present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional speech recognizers. In LAS, the neural network architecture subsumes the acoustic, pronunciation and language models making it not only an end-to-end trained system but an end-to-end model. In contrast to DNN-HMM, CTC and most other models, LAS makes no independence assumptions about the probability distribution of the output character sequences given the acoustic sequence. Our system has two components: a listener and a speller. The listener is a pyramidal recurrent network encoder that accepts filter bank spectra as inputs. The speller is an attention-based recurrent network decoder that emits each character conditioned on all previous characters, and the entire acoustic sequence. On a Google voice search task, LAS achieves a WER of 14.1% without a dictionary or an external language model and 10.3% with language model rescoring over the top 32 beams. In comparison, the state-of-the-art CLDNN-HMM model achieves a WER of 8.0% on the same set.

...read moreread less

2,279 citations

Proceedings Article•DOI•

Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter

[...]

Zeerak Waseem¹, Dirk Hovy¹•Institutions (1)

University of Copenhagen¹

01 Jun 2016

TL;DR: A list of criteria founded in critical race theory is provided, and these are used to annotate a publicly available corpus of more than 16k tweets and present a dictionary based the most indicative words in the data.

...read moreread less

Abstract: Hate speech in the form of racist and sexist remarks are a common occurrence on social media. For that reason, many social media services address the problem of identifying hate speech, but the definition of hate speech varies markedly and is largely a manual effort (BBC, 2015; Lomas, 2015). We provide a list of criteria founded in critical race theory, and use them to annotate a publicly available corpus of more than 16k tweets. We analyze the impact of various extra-linguistic features in conjunction with character n-grams for hatespeech detection. We also present a dictionary based the most indicative words in our data.

...read moreread less

1,368 citations

Journal Article•DOI•

WORLD: A vocoder-based high-quality speech synthesis system for real-time applications

[...]

Masanori Morise¹, Fumiya Yokomori¹, Kenji Ozawa¹•Institutions (1)

University of Yamanashi¹

01 Jul 2016-IEICE Transactions on Information and Systems

TL;DR: A vocoder-based speech synthesis system, named WORLD, was developed in an effort to improve the sound quality of realtime applications using speech and showed that it was superior to the other systems in terms of both sound quality and processing speed.

...read moreread less

Abstract: A vocoder-based speech synthesis system, named WORLD, was developed in an effort to improve the sound quality of realtime applications using speech. Speech analysis, manipulation, and synthesis on the basis of vocoders are used in various kinds of speech research. Although several high-quality speech synthesis systems have been developed, real-time processing has been difficult with them because of their high computational costs. This new speech synthesis system has not only sound quality but also quick processing. It consists of three analysis algorithms and one synthesis algorithm proposed in our previous research. The effectiveness of the system was evaluated by comparing its output with against natural speech including consonants. Its processing speed was also compared with those of conventional systems. The results showed that WORLD was superior to the other systems in terms of both sound quality and processing speed. In particular, it was over ten times faster than the conventional systems, and the real time factor (RTF) indicated that it was fast enough for real-time processing. key words: speech analysis, speech synthesis, vocoder, sound quality, realtime processing

...read moreread less

1,025 citations

Journal Article•DOI•

Complex ratio masking for monaural speech separation

[...]

Donald S. Williamson¹, Yuxuan Wang¹, DeLiang Wang¹•Institutions (1)

Ohio State University¹

01 Mar 2016-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: The proposed approach improves over other methods when evaluated with several objective metrics, including the perceptual evaluation of speech quality (PESQ), and a listening test where subjects prefer the proposed approach with at least a 69% rate.

...read moreread less

Abstract: Speech separation systems usually operate on the short-time Fourier transform (STFT) of noisy speech, and enhance only the magnitude spectrum while leaving the phase spectrum unchanged. This is done because there was a belief that the phase spectrum is unimportant for speech enhancement. Recent studies, however, suggest that phase is important for perceptual quality, leading some researchers to consider magnitude and phase spectrum enhancements. We present a supervised monaural speech separation approach that simultaneously enhances the magnitude and phase spectra by operating in the complex domain. Our approach uses a deep neural network to estimate the real and imaginary components of the ideal ratio mask defined in the complex domain. We report separation results for the proposed method and compare them to related systems. The proposed approach improves over other methods when evaluated with several objective metrics, including the perceptual evaluation of speech quality (PESQ), and a listening test where subjects prefer the proposed approach with at least a 69% rate.

...read moreread less

699 citations

Proceedings Article•DOI•

Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter.

[...]

Zeerak Waseem¹•Institutions (1)

University of Copenhagen¹

01 Nov 2016

TL;DR: It is found that amateur annotators are more likely than expert annotators to label items as hate speech, and that systems training on expert annotations outperform systems trained on amateur annotations.

...read moreread less

Abstract: Hate speech in the form of racism and sexism is commonplace on the internet (Waseem and Hovy, 2016). For this reason, there has been both an academic and an industry interest in detection of hate speech. The volume of data to be reviewed for creating data sets encourages a use of crowd sourcing for the annotation efforts. In this paper, we provide an examination of the influence of annotator knowledge of hate speech on classification models by comparing classification results obtained from training on expert and amateur annotations. We provide an evaluation on our own data set and run our models on the data set released by Waseem and Hovy (2016). We find that amateur annotators are more likely than expert annotators to label items as hate speech, and that systems trained on expert annotations outperform systems trained on amateur annotations.

...read moreread less

487 citations

Proceedings Article•DOI•

Phonetic posteriorgrams for many-to-one voice conversion without parallel data training

[...]

Lifa Sun¹, Kun Li¹, Hao Wang¹, Shiyin Kang¹, Helen Meng¹ - Show less +1 more•Institutions (1)

The Chinese University of Hong Kong¹

11 Jul 2016

TL;DR: This paper proposes a novel approach to voice conversion with non-parallel training data to bridge between speakers by means of Phonetic PosteriorGrams obtained from a speaker-independent automatic speech recognition system.

...read moreread less

Abstract: This paper proposes a novel approach to voice conversion with non-parallel training data. The idea is to bridge between speakers by means of Phonetic PosteriorGrams (PPGs) obtained from a speaker-independent automatic speech recognition (SI-ASR) system. It is assumed that these PPGs can represent articulation of speech sounds in a speaker-normalized space and correspond to spoken content speaker-independently. The proposed approach first obtains PPGs of target speech. Then, a Deep Bidirectional Long Short-Term Memory based Recurrent Neural Network (DBLSTM) structure is used to model the relationships between the PPGs and acoustic features of the target speech. To convert arbitrary source speech, we obtain its PPGs from the same SI-ASR and feed them into the trained DBLSTM for generating converted speech. Our approach has two main advantages: 1) no parallel training data is required; 2) a trained model can be applied to any other source speaker for a fixed target speaker (i.e., many-to-one conversion). Experiments show that our approach performs equally well or better than state-of-the-art systems in both speech quality and speaker similarity.

...read moreread less

296 citations

DOI•

Measuring the Reliability of Hate Speech Annotations: the Case of the European Refugee Crisis

[...]

Björn Roß, Michael Rist, Guillermo Carbonell, Benjamin Cabrera, Nils Kurowsky, Michael Wojatzki - Show less +2 more

22 Sep 2016

TL;DR: The authors collected potentially hateful messages and asked two groups of internet users to determine whether they were hate speech or not, whether they should be banned or not and to rate their degree of offensiveness.

...read moreread less

Abstract: Some users of social media are spreading racist, sexist, and otherwise hateful content. For the purpose of training a hate speech detection system, the reliability of the annotations is crucial, but there is no universally agreed-upon definition. We collected potentially hateful messages and asked two groups of internet users to determine whether they were hate speech or not, whether they should be banned or not and to rate their degree of offensiveness. One of the groups was shown a definition prior to completing the survey. We aimed to assess whether hate speech can be annotated reliably, and the extent to which existing definitions are in accordance with subjective ratings. Our results indicate that showing users a definition caused them to partially align their own opinion with the definition but did not improve reliability, which was very low overall. We conclude that the presence of hate speech should perhaps not be considered a binary yes-or-no decision, and raters need more detailed instructions for the annotation.

...read moreread less

248 citations

Proceedings Article•DOI•

Personalized speech recognition on mobile devices

[...]

Ian McGraw¹, Rohit Prabhavalkar¹, Raziel Alvarez¹, Montse Gonzalez Arenas¹, Kanishka Rao¹, David Rybach¹, Ouais Alsharif¹, Hasim Sak¹, Alexander H. Gruenstein¹, Francoise Beaufays¹, Carolina Parada¹ - Show less +7 more•Institutions (1)

Google¹

20 Mar 2016

TL;DR: A large vocabulary speech recognition system that is accurate, has low latency, and yet has a small enough memory and computational footprint to run faster than real-time on a Nexus 5 Android smartphone is described.

...read moreread less

Abstract: We describe a large vocabulary speech recognition system that is accurate, has low latency, and yet has a small enough memory and computational footprint to run faster than real-time on a Nexus 5 Android smartphone. We employ a quantized Long Short-Term Memory (LSTM) acoustic model trained with connectionist temporal classification (CTC) to directly predict phoneme targets, and further reduce its memory footprint using an SVD-based compression scheme. Additionally, we minimize our memory footprint by using a single language model for both dictation and voice command domains, constructed using Bayesian interpolation. Finally, in order to properly handle device-specific information, such as proper names and other context-dependent information, we inject vocabulary items into the decoder graph and bias the language model on-the-fly. Our system achieves 13.5% word error rate on an open-ended dictation task, running with a median speed that is seven times faster than real-time.

...read moreread less

159 citations

Journal Article•DOI•

Boosting contextual information for deep neural network based voice activity detection

[...]

Xiao-Lei Zhang¹, DeLiang Wang¹•Institutions (1)

Ohio State University¹

01 Feb 2016-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: When trained on a large amount of noise types and a wide range of signal-to-noise ratios, the MRS-based VAD demonstrates surprisingly good generalization performance on unseen test scenarios, approaching the performance with noise-dependent training.

...read moreread less

Abstract: Voice activity detection (VAD) is an important topic in audio signal processing. Contextual information is important for improving the performance of VAD at low signal-to-noise ratios. Here we explore contextual information by machine learning methods at three levels. At the top level, we employ an ensemble learning framework, named multi-resolution stacking (MRS), which is a stack of ensemble classifiers. Each classifier in a building block inputs the concatenation of the predictions of its lower building blocks and the expansion of the raw acoustic feature by a given window (called a resolution). At the middle level, we describe a base classifier in MRS, named boosted deep neural network (bDNN). bDNN first generates multiple base predictions from different contexts of a single frame by only one DNN and then aggregates the base predictions for a better prediction of the frame, and it is different from computationally-expensive boosting methods that train ensembles of classifiers for multiple base predictions. At the bottom level, we employ the multi-resolution cochleagram feature, which incorporates the contextual information by concatenating the cochleagram features at multiple spectrotemporal resolutions. Experimental results show that the MRS-based VAD outperforms other VADs by a considerable margin. Moreover, when trained on a large amount of noise types and a wide range of signal-to-noise ratios, the MRS-based VAD demonstrates surprisingly good generalization performance on unseen test scenarios, approaching the performance with noise-dependent training.

...read moreread less

135 citations

Journal Article•DOI•

Advances in phase-aware signal processing in speech communication

[...]

Pejman Mowlaee¹, Rahim Saeidi², Yannis Stylianou³•Institutions (3)

Graz University of Technology¹, Aalto University², University of Crete³

01 Jul 2016-Speech Communication

TL;DR: It is shown that phase-aware signal processing is an important emerging field with high potential in the current speech communication applications and can complement the possible solutions that magnitude-only methods suggest.

...read moreread less

126 citations

Proceedings Article•DOI•

Feature learning with raw-waveform CLDNNs for Voice Activity Detection

[...]

Ruben Zazo, Tara N. Sainath¹, Gabor Simko¹, Carolina Parada¹•Institutions (1)

Google¹

12 Sep 2016

TL;DR: This paper proposes a novel approach to VAD to tackle both feature and model selection jointly and shows that using the raw waveform allows the neural network to learn features directly for the task at hand, which is more powerful than using log-mel features, specially for noisy environments.

...read moreread less

Abstract: Voice Activity Detection (VAD) is an important preprocessing step in any state-of-the-art speech recognition system. Choosing the right set of features and model architecture can be challenging and is an active area of research. In this paper we propose a novel approach to VAD to tackle both feature and model selection jointly. The proposed method is based on a CLDNN (Convolutional, Long Short-Term Memory, Deep Neural Networks) architecture fed directly with the raw waveform. We show that using the raw waveform allows the neural network to learn features directly for the task at hand, which is more powerful than using log-mel features, specially for noisy environments. In addition, using a CLDNN, which takes advantage of both frequency modeling with the CNN and temporal modeling with LSTM, is a much better model for VAD compared to the DNN. The proposed system achieves over 78% relative improvement in False Alarms (FA) at the operating point of 2% False Rejects (FR) on both clean and noisy conditions compared to a DNN of comparable size trained with log-mel features. In addition, we study the impact of the model size and the learned features to provide a better understanding of the proposed architecture.

...read moreread less

Journal Article•DOI•

Automatic Speech Recognition from Neural Signals: A Focused Review.

[...]

Christian Herff¹, Tanja Schultz¹•Institutions (1)

University of Bremen¹

27 Sep 2016-Frontiers in Neuroscience

TL;DR: It is argued that modalities based on metabolic processes, such as functional Near Infrared Spectroscopy and functional Magnetic Resonance Imaging, are less suited for Automatic Speech Recognition from neural signals due to low temporal resolution but are very useful for the investigation of the underlying neural mechanisms involved in speech processes.

...read moreread less

Abstract: Speech interfaces have become widely accepted and are nowadays integrated in various real-life applications and devices. They have become a part of our daily life. However, speech interfaces presume the ability to produce intelligible speech, which might be impossible due to either loud environments, bothering bystanders or incapabilities to produce speech (i.e.~patients suffering from locked-in syndrome). For these reasons it would be highly desirable to not speak but to simply envision oneself to say words or sentences. Interfaces based on imagined speech would enable fast and natural communication without the need for audible speech and would give a voice to otherwise mute people. This focused review analyzes the potential of different brain imaging techniques to recognize speech from neural signals by applying Automatic Speech Recognition technology. We argue that modalities based on metabolic processes, such as functional Near Infrared Spectroscopy and functional Magnetic Resonance Imaging, are less suited for Automatic Speech Recognition from neural signals due to low temporal resolution but are very useful for the investigation of the underlying neural mechanisms involved in speech processes. In contrast, electrophysiologic activity is fast enough to capture speech processes and is therefor better suited for ASR. Our experimental results indicate the potential of these signals for speech recognition from neural data with a focus on invasively measured brain activity (electrocorticography). As a first example of Automatic Speech Recognition techniques used from neural signals, we discuss the \emph{Brain-to-text} system.

...read moreread less

Patent•

Anchored speech detection and speech recognition

[...]

Sree Hari Krishnan Parthasarathi¹, Bjorn Hoffmeister¹, Brian King¹, Roland Maas¹•Institutions (1)

Amazon.com¹

29 Jun 2016

TL;DR: In this article, a system configured to process speech commands may classify incoming audio as desired speech, undesired speech, or non-speech, where desired speech is speech that is from a same speaker as reference speech.

...read moreread less

Abstract: A system configured to process speech commands may classify incoming audio as desired speech, undesired speech, or non-speech. Desired speech is speech that is from a same speaker as reference speech. The reference speech may be obtained from a configuration session or from a first portion of input speech that includes a wakeword. The reference speech may be encoded using a recurrent neural network (RNN) encoder to create a reference feature vector. The reference feature vector and incoming audio data may be processed by a trained neural network classifier to label the incoming audio data (for example, frame-by-frame) as to whether each frame is spoken by the same speaker as the reference speech. The labels may be passed to an automatic speech recognition (ASR) component which may allow the ASR component to focus its processing on the desired speech.

...read moreread less

Posted Content•

Speech Enhancement In Multiple-Noise Conditions using Deep Neural Networks

[...]

Anurag Kumar¹, Dinei Florencio²•Institutions (2)

Carnegie Mellon University¹, Microsoft²

09 May 2016-arXiv: Sound

TL;DR: This paper deals with improving speech quality in office environment where multiple stationary as well as non-stationary noises can be simultaneously present in speech and proposes several strategies based on Deep Neural Networks for speech enhancement in these scenarios.

...read moreread less

Abstract: In this paper we consider the problem of speech enhancement in real-world like conditions where multiple noises can simultaneously corrupt speech. Most of the current literature on speech enhancement focus primarily on presence of single noise in corrupted speech which is far from real-world environments. Specifically, we deal with improving speech quality in office environment where multiple stationary as well as non-stationary noises can be simultaneously present in speech. We propose several strategies based on Deep Neural Networks (DNN) for speech enhancement in these scenarios. We also investigate a DNN training strategy based on psychoacoustic models from speech coding for enhancement of noisy speech

...read moreread less

Proceedings Article•DOI•

DNN-based enhancement of noisy and reverberant speech

[...]

Yan Zhao¹, DeLiang Wang¹, Ivo Merks, Tao Zhang•Institutions (1)

Ohio State University¹

20 Mar 2016

TL;DR: This paper proposes to enhance the noisy and reverberant speech by learning a mapping to reverberant target speech rather than anechoic target speech, and develops a masking-based method for denoising and compares it with the spectral mapping method.

...read moreread less

Abstract: In the real world, speech is usually distorted by both reverberation and background noise. In such conditions, speech intelligibility is degraded substantially, especially for hearing-impaired (HI) listeners. As a consequence, it is essential to enhance speech in the noisy and reverberant environment. Recently, deep neural networks have been introduced to learn a spectral mapping to enhance corrupted speech, and shown significant improvements in objective metrics and automatic speech recognition score. However, listening tests have not yet shown any speech intelligibility benefit. In this paper, we propose to enhance the noisy and reverberant speech by learning a mapping to reverberant target speech rather than anechoic target speech. A preliminary listening test was conducted, and the results show that the proposed algorithm is able to improve speech intelligibility of HI listeners in some conditions. Moreover, we develop a masking-based method for denoising and compare it with the spectral mapping method. Evaluation results show that the masking-based method outperforms the mapping-based method.

...read moreread less

Journal Article•DOI•

Unseen noise estimation using separable deep auto encoder for speech enhancement

[...]

Meng Sun¹, Xiongwei Zhang¹, Hugo Van hamme², Thomas Fang Zheng³•Institutions (3)

Nanjing University of Science and Technology¹, Katholieke Universiteit Leuven², Tsinghua University³

01 Jan 2016-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: The above proposed technique is called separable deep auto encoder (SDAE), and given the under-determined nature of the above optimization problem, the clean speech reconstruction is confined in the convex hull spanned by a pre-trained speech dictionary.

...read moreread less

Abstract: Unseen noise estimation is a key yet challenging step to make a speech enhancement algorithm work in adverse environments. At worst, the only prior knowledge we know about the encountered noise is that it is different from the involved speech. Therefore, by subtracting the components which cannot be adequately represented by a well defined speech model, the noises can be estimated and removed. Given the good performance of deep learning in signal representation, a deep auto encoder (DAE) is employed in this work for accurately modeling the clean speech spectrum. In the subsequent stage of speech enhancement, an extra DAE is introduced to represent the residual part obtained by subtracting the estimated clean speech spectrum (by using the pre-trained DAE) from the noisy speech spectrum. By adjusting the estimated clean speech spectrum and the unknown parameters of the noise DAE, one can reach a stationary point to minimize the total reconstruction error of the noisy speech spectrum. The enhanced speech signal is thus obtained by transforming the estimated clean speech spectrum back into time domain. The above proposed technique is called separable deep auto encoder (SDAE). Given the under-determined nature of the above optimization problem, the clean speech reconstruction is confined in the convex hull spanned by a pre-trained speech dictionary. New learning algorithms are investigated to respect the non-negativity of the parameters in the SDAE. Experimental results on TIMIT with 20 noise types at various noise levels demonstrate the superiority of the proposed method over the conventional baselines.

...read moreread less

Journal Article•DOI•

Voice Activity Detection: Merging Source and Filter-based Information

[...]

Thomas Drugman¹, Yannis Stylianou¹, Yusuke Kida¹, Masami Akamine¹•Institutions (1)

Toshiba¹

01 Feb 2016-IEEE Signal Processing Letters

TL;DR: In this article, the authors investigated the joint use of source and filter-based features, and proposed two strategies are proposed to merge source and filtering information: feature and decision fusion.

...read moreread less

Abstract: Voice Activity Detection (VAD) refers to the problem of distinguishing speech segments from background noise. Numerous approaches have been proposed for this purpose. Some are based on features derived from the power spectral density, others exploit the periodicity of the signal. The goal of this letter is to investigate the joint use of source and filter-based features. Interestingly, a mutual information-based assessment shows superior discrimination power for the source-related features, especially the proposed ones. The features are further the input of an artificial neural network-based classifier trained on a multi-condition database. Two strategies are proposed to merge source and filter information: feature and decision fusion. Our experiments indicate an absolute reduction of 3% of the equal error rate when using decision fusion. The final proposed system is compared to four state-of-the-art methods on 150 minutes of data recorded in real environments. Thanks to the robustness of its source-related features, its multi-condition training and its efficient information fusion, the proposed system yields over the best state-of-the-art VAD a substantial increase of accuracy across all conditions (24% absolute on average).

...read moreread less

Journal Article•DOI•

Real-Time Control of an Articulatory-Based Speech Synthesizer for Brain Computer Interfaces.

[...]

Florent Bocquelet¹, Thomas Hueber², Laurent Girin³, Christophe Savariaux³, Christophe Savariaux², Blaise Yvert¹, Blaise Yvert³ - Show less +3 more•Institutions (3)

French Institute of Health and Medical Research¹, Centre national de la recherche scientifique², University of Grenoble³

23 Nov 2016-PLOS Computational Biology

TL;DR: It is found that real-time synthesis of vowels and consonants was possible with good intelligibility and open to future speech BCI applications using such articulatory-based speech synthesizer.

...read moreread less

Abstract: Restoring natural speech in paralyzed and aphasic people could be achieved using a Brain-Computer Interface (BCI) controlling a speech synthesizer in real-time. To reach this goal, a prerequisite is to develop a speech synthesizer producing intelligible speech in real-time with a reasonable number of control parameters. We present here an articulatory-based speech synthesizer that can be controlled in real-time for future BCI applications. This synthesizer converts movements of the main speech articulators (tongue, jaw, velum, and lips) into intelligible speech. The articulatory-to-acoustic mapping is performed using a deep neural network (DNN) trained on electromagnetic articulography (EMA) data recorded on a reference speaker synchronously with the produced speech signal. This DNN is then used in both offline and online modes to map the position of sensors glued on different speech articulators into acoustic parameters that are further converted into an audio signal using a vocoder. In offline mode, highly intelligible speech could be obtained as assessed by perceptual evaluation performed by 12 listeners. Then, to anticipate future BCI applications, we further assessed the real-time control of the synthesizer by both the reference speaker and new speakers, in a closed-loop paradigm using EMA data recorded in real time. A short calibration period was used to compensate for differences in sensor positions and articulatory differences between new speakers and the reference speaker. We found that real-time synthesis of vowels and consonants was possible with good intelligibility. In conclusion, these results open to future speech BCI applications using such articulatory-based speech synthesizer.

...read moreread less

Patent•

Voice activity detection

[...]

Tara N. Sainath¹, Gabor Simko¹, Maria Carolina Parada San Martin¹•Institutions (1)

Google¹

04 Jan 2016

Patent•

Input speech quality matching

[...]

Kenneth John Basye¹, Arthur Richard Toth¹, William Folwell Barton¹•Institutions (1)

Amazon.com¹

22 Jun 2016

TL;DR: In this paper, a system uses trained models to detect a speech quality and generate an indicator of the speech quality, which is sent to downstream components of the system such as a command processor or TTS system.

...read moreread less

Abstract: A system matches text-to-speech (TTS) or other output to a quality of an input spoken utterance. The system uses trained models to detect a speech quality and generates an indicator of the speech quality. The speech quality may be determined from audio or non-audio data. The indicator is sent to downstream components of the system such as a command processor or TTS system. The output of the system is then determined using the indicator of speech quality, thus customizing an output of the system to the manner in which the utterance was spoken.

...read moreread less

Proceedings Article•DOI•

Improved speaker independent lip reading using speaker adaptive training and deep neural networks

[...]

Ibrahim Almajai¹, Stephen Cox¹, Richard P. Harvey¹, Yuxuan Lan¹•Institutions (1)

University of East Anglia¹

20 Mar 2016

TL;DR: It is shown that error-rates for speaker-independent lip-reading can be very significantly reduced and that there is no need to map phonemes to visemes for context-dependent visual speech transcription.

...read moreread less

Abstract: Recent improvements in tracking and feature extraction mean that speaker-dependent lip-reading of continuous speech using a medium size vocabulary (around 1000 words) is realistic. However, the recognition of previously unseen speakers has been found to be a very challenging task, because of the large variation in lip-shapes across speakers and the lack of large, tracked databases of visual features, which are very expensive to produce. By adapting a technique that is established in speech recognition but has not previously been used in lip-reading, we show that error-rates for speaker-independent lip-reading can be very significantly reduced. Furthermore, we show that error-rates can be even further reduced by the additional use of Deep Neural Networks (DNN). We also find that there is no need to map phonemes to visemes for context-dependent visual speech transcription.

...read moreread less

Proceedings Article•DOI•

Speech-Based Detection of Alzheimer's Disease in Conversational German.

[...]

Jochen Weiner¹, Christian Herff¹, Tanja Schultz¹•Institutions (1)

University of Bremen¹

08 Sep 2016

TL;DR: An experiment to detect Alzheimer’s disease using spontaneous conversational speech, with an F-score of 0.8, clearly showing the approach detects dementia well.

...read moreread less

Abstract: The worldwide population is aging. With a larger population of elderly people, the numbers of people affected by cognitive impairment such as Alzheimer’s disease are growing. Unfortunately, there is no known cure for Alzheimer’s disease. The only way to alleviate it’s serious effects is to start therapy very early before the disease has wrought too much irreversible damage. Current diagnostic procedures are neither cost nor time efficient and therefore do not meet the demands for frequent mass screening required to mitigate the consequences of cognitive impairments on the global scale. We present an experiment to detect Alzheimer’s disease using spontaneous conversational speech. The speech data was recorded during biographic interviews in the Interdisciplinary Longitudinal Study on Adult Development and Aging (ILSE), a large data resource on healthy and satisfying aging in middle adulthood and later life in Germany. From these recordings we extract ten speech-based features using voice activity detection and transcriptions. In an experimental setup with 98 data samples we train a linear discriminant analysis classifier to distinguish subjects with Alzheimer’s disease from the control group. This setup results in an F-score of 0.8 for the detection of Alzheimer’s disease, clearly showing our approach detects dementia well.

...read moreread less

Proceedings Article•DOI•

Deep neural networks for automatic detection of screams and shouted speech in subway trains

[...]

Pierre Laffitte¹, David Sodoyer¹, Charles Tatkeu¹, Laurent Girin²•Institutions (2)

university of lille¹, University of Grenoble²

20 Mar 2016

TL;DR: This paper investigated the use of DNNs for automatic scream and shouted speech detection, within the framework of surveillance systems in public transportation, and recorded a database of sounds occurring in subway trains.

...read moreread less

Abstract: Deep Neural Networks (DNNs) have recently become a popular technique for regression and classification problems. Their capacity to learn high-order correlations between input and output data proves to be very powerful for automatic speech recognition. In this paper we investigate the use of DNNs for automatic scream and shouted speech detection, within the framework of surveillance systems in public transportation. We recorded a database of sounds occurring in subway trains in real conditions of exploitation and used DNNs to classify the sounds into screams, shouts and other categories. We report encouraging results, given the difficulty of the task, especially when a high level of surrounding noise is present.

...read moreread less

Speech Communications Human And Machine

[...]

Mandy Eberhart

01 Jan 2016

TL;DR: The speech communications human and machine is universally compatible with any devices to read and is available in the digital library an online access to it is set as public so you can download it instantly.

...read moreread less

Abstract: Thank you for reading speech communications human and machine. Maybe you have knowledge that, people have search hundreds times for their chosen readings like this speech communications human and machine, but end up in malicious downloads. Rather than reading a good book with a cup of tea in the afternoon, instead they are facing with some infectious virus inside their desktop computer. speech communications human and machine is available in our digital library an online access to it is set as public so you can download it instantly. Our digital library hosts in multiple locations, allowing you to get the most less latency time to download any of our books like this one. Kindly say, the speech communications human and machine is universally compatible with any devices to read.

...read moreread less

Patent•

Speech recognizer for multimodal systems and signing in/out with and /or for a digital pen

[...]

Phillipp H. Schmid, David McGee

11 Mar 2016

TL;DR: In this paper, a multimodal system using at least one speech recognizer to perform speech recognition utilizing a circular buffer to unify all modal events into a single interpretation of the user's intent is presented.

...read moreread less

Abstract: A multimodal system using at least one speech recognizer to perform speech recognition utilizing a circular buffer to unify all modal events into a single interpretation of the user's intent.

...read moreread less

Journal Article•DOI•

Automatic Voice Pathology Detection With Running Speech by Using Estimation of Auditory Spectrum and Cepstral Coefficients Based on the All-Pole Model.

[...]

Zulfiqar Ali¹, Zulfiqar Ali², Irraivan Elamvazuthi², Mansour Alsulaiman¹, Ghulam Muhammad¹ - Show less +1 more•Institutions (2)

King Saud University¹, Universiti Teknologi Petronas²

01 Nov 2016-Journal of Voice

TL;DR: The developed system can effectively be used in voice pathology detection and classification systems, and the proposed features can visually differentiate between normal and pathological samples.

...read moreread less

Journal Article•DOI•

A silent speech system based on permanent magnet articulography and direct synthesis

[...]

José A. González¹, Lam Aun Cheah², James M. Gilbert², Jie Bai², Stephen R. Ell³, Phil D. Green¹, Roger K. Moore¹ - Show less +3 more•Institutions (3)

University of Sheffield¹, University of Hull², Castle Hill Hospital³

01 Sep 2016-Computer Speech & Language

TL;DR: Results show that it is possible to reconstruct speech from articulator movements captured by an unobtrusive technique without an intermediate recognition step, and the SSI is capable of producing speech of sufficient intelligibility and naturalness that the speaker is clearly identifiable.

...read moreread less

Journal Article•DOI•

Spoofing detection goes noisy

[...]

Cemal Hanili¹, Tomi Kinnunen², Sahidullah², Aleksandr Sizov²•Institutions (2)

Bursa Technical University¹, University of Eastern Finland²

01 Dec 2016-Speech Communication

TL;DR: A significant gap is revealed between the performance of state-of-the-art spoofing detectors between clean and noisy conditions and a study with two score fusion strategies shows that combining different feature based systems improves recognition accuracy for known and unknown attacks in both clean and noise conditions.

...read moreread less

Proceedings Article•DOI•

A comparative study of robustness of deep learning approaches for VAD

[...]

Sibo Tong¹, Hao Gu¹, Kai Yu¹•Institutions (1)

Shanghai Jiao Tong University¹

20 Mar 2016

TL;DR: To improve the robustness of deep learning based VAD models, a new noise-aware training (NAT) approach is also proposed and experiments show that LSTM-based VAD is most robust but the performance degrades dramatically in the conditions with unseen noise or diverse SNR.

...read moreread less

Abstract: Voice activity detection (VAD) is an important step for real-world automatic speech recognition (ASR) systems Deep learning approaches, such as DNN, RNN or CNN, have been widely used in model-based VAD Although they have achieved success in practice, they are developed on different VAD tasks separately Whilst VAD performance under noisy conditions, especially with unseen noise or very low SNR, are of great interest, there has no robustness comparison of different deep learning approaches so far In this paper, to learn the robustness property, VAD models based on DNN, LSTM and CNN are thoroughly compared at both frame and segment level under various noisy conditions on Aurora 4, a commonly used speech corpus with rich noises To improve the robustness of deep learning based VAD models, a new noise-aware training (NAT) approach is also proposed Experiments show that LSTM-based VAD is most robust but the performance degrades dramatically in the conditions with unseen noise or diverse SNR By incorporating NAT, significant performance gains can be obtained in these conditions

...read moreread less

Book Chapter•DOI•

Cross-Modal Supervision for Learning Active Speaker Detection in Video

[...]

Punarjay Chakravarty¹, Tinne Tuytelaars¹•Institutions (1)

Katholieke Universiteit Leuven¹

08 Oct 2016

TL;DR: This work is the first to present an active speaker detection system that learns on one audio-visual dataset and automatically adapts to speakers in a new dataset, and is seen as an example of how the availability of multi-modal data allows us to learn a model without the need for supervision.

...read moreread less

Abstract: In this paper, we show how to use audio to supervise the learning of active speaker detection in video. Voice Activity Detection (VAD) guides the learning of the vision-based classifier in a weakly supervised manner. The classifier uses spatio-temporal features to encode upper body motion - facial expressions and gesticulations associated with speaking. We further improve a generic model for active speaker detection by learning person specific models. Finally, we demonstrate the online adaptation of generic models learnt on one dataset, to previously unseen people in a new dataset, again using audio (VAD) for weak supervision. The use of temporal continuity overcomes the lack of clean training data. We are the first to present an active speaker detection system that learns on one audio-visual dataset and automatically adapts to speakers in a new dataset. This work can be seen as an example of how the availability of multi-modal data allows us to learn a model without the need for supervision, by transferring knowledge from one modality to another.

...read moreread less

Collapse