scispace - formally typeset
Open AccessJournal ArticleDOI

Unsupervised Speech Activity Detection Using Voicing Measures and Perceptual Spectral Flux

Reads0
Chats0
TLDR
Experimental results indicate that the proposed SAD scheme is highly effective and provides superior and consistent performance across various noise types and distortion levels.
Abstract
Effective speech activity detection (SAD) is a necessary first step for robust speech applications. In this letter, we propose a robust and unsupervised SAD solution that leverages four different speech voicing measures combined with a perceptual spectral flux feature, for audio-based surveillance and monitoring applications. Effectiveness of the proposed technique is evaluated and compared against several commonly adopted unsupervised SAD methods under simulated and actual harsh acoustic conditions with varying distortion levels. Experimental results indicate that the proposed SAD scheme is highly effective and provides superior and consistent performance across various noise types and distortion levels.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Speaker Recognition by Machines and Humans: A tutorial review

TL;DR: A comparative study of human versus machine speaker recognition is concluded, with an emphasis on prominent speaker-modeling techniques that have emerged in the last decade for automatic systems.
Journal ArticleDOI

MSP-IMPROV: An Acted Corpus of Dyadic Interactions to Study Emotion Perception

TL;DR: The MSP-IMPROV corpus is presented, a multimodal emotional database, where the goal is to have control over lexical content and emotion while also promoting naturalness in the recordings, leveraging the large size of the audiovisual database.
Journal ArticleDOI

Applications of Artificial Intelligence in Machine Learning: Review and Prospect

TL;DR: A brief review and future prospect of the vast applications of machine learning has been made.
Journal ArticleDOI

Boosting contextual information for deep neural network based voice activity detection

TL;DR: When trained on a large amount of noise types and a wide range of signal-to-noise ratios, the MRS-based VAD demonstrates surprisingly good generalization performance on unseen test scenarios, approaching the performance with noise-dependent training.
Proceedings ArticleDOI

Speech activity detection on youtube using deep neural networks.

TL;DR: It is demonstrated that a DNN with input consisting of multiple frames of mel frequency cepstral coefficients (MFCCs) yields drastically lower frame-wise error rates on YouTube videos compared to a conventional GMM based system.
References
More filters
Journal ArticleDOI

A statistical model-based voice activity detection

TL;DR: An effective hang-over scheme which considers the previous observations by a first-order Markov process modeling of speech occurrences is proposed which shows significantly better performances than the G.729B VAD in low signal-to-noise ratio (SNR) and vehicular noise environments.

Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound

Paul Boersma
TL;DR: In this article, the authors present an autocorrelation-based method for detecting the acoustic pitch period of a sound, where the position of the maximum of the auto-correlation function of the sound can be found from the relative height of this maximum.
Proceedings ArticleDOI

Construction and evaluation of a robust multifeature speech/music discriminator

TL;DR: A real-time computer system capable of distinguishing speech signals from music signals over a wide range of digital audio input is constructed and extensive data on system performance and the cross-validated training/test setup used to evaluate the system is provided.
Journal ArticleDOI

Efficient voice activity detection algorithms using long-term speech information

TL;DR: A new VAD algorithm for improving speech detection robustness in noisy environments and the performance of speech recognition systems is presented, which formsulates the speech/non-speech decision rule by comparing the long-term spectral envelope to the average noise spectrum, thus yielding a high discriminating decision rule and minimizing the average number of decision errors.
Related Papers (5)