Showing papers on "Linear predictive coding published in 2015"

PDF

Open Access

Journal Article•DOI•

Deep Learning for Acoustic Modeling in Parametric Speech Generation: A systematic review of existing techniques and future trends

[...]

Zhen-Hua Ling¹, Shiyin Kang², Heiga Zen³, Andrew W. Senior³, Mike Schuster³, Xiaojun Qian², Helen Meng², Li Deng⁴ - Show less +4 more•Institutions (4)

University of Science and Technology of China¹, The Chinese University of Hong Kong², Google³, Microsoft⁴

02 Apr 2015-IEEE Signal Processing Magazine

TL;DR: In this article, Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs) are used for generating low-level speech waveforms from high-level symbolic inputs via intermediate acoustic feature sequences.

...read moreread less

203 citations

Deep Learning for Acoustic Modeling in Parametric Speech Generation

[...]

Zhen-Hua Ling, Shiyin Kang, Heiga Zen, Andrew W. Senior, Mike Schuster, Xiaojun Qian, Helen Meng, Li Deng - Show less +4 more

01 Jan 2015

TL;DR: This article systematically reviews emerging speech generation approaches with the dual goal of helping readers gain a better understanding of the existing techniques as well as stimulating new work in the burgeoning area of deep learning for parametric speech generation.

...read moreread less

Abstract: Hidden Markov models (HMMs) and Gaussian mixture models (GMMs) are the two most common types of acoustic models used in statistical parametric approaches for generating low-level speech waveforms from high-level symbolic inputs via intermediate acoustic feature sequences. However, these models have their limitations in representing complex, nonlinear relationships between the speech generation inputs and the acoustic features. Inspired by the intrinsically hierarchical process of human speech production and by the successful application of deep neural networks (DNNs) to automatic speech recognition (ASR), deep learning techniques have also been applied successfully to speech generation, as reported in recent literature. This article systematically reviews these emerging speech generation approaches, with the dual goal of helping readers gain a better understanding of the existing techniques as well as stimulating new work in the burgeoning area of deep learning for parametric speech generation. In speech signal and information processing, many applications have been formulated as machine-learning tasks. ASR is a typical classification task that predicts word sequences from speech waveforms or feature sequences. There are also many regression tasks in speech processing that are aimed to generate speech signals from various types of inputs. They are referred to as speech generation tasks in this article. Speech generation covers a wide range of research topics in speech processing, such as text-to-speech (TTS) synthesis (generating speech from text), voice conversion (modifying nonlinguistic information of the input speech), speech enhancement (improving speech quality by noise reduction or other processing), and articulatory-to-acoustic mapping (converting articulatory movements to acoustic features). These

...read moreread less

189 citations

Journal Article•DOI•

Multi-channel linear prediction-based speech dereverberation with sparse priors

[...]

Ante Jukic¹, Toon van Waterschoot², Timo Gerkmann¹, Simon Doclo¹•Institutions (2)

University of Oldenburg¹, Katholieke Universiteit Leuven²

01 Sep 2015-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This paper proposes to model the desired speech signal using a general sparse prior that can be represented in a convex form as a maximization over scaled complex Gaussian distributions, which can be interpreted as a generalization of the commonly used time-varying Gaussian model.

...read moreread less

Abstract: The quality of speech signals recorded in an enclosure can be severely degraded by room reverberation. In this paper, we focus on a class of blind batch methods for speech dereverberation in a noiseless scenario with a single source, which are based on multi-channel linear prediction in the short-time Fourier transform domain. Dereverberation is performed by maximum-likelihood estimation of the model parameters that are subsequently used to recover the desired speech signal. Contrary to the conventional method, we propose to model the desired speech signal using a general sparse prior that can be represented in a convex form as a maximization over scaled complex Gaussian distributions. The proposed model can be interpreted as a generalization of the commonly used time-varying Gaussian model. Furthermore, we reformulate both the conventional and the proposed method as an optimization problem with an ep-norm cost function, emphasizing the role of sparsity in the considered speech dereverberation methods. Experimental evaluation in different acoustic scenarios show that the proposed approach results in an improved performance compared to the conventional approach in terms of instrumental measures for speech quality.

...read moreread less

97 citations

Journal Article•DOI•

Speech Enhancement using Spectral Subtraction-type Algorithms: A Comparison and Simulation Study

[...]

Navneet Upadhyay¹, Abhijit Karmakar²•Institutions (2)

LNM Institute of Information Technology¹, Central Electronics Engineering Research Institute²

01 Jan 2015-Procedia Computer Science

TL;DR: It is evident from the results that the modified forms of spectral subtraction method reduces remnant noise significantly and the enhanced speech contains minimal speech distortion.

...read moreread less

80 citations

Proceedings Article•DOI•

Vocaine the vocoder and applications in speech synthesis

[...]

Yannis Agiomyrgiannakis¹•Institutions (1)

Google¹

19 Apr 2015

TL;DR: A new vocoder synthesizer, referred to as Vocaine, that features a novel Amplitude Modulated-Frequency Modulated (AM-FM) speech model, a new way to synthesize non-stationary sinusoids using quadratic phase splines and a super fast cosine generator is presented.

...read moreread less

Abstract: Vocoders received renewed attention recently as basic components in speech synthesis applications such as voice transformation, voice conversion and statistical parametric speech synthesis. This paper presents a new vocoder synthesizer, referred to as Vocaine, that features a novel Amplitude Modulated-Frequency Modulated (AM-FM) speech model, a new way to synthesize non-stationary sinusoids using quadratic phase splines and a super fast cosine generator. Extensive evaluations are made against several state-of-the-art methods in Copy-Synthesis and Text-To-Speech synthesis experiments. Vocaine matches or outperforms STRAIGHT in Copy-Synthesis experiments and outperforms our baseline real-time optimized Mixed-Excitation vocoder with the same computational cost. We report that Vocaine considerably improves our statistical TTS synthesizers and that our new statistical parametric synthesizer [1] matched the quality of our mature production Unit-Selection system with uncompressed waveforms.

...read moreread less

79 citations

Journal Article•DOI•

CheapTrick, a spectral envelope estimator for high-quality speech synthesis

[...]

Masanori Morise¹•Institutions (1)

Takeda Pharmaceutical Company¹

01 Mar 2015-Speech Communication

TL;DR: Object and subjective evaluations indicated that the proposed spectral envelope estimation algorithm can obtain a temporally stable spectral envelope and synthesize speech with higher sound quality than speech synthesized with other algorithms.

...read moreread less

73 citations

Journal Article•DOI•

Phase Estimation in Single Channel Speech Enhancement Using Phase Decomposition

[...]

Josef Kulmer¹, Pejman Mowlaee¹•Institutions (1)

Graz University of Technology¹

01 May 2015-IEEE Signal Processing Letters

TL;DR: This letter presents a novel method to estimate the clean speech phase spectrum, given the noisy speech observation in single-channel speech enhancement, which relies on the phase decomposition of the instantaneous noisy phase spectrum followed by temporal smoothing in order to reduce the large variance of noisy phase.

...read moreread less

Abstract: Conventional speech enhancement methods typically utilize the noisy phase spectrum for signal reconstruction. This letter presents a novel method to estimate the clean speech phase spectrum, given the noisy speech observation in single-channel speech enhancement. The proposed method relies on the phase decomposition of the instantaneous noisy phase spectrum followed by temporal smoothing in order to reduce the large variance of noisy phase, and consequently reconstructs an enhanced instantaneous phase spectrum for signal reconstruction. The effectiveness of the proposed method is evaluated in two ways: phase enhancement-only and by quantifying the additional improvement on top of the conventional amplitude enhancement scheme where noisy phase is often used in signal reconstruction. The instrumental metrics predict a consistent improvement in perceived speech quality and speech intelligibility when the noisy phase is enhanced using the proposed phase estimation method.

...read moreread less

51 citations

Proceedings Article•DOI•

Reconstructing intelligible audio speech from visual speech features.

[...]

Thomas Le Cornu¹, Ben Milner¹•Institutions (1)

University of East Anglia¹

06 Sep 2015

TL;DR: The proposed method aims to estimate a spectral enve- lope from visual features which is then combined with an arti- ficial excitation signal and used within a model of speech pro- duction to reconstruct an audio signal.

...read moreread less

Abstract: This work describes an investigation into the feasibility of producing intelligible audio speech from only visual speech fea- tures. The proposed method aims to estimate a spectral enve- lope from visual features which is then combined with an arti- ficial excitation signal and used within a model of speech pro- duction to reconstruct an audio signal. Different combinations of audio and visual features are considered, along with both a statistical method of estimation and a deep neural network. The intelligibility of the reconstructed audio speech is measured by human listeners, and then compared to the intelligibility of the video signal only and when combined with the reconstructed audio.

...read moreread less

47 citations

Journal Article•DOI•

Formant-based robust voice activity detection

[...]

In-Chul Yoo¹, Hyeontaek Lim¹, Dongsuk Yook¹•Institutions (1)

Korea University¹

01 Dec 2015-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This paper proposes a simple formant-based VAD algorithm to overcome the problem of detecting formants under conditions with severe noise, which achieves a much faster processing time and outperforms standard VAD algorithms under various noise conditions.

...read moreread less

Abstract: Voice activity detection (VAD) can be used to distinguish human speech from other sounds, and various applications can benefit from VAD-including speech coding and speech recognition. To accurately detect voice activity, the algorithm must take into account the characteristic features of human speech and/or background noise. In many real-life applications, noise frequently occurs in an unexpected manner, and in such situations, it is difficult to determine the characteristics of noise with sufficient accuracy. As a result, robust VAD algorithms that depend less on making correct noise estimates are desirable for real-life applications. Formants are the major spectral peaks of the human voice, and these are highly useful to distinguish vowel sounds. The characteristics of the spectral peaks are such that, these peaks are likely to survive in a signal after severe corruption by noise, and so formants are attractive features for voice activity detection under low signal-to-noise ratio (SNR) conditions. However, it is difficult to accurately extract formants from noisy signals when background noise introduces unrelated spectral peaks. Therefore, this paper proposes a simple formant-based VAD algorithm to overcome the problem of detecting formants under conditions with severe noise. The proposed method achieves a much faster processing time and outperforms standard VAD algorithms under various noise conditions. The proposed method is robust against various types of noise and produces a light computational load, so it is suitable for use in various applications.

...read moreread less

42 citations

Patent•

Method and system of environment sensitive automatic speech recognition

[...]

Binuraj K. Ravindran, Georg Stemmer, Joachim Hofer

26 Mar 2015

TL;DR: In this article, a system of an environment-sensitive automatic speech recognition is described, which includes steps for obtaining audio data including human speech, determining at least one characteristic of the environment in which the audio data was obtained, and modifying the speech recognition parameters depending on the characteristic.

...read moreread less

Abstract: In a system of an environment-sensitive automatic speech recognition, a method includes steps for obtaining audio data including human speech, determining at least one characteristic of the environment in which the audio data was obtained, and modifying at least one parameter to be used to perform speech recognition depending on the characteristic.

...read moreread less

35 citations

Patent•

Enhanced speech endpointing

[...]

Petar Aleksic¹, Glen Shires¹, Michael Buchanan¹•Institutions (1)

Google¹

03 Sep 2015

Journal Article•DOI•

Coupled dictionaries for exemplar-based speech enhancement and automatic speech recognition

[...]

Deepak Baby¹, Tuomas Virtanen², Jort F. Gemmeke¹, Hugo Van hamme¹•Institutions (2)

Katholieke Universiteit Leuven¹, Tampere University of Technology²

01 Nov 2015-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: An efficient way to directly compute the full-resolution frequency estimates of speech and noise using coupled dictionaries, which results in improved word error rates for the speech recognition tasks using HMM-GMM and deep-neural network (DNN) based systems.

...read moreread less

Abstract: Exemplar-based speech enhancement systems work by decomposing the noisy speech as a weighted sum of speech and noise exemplars stored in a dictionary and use the resulting speech and noise estimates to obtain a time-varying filter in the full-resolution frequency domain to enhance the noisy speech. To obtain the decomposition, exemplars sampled in lower dimensional spaces are preferred over the full-resolution frequency domain for their reduced computational complexity and the ability to better generalize to unseen cases. But the resulting filter may be sub-optimal as the mapping of the obtained speech and noise estimates to the full-resolution frequency domain yields a low-rank approximation. This paper proposes an efficient way to directly compute the full-resolution frequency estimates of speech and noise using coupled dictionaries: an input dictionary containing atoms from the desired exemplar space to obtain the decomposition and a coupled output dictionary containing exemplars from the full-resolution frequency domain. We also introduce modulation spectrogram features for the exemplar-based tasks using this approach. The proposed system was evaluated for various choices of input exemplars and yielded improved speech enhancement performances on the AURORA-2 and AURORA-4 databases. We further show that the proposed approach also results in improved word error rates (WERs) for the speech recognition tasks using HMM-GMM and deep-neural network (DNN) based systems.

...read moreread less

Drone Detection using Audio Analysis

[...]

Louise Hauzenberger, Emma Holmberg Ohlsson

01 Jan 2015

TL;DR: In this article, a solution using linear predictive coding, the slope of the frequency spectrum and the zero crossing rate was evaluated, and it was concluded that drone detection using audio analysis is possible.

...read moreread less

Abstract: Drones used for illegal purposes is a growing problem and a way to detect these is needed. This thesis has evaluated the possibility of using sound analysis as the detection mechanism. A solution using linear predictive coding, the slope of the frequency spectrum and the zero crossing rate was evaluated. The results showed that a solution using linear predictive coding and the slope of the frequency spectrum give a good result for the distance it is calibrated for. The zero crossing rate on the other hand does not improve the result and was not part of the final solution. The amount of false positives increases when calibrating for longer distances, and a compromise between detecting drones at long distances and the number of false positives need to be made in the implemented solution. It was concluded that drone detection using audio analysis is possible, and that the implemented solution, with linear predictive coding and slope of the frequency spectrum, could with further improvements become a useable product.

...read moreread less

Proceedings Article•DOI•

A novel approach of speech emotion recognition with prosody, quality and derived features using SVM classifier for a class of North-Eastern Languages

[...]

Amiya Kumar Samantaray, Kamalakanta Mahapatra¹, Bibek Kabi², Aurobinda Routray²•Institutions (2)

National Institute of Technology, Rourkela¹, Indian Institute of Technology Kharagpur²

09 Jul 2015

TL;DR: A novel approach is being introduces using a combination of prosody features, quality features, derived features and dynamic feature for robust automatic recognition of speaker's state of emotion in `Five native Assamese Languages'.

...read moreread less

Abstract: Speech emotion recognition is one of the recent challenges in speech processing and Human Computer Interaction (HCI) in order to address various operational needs for the real world applications. Besides human facial expressions, speech has been proven to be one of the most precious modalities for automatic recognition of human emotions. Speech is a spontaneous medium of perceiving emotions which provides in-depth information related to different cognitive states of a human being. In this context, a novel approach is being introduces using a combination of prosody features (i.e. pitch, energy, Zero crossing rate), quality features (i.e. Formant Frequencies, Spectral features etc.), derived features (i.e. Mel-Frequency Cepstral Coefficient (MFCC), Linear Predictive Coding Coefficients (LPCC)) and dynamic feature (Mel-Energy spectrum dynamic Coefficients (MEDC)) for robust automatic recognition of speaker's state of emotion. Multilevel SVM classifier is used for identification of seven discrete emotional states namely anger, disgust, fear, happy, neutral, sad and surprise in ‘Five native Assamese Languages’. The overall results of the conducted experiments revealed that the approach of using the combination of features achieved an average accuracy rate of 82.26% for speaker independent cases.

...read moreread less

Journal Article•DOI•

A watermarking method for digital speech self-recovery

[...]

Saeed Sarreshtedari¹, Mohammad Ali Akhaee¹, Aliazam Abbasfar¹•Institutions (1)

University of Tehran¹

01 Nov 2015-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: Experimental results show that the self-embedding speech signal is recoverable with proper speech quality for high tampering rates, without significant loss in the quality of the original speech signal.

...read moreread less

Abstract: Authentication and tampering detection of the digital signals is one of the main applications of the digital watermarking. Recently, watermarking algorithms for digital images are developed to not only detect the image tampering, but also to recover the lost content to some extent. In this paper, a new watermarking scheme is introduced to generate digital self-embedding speech signals enjoying the self-recovery feature. For this purpose, the compressed version of the speech signal generated by a speech codec and protected against the tampering by the proper channel coding is embedded into the original speech signal. Experimental results show that the self-embedding speech signal is recoverable with proper speech quality for high tampering rates, without significant loss in the quality of the original speech signal.

...read moreread less

Book Chapter•DOI•

SVM and HMM Modeling Techniques for Speech Recognition Using LPCC and MFCC Features

[...]

S. Ananthi¹, P. Dhanalakshmi¹•Institutions (1)

Annamalai University¹

01 Jan 2015

TL;DR: From the exhaustive analysis, it is evident that HMM performs better than other modeling techniques such as SVM.

...read moreread less

Abstract: Speech Recognition approach intends to recognize the text from the speech utterance which can be more helpful to the people with hearing disabled. Support Vector Machine (SVM) and Hidden Markov Model (HMM) are widely used techniques for speech recognition system. Acoustic features namely Linear Predictive Coding (LPC), Linear Prediction Cepstral Coefficient (LPCC) and Mel Frequency Cepstral Coefficients (MFCC) are extracted. Modeling techniques such as SVM and HMM were used to model each individual word thus owing to 620 models which are trained to the system. Each isolated word segment from the test sentence is matched against these models for finding the semantic representation of the test input speech. The performance of the system is evaluated for the words related to computer domain and the system shows an accuracy of 91.46% for SVM 98.92% for HMM. From the exhaustive analysis, it is evident that HMM performs better than other modeling techniques such as SVM.

...read moreread less

Proceedings Article•DOI•

Arithmetic coding of speech and audio spectra using tcx based on linear predictive spectral envelopes

[...]

Tom Bäckström¹, Christian Helmrich¹•Institutions (1)

University of Erlangen-Nuremberg¹

19 Apr 2015

TL;DR: Subjective measurements show that the proposed methods give a statistically significant improvement in perceptual quality when the bit-rate is held constant, and the proposed method has been adopted to the 3GPP Enhanced Voice Services speech coding standard.

...read moreread less

Abstract: Unified speech and audio codecs often use a frequency domain coding technique of the transform coded excitation (TCX) type. It is based on modeling the speech source with a linear predictor, spectral weighting by a perceptual model and entropy coding of the frequency components. While previous approaches have used neighbouring frequency components to form a probability model for the entropy coder of spectral components, we propose to use the magnitude of the linear predictor to estimate the variance of spectral components. Since the linear predictor is transmitted in any case, this method does not require any additional side info. Subjective measurements show that the proposed methods give a statistically significant improvement in perceptual quality when the bit-rate is held constant. Consequently, the proposed method has been adopted to the 3GPP Enhanced Voice Services speech coding standard.

...read moreread less

Journal Article•DOI•

Spectral dynamics recovery for enhanced speech intelligibility in noise

[...]

Petko N. Petkov¹, W. Bastiaan Kleijn¹•Institutions (1)

Royal Institute of Technology¹

01 Feb 2015-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A distortion measure is proposed to characterize the deviation of the dynamics of the noisy modified speech from the Dynamics of natural speech, and a parametric relationship between the signal band-power before and after modification is derived.

...read moreread less

Abstract: Speech intelligibility in noisy environments decreases with an increase in the noise power. We hypothesize that the differences of subsequent short-term spectra of speech, which we collectively refer to as the speech spectral dynamics, can be used to characterize speech intelligibility. We propose a distortion measure to characterize the deviation of the dynamics of the noisy modified speech from the dynamics of natural speech. Optimizing this distortion measure, we derive a parametric relationship between the signal band-power before and after modification. The parametric nature of the solution ensures adaptation to the noise level, the speech statistics and a penalty on the power gain. A multi-band speech modification system based on the single-band optimal solution is designed under a total signal power constraint and evaluated in selected noise conditions. The results indicate that the proposed approach compares favorably to a reference method based on optimizing a measure of the speech intelligibility index. Very low computational complexity and high intelligibility gain make this an attractive approach for speech modification in a wide range of application scenarios.

...read moreread less

Journal Article•DOI•

A comparative study of different features for isolated spoken word recognition using HMM with reference to Assamese language

[...]

Sruti Sruba Bharali¹, Sanjib Kr. Kalita¹•Institutions (1)

Gauhati University¹

01 Dec 2015-International Journal of Speech Technology

TL;DR: The work done in implementation of speaker independent, isolated word recognizer for Assamese language using the hidden Markov model toolkit using the Hidden Markov Model (HMM) has been used to build the different recognition models.

...read moreread less

Abstract: This paper describes the work done in implementation of speaker independent, isolated word recognizer for Assamese language. Linear predictive coding (LPC) analysis, LPC cepstral coefficients (LPCEPSTRA), linear mel-filter bank channel outputs and mel frequency cepstral coefficients (MFCC) are used to get the acoustical features. The hidden Markov model toolkit (HTK) using the Hidden Markov Model (HMM) has been used to build the different recognition models. The speech recognition model is trained for 10 Assamese words representing the digits from 0 (shounya) to 9 (no) in the Assamese language using fifteen speakers. Different models were created for each word which varied on the number of input feature values and the number of hidden states. The system obtained a maximum accuracy of 80 % for 39 MFCC features and a 7 state HMM model with 5 hidden states for a system with clean data and a maximum accuracy of 95 % for 26 LPCESPTRA features and a 7 state HMM model with 5 hidden states for a system with noisy data.

...read moreread less

Proceedings Article•DOI•

On speech quality estimation of phase-aware single-channel speech enhancement

[...]

Andreas Gaich¹, Pejman Mowlaee¹•Institutions (1)

Graz University of Technology¹

19 Apr 2015

TL;DR: In this article, the reliability of the existing instrumental metrics for performance evaluation of phase-aware methods was evaluated in terms of predicting the perceived quality achieved by the phaseaware methods, and the proposed phase deviation metric was proposed.

...read moreread less

Abstract: To approximate the speech quality of a given speech enhancement system, most of the existing instrumental metrics rely on the calculation of a distortion metric defined between the clean reference signal and the enhanced signal in the spectral amplitude domain. Several recent studies have demonstrated the effectiveness of employing a phase modification stage in single-channel speech enhancement showing positive impact brought by modifying both amplitude and phase in contrast to the conventional methods where the noisy spectral amplitude is only modified and noisy phase is used for signal reconstruction. In this work we present two contributions; First we study the reliability of the existing instrumental metrics for performance evaluation of phase-aware methods, and second we propose novel phase-aware instrumental metrics and evaluate their reliability in terms of predicting the perceived quality achieved by the phase-aware methods. Our objective and subjective evaluations demonstrate that PESQ and the proposed phase deviation metric perform as reliable speech quality estimators following the subjective results.

...read moreread less

Proceedings Article•DOI•

Subjective quality evaluation of the 3GPP EVS codec

[...]

Anssi Rämö¹, Henri Toukomaa¹•Institutions (1)

Nokia¹

19 Apr 2015

TL;DR: Comparison to Opus, IETF driven open source codec as well as industry standard voice codecs: 3GPP AMR and AMR-WB, and ITU-T G.718B, G.1C and G.719 as wellAs direct signals at varying bandwidths was made.

...read moreread less

Abstract: This paper discusses the voice and audio quality characteristics of EVS, the recently standardized 3GPP codec. Comparison to Opus, IETF driven open source codec as well as industry standard voice codecs: 3GPP AMR and AMR-WB, and ITU-T G.718B, G.722.1C and G.719 as well as direct signals at varying bandwidths was made. Voice and audio quality was evaluated with three subjective listening tests containing clean and noisy speech in Finnish language as well as a mixed condition test containing both speech and music intermixed. Nine-scale subjective mean opinion score was calculated for all tested conditions.

...read moreread less

Journal Article•DOI•

A Novel Method for Speech Acquisition and Enhancement by 94 GHz Millimeter-Wave Sensor.

[...]

Fuming Chen¹, Sheng Li, Chuantao Li¹, Miao Liu¹, Zhao Li¹, Xue Huijun¹, Xijing Jing¹, Jianqi Wang², Jianqi Wang¹ - Show less +5 more•Institutions (2)

Fourth Military Medical University¹, Shaanxi University of Technology²

31 Dec 2015-Sensors

TL;DR: The experimental results show that human speech can be effectively acquired by a 94 GHz MMW radar sensor when the detection distance is 20 m, and the noise of the radar speech is greatly suppressed and the speech sounds become more pleasant to human listeners after being enhanced by the proposed algorithm.

...read moreread less

Abstract: In order to improve the speech acquisition ability of a non-contact method, a 94 GHz millimeter wave (MMW) radar sensor was employed to detect speech signals. This novel non-contact speech acquisition method was shown to have high directional sensitivity, and to be immune to strong acoustical disturbance. However, MMW radar speech is often degraded by combined sources of noise, which mainly include harmonic, electrical circuit and channel noise. In this paper, an algorithm combining empirical mode decomposition (EMD) and mutual information entropy (MIE) was proposed for enhancing the perceptibility and intelligibility of radar speech. Firstly, the radar speech signal was adaptively decomposed into oscillatory components called intrinsic mode functions (IMFs) by EMD. Secondly, MIE was used to determine the number of reconstructive components, and then an adaptive threshold was employed to remove the noise from the radar speech. The experimental results show that human speech can be effectively acquired by a 94 GHz MMW radar sensor when the detection distance is 20 m. Moreover, the noise of the radar speech is greatly suppressed and the speech sounds become more pleasant to human listeners after being enhanced by the proposed algorithm, suggesting that this novel speech acquisition and enhancement method will provide a promising alternative for various applications associated with speech detection.

...read moreread less

Proceedings Article•DOI•

Speech Separation based on signal-noise-dependent deep neural networks for robust speech recognition

[...]

Yan-Hui Tu¹, Jun Du¹, Li-Rong Dai¹, Chin-Hui Lee²•Institutions (2)

University of Science and Technology of China¹, Georgia Institute of Technology²

19 Apr 2015

TL;DR: A new signal-noise-dependent (SND) deep neural network (DNN) framework to further improve the separation and recognition performance of the recently developed technique for general DNN-based speech separation and Experimental results on the Speech Separation Challenge (SSC) task show that SND-DNNs could yield significant performance improvements.

...read moreread less

Abstract: In this paper, we propose a new signal-noise-dependent (SND) deep neural network (DNN) framework to further improve the separation and recognition performance of the recently developed technique for general DNN-based speech separation. We adopt a divide and conquer strategy to design the proposed SND-DNNs with higher resolutions that a single general DNN could not well accommodate for all the speaker mixing variabilities at different levels of signal-to-noise ratios (SNRs). In this study two kinds of SNR-dependent DNNs, namely positive and negative DNNs, are trained to cover the mixed speech signals with positive and negative SNR levels, respectively. At the separation stage, a first-pass separation using a general DNN can give an accurate SNR estimation for a model selection. Experimental results on the Speech Separation Challenge (SSC) task show that SND-DNNs could yield significant performance improvements for both speech separation and recognition over a general DNN. Furthermore, this purely front-end processing method achieves a relative word error rate reduction of 11.6% over a state-of-the-art recognition system where a complicated joint decoding framework needs to be implemented in the back-end.

...read moreread less

Proceedings Article•DOI•

Attributing modelling errors in HMM synthesis by stepping gradually from natural to modelled speech

[...]

Thomas Merritt¹, Javier Latorre², Simon King¹•Institutions (2)

University of Edinburgh¹, Toshiba²

19 Apr 2015

TL;DR: By constructing speech signals that lie in between natural speech and the output from a complete HMM synthesis system, this work manipulates the temporal smoothness and the variance of the spectral parameters to create stimuli, and listeners made `same or different' pairwise judgements, from which a perceptual map is generated using Multidimensional Scaling.

...read moreread less

Abstract: Even the best statistical parametric speech synthesis systems do not achieve the naturalness of good unit selection. We investigated possible causes of this. By constructing speech signals that lie in between natural speech and the output from a complete HMM synthesis system, we investigated various effects of modelling. We manipulated the temporal smoothness and the variance of the spectral parameters to create stimuli, then presented these to listeners alongside natural and vocoded speech, as well as output from a full HMM-based text-to-speech system and from an idealised ‘pseudo-HMM’. All speech signals, except the natural waveform, were created using vocoders employing one of two popular spectral parameterisations: Mel-Cepstra or Mel-Line Spectral Pairs. Listeners made ‘same or different’ pairwise judgements, from which we generated a perceptual map using Multidimensional Scaling. We draw conclusions about which aspects of HMM synthesis are limiting the naturalness of the synthetic speech.

...read moreread less

Journal Article•DOI•

Estimating nonnegative matrix model activations with deep neural networks to increase perceptual speech quality.

[...]

Donald S. Williamson¹, Yuxuan Wang¹, DeLiang Wang¹•Institutions (1)

Ohio State University¹

10 Sep 2015-Journal of the Acoustical Society of America

TL;DR: Systematic comparisons using objective metrics, including the perceptual evaluation of speech quality, show that the proposed algorithm achieves higher speech quality than related masking and NMF methods.

...read moreread less

Abstract: As a means of speech separation, time-frequency masking applies a gain function to the time-frequency representation of noisy speech. On the other hand, nonnegative matrix factorization (NMF) addresses separation by linearly combining basis vectors from speech and noise models to approximate noisy speech. This paper presents an approach for improving the perceptual quality of speech separated from background noise at low signal-to-noise ratios. An ideal ratio mask is estimated, which separates speech from noise with reasonable sound quality. A deep neural network then approximates clean speech by estimating activation weights from the ratio-masked speech, where the weights linearly combine elements from a NMF speech model. Systematic comparisons using objective metrics, including the perceptual evaluation of speech quality, show that the proposed algorithm achieves higher speech quality than related masking and NMF methods. In addition, a listening test was performed and its results show that the output of the proposed algorithm is preferred over the comparison systems in terms of speech quality.

...read moreread less

Proceedings Article•DOI•

Robust overlapped speech detection and its application in word-count estimation for Prof-Life-Log data

[...]

Navid Shokouhi¹, Ali Ziaei¹, Abhijeet Sangwan¹, John H. L. Hansen¹•Institutions (1)

University of Texas at Dallas¹

19 Apr 2015

TL;DR: A noise robust overlapped speech detection algorithm to estimate the likelihood of overlapping speech in a given audio file in the presence of environment noise and achieves 35% relative improvement over previous efforts, which included speech enhancement using spectral subtraction and silence removal.

...read moreread less

Abstract: The ability to estimate the number of words spoken by an individual over a certain period of time is valuable in second language acquisition, healthcare, and assessing language development. However, establishing a robust automatic framework to achieve high accuracy is non-trivial in realistic/naturalistic scenarios due to various factors such as different styles of conversation or types of noise that appear in audio recordings, especially in multi-party conversations. In this study, we propose a noise robust overlapped speech detection algorithm to estimate the likelihood of overlapping speech in a given audio file in the presence of environment noise. This information is embedded into a word-count estimator, which uses a linear minimum mean square estimator (LMMSE) to predict the number of words from the syllable rate. Syllables are detected using a modified version of the mrate algorithm. The proposed word-count estimator is tested on long duration files from the Prof-Life-Log corpus. Data is recorded using a LENA recording device, worn by a primary speaker in various environments and under different noise conditions. The overlap detection system significantly outperforms baseline performance in noisy conditions. Furthermore, applying overlap detection results to word-count estimation achieves 35% relative improvement over our previous efforts, which included speech enhancement using spectral subtraction and silence removal.

...read moreread less

Journal Article•DOI•

Singer identification using perceptual features and cepstral coefficients of an audio signal from Indian video songs

[...]

Tushar Ratanpara¹, Narendra M. Patel•Institutions (1)

C.U.Shah University¹

25 Jun 2015-Eurasip Journal on Audio, Speech, and Music Processing

TL;DR: The research presented in this paper automatically recognize a singer without separating instrumental and singing sounds using audio features like timbre coefficients, pitch class, mel frequency cepstral coefficients, linear predictive coding coefficients, and loudness of an audio signal from Indian video songs (IVS).

...read moreread less

Abstract: Singer identification is a difficult topic in music information retrieval because background instrumental music is included with singing voice which reduces performance of a system. One of the main disadvantages of the existing system is vocals and instrumental are separated manually and only vocals are used to build training model. The research presented in this paper automatically recognize a singer without separating instrumental and singing sounds using audio features like timbre coefficients, pitch class, mel frequency cepstral coefficients (MFCC), linear predictive coding (LPC) coefficients, and loudness of an audio signal from Indian video songs (IVS). Initially, various IVS of distinct playback singers (PS) are collected. After that, 53 audio features (12 dimensional timbre audio feature vectors, 12 pitch classes, 13 MFCC coefficients, 13 LPC coefficients, and 3 loudness feature vector of an audio signal) are extracted from each segment. Dimension of extracted audio features is reduced using principal component analysis (PCA) method. Playback singer model (PSM) is trained using multiclass classification algorithms like back propagation, AdaBoost.M2, k-nearest neighbor (KNN) algorithm, naive Bayes classifier (NBC), and Gaussian mixture model (GMM). The proposed approach is tested on various combinations of dataset and different combinations of audio feature vectors with various Indian male and female PS’s songs.

...read moreread less

Proceedings Article•DOI•

Joint acoustic and spectral modeling for speech dereverberation using non-negative representations

[...]

Nasser Mohammadiha¹, Paris Smaragdis², Simon Doclo¹•Institutions (2)

University of Oldenburg¹, University of Illinois at Urbana–Champaign²

19 Apr 2015

TL;DR: Experimental results show that the proposed method improved instrumental speech quality measures, where using speech temporal dynamics was found to be beneficial in severe reverberation conditions.

...read moreread less

Abstract: This paper proposes a single-channel speech dereverberation method enhancing the spectrum of the reverberant speech signal. The proposed method uses a non-negative approximation of the convolutive transfer function (N-CTF) to simultaneously estimate the magnitude spectrograms of the speech signal and the room impulse response (RIR). To utilize the speech spectral structure, we propose to model the speech spectrum using non-negative matrix factorization, which is directly used in the N-CTF model resulting in a new cost function. We derive new estimators for the parameters by minimizing the obtained cost function. Additionally, to investigate the effect of the speech temporal dynamics for dereverberation, we use a frame stacking method and derive optimal estimators. Experiments are performed for two measured RIRs and the performance of the proposed method is compared to the performance of a state-of-the-art dereverberation method enhancing the speech spectrum. Experimental results show that the proposed method improved instrumental speech quality measures, where using speech temporal dynamics was found to be beneficial in severe reverberation conditions.

...read moreread less

Journal Article•DOI•

Line spectral frequencies modeling by a mixture of von Mises-Fisher distributions

[...]

Zhanyu Ma¹, Jalil Taghia², W. Bastiaan Kleijn³, Arne Leijon⁴, Jun Guo¹ - Show less +1 more•Institutions (4)

Beijing University of Posts and Telecommunications¹, Technical University of Berlin², Victoria University of Wellington³, Royal Institute of Technology⁴

01 Sep 2015-Signal Processing

TL;DR: Experimental results show that the proposed VVQ outperforms the recently introduced Dirichlet mixture model-based VQ and the conventional Gaussian mixturemodel- based VQ, in terms of modeling performance and D-R relation.

...read moreread less

Journal Article•DOI•

Effects of manipulating the signal-to-noise envelope power ratio on speech intelligibility.

[...]

Søren Jørgensen¹, Remi Julien Blaise Decorsiere¹, Torsten Dau¹•Institutions (1)

Technical University of Denmark¹

17 Mar 2015-Journal of the Acoustical Society of America

TL;DR: A good correspondence between the data and the corresponding sEPSM predictions was obtained when the noise was manipulated and mixed with the unprocessed speech, consistent with the hypothesis that SNRenv is indicative of speech intelligibility.

...read moreread less

Abstract: Jorgensen and Dau [(2011). J. Acoust. Soc. Am. 130, 1475–1487] suggested a metric for speech intelligibility prediction based on the signal-to-noise envelope power ratio ( SNRenv), calculated at the output of a modulation-frequency selective process. In the framework of the speech-based envelope power spectrum model (sEPSM), the SNRenv was demonstrated to account for speech intelligibility data in various conditions with linearly and nonlinearly processed noisy speech, as well as for conditions with stationary and fluctuating interferers. Here, the relation between the SNRenv and speech intelligibility was investigated further by systematically varying the modulation power of either the speech or the noise before mixing the two components, while keeping the overall power ratio of the two components constant. A good correspondence between the data and the corresponding sEPSM predictions was obtained when the noise was manipulated and mixed with the unprocessed speech, consistent with the hypothesis that SN...

...read moreread less

Collapse