scispace - formally typeset
Search or ask a question

Showing papers on "Linear predictive coding published in 2015"


Journal ArticleDOI
TL;DR: In this article, Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs) are used for generating low-level speech waveforms from high-level symbolic inputs via intermediate acoustic feature sequences.
Abstract: Hidden Markov models (HMMs) and Gaussian mixture models (GMMs) are the two most common types of acoustic models used in statistical parametric approaches for generating low-level speech waveforms from high-level symbolic inputs via intermediate acoustic feature sequences. However, these models have their limitations in representing complex, nonlinear relationships between the speech generation inputs and the acoustic features. Inspired by the intrinsically hierarchical process of human speech production and by the successful application of deep neural networks (DNNs) to automatic speech recognition (ASR), deep learning techniques have also been applied successfully to speech generation, as reported in recent literature.

203 citations


01 Jan 2015
TL;DR: This article systematically reviews emerging speech generation approaches with the dual goal of helping readers gain a better understanding of the existing techniques as well as stimulating new work in the burgeoning area of deep learning for parametric speech generation.
Abstract: Hidden Markov models (HMMs) and Gaussian mixture models (GMMs) are the two most common types of acoustic models used in statistical parametric approaches for generating low-level speech waveforms from high-level symbolic inputs via intermediate acoustic feature sequences. However, these models have their limitations in representing complex, nonlinear relationships between the speech generation inputs and the acoustic features. Inspired by the intrinsically hierarchical process of human speech production and by the successful application of deep neural networks (DNNs) to automatic speech recognition (ASR), deep learning techniques have also been applied successfully to speech generation, as reported in recent literature. This article systematically reviews these emerging speech generation approaches, with the dual goal of helping readers gain a better understanding of the existing techniques as well as stimulating new work in the burgeoning area of deep learning for parametric speech generation. In speech signal and information processing, many applications have been formulated as machine-learning tasks. ASR is a typical classification task that predicts word sequences from speech waveforms or feature sequences. There are also many regression tasks in speech processing that are aimed to generate speech signals from various types of inputs. They are referred to as speech generation tasks in this article. Speech generation covers a wide range of research topics in speech processing, such as text-to-speech (TTS) synthesis (generating speech from text), voice conversion (modifying nonlinguistic information of the input speech), speech enhancement (improving speech quality by noise reduction or other processing), and articulatory-to-acoustic mapping (converting articulatory movements to acoustic features). These

189 citations


Journal ArticleDOI
TL;DR: This paper proposes to model the desired speech signal using a general sparse prior that can be represented in a convex form as a maximization over scaled complex Gaussian distributions, which can be interpreted as a generalization of the commonly used time-varying Gaussian model.
Abstract: The quality of speech signals recorded in an enclosure can be severely degraded by room reverberation. In this paper, we focus on a class of blind batch methods for speech dereverberation in a noiseless scenario with a single source, which are based on multi-channel linear prediction in the short-time Fourier transform domain. Dereverberation is performed by maximum-likelihood estimation of the model parameters that are subsequently used to recover the desired speech signal. Contrary to the conventional method, we propose to model the desired speech signal using a general sparse prior that can be represented in a convex form as a maximization over scaled complex Gaussian distributions. The proposed model can be interpreted as a generalization of the commonly used time-varying Gaussian model. Furthermore, we reformulate both the conventional and the proposed method as an optimization problem with an ep-norm cost function, emphasizing the role of sparsity in the considered speech dereverberation methods. Experimental evaluation in different acoustic scenarios show that the proposed approach results in an improved performance compared to the conventional approach in terms of instrumental measures for speech quality.

97 citations


Journal ArticleDOI
TL;DR: It is evident from the results that the modified forms of spectral subtraction method reduces remnant noise significantly and the enhanced speech contains minimal speech distortion.

80 citations


Proceedings ArticleDOI
Yannis Agiomyrgiannakis1
19 Apr 2015
TL;DR: A new vocoder synthesizer, referred to as Vocaine, that features a novel Amplitude Modulated-Frequency Modulated (AM-FM) speech model, a new way to synthesize non-stationary sinusoids using quadratic phase splines and a super fast cosine generator is presented.
Abstract: Vocoders received renewed attention recently as basic components in speech synthesis applications such as voice transformation, voice conversion and statistical parametric speech synthesis. This paper presents a new vocoder synthesizer, referred to as Vocaine, that features a novel Amplitude Modulated-Frequency Modulated (AM-FM) speech model, a new way to synthesize non-stationary sinusoids using quadratic phase splines and a super fast cosine generator. Extensive evaluations are made against several state-of-the-art methods in Copy-Synthesis and Text-To-Speech synthesis experiments. Vocaine matches or outperforms STRAIGHT in Copy-Synthesis experiments and outperforms our baseline real-time optimized Mixed-Excitation vocoder with the same computational cost. We report that Vocaine considerably improves our statistical TTS synthesizers and that our new statistical parametric synthesizer [1] matched the quality of our mature production Unit-Selection system with uncompressed waveforms.

79 citations


Journal ArticleDOI
TL;DR: Object and subjective evaluations indicated that the proposed spectral envelope estimation algorithm can obtain a temporally stable spectral envelope and synthesize speech with higher sound quality than speech synthesized with other algorithms.

73 citations


Journal ArticleDOI
TL;DR: This letter presents a novel method to estimate the clean speech phase spectrum, given the noisy speech observation in single-channel speech enhancement, which relies on the phase decomposition of the instantaneous noisy phase spectrum followed by temporal smoothing in order to reduce the large variance of noisy phase.
Abstract: Conventional speech enhancement methods typically utilize the noisy phase spectrum for signal reconstruction. This letter presents a novel method to estimate the clean speech phase spectrum, given the noisy speech observation in single-channel speech enhancement. The proposed method relies on the phase decomposition of the instantaneous noisy phase spectrum followed by temporal smoothing in order to reduce the large variance of noisy phase, and consequently reconstructs an enhanced instantaneous phase spectrum for signal reconstruction. The effectiveness of the proposed method is evaluated in two ways: phase enhancement-only and by quantifying the additional improvement on top of the conventional amplitude enhancement scheme where noisy phase is often used in signal reconstruction. The instrumental metrics predict a consistent improvement in perceived speech quality and speech intelligibility when the noisy phase is enhanced using the proposed phase estimation method.

51 citations


Proceedings ArticleDOI
06 Sep 2015
TL;DR: The proposed method aims to estimate a spectral enve- lope from visual features which is then combined with an arti- ficial excitation signal and used within a model of speech pro- duction to reconstruct an audio signal.
Abstract: This work describes an investigation into the feasibility of producing intelligible audio speech from only visual speech fea- tures. The proposed method aims to estimate a spectral enve- lope from visual features which is then combined with an arti- ficial excitation signal and used within a model of speech pro- duction to reconstruct an audio signal. Different combinations of audio and visual features are considered, along with both a statistical method of estimation and a deep neural network. The intelligibility of the reconstructed audio speech is measured by human listeners, and then compared to the intelligibility of the video signal only and when combined with the reconstructed audio.

47 citations


Journal ArticleDOI
TL;DR: This paper proposes a simple formant-based VAD algorithm to overcome the problem of detecting formants under conditions with severe noise, which achieves a much faster processing time and outperforms standard VAD algorithms under various noise conditions.
Abstract: Voice activity detection (VAD) can be used to distinguish human speech from other sounds, and various applications can benefit from VAD-including speech coding and speech recognition. To accurately detect voice activity, the algorithm must take into account the characteristic features of human speech and/or background noise. In many real-life applications, noise frequently occurs in an unexpected manner, and in such situations, it is difficult to determine the characteristics of noise with sufficient accuracy. As a result, robust VAD algorithms that depend less on making correct noise estimates are desirable for real-life applications. Formants are the major spectral peaks of the human voice, and these are highly useful to distinguish vowel sounds. The characteristics of the spectral peaks are such that, these peaks are likely to survive in a signal after severe corruption by noise, and so formants are attractive features for voice activity detection under low signal-to-noise ratio (SNR) conditions. However, it is difficult to accurately extract formants from noisy signals when background noise introduces unrelated spectral peaks. Therefore, this paper proposes a simple formant-based VAD algorithm to overcome the problem of detecting formants under conditions with severe noise. The proposed method achieves a much faster processing time and outperforms standard VAD algorithms under various noise conditions. The proposed method is robust against various types of noise and produces a light computational load, so it is suitable for use in various applications.

42 citations


Patent
26 Mar 2015
TL;DR: In this article, a system of an environment-sensitive automatic speech recognition is described, which includes steps for obtaining audio data including human speech, determining at least one characteristic of the environment in which the audio data was obtained, and modifying the speech recognition parameters depending on the characteristic.
Abstract: In a system of an environment-sensitive automatic speech recognition, a method includes steps for obtaining audio data including human speech, determining at least one characteristic of the environment in which the audio data was obtained, and modifying at least one parameter to be used to perform speech recognition depending on the characteristic.

35 citations


Patent
03 Sep 2015

Journal ArticleDOI
TL;DR: An efficient way to directly compute the full-resolution frequency estimates of speech and noise using coupled dictionaries, which results in improved word error rates for the speech recognition tasks using HMM-GMM and deep-neural network (DNN) based systems.
Abstract: Exemplar-based speech enhancement systems work by decomposing the noisy speech as a weighted sum of speech and noise exemplars stored in a dictionary and use the resulting speech and noise estimates to obtain a time-varying filter in the full-resolution frequency domain to enhance the noisy speech. To obtain the decomposition, exemplars sampled in lower dimensional spaces are preferred over the full-resolution frequency domain for their reduced computational complexity and the ability to better generalize to unseen cases. But the resulting filter may be sub-optimal as the mapping of the obtained speech and noise estimates to the full-resolution frequency domain yields a low-rank approximation. This paper proposes an efficient way to directly compute the full-resolution frequency estimates of speech and noise using coupled dictionaries: an input dictionary containing atoms from the desired exemplar space to obtain the decomposition and a coupled output dictionary containing exemplars from the full-resolution frequency domain. We also introduce modulation spectrogram features for the exemplar-based tasks using this approach. The proposed system was evaluated for various choices of input exemplars and yielded improved speech enhancement performances on the AURORA-2 and AURORA-4 databases. We further show that the proposed approach also results in improved word error rates (WERs) for the speech recognition tasks using HMM-GMM and deep-neural network (DNN) based systems.

01 Jan 2015
TL;DR: In this article, a solution using linear predictive coding, the slope of the frequency spectrum and the zero crossing rate was evaluated, and it was concluded that drone detection using audio analysis is possible.
Abstract: Drones used for illegal purposes is a growing problem and a way to detect these is needed. This thesis has evaluated the possibility of using sound analysis as the detection mechanism. A solution using linear predictive coding, the slope of the frequency spectrum and the zero crossing rate was evaluated. The results showed that a solution using linear predictive coding and the slope of the frequency spectrum give a good result for the distance it is calibrated for. The zero crossing rate on the other hand does not improve the result and was not part of the final solution. The amount of false positives increases when calibrating for longer distances, and a compromise between detecting drones at long distances and the number of false positives need to be made in the implemented solution. It was concluded that drone detection using audio analysis is possible, and that the implemented solution, with linear predictive coding and slope of the frequency spectrum, could with further improvements become a useable product.

Proceedings ArticleDOI
09 Jul 2015
TL;DR: A novel approach is being introduces using a combination of prosody features, quality features, derived features and dynamic feature for robust automatic recognition of speaker's state of emotion in `Five native Assamese Languages'.
Abstract: Speech emotion recognition is one of the recent challenges in speech processing and Human Computer Interaction (HCI) in order to address various operational needs for the real world applications. Besides human facial expressions, speech has been proven to be one of the most precious modalities for automatic recognition of human emotions. Speech is a spontaneous medium of perceiving emotions which provides in-depth information related to different cognitive states of a human being. In this context, a novel approach is being introduces using a combination of prosody features (i.e. pitch, energy, Zero crossing rate), quality features (i.e. Formant Frequencies, Spectral features etc.), derived features (i.e. Mel-Frequency Cepstral Coefficient (MFCC), Linear Predictive Coding Coefficients (LPCC)) and dynamic feature (Mel-Energy spectrum dynamic Coefficients (MEDC)) for robust automatic recognition of speaker's state of emotion. Multilevel SVM classifier is used for identification of seven discrete emotional states namely anger, disgust, fear, happy, neutral, sad and surprise in ‘Five native Assamese Languages’. The overall results of the conducted experiments revealed that the approach of using the combination of features achieved an average accuracy rate of 82.26% for speaker independent cases.

Journal ArticleDOI
TL;DR: Experimental results show that the self-embedding speech signal is recoverable with proper speech quality for high tampering rates, without significant loss in the quality of the original speech signal.
Abstract: Authentication and tampering detection of the digital signals is one of the main applications of the digital watermarking. Recently, watermarking algorithms for digital images are developed to not only detect the image tampering, but also to recover the lost content to some extent. In this paper, a new watermarking scheme is introduced to generate digital self-embedding speech signals enjoying the self-recovery feature. For this purpose, the compressed version of the speech signal generated by a speech codec and protected against the tampering by the proper channel coding is embedded into the original speech signal. Experimental results show that the self-embedding speech signal is recoverable with proper speech quality for high tampering rates, without significant loss in the quality of the original speech signal.

Book ChapterDOI
01 Jan 2015
TL;DR: From the exhaustive analysis, it is evident that HMM performs better than other modeling techniques such as SVM.
Abstract: Speech Recognition approach intends to recognize the text from the speech utterance which can be more helpful to the people with hearing disabled. Support Vector Machine (SVM) and Hidden Markov Model (HMM) are widely used techniques for speech recognition system. Acoustic features namely Linear Predictive Coding (LPC), Linear Prediction Cepstral Coefficient (LPCC) and Mel Frequency Cepstral Coefficients (MFCC) are extracted. Modeling techniques such as SVM and HMM were used to model each individual word thus owing to 620 models which are trained to the system. Each isolated word segment from the test sentence is matched against these models for finding the semantic representation of the test input speech. The performance of the system is evaluated for the words related to computer domain and the system shows an accuracy of 91.46% for SVM 98.92% for HMM. From the exhaustive analysis, it is evident that HMM performs better than other modeling techniques such as SVM.

Proceedings ArticleDOI
19 Apr 2015
TL;DR: Subjective measurements show that the proposed methods give a statistically significant improvement in perceptual quality when the bit-rate is held constant, and the proposed method has been adopted to the 3GPP Enhanced Voice Services speech coding standard.
Abstract: Unified speech and audio codecs often use a frequency domain coding technique of the transform coded excitation (TCX) type. It is based on modeling the speech source with a linear predictor, spectral weighting by a perceptual model and entropy coding of the frequency components. While previous approaches have used neighbouring frequency components to form a probability model for the entropy coder of spectral components, we propose to use the magnitude of the linear predictor to estimate the variance of spectral components. Since the linear predictor is transmitted in any case, this method does not require any additional side info. Subjective measurements show that the proposed methods give a statistically significant improvement in perceptual quality when the bit-rate is held constant. Consequently, the proposed method has been adopted to the 3GPP Enhanced Voice Services speech coding standard.

Journal ArticleDOI
TL;DR: A distortion measure is proposed to characterize the deviation of the dynamics of the noisy modified speech from the Dynamics of natural speech, and a parametric relationship between the signal band-power before and after modification is derived.
Abstract: Speech intelligibility in noisy environments decreases with an increase in the noise power. We hypothesize that the differences of subsequent short-term spectra of speech, which we collectively refer to as the speech spectral dynamics, can be used to characterize speech intelligibility. We propose a distortion measure to characterize the deviation of the dynamics of the noisy modified speech from the dynamics of natural speech. Optimizing this distortion measure, we derive a parametric relationship between the signal band-power before and after modification. The parametric nature of the solution ensures adaptation to the noise level, the speech statistics and a penalty on the power gain. A multi-band speech modification system based on the single-band optimal solution is designed under a total signal power constraint and evaluated in selected noise conditions. The results indicate that the proposed approach compares favorably to a reference method based on optimizing a measure of the speech intelligibility index. Very low computational complexity and high intelligibility gain make this an attractive approach for speech modification in a wide range of application scenarios.

Journal ArticleDOI
TL;DR: The work done in implementation of speaker independent, isolated word recognizer for Assamese language using the hidden Markov model toolkit using the Hidden Markov Model (HMM) has been used to build the different recognition models.
Abstract: This paper describes the work done in implementation of speaker independent, isolated word recognizer for Assamese language. Linear predictive coding (LPC) analysis, LPC cepstral coefficients (LPCEPSTRA), linear mel-filter bank channel outputs and mel frequency cepstral coefficients (MFCC) are used to get the acoustical features. The hidden Markov model toolkit (HTK) using the Hidden Markov Model (HMM) has been used to build the different recognition models. The speech recognition model is trained for 10 Assamese words representing the digits from 0 (shounya) to 9 (no) in the Assamese language using fifteen speakers. Different models were created for each word which varied on the number of input feature values and the number of hidden states. The system obtained a maximum accuracy of 80 % for 39 MFCC features and a 7 state HMM model with 5 hidden states for a system with clean data and a maximum accuracy of 95 % for 26 LPCESPTRA features and a 7 state HMM model with 5 hidden states for a system with noisy data.

Proceedings ArticleDOI
19 Apr 2015
TL;DR: In this article, the reliability of the existing instrumental metrics for performance evaluation of phase-aware methods was evaluated in terms of predicting the perceived quality achieved by the phaseaware methods, and the proposed phase deviation metric was proposed.
Abstract: To approximate the speech quality of a given speech enhancement system, most of the existing instrumental metrics rely on the calculation of a distortion metric defined between the clean reference signal and the enhanced signal in the spectral amplitude domain. Several recent studies have demonstrated the effectiveness of employing a phase modification stage in single-channel speech enhancement showing positive impact brought by modifying both amplitude and phase in contrast to the conventional methods where the noisy spectral amplitude is only modified and noisy phase is used for signal reconstruction. In this work we present two contributions; First we study the reliability of the existing instrumental metrics for performance evaluation of phase-aware methods, and second we propose novel phase-aware instrumental metrics and evaluate their reliability in terms of predicting the perceived quality achieved by the phase-aware methods. Our objective and subjective evaluations demonstrate that PESQ and the proposed phase deviation metric perform as reliable speech quality estimators following the subjective results.

Proceedings ArticleDOI
Anssi Rämö1, Henri Toukomaa1
19 Apr 2015
TL;DR: Comparison to Opus, IETF driven open source codec as well as industry standard voice codecs: 3GPP AMR and AMR-WB, and ITU-T G.718B, G.1C and G.719 as wellAs direct signals at varying bandwidths was made.
Abstract: This paper discusses the voice and audio quality characteristics of EVS, the recently standardized 3GPP codec. Comparison to Opus, IETF driven open source codec as well as industry standard voice codecs: 3GPP AMR and AMR-WB, and ITU-T G.718B, G.722.1C and G.719 as well as direct signals at varying bandwidths was made. Voice and audio quality was evaluated with three subjective listening tests containing clean and noisy speech in Finnish language as well as a mixed condition test containing both speech and music intermixed. Nine-scale subjective mean opinion score was calculated for all tested conditions.

Journal ArticleDOI
31 Dec 2015-Sensors
TL;DR: The experimental results show that human speech can be effectively acquired by a 94 GHz MMW radar sensor when the detection distance is 20 m, and the noise of the radar speech is greatly suppressed and the speech sounds become more pleasant to human listeners after being enhanced by the proposed algorithm.
Abstract: In order to improve the speech acquisition ability of a non-contact method, a 94 GHz millimeter wave (MMW) radar sensor was employed to detect speech signals. This novel non-contact speech acquisition method was shown to have high directional sensitivity, and to be immune to strong acoustical disturbance. However, MMW radar speech is often degraded by combined sources of noise, which mainly include harmonic, electrical circuit and channel noise. In this paper, an algorithm combining empirical mode decomposition (EMD) and mutual information entropy (MIE) was proposed for enhancing the perceptibility and intelligibility of radar speech. Firstly, the radar speech signal was adaptively decomposed into oscillatory components called intrinsic mode functions (IMFs) by EMD. Secondly, MIE was used to determine the number of reconstructive components, and then an adaptive threshold was employed to remove the noise from the radar speech. The experimental results show that human speech can be effectively acquired by a 94 GHz MMW radar sensor when the detection distance is 20 m. Moreover, the noise of the radar speech is greatly suppressed and the speech sounds become more pleasant to human listeners after being enhanced by the proposed algorithm, suggesting that this novel speech acquisition and enhancement method will provide a promising alternative for various applications associated with speech detection.

Proceedings ArticleDOI
19 Apr 2015
TL;DR: A new signal-noise-dependent (SND) deep neural network (DNN) framework to further improve the separation and recognition performance of the recently developed technique for general DNN-based speech separation and Experimental results on the Speech Separation Challenge (SSC) task show that SND-DNNs could yield significant performance improvements.
Abstract: In this paper, we propose a new signal-noise-dependent (SND) deep neural network (DNN) framework to further improve the separation and recognition performance of the recently developed technique for general DNN-based speech separation. We adopt a divide and conquer strategy to design the proposed SND-DNNs with higher resolutions that a single general DNN could not well accommodate for all the speaker mixing variabilities at different levels of signal-to-noise ratios (SNRs). In this study two kinds of SNR-dependent DNNs, namely positive and negative DNNs, are trained to cover the mixed speech signals with positive and negative SNR levels, respectively. At the separation stage, a first-pass separation using a general DNN can give an accurate SNR estimation for a model selection. Experimental results on the Speech Separation Challenge (SSC) task show that SND-DNNs could yield significant performance improvements for both speech separation and recognition over a general DNN. Furthermore, this purely front-end processing method achieves a relative word error rate reduction of 11.6% over a state-of-the-art recognition system where a complicated joint decoding framework needs to be implemented in the back-end.

Proceedings ArticleDOI
19 Apr 2015
TL;DR: By constructing speech signals that lie in between natural speech and the output from a complete HMM synthesis system, this work manipulates the temporal smoothness and the variance of the spectral parameters to create stimuli, and listeners made `same or different' pairwise judgements, from which a perceptual map is generated using Multidimensional Scaling.
Abstract: Even the best statistical parametric speech synthesis systems do not achieve the naturalness of good unit selection. We investigated possible causes of this. By constructing speech signals that lie in between natural speech and the output from a complete HMM synthesis system, we investigated various effects of modelling. We manipulated the temporal smoothness and the variance of the spectral parameters to create stimuli, then presented these to listeners alongside natural and vocoded speech, as well as output from a full HMM-based text-to-speech system and from an idealised ‘pseudo-HMM’. All speech signals, except the natural waveform, were created using vocoders employing one of two popular spectral parameterisations: Mel-Cepstra or Mel-Line Spectral Pairs. Listeners made ‘same or different’ pairwise judgements, from which we generated a perceptual map using Multidimensional Scaling. We draw conclusions about which aspects of HMM synthesis are limiting the naturalness of the synthetic speech.

Journal ArticleDOI
TL;DR: Systematic comparisons using objective metrics, including the perceptual evaluation of speech quality, show that the proposed algorithm achieves higher speech quality than related masking and NMF methods.
Abstract: As a means of speech separation, time-frequency masking applies a gain function to the time-frequency representation of noisy speech. On the other hand, nonnegative matrix factorization (NMF) addresses separation by linearly combining basis vectors from speech and noise models to approximate noisy speech. This paper presents an approach for improving the perceptual quality of speech separated from background noise at low signal-to-noise ratios. An ideal ratio mask is estimated, which separates speech from noise with reasonable sound quality. A deep neural network then approximates clean speech by estimating activation weights from the ratio-masked speech, where the weights linearly combine elements from a NMF speech model. Systematic comparisons using objective metrics, including the perceptual evaluation of speech quality, show that the proposed algorithm achieves higher speech quality than related masking and NMF methods. In addition, a listening test was performed and its results show that the output of the proposed algorithm is preferred over the comparison systems in terms of speech quality.

Proceedings ArticleDOI
19 Apr 2015
TL;DR: A noise robust overlapped speech detection algorithm to estimate the likelihood of overlapping speech in a given audio file in the presence of environment noise and achieves 35% relative improvement over previous efforts, which included speech enhancement using spectral subtraction and silence removal.
Abstract: The ability to estimate the number of words spoken by an individual over a certain period of time is valuable in second language acquisition, healthcare, and assessing language development. However, establishing a robust automatic framework to achieve high accuracy is non-trivial in realistic/naturalistic scenarios due to various factors such as different styles of conversation or types of noise that appear in audio recordings, especially in multi-party conversations. In this study, we propose a noise robust overlapped speech detection algorithm to estimate the likelihood of overlapping speech in a given audio file in the presence of environment noise. This information is embedded into a word-count estimator, which uses a linear minimum mean square estimator (LMMSE) to predict the number of words from the syllable rate. Syllables are detected using a modified version of the mrate algorithm. The proposed word-count estimator is tested on long duration files from the Prof-Life-Log corpus. Data is recorded using a LENA recording device, worn by a primary speaker in various environments and under different noise conditions. The overlap detection system significantly outperforms baseline performance in noisy conditions. Furthermore, applying overlap detection results to word-count estimation achieves 35% relative improvement over our previous efforts, which included speech enhancement using spectral subtraction and silence removal.

Journal ArticleDOI
TL;DR: The research presented in this paper automatically recognize a singer without separating instrumental and singing sounds using audio features like timbre coefficients, pitch class, mel frequency cepstral coefficients, linear predictive coding coefficients, and loudness of an audio signal from Indian video songs (IVS).
Abstract: Singer identification is a difficult topic in music information retrieval because background instrumental music is included with singing voice which reduces performance of a system. One of the main disadvantages of the existing system is vocals and instrumental are separated manually and only vocals are used to build training model. The research presented in this paper automatically recognize a singer without separating instrumental and singing sounds using audio features like timbre coefficients, pitch class, mel frequency cepstral coefficients (MFCC), linear predictive coding (LPC) coefficients, and loudness of an audio signal from Indian video songs (IVS). Initially, various IVS of distinct playback singers (PS) are collected. After that, 53 audio features (12 dimensional timbre audio feature vectors, 12 pitch classes, 13 MFCC coefficients, 13 LPC coefficients, and 3 loudness feature vector of an audio signal) are extracted from each segment. Dimension of extracted audio features is reduced using principal component analysis (PCA) method. Playback singer model (PSM) is trained using multiclass classification algorithms like back propagation, AdaBoost.M2, k-nearest neighbor (KNN) algorithm, naive Bayes classifier (NBC), and Gaussian mixture model (GMM). The proposed approach is tested on various combinations of dataset and different combinations of audio feature vectors with various Indian male and female PS’s songs.

Proceedings ArticleDOI
19 Apr 2015
TL;DR: Experimental results show that the proposed method improved instrumental speech quality measures, where using speech temporal dynamics was found to be beneficial in severe reverberation conditions.
Abstract: This paper proposes a single-channel speech dereverberation method enhancing the spectrum of the reverberant speech signal. The proposed method uses a non-negative approximation of the convolutive transfer function (N-CTF) to simultaneously estimate the magnitude spectrograms of the speech signal and the room impulse response (RIR). To utilize the speech spectral structure, we propose to model the speech spectrum using non-negative matrix factorization, which is directly used in the N-CTF model resulting in a new cost function. We derive new estimators for the parameters by minimizing the obtained cost function. Additionally, to investigate the effect of the speech temporal dynamics for dereverberation, we use a frame stacking method and derive optimal estimators. Experiments are performed for two measured RIRs and the performance of the proposed method is compared to the performance of a state-of-the-art dereverberation method enhancing the speech spectrum. Experimental results show that the proposed method improved instrumental speech quality measures, where using speech temporal dynamics was found to be beneficial in severe reverberation conditions.

Journal ArticleDOI
TL;DR: Experimental results show that the proposed VVQ outperforms the recently introduced Dirichlet mixture model-based VQ and the conventional Gaussian mixturemodel- based VQ, in terms of modeling performance and D-R relation.

Journal ArticleDOI
TL;DR: A good correspondence between the data and the corresponding sEPSM predictions was obtained when the noise was manipulated and mixed with the unprocessed speech, consistent with the hypothesis that SNRenv is indicative of speech intelligibility.
Abstract: Jorgensen and Dau [(2011). J. Acoust. Soc. Am. 130, 1475–1487] suggested a metric for speech intelligibility prediction based on the signal-to-noise envelope power ratio ( SNRenv), calculated at the output of a modulation-frequency selective process. In the framework of the speech-based envelope power spectrum model (sEPSM), the SNRenv was demonstrated to account for speech intelligibility data in various conditions with linearly and nonlinearly processed noisy speech, as well as for conditions with stationary and fluctuating interferers. Here, the relation between the SNRenv and speech intelligibility was investigated further by systematically varying the modulation power of either the speech or the noise before mixing the two components, while keeping the overall power ratio of the two components constant. A good correspondence between the data and the corresponding sEPSM predictions was obtained when the noise was manipulated and mixed with the unprocessed speech, consistent with the hypothesis that SN...