Showing papers on "Linear predictive coding published in 2014"

PDF

Open Access

Journal Article•DOI•

STFT phase reconstruction in voiced speech for an improved single-channel speech enhancement

[...]

Martin Krawczyk¹, Timo Gerkmann¹•Institutions (1)

01 Dec 2014-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: It is shown that, when the noisy phase is enhanced using the proposed phase reconstruction, instrumental measures predict an increase of speech quality over a range of signal to noise ratios, even without explicit amplitude enhancement.

...read moreread less

Abstract: The enhancement of speech which is corrupted by noise is commonly performed in the short-time discrete Fourier transform domain. In case only a single microphone signal is available, typically only the spectral amplitude is modified. However, it has recently been shown that an improved spectral phase can as well be utilized for speech enhancement, e.g., for phase-sensitive amplitude estimation. In this paper, we therefore present a method to reconstruct the spectral phase of voiced speech from only the fundamental frequency and the noisy observation. The importance of the spectral phase is highlighted and we elaborate on the reason why noise reduction can be achieved by modifications of the spectral phase. We show that, when the noisy phase is enhanced using the proposed phase reconstruction, instrumental measures predict an increase of speech quality over a range of signal to noise ratios, even without explicit amplitude enhancement.

...read moreread less

197 citations

Journal Article•DOI•

Application of Linear Predictive Coding for Human Activity Classification Based on Micro-Doppler Signatures

[...]

Rios Jesus Javier¹, Youngwook Kim²•Institutions (2)

ITT Technical Institute¹, California State University, Fresno²

08 Apr 2014-IEEE Geoscience and Remote Sensing Letters

TL;DR: In this paper, classification of various human activities based on micro-Doppler signatures is studied using linear predictive coding (LPC) to reduce the computational time cost for extracting features, which makes real-time processing feasible.

...read moreread less

Abstract: In this letter, classification of various human activities based on micro-Doppler signatures is studied using linear predictive coding (LPC). LPC is proposed to extract the features of micro-Doppler that are mixtures of different frequencies. The use of LPC can not only decrease the time frame required to capture the Doppler signature of human motion but can also reduce the computational time cost for extracting its features, which makes real-time processing feasible. The measured data of 12 human subjects performing seven different activities using a Doppler radar are used. These activities include running, walking, walking while holding a stick, crawling, boxing while moving forward, boxing while standing in place, and sitting still. A support vector machine is then trained using the output of LPC to classify the activities. Multiclass classification is implemented using a one-versus-one decision structure. The resulting classification accuracy is found to be over 85%. The effects of the number of LPC coefficients and the size of the sliding time window, as well as the decision time-frame size used in the extraction of micro-Doppler signatures, are also discussed.

...read moreread less

113 citations

Proceedings Article•DOI•

Learning spectral mapping for speech dereverberation

[...]

Kun Han¹, Yuxuan Wang¹, DeLiang Wang¹•Institutions (1)

Ohio State University¹

04 May 2014

TL;DR: It is demonstrated that distortion caused by reverberation is substantially attenuated by the DNN whose outputs can be resynthesized to the dereverebrated speech signal.

...read moreread less

Abstract: Reverberation distorts human speech and usually has negative effects on speech intelligibility, especially for hearing-impaired listeners. It also causes performance degradation in automatic speech recognition and speaker identification systems. Therefore, the dereverberation problem must be dealt with in daily listening environments. We propose to use deep neural networks (DNNs) to learn a spectral mapping from the reverberant speech to the anechoic speech. The trained DNN produces the estimated spectral representation of the corresponding anechoic speech. We demonstrate that distortion caused by reverberation is substantially attenuated by the DNN whose outputs can be resynthesized to the dereverebrated speech signal. The proposed approach is simple, and our systematic evaluation shows promising dereverberation results, which are significantly better than those of related systems.

...read moreread less

87 citations

Journal Article•DOI•

A Comparative Study of Feature Extraction Techniques for Speech Recognition System

[...]

Pratik K. Kurzekar, Ratnadeep R. Deshmukh, Vishal B. Waghmare, Pukhraj P. Shrishrimal, Babasaheb Ambedkar - Show less +1 more

15 Dec 2014-International Journal of Innovative Research in Science, Engineering and Technology

TL;DR: Speech processing has vast applications in voice dialing, telephone communication, call routing, domestic appliances control, Speech to Text conversion, Text to Speech conversion, lip synchronization, automation systems etc.

...read moreread less

Abstract: The automatic recognition of speech means enabling a natural and easy mode of communication between human and machine. Speech processing has vast applications in voice dialing, telephone communication, call routing, domestic appliances control, Speech to Text conversion, Text to Speech conversion, lip synchronization, automation systems etc. Here we have discussed some mostly used feature extraction techniques like Mel frequency Cepstral Co-efficient (MFCC), Linear Predictive Coding (LPC) Analysis, Dynamic Time Wrapping (DTW), Relative Spectra Processing (RASTA) and Zero Crossings with Peak Amplitudes (ZCPA).Some parameters like RASTA and MFCC considers the nature of speech while it extracts the features, while LPC predicts the future features based on previous features.

...read moreread less

53 citations

Journal Article•DOI•

Bayesian Estimation of Clean Speech Spectral Coefficients Given a Priori Knowledge of the Phase

[...]

Timo Gerkmann¹•Institutions (1)

University of Oldenburg¹

08 Jul 2014-IEEE Transactions on Signal Processing

TL;DR: Instrumental measures predict that by incorporating uncertain prior information of the phase, the quality and intelligibility of processed speech can be improved both over traditional phase insensitive approaches, and approaches that treat prior information on the phase as deterministic.

...read moreread less

Abstract: While most short-time discrete Fourier transform-based single-channel speech enhancement algorithms only modify the noisy spectral amplitude, in recent years the interest in phase processing has increased in the field. The goal of this paper is twofold. First, we derive Bayesian probability density functions and estimators for the clean speech phase when different amounts of prior knowledge about the speech and noise amplitudes is given. Second, we derive a joint Bayesian estimator of the clean speech amplitudes and phases, when uncertain a priori knowledge on the phase is available. Instrumental measures predict that by incorporating uncertain prior information of the phase, the quality and intelligibility of processed speech can be improved both over traditional phase insensitive approaches, and approaches that treat prior information on the phase as deterministic.

...read moreread less

47 citations

Proceedings Article•DOI•

Medium-duration modulation cepstral feature for robust speech recognition

[...]

Vikramjit Mitra, Horacio Franco, Martin Graciarena, Dimitra Vergyri

04 May 2014

TL;DR: The proposed Modulation of Medium Duration Speech Amplitude (MMeDuSA) feature is a composite feature capturing subband speech modulations and a summary modulation that improved recognition performance under both noisy and channel degraded conditions in almost all the recognition tasks.

...read moreread less

Abstract: Studies have shown that the performance of state-of-the-art automatic speech recognition (ASR) systems significantly deteriorate with increased noise levels and channel degradations, when compared to human speech recognition capability. Traditionally, noise-robust acoustic features are deployed to improve speech recognition performance under varying background conditions to compensate for the performance degradations. In this paper, we present the Modulation of Medium Duration Speech Amplitude (MMeDuSA) feature, which is a composite feature capturing subband speech modulations and a summary modulation. We analyze MMeDuSA's speech recognition performance using SRI International's DECIPHER ® large vocabulary continuous speech recognition (LVCSR) system, on noise and channel degraded Levantine Arabic speech distributed through the Defense Advance Research Projects Agency (DARPA) Robust Automatic Speech Transcription (RATS) program. We also analyzed MMeDuSA's performance against the Aurora-4 noise-and-channel degraded English corpus. Our results from all these experiments suggest that the proposed MMeDuSA feature improved recognition performance under both noisy and channel degraded conditions in almost all the recognition tasks.

...read moreread less

41 citations

Proceedings Article•DOI•

Global variance equalization for improving deep neural network based speech enhancement

[...]

Yong Xu¹, Jun Du¹, Li-Rong Dai¹, Chin-Hui Lee²•Institutions (2)

University of Science and Technology of China¹, Georgia Institute of Technology²

09 Jul 2014

TL;DR: Experimental results show that the quality of the estimated clean speech signal is improved both subjectively and objectively in terms of perceptual evaluation of speech quality (PESQ), especially in mismatch environments where the additive noise is not seen in the DNN training.

...read moreread less

Abstract: We address an over-smoothing issue of enhanced speech in deep neural network (DNN) based speech enhancement and propose a global variance equalization framework with two schemes, namely post-processing and post-training with modified object function for the equalization between the global variance of the estimated and the reference speech. Experimental results show that the quality of the estimated clean speech signal is improved both subjectively and objectively in terms of perceptual evaluation of speech quality (PESQ), especially in mismatch environments where the additive noise is not seen in the DNN training.

...read moreread less

40 citations

Journal Article•DOI•

Estimation of subband speech correlations for noise reduction via MVDR Processing

[...]

Alexander Schasse¹, Rainer Martin¹•Institutions (1)

Ruhr University Bochum¹

01 Sep 2014-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: If the speech correlation is properly estimated and the previously derived subband filters discussed in this work show significantly less speech distortion compared to conventional noise reduction algorithms, the quality and intelligibility of the processed signals are predicted by objective measures.

...read moreread less

Abstract: Recently, it has been proposed to use the minimum-variance distortionless-response (MVDR) approach in single-channel speech enhancement in the short-time frequency domain. By applying optimal FIR filters to each subband signal, these filters reduce additive noise components with less speech distortion compared to conventional approaches. An important ingredient to these filters is the temporal correlation of the speech signals. We derive algorithms to provide a blind estimation of this quantity based on a maximum-likelihood and maximum a-posteriori estimation. To derive proper models for the inter-frame correlation of the speech and noise signals, we investigate their statistics on a large dataset. If the speech correlation is properly estimated, the previously derived subband filters discussed in this work show significantly less speech distortion compared to conventional noise reduction algorithms. Therefore, the focus of the experimental parts of this work lies on the quality and intelligibility of the processed signals. To evaluate the performance of the subband filters in combination with the clean speech inter-frame correlation estimators, we predict the speech quality and intelligibility by objective measures.

...read moreread less

37 citations

Journal Article•DOI•

Glottal Spectral Separation for Speech Synthesis

[...]

João P. Cabral¹, Korin Richmond², Junichi Yamagishi², Steve Renals²•Institutions (2)

University College Dublin¹, University of Edinburgh²

20 Feb 2014-IEEE Journal of Selected Topics in Signal Processing

TL;DR: The experiments presented here show that the analysis-synthesis technique based on GSS can produce speech comparable to that of a high-quality vocoder that is based on the spectral envelope representation, and permit control over voice qualities, namely to transform a modal voice into breathy and tense, by modifying the glottal parameters.

...read moreread less

Abstract: This paper proposes an analysis method to separate the glottal source and vocal tract components of speech that is called Glottal Spectral Separation (GSS). This method can produce high-quality synthetic speech using an acoustic glottal source model. In the source-filter models commonly used in speech technology applications it is assumed the source is a spectrally flat excitation signal and the vocal tract filter can be represented by the spectral envelope of speech. Although this model can produce high-quality speech, it has limitations for voice transformation because it does not allow control over glottal parameters which are correlated with voice quality. The main problem with using a speech model that better represents the glottal source and the vocal tract filter is that current analysis methods for separating these components are not robust enough to produce the same speech quality as using a model based on the spectral envelope of speech. The proposed GSS method is an attempt to overcome this problem, and consists of the following three steps. Initially, the glottal source signal is estimated from the speech signal. Then, the speech spectrum is divided by the spectral envelope of the glottal source signal in order to remove the glottal source effects from the speech signal. Finally, the vocal tract transfer function is obtained by computing the spectral envelope of the resulting signal. In this work, the glottal source signal is represented using the Liljencrants-Fant model (LF-model). The experiments we present here show that the analysis-synthesis technique based on GSS can produce speech comparable to that of a high-quality vocoder that is based on the spectral envelope representation. However, it also permit control over voice qualities, namely to transform a modal voice into breathy and tense, by modifying the glottal parameters.

...read moreread less

34 citations

Proceedings Article•DOI•

A feature study for classification-based speech separation at very low signal-to-noise ratio

[...]

Jitong Chen¹, Yuxuan Wang¹, DeLiang Wang¹•Institutions (1)

Ohio State University¹

04 May 2014

TL;DR: This study focuses on the SNR level of -5 dB in which speech is generally dominated by background noise, and proposes a new feature called Multi-Resolution Cochleagram (MRCG), which is extracted from four cochlea-grams of different resolutions to capture both local information and spectrotemporal context.

...read moreread less

Abstract: Speech separation is a challenging problem at low signal-to-noise ratios (SNRs). Separation can be formulated as a classification problem. In this study, we focus on the SNR level of -5 dB in which speech is generally dominated by background noise. In such a low SNR condition, extracting robust features from a noisy mixture is crucial for successful classification. Using a common neural network classifier, we systematically compare separation performance of many monaural features. In addition, we propose a new feature called Multi-Resolution Cochleagram (MRCG), which is extracted from four cochlea-grams of different resolutions to capture both local information and spectrotemporal context. Comparisons using two non-stationary noises show a range of feature robustness for speech separation with the proposed MRCG performing the best. We also find that ARMA filtering, a post-processing technique previously used for robust speech recognition, improves speech separation performance by smoothing the temporal trajectories of feature dimensions.

...read moreread less

31 citations

Proceedings Article•DOI•

Speech dereverberation using weighted prediction error with laplacian model of the desired signal

[...]

Ante Jukic¹, Simon Doclo¹•Institutions (1)

University of Oldenburg¹

04 May 2014

TL;DR: Experimental results, obtained using measured impulse responses, indicate that the proposed approach could be used to improve the dereverberation performance compared to the classical technique.

...read moreread less

Abstract: Reverberation has a considerable impact on the quality and intelligibility of captured speech signals. In this paper we present an approach for blind multi-microphone speech dereverberation based on the weighted prediction error method, where the reverberant observations are modeled using multi-channel linear prediction in the short-time Fourier transform domain. Instead of using the commonly employed Gaussian distribution for the desired speech signal, the proposed approach uses a Laplacian distribution which is known to be more accurate in modeling speech signals. Maximum-likelihood estimation is used for estimating the model parameters, leading to a linear programming optimization problem. Experimental results, obtained using measured impulse responses, indicate that the proposed approach could be used to improve the dereverberation performance compared to the classical technique.

...read moreread less

Proceedings Article•DOI•

Subjective speech quality and speech intelligibility evaluation of single-channel dereverberation algorithms

[...]

Anna Warzybok¹, Ina Kodrasi¹, Jan Ole Jungmann², Emanuel A. P. Habets, Timo Gerkmann¹, Alfred Mertins², Simon Doclo¹, Birger Kollmeier¹, Stefan Goetze³ - Show less +5 more•Institutions (3)

University of Oldenburg¹, University of Lübeck², Fraunhofer Society³

20 Nov 2014

TL;DR: Six different single-channel dereverberation algorithms are evaluated subjectively in terms of speech intelligibility and speech quality and the best performance was observed for the regularized spectral inverse approach with pre-echo removal.

...read moreread less

Abstract: In this contribution, six different single-channel dereverberation algorithms are evaluated subjectively in terms of speech intelligibility and speech quality. In order to study the influence of the dereverberation algorithms on speech intelligibility, speech reception thresholds in noise were measured for different reverberation times. The quality ratings were obtained following the ITU-T P.835 recommendations (with slight changes for adaptation to the problem of dere-verberation) and included assessment of the attributes: reverberant, colored, distorted, and overall quality. Most of the algorithms improved speech intelligibility for short as well as long reverberation times compared to the reverberant condition. The best performance in terms of speech intelligibility and quality was observed for the regularized spectral inverse approach with pre-echo removal. The overall quality of the processed signals was highly correlated with the attribute reverberant or/and distorted. To generalize the present outcomes, further studies are needed to account for the influence of the estimation errors.

...read moreread less

Journal Article•DOI•

Feature Extraction Techniques in Speech Processing: A Survey

[...]

Rekha Hibare, Anup Vibhute

18 Dec 2014-International Journal of Computer Applications

TL;DR: This paper intends to focus on the survey of various feature extraction techniques in speech processing such as Fast Fourier Transforms, Linear Predictive Coding, Mel Frequency Cepstral Coefficients, Discrete Wavelet Transforms , Wavelet Packet Transforms and their applications in speechprocessing.

...read moreread less

Abstract: Speech processing includes the various techniques such as speech coding, speech synthesis, speech recognition and speaker recognition. In the area of digital signal processing, speech processing has versatile applications so it is still an intensive field of research. Speech processing mostly performs two fundamental operations such as Feature Extraction and Classification. The main criterion for the good speech processing system is the selection of feature extraction technique which plays an important role in the system accuracy. This paper intends to focus on the survey of various feature extraction techniques in speech processing such as Fast Fourier Transforms, Linear Predictive Coding, Mel Frequency Cepstral Coefficients, Discrete Wavelet Transforms, Wavelet Packet Transforms, Hybrid Algorithm DWPD and their applications in speech processing.

...read moreread less

Proceedings Article•DOI•

Mask-based enhancement for very low quality speech

[...]

Sira Gonzalez¹, Mike Brookes¹•Institutions (1)

Imperial College London¹

04 May 2014

TL;DR: A mask-based enhancer for very low quality speech that is able to preserve important cues in a noise-robust manner by identifying the time-frequency regions that contain significant speech energy is proposed.

...read moreread less

Abstract: We propose a mask-based enhancer for very low quality speech that is able to preserve important cues in a noise-robust manner by identifying the time-frequency regions that contain significant speech energy. We use a classifier to estimate a time-frequency mask from an input feature set that provides information about the energy distribution of both voiced and unvoiced speech. We evaluate the enhancer on a range of noisy speech signals and demonstrate that it yields consistent improvements in an objective intelligibility measure.

...read moreread less

Proceedings Article•DOI•

UT-Vocal Effort II: Analysis and constrained-lexicon recognition of whispered speech

[...]

Shabnam Ghaffarzadegan¹, Hynek Boril¹, John H. L. Hansen¹•Institutions (1)

University of Texas at Dallas¹

04 May 2014

TL;DR: Several strategies involving front-end filter bank redistribution, cepstral dimensionality reduction, and lexicon expansion for alternative pronunciations are proposed to improve robustness of automatic speech recognition of whispered speech with neutral-trained acoustic models.

...read moreread less

Abstract: This study focuses on acoustic variations in speech introduced by whispering, and proposes several strategies to improve robustness of automatic speech recognition of whispered speech with neutral-trained acoustic models. In the analysis part, differences in neutral and whispered speech captured in the UT-Vocal Effort II corpus are studied in terms of energy, spectral slope, and formant center frequency and bandwidth distributions in silence, voiced, and unvoiced speech signal segments. In the part dedicated to speech recognition, several strategies involving front-end filter bank redistribution, cepstral dimensionality reduction, and lexicon expansion for alternative pronunciations are proposed. The proposed neutral-trained system employing redistributed filter bank and reduced features provides a 7.7% absolute WER reduction over the baseline system trained on neutral speech, and a 1.3% reduction over a baseline system with whisper-adapted acoustic models.

...read moreread less

Journal Article•DOI•

Intelligibility enhancement of HMM-generated speech in additive noise by modifying Mel cepstral coefficients to increase the glimpse proportion

[...]

Cassia Valentini-Botinhao¹, Junichi Yamagishi¹, Simon King¹, Ranniery Maia²•Institutions (2)

University of Edinburgh¹, Toshiba²

01 Mar 2014-Computer Speech & Language

TL;DR: A method for modifying the Mel cepstral coefficients generated by statistical parametric models that have been trained on plain speech such that the glimpse proportion - an objective measure of the intelligibility of speech in noise - increases, while keeping the speech energy fixed.

...read moreread less

Proceedings Article•DOI•

Fundamental frequency generation for whisper-to-audible speech conversion

[...]

Matthias Janke¹, Michael Wand¹, Till Heistermann¹, Tanja Schultz¹, Kishore Prahallad - Show less +1 more•Institutions (1)

Karlsruhe Institute of Technology¹

04 May 2014

TL;DR: This work addresses the F0 modeling in whisper-to-speech conversion and shows that F0 contours can be derived from the mapped spectral vectors, which can be used for the synthesis of a speech signal.

...read moreread less

Abstract: In this work, we address the issues involved in whisper-to-audible speech conversion. Spectral mapping techniques using Gaussian mixture models or Artificial Neural Networks borrowed from voice conversion have been applied to transform whisper spectral features to normally phonated audible speech. However, the modeling and generation of fundamental frequency ($F_0$) and its contour in the converted speech is a major issue. Whispered speech does not contain explicit voicing characteristics and hence it is hard to derive a suitable $F_0$, making it difficult to generate a natural prosody after conversion. Our work addresses the $F_0$ modeling in whisper-to-speech conversion. We show that $F_0$ contours can be derived from the mapped spectral vectors, which can be used for the synthesis of a speech signal. We also present a hybrid unit selection approach for whisper-to-speech conversion. Unit selection is performed on the spectral vectors, where $F_0$ and its contour can be obtained as a byproduct without any additional modeling.

...read moreread less

Proceedings Article•DOI•

MMSE-optimal enhancement of complex speech coefficients with uncertain prior knowledge of the clean speech phase

[...]

Timo Gerkmann

04 May 2014

TL;DR: This paper first obtains a clean speech phase estimate using a recent phase reconstruction algorithm, then proposes to treat this reconstructed phase as uncertain a priori knowledge when deriving a joint MMSE estimate of the clean speech amplitude and phase.

...read moreread less

Abstract: In most STFT-based speech enhancement algorithms only the STFT amplitude of speech is processed, while the STFT phase of the noisy signal is neither modified nor employed to improve amplitude estimation. This is also, because modifying the spectral phase often yields undesired artifacts and unnatural sounding speech. In this paper, we first obtain a clean speech phase estimate using a recent phase reconstruction algorithm. Then, we propose to treat this reconstructed phase as uncertain a priori knowledge when deriving a joint MMSE estimate of the clean speech amplitude and phase. The resulting MMSE-estimator yields a compromise between the phase of the noisy signal and the prior phase estimate. Instrumental measures and informal listening show that the proposed estimator reduces un-desired artifacts and results in an improved speech quality.

...read moreread less

Patent•

Speech signal separation and synthesis based on auditory scene analysis and speech modeling

[...]

Carlos Avendano, David Klein, John Woodruff, Michael M. Goodwin

18 Jul 2014

TL;DR: In this paper, the authors proposed a method for generating clean speech from a speech signal representing a mixture of a noise and speech using a model of speech using auditory and speech production principles.

...read moreread less

Abstract: Provided are systems and methods for generating clean speech from a speech signal representing a mixture of a noise and speech. The clean speech may be generated from synthetic speech parameters. The synthetic speech parameters are derived based on the speech signal components and a model of speech using auditory and speech production principles. The modeling may utilize a source-filter structure of the speech signal. One or more spectral analyses on the speech signal are performed to generate spectral representations. The feature data is derived based on a spectral representation. The features corresponding to the target speech according to a model of speech are grouped and separated from the feature data. The synthetic speech parameters, including spectral envelope, pitch data and voice classification data are generated based on features corresponding to the target speech.

...read moreread less

Proceedings Article•DOI•

A probabilistic approach for phase estimation in single-channel speech enhancement using von mises phase priors

[...]

Josef Kulmer¹, Pejman Mowlaee¹, Mario Kaoru Watanabe¹•Institutions (1)

Graz University of Technology¹

20 Nov 2014

TL;DR: This paper proposes a probabilistic method to estimate the clean speech phase from noisy observation using von Mises phase priors, which leads to improved speech quality and intelligibility predicted by instrumental measures without explicit incorporation of amplitude enhancement.

...read moreread less

Abstract: In many artificial intelligence systems human voice is considered as the medium for information transmission. Human-machine communication by voice becomes difficult when speech is mixed with some background noise. As a remedy, a single-channel speech enhancement is indispensable for reducing background noise from noisy speech to make it suitable for automatic speech recognition and telephony speech. While the conventional techniques for single-channel speech enhancement incorporate noisy phase in both amplitude estimation and signal reconstruction stages, in this paper we propose a probabilistic method to estimate the clean speech phase from noisy observation. Our proposed method consists of phase unwrapping followed by threshold-based temporal smoothing using von Mises phase priors. The proposed phase enhancement method leads to improved speech quality and intelligibility predicted by instrumental measures without explicit incorporation of amplitude enhancement.

...read moreread less

Journal Article•DOI•

Speech Coding Based on Compressed Sensing and Sparse Representation

[...]

Shang Jing Li¹, Qi Zhu¹•Institutions (1)

Nanjing University of Posts and Telecommunications¹

07 May 2014-Applied Mechanics and Materials

TL;DR: Results show that the proposed coding scheme has achieved average Mean Opinion Score of the synthesized speech 3.083 in an appropriate bit rate (4.2 Kbps), which outperforms the quality of Code excited linear prediction (CELP).

...read moreread less

Abstract: In this paper, we propose a novel speech coding scheme based on compressed sensing and sparse representation. Compressed sensing (CS) attracts great interest for its ability to utilize a few measurements to recover original signals. Measurements preserve part of speech features while projected by row echelon matrix. A dictionary is learned in order to contain redundant information about speech measurements. The synthesized speech is recovered from a sparse approximation of the corresponding measurement. A rear low-pass filter is adopted to improve the subject quality of synthesized speech. Results show that the proposed coding scheme has achieved average Mean Opinion Score (MOS) of the synthesized speech 3.083 in an appropriate bit rate (4.2 Kbps), which outperforms the quality of Code excited linear prediction (CELP).

...read moreread less

Journal Article•DOI•

Parameter Generation Methods With Rich Context Models for High-Quality and Flexible Text-To-Speech Synthesis

[...]

Shinnosuke Takamichi¹, Tomoki Toda¹, Yoshinori Shiga, Sakriani Sakti¹, Graham Neubig¹, Satoshi Nakamura¹ - Show less +2 more•Institutions (1)

Nara Institute of Science and Technology¹

01 Apr 2014-IEEE Journal of Selected Topics in Signal Processing

TL;DR: In this paper, the authors proposed a hybrid method combining Hidden Markov Model (HMM)-based speech synthesis and unit selection synthesis using rich context models, which are statistical models that represent individual acoustic parameter segments.

...read moreread less

Abstract: In this paper, we propose parameter generation methods using rich context models as yet another hybrid method combining Hidden Markov Model (HMM)-based speech synthesis and unit selection synthesis. Traditional HMM-based speech synthesis enables flexible modeling of acoustic features based on a statistical approach. However, the speech parameters tend to be excessively smoothed. To address this problem, several hybrid methods combining HMM-based speech synthesis and unit selection synthesis have been proposed. Although they significantly improve quality of synthetic speech, they usually lose flexibility of the original HMM-based speech synthesis. In the proposed methods, we use rich context models, which are statistical models that represent individual acoustic parameter segments. In training, the rich context models are reformulated as Gaussian Mixture Models (GMMs). In synthesis, initial speech parameters are generated from probability distributions over-fitted to individual segments, and the speech parameter sequence is iteratively generated from GMMs using a parameter generation method based on the maximum likelihood criterion. Since the basic framework of the proposed methods is still the same as the traditional framework, the capability of flexibly modeling acoustic features remains. The experimental results demonstrate: (1) the use of approximation with a single Gaussian component sequence yields better synthetic speech quality than the use of EM algorithm in the proposed parameter generation method, (2) the state-based model selection yields quality improvements at the same level as the frame-based model selection, (3) the use of the initial parameters generated from the over-fitted speech probability distributions is very effective to further improve speech quality, and (4) the proposed methods for spectral and F0 components yields significant improvements in synthetic speech quality compared with the traditional HMM-based speech synthesis.

...read moreread less

Proceedings Article•DOI•

Automatic male-female voice discrimination

[...]

Arijit Ghosal, Suchibrota Dutta¹•Institutions (1)

Maharaja Manindra Chandra College¹

03 Apr 2014

TL;DR: This work has presented a novel simple scheme for classifying audio speech signals into male speech and female speech using popular salient low level time-domain acoustic features which are very closely related to the physical properties of source audio signal.

...read moreread less

Abstract: In this work, we have presented a novel simple scheme for classifying audio speech signals into male speech and female speech. In the context of content-based multimedia indexing gender identification based on speech signal is an important task. Some popular salient low level time-domain acoustic features which are very closely related to the physical properties of source audio signal like zero crossing rate (ZCR), short time energy (STE) along with spectral flux, a low level frequency domain feature, are used for this discrimination. RANSAC and Neural-Net has been used as classifier. The experimental result exhibits the efficiency of the proposed scheme.

...read moreread less

Journal Article•DOI•

Improving Speech Recognition Rate through Analysis Parameters

[...]

Deividas Eringis¹, Gintautas Tamulevičius²•Institutions (2)

Vilnius University¹, Vilnius Gediminas Technical University²

01 May 2014-Electrical, Control and Communication Engineering

TL;DR: The highest speech recognition rate was obtained using 10 ms length analysis window with the frame shift varying from 7.5 to 10 ms (regardless of analysis type), and the highest increase of recognition rates was 2.5 %.

...read moreread less

Abstract: Speech signal is redundant and non-stationary by nature. Because of vocal tract inertness these variations are not very rapid and the signal can be considered as stationary in short segments. It is presumed that in short-time magnitude spectrum the most distinct information of speech is contained. This is the main reason for speech signal analysis in frame-by-frame manner. The analyzed speech signal is segmented into overlapping segments (so-called frames) for this purpose. Segments of 15-25 ms with the overlap of 10-15 ms are used usually. In this paper we present results of our investigation of analysis window length and frame shift influence on speech recognition rate. We have analyzed three different cepstral analysis approaches for this purpose: mel frequency cepstral analysis (MFCC), linear prediction cepstral analysis (LPCC) and perceptual linear prediction cepstral analysis (PLPC). The highest speech recognition rate was obtained using 10 ms length analysis window with the frame shift varying from 7.5 to 10 ms (regardless of analysis type). The highest increase of recognition rate was 2.5 %.

...read moreread less

Patent•

Speech separation method and system

[...]

Jun Du¹, Yong Xu¹, Yanhui Tu¹, Li-Rong Dai¹, Zhi-Guo Wang¹, Yu Hu¹, Qing-Feng Liu¹ - Show less +3 more•Institutions (1)

University of Science and Technology of China¹

30 Dec 2014

TL;DR: In this paper, a speech separation method and a system consisting of extracting a speech feature of the mixture speech signal and inputting the extracted speech feature into a regression model for speech separation, obtaining an estimated speech features of a target speech signal, synthesizing to obtain the target speech signals according to the estimated speech feature.

...read moreread less

Abstract: An example of the present invention discloses a speech separation method and a system, the method comprises: receiving a mixture speech signal to be separated; extracting a speech feature of the mixture speech signal; inputting the extracted speech feature of the mixture speech signal into a regression model for speech separation, obtaining an estimated speech feature of a target speech signal; synthesizing to obtain the target speech signal according to the estimated speech feature. Speech separation effect can be improved effectively using the present invention.

...read moreread less

Proceedings Article•DOI•

Speech enhancement based on a modified spectral subtraction method

[...]

Md. Tariqul Islam¹, Celia Shahnaz¹, Shaikh Anowarul Fattah¹•Institutions (1)

Bangladesh University of Engineering and Technology¹

25 Sep 2014

TL;DR: It is shown in terms of objective measures, spectrogram analysis and subjective listening tests that the proposed method consistently outperforms one of the state-of-the-art methods of speech enhancement from noisy speech corrupted by babble or car noise at high as well as very low levels of SNR.

...read moreread less

Abstract: In this paper, a noisy speech enhancement method based on modified spectral subtraction performed on short time magnitude spectrum is presented. Here the cross-terms containing spectra of noise and clean signals are taken into consideration which are neglected in the traditional spectral subtraction method on the basis of the assumption that clean speech and noise signals are completely uncorrelated which is not true for most of the noises. In this method, the noise estimate to be subtracted from the noisy speech spectrum is proposed to be determined exploiting the low frequency regions of noisy speech of the current frame rather than depending only on the initial silence frames. We argue that this approach of noise estimation is capable of tracking the time variation of the non-stationary noise. By employing the noise estimates thus obtained, a procedure is formulated to reduce noise from the magnitude spectrum of noisy speech signal. The noise reduced magnitude spectrum is then recombined with the unchanged phase spectrum to produce a modified complex spectrum prior to synthesizing an enhanced frame. Extensive simulations are carried out using NOIZEUS database in order to evaluate the performance of the proposed method. It is shown in terms of objective measures, spectrogram analysis and subjective listening tests that the proposed method consistently outperforms one of the state-of-the-art methods of speech enhancement from noisy speech corrupted by babble or car noise at high as well as very low levels of SNR.

...read moreread less

Proceedings Article•DOI•

Nonlinear estimation of missing ΔLSF parameters by a mixture of Dirichlet distributions

[...]

Zhanyu Ma¹, Rainer Martin², Jun Guo¹, Honggang Zhang¹•Institutions (2)

Beijing University of Posts and Telecommunications¹, Ruhr University Bochum²

04 May 2014

TL;DR: A novel scheme to estimate the missing values occurring during LPC model transmission and applies the Dirichlet mixture model (DMM) to capture the correlations among the elements in the ΔLSF vector is proposed.

...read moreread less

Abstract: In packet networks, a reliable scheme to handle packet loss during speech transmission is of great importance. As a common representation of the linear predictive coding (LPC) model, the line spectral frequency (LSF) parameters are widely used in speech quantization and transmission. In this paper, we propose a novel scheme to estimate the missing values occurring during LPC model transmission. In order to exploit the boundary and ordering properties of the LSF parameters, we utilize the ΔLSF representation and apply the Dirichlet mixture model (DMM) to capture the correlations among the elements in the ΔLSF vector. With the conditional distribution of the missing part given the received part, an optimal nonlinear minimum mean square error estimator for the missing values is proposed. Compared to the previously presented Gaussian mixture model based method, the proposed DMM based nonlinear estimator shows a convincing improvement.

...read moreread less

Proceedings Article•DOI•

Improved Voice Activity Detection based on support vector machine with high separable speech feature vectors

[...]

Yuexian Zou¹, Weiqiao Zheng¹, Wei Shi¹, Hong Liu¹•Institutions (1)

Peking University¹

18 Sep 2014

TL;DR: Aiming at achieving improved accuracy and robustness of the VAD technique to noise, the feature selection has been conducted by introducing the class separation measure (CSM) criterion to evaluate the capability of the feature vectors extracted for classifying speech and non-speech segments.

...read moreread less

Abstract: Voice Activity Detection (VAD) is one of the key techniques for many speech applications. Existing VAD algorithms have shown unsatisfied performance under nonstationary noise and low Signal-to-Noise-Ratio (SNR) situations. Motivated by the fact that people is able to distinguish the speech and non-speech even in low SNR situations, this paper studies the VAD technique from the pattern recognition point of view, where the VAD essentially is formulated as a binary classification problem. Specifically, the VAD is implemented by classifying the speech signal into speech and non-speech segments. The radial basis function (RBF) based support vector machine (SVM) is employed with supervised manner, which is perfectly suitable for binary classification tasks with some training samples. Aiming at achieving improved accuracy and robustness of the VAD technique to noise, the feature selection has been conducted by introducing the class separation measure (CSM) criterion to evaluate the capability of the feature vectors extracted for classifying speech and non-speech segments. Most famous speech features have been taken into account, including Mel-frequency cepstral coefficients (MFCC), the principal component analysis of the MFCC (PCA-MFCC), linear predictive coding (LPC) and linear predictive cepstral coding (LPCC). Intensive experimental results show that the MFCC features capture the most relevant information of speech and keep good separability of classification in different noisy conditions, so do the PCA-MFCC features. Moreover, the PCA- MFCC features are more robust to the noise with less computational cost. As a result, a VAD method by using the PCA-MFCC and the RBF-SVM as the classifier has been developed, which is termed as PCA-SVM-VAD for short. The experimental results with the NOIZEUS database show that the proposed PCA-SVM-VAD method has clear improvements over other VAD methods and performs much more robust in car noisy environment at various SNRs.

...read moreread less

Journal Article•DOI•

A New PAPR Reduction Technique in OFDM Systems Using Linear Predictive Coding

[...]

Md. Mahmudul Hasan¹•Institutions (1)

University of Information Technology and Sciences¹

01 Mar 2014-Wireless Personal Communications

TL;DR: This method proposes the use of signal whitening property of LPC as a preprocessing step in OFDM systems and can achieve a significant reduction in PAPR without degrading the power spectral level, error performance or computational complexity of the systems.

...read moreread less

Abstract: High peak-to-average power ratio (PAPR) has always been as a major problem in orthogonal frequency division multiplexing (OFDM) that leads to a severe nonlinear distortion in practical hardware implementations of high power amplifier. In this article, a new PAPR reduction method is proposed based on linear predictive coding (LPC). This method proposes the use of signal whitening property of LPC as a preprocessing step in OFDM systems. Error filtering of the proposed method removes the predictable content of stationary stochastic processes which can reduce the autocorrelation of input data sequences and is shown to be very effective solution for PAPR problem in OFDM transmissions. It is shown that the proposed method can achieve a significant reduction in PAPR without degrading the power spectral level, error performance or computational complexity of the systems. It is also shown that the proposed method is independent of modulation schemes and can be applied to any number of subcarriers under both additive white gaussian noise and wireless Rayleigh fading channel.

...read moreread less

Proceedings Article•DOI•

Non-negative source-filter dynamical system for speech enhancement

[...]

Umut Simsekli¹, Jonathan Le Roux², John R. Hershey²•Institutions (2)

Boğaziçi University¹, Mitsubishi²

04 May 2014

TL;DR: A novel probabilistic model is proposed which precisely models the speech by taking into account the underlying speech production process as well as its dynamics, which outperforms state-of-the-art methods in terms of objective measures.

...read moreread less

Abstract: Model-based speech enhancement methods, which rely on separately modeling the speech and the noise, have been shown to be powerful in many different problem settings. When the structure of the noise can be arbitrary, which is often the case in practice, modelbased methods have to focus on developing good speech models, whose quality will be key to their performance. In this study, we propose a novel probabilistic model for speech enhancement which precisely models the speech by taking into account the underlying speech production process as well as its dynamics. The proposed model follows a source-filter approach where the excitation and filter parts are modeled as non-negative dynamical systems. We present convergence-guaranteed update rules for each latent factor. In order to assess performance, we evaluate our model on a challenging speech enhancement task where the speech is observed under non-stationary noises recorded in a car. We show that our model outperforms state-of-the-art methods in terms of objective measures.

...read moreread less

Collapse