scispace - formally typeset
Search or ask a question

Showing papers on "Linear predictive coding published in 2012"


Journal ArticleDOI
TL;DR: In this paper, five state-of-the-art GCI detection algorithms are compared using six different databases with contemporaneous electroglottographic recordings as ground truth, and containing many hours of speech by multiple speakers.
Abstract: The pseudo-periodicity of voiced speech can be exploited in several speech processing applications. This requires however that the precise locations of the glottal closure instants (GCIs) are available. The focus of this paper is the evaluation of automatic methods for the detection of GCIs directly from the speech waveform. Five state-of-the-art GCI detection algorithms are compared using six different databases with contemporaneous electroglottographic recordings as ground truth, and containing many hours of speech by multiple speakers. The five techniques compared are the Hilbert Envelope-based detection (HE), the Zero Frequency Resonator-based method (ZFR), the Dynamic Programming Phase Slope Algorithm (DYPSA), the Speech Event Detection using the Residual Excitation And a Mean-based Signal (SEDREAMS) and the Yet Another GCI Algorithm (YAGA). The efficacy of these methods is first evaluated on clean speech, both in terms of reliabililty and accuracy. Their robustness to additive noise and to reverberation is also assessed. A further contribution of the paper is the evaluation of their performance on a concrete application of speech processing: the causal-anticausal decomposition of speech. It is shown that for clean speech, SEDREAMS and YAGA are the best performing techniques, both in terms of identification rate and accuracy. ZFR and SEDREAMS also show a superior robustness to additive noise and reverberation.

241 citations


Patent
09 Jul 2012
TL;DR: In this paper, a method for processing speech, comprising semantically parsing a received natural language speech input with respect to a plurality of predetermined command grammars in an automated speech processing system, is presented.
Abstract: A method for processing speech, comprising semantically parsing a received natural language speech input with respect to a plurality of predetermined command grammars in an automated speech processing system; determining if the parsed speech input unambiguously corresponds to a command and is sufficiently complete for reliable processing, then processing the command; if the speech input ambiguously corresponds to a single command or is not sufficiently complete for reliable processing, then prompting a user for further speech input to reduce ambiguity or increase completeness, in dependence on a relationship of previously received speech input and at least one command grammar of the plurality of predetermined command grammars, reparsing the further speech input in conjunction with previously parsed speech input, and iterating as necessary. The system also monitors abort, fail or cancel conditions in the speech input.

226 citations


Journal ArticleDOI
TL;DR: Three speaking-aid systems are proposed that enhance three different types of EL speech signals: EL speech, EL speech using an air-pressure sensor (EL-air speech), and silent EL speech which is produced with a new sound source unit that generates signals with extremely low energy.

190 citations


Journal ArticleDOI
TL;DR: In this paper, a set of speech processing tools created by introducing sparsity constraints into the linear prediction framework is presented, which have shown to be effective in several issues related to modeling and coding of speech signals.
Abstract: The aim of this paper is to provide an overview of Sparse Linear Prediction, a set of speech processing tools created by introducing sparsity constraints into the linear prediction framework. These tools have shown to be effective in several issues related to modeling and coding of speech signals. For speech analysis, we provide predictors that are accurate in modeling the speech production process and overcome problems related to traditional linear prediction. In particular, the predictors obtained offer a more effective decoupling of the vocal tract transfer function and its underlying excitation, making it a very efficient method for the analysis of voiced speech. For speech coding, we provide predictors that shape the residual according to the characteristics of the sparse encoding techniques resulting in more straightforward coding strategies. Furthermore, encouraged by the promising application of compressed sensing in signal compression, we investigate its formulation and application to sparse linear predictive coding. The proposed estimators are all solutions to convex optimization problems, which can be solved efficiently and reliably using, e.g., interior-point methods. Extensive experimental results are provided to support the effectiveness of the proposed methods, showing the improvements over traditional linear prediction in both speech analysis and coding.

128 citations


Journal ArticleDOI
TL;DR: The method enables the control of the source distortion and source confusion trade-off, and therefore achieves superior performance compared to powerful approaches like geometric spectral subtraction and codebook-based filtering, for a number of challenging interferer classes such as speech babble and wind noise.
Abstract: The enhancement of speech degraded by real-world interferers is a highly relevant and difficult task. Its importance arises from the multitude of practical applications, whereas the difficulty is due to the fact that interferers are often nonstationary and potentially similar to speech. The goal of monaural speech enhancement is to separate a single mixture into its underlying clean speech and interferer components. This under-determined problem is solved by incorporating prior knowledge in the form of learned speech and interferer dictionaries. The clean speech is recovered from the degraded speech by sparse coding of the mixture in a composite dictionary consisting of the concatenation of a speech and an interferer dictionary. Enhancement performance is measured using objective measures and is limited by two effects. A too sparse coding of the mixture causes the speech component to be explained with too few speech dictionary atoms, which induces an approximation error we denote source distortion. However, a too dense coding of the mixture results in source confusion, where parts of the speech component are explained by interferer dictionary atoms and vice-versa. Our method enables the control of the source distortion and source confusion trade-off, and therefore achieves superior performance compared to powerful approaches like geometric spectral subtraction and codebook-based filtering, for a number of challenging interferer classes such as speech babble and wind noise.

111 citations


Journal ArticleDOI
TL;DR: A new algorithm is proposed for steganography in low bit-rate VoIP audio streams by integrating information hiding into the process of speech encoding, thus maintaining synchronization between information hiding and speech encoding.
Abstract: Low bit-rate speech codecs have been widely used in audio communications like VoIP and mobile communications, so that steganography in low bit-rate audio streams would have broad applications in practice. In this paper, the authors propose a new algorithm for steganography in low bit-rate VoIP audio streams by integrating information hiding into the process of speech encoding. The proposed algorithm performs data embedding while pitch period prediction is conducted during low bit-rate speech encoding, thus maintaining synchronization between information hiding and speech encoding. The steganography algorithm can achieve high quality of speech and prevent detection of steganalysis, but also has great compatibility with a standard low bit-rate speech codec without causing further delay by data embedding and extraction. Testing shows, with the proposed algorithm, the data embedding rate of the secret message can attain 4 bits/frame (133.3 bits/second).

109 citations


Journal ArticleDOI
TL;DR: Experimental results prove the efficiency of the proposed hiding technique since the stego signals are perceptually indistinguishable from the equivalent cover signal, while being able to recover the secret speech message with slight degradation in the quality.
Abstract: A new method to secure speech communication using the discrete wavelet transforms (DWT) and the fast Fourier transform is presented in this article. In the first phase of the hiding technique, we separate the speech high-frequency components from the low-frequency components using the DWT. In a second phase, we exploit the low-pass spectral proprieties of the speech spectrum to hide another secret speech signal in the low-amplitude high-frequency regions of the cover speech signal. The proposed method allows hiding a large amount of secret information while rendering the steganalysis more complex. Experimental results prove the efficiency of the proposed hiding technique since the stego signals are perceptually indistinguishable from the equivalent cover signal, while being able to recover the secret speech message with slight degradation in the quality.

69 citations


Patent
21 Sep 2012
TL;DR: In this article, a system-effected method for synthesizing speech, or recognizing speech including a sequence of expressive speech utterances, can be computer-implemented and can include system-generating a speech signal embodying the sequence of utterances.
Abstract: A system-effected method for synthesizing speech, or recognizing speech including a sequence of expressive speech utterances. The method can be computer-implemented and can include system-generating a speech signal embodying the sequence of expressive speech utterances. Other possible steps include: system-marking the speech signal with a pitch marker indicating a pitch change at or near a first zero amplitude crossing point of the speech signal following a glottal closure point, at a minimum, at a maximum or at another location; system marking the speech signal with at least one further pitch marker; system-aligning a sequence of prosodically marked text with the pitch-marked speech signal according to the pitch markers; and system outputting the aligned text or the aligned speech signal, respectively. Computerized systems, and stored programs for implementing method embodiments of the invention are also disclosed.

57 citations


Journal ArticleDOI
TL;DR: New feature extraction methods, which utilize wavelet decomposition and reduced order linear predictive coding (LPC) coefficients, have been proposed for speech recognition and the experimental results show the superiority of the proposed techniques over the conventional methods like linear predictive cepstral coefficients, Mel-frequency cep stral coefficient, spectral subtraction, and cepStral mean normalization in presence of additive white Gaussian noise.
Abstract: In this article, new feature extraction methods, which utilize wavelet decomposition and reduced order linear predictive coding (LPC) coefficients, have been proposed for speech recognition. The coefficients have been derived from the speech frames decomposed using discrete wavelet transform. LPC coefficients derived from subband decomposition (abbreviated as WLPC) of speech frame provide better representation than modeling the frame directly. The WLPC coefficients have been further normalized in cepstrum domain to get new set of features denoted as wavelet subband cepstral mean normalized features. The proposed approaches provide effective (better recognition rate), efficient (reduced feature vector dimension), and noise robust features. The performance of these techniques have been evaluated on the TI-46 isolated word database and own created Marathi digits database in a white noise environment using the continuous density hidden Markov model. The experimental results also show the superiority of the proposed techniques over the conventional methods like linear predictive cepstral coefficients, Mel-frequency cepstral coefficients, spectral subtraction, and cepstral mean normalization in presence of additive white Gaussian noise.

57 citations


01 Jan 2012
TL;DR: An accuracy of 85% is obtained by the combination of features, when the proposed approach is tested using a dataset of 280 speech samples, which is more than those obtained by using the features singly.
Abstract: This paper proposes an approach to recognize English words corresponding to digits Zero to Nine spoken in an isolated way by different male and female speakers. A set of features consisting of a combination of Mel Frequency Cepstral Coefficients (MFCC), Linear Predictive Coding (LPC), Zero Crossing Rate (ZCR), and Short Time Energy (STE) of the audio signal, is used to generate a 63-element feature vector, which is subsequently used for discrimination. Classification is done using artificial neural networks (ANN) with feedforward back-propagation architectures. An accuracy of 85% is obtained by the combination of features, when the proposed approach is tested using a dataset of 280 speech samples, which is more than those obtained by using the features singly.

42 citations


Patent
05 Dec 2012
TL;DR: In this article, a method and a system for speech recognition are provided, in which vocal characteristics are captured from speech data and used to identify a speaker identification of the speech data.
Abstract: A method and a system for speech recognition are provided. In the method, vocal characteristics are captured from speech data and used to identify a speaker identification of the speech data. Next, a first acoustic model is used to recognize a speech in the speech data. According to the recognized speech and the speech data, a confidence score of the speech recognition is calculated and it is determined whether the confidence score is over a threshold. If the confidence score is over the threshold, the recognized speech and the speech data are collected, and the collected speech data is used for performing a speaker adaptation on a second acoustic model corresponding to the speaker identification.

Proceedings ArticleDOI
21 Mar 2012
TL;DR: This study proposes limited vocabulary isolated word recognition using Linear Predictive Coding and Mel Frequency Cepstral Coefficients for feature extraction, Dynamic Time Warping (DTW) and discrete Hidden Markov Model (HMM) for recognition and their comparisons.
Abstract: This study proposes limited vocabulary isolated word recognition using Linear Predictive Coding(LPC) and Mel Frequency Cepstral Coefficients(MFCC) for feature extraction, Dynamic Time Warping(DTW) and discrete Hidden Markov Model (HMM) for recognition and their comparisons. Feature extraction is carried over the speech frame of 300 samples with 100 samples overlap at 8 KHz sampling rate of the input speech. MFCC analysis provides better recognition rate than LPC as it operates on a logarithmic scale which resembles human auditory system whereas LPC has uniform resolution over the frequency plane. This is followed by pattern recognition. Since the voice signal tends to have different temporal rate, DTW is one of the methods that provide non-linear alignment between two voice signals. Another method called HMM that statistically models the words is also presented. Experimentally it is observed that recognition accuracy is better for HMM compared with DTW. The database used is TI-46 isolated word corpus zero-nine from Linguist Data Consortium.

Journal ArticleDOI
TL;DR: A speech-model based method using the linear predictive (LP) residual of the speech signal and the maximum-likelihood (ML) estimator proposed in “Blind estimation of reverberation time,” shows that it can estimate RT60 in the presence of (moderate) background noise.
Abstract: In this paper, we propose a speech-model based method using the linear predictive (LP) residual of the speech signal and the maximum-likelihood (ML) estimator proposed in “Blind estimation of reverberation time,” (R. Ratnam , J. Acoust. Soc. Amer., 2004) to blindly estimate the reverberation time (RT60). The input speech is passed through a low order linear predictive coding (LPC) filter to obtain the LP residual signal. It is proven that the unbiased autocorrelation function of this LP residual has the required properties to be used as an input to the ML estimator. It is shown that this method can successfully estimate the reverberation time with less data than existing blind methods. Experiments show that the proposed method can produce better estimates of RT60, even in highly reverberant rooms. This is because the entire input speech data is used in the estimation process. The proposed method is not sensitive to the type of input data (voiced, unvoiced), number of gaps, or window length. In addition, evaluation using white Gaussian noise and recorded babble noise shows that it can estimate RT60 in the presence of (moderate) background noise.

Proceedings ArticleDOI
21 Mar 2012
TL;DR: In this paper, a set of different feature extraction methods such as linear predictive coding (LPC), mel frequency cepstral coefficient (MFCC) and perceptual linear prediction (PLP) with several features normalization techniques including rasta filtering and CMPS were compared for text independent speaker identification using a combination between gaussian mixture models (GMM) and linear or non-linear kernels based on support vector machine (SVM).
Abstract: The speech feature extraction has been a key focus in robust speech recognition research; it significantly affects the recognition performance. In this paper, we first study a set of different feature extraction methods such as linear predictive coding (LPC), mel frequency cepstral coefficient (MFCC) and perceptual linear prediction (PLP) with several features normalization techniques including rasta filtering and cepstral mean subtraction (CMS). Based on this, a comparative evaluation of these features is performed on the task of text independent speaker identification using a combination between gaussian mixture models (GMM) and linear or non-linear kernels based on support vector machine (SVM).

Journal ArticleDOI
TL;DR: The experimental results confirm the superiority of the proposed VAD compared to the reference methods particularly for speech detection rate at the dominant noisy conditions.
Abstract: A new voice activity detection (VAD) algorithm with soft decision output in Mel-frequency domain is developed based on hidden Markov model (HMM) and is incorporated in an HMM-based speech enhancement system. The proposed VAD uses a two-state ergodic HMM representing speech presence and speech absence. The states are constructed from noisy speech and noise HMMs used in the speech enhancement system. This composite model provides a robust detection of speech segments in the presence of noise and obviates the need for extra modeling in HMM-based speech enhancement applications. As the main purpose of the proposed VAD is to detect speech segments accurately, a hang-over mechanism is proposed and is applied on the output of the VAD to improve the speech detection rate. The VAD is integrated in the HMM-based speech enhancement system in Mel-frequency spectral (MFS) and cepstral (MFC) domains. The performance of the proposed VAD, the effectiveness of the hang-over mechanism and the performance of the VAD-integrated speech enhancement system are evaluated on four noise types at different SNR levels. The experimental results confirm the superiority of the proposed VAD compared to the reference methods particularly for speech detection rate at the dominant noisy conditions.

Journal ArticleDOI
TL;DR: The integer fast Fourier transform (FFT) is used to replace the floating-point FFT to speed up the computation for capturing speech features through a mel-frequency cepstrum coefficient, resulting in a significant reduction in the calculation time without influencing the speech recognition rate.
Abstract: A field-programmable gate array (FPGA)-based robust speech measurement and recognition system is the focus of this paper, and the environmental noise problem is its main concern. To accelerate the recognition speed of the FPGA-based speech recognition system, the discrete hidden Markov model is used here to lessen the computation burden inherent in speech recognition. Furthermore, the empirical mode decomposition is used to decompose the measured speech signal contaminated by noise into several intrinsic mode functions (IMFs). The IMFs are then weighted and summed to reconstruct the original clean speech signal. Unlike previous research, in which IMFs were selected by trial and error for specific applications, the weights for each IMF are designed by the genetic algorithm to obtain an optimal solution. The experimental results in this paper reveal that this method achieves a better speech recognition rate for speech subject to various environmental noises. Moreover, this paper also explores the hardware realization of the designed speech measurement and recognition systems on an FPGA-based embedded system with the System-On-a-Chip (SOC) architecture. Since the central-processing-unit core adopted in the SOC has limited computation ability, this paper uses the integer fast Fourier transform (FFT) to replace the floating-point FFT to speed up the computation for capturing speech features through a mel-frequency cepstrum coefficient. The result is a significant reduction in the calculation time without influencing the speech recognition rate. It can be seen from the experiments in this paper that the performance of the implemented hardware is significantly better than that of existing research.

Patent
10 May 2012
TL;DR: In this article, the authors proposed a method of processing an audio signal comprises steps for determining whether the audio signal encoding type is musical signal encoding types using first type information and second type information.
Abstract: FIELD: information technology. ^ SUBSTANCE: method of processing an audio signal comprises steps for determining whether the audio signal encoding type is musical signal encoding type using first type information. If the audio signal encoding type is not a musical signal encoding type, it determined whether the audio signal encoding type is a speech signal encoding type or a mixed signal encoding type using second type information. If the audio signal encoding type is a mixed signal encoding type, spectral data and a linear prediction coefficient is extracted from the audio signal; a difference signal for linear prediction by performing inverse frequency transformation over spectral data is generated; the audio signal is reconstructed by performing linear predictive coding over the linear prediction coefficient and the difference signal and the high-frequency domain signal is reconstructed using a base extension signal corresponding to the frequency domain of the reconstructed audio signal and range extension information. ^ EFFECT: higher efficiency of coding/decoding a audio signals. ^ 15 cl, 14 dwg

Patent
Ho-Sang Sung1, Eunmi Oh1
23 Apr 2012
TL;DR: In this paper, a quantization path determiner that determines a path from a first path not using interframe prediction and a second path using the inter-frame prediction, based on a criterion before quantization of the input signal, is provided.
Abstract: A quantizing apparatus is provided that includes a quantization path determiner that determines a path from a first path not using inter-frame prediction and a second path using the inter-frame prediction, as a quantization path of an input signal, based on a criterion before quantization of the input signal; a first quantizer that quantizes the input signal, if the first path is determined as the quantization path of the input signal; and a second quantizer that quantizes the input signal, if the second path is determined as the quantization path of the input signal.

Proceedings ArticleDOI
25 Mar 2012
TL;DR: This contribution presents a new consistent solution for MMSE speech amplitude (SA) estimation under SPU, being based on the generalized gamma distribution representing a variety of speech priors, and is shown to outperform both the SPU-based MMSE-SA estimator relying on a Gaussian speech prior, and the gamma MM SE-SA estimation without SPU.
Abstract: Several investigations showed that speech enhancement approaches can be improved by speech presence uncertainty (SPU) estimation. Although there has been a strong focus on the use of correct statistical models for spectral weighting rules for the last few decades, there is just a few publications about SPU estimation based on a speech prior consistent with the spectral weighting rule. This contribution presents a new consistent solution for MMSE speech amplitude (SA) estimation under SPU, being based on the generalized gamma distribution representing a variety of speech priors. Employing the gamma speech model which is a special case of the generalized gamma distribution, the new approach is shown to outperform both the SPU-based MMSE-SA estimator relying on a Gaussian speech prior, and the gamma MMSE-SA estimation without SPU.

Patent
26 Sep 2012
TL;DR: In this article, a computing device is able to use an embedded speech recognizer and a network speech recognition system for speech recognition, and the captured audio is forwarded to at least one application.
Abstract: A computing device is able to use an embedded speech recognizer and a network speech recognizer for speech recognition. In response to detecting speech in the captured audio, the computing device may forward the captured audio to its embedded speech recognizer and to a speech client for the network speech recognizer. The embedded speech recognizer provides an embedded-recognizer result for the captured audio. If a network- recognition criterion is met, the speech client forwards the captured audio to the network speech recognizer and receives a network-recognizer result for the captured audio from the network speech recognizer. A speech recognition result for the captured audio is forwarded to at least one application, wherein the speech recognition result is based on at least one of the embedded-recognizer result and the network-recognizer result.

Proceedings Article
18 Oct 2012
TL;DR: 1-D Local binary patterns (LBP) are proposed to be used in speech signal segmentation and voice activation detection (VAD) and combined with hidden Markov model (HMM) for advanced speech recognition.
Abstract: In this paper, 1-D Local binary patterns (LBP) are proposed to be used in speech signal segmentation and voice activation detection (VAD)and combined with hidden Markov model (HMM) for advanced speech recognition. Speech is firstly de-noised by Adaptive Empirical Model Decomposition (AEMD), and then processed using LBP based VAD. The short-time energy of the speech activity detected from the VAD is finally smoothed and used as the input of the HMM recognition process. The enhanced performance of the proposed system for speech recognition is compared with other VAD techniques at different SNRs ranging from 15 dB to a robust noisy condition at −5 dB.

Proceedings ArticleDOI
01 Dec 2012
TL;DR: A prototype model for speech recognition in Assamese language using Linear Predictive Coding and Mel frequency cepstral coefficient is proposed, which is able to generate 10% gain in the recognition rate than the case when individual architectures are used.
Abstract: The current work proposes a prototype model for speech recognition in Assamese language using Linear Predictive Coding (LPC) and Mel frequency cepstral coefficient (MFCC). The speech recognition is a part of a speech to text conversion system. The LPC and MFCC features are extracted by two different Recurrent Neural Networks (RNN), which are used to recognize the vocal extract of Assamese language- a major language in the North Eastern part of India. In this work, decision block is designed by a combined framework of RNN block to extract the features. Using this combined architecture our system is able to generate 10% gain in the recognition rate than the case when individual architectures are used.

Proceedings ArticleDOI
18 Oct 2012
TL;DR: It is shown that the influence of the wind noise can greatly be reduced by the proposed concept and a comparison with a state-of-the-art speech enhancement system and an algorithm specially designed to reduce wind noise is shown.
Abstract: In this contribution, we propose a method to enhance single channel speech signals which are degraded by wind noise. In contrast to common speech enhancement systems, a special processing is required due to the highly non-stationary characteristics of wind signals. The basic idea is to exploit the fact that wind noise is mainly located at low frequencies and thus, a large frequency range of the speech is almost noise free. Techniques which artificially extend the bandwidth of telephone speech towards lower frequencies are applied to replace the highly disturbed low frequency parts. Here, the discrete model of speech production is used to reconstruct the required parts of the speech signal. Important parameters for this model are pitch frequency, the spectral envelope and a spectral gain. In this context, an evaluation is carried out which determines the robustness of several pitch estimators against wind noise. The frequency range of the reconstructed speech is finally adapted to the actual level of wind noise. Based on realistic scenarios it is shown that the influence of the wind noise can greatly be reduced by the proposed concept. This includes a comparison with a state-of-the-art speech enhancement system and an algorithm specially designed to reduce wind noise.

Journal ArticleDOI
TL;DR: This paper presents long-term nonlinear prediction based on second-order Volterra filters that can outperform conventional linear prediction techniques in terms of prediction gain and “whiter” residuals.
Abstract: Previous studies of nonlinear prediction of speech have been mostly focused on short-term prediction. This paper presents long-term nonlinear prediction based on second-order Volterra filters. It will be shown that the presented predictor can outperform conventional linear prediction techniques in terms of prediction gain and “whiter” residuals.

Patent
12 Apr 2012
TL;DR: In this paper, a method of noise-robust speech classification is disclosed, where internal classification parameters are generated in the speech classifier from at least one of the input parameters, and a Normalized Auto-correlation Coefficient Function threshold is set.
Abstract: A method of noise-robust speech classification is disclosed. Classification parameters are input to a speech classifier from external components. Internal classification parameters are generated in the speech classifier from at least one of the input parameters. A Normalized Auto-correlation Coefficient Function threshold is set. A parameter analyzer is selected according to a signal environment. A speech mode classification is determined based on a noise estimate of multiple frames of input speech.

01 Jan 2012
TL;DR: In this project, the linear predictor model provides a robust, reliable and accurate method for estimating parameters that characterize the linear, time varying system.
Abstract: One of the most powerful speech analysis techniques is the method of linear predictive analysis This method has become the predominant technique for representing speech for low bit rate transmission or storage The importance of this method lies both in its ability to provide extremely accurate estimates of the speech parameters and in its relative speed of computation The basic idea behind linear predictive analysis is that the speech sample can be approximated as a linear combination of past samples The linear predictor model provides a robust, reliable and accurate method for estimating parameters that characterize the linear, time varying system In this project, we implement a voice excited

Proceedings Article
18 Oct 2012
TL;DR: In this paper, the authors proposed to modify the original speech signal before this is presented in the noisy environment by combining a signal to noise ratio recovery approach with dynamic range compression in order to improve the intelligibility of the speech in noise.
Abstract: The ability to detect speech in noise plays a significant role in our communication with others. In this work we suggest to modify the original speech signal before this is presented in the noisy environment by combining a signal to noise ratio recovery approach with dynamic range compression in order to improve the intelligibility of the speech in noise. The modification is performed under the constraint of equal global signal power before and after modifications. Experiments with speech shaped (SSN) and competing speaker (CS) types of noise at various low SNR values, show that the suggested approach outperforms state-of-the-art methods in terms of the Speech Intelligibility Index (SII) as well as in informal listening tests. Comparing with a state-of-the-art method there is an improvement of 4 dB and 8 dB in terms of SNR gain, for the SSN and the CS types of noise, respectively.

01 Jan 2012
TL;DR: The purpose of modification in the MFCC based technique generally being used was to improve its performance for making it more robust, accurate and making it faster and computationally efficient so, that the technique can be considered for real time applications.
Abstract: Speech processing is emerged as one of the important application area of digital signal processing. Various fields for research in speech processing are speech recognition, speaker recognition, speech synthesis, speech coding etc. Speaker recognition has been an interesting research field for the last many decades, which still have a number of unsolved problems. The objective of automatic speaker recognition is to extract, characterize and recognize the information about speaker identity. A direct analysis and synthesizing the complex voice signal is due to too much information contained in the signal. Therefore the digital signal processes such as Feature Extraction and Feature Matching are introduced to represent the voice signal. Several methods such as Liner Predictive Coding (LPC), Hidden Markov Model (HMM), Artificial Neural Network (ANN) etc are evaluated with a view to identify a straight forward and effective method for speech signal. In this paper, the Mel Frequency Cepstrum Coefficient (MFCC) technique is been explained for designing a speaker recognition system. Some modifications to the existing technique of MFCC for feature extraction are suggested. The purpose of modification in the MFCC based technique generally being used was to improve its performance for making it more robust, accurate and making it faster and computationally efficient. So, that the technique can be considered for real time applications.

Proceedings ArticleDOI
Nicolas Obin1, Marco Liuni1
01 Dec 2012
TL;DR: An entropy-based spectral representation as a measure of the degree of noisiness in audio signals, complementary to the standard MFCCs for audio and speech recognition, and extended to the classification of voice quality for the design of an automatic voice casting system in video games.
Abstract: This paper introduces an entropy-based spectral representation as a measure of the degree of noisiness in audio signals, complementary to the standard MFCCs for audio and speech recognition. The proposed representation is based on the Renyi entropy, which is a generalization of the Shannon entropy. In audio signal representation, Renyi entropy presents the advantage of focusing either on the harmonic content (prominent amplitude within a distribution) or on the noise content (equal distribution of amplitudes). The proposed representation outperforms all other noisiness measures — including Shannon and Wiener entropies — in a large-scale classification of vocal effort (whispered-soft/normal/loud-shouted) in the real scenario of multi-language massive role-playing video games. The improvement is around 10% in relative error reduction, and is particularly significant for the recognition of noisy speech — i.e., whispery/breathy speech. This confirms the role of noisiness for speech recognition, and will further be extended to the classification of voice quality for the design of an automatic voice casting system in video games.

Proceedings Article
18 Oct 2012
TL;DR: Two straightforward post-filtering methods for near-end speech enhancement in difficult noise conditions using a perceptually motivated high-pass filter to transfer energy from the first formant to higher frequencies are introduced.
Abstract: Post-filtering can be used to enhance the quality and intelligibility of speech in mobile phones. This paper introduces two straightforward post-filtering methods for near-end speech enhancement in difficult noise conditions. Both of the algorithms use a perceptually motivated high-pass filter to transfer energy from the first formant to higher frequencies. A Speech Reception Threshold (SRT) test was conducted to determine the performance of the proposed methods in comparison to a similar post-filtering approach and unprocessed speech. The results of the listening tests indicate that all of the post-filtering methods provide intelligibility enhancement compared to unprocessed speech, but there were no significant differences between the methods themselves.