scispace - formally typeset
Search or ask a question

Showing papers on "Linear predictive coding published in 2010"


Journal Article
TL;DR: In this paper, an index for predicting the effects of noise, nonlinear distortion, and linear filtering on speech quality is developed for both normal-hearing and hearing-impaired listeners.
Abstract: Signal modifications in audio devices such as hearing aids include both nonlinear and linear processing. An index is developed for predicting the effects of noise, nonlinear distortion, and linear filtering on speech quality. The index is designed for both normal-hearing and hearing-impaired listeners. It starts with a representation of the auditory periphery that incorporates aspects of impaired hearing. The cochlear model is followed by the extraction of signal features related to the quality judgments. One set of features measures the effects of noise and nonlinear distortion on speech quality, whereas second set of features measures the effects of linear filtering. The hearing-aid speech quality index (HASQI) is the product of the subindices computed for each of the two sets of features. The models are evaluated by comparing the model predictions with quality judgments made by normal-hearing and hearing-impaired listeners for speech stimuli containing noise, nonlinear distortion, linear processing, and combinations of these signal degradations.

124 citations


Journal ArticleDOI
TL;DR: Experimental results demonstrate the potential of compressed sensing in speech coding techniques, offering high perceptual quality with a very sparse approximated prediction residual.
Abstract: Encouraged by the promising application of compressed sensing in signal compression, we investigate its formulation and application in the context of speech coding based on sparse linear prediction. In particular, a compressed sensing method can be devised to compute a sparse approximation of speech in the residual domain when sparse linear prediction is involved. We compare the method of computing a sparse prediction residual with the optimal technique based on an exhaustive search of the possible nonzero locations and the well known Multi-Pulse Excitation, the first encoding technique to introduce the sparsity concept in speech coding. Experimental results demonstrate the potential of compressed sensing in speech coding techniques, offering high perceptual quality with a very sparse approximated prediction residual.

79 citations


Journal Article
TL;DR: In the proposed system, incorporating non-fiducial features from the LPC spectrum produced a segment and subject recognition rate of 99.52% and 100% respectively, which allows for LPC to be used in a practical ECG biometric system that requires fast, stringent and accurate recognition.
Abstract: In this paper, a novel method for a biometric system based on the ECG signal is proposed, using spectral coefficients computed through linear predictive coding (LPC). ECG biometric systems have traditionally incorporated characteristics of fiducial points of the ECG signal as the feature set. These systems have been shown to contain loopholes and thus a non-fiducial system allows for tighter security. In the proposed system, incorporating non-fiducial features from the LPC spectrum produced a segment and subject recognition rate of 99.52% and 100% respectively. The recognition rates outperformed the biometric system that is based on the wavelet packet decomposition (WPD) algorithm in terms of recognition rates and computation time. This allows for LPC to be used in a practical ECG biometric system that requires fast, stringent and accurate recognition. Keywords—biometric, ecg, linear predictive coding, wavelet packet decomposition

49 citations


Proceedings ArticleDOI
19 Jul 2010
TL;DR: A two-step framework is proposed to estimate the background noise with minimal speech leakage signal and a correlation based similarity measure is applied to determine the integrity of speech signal, showing that it performs better than the existing speech enhancement algorithms with significant improvement in terms of SNR value.
Abstract: This paper presents a new audio forensics method based on background noise in the audio signals. The traditional speech enhancement algorithms improve the quality of speech signals, however, existing methods leave traces of speech in the removed noise. Estimated noise using these existing methods contains traces of speech signal, also known as leakage signal. Although this speech leakage signal has low SNR, yet it can be perceived easily by listening to the estimated noise signal, it therefore cannot be used for audio forensics applications. For reliable audio authentication, a better noise estimation method is desirable. To achieve this goal, a two-step framework is proposed to estimate the background noise with minimal speech leakage signal. A correlation based similarity measure is then applied to determine the integrity of speech signal. The proposed method has been evaluated for different speech signals recorded in various environments. The results show that it performs better than the existing speech enhancement algorithms with significant improvement in terms of SNR value.

48 citations


Proceedings Article
01 Aug 2010
TL;DR: This contribution uses a theoretical analysis of the Speech Intelligibility Index (SII) to develop an algorithm which numerically maximizes the SII under the constraint of an unchanged average power of the audio signal.
Abstract: In speech communications, signal processing algorithms for near end listening enhancement allow to improve the intelligibility of clean (far end) speech for the near end listener who perceives not only the far end speech but also ambient background noise A typical scenario is mobile telephony in acoustical background noise such as traffic or babble noise In these situations, it is often not acceptable/possible to increase the audio power amplification In this contribution we use a theoretical analysis of the Speech Intelligibility Index (SII) to develop an algorithm which numerically maximizes the SII under the constraint of an unchanged average power of the audio signal

47 citations


Journal ArticleDOI
TL;DR: A novel method of enhancing esophageal speech using statistical voice conversion using Gaussian mixture models to improve the intelligibility and naturalness and applies one-to-many eigenvoice conversion to esophagal speech enhancement to make it possible to flexibly control the voice quality of enhanced speech.
Abstract: This paper presents a novel method of enhancing esophageal speech using statistical voice conversion. Esophageal speech is one of the alternative speaking methods for laryngectomees. Although it doesn't require any external devices, generated voices usually sound unnatural compared with normal speech. To improve the intelligibility and naturalness of esophageal speech, we propose a voice conversion method from esophageal speech into normal speech. A spectral parameter and excitation parameters of target normal speech are separately estimated from a spectral parameter of the esophageal speech based on Gaussian mixture models. The experimental results demonstrate that the proposed method yields significant improvements in intelligibility and naturalness. We also apply one-to-many eigenvoice conversion to esophageal speech enhancement to make it possible to flexibly control the voice quality of enhanced speech.

39 citations


Patent
02 Apr 2010
TL;DR: In this paper, a linear prediction coefficient of a signal represented in a frequency domain is obtained by performing linear prediction analysis in the frequency direction by using a covariance method or an autocorrelation method.
Abstract: A linear prediction coefficient of a signal represented in a frequency domain is obtained by performing linear prediction analysis in a frequency direction by using a covariance method or an autocorrelation method. After the filter strength of the obtained linear prediction coefficients is adjusted, filtering is performed in the frequency direction on the signal by using the adjusted coefficients, whereby the temporal envelope of the signal is shaped. This reduces the occurrence of pre-echo and post-echo and improves the subjective quality of the decoded signal, without significantly increasing the bit rate in a bandwidth extension technique in the frequency domain represented by SBR.

36 citations


Proceedings ArticleDOI
14 Mar 2010
TL;DR: Experimental results prove the effectiveness of the reweighted 1- norm minimization, offering better coding properties compared to 1-norm minimization.
Abstract: Linear prediction of speech based on 1-norm minimization has already proved to be an interesting alternative to 2-norm minimization. In particular, choosing the 1-norm as a convex relaxation of the 0-norm, the corresponding linear prediction model offers a sparser residual better suited for coding applications. In this paper, we propose a new speech modeling technique based on reweighted 1-norm minimization. The purpose of the reweighted scheme is to overcome the mismatch between 0-norm minimization and 1-norm minimization while keeping the problem solvable with convex estimation tools. Experimental results prove the effectiveness of the reweighted 1-norm minimization, offering better coding properties compared to 1-norm minimization.

30 citations


Book
10 Mar 2010
TL;DR: This book describes several modules of the Code Excited Linear Prediction (CELP) algorithm using the Federal Standard-1016 CELP MATLAB(r) software to describe in detail several functions and parameter computations associated with analysis-by-synthesis linear prediction.
Abstract: This book describes several modules of the Code Excited Linear Prediction (CELP) algorithm. The authors use the Federal Standard-1016 CELP MATLAB(r) software to describe in detail several functions and parameter computations associated with analysis-by-synthesis linear prediction. The book begins with a description of the basics of linear prediction followed by an overview of the FS-1016 CELP algorithm. Subsequent chapters describe the various modules of the CELP algorithm in detail. In each chapter, an overall functional description of CELP modules is provided along with detailed illustrations of their MATLAB(r) implementation. Several code examples and plots are provided to highlight some of the key CELP concepts. Table of Contents: Introduction to Linear Predictive Coding / Autocorrelation Analysis and Linear Prediction / Line Spectral Frequency Computation / Spectral Distortion / The Codebook Search / The FS-1016 Decoder

28 citations


Journal ArticleDOI
TL;DR: Experimental results show that the proposed method for formant and anti-formant tracking as well as speech resynthesis is the computationally most efficient approach, when the optimization criterion considers a psychoacoustical frequency scale.
Abstract: A speech production model consists of a linear, slowly time-varying filter. Pole-zero models are required for a good representation of certain types of speech sounds, like nasals and laterals. From a perceptual point of view, designing them by minimizing a logarithmic criterion appears as a very suitable approach. The most accurate available results are obtained by using Newton-like search algorithms to optimize pole and zero positions, or the coefficients of a decomposition into quadratic factors. In this paper, we propose to optimize the numerator and denominator coefficients instead. Experimental results show that this is the computationally most efficient approach, especially when the optimization criterion considers a psychoacoustical frequency scale. To illustrate its applicability in speech processing, we used the proposed method for formant and anti-formant tracking as well as speech resynthesis.

26 citations


Patent
09 Jul 2010
TL;DR: In this paper, a speech sensor may generate a unique signal based on the facial, bone, lips and/or throat movements, which can be used in the coding, compression, noise reduction and other aspects of signal processing.
Abstract: Speech detection is a technique to determine and classify periods of speech. In a normal conversation, each speaker speaks less than half the time. The remaining time is devoted to listening to the other end and pauses between speech and silence. The classification is usually done by comparing the signal energy to a threshold. Classifying speech as noise and noise as speech may affect the performance of the communication device. The current invention overcomes such problems by utilizing an alternate sensor signal indicating the presence or absence of speech. In the current invention, the communication device receives an audio signal via single or multiple microphones. The speech sensor may generate a unique signal based on the facial, bone, lips and/or throat movements. The system then combines the information received by the microphones and the speech sensor to decide the presence or absence of speech. This decision can be used in the coding, compression, noise reduction and other aspects of signal processing.

Journal ArticleDOI
TL;DR: An algorithm, the joint time-frequency segmentation algorithm, where the wavelet packet coefficients of the analyzed speech signal are represented as tiles of a time- frequencies representation adapted to the characteristics of the signal itself is developed.
Abstract: We develop an algorithm, the joint time-frequency segmentation algorithm, where the wavelet packet coefficients of the analyzed speech signal are represented as tiles of a time-frequency representation adapted to the characteristics of the signal itself. Further, our algorithm enables the decomposition of the speech signal into transient and non-transient components, respectively. Any block of wavelet packet coefficients, whose tiling height is larger than or equal to the tiling width belongs to the transient component and vice versa for the non-transient component. The transient component is selectively amplified and recombined with the original speech to generate the modified speech with energy adjusted to be equal to the original speech. The intelligibility of the original and modified speech is evaluated by 16 human listeners. Word recognition rate results show that the modified speech significantly improves speech intelligibility in background noise, i.e., by 10% absolute at 0 dB to 27% absolute at -30 dB.

Proceedings ArticleDOI
01 Dec 2010
TL;DR: The effect of speech coding on text independent speaker identification (SI) is presented and it is observed that there is a significant reduction of performance in SI system due to coding, and effect is more prominent in case of SI system build with source features.
Abstract: The increasing use of wireless systems is creating great deal of interest in the development of robust speech systems in wireless environment. The major degradations involved in wireless environment are: effect of varying background conditions, degradation due to speech coders and errors due to wireless channels. In this paper, we presented the effect of speech coding on text independent speaker identification (SI). Speech coders considered in this work are GSM full rate (ETSI 06.10), CELP (FS-1016), and MELP (TI 2.4kbps). The amount of distortion introduced by coding is measured using log-likelihood ratio (LLR), weighted spectral slope (WSS) and log-spectral distance (LSD). The effect of coding on SI is analyzed by building SI system using both vocal track system and excitation source features. We observed that there is a significant reduction of performance in SI system due to coding, and effect is more prominent in case of SI system build with source features. We also observed that, speaker characteristics are well preserved in case of MELP compared to CELP even though MELP coder bit rate is less than CELP.

Patent
07 Dec 2010
TL;DR: In this article, noise suppression information is used to optimize or improve automatic speech recognition performed for a signal using a gain value, and the gain to apply to the noisy speech signal is selected to optimize speech recognition analysis of the resulting signal.
Abstract: Noise suppression information is used to optimize or improve automatic speech recognition performed for a signal. Noise suppression can be performed on a noisy speech signal using a gain value. The gain to apply to the noisy speech signal is selected to optimize speech recognition analysis of the resulting signal. The gain may be selected based on one or more features for a current sub band and time frame, as well as one or more features for other sub bands and/or time frames. Noise suppression information can be provided to a speech recognition module to improve the robustness of the speech recognition analysis. Noise suppression information can also be used to encode and identify speech.

Book ChapterDOI
08 Nov 2010
TL;DR: Development of an experimental, speaker-dependent, real-time, isolated word recognizer for Indian regional language Punjabi, which emphasizes on template-based recognizer approach using linear predictive coding with dynamic programming computation and vector quantization with Hidden Markov Model based recognizers in isolated word recognition tasks, which also significantly reduces the computational costs.
Abstract: Issue of speech interface to computer has been capturing the global attention because of convenience put forth by it. Although speech recognition is not a new phenomenon in existing developments of user-machine interface studies but the highlighted facts only provide promising solutions for widely accepted language English. This paper presents development of an experimental, speaker-dependent, real-time, isolated word recognizer for Indian regional language Punjabi. Research is further extended to comparison of speech recognition system for small vocabulary of speaker dependent isolated spoken words in Indian regional language (Punjabi) using the Hidden Markov Model (HMM) and Dynamic Time Warp (DTW) technique. Punjabi language gives immense changes between consecutive phonemes. Thus, end point detection becomes highly difficult. The presented work emphasizes on template-based recognizer approach using linear predictive coding with dynamic programming computation and vector quantization with Hidden Markov Model based recognizers in isolated word recognition tasks, which also significantly reduces the computational costs. The parametric variation gives enhancement in the feature vector for recognition of 500-isolated word vocabulary on Punjabi language, as the Hidden Marko Model and Dynamic Time Warp technique gives 91.3% and 94.0% accuracy respectively.

Proceedings ArticleDOI
14 Mar 2010
TL;DR: This paper presents a novel method of enhancing esophageal speech using statistical voice conversion based on Gaussian mixture models and applies one-to-many eigenvoice conversion to esophagal speech enhancement for flexibly controlling enhanced voice quality.
Abstract: This paper presents a novel method of enhancing esophageal speech using statistical voice conversion. Esophageal speech is one of the alternative speaking methods for laryngectomees. Although it doesn't require any external devices, generated voices sound unnatural. To improve the intelligibility and naturalness of esophageal speech, we propose a voice conversion method from esophageal speech into normal speech. A spectral parameter and excitation parameters of target normal speech are separately estimated from a spectral parameter of the esophageal speech based on Gaussian mixture models. The experimental results demonstrate that the proposed method yields significant improvements in intelligibility and naturalness. We also apply one-to-many eigenvoice conversion to esophageal speech enhancement for flexibly controlling enhanced voice quality.

Journal ArticleDOI
TL;DR: A parametric speech spectrum model is developed that allows us to estimate the F0 and spectral envelope simultaneously simultaneously and confirmed experimentally the significant advantage of this joint estimation approach for both F< sub>0. estimation and spectral envelopes estimation.
Abstract: Although considerable effort has been devoted to both fundamental frequency (F0) and spectral envelope estimation in the field of speech processing, the problem of determining F0 and spectral envelopes has largely been tackled independently. If F0 were known in advance, then the spectral envelope could be estimated very reliably. On the other hand, if the spectral envelope were known in advance, then we could obtain a reliable F0 estimate. F0 and the spectral envelope, each of which is a prerequisite of the other, should thus be estimated jointly rather than independently in succession. On this basis, we develop a parametric speech spectrum model that allows us to estimate the F0 and spectral envelope simultaneously. We confirmed experimentally the significant advantage of this joint estimation approach for both F0 estimation and spectral envelope estimation.

Journal ArticleDOI
TL;DR: A novel media-specific Forward Error Correction (FEC) technique which retrieves LTP-resynchronization with no additional delay at the cost of a very small bit of overhead and can cope with the presence of advanced LTP filters and the usual subframe segmentation applied in modern codecs.
Abstract: The widely used code-excited linear prediction (CELP) paradigm relies on a strong interframe dependency which renders CELP-based codecs vulnerable to packet loss. The use of long-term prediction (LTP) or adaptive codebooks (ACB) is the main source of interframe dependency in these codecs, since they employ the excitation from previous frames. After a frame erasure, previous excitation is unavailable and a desynchronization between the encoder and the decoder appears, causing an additional distortion which is propagated to the subsequent frames. In this paper, we propose a novel media-specific Forward Error Correction (FEC) technique which retrieves LTP-resynchronization with no additional delay at the cost of a very small bit of overhead. In particular, the proposed FEC code contains a multipulse signal which replaces the excitation of the previous frame (i.e., ACB memory) when this has been lost. This multipulse description of the previous excitation is optimized to minimize the perceptual error between the synthesized speech signal and the original one. To this end, we develop a multipulse formulation which includes the additional CELP processing and, in addition, can cope with the presence of advanced LTP filters and the usual subframe segmentation applied in modern codecs. Finally, a quantization scheme is proposed to encode pulse parameters. Objective and subjective quality tests applied to our proposal show that the propagation error due to LTP filter can practically be removed with a very little bandwidth increase.

Proceedings ArticleDOI
26 Nov 2010
TL;DR: A novel approach with enhanced performance over traditional methods that have been reported so far on speech recognition, employed neural network in research work with LPC, MFCC and PLP parameters.
Abstract: Many multimedia applications and entertainment industry products like games, cartoons and film dubbing require speech driven face animation and audio-video synchronization. Only Automatic Speech Recognition system (ASR) does not give good results in noisy environment. Audio Visual Speech Recognition system plays vital role in such harsh environment as it uses both – audio and visual – information. In this paper, we have proposed a novel approach with enhanced performance over traditional methods that have been reported so far. Our algorithm works on the bases of acoustic and visual parameters to achieve better results. We have tested our system for English language using LPC, MFCC and PLP parameters of the speech. Lip parameters like lip width, lip height etc are extracted from the video and these both acoustic and visual parameters are used to train systems like Artificial Neural Network (ANN), Vector Quantization (VQ), Dynamic Time Warping (DTW), Support Vector Machine (SVM). We have employed neural network in our research work with LPC, MFCC and PLP parameters. Results show that our system is giving very good response against tested vowels.

Patent
12 Nov 2010
TL;DR: In this article, an apparatus for processing a signal and method thereof are disclosed, which includes decoding the signal according to the speech coding scheme or the audio coding scheme based on the coding mode information.
Abstract: An apparatus for processing a signal and method thereof are disclosed. The present invention includes receiving coding mode information indicating a speech coding scheme or an audio coding scheme, linear prediction coding degree information indicating a linear prediction coding degree, and the signal including at least one of a speech signal and an audio signal; decoding the signal according to the speech coding scheme or the audio coding scheme based on the coding mode information; decoding linear prediction coding coefficients of the signal based on the linear prediction coding degree information; and generating an output signal by applying the decoded linear prediction coding coefficients to the decoded signal. In this case, the linear prediction coding degree information is determined based on a variation of a value of an LPC residual generated from performing the linear prediction coding on the signal.

Patent
15 Jul 2010
TL;DR: In this article, an acoustic signal analyzer receives a digital acoustic signal containing a speech signal and a noise signal, generates a non-speech GMM and a speech GMM adapted to a noise environment, and calculates the output probabilities of dominant Gaussian distributions of the GMMs.
Abstract: The processing efficiency and estimation accuracy of a voice activity detection apparatus are improved. An acoustic signal analyzer receives a digital acoustic signal containing a speech signal and a noise signal, generates a non-speech GMM and a speech GMM adapted to a noise environment, by using a silence GMM and a clean-speech GMM in each frame of the digital acoustic signal, and calculates the output probabilities of dominant Gaussian distributions of the GMMs. A speech state probability to non-speech state probability ratio calculator calculates a speech state probability to non-speech state probability ratio based on a state transition model of a speech state and a non-speech state, by using the output probabilities; and a voice activity detection unit judges, from the speech state probability to non-speech state probability ratio, whether the acoustic signal in the frame is in the speech state or in the non-speech state and outputs only the acoustic signal in the speech state.

Journal ArticleDOI
TL;DR: A two-stage speech activity detection system is presented which at first takes advantage of a voice activity detector to discard pause segments out of the audio signals; this is done even in presence of stationary background noises.

Patent
16 Apr 2010
TL;DR: In this paper, a speech detection apparatus and a method were proposed to determine whether a frame is speech or not using feature information extracted from an input signal. But, they did not specify which feature information is required for speech detection for each frame in the estimated situation.
Abstract: A speech detection apparatus and method are provided. The speech detection apparatus and method determine whether a frame is speech or not using feature information extracted from an input signal. The speech detection apparatus may estimate a situation related to an input frame and determine which feature information is required for speech detection for the input frame in the estimated situation. The speech detection apparatus may detect a speech signal using dynamic feature information that may be more suitable to the situation of a particular frame, instead of using the same feature information for each and every frame.

Book
05 May 2010
TL;DR: This is the story of the development of linear predictive coded (LPC) speech and how it came to be used in the first successful packet speech experiments and addresses some of the common assumptions made when modeling random signals.
Abstract: In December 1974 the first realtime conversation on the ARPAnet took place between Culler- Harrison Incorporated in Goleta, California, and MIT Lincoln Laboratory in Lexington, Massachusetts. This was the first successful application of realtime digital speech communication over a packet network and an early milestone in the explosion of realtime signal processing of speech, audio, images, and video that we all take for granted today. It could be considered as the first voice over Internet Protocol (VoIP), except that the Internet Protocol (IP) had not yet been established. In fact, the interest in realtime signal processing had an indirect, but major, impact on the development of IP. This is the story of the development of linear predictive coded (LPC) speech and how it came to be used in the first successful packet speech experiments. Several related stories are recounted as well. The history is preceded by a tutorial on linear prediction methods which incorporates a variety of views to provide context for the stories. This part is a technical survey of the fundamental ideas of linear prediction that are important for speech processing, but the development departs from traditional treatments and takes advantage of several shortcuts, simplifications, and unifications that come with years of hindsight. In particular, some of the key results are proved using short and simple techniques that are not as well known as they should be, and it also addresses some of the common assumptions made when modeling random signals. The reader interested only in the history and already familiar with or uninterested in the technical details of linear prediction and speech may skip Part I entirely.

Proceedings Article
01 Aug 2010
TL;DR: It is shown that bone conducted speech can also be used for robust pitch determination even in highly noisy environment that can be very useful in many practical speech communication applications like speech enhancement, speech/speaker recognition, and so on.
Abstract: This paper investigates the pitch characteristics of bone conducted speech. Pitch determination of speech signal can not attain the expected level of accuracy in adverse conditions. Bone conducted speech is robust to ambient noise and it has regular harmonic structure in the lower spectral region. These two properties make it very suitable for pitch tracking. Few works have been reported in the literature on bone conducted speech to facilitate detection and removal of unwanted signal from the simultaneously recorded air conducted speech. In this paper, we show that bone conducted speech can also be used for robust pitch determination even in highly noisy environment that can be very useful in many practical speech communication applications like speech enhancement, speech/ speaker recognition, and so on.

Proceedings ArticleDOI
16 Apr 2010
TL;DR: The past decade has witnessed substantial progress towards the application of low-rate speech coders to civilian and military communications as well as computer-related voice applications, and a number of these coders have already been adopted in national and international cellular telephony standards.
Abstract: The past decade has witnessed substantial progress towards the application of low-rate speech coders to civilian and military communications as well as computer-related voice applications. Central to this progress has been the development of new speech coders capable of producing high-quality speech at low data rates. Most of these coders incorporate mechanisms to represent the spectral properties of speech, provide for speech waveform matching, and optimize the coder's performance for the human ear. A number of these coders have already been adopted in national and international cellular telephony standards.

Patent
22 Oct 2010
TL;DR: In this article, a first band filter (103) extracts frequency components from the wide band speech signal other than the frequency components in a narrow band, and a band combining unit (104) combines the extracted frequency components with the up-sampled narrow-band speech signal.
Abstract: A synthesis filter (106) synthesizes wide band phonological signals and sound source signals selected from a speech signal codebook (105) into a plurality of wide band speech signals, and a distortion evaluation unit (107) selects a wide band speech signal having the lowest waveform distortion relative to an up-sampled narrow band speech signal output from a sampling conversion unit (101). A first band filter (103) extracts frequency components from the wide band speech signal other than the frequency components in a narrow band, and a band combining unit (104) combines the extracted frequency components with the up-sampled narrow band speech signal.

Journal ArticleDOI
TL;DR: This paper explores the possibility of estimating the entropy of frames after calculating its score function, instead of using original frames, and shows that this fact enables voice activity detection under high noise, where the simple entropy method fails.
Abstract: This paper deals with non-linear transformations for improving the performance of an entropy-based voice activity detector (VAD). The idea to use a non-linear transformation has already been applied in the field of speech linear prediction, or linear predictive coding, based on source separation techniques, where a score function is added to classical equations in order to take into account the true distribution of the signal. We explore the possibility of estimating the entropy of frames after calculating its score function, instead of using original frames. We observe that if the signal is clean, the estimated entropy is essentially the same; if the signal is noisy, however, the frames transformed using the score function may give entropy that is different in voiced frames as compared to unvoiced ones. Experimental evidence is given to show that this fact enables voice activity detection under high noise, where the simple entropy method fails.

Proceedings ArticleDOI
01 Aug 2010
TL;DR: A new method for the bandwidth extension of telephone speech using only the information in the narrowband speech to improve speech quality compared with a previously published bandwidth extension method.
Abstract: The limited audio bandwidth used in telephone systems degrades both the quality and the intelligibility of speech. This paper presents a new method for the bandwidth extension of telephone speech. Frequency components are added to the frequency band 4–8 kHz using only the information in the narrowband speech. First, a wideband excitation is generated by spectral folding from the narrowband linear prediction residual. The highband of this signal is divided into four subbands with a filter bank, and a neural network is used to weight the subbands based on features calculated from the narrowband speech. Bandwidth-extended speech is obtained by summing the weighted subbands and the original narrowband signal. Listening tests show that this new method improves speech quality compared with a previously published bandwidth extension method.

Book
01 Jan 2010
TL;DR: This comprehensive text provides an in-depth examination of the underlying signal processing techniques used in speech coding and covers the most recent research findings on topics including: A general introduction to speech processing Digital signal processing concepts Sampling theory and related topics
Abstract: It is becoming increasingly apparent that all forms of communicationincluding voicewill be transmitted through packet-switched networks based on the Internet Protocol (IP). Therefore, the design of modern devices that rely on speech interfaces, such as cell phones and PDAs, requires a complete and up-to-date understanding of the basics of speech coding. Outlines key signal processing algorithms used to mitigate impairments to speech quality in VoIP networks Offering a detailed yet easily accessible introduction to the field, Principles of Speech Coding provides an in-depth examination of the underlying signal processing techniques used in speech coding. The authors present coding standards from various organizations, including the International Telecommunication Union (ITU). With a focus on applications such as Voice-over-IP telephony, this comprehensive text covers the most recent research findings on topics including: A general introduction to speech processing Digital signal processing concepts Sampling theory and related topics Principles of pulse code modulation (PCM) and adaptive differential pulse code modulation (ADPCM) standards Linear prediction (LP) and use of the linear predictive coding (LPC) model Vector quantization and its applications in speech coding Case studies of practical speech coders from ITU and others The Internet low-bit-rate coder (ILBC) Developed from the authors combined teachings, this book also illustrates its contents by providing a real-time implementation of a speech coder on a digital signal processing chip. With its balance of theory and practical coverage, it is ideal for senior-level undergraduate and graduate students in electrical and computer engineering. It is also suitable for engineers and researchers designing or using speech coding systems in their work.