scispace - formally typeset
Search or ask a question

Showing papers on "Linear predictive coding published in 2006"


Journal ArticleDOI
TL;DR: Experimental results show that the use of a priori information and the calculation of the instantaneous speech and noise excitation variances on a frame-by-frame basis result in good performance in both stationary and nonstationary noise conditions.
Abstract: In this paper, we present a new technique for the estimation of short-term linear predictive parameters of speech and noise from noisy data and their subsequent use in waveform enhancement schemes. The method exploits a priori information about speech and noise spectral shapes stored in trained codebooks, parameterized as linear predictive coefficients. The method also uses information about noise statistics estimated from the noisy observation. Maximum-likelihood estimates of the speech and noise short-term predictor parameters are obtained by searching for the combination of codebook entries that optimizes the likelihood. The estimation involves the computation of the excitation variances of the speech and noise auto-regressive models on a frame-by-frame basis, using the a priori information and the noisy observation. The high computational complexity resulting from a full search of the joint speech and noise codebooks is avoided through an iterative optimization procedure. We introduce a classified noise codebook scheme that uses different noise codebooks for different noise types. Experimental results show that the use of a priori information and the calculation of the instantaneous speech and noise excitation variances on a frame-by-frame basis result in good performance in both stationary and nonstationary noise conditions.

200 citations


Patent
Min Chu1, Peng Liu1, Yong Zhao1, Yusheng Li1
22 Aug 2006
TL;DR: In this article, a concatenating speech synthesizer concatenates selected speech units to obtain the desired synthesized speech by selecting replacement speech units based on measures representative of the difference between the HMM acoustic models of the desired speech unit and available speech units.
Abstract: A concatenating speech synthesizer concatenates selected speech units to obtain the desired synthesized speech. When desired speech units of phonetic and/or prosodic context are not available, the synthesizer selects replacement speech units based on measures representative of the difference between the HMM acoustic models of the desired speech unit and available speech units.

171 citations


Proceedings ArticleDOI
14 May 2006
TL;DR: A new approach is presented that applies unit selection to find corresponding time frames in source and target speech to achieve the same performance as the conventional text-dependent training.
Abstract: So far, most of the voice conversion training procedures are text-dependent, i.e., they are based on parallel training utterances of source and target speaker. Since several applications (e.g. speech-to-speech translation or dubbing) require text-independent training, over the last two years, training techniques that use non-parallel data were proposed. In this paper, we present a new approach that applies unit selection to find corresponding time frames in source and target speech. By means of a subjective experiment it is shown that this technique achieves the same performance as the conventional text-dependent training.

129 citations


Journal ArticleDOI
TL;DR: This paper recollects the events that led to proposing the linear prediction coding (LPC) method, then the multipulse LPC and the code-excited LPC.
Abstract: This paper recollects the events that led to proposing the linear prediction coding (LPC) method, then the multipulse LPC and the code-excited LPC

118 citations


Proceedings ArticleDOI
01 Aug 2006
TL;DR: The simulations show that the performance of the proposed gender classifier is excellent; it is very robust for noise and completely independent of languages; the classification accuracy is as high as above 98% for all clean speech and remains 95% for most noisy speech.
Abstract: A novel gender classification system has been proposed based on Gaussian Mixture Models, which apply the combined parameters of pitch and 10th order relative spectral perceptual linear predictive coefficients to model the characteristics of male and female speech The performances of gender classification system have been evaluated on the conditions of clean speech, noisy speech and multi-language The simulations show that the performance of the proposed gender classifier is excellent; it is very robust for noise and completely independent of languages; the classification accuracy is as high as above 98% for all clean speech and remains 95% for most noisy speech, even the SNR of speech is degraded to 0dB

86 citations


Journal ArticleDOI
TL;DR: It is found that when using a fixed-length analysis window, the best measures can detect the instant of glottal closure in 97% of larynx cycles with a standard deviation of 0.6 ms and that some improvement in detection rate may be obtained if the analysis window length is adapted to the speech pitch.
Abstract: Measures based on the group delay of the LPC residual have been used by a number of authors to identify the time instants of glottal closure in voiced speech. In this paper, we discuss the theoretical properties of three such measures and we also present a new measure having useful properties. We give a quantitative assessment of each measure's ability to detect glottal closure instants evaluated using a speech database that includes a direct measurement of glottal activity from a Laryngograph/EGG signal. We find that when using a fixed-length analysis window, the best measures can detect the instant of glottal closure in 97% of larynx cycles with a standard deviation of 0.6 ms and that in 9% of these cycles an additional excitation instant is found that normally corresponds to glottal opening. We show that some improvement in detection rate may be obtained if the analysis window length is adapted to the speech pitch. If the measures are applied to the preemphasized speech instead of to the LPC residual, we find that the timing accuracy worsens but the detection rate improves slightly. We assess the computational cost of evaluating the measures and we present new recursive algorithms that give a substantial reduction in computation in all cases.

67 citations


Proceedings ArticleDOI
14 May 2006
TL;DR: A procedure to first establish a band limited interpolation of the observed spectrum using a recently rediscovered true envelope estimator and then using the band limited envelope to derive an all pole envelope model named TE-LPC is proposed.
Abstract: In this work we address the problem of all pole spectral envelope estimation for speech signals. The currently widely used all pole spectral envelope model suffers from well-known systematic errors and more severely from model order mismatch. We will propose a procedure to first establish a band limited interpolation of the observed spectrum using a recently rediscovered true envelope estimator and then using the band limited envelope to derive an all pole envelope model named TE-LPC. The band-limited envelope that is used to derive the all pole envelope model reduces the problem of the unknown all pole model order. For the experimental investigation we propose a new perceptually motivated residual spectral peak flatness measure. The experimental results demonstrate that the proposed method significantly increases the spectral flatness for the perceptually especially important low order harmonics of voiced utterances.

60 citations


Journal ArticleDOI
TL;DR: A new technique for separating two speech signals from a single recording is presented and effectively adds vocal-tract-related filter characteristics as a new cue to CASA models using a new grouping technique based on an underdetermined blind source separation.
Abstract: We present a new technique for separating two speech signals from a single recording. The proposed method bridges the gap between underdetermined blind source separation techniques and those techniques that model the human auditory system, that is, computational auditory scene analysis (CASA). For this purpose, we decompose the speech signal into the excitation signal and the vocal-tract-related filter and then estimate the components from the mixed speech using a hybrid model. We first express the probability density function (PDF) of the mixed speech's log spectral vectors in terms of the PDFs of the underlying speech signal's vocal-tract-related filters. Then, the mean vectors of PDFs of the vocal-tract-related filters are obtained using a maximum likelihood estimator given the mixed signal. Finally, the estimated vocal-tract-related filters along with the extracted fundamental frequencies are used to reconstruct estimates of the individual speech signals. The proposed technique effectively adds vocal-tract-related filter characteristics as a new cue to CASA models using a new grouping technique based on an underdetermined blind source separation. We compare our model with both an underdetermined blind source separation and a CASA method. The experimental results show that our model outperforms both techniques in terms of SNR improvement and the percentage of crosstalk suppression.

53 citations


Journal ArticleDOI
Li Deng1, Alejandro Acero1, I. Bazzi1
TL;DR: A new technique for high-accuracy tracking of vocal-tract resonances (which coincide with formants for nonnasalized vowels) in natural speech is presented, based on a discretized nonlinear prediction function which is embedded in a temporal constraint on the quantized input values over adjacent time frames as the prior knowledge for their temporal behavior.
Abstract: This paper presents a new technique for high-accuracy tracking of vocal-tract resonances (which coincide with formants for nonnasalized vowels) in natural speech. The technique is based on a discretized nonlinear prediction function, which is embedded in a temporal constraint on the quantized input values over adjacent time frames as the prior knowledge for their temporal behavior. The nonlinear prediction is constructed, based on its analytical form derived in detail in this paper, as a parameter-free, discrete mapping function that approximates the “forward” relationship from the resonance frequencies and bandwidths to the Linear Predictive Coding (LPC) cepstra of real speech. Discretization of the function permits the “inversion” of the function via a search operation. We further introduce the nonlinear-prediction residual, characterized by a multivariate Gaussian vector with trainable mean vectors and covariance matrices, to account for the errors due to the functional approximation. We develop and describe an expectation–maximization (EM)-based algorithm for training the parameters of the residual, and a dynamic programming-based algorithm for resonance tracking. Details of the algorithm implementation for computation speedup are provided. Experimental results are presented which demonstrate the effectiveness of our new paradigm for tracking vocal-tract resonances. In particular, we show the effectiveness of training the prediction-residual parameters in obtaining high-accuracy resonance estimates, especially during consonantal closure.

51 citations


Proceedings ArticleDOI
06 Jun 2006
TL;DR: The zero crossing extraction and the energy level detection are applied to the recorded speech signal for voiced/unvoiced area detection and the extracted MFCC data are further used as inputs for neural network training.
Abstract: Speech recognition is a major topic in speech signal processing. Speech recognition is considered as one of the most popular and reliable biometric technologies used in automatic personal identification systems. Speech recognition systems are used for variety of applications such as multimedia browsing tool, access centre, security and finance. It allows people work in active environment to use computer. For a reliable and high accuracy of speech recognition, simple and efficient representation methods are required. In this paper, the zero crossing extraction and the energy level detection are applied to the recorded speech signal for voiced/unvoiced area detection. The detected voiced signals are applied for segmentation. Further, the MFCC method is applied to all of the segmented windows. The extracted MFCC data are further used as inputs for neural network training.

48 citations


Journal ArticleDOI
TL;DR: Analytical results from statistical room acoustics are utilizes to analyze the AR modeling of speech under reverberant conditions and it is demonstrated that at each individual source-microphone position (without spatial expectation), the M-channel AR coefficients provide the best approximation to the clean speech coefficients when microphones are closely spaced.
Abstract: Hands-free speech input is required in many modern telecommunication applications that employ autoregressive (AR) techniques such as linear predictive coding. When the hands-free input is obtained in enclosed reverberant spaces such as typical office rooms, the speech signal is distorted by the room transfer function. This paper utilizes theoretical results from statistical room acoustics to analyze the AR modeling of speech under these reverberant conditions. Three cases are considered: (i) AR coefficients calculated from a single observation; (ii) AR coefficients calculated jointly from an M-channel observation (M > 1); and (iii) AR coefficients calculated from the output of a delay-and sum beamformer. The statistical analysis, with supporting simulations, shows that the spatial expectation of the AR coefficients for cases (i) and (ii) are approximately equal to those from the original speech, while for case (iii) there is a discrepancy due to spatial correlation between the microphones which can be significant. It is subsequently demonstrated that at each individual source-microphone position (without spatial expectation), the M-channel AR coefficients from case (ii) provide the best approximation to the clean speech coefficients when microphones are closely spaced (<0.3m).

Proceedings ArticleDOI
01 Aug 2006
TL;DR: It is shown that the application of spectrum subtraction also helps decrease some constant level background noise found in these types of microphones and the proposed equalizer provides notable quality improvement on the bone-conducted speech input, both subjectively and objectively.
Abstract: We propose an equalizer that attempts to improve the perceived speech quality of bone-conducted speech input with ear-insert microphones, which can provide clean speech input in noisy environments. We first show that the transfer characteristics of bone-conducted speech are both speaker and microphone dependent, and propose an equalizer which is trained using simultaneously recorded airborne and bone-conducted speech. The short-term FFT amplitude ratio of airborne and bone-conducted speech is used. The amplitudes are averaged and smoothed extensively before the ratio is calculated. The trained equalizer is applied to bone-conducted speech in the frequency domain. We show that the proposed equalizer provides notable quality improvement on the bone-conducted speech input, both subjectively and objectively. We also show that the application of spectrum subtraction also helps decrease some constant level background noise found in these types of microphones

Proceedings Article
01 Jan 2006
TL;DR: A speech fragment decoding technique that treats segregation and recognition as coupled problems and produces significantly lower error rates than a conventional recogniser, and mimics the pattern of human performance whereby performance increases as the target-masker ratio is reduced below -3 dB.
Abstract: This paper addresses the problem of recognising speech in the presence of a competing speaker. We employ a speech fragment decoding technique that treats segregation and recognition as coupled problems. Data-driven techniques are used to segment a spectro-temporal representation into a set of spectro-temporal fragments, such that each fragment is dominated by one or other of the speech sources. A speech fragment decoder is used which employs missing data techniques and clean speech models to simultaneously search for the set of fragments and the word sequence that best matches the target speaker model. The paper reports recent advances in this technique, and presents an evaluation based on artificially mixed speech utterances. The fragment decoder produces significantly lower error rates than a conventional recogniser, and mimics the pattern of human performance whereby performance increases as the target-masker ratio is reduced below -3 dB. Index Terms: speech recognition, speech separation, simultaneous speech, auditory scene analysis, noise robustness.

Journal ArticleDOI
01 Dec 2006
TL;DR: Two new simple non-linear methods of frequency scale mapping for transformation of voice characteristics between male and female or childish and young male voices are introduced.
Abstract: Voice conversion, i.e. modification of a speech signal to sound as if spoken by a different speaker, finds its use in speech synthesis with a new voice without necessity of a new database. This paper introduces two new simple non-linear methods of frequency scale mapping for transformation of voice characteristics between male and female or childish. The frequency scale mapping methods were developed primarily for use in the Czech and Slovak text-to-speech (TTS) system designed for the blind and based on the Pocket PC device platform. It uses cepstral description of the diphone speech inventory of the male speaker using the source-filter speech model or the harmonic speech model. Three new diphone speech inventories corresponding to female, childish and young male voices are created from the original male speech inventory. Listening tests are used for evaluation of voice transformation and quality of synthetic speech.

Proceedings Article
01 Sep 2006
TL;DR: A new “syllable-like” speech unit that is suitable for concatenative speech synthesis is described, automatically generated using a group delay based segmentation algorithm and acoustically correspond to the form C*VC* (C: consonant, V: vowel).
Abstract: In this work we describe a new “syllable-like” speech unit that is suitable for concatenative speech synthesis. These units are automatically generated using a group delay based segmentation algorithm and acoustically correspond to the form C*VC* (C: consonant, V: vowel). The effectiveness of the unit is demonstrated by synthesizing natural-sounding speech in Tamil, a regional Indian language. Significant quality improvement is obtained if bisyllable units are also used, rather than just monosyllables, with results far superior to the traditional diphone-based approach. An important advantage of this approach is the elimination of prosody rules. Since ƒ 0 is part of the target cost, the unit selection procedure chooses the best unit from among the many candidates. The naturalness of the synthesized speech demonstrates the effectiveness of this approach.

Proceedings ArticleDOI
14 May 2006
TL;DR: Several speaker adaptation algorithms and MAP modification are described to develop consistent method for synthesizing speech in a unified way for arbitrary amount of the speech data.
Abstract: In HMM-based speech synthesis, we have to choose the modeling strategy for speech synthesis units depending on the amount of available speech data to generate synthetic speech of better quality. In general, speaker-dependent modeling is an ideal choice for a large speech data, whereas speaker adaptation with average voice model becomes promising when available speech data of a target speaker is limited. This paper describes several speaker adaptation algorithms and MAP modification to develop consistent method for synthesizing speech in a unified way for arbitrary amount of the speech data. We incorporate these adaptation algorithms into our HSMM-based speech synthesis system and show its effectiveness from results of several evaluation tests.

Book ChapterDOI
16 Aug 2006
TL;DR: An adaptive LSB algorithm to embed dynamic secret speech information data bits into public speech of G.711-PCM (Pulse Code Modulation) for the purpose of secure communication according to energy distribution, with high efficiency in steganography and good quality in output speech.
Abstract: This paper presents an adaptive LSB (Least Significant Bit) algorithm to embed dynamic secret speech information data bits into public speech of G.711-PCM (Pulse Code Modulation) for the purpose of secure communication according to energy distribution, with high efficiency in steganography and good quality in output speech. It is superior to available classical algorithms, LSB. Experiments show that the proposed approach is capable of embedding up to 20 Kbps information data of secret speech into G.711 speech at an average embedded error rate of 10−5. It meets the requirements of information hiding, and satisfies the secure communication speech quality constraints with an excellent speech quality and complicating speakers’ recognition.

Patent
23 Mar 2006
TL;DR: In this article, a system and method for improving the quality and intelligibility of speech signals is proposed, which applies frequency compression to the higher frequency components of speech signal while leaving lower frequency components substantially unchanged.
Abstract: A system and method are provided for improving the quality and intelligibility of speech signals. The system and method apply frequency compression to the higher frequency components of speech signals while leaving lower frequency components substantially unchanged. This preserves higher frequency information related to consonants which is typically lost to filtering and bandpass constraints. This information is preserved without significantly altering the fundamental pitch of the speech signal so that when the speech signal is reproduced its overall tone qualities are preserved. The system and method further apply frequency expansion to speech signals. Like the compression, only the upper frequencies of a received speech signal are expanded. When the frequency expansion is applied to a speech signal that has been compressed according to the invention, the speech signal is substantially returned to its pre-compressed state. However, frequency compression according to the invention provides improved intelligibility even when the speech signal is not subsequently re-expanded. Likewise, speech signals may be expanded even though the original signal was not compressed, without significant degradation of the speech signal quality. Thus, a transmitter may include the system for applying high frequency compression without regard to whether a receiver will be capable of re-expanding the signal. Likewise, a receiver may expand a received speech signal without regard to whether the signal was previously compressed.

Journal ArticleDOI
TL;DR: An accurate speech detection algorithm for improving the performance of speech recognition systems working in noisy environments based on a hard decision clustering approach where a set of prototypes is used to characterize the noisy channel.
Abstract: This paper shows an accurate speech detection algorithm for improving the performance of speech recognition systems working in noisy environments. The proposed method is based on a hard decision clustering approach where a set of prototypes is used to characterize the noisy channel. Detecting the presence of speech is enabled by a decision rule formulated in terms of an averaged distance between the observation vector and a cluster-based noise model. The algorithm benefits from using contextual information, a strategy that considers not only a single speech frame but also a neighborhood of data in order to smooth the decision function and improve speech detection robustness. The proposed scheme exhibits reduced computational cost making it adequate for real time applications, i.e., automated speech recognition systems. An exhaustive analysis is conducted on the AURORA 2 and AURORA 3 databases in order to assess the performance of the algorithm and to compare it to existing standard voice activity detection (VAD) methods. The results show significant improvements in detection accuracy and speech recognition rate over standard VADs such as ITU-T G.729, ETSI GSM AMR, and ETSI AFE for distributed speech recognition and a representative set of recently reported VAD algorithms.

Patent
Yang Gao1
23 Oct 2006
TL;DR: In this paper, a speech post-processor for enhancing a speech signal (320) divided into a plurality of sub-bands (330) in the frequency domain is presented, where an envelope modification factor is generated using FAC = α ENV / Max + (1-α), where FAC is the modification factor, ENV is the envelope, Max is the maximum envelope, and α is a value between 0 and 1, where α is different constant value for each speech coding rate.
Abstract: There is provided a speech post-processor (250) for enhancing a speech signal (320) divided into a plurality of sub-bands (330) in frequency domain. The speech post-processor comprises an envelope modification factor generator (260) configured to use frequency domain coefficients representative of an envelope derived from the plurality of sub-bands to generate an envelope modification factor for the envelope derived from the plurality of sub-bands, where the envelope modification factor is generated using FAC = α ENV / Max + (1-α), where FAC is the envelope modification factor, ENV is the envelope, Max is the maximum envelope, and α is a value between 0 and 1, where α is a different constant value for each speech coding rate. The speech post-processor further comprises an envelope modifier (265) configured to modify the envelope derived from the plurality of sub-bands by the envelope modification factor corresponding to each of the plurality of sub-bands.

PatentDOI
TL;DR: In this article, a stereo encoding device capable of accurately encoding a stereo signal at a low bit rate and suppressing delay in audio communication is described. But it is not shown how it can be used to perform monaural encoding in its first layer.
Abstract: There is disclosed a stereo encoding device capable of accurately encoding a stereo signal at a low bit rate and suppressing delay in audio communication. The device performs monaural encoding in its first layer (110). In a second layer (120), a filtering unit (103) generates an LPC (Linear Predictive Coding) coefficient and generates a left channel drive sound source signal. A time region evaluation unit (104) and a frequency region evaluation unit (105) perform signal evaluation and prediction in both of their regions. A residual encoding unit (106) encodes a residual signal. A bit distribution control unit (107) adaptively distributes bits to the time region evaluation unit (104), the frequency region evaluation unit (105), and the residual encoding unit (106) according to a condition of the audio signal.

Journal ArticleDOI
TL;DR: A frame-by-frame adaptation method adding the reflection signal to the means of the acoustic model to compensate for the reflection noise in hands-free speech recognition.
Abstract: This paper describes a hands-free speech recognition technique based on acoustic model adaptation to reverberant speech. In hands-free speech recognition, the recognition accuracy is degraded by reverberation, since each segment of speech is affected by the reflection energy of the preceding segment. To compensate for the reflection signal we introduce a frame-by-frame adaptation method adding the reflection signal to the means of the acoustic model. The reflection signal is approximated by a first-order linear prediction from the observation signal at the preceding frame, and the linear prediction coefficient is estimated with a maximum likelihood method by using the EM algorithm, which maximizes the likelihood of the adaptation data. Its effectiveness is confirmed by word recognition experiments on reverberant speech.

Proceedings ArticleDOI
14 May 2006
TL;DR: A novel multichannel speech activity detection algorithm is presented, which explicitly models the overlap incurred by participants taking turns at speaking, and which almost halves the number of frames missed by a competitive algorithm within regions of overlapped speech.
Abstract: The study of meetings, and multi-party conversation in general, is currently the focus of much attention, calling for more robust and more accurate speech activity detection systems. We present a novel multichannel speech activity detection algorithm, which explicitly models the overlap incurred by participants taking turns at speaking. Parameters for overlapped speech states are estimated during decoding by using and combining knowledge from other observed states in the same meeting, in an unsupervised manner. We demonstrate on the NIST Rich Transcription Spring 2004 data set that the new system almost halves the number of frames missed by a competitive algorithm within regions of overlapped speech. The overall speech detection error on unseen data is reduced by 36% relative.

Journal ArticleDOI
TL;DR: The results indicate that the hybrid use of articulatory, perceptual and prosodic features of speech, combined with a supervised dimensionality-reduction procedure, is able to outperform any individual acoustic model for speech-driven facial animation.

Proceedings ArticleDOI
14 May 2006
TL;DR: ITU-T test results showed that this coder passed all the requirements of the G729EV qualification phase.
Abstract: This paper describes a 8–32 kbit/s scalable speech and audio coder submitted as a candidate for the ITU-T G729-based Embedded Variable bitrate (G729EV) standardization. The coder is built upon a 3-stage coding structure consisting of: narrowband cascade CELP coding at 8 and 12 kbit/s, bandwidth extension based on wideband linear-predictive coding (WB-LPC) at 14 kbit/s, and MDCT coding in a WB-LPC weighted signal domain from 14 to 32 kbit/s. ITU-T test results showed that this coder passed all the requirements of the G729EV qualification phase.

Patent
10 Mar 2006
TL;DR: In this article, a speech piece editing section predicts prosody of a fixed message and selects an item of the retrieved speech piece data most matching each speech piece of the fixed message one by one according to the prosody prediction results.
Abstract: A speech piece editing section (5) retrieves speech piece data on a speech piece the read of which matches that of a speech piece in a fixed message from a speech piece database (7) and converts the speech piece so as to match the speed specified by utterance speed data. The speech piece editing section (5) predicts the prosody of a fixed message and selects an item of the retrieved speech piece data most matching each speech piece of the fixed message one by one according to the prosody prediction results. However, if the proportion of the speech piece corresponding to the selected item of the speech piece data does not reach a predetermined value, the selection is cancelled. Concerning the speech piece for which selection is not made, waveform data representing the waveform of each unit speech is supplied to a sound processing section (41). The selected speech piece data and the supplied waveform data are interconnected thereby to create data representing a synthesized speech. Thus, a speech synthesis device for quickly producing a synthesized speech without any uncomfortable feeling with a simple structure is provided.

Journal ArticleDOI
TL;DR: The proposed speech bandwidth extension method is based on models of speech acoustics and fundamentals of human hearing, and care has been taken to deal with implementation aspects, such as noisy speech signals, speech signal delays, computational complexity, and processing memory usage.
Abstract: Today's telecommunications systems use a limited audio signal bandwidth. A typical bandwidth is 0.3-3.4 kHz, but recently it has been suggested that mobile phone networks will facilitate an audio signal bandwidth of 50 Hz-7 kHz. This is suggested since an increased bandwidth will increase the sound quality of the speech signals. Since only few telephones initially will have this facility, a method extending the conventional narrow frequency-band speech signal into a wide-band speech signal utilizing the receiving telephone only is suggested. This will give the impression of a wide-band speech signal. The proposed speech bandwidth extension method is based on models of speech acoustics and fundamentals of human hearing. The extension maps each speech feature separately. Care has been taken to deal with implementation aspects, such as noisy speech signals, speech signal delays, computational complexity, and processing memory usage.

Proceedings ArticleDOI
18 Dec 2006
TL;DR: The proposed methods modify the shape of the vocal tract system and the characteristics of the prosody according to the desired requirement by manipulating instants of significant excitation from the linear prediction residual of the speech signals.
Abstract: In this paper we proposed some flexible methods, which are useful in the process of voice conversion. The proposed methods modify the shape of the vocal tract system and the characteristics of the prosody according to the desired requirement. The shape of the vocal tract system is modified by shifting the major resonant frequencies (formants) of the short term spectrum, and altering their band- widths accordingly. In the case of prosody modification, the required durational and intonational characteristics are imposed on the given speech signal. In the proposed method, the prosodic characteristics are manipulated using instants of significant excitation. The instants of significant excitation correspond to the instants of glottal closure (epochs) in the case of voiced speech, and to some random excitations like onset of burst in the case of nonvoiced speech. Instants of significant excitation are computed from the linear prediction (LP) residual of the speech signals by using the property of average group delay of minimum phase signals. The manipulations of durational characteristics and pitch contour (intonation pattern) are achieved by manipulating the LP residual with the help of the knowledge of the instants of significant excitation. The modified LP residual is used to excite the time varying filter. The filter parameters are updated according to the desired vocal tract characteristics. The proposed methods are evaluated using listening tests.

Proceedings ArticleDOI
07 Jun 2006
TL;DR: This paper compares the recognition rates obtained in the ASR system realising front-ends based on features extracted by perceptual variants of cepstral analysis and linear prediction and by simple linear prediction with hidden Markov models (EMM).
Abstract: This paper describes continuous speech recognition experiments on a Romanian language speech database, by using Hidden Markov Models (HMM). We compare the recognition rates obtained in our ASR system realising front-ends based on features extracted by perceptual variants of cepstral analysis and linear prediction and by simple linear prediction. The best results obtained with 36 coefficients mel-frecquency cepstral coefficients (MFCC) are used as basis to rank the front-ends based on LPC. The second rank is very promising for the performance obtained with 5 perceptual linear prediction (PLP) coefficients, obviously better at the last ranked performance of the simple linear prediction coefficients (LPC). We reorganized the database as follows: one database for male speakers, one database for female speakers and one database for both male and female speakers.

Patent
Doh-Suk Kim1
30 Jun 2006
TL;DR: In one embodiment, distortion in a received speech signal is estimated using at least one model trained based on subjective quality assessment data and a speech quality assessment is then determined based on the estimated distortion as discussed by the authors.
Abstract: In one embodiment, distortion in a received speech signal is estimated using at least one model trained based on subjective quality assessment data. A speech quality assessment for the received speech signal is then determined based on the estimated distortion.