scispace - formally typeset
Search or ask a question

Showing papers on "Speech coding published in 2009"


Patent
10 Dec 2009
Abstract: A speech signal processing system comprises an audio processor (103) for providing a first signal representing an acoustic speech signal of a speaker. An EMG processor (109) provides a second signal which represents an electromyographic signal for the speaker captured simultaneously with the acoustic speech signal. A speech processor (105) is arranged to process the first signal in response to the second signal to generate a modified speech signal. The processing may for example be a beam forming, noise compensation, or speech encoding. Improved speech processing may be achieved in particular in an acoustically noisy environment.

547 citations


Journal ArticleDOI
TL;DR: The results demonstrate that the qTA model is both an effective tool for research on tone and intonation and a potentially effective system for automatic synthesis of tone andintonation.
Abstract: This paper reports the development of a quantitative target approximation (qTA) model for generating F0 contours of speech. The qTA model simulates the production of tone and intonation as a process of syllable-synchronized sequential target approximation [Xu, Y. (2005). “Speech melody as articulatorily implemented communicative functions,” Speech Commun. 46, 220–251]. It adopts a set of biomechanical and linguistic assumptions about the mechanisms of speech production. The communicative functions directly modeled are lexical tone in Mandarin and lexical stress in English and focus in both languages. The qTA model is evaluated by extracting function-specific model parameters from natural speech via supervised learning (automatic analysis by synthesis) and comparing the F0 contours generated with the extracted parameters to those of natural utterances through numerical evaluation and perceptual testing. The F0 contours generated by the qTA model with the learned parameters were very close to the natural co...

171 citations


Journal ArticleDOI
TL;DR: This paper proposes effective algorithms to automatically classify audio clips into one of six classes: music, news, sports, advertisement, cartoon and movie, using the application of neural network for the classification of audio.
Abstract: In the age of digital information, audio data has become an important part in many modern computer applications Audio classification has been becoming a focus in the research of audio processing and pattern recognition Automatic audio classification is very useful to audio indexing, content-based audio retrieval and on-line audio distribution, but it is a challenge to extract the most common and salient themes from unstructured raw audio data In this paper, we propose effective algorithms to automatically classify audio clips into one of six classes: music, news, sports, advertisement, cartoon and movie For these categories a number of acoustic features that include linear predictive coefficients, linear predictive cepstral coefficients and mel-frequency cepstral coefficients are extracted to characterize the audio content Support vector machines are applied to classify audio into their respective classes by learning from training data Then the proposed method extends the application of neural network (RBFNN) for the classification of audio RBFNN enables nonlinear transformation followed by linear transformation to achieve a higher dimension in the hidden space The experiments on different genres of the various categories illustrate the results of classification are significant and effective

160 citations


Patent
24 Jun 2009
TL;DR: In this article, the problem of insufficient noise contents is addressed in a reconstructed highband, by using Adaptive Noise-floor Addition, which is applicable to both speech coding and natural audio coding systems.
Abstract: Methods and an apparatus for enhancement of source coding systems utilizing high frequency reconstruction (HFR) are introduced. The problem of insufficient noise contents is addressed in a reconstructed highband, by using Adaptive Noise-floor Addition. New methods are also introduced for enhanced performance by means of limiting unwanted noise, interpolation and smoothing of envelope adjustment amplification factors. The methods and apparatus used are applicable to both speech coding and natural audio coding systems.

160 citations


Patent
16 Feb 2009
TL;DR: In this paper, a user preference processor (109) receives user preference feedback for the test audio signals and generates a personalization parameter for the user in response to the user preferences and a noise parameter for each noise component of at least one of the audio signals.
Abstract: An audio device is arranged to present a plurality of test audio signals to a user where each test audio signal comprises a signal component and a noise component. A user preference processor (109) receives user preference feedback for the test audio signals and generates a personalization parameter for the user in response to the user preference feedback and a noise parameter for the noise component of at least one of the test audio signals. An audio processor (113) then processes an audio signal in response to the personalization parameter and the resulting signal is presented to the user. The invention may allow improved characterization of a user thereby resulting in improved adaptation of the processing and thus an improved personalization of the presented signal. The invention may e.g. be beneficial for hearing aids for hearing impaired users.

143 citations


Patent
06 Jul 2009
TL;DR: In this paper, an apparatus for generating at least one audio output signal representing a superposition of at least two different audio objects comprises a processor for processing an audio input signal to provide an object representation of the input signal, where this object representation can be generated by a parametrically guided approximation of original objects using an object downmix signal.
Abstract: An apparatus for generating at least one audio output signal representing a superposition of at least two different audio objects comprises a processor for processing an audio input signal to provide an object representation of the audio input signal, where this object representation can be generated by a parametrically guided approximation of original objects using an object downmix signal. An object manipulator individually manipulates objects using audio object based metadata referring to the individual audio objects to obtain manipulated audio objects. The manipulated audio objects are mixed using an object mixer for finally obtaining an audio output signal having one or several channel signals depending on a specific rendering setup.

134 citations


Proceedings ArticleDOI
19 Apr 2009
TL;DR: This new codec forms the basis of the reference model in the ongoing MPEG standardization activity for Unified Speech and Audio Coding, which results in a codec that exhibits consistently high quality for speech, music and mixed audio content.
Abstract: Traditionally, speech coding and audio coding were separate worlds. Based on different technical approaches and different assumptions about the source signal, neither of the two coding schemes could efficiently represent both speech and music at low bitrates. This paper presents a unified speech and audio codec, which efficiently combines techniques from both worlds. This results in a codec that exhibits consistently high quality for speech, music and mixed audio content. The paper gives an overview of the codec architecture and presents results of formal listening tests comparing this new codec with HE-AAC(v2) and AMR-WB+. This new codec forms the basis of the reference model in the ongoing MPEG standardization activity for Unified Speech and Audio Coding.

108 citations


Patent
16 Jun 2009
TL;DR: Spatialized audio is generated for voice data received at a telecommunications device based on spatial audio information received with the voice data and based on a determined virtual position of the source for producing spatialized audio signals as discussed by the authors.
Abstract: Spatialized audio is generated for voice data received at a telecommunications device based on spatial audio information received with the voice data and based on a determined virtual position of the source of the voice data for producing spatialized audio signals.

106 citations


PatentDOI
TL;DR: In this paper, two-channel input audio signals are processed to construct output audio signals by decomposing the input signal into a plurality of subband audio signals, and the output signal is synthesized from the generated subband signals.
Abstract: Two-channel input audio signals are processed to construct output audio signals by decomposing the two-channel input audio signals into a plurality of two-channel subband audio signals. Separately, in each of a plurality of subbands, at least three generated subband audio signals are generated by steering the two-channel subband audio signals into at least three generated signal locations. The output audio signals are synthesized from the generated subband audio signals. The steering applies differing construction rules in at least two of the plurality of subbands.

102 citations


Patent
12 Mar 2009
TL;DR: In this article, a speech recognition system includes a mobile device and a remote server, where the mobile device receives the speech from the user and extracts the features and phonemes from the speech.
Abstract: A speech recognition system includes a mobile device and a remote server. The mobile device receives the speech from the user and extracts the features and phonemes from the speech. Selected phonemes and measures of uncertainty are transmitted to the server, which processes the phonemes for speech understanding and transmits a text of the speech (or the context or understanding of the speech) back to the mobile device.

94 citations


Journal ArticleDOI
TL;DR: To evaluate envelope recovery at the output of the cochlea, neural cross-correlation coefficients were developed that quantify the similarity between two sets of spike-train responses and can be used to quantitatively evaluate a wide range of perceptually significant temporal coding issues relevant to normal and impaired hearing.
Abstract: Any sound can be separated mathematically into a slowly varying envelope and rapidly varying fine-structure component. This property has motivated numerous perceptual studies to understand the relative importance of each component for speech and music perception. Specialized acoustic stimuli, such as auditory chimaeras with the envelope of one sound and fine structure of another have been used to separate the perceptual roles for envelope and fine structure. Cochlear narrowband filtering limits the ability to isolate fine structure from envelope; however, envelope recovery from fine structure has been difficult to evaluate physiologically. To evaluate envelope recovery at the output of the cochlea, neural cross-correlation coefficients were developed that quantify the similarity between two sets of spike-train responses. Shuffled auto- and cross-correlogram analyses were used to compute separate correlations for responses to envelope and fine structure based on both model and recorded spike trains from auditory nerve fibers. Previous correlogram analyses were extended to isolate envelope coding more effectively in auditory nerve fibers with low center frequencies, which are particularly important for speech coding. Recovered speech envelopes were present in both model and recorded responses to one- and 16-band speech fine-structure chimaeras and were significantly greater for the one-band case, consistent with perceptual studies. Model predictions suggest that cochlear recovered envelopes are reduced following sensorineural hearing loss due to broadened tuning associated with outer-hair cell dysfunction. In addition to the within-fiber cross-stimulus cases considered here, these neural cross-correlation coefficients can also be used to evaluate spatiotemporal coding by applying them to cross-fiber within-stimulus conditions. Thus, these neural metrics can be used to quantitatively evaluate a wide range of perceptually significant temporal coding issues relevant to normal and impaired hearing.

Proceedings ArticleDOI
04 Feb 2009
TL;DR: Comparison among different structures of Neural Networks conducted here for a better understanding of the problem and its possible solutions is conducted.
Abstract: This paper presents the Bangla speech recognition system. Bangla speech recognition system is divided mainly into two major parts. The first part is speech signal processing and the second part is speech pattern recognition technique. The speech processing stage consists of speech starting and end point detection, windowing, filtering, calculating the Linear Predictive Coding(LPC) and Cepstral Coefficients and finally constructing the codebook by vector quantization. The second part consists of pattern recognition system using Artificial Neural Network(ANN). Speech signals are recorded using an audio wave recorder in the normal room environment. The recorded speech signal is passed through the speech starting and end-point detection algorithm to detect the presence of the speech signal and remove the silence and pauses portions of the signals. The resulting signal is then filtered for the removal of unwanted background noise from the speech signals. The filtered signal is then windowed ensuring half frame overlap. After windowing, the speech signal is then subjected to calculate the LPC coefficient and Cepstral coefficient. The feature extractor uses a standard LPC Cepstrum coder, which converts the incoming speech signal into LPC Cepstrum feature space. The Self Organizing Map(SOM) Neural Network makes each variable length LPC trajectory of an isolated word into a fixed length LPC trajectory and thereby making the fixed length feature vector, to be fed into to the recognizer. The structures of the neural network is designed with Multi Layer Perceptron approach and tested with 3, 4, 5 hidden layers using the Transfer functions of Tanh Sigmoid for the Bangla speech recognition system. Comparison among different structures of Neural Networks conducted here for a better understanding of the problem and its possible solutions.

Proceedings ArticleDOI
19 Apr 2009
TL;DR: It is found that MP is efficient and effective to recover CS encoded speech as well as jointly estimate the linear model in the signal dependent unknown linear transform.
Abstract: Compressive sensing (CS) has been proposed for signals with sparsity in a linear transform domain. We explore a signal dependent unknown linear transform, namely the impulse response matrix operating on a sparse excitation, as in the linear model of speech production, for recovering compressive sensed speech. Since the linear transform is signal dependent and unknown, unlike the standard CS formulation, a codebook of transfer functions is proposed in a matching pursuit (MP) framework for CS recovery. It is found that MP is efficient and effective to recover CS encoded speech as well as jointly estimate the linear model. Moderate number of CS measurements and low order sparsity estimate will result in MP converge to the same linear transform as direct VQ of the LP vector derived from the original signal. There is also high positive correlation between signal domain approximation and CS measurement domain approximation for a large variety of speech spectra.

Patent
Nobuyuki Washio1
22 Dec 2009
TL;DR: In this article, an information processing apparatus for speech recognition includes a first speech dataset storing speech data uttered by low recognition rate speakers, a second speech dataset stored speech data uttering by a plurality of speakers, and a third speech dataset containing speech data to be mixed with the speech data of the second dataset.
Abstract: An information processing apparatus for speech recognition includes a first speech dataset storing speech data uttered by low recognition rate speakers; a second speech dataset storing speech data uttered by a plurality of speakers; a third speech dataset storing speech data to be mixed with the speech data of the second speech dataset; a similarity calculating part obtaining, for each piece of the speech data in the second speech dataset, a degree of similarity to a given average voice in the first speech dataset; a speech data selecting part recording the speech data, the degree of similarity of which is within a given selection range, as selected speech data in the third speech dataset; and an acoustic model generating part generating a first acoustic model using the speech data recorded in the second speech dataset and the third speech dataset.

Journal ArticleDOI
TL;DR: An efficient algorithm for segmentation of audio signals into speech or music that can be easily adapted to different audio types, and is suitable for real-time operation is presented.
Abstract: We present an efficient algorithm for segmentation of audio signals into speech or music. The central motivation to our study is consumer audio applications, where various real-time enhancements are often applied. The algorithm consists of a learning phase and a classification phase. In the learning phase, predefined training data is used for computing various time-domain and frequency-domain features, for speech and music signals separately, and estimating the optimal speech/music thresholds, based on the probability density functions of the features. An automatic procedure is employed to select the best features for separation. In the test phase, initial classification is performed for each segment of the audio signal, using a three-stage sieve-like approach, applying both Bayesian and rule-based methods. To avoid erroneous rapid alternations in the classification, a smoothing technique is applied, averaging the decision on each segment with past segment decisions. Extensive evaluation of the algorithm, on a database of more than 12 hours of speech and more than 22 hours of music showed correct identification rates of 99.4% and 97.8%, respectively, and quick adjustment to alternating speech/music sections. In addition to its accuracy and robustness, the algorithm can be easily adapted to different audio types, and is suitable for real-time operation.

Journal ArticleDOI
TL;DR: In this book, an introduction to pitch estimation is given and a number of statistical methods for pitch estimation are presented, which include both single- and multi-pitch estimators based on statistical approaches, like maximum likelihood and maximum a posteriori methods.
Abstract: Periodic signals can be decomposed into sets of sinusoids having frequencies that are integer multiples of a fundamental frequency. The problem of finding such fundamental frequencies from noisy observations is important in many speech and audio applications, where it is commonly referred to as pitch estimation. These applications include analysis, compression, separation, enhancement, automatic transcription and many more. In this book, an introduction to pitch estimation is given and a number of statistical methods for pitch estimation are presented. The basic signal models and associated estimation theoretical bounds are introduced, and the properties of speech and audio signals are discussed and illustrated. The presented methods include both single- and multi-pitch estimators based on statistical approaches, like maximum likelihood and maximum a posteriori methods, filtering methods based on both static and optimal adaptive designs, and subspace methods based on the principles of subspace orthogonality and shift-invariance. The application of these methods to analysis of speech and audio signals is demonstrated using both real and synthetic signals, and their performance is assessed under various conditions and their properties discussed. Finally, the estimators are compared in terms of computational and statistical efficiency, generalizability and robustness. (Less)

Proceedings ArticleDOI
19 Apr 2009
TL;DR: A new technique for robust keyword spotting that uses bidirectional Long Short-Term Memory (BLSTM) recurrent neural nets to incorporate contextual information in speech decoding and overcomes the drawbacks of generative HMM modeling.
Abstract: In this paper we propose a new technique for robust keyword spotting that uses bidirectional Long Short-Term Memory (BLSTM) recurrent neural nets to incorporate contextual information in speech decoding. Our approach overcomes the drawbacks of generative HMM modeling by applying a discriminative learning procedure that non-linearly maps speech features into an abstract vector space. By incorporating the outputs of a BLSTM network into the speech features, it is able to make use of past and future context for phoneme predictions. The robustness of the approach is evaluated on a keyword spotting task using the HUMAINE Sensitive Artificial Listener (SAL) database, which contains accented, spontaneous, and emotionally colored speech. The test is particularly stringent because the system is not trained on the SAL database, but only on the TIMIT corpus of read speech. We show that our method prevails over a discriminative keyword spotter without BLSTM-enhanced feature functions, which in turn has been proven to outperform HMM-based techniques.

Journal Article
TL;DR: This paper describes this codec in detail and shows that the new reference model reaches the goal of consistent high quality for all signal types.
Abstract: Coding of speech signals at low bitrates, such as 16 kbps, has to rely on an efficient speech reproduction model to achieve reasonable speech quality. However, for audio signals not fitting to the model this approach generally fails. On the other hand, generic audio codecs, designed to handle any kind of audio signal, tend to show unsatisfactory results for speech signals, especially at low bitrates. To overcome this, a process was initiated by ISO/MPEG, aiming to standardize a new codec with consistent high quality for speech, music and mixed content over a broad range of bitrates. After a formal listening test evaluating several proposals MPEG has selected the best performing codec as the reference model for the standardization process. This paper describes this codec in detail and shows that the new reference model reaches the goal of consistent high quality for all signal types.

Journal Article
TL;DR: A new set of cross-fade windows designed in order to provide an adequate trade-off between overlap duration and time/frequency resolution, and to maintain the benefits of critical sampling through all coding modes are presented.
Abstract: The reference model selected by MPEG for the forthcoming unified speech and audio codec (USAC) switches between a non-LPC based coding mode (based on AAC) operating in the transform domain and an LPC-based coding mode (derived from AMR-WB+) operating either in the time domain (ACELP) or in the frequency domain (wLPT). Seamlessly switching between these different coding modes required the design of a new set of cross-fade windows optimized to minimize the amount of overhead information sent during transitions between LPC-based and non-LPC based coding. This paper presents the new set of windows which was designed in order to provide an adequate trade-off between overlap duration and time/frequency resolution, and to maintain the benefits of critical sampling through all coding modes.

Patent
01 Sep 2009
TL;DR: In this paper, an active noise reduction (ANR) circuit is used to adjust the hearing compensated audio signal based on the ANR signal to produce an output audio signal, wherein the ANRs signal is generated based on output audio signals.
Abstract: A circuit includes a microphone circuit, an audio processing module, a digital audio processing module, and an active noise reduction (ANR) circuit. The microphone circuit receives acoustic vibrations and generates an audio signal therefrom. The audio processing module generates a representation of the audio signal. The digital audio processing module compensates the representation of the audio signal based on hearing compensation data to produce a hearing compensated audio signal. The ANR circuit receives the hearing compensated audio signal and an ANR signal. The ANR circuit further functions to adjust the hearing compensated audio signal based on the ANR signal to produce an output audio signal, wherein the ANR signal is generated based on the output audio signal.

Journal ArticleDOI
TL;DR: Experimental results show that the proposed scheme is inaudible and robust against common signals processing, including MP3 compression, low-pass filtering, noise addition, and equalization, etc, and survives several desynchronization attacks.

Journal ArticleDOI
TL;DR: Results indicate that it is possible to construct a variable-rate harmonic codec that is equivalent to iLBC at approximately 13 kbps.
Abstract: The harmonic representation of speech signals has found many applications in speech processing. This paper presents a novel statistical approach to model the behavior of harmonic phases. Phase information is decomposed into three parts: a minimum phase part, a translation term, and a residual term referred to as dispersion phase. Dispersion phases are modeled by wrapped Gaussian mixture models (WGMMs) using an expectation-maximization algorithm suitable for circular vector data. A multivariate WGMM-based phase quantizer is then proposed and constructed using novel scalar quantizers for circular random variables. The proposed phase modeling and quantization scheme is evaluated in the context of a narrowband harmonic representation of speech. Results indicate that it is possible to construct a variable-rate harmonic codec that is equivalent to iLBC at approximately 13 kbps.

Proceedings ArticleDOI
01 Dec 2009
TL;DR: This work investigates a novel ASR approach using Bidirectional Long Short-Term Memory Recurrent Neural Networks and Connectionist Temporal Classification, which is capable of transcribing graphemes directly and yields results highly competitive with phoneme transcription.
Abstract: Main-stream Automatic Speech Recognition systems are based on modelling acoustic sub-word units such as phonemes. Phonemisation dictionaries and language model based decoding techniques are applied to transform the phoneme hypothesis into orthographic transcriptions. Direct modelling of graphemes as sub-word units using HMM has not been successful. We investigate a novel ASR approach using Bidirectional Long Short-Term Memory Recurrent Neural Networks and Connectionist Temporal Classification, which is capable of transcribing graphemes directly and yields results highly competitive with phoneme transcription. In design of such a grapheme based speech recognition system phonemisation dictionaries are no longer required. All that is needed is text transcribed on the sentence level, which greatly simplifies the training procedure. The novel approach is evaluated extensively on the Wall Street Journal 1 corpus.

Journal ArticleDOI
TL;DR: A point process-based computational framework for the task of spotting keywords in continuous speech is formulated and it is found that even with a noisy and extremely sparse phonetic landmark-based point process representation, keywords can be spotted with accuracy levels comparable to recently studied hidden Markov model-based keyword spotting systems.
Abstract: We investigate the hypothesis that the linguistic content underlying human speech may be coded in the pattern of timings of various acoustic ldquoeventsrdquo (landmarks) in the speech signal. This hypothesis is supported by several strands of research in the fields of linguistics, speech perception, and neuroscience. In this paper, we put these scientific motivations to the test by formulating a point process-based computational framework for the task of spotting keywords in continuous speech. We find that even with a noisy and extremely sparse phonetic landmark-based point process representation, keywords can be spotted with accuracy levels comparable to recently studied hidden Markov model-based keyword spotting systems. We show that the performance of our keyword spotting system in the high-precision regime is better predicted by the median duration of the keyword rather than simply the number of its constituent syllables or phonemes. When we are confronted with very few (in the extreme case, zero) examples of the keyword in question, we find that constructing a keyword detector from its component syllable detectors provides a viable approach.

Proceedings ArticleDOI
01 Nov 2009
TL;DR: The results demonstrate that the principles of compressed sensing can be applied to sparse decompositions of speech and audio signals and that it offers a significant reduction of the computational complexity, but also that such signals may pose a challenge due to their non-stationary and complex nature with varying levels of sparsity.
Abstract: In this paper, we consider the application of compressed sensing (aka compressive sampling) to speech and audio signals. We discuss the design considerations and issues that must be addressed in doing so, and we apply compressed sensing as a pre-processor to sparse decompositions of real speech and audio signals using dictionaries composed of windowed complex sinusoids. Our results demonstrate that the principles of compressed sensing can be applied to sparse decompositions of speech and audio signals and that it offers a significant reduction of the computational complexity, but also that such signals may pose a challenge due to their non-stationary and complex nature with varying levels of sparsity.

Patent
19 Jun 2009
TL;DR: In this paper, a low-bit-rate coding scheme for transitional speech frames is described, and a method for low bit rate coding of transitional speech frame frames is presented and evaluated.
Abstract: Systems, methods, and apparatus for low-bit-rate coding of transitional speech frames are disclosed.

Patent
06 Oct 2009
TL;DR: In this paper, a context-based entropy decoder is proposed to decode the entropy-encoded audio information in dependence on a context, which context is based on a previously-decoded information in a non-reset state of operation.
Abstract: An audio decoder for providing a decoded audio information on the basis of an entropy encoded audio information comprises a context-based entropy decoder configured to decode the entropy-encoded audio information in dependence on a context, which context is based on a previously-decoded audio information in a non-reset state-of-operation. The context-based entropy decoder is configured to select a mapping information, for deriving the decoded audio information from the encoded audio information, in dependence on the context. The context-based entropy decoder comprises a context resetter configured to reset the context for selecting the mapping information to a default context, which default context is independent from the previously-decoded audio information, in response to a side information of the encoded audio information.

Proceedings ArticleDOI
19 Apr 2009
TL;DR: A novel framework which combines the advantages of different well known segmentation methods is introduced, using an automatically estimated log-linear segment model to determine the segmentation of an audio stream in a holistic way by a maximum a posteriori decoding strategy.
Abstract: Audio segmentation is an essential preprocessing step in several audio processing applications with a significant impact e.g. on speech recognition performance. We introduce a novel framework which combines the advantages of different well known segmentation methods. An automatically estimated log-linear segment model is used to determine the segmentation of an audio stream in a holistic way by a maximum a posteriori decoding strategy, instead of classifying change points locally. A comparison to other segmentation techniques in terms of speech recognition performance is presented, showing a promising segmentation quality of our approach.

Dissertation
01 Jan 2009
TL;DR: A study of the implementation of a speech generative model; whereby the speech is synthesized and recovered back from its MFCC representations, and the spectral distance between the original speech signal and the one produced from the MFCC vectors has been computed.
Abstract: The classical front end analysis in speech recognition is a spectral analysis which parametrizes the speech signal into feature vectors; the most popular set of them is the Mel Frequency Cepstral Coefficients (MFCC). They are based on a standard power spectrum estimate which is first subjected to a log-based transform of the frequency axis (mel- frequency scale), and then decorrelated by using a modified discrete cosine transform. Following a focused introduction on speech production, perception and analysis, this paper gives a study of the implementation of a speech generative model; whereby the speech is synthesized and recovered back from its MFCC representations. The work has been developed into two steps: first, the computation of the MFCC vectors from the source speech files by using HTK Software; and second, the implementation of the generative model in itself, which, actually, represents the conversion chain from HTK-generated MFCC vectors to speech reconstruction. In order to know the goodness of the speech coding into feature vectors and to evaluate the generative model, the spectral distance between the original speech signal and the one produced from the MFCC vectors has been computed. For that, spectral models based on Linear Prediction Coding (LPC) analysis have been used. During the implementation of the generative model some results have been obtained in terms of the reconstruction of the spectral representation and the quality of the synthesized speech.

Proceedings ArticleDOI
19 Apr 2009
TL;DR: In this paper, a codebook of pitch-synchronous residual frames is used to construct a more realistic source signal, which is obtained by concatenating excitation frames picked up from the codebook, based on a selection criterion and taking target residual coefficients as input.
Abstract: This paper proposes a method to improve the quality delivered by statistical parametric speech synthesizers. For this, we use a codebook of pitch-synchronous residual frames, so as to construct a more realistic source signal. First a limited codebook of typical excitations is built from some training database. During the synthesis part, HMMs are used to generate filter and source coefficients. The latter coefficients contain both the pitch and a compact representation of target residual frames. The source signal is obtained by concatenating excitation frames picked up from the codebook, based on a selection criterion and taking target residual coefficients as input. Subjective results show a relevant improvement compared to the basic technique.