scispace - formally typeset
Search or ask a question

Showing papers on "Cepstrum published in 2002"


Journal ArticleDOI
TL;DR: A notion of subspace angles between two linear, autoregressive moving average, single-input–single-output models is defined by considering the principal angles between subspaces that are derived from these models.

136 citations


Patent
03 Jul 2002
TL;DR: In this paper, a method for early recognition and prediction of unit damage in machine plants, in particular mobile machine plants is presented. But this method is limited to the case of a single machine plant.
Abstract: The invention relates to a device and a method for the early recognition and prediction of unit damage in machine plants, in particular mobile machine plants. The structural noise of the machine plant is thus recorded by a sensor (1), transmitted as an acceleration signal and analysed in a digital signal processor (DSP). In order to inhibit negative influences from environmental vibrations or structural sound waves which are nothing to do with the status of the machine plant, the acceleration signal is firstly transformed into the frequency domain by means of a fast Fourier transformation and the data thus obtained is then transformed back into the time domain by means of a cepstrum analysis, such that resonance data from single shock impulses (a cepstrum) are obtained in the time domain. Said cepstrum is then compared with a comparative cepstrum, available in a memory unit corresponding to load and speed signals for the current operating status for a new machine plant. On exceeding threshold values the diagnostic signal, in particular information about the unit diagnosed as being damaged and the predicted remaining service life thereof is displayed for the user and an emergency operation initiated.

58 citations


Proceedings ArticleDOI
13 May 2002
TL;DR: This work has extended this filter characteristic to the mfcc algorithm and found that the increased filter bandwidth improves recognition performance in clean speech and provides added noise robustness as well.
Abstract: Many speech recognition systems use mel-frequency cepstral coefficient (mfcc) feature extraction as a front end. In the algorithm, a speech spectrum passes through a filter bank of mel-spaced triangular filters, and the filter output energies are log-compressed and transformed to the cepstral domain by the OCT. The spacing of filter bank center frequencies mimics the known warped-frequency characteristics of the human auditory system, yet the bandwidths of these filters is not chosen through biological inspiration. Instead they are set by aligning endpoints of the triangle, which is itself an arbitrary shape. It is surprising that for such a popular speech recognition front end, proper analysis or optimization of the filter bandwidths has not been performed. With complex cochlear models, realistic filter shapes that more closely approximate critical bands are used. And these filters, compared to the filters used in mfcc, are considerably wider and overlap with neighboring filters more. We have extended this filter characteristic to the mfcc algorithm and found that the increased filter bandwidth improves recognition performance in clean speech and provides added noise robustness as well.

51 citations


PatentDOI
TL;DR: In this article, a method for simulating a 3D sound environment in an audio system using an at least two-channel reproduction device is presented, which includes generating first and second pseudo head-related transfer function (HRTF) data, first using at least one speaker and then using headphones.
Abstract: The invention provides a method for simulating a 3D sound environment in an audio system using an at least two-channel reproduction device, the method including generating first and second pseudo head-related transfer function (HRTF) data, first using at least one speaker and then using headphones; dividing the first and second frequency representation of the data or using a deconvolution operator on the time domain representation of the first and second data, or subtracting the cepstrum representation of the first and second data, and using the results of the division or subtraction to prepare filters having an impulse response operable to initiate natural sounds of a remote speaker for preparing at least two filters connectable to the system in the audio path from an audio source to sound reproduction devices to be used by a listener.

46 citations


Proceedings ArticleDOI
18 Nov 2002
TL;DR: This paper presents a methodology for combining acoustic-phonetic knowledge with statistical learning for automatic segmentation and classification of continuous speech and achieves performance on segmentation of continuousspeech better than the BMM based approach that uses 39 cepstrum-based speech parameters.
Abstract: In this paper, we present a methodology for combining acoustic-phonetic knowledge with statistical learning for automatic segmentation and classification of continuous speech At present we focus on the recognition of broad classes-vowel, stop, fricative, sonorant consonant and silence Judicious use is made of 13 knowledge-based acoustic parameters (APs) and support vector machines (SVMs) It has been shown earlier that SVMs perform comparable to hidden Markov models (HMMs) for detection of stop consonants We achieve performance on segmentation of continuous speech better than the BMM based approach that uses 39 cepstrum-based speech parameters

41 citations


Proceedings ArticleDOI
07 Nov 2002
TL;DR: The proposed method combines an energy-feature basis idea in the time domain to solve the synchronization problem and achieved blind audio watermarking in the cepstrum domain.
Abstract: The synchronization problem is a serious issue in the watermark area. It happens when losing the location of the embedded watermark when cropping, shifting and so on. Therefore the detector finds it difficult to reform the copyright information. In this paper, the proposed method combines an energy-feature basis idea in the time domain to solve the synchronization problem and achieved blind audio watermarking in the cepstrum domain. The simulation results show a high security performance against the MP3 attack and the robustness improvement of several kinds of digital distortion attacks such as pitch-shifting, and cut samples.

37 citations


Proceedings ArticleDOI
07 Aug 2002
TL;DR: A continuous-time mel-frequency cepstrum encoding IC using analog circuits and floating-gate computational arrays and a novel approach to programmable signal spectrum decomposition, analog frequency transforms, and spectrum compaction is presented.
Abstract: This paper presents a continuous-time mel-frequency cepstrum encoding IC using analog circuits and floating-gate computational arrays. We present the dynamics of several floating-gate computational building blocks and accompanying experimental measurements. We also present a novel approach to programmable signal spectrum decomposition, analog frequency transforms, and spectrum compaction. Experimental data is presented from circuits fabricated on a 0.5 /spl mu/m nwell CMOS process available through MOSIS. This system can act as the front-end for larger digital or analog speech processing systems.

37 citations


Proceedings ArticleDOI
13 May 2002
TL;DR: Four different pitch tracking algorithms - autocorrelation, cepstrum, harmonic product spectrum, and a new method based on the modulation spectrum - are compared and investigated and their fitness for noisy environments is investigated.
Abstract: In this paper we compare four different pitch tracking algorithms - autocorrelation, cepstrum, harmonic product spectrum, and a new method based on the modulation spectrum - and investigate their fitness for noisy environments. From each tracker, possible ƒ 0 candidates in every time window are stored over a 10-frame interval, and the best contour is computed through a Viterbi search. The audio data comprises speech samples recorded in a moving car with a microphone setup commonly used for car speech applications like speakerphones. To see how performances deteriorate with increasing noise level, a second audio test set is prepared by blending clean speech recordings and noise at different signal-ta-noise ratios.

32 citations


Proceedings ArticleDOI
13 May 2002
TL;DR: In this paper, the authors explore modem methods and algorithms from chaotic systems theory for modeling speech signals in a multidimensional phase space and for extracting nonlinear acoustic features, and integrate these chaotic type features with the standard linear ones (based on cepstrum) to develop a generalized hybrid set of short-time acoustic features for speech signals and demonstrate its efficacy by showing significant improvements in HMM-based word recognition.
Abstract: Nonlinear systems based on chaos theory can model various aspects of the nonlinear dynamic phenomena occuring during speech production. In this paper, we explore modem methods and algorithms from chaotic systems theory for modeling speech signals in a multidimensional phase space and for extracting nonlinear acoustic features. Further, we integrate these chaotic-type features with the standard linear ones (based on cepstrum) to develop a generalized hybrid set of short-time acoustic features for speech signals and demonstrate its efficacy by showing significant improvements in HMM-based word recognition.

32 citations


Proceedings Article
01 Jan 2002
TL;DR: This work is mainly focused on showing experimental results using a combination of two methods for noise compensation which are shown to be complementary: classical spectral subtraction algorithm and histogram equalization.
Abstract: This work is mainly focused on showing experimental results using a combination of two methods for noise compensation which are shown to be complementary: classical spectral subtraction algorithm and histogram equalization. While spectral subtraction is focused on the reduction of the additive noise in the spectral domain, histogram equalization is applied in the cepstral domain to compensate the remaining non-linear effects associated to channel distortion and additive noise. The estimation of the noise spectrum for the spectral subtraction method relies on a new algorithm for speech / non-speech detection (SND) based on order statistics. This SND classification is also used for dropping long speech pauses. Results on Aurora 2 and Aurora 3 are reported.

29 citations


Patent
30 Aug 2002
TL;DR: In this article, a computer performs filtering of speech data and specifies a pitch length according to the timing when the filtering result zero-crosses, and adjusts the phase and the number of samples of the respective intervals, thereby eliminating affect of pitch fluctuation.
Abstract: A computer performs filtering of speech data and specifies a pitch length according to the timing when the filtering result zero-crosses. It should be noted that the center frequency of passing band in the filtering is controlled to be a value equal to a reciprocal number of the pitch length specified according to the zero cross timing unless a shifting from the pitch length extracted from the speech data cepstrum and periodgram exceeds a predetermined amount. Next, the computer divides the speech data into unit pitch intervals according to the filtering result and adjusts the phase and the number of samples of the respective intervals, thereby eliminating affect of pitch fluctuation. The pitch waveform data obtained is interpolated by a plurality of methods and those having little higher harmonic component are output together with data on the original number of samples and amplitude of the respective intervals.

Proceedings ArticleDOI
05 Nov 2002
TL;DR: In this article, the authors present signal methods dedicated to the fault detection in the mechanical part of an induction drive: bearing damage, eccentricity and rotor unbalance, and an experimental bench test is described and used to create and characterise these faults.
Abstract: This paper presents signal methods dedicated to the fault detection in the mechanical part of an induction drive: bearing damage, eccentricity and rotor unbalance An experimental bench test is described and used to create and characterise these faults Two methods for bearing faults detection are also detailed: the cepstrum and an original method (parcels summation method) These algorithms are tested on synthetical and real signals

Journal ArticleDOI
TL;DR: Although experimental results indicate certain inconsistency between objective and subjective measures, it can still be concluded that spectral smoothing and comb filtering contribute to the melioration of speech quality for speech corrupted by additive white Gaussian noise.

Proceedings Article
01 Jan 2002
TL;DR: An integrated system which segments musical signals according to the presence or absence of drum instruments with straightforward acoustic pattern recognition approach with mel-frequency cepstrum coefficients as features and a Gaussian mixture model classifier achieves 88 % correct segmentation over a database of 28 hours of music from different musical genres.
Abstract: A system is described which segments musical signals according to the presence or absence of drum instruments. Two different yet approximately equally accurate approaches were taken to solve the problem. The first is based on periodicity detection in the amplitude envelopes of the signal at subbands. The band-wise periodicity estimates are aggregated into a summary autocorrelation function, the characteristics of which reveal the drums. The other mechanism applies straightforward acoustic pattern recognition approach with mel-frequency cepstrum coefficients as features and a Gaussian mixture model classifier. The integrated system achieves 88 % correct segmentation over a database of 28 hours of music from different musical genres. For the both methods, errors occur for borderline cases with soft percussive-like drum accompaniment, or transient-like instrumentation without drums.

Proceedings ArticleDOI
07 Nov 2002
TL;DR: This paper investigates the application of better signal processing techniques to improve the signal to noise ratio in vibrational signals used for condition monitoring and a simple vibrational model is proposed, which includes the effect of period variation.
Abstract: This paper investigates the application of better signal processing techniques to improve the signal to noise ratio in vibrational signals used for condition monitoring. Environmental conditions such as instantaneous speed variations as well as the presence of multiple fault conditions can however obscure the defect signals that are required for reliable diagnostics and can lead to faulty diagnostic decisions. While these problems can be solved with the right combination of techniques, the difficulty of obtaining sufficiently large measured data sets on which to train these techniques remain. Artificially generated training data sets, by empirical modeling of defects, is hence investigated and a simple vibrational model, which includes the effect of period variation, is proposed for the bearing defect data set by Hoffman and van der Merwe (see Proceedings of the 5th WSES International Conference on Circuits, Systems, Communications and Computers (CSCC 2001), Rethymno, Greece, July 2001, p.209-214). Signal processing techniques, such as the cepstrum, can be influenced by the noise caused by instantaneous angular speed variations of the shaft. A novel modified cepstrum analysis is proposed which is less sensitive to small Fourier components encountered in a simulated vibration signal.

Patent
23 Jul 2002
TL;DR: In this article, an LSP decoding section extracts and decodes only LSP information from coded speech data, which is read for each block, and a Cepstrum conversion section converts the obtained LPC information into an LPC CepStrum which represents features of speech.
Abstract: A process of identifying a speaker in coded speech data and a process of searching for the speaker are efficiently performed with fewer computations and with a smaller storage capacity. In an information search apparatus, an LSP decoding section extracts and decodes only LSP information from coded speech data which is read for each block. An LPC conversion section converts the LSP information into LPC information. A Cepstrum conversion section converts the obtained LPC information into an LPC Cepstrum which represents features of speech. A vector quantization section performs vector quantization on the LPC Cepstrum. A speaker identification section identifies a speaker on the basis of the result of the vector quantization. Furthermore, the identified speaker is compared with a search condition in a condition comparison section, and based on the result, the search result is output.

Proceedings ArticleDOI
13 May 2002
TL;DR: The Analog Speech Recognition project combines low-power analog signal processing and digital signal processing theory to provide low- power and robust speech processing systems.
Abstract: The Analog Speech Recognition project combines low-power analog signal processing and digital signal processing theory to provide low-power and robust speech processing systems. This project looks to bring together multiple analog signal processing (ASP) blocks into one large ASP system. These component blocks include an analog cepstrum. vector quantizer, and analog HMM. Finally. there are power dissipation comparisons made between the analog and digital systems based on the computations performed.

Patent
Takafumi Hitotsumatsu1
03 Sep 2002
TL;DR: In this article, the voice of a user is inputted to a speech recognition section until a start of a no-voice domain from depression of a talk-switch LPC coefficients are calculated from the voice in an LPC analysis section and a cepstrum calculation section, and then temporarily stored in a parameter backward output section.
Abstract: Voice of a user is inputted to a speech recognition section until a start of a no-voice domain from depression of a talk-switch LPC cepstrum coefficients are calculated from the voice in an LPC analysis section and a cepstrum calculation section, and then temporarily stored in a parameter backward output section A series of the LPC cepstrum coefficients is re-arranged to the series in which the time axis is inverted and then outputted to a collating section The collating section calculates a degree of similarity between the LPC cepstrum coefficients and a recognition dictionary of a backward tree-structure stored in a standard pattern section through a backward collating

Proceedings ArticleDOI
13 May 2002
TL;DR: A new speech recognition technique is proposed that generates a speech trajectory using an HMM-based speech synthesis method that generates an acoustic trajectory by maximizing the likelihood of the trajectory while taking into account the relation between the cepstrum, delta-cepStrum, and delta-delta cepStrum.
Abstract: Parametric trajectory models have been proposed to exploit this time-dependency. However, parametric trajectory modeling methods are unable to take advantage of efficient HMM training and recognition methods. We have proposed a new speech recognition technique that generates a speech trajectory using an HMM-based speech synthesis method. This method generates an acoustic trajectory by maximizing the likelihood of the trajectory while taking into account the relation between the cepstrum, delta-cepstrum, and delta-delta cepstrum. In this paper, we extend our method to a general formulation including variance training procedure. Speaker independent speech recognition experiments show that the proposed method is effective for speech recognition.

Proceedings ArticleDOI
Hong Kook Kim1, R.C. Rose1
13 May 2002
TL;DR: The distinguishing aspect of the method is that noise-corrupted speech is decomposed into clean speech and noise components directly in the cepstrum domain without having to transform to the linear spectrum domain as is necessary for many existing model combination approaches.
Abstract: In this paper, we propose a cepstrum-domain model combination method for automatic speech recognition in noisy environments. The distinguishing aspect of the method is that noise-corrupted speech is decomposed into clean speech and noise components directly in the cepstrum domain without having to transform to the linear spectrum domain as is necessary for many existing model combination approaches. This is accomplished by exploiting the properties of the minimum mean squared error-log spectral amplitude (MMSE-LSA) based speech enhancement algorithm. As a result, a clean speech hidden Markov model (HMM) is easily compensated for a noise-corrupted domain by adding the means and covariance matrices of the clean speech HMM and those of an estimated noise model. The complexity of the proposed model combination procedure is significantly reduced with respect to conventional parallel model combination. The procedure was applied to a noisy connected digit recognition task. A 40% reduction in word error rate was achieved when it was combined with acoustic feature compensation techniques under mismatched environmental and channel conditions.

Book ChapterDOI
16 Dec 2002
TL;DR: Training and testing on a database of 410 audio files have shown asymptotic classification improvement by AdaBoost.
Abstract: AdaBoost is used to boost and select the best sequence of weak classifiers for the speech/non-speech classification. These weak classifiers are chosen the simple threshold functions. Statistical mean and variance of the Mel-frequency Cepstrum Coefficients(MFCC) over all overlapping frames of an audio file are used as audio features. Training and testing on a database of 410 audio files have shown asymptotic classification improvement by AdaBoost. A classification accuracy of 99.51% has been achieved on the test data. A comparison of AdaBoost with Nearest Neighbor and Nearest Center classifiers is also given.

Journal ArticleDOI
TL;DR: The proposed speech recognition and compression chip for portable memopad devices, especially suitable for use by the visually impaired, is presented, based on several cores of which they can be regarded as intellectual property cores to be used for a variety of speech-related application systems.
Abstract: This paper presents the design of a speech recognition and compression chip for portable memopad devices, especially suitable for use by the visually impaired. The proposed chip design is based on several cores of which they can be regarded as intellectual property (IP) cores to be used for a variety of speech-related application systems. A cepstrum extraction core and a dynamic warping core are designed for mapping the speech recognition algorithms. In the cepstrum extraction core, a novel architecture computes the autocorrelation between the overlapping frames using two pairs of shift registers and an intelligent accumulation procedure. The architecture of the dynamic time warping core uses only a single processing element, and is based on our extensive study of the relationship among the nodes in the dynamic time warping lattice. Bit rate is the key factor affecting the memory size for speech compression; therefore, a very low bit-rate speech coder is used. The speech coder exploits a line-spectrum-based interpolation method, which yields fine quality synthesized speech despite the low 1.6 kbps bit rate. The 1.6 kbps vocoder core is cost-effective, and it integrates both encoder and decoder algorithms. The proposed design has been tested via hardware simulations on Xilinx Virtex series FPGAs and a semi-custom chip fabricated by 0.35 /spl mu/m CMOS single-poly-four-metal technology on a die size approximately 4.46/spl times/4.46 mm/sup 2/.

Patent
29 Mar 2002
TL;DR: In this article, the clean speech component is decomposed from the background noise component and channel distortion in the noisy speech cepstrum domain, and then combined with a nonlinear gain function representing noise to obtain an estimate of the clean-speech component in the CSP domain.
Abstract: The invention provides devices and methods to decompose the noise-corrupted speech into an estimate for the clean speech component and an estimate for a noise component in a domain wherein the two components additively interact. In one implementation, the noise-corrupted speech signal is transformed into the cepstrum domain and combined with a cepstrum of a nonlinear gain function representing noise to obtain an estimate of the clean speech component in the cepstrum domain. In another implementation, the clean speech component is decomposed from the background noise component and channel distortion in the noisy speech cepstrum domain.

Proceedings ArticleDOI
04 Aug 2002
TL;DR: The proposed partial-correlation (PARCOR) coefficients scheme to model the cross areas of the several cylinders from the vocal tract can yield better identification performance than the conventional approach.
Abstract: In this work, we propose the partial-correlation (PARCOR) coefficients scheme to model the cross areas of the several cylinders from the vocal tract. By using the relationship of the acoustic impedance proportional to the reciprocal of cross areas, the ratios of cross areas between each neighboring cylinders are used to model a speaker's vocal tract. The autoregressive model (AR model) is performed on the speech residual signals, that are produced from the inverse vocal tract transform based on the PARCOR, to generate features. These features with the conventional features from the Mel-Frequency Cepstral Coefficient (MFCC) are used for the identification engine of the Gaussian Mixture Model (GMM). According to our computer analyses in the TIMIT speech database, the proposed system can yield better identification performance than the conventional approach.

Journal ArticleDOI
TL;DR: The spectral estimation problem of nonstationary autoregressive moving-average (ARMA) processes is considered and a new method is proposed for the estimation of the time-varying spectral content of such signals.
Abstract: The spectral estimation problem of nonstationary autoregressive moving-average (ARMA) processes is considered and a new method is proposed for the estimation of the time-varying spectral content of such signals. The proposed method can be viewed as an extension of the estimator proposed earlier by the authors to the time-varying case. The AR part of the model is estimated by solving the time-varying modified Yule-Walker equations using an estimated time-varying autocorrelation function. An evolutionary cepstrum estimator is proposed, which is then used in a simple recursion to obtain the MA parameters of the model.

Proceedings ArticleDOI
14 Oct 2002
TL;DR: This paper proposes the use of the cepstrum method (CEP), which uses the same configurations as in MFCC extraction for the extraction of pitch-related features, and finds that the addition of a properly normalized and transformed set of pitch related-features can reduce the recognition error rate.
Abstract: Chinese is a tonal language that uses fundamental frequency, in addition to phones for word differentiation. Commonly used front-end features, such as mel-frequency cepstral coefficients (MFCC), however, are optimized for non-tonal languages such as English and are not mainly focused on pitch information that is important for tone identification. In this paper, we examine the integration of tone-related acoustic features for Chinese recognition. We propose the use of the cepstrum method (CEP), which uses the same configurations as in MFCC extraction for the extraction of pitch-related features. The pitch periods extracted from the CEP algorithm can be used directly for speech recognition and do not require any special treatment for unvoiced frames. In addition, we explore a number of feature transformations and find that the addition of a properly normalized and transformed set of pitch related-features can reduce the recognition error rate from 34.61% to 29.45% on the Chinese 1998 National Performance Assessment (Project 863) corpus.


Journal ArticleDOI
TL;DR: In this paper, Chi's real one-dimensional (1-D) parametric nonminimum-phase Fourier series-based model (FSBM) is extended to two-dimensional FSBM for a 2-D non minimum-phase linear shift-invariant system by using finite2-D Fourierseries approximations to its amplitude response and phase response, respectively.
Abstract: In this paper, Chi's (1997, 1999) real one-dimensional (1-D) parametric nonminimum-phase Fourier series-based model (FSBM) is extended to two-dimensional (2-D) FSBM for a 2-D nonminimum-phase linear shift-invariant system by using finite 2-D Fourier series approximations to its amplitude response and phase response, respectively. The proposed 2-D FSBM is guaranteed stable, and its complex cepstrum can be obtained from its amplitude and phase parameters through a closed-form formula without involving complicated 2-D phase unwrapping and polynomial rooting. A consistent estimator is proposed for the amplitude estimation of the 2-D FSBM using a 2-D half plane causal minimum-phase linear prediction error filter (modeled by a 2-D minimum-phase FSBM), and then, two consistent estimators are proposed for the phase estimation of the 2-D FSBM using the Chien et al. (1997) 2-D phase equalizer (modeled by a 2-D all-pass FSBM). The estimated 2-D FSBM can be applied to modeling of 2-D non-Gaussian random signals and 2-D signal classification using complex cepstra. Some simulation results are presented to support the efficacy of the three proposed estimators. Furthermore, classification of texture images (2-D non-Gaussian signals) using the estimated FSBM, second-, and higher order statistics is presented together with some experimental results. Finally, we draw some conclusions.

Proceedings ArticleDOI
13 May 2002
TL;DR: Two techniques are researched for Jacobian adaptation (JA) in the presence of additive noise and these methods have been compared to cepstral mean normalization (CMN) and parallel model combination (PMC) in isolated word recognition task having a vocabulary of 200 English words.
Abstract: In this paper, two techniques are researched for Jacobian adaptation (JA) in the presence of additive noise. Since the original concept of JA was presented only for static cepstral coefficients, the performance of JA is researched when it is extended to cover also the delta cepstrum. However, this extension or the original concept can not provide accurate recognition performance when the mismatch between the training and recognition environments is out of the linear range of JA. Hence, this problem can be alleviated to some extent by dividing JA into two steps. At first, the adaptation is done e.g. from clean to the target environment having “high” SNR level. After that, the new JA matrixes are calculated and they are used in the second step to adapt the system to the lower target SNR leve1. Both of the above adaptation methods have been compared to cepstral mean normalization (CMN) and parallel model combination (PMC) in isolated word recognition task having a vocabulary of 200 English words. The best performance was achieved with PMC but JA showed comparable performance to CMN and outperformed it when JA was done in two steps from SNR of 25 dB to 5 dB. The system was tested with SpeechDat(II) database by adding noise recorded inside a car to the test set utterances at various SNR levels.

Journal ArticleDOI
Tadashi Emori1, Koichi Shinoda1
TL;DR: The proposed method can estimate a more optimal parameter for a speaker with a small amount of computation than in past schemes using multiple vocal tract length parameters in advance.
Abstract: Speaker normalization techniques for correcting differences in the vocal tract lengths of different speakers, referred to as vocal tract length normalization, in a large vocabulary voice recognition system using a hidden Markov model (HMM), have been proposed in recent years. In this paper, a scheme for approximating especially small changes in the vocal tract length by linear mapping using a vocal tract length parameter in cepstrum space and maximum-likelihood estimation of this parameter from vocalization is proposed. The proposed method can estimate a more optimal parameter for a speaker with a small amount of computation than in past schemes using multiple vocal tract length parameters in advance. In evaluation tests of the recognition of 5000 single Japanese words, the proposed scheme decreased errors by 7.1% alone and 14.6% in combination with cepstrum mean normalization (CMN). © 2002 Wiley Periodicals, Inc. Syst Comp Jpn, 33(5): 30–40, 2002; Published online in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/scj.1125