scispace - formally typeset
Search or ask a question

Showing papers on "Cepstrum published in 2001"


Book
01 Jan 2001
TL;DR: This chapter discusses the Discrete-Time Speech Signal Processing Framework, a model based on the FBS Method, and its applications in Speech Communication Pathway and Homomorphic Signal Processing.
Abstract: (NOTE: Each chapter begins with an introduction and concludes with a Summary, Exercises and Bibliography.) 1. Introduction. Discrete-Time Speech Signal Processing. The Speech Communication Pathway. Analysis/Synthesis Based on Speech Production and Perception. Applications. Outline of Book. 2. A Discrete-Time Signal Processing Framework. Discrete-Time Signals. Discrete-Time Systems. Discrete-Time Fourier Transform. Uncertainty Principle. z-Transform. LTI Systems in the Frequency Domain. Properties of LTI Systems. Time-Varying Systems. Discrete-Fourier Transform. Conversion of Continuous Signals and Systems to Discrete Time. 3. Production and Classification of Speech Sounds. Anatomy and Physiology of Speech Production. Spectrographic Analysis of Speech. Categorization of Speech Sounds. Prosody: The Melody of Speech. Speech Perception. 4. Acoustics of Speech Production. Physics of Sound. Uniform Tube Model. A Discrete-Time Model Based on Tube Concatenation. Vocal Fold/Vocal Tract Interaction. 5. Analysis and Synthesis of Pole-Zero Speech Models. Time-Dependent Processing. All-Pole Modeling of Deterministic Signals. Linear Prediction Analysis of Stochastic Speech Sounds. Criterion of "Goodness". Synthesis Based on All-Pole Modeling. Pole-Zero Estimation. Decomposition of the Glottal Flow Derivative. Appendix 5.A: Properties of Stochastic Processes. Random Processes. Ensemble Averages. Stationary Random Process. Time Averages. Power Density Spectrum. Appendix 5.B: Derivation of the Lattice Filter in Linear Prediction Analysis. 6. Homomorphic Signal Processing. Concept. Homomorphic Systems for Convolution. Complex Cepstrum of Speech-Like Sequences. Spectral Root Homomorphic Filtering. Short-Time Homomorphic Analysis of Periodic Sequences. Short-Time Speech Analysis. Analysis/Synthesis Structures. Contrasting Linear Prediction and Homomorphic Filtering. 7. Short-Time Fourier Transform Analysis and Synthesis. Short-Time Analysis. Short-Time Synthesis. Short-Time Fourier Transform Magnitude. Signal Estimation from the Modified STFT or STFTM. Time-Scale Modification and Enhancement of Speech. Appendix 7.A: FBS Method with Multiplicative Modification. 8. Filter-Bank Analysis/Synthesis. Revisiting the FBS Method. Phase Vocoder. Phase Coherence in the Phase Vocoder. Constant-Q Analysis/Synthesis. Auditory Modeling. 9. Sinusoidal Analysis/Synthesis. Sinusoidal Speech Model. Estimation of Sinewave Parameters. Synthesis. Source/Filter Phase Model. Additive Deterministic-Stochastic Model. Appendix 9.A: Derivation of the Sinewave Model. Appendix 9.B: Derivation of Optimal Cubic Phase Parameters. 10. Frequency-Domain Pitch Estimation. A Correlation-Based Pitch Estimator. Pitch Estimation Based on a "Comb Filter<170. Pitch Estimation Based on a Harmonic Sinewave Model. Glottal Pulse Onset Estimation. Multi-Band Pitch and Voicing Estimation. 11. Nonlinear Measurement and Modeling Techniques. The STFT and Wavelet Transform Revisited. Bilinear Time-Frequency Distributions. Aeroacoustic Flow in the Vocal Tract. Instantaneous Teager Energy Operator. 12. Speech Coding. Statistical Models of Speech. Scaler Quantization. Vector Quantization (VQ). Frequency-Domain Coding. Model-Based Coding. LPC Residual Coding. 13. Speech Enhancement. Introduction. Preliminaries. Wiener Filtering. Model-Based Processing. Enhancement Based on Auditory Masking. Appendix 13.A: Stochastic-Theoretic parameter Estimation. 14. Speaker Recognition. Introduction. Spectral Features for Speaker Recognition. Speaker Recognition Algorithms. Non-Spectral Features in Speaker Recognition. Signal Enhancement for the Mismatched Condition. Speaker Recognition from Coded Speech. Appendix 14.A: Expectation-Maximization (EM) Estimation. Glossary.Speech Signal Processing.Units.Databases.Index.About the Author.

984 citations


Proceedings ArticleDOI
29 Nov 2001
TL;DR: This work proposes the use of the linear predictive coding (LPC) cepstrum for clustering ARIMA time series, by using the Euclidean distance between the LPC cepstra of two time series as their dissimilarity measure.
Abstract: Much environmental and socioeconomic time-series data can be adequately modeled using autoregressive integrated moving average (ARIMA) models. We call such time series "ARIMA time series". We propose the use of the linear predictive coding (LPC) cepstrum for clustering ARIMA time series, by using the Euclidean distance between the LPC cepstra of two time series as their dissimilarity measure. We demonstrate that LPC cepstral coefficients have the desired features for accurate clustering and efficient indexing of ARIMA time series. For example, just a few LPC cepstral coefficients are sufficient in order to discriminate between time series that are modeled by different ARIMA models. In fact, this approach requires fewer coefficients than traditional approaches, such as DFT (discrete Fourier transform) and DWT (discrete wavelet transform). The proposed distance measure can be used for measuring the similarity between different ARIMA models as well. We cluster ARIMA time series using the "partition around medoids" method with various similarity measures. We present experimental results demonstrating that, using the proposed measure, we achieve significantly better clusterings of ARIMA time series data as compared to clusterings obtained by using other traditional similarity measures, such as DFT, DWT, PCA (principal component analysis), etc. Experiments were performed both on simulated and real data.

445 citations


Proceedings ArticleDOI
07 May 2001
TL;DR: In this paper, the authors present several mechanisms that enable effective spread-spectrum audio watermarking systems: prevention against detection desynchronization, cepstrum filtering, and chess watermarks.
Abstract: We present several mechanisms that enable effective spread-spectrum audio watermarking systems: prevention against detection desynchronization, cepstrum filtering, and chess watermarks. We have incorporated these techniques into a system capable of reliably detecting a watermark in an audio clip that has been modified using a composition of attacks that degrade the original audio characteristics well beyond the limit of acceptable quality. Such attacks include: fluctuating scaling in the time and frequency domain, compression, addition and multiplication of noise, resampling, requantization, normalization, filtering, and random cutting and pasting of signal samples.

182 citations


Proceedings Article
Jasha Droppo1, Li Deng1, Alex Acero1
01 Sep 2001
TL;DR: This paper describes recent improvements to SPLICE, Stereo-based Piecewise Linear Compensation for Environments, which produces an estimate of cepstrum of undistorted speech given the observed cepStrum of distorted speech.
Abstract: This paper describes recent improvements to SPLICE, Stereobased Piecewise Linear Compensation for Environments, which produces an estimate of cepstrum of undistorted speech given the observed cepstrum of distorted speech For distributed speech recognition applications, SPLICE can be placed at the server, thus limiting the processing that would take place at the client We evaluated this algorithm on the Aurora2 task, which consists of digit sequences within the TIDigits database that have been digitally corrupted by passing them through a linear filter and/or by adding different types of realistic noises at SNRs ranging from 20dB to -5dB On set A data, for which matched training data is available, we achieved a 66% decrease in word error rate over the baseline system with clean models This preliminary result is of practical significance because in a server implementation, new noise conditions can be added as they are identified once the service is running

158 citations


Journal ArticleDOI
TL;DR: The method is based on noise-robust 2-D phase unwrapping and a noise-Robust procedure to estimate the pulse in the complex cepstrum domain and gave stable results with respect to noise and gray levels through several image sequences.
Abstract: This paper presents a new method for 2-D blind homomorphic deconvolution of medical B-scan ultrasound images. The method is based on noise-robust 2-D phase unwrapping and a noise-robust procedure to estimate the pulse in the complex cepstrum domain. Ordinary Wiener filtering is used in the subsequent deconvolution. The resulting images became much sharper with better defined tissue structures compared with the ordinary images. The deconvolved images had a resolution gain of the order of 3 to 7, and the signal-to-noise ratio (SNR) doubled for many of the images used in our experiments. The method gave stable results with respect to noise and gray levels through several image sequences.

112 citations


Journal ArticleDOI
Hong Kook Kim1, R.V. Cox
TL;DR: The proposed bitstream-based front-end gives superior word and string accuracies over a recognizer constructed from decoded speech signals and its performance is comparable to that of a wireline recognition system that uses the cepstrum as a feature set.
Abstract: We propose a feature extraction method for a speech recognizer that operates in digital communication networks. The feature parameters are basically extracted by converting the quantized spectral information of a speech coder into a cepstrum. We also include the voiced/unvoiced information obtained from the bitstream of the speech coder in the recognition feature set. We performed speaker-independent connected digit HMM recognition experiments under clean, background noise, and channel impairment conditions. From these results, we found that the speech recognition system employing the proposed bitstream-based front-end gives superior word and string accuracies over a recognizer constructed from decoded speech signals. Its performance is comparable to that of a wireline recognition system that uses the cepstrum as a feature set. Next, we extended the evaluation of the proposed bitstream-based front-end to large vocabulary speech recognition with a name database. The recognition results proved that the proposed bitstream-based front-end also gives a comparable performance to the conventional wireline front-end.

91 citations


Proceedings ArticleDOI
07 May 2001
TL;DR: In this article, a method for speech/non-speech detection using a linear discriminant analysis (LDA) applied to mel frequency cepstrum coefficients (MFCC) is presented.
Abstract: In speech recognition, speech/non-speech detection must be robust to,noise. In the paper, a method for speech/non-speech detection using a linear discriminant analysis (LDA) applied to mel frequency cepstrum coefficients (MFCC) is presented. The energy is the most discriminant parameter between noise and speech. But with this single parameter, the speech/non-speech detection system detects too many noise segments. The LDA applied to MFCC and the associated test reduces the detection of noise segments. This new algorithm is compared to the one based on signal to noise ratio (Mauuary and Monne, 1993).

80 citations


Proceedings ArticleDOI
02 May 2001
TL;DR: The performance of the test system has proved the feasibility of the modeling language by a single Gaussian Mixture Model instead of using complex system such as phonetic recogniser followed by language modelling or large vocabulary continuous speech recognition system.
Abstract: The speech parametrization methods: linear prediction cepstrum coefficients and mel-frequency cepstrum coefficients were compared with regard to language identification accuracy in a Gaussian mixture model based language identification system. Ten different languages were used to test against a set of ten second test files. The 12th order linear prediction cepstrum coefficients with delta and accelerate coefficients resulted in the best accuracy of 60.0 percent. This has shown that information obtained from linear prediction analysis has increased the ability of discriminating different languages. It also shows that language identification performance may be increased by encompassing temporal information by including delta and acceleration features. Besides, the performance of our test system has proved the feasibility of the modeling language by a single Gaussian Mixture Model instead of using complex system such as phonetic recogniser followed by language modelling or large vocabulary continuous speech recognition system.

80 citations


Journal ArticleDOI
TL;DR: In this article, the authors combine some extensive numerical simulations of a single stage geared unit with localized tooth faults and the use of several detection techniques whose performances are compared and critically assessed.
Abstract: The early detection of failures in geared systems is an important industrial problem which has still to be addressed from both an experimental and theoretical viewpoint. The proposed paper combines some extensive numerical simulations of a single stage geared unit with localized tooth faults and the use of several detection techniques whose performances are compared and critically assessed. A model aimed at simulating the contributions of local tooth defects such as spalling to the gear dynamic behavior is set up. The pinion and the gear of a pair are assimilated to two rigid cylinders with all six degrees of freedom connected by a series of springs which represent gear body and gear tooth compliances on the base plane. Classical shaft finite elements including torsional, flexural and axial displacements can be superimposed to the gear element together with some lumped stiffnesses, masses, inertias, … which account for the load machines, bearings and couplings. Tooth defects are modeled by a distribution of normal deviations over a zone which can be located anywhere on the active tooth flanks. Among the numerous available signal processing techniques used in vibration monitoring, cepstrum analysis is sensitive, reliable and it can be adapted to complex geared system with several meshes. From an analytical analysis of the equations of motion, two complementary detection techniques based upon acceleration power cepstrum are proposed. The equations of motion and the contact problem between mating flanks are simultaneously solved by coupling an implicit time-step integration scheme and a unilateral normal contact algorithm. The results of the numerical simulations are used as a data base for the proposed detection techniques. The combined influence of the defect location, depth and extent is analyzed for two examples of spur and helical gears with various profile modifications and the effectiveness of the two complementary detection methods is discussed before some conclusions are drawn.

57 citations


Patent
11 Jul 2001
TL;DR: In this paper, the initial weighting coefficients are calculated from a cepstrum extracted from the repetitive-PN1023 sequence ECR signal by DFT methods or with a PN1023 auto-correlation match filter.
Abstract: DTV signals transmitted over the air with a symbol rate of around 10.76 million samples per second include echo-cancellation reference (ECR) signals each of which includes or essentially consists of a repetitive-PN1023 sequence with baud-rate symbols, which repetitive-PN1023 sequence incorporates a number of consecutive data-segment synchronization signals. Receivers for these DTV signals respond to these ECR signals to generate initial weighting coefficients for adaptive filters used for channel equalization and echo suppression. The initial weighting coefficients are calculated from a cepstrum extracted from the repetitive-PN1023 sequence ECR signal by DFT methods or with a PN1023 auto-correlation match filter.

52 citations


Journal ArticleDOI
TL;DR: In this article, a new indicator for the vibratory diagnosis of gear systems is proposed based on the power cepstrum of the accelerometer signal, which is derived from an analytical analysis of the equations of motion.

Journal ArticleDOI
TL;DR: In this paper, a new cepstral analysis procedure with the complex cepstrum for recovering excitations causing multiple transient signal components from vibration signals, especially from rotor vibration signals has been developed.
Abstract: A new cepstral analysis procedure with the complex cepstrum for recovering excitations causing multiple transient signal components from vibration signals, especially from rotor vibration signals, has been developed. Along with the problem of singularity, a major problem of the cepstrum is that it cannot provide a correct distribution of the excitations. To solve these problems, a signal preprocessing method, whose function is to provide a definition for the distribution of the excitations along the quefrency axis and remove singular points from the transform, has been added to the cepstrum analysis. With this procedure, a correct distribution of the excitations can be obtained. An example of application to the condition monitoring of rotor machinery is also presented.

Patent
19 Jun 2001
TL;DR: In this paper, the inverse discrete Fourier transform of the logarithm of two-sided autospectral density is used to evaluate the performance of a rocket engine during static test firing.
Abstract: Vibration and tachometer measurements are used to assess the health of rotating equipment to compute and store two sided cepstrum parameters used to compare the engine performance to a class of engines for determining out-of-family performance indicating the healthy or defective nature of the engine under test. The cepstrum parameter can be viewed after static test firing of a rocket engine and analyzed for changes in the cepstrum parameter further indicating defect growth during static test firing. Engine-to-engine comparisons of vibration-related parameters can be used to provide information on abnormal gear behavior. The cepstrum is defined as the inverse discrete Fourier transform of the logarithm of two-sided autospectral density. The test method is an effective screen for determining defective rocket engine components during preflight static testing.

PatentDOI
Juha Iso-Sipila1
TL;DR: In this article, a low-pass filter is used to filter the normalized modulation spectrum in order to improve the signal-to-noise ratio (SNR) in the speech signal.
Abstract: A method and apparatus for speech processing in a distributed speech recognition system having a front-end and a back-end. The speech processing steps in the front-end are as follows: extracting speech features from a speech signal and normalizing the speech features in order to alter the power of the noise component in the modulation spectrum in relation to the power of the signal component, especially with frequencies above 10 Hz. A low-pass filter is then used to filter the normalized modulation spectrum in order to improve the signal-to-noise ratio (SNR) in the speech signal. The combination of feature vector normalization and low-pass filtering is effective in noise removal, especially in a low SNR environment.

Proceedings Article
01 Jan 2001
TL;DR: The approximate inference technique is used as an approximate E step in a generalized EM algorithm that learns the parameters of the noise model from a test utterance that performs as well as or signi cantly better than the non-adaptive algorithm, without the need for a separate training set of noise examples.
Abstract: A challenging, unsolved problem in the speech recognition community is recognizing speech signals that are corrupted by loud, highly nonstationary noise. One approach to noisy speech recognition is to automatically remove the noise from the cepstrum sequence before feeding it in to a clean speech recognizer. In previous work published in Eurospeech, we showed how a probability model trained on clean speech and a separate probability model trained on noise could be combined for the purpose of estimating the noise free speech from the noisy speech. We showed how an iterative 2nd order vector Taylor series approximation could be used for probabilistic inference in this model. In many circumstances, it is not possible to obtain examples of noise without speech. Noise statistics may change signi ficantly during an utterance, so that speech free frames are not sufficient for estimating the noise model. In this paper, we show how the noise model can be learned even when the data contains speech. In particular, the noise model can be learned from the test utterance and then used to denoise the test utterance. The approximate inference technique is used as an approximate E step in a generalized EM algorithm that learns the parameters of the noise model from a test utterance. For both Wall Street Journal data with added noise samples and the Aurora benchmark, we show that the new noise adaptive technique performs as well as or signi cantly better than the non-adaptive algorithm, without the need for a separate training set of noise examples.

Proceedings ArticleDOI
01 May 2001
TL;DR: This paper compares the recognition performance of the mel-LPC cepstrum with those of both the standard LPC mel-cepstrum and the MFCC through the Japanese dictation system with 20,000 word vocabulary, and finds that this performance is slightly superior to that of MFCC.
Abstract: This paper presents a simple and efficient time domain technique to estimate an all-pole model on the mel-frequency scale (mel-LPC), and compares the recognition performance of the mel-LPC cepstrum with those of both the standard LPC mel-cepstrum and the MFCC (mel-frequency cepstral coefficient) through the Japanese dictation system (Julius) with 20,000 word vocabulary. First, the optimal value of the frequency warping factor is examined in terms of monosyllable accuracy. When using the optimal warping factors, the mel-LPC cepstrum attains word accuracies of 93.0% for male speakers and 93.1% for female speakers, which are 2.1% and 1.7% higher than those of the LPC mel-cepstrum, respectively. Furthermore, this performance is slightly superior to that of MFCC.

Proceedings Article
01 Jan 2001
TL;DR: This paper compares Root-cepstrum to Mel-Frequency cepstrum Coefficients (MFCC) in terms of their noise immunity during modeling and decoding speed and observes that for 84% of the phonemes, the average distance to all other acoustic units is increased in the Root-CEpstrums domain compared to MFCC resulting in a sharp acoustic model set.
Abstract: Root-cepstral analysis has been proposed previously for speech recognition in car environments [9]. In this paper, we focus on an alternative aspect of Root-cepstrum as it applies to discriminative acoustic modeling and fast speech recognizer decoding. We compare Root-cepstrum to Mel-Frequency cepstrum Coefficients (MFCC) in terms of their noise immunity during modeling and decoding speed. Our experiments use the SPINE [5] corpus which is composed of clean and noisy data with a 5K vocabulary size. Experiments were performed that allow pair-wise comparisons of acoustic models across different feature sets and acoustic units. We observed that for 84% of the phonemes, the average distance to all other acoustic units is increased in the Root-cepstrum domain compared to MFCC resulting in a sharp acoustic model set. Therefore, the ambiguity in the Root-cepstrum space is reduced. Large vocabulary noisy speech recognition experiments showed a 27.5% reduction in real–time processing factor (RTF) compared to MFCC features while improving overall recognition accuracy.

Journal ArticleDOI
TL;DR: A new performance criterion for spectral envelope fitting which is based on the statistical analysis of the behavior of the empirical sinusoidal magnitude estimates is introduced and it is shown that penalization is an efficient approach to control the smoothness of the estimation envelope.
Abstract: Estimation of the spectral envelope (magnitude of the transfer function) of a filter driven by a periodic signal is a long-standing problem in speech and audio processing. Recently, there has been a renewed interest in this issue in connection with the rapid developments of processing techniques based on sinusoidal modeling. In this paper, we introduce a new performance criterion for spectral envelope fitting which is based on the statistical analysis of the behavior of the empirical sinusoidal magnitude estimates. We further show that penalization is an efficient approach to control the smoothness of the estimation envelope. In low-noise situations, the proposed method can be approximated by a two-step weighted least-squares procedure which also provides an interesting insight into the limitations of the previously proposed "discrete cepstrum" approach. A systematic simulation study confirms that the proposed methods perform significantly better than existing ones for high pitched and noisy signals.

Proceedings Article
01 Jan 2001
TL;DR: A new VTLN method is proposed, in which the vocal tract length is normalized in the cepstrum space by means of linear mapping whose parameter is derived using maximumlikelihood estimation, which offers greater precision in determining parameters for individual speakers.
Abstract: Recently, vocal tract length normalization (VTLN) techniques have been developed for speaker normalization in speech recognition. This paper proposes a new VTLN method, in which the vocal tract length is normalized in the cepstrum space by means of linear mapping whose parameter is derived using maximumlikelihood estimation. The computational costs of this method are much lower than that of such conventional methods as ML-VTLN, in which the parameter for mapping is selected from among several parameters. Further, the new method offers greater precision in determining parameters for individual speakers. Experimental use of the method resulted in an error reduction rate of 7.1%. A combination of the proposed method with cepstrum mean normalization (CMN) method was also examined and found to reduce the error rate even more, by 14.6%.

Proceedings Article
03 Jan 2001
TL;DR: The approximate inference technique is used as an approximate E step in a generalized EM algorithm that learns the parameters of the noise model from a test utterance, and the new noise adaptive technique performs as well as or significantly better than the non-adaptive algorithm, without the need for a separate training set of noise examples.
Abstract: A challenging, unsolved problem in the speech recognition community is recognizing speech signals that are corrupted by loud, highly nonstationary noise. One approach to noisy speech recognition is to automatically remove the noise from the cepstrum sequence before feeding it in to a clean speech recognizer. In previous work published in Eurospeech, we showed how a probability model trained on clean speech and a separate probability model trained on noise could be combined for the purpose of estimating the noise-free speech from the noisy speech. We showed how an iterative 2nd order vector Taylor series approximation could be used for probabilistic inference in this model. In many circumstances, it is not possible to obtain examples of noise without speech. Noise statistics may change significantly during an utterance, so that speech-free frames are not sufficient for estimating the noise model. In this paper, we show how the noise model can be learned even when the data contains speech. In particular, the noise model can be learned from the test utterance and then used to denoise the test utterance. The approximate inference technique is used as an approximate E step in a generalized EM algorithm that learns the parameters of the noise model from a test utterance. For both Wall Street Journal data with added noise samples and the Aurora benchmark, we show that the new noise adaptive technique performs as well as or significantly better than the non-adaptive algorithm, without the need for a separate training set of noise examples.

Patent
07 Sep 2001
TL;DR: In this paper, the data is transformed into the frequency domain and back into the time domain to allow suppression of interference frequencies, and the cepstrum is compared with a comparison value contained in memory that corresponds to load and speed signals for the current operating status.
Abstract: Method and device for monitoring machine plant based on vibration monitoring whereby the data is transformed into the frequency domain and back into the time domain to allow suppression of interference frequencies. According to the method structural noise of the machine plant is recorded using a sensor (1), transmitted as an acceleration signal and analyzed in a digital signal processor (DSP). To inhibit interference due to environmental vibrations or structural sound waves that do not relate to the status of the machine plant, the signal is transformed into the frequency domain using an FFT. It is then transformed back into the time domain using a cepstrum analysis such that single shock impulses (a cepstrum) are obtained in the time domain. The cepstrum is compared with a comparison value contained in memory that corresponds to load and speed signals for the current operating status. If the threshold is exceeded, conclusions can be made about damage to the unit and a remaining service life predicted. An emergency operation can be initiated.

Proceedings Article
01 Jan 2001
TL;DR: A variety of techniques for robust digit recognition in noise are considered using the AURORA 2.0 corpus, and it is shown that MFCC adaptation could not outperform RCC parameterization with front-end enhancement, which is much more computationally efficient than model adaptation.
Abstract: In this paper, a variety of techniques for robust digit recognition in noise are considered using the AURORA 2.0 corpus. Current recognizers perform as well as humans in small vocabulary tasks but computer recognition performance degrades substantially when noise is introduced into the speech, while human performance is much less sensitive. To make the recognizer robust, several methodologies are employed. These include, feature processing, enhancement before recognition and model adaptation. We considered a number of processing and adaptation scenarios depending on noise type. The best performance, as expected, was obtained in matched training conditions which in general has limited applicability in real world problems. As a feature processing step, using RCCs (Root Cepstrum Coeff.) instead of MFCCs gave substantial improvement. MFCC with front-end enhancement increased performance considerably, but results were far from that obtained with matched training. When we combine the RCC with enhancement, however, we get the best results. In the next step, we employed model adaptation techniques which outperformed MFCC+enhancement and gave much closer results to the matched condition limits. However, MFCC adaptation could not outperform RCC parameterization with front-end enhancement, which we show is much more computationally efficient than model adaptation.

Proceedings ArticleDOI
04 Jul 2001
TL;DR: New structures are proposed for an effective realization of cepstral vocal tract models that model both formants and antiformants.
Abstract: Speech is an analog sound signal produced by exciting the human vocal tract. The magnitude response of the vocal tract exhibits both peaks (formants) and valleys (antiformants). Vocal tract models are differentiated according to whether they model the formants alone (LPC models) or also antiformants (ARMA and cepstral models). New structures are proposed for an effective realization of cepstral vocal tract models that model both formants and antiformants.

Journal ArticleDOI
R. Hariharan1, I. Kiss1, I. Viikki1
TL;DR: A multiresolution-based feature extraction technique for speech recognition in adverse conditions that improves word recognition accuracy by 41 % and the proposed algorithm clearly outperformed the mel cepstral front-end when the same number of HMM parameters were used in both systems.
Abstract: In this paper, we present a multiresolution-based feature extraction technique for speech recognition in adverse conditions. The proposed front-end algorithm uses mel cepstrum-based feature computation of subbands in order not to spread noise distortions over the entire feature space. Conventional full-band features are also augmented to the final feature vector which is fed to the recognition unit. Other novel features of the proposed front-end algorithm include emphasis of long-term spectral information combined with cepstral domain feature vector normalization and the use of the PCA transform, instead of DCT, to provide the final cepstral parameters. The proposed algorithm was experimentally evaluated in a connected digit recognition task under various noise conditions. The results obtained show that the new feature extraction algorithm improves word recognition accuracy by 41 % when compared to the performance of mel cepstrum front-end. A substantial increase in recognition accuracy was observed in all tested noise environments at all different SNRs. The good performance of the multiresolution front-end is not only due to the higher feature vector dimension, but the proposed algorithm clearly outperformed the mel cepstral front-end when the same number of HMM parameters were used in both systems. We also propose methods to reduce the computational complexity of the multiresolution front-end-based speech recognition system. Experimental results indicate the viability of the proposed techniques.

Journal ArticleDOI
TL;DR: This work examines the problem of identifying temporal regions or frames as being either one-speaker or two-speakers speech, and proposes a new pitch prediction feature (PPF) which is compared with the linear Predictive cepstral coefficients (LPCC) and the mel frequency cEPstral coefficient (MFCC).

01 Jan 2001
TL;DR: Experimental results indicate that trajectories on such reduced dimension spaces can provide reliable representations of spoken words, while reducing the training complexity and the operation of the Recognizer.
Abstract: Although speech recognition products are already available in the market at present, their development is mainly based on statistical techniques which work under very specific assumptions. The work presented in this thesis investigates the feasibility of alternative approaches for solving the problem more efficiently. A speech recognizer system comprised of two distinct blocks, a Feature Extractor and a Recognizer, is presented. The Feature Extractor block uses a standard LPC Cepstrum coder, which translates the incoming speech into a trajectory in the LPC Cepstrum feature space, followed by a Self Organizing Map, which tailors the outcome of the coder in order to produce optimal trajectory representations of words in reduced dimension feature spaces. Designs of the Recognizer blocks based on three different approaches are compared. The performance of Templates, MultiLayer Perceptrons, and Recurrent Neural Networks based recognizers is tested on a small isolated speaker dependent word recognition problem. Experimental results indicate that trajectories on such reduced dimension spaces can provide reliable representations of spoken words, while reducing the training complexity and the operation of the Recognizer. The comparison between the different approaches to the design of the Recognizers conducted here gives a better understanding of the problem and its possible solutions. A new learning procedure that optimizes the usage of the training set is also presented. Optimal tailoring of trajectories, new

Journal ArticleDOI
TL;DR: In this paper, a blind deconvolution method based on the cepstrum technique is proposed to identify specific damage modes in fiber-reinforced composites, where the acoustic emission signal is demodulated and information on the wave source can be revealed, and thus damage can be identified.
Abstract: The analysis of acoustic emission signals has been widely applied to damage detection and damage characterization in composites. Features of acoustic emission signals, such as amplitude, frequency, and counts, are usually used to identify the type of a damage. Recently, time-frequency distribution techniques, such as the wavelet transform and the Choi-Williams distribution, have also been applied to characterize damage. A common feature of these approaches is that the analysis is on the acoustic emission signal itself. Nevertheless, this signal is not the wave source signal as it has been modulated by the signal transfer path. Real information on damage is actually hidden behind the signal. To reveal direct information on damage, a blind deconvolution method has been developed. It is a frequency domain method based on the cepstrum technique. With the method, the acoustic emission signal is demodulated, and information on the wave source can be revealed, and thus damage can be identified. This paper presents preliminary test data to assess the validity of the proposed methodology as a means of identifying specific damage modes in fiber-reinforced composites.

Patent
15 Aug 2001
TL;DR: In this paper, an apparatus and method for generating parametric representation of input speech based on a melfrequency warping of the vocal tract spectrum which is computationally efficient and provides increased recognition accuracy over conventional LP cepstrum approaches is presented.
Abstract: The present invention is an apparatus and method for generating parametric representation of input speech based on a mel-frequency warping of the vocal tract spectrum which is computationally efficient and provides increased recognition accuracy over conventional LP cepstrum approaches. It is capable of rapid processing operable in many different devices. The invention is a speech recognition system comprising linear prediction (LP) signal processor and a mel-frequency linear prediction (MFLP) generator for mel-frequency warping the LP parameters to generate MFLP parametric representations for robust, perceptually modeled speech recognition requiring minimal computation and storage.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a method to measure the range and depth from the sound source to the receiver by analyzing radiated noise cepstrum of a moving vessel or torpedo.

Patent
21 Sep 2001
TL;DR: In this paper, a speaker normalization processing capable of speedingly performing normalization into individual features of a speaker without requiring an on-line "non-teacher" is proposed.
Abstract: PROBLEM TO BE SOLVED: To provide a speech recognition device by which a user does not utter specific contents is not and which uses a speaker normalization processing capable of speedingly performing normalization into individual features of a speaker without requiring an on-line 'non-teacher'. SOLUTION: Featured values such as LPC cepstrum coefficients are extracted with the speech digitalized by A/D conversion as input signals (S10). Then, frequency axis conversion is conducted for the featured value such as the LPC cepstrum in order to normalize the effect caused by the individuality of the length of the vocal track of the uttering person (S30). Then, a matching is conducted between the featured values of the inputted speech that is frequency axis converted and the acoustic model featured values beforehand learned from plural speakers (S50). After that, the inputted utterings are made as teacher's signals based on the recognition result computed in the S50 and optimum conversion coefficients are obtained (S60). Then, smoothing is conducted for the conversion coefficients to absorb dispersion caused by the speakers and the phonemes and new updated frequency axis conversion coefficients are obtained (S70).