scispace - formally typeset
Search or ask a question

Showing papers on "Cepstrum published in 2000"


Proceedings Article
01 Jan 2000
TL;DR: The results show that the use of the Mel scale for modeling music is at least not harmful for this problem, although further experimentation is needed to verify that this is the optimal scale in the general case and whether this transform is valid for music spectra.
Abstract: We examine in some detail Mel Frequency Cepstral Coefficients (MFCCs) the dominant features used for speech recognition and investigate their applicability to modeling music. In particular, we examine two of the main assumptions of the process of forming MFCCs: the use of the Mel frequency scale to model the spectra; and the use of the Discrete Cosine Transform (DCT) to decorrelate the Mel-spectral vectors. We examine the first assumption in the context of speech/music discrimination. Our results show that the use of the Mel scale for modeling music is at least not harmful for this problem, although further experimentation is needed to verify that this is the optimal scale in the general case. We investigate the second assumption by examining the basis vectors of the theoretically optimal transform to decorrelate music and speech spectral vectors. Our results demonstrate that the use of the DCT to decorrelate vectors is appropriate for both speech and music spectra. MFCCs for Music Analysis Of all the human generated sounds which influence our lives, speech and music are arguably the most prolific. Speech has received much focused attention and decades of research in this community have led to usable systems and convergence of the features used for speech analysis. In the music community however, although the field of synthesis is very mature, a dominant paradigm has yet to emerge to solve other problems such as music classification or transcription. Consequently, many representations for music have been proposed (e.g. (Martin1998), (Scheirer1997), (Blum1999)). In this paper, we examine some of the assumptions of Mel Frequency Cepstral Coefficients (MFCCs) the dominant features used for speech recognition and examine whether these assumptions are valid for modeling music. MFCCs have been used by other authors to model music and audio sounds (e.g. (Blum1999)). These works however use cepstral features merely because they have been so successful for speech recognition without examining the assumptions made in great detail. MFCCs (e.g. see (Rabiner1993)) are short-term spectral features. They are calculated as follows (the steps and assumptions made are explained in more detail in the full paper): 1. Divide signal into frames. 2. For each frame, obtain the amplitude spectrum. 3. Take the logarithm. 4. Convert to Mel (a perceptually-based) spectrum. 5. Take the discrete cosine transform (DCT). We seek to determine whether this process is suitable for creating features to model music. We examine only steps 4 and 5 since, as explained in the full paper, the other steps are less controversial. Step 4 calculates the log amplitude spectrum on the so-called Mel scale. This transformation emphasizes lower frequencies which are perceptually more meaningful for speech. It is possible however that the Mel scale may not be optimal for music as there may be more information in say higher frequencies. Step 5 takes the DCT of the Mel spectra. For speech, this approximates principal components analysis (PCA) which decorrelates the components of the feature vectors. We investigate whether this transform is valid for music spectra. Mel vs Linear Spectral Modeling To investigate the effect of using the Mel scale, we examine the performance of a simple speech/music discriminator. We use around 3 hours of labeled data from a broadcast news show, divided into 2 hours of training data and 40 minutes of testing data. We convert the data to ‘Mel’ and ‘Linear’ cepstral features and train mixture of Gaussian classifiers for each class. We then classify each segment in the test data using these models. This process is described in more detail in the full paper. We find that for this speech/music classification problem, the results are (statistically) significantly better if Mel-based cepstral features rather than linear-based cepstral features are used. However, whether this is simply because the Mel scale models speech better or because it also models music better is not clear. At worst, we can conclude that using the Mel cepstrum to model music in this speech/music discrimination problem is not harmful. Further tests are needed to verify that the Mel cepstrum is appropriate for modeling music in the general case. Using the DCT to Approximate Principal Components Analysis We additionally investigate the effectiveness of using the DCT to decorrelate Mel spectral features. The mathematically correct way to decorrelate components is to use PCA (or equivalently the KL transform). This transform uses the eigenvalues of the covariance matrix of the data to be modeled as basis vectors. By investigating how closely these vectors approximate cosine functions we can get a feel for how well the DCT approximates PCA. By inspecting the eigenvectors for the Mel log spectra for around 3 hours of speech and 4 hours of music we see that the DCT is an appropriate transform for decorrelating music (and speech) log spectra. Future Work Future work should focus on a more thorough examination the parameters used to generate MFCC features such as the sampling rate of the signal, the frequency scaling (Mel or otherwise) and the number of bins to use when smoothing. Also worthy of investigation is the windowing size and frame rate. Suggested Readings Blum, T, Keislar, D., Wheaton, J. and Wold, E., 1999, Method and article of manufacture for content-based analysis, storage, retrieval, and segmentation of audio information, U.S. Patent 5, 918, 223. Martin, K.. 1998, Toward automatic sound source recognition: identifying musical instruments, Proceedings NATO Computational Hearing Advanced Study Institute. Rabiner, L. and Juang, B., 1993, Fundamentals of Speech Recognition, Prentice-Hall. Scheirer, E. and Slaney, M., 1997, Construction and evaluation of a robust multifeature speech/music discriminator, Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing.

1,189 citations


Journal ArticleDOI
TL;DR: In this paper, a gear pair affected by a fatigue crack is compared with those obtained by means of the well-accepted cepstrum analysis and time-synchronous average analysis.

330 citations


Journal ArticleDOI
01 Aug 2000
TL;DR: This work inserts a digital watermark into the cepstral components of the audio signal using a technique analogous to spread spectrum communications, hiding a narrow band signal in a wideband channel.
Abstract: We propose a digital audio watermarking technique in the cepstrum domain. We insert a digital watermark into the cepstral components of the audio signal using a technique analogous to spread spectrum communications, hiding a narrow band signal in a wideband channel. In our method, we use pseudo-random sequences to watermark the audio signal. The watermark is then weighted in the cepstrum domain according to the distribution of cepstral coefficients and the frequency masking characteristics of the human auditory system. Watermark embedding minimizes the audibility of the watermark signal. The embedded watermark is robust to multiple watermarks, MPEG audio coding and additive noise.

143 citations


Proceedings ArticleDOI
05 Jun 2000
TL;DR: New speech features using independent component analysis to human speeches, which resemble Gabor-like features, are proposed and given much better recognition rates than conventional mel-frequency cepstral features.
Abstract: In this paper, we proposed new speech features using independent component analysis to human speeches. When independent component analysis is applied to speech signals for efficient encoding the adapted basis functions resemble Gabor-like features. Trained basis functions have some redundancies, so we select some of the basis functions by the reordering method. The basis functions are almost ordered from the low frequency basis vector to the high frequency basis vector. And this is compatible with the fact that human speech signals have much more information in the low frequency range. Those features can be used in automatic speech recognition systems and the proposed method gives much better recognition rates than conventional mel-frequency cepstral features.

117 citations


Proceedings ArticleDOI
Xin Li1, H.H. Yu
30 Jul 2000
TL;DR: The experiment results have shown that the novel audio data hiding scheme in the cepstrum domain can achieve transparent and robust data hiding at the capacity region of above 20 bps.
Abstract: We propose a novel data hiding scheme for audio signals in the cepstrum domain. Cepstrum representation of audio can be shown to be very robust to a wide range of attacks including most challenging time-scaling and pitch-shifting warping. In the cepstrum domain, we propose to embed data by manipulating the statistical mean of selected cepstrum coefficients. An intuitive psychoacoustic model is employed to control the audibility of introduced distortion. Our experiment results have shown that the novel audio data hiding scheme in the cepstrum domain can achieve transparent and robust data hiding at the capacity region of above 20 bps.

88 citations


Book
01 Jan 2000
TL;DR: Auditory Processing of Speech Perceptual Coding Considerations Research in PerceptUAL Speech Coding APPENDIX: RELATED INTERNET SITES.
Abstract: INTRODUCTION SPEECH PRODUCTION The Speech Chain Articulation Source-Filter Model SPEECH ANALYSIS TECHNIQUES Sampling and the Speech Waveform Systems and Filtering z Transform Fourier Transform Discrete Fourier Transform Windowing Signal Segments LINEAR PREDICTION VOCAL TRACT MODELING Sound Propagation in the Vocal Tract Estimation of LP Parameters Transformations of LP Parameters for Quantization Examples of LP Modeling PITCH EXTRACTION Autocorrelation Pitch Extraction Cepstral Pitch Extraction Frequency-Domain Error Minimization Pitch Tracking AUDITORY INFORMATION PROCESSING The Basilar Membrane: A Spectrum Analyzer Critical Bands Thresholds of Audibility and Detectability Monaural Masking QUANTIZATION AND WAVEFORM CODERS Uniform Quantization Nonlinear Quantization Adaptive Quantization Vector Quantization QUALITY EVALUATION Objective Measures Subjective Measures Perceptual Objective Measures VOICE CODING CONCEPTS Channel Vocoder Formant Vocoders The Sinusoidal Speech Coder Linear Prediction Vocoder LINEAR PREDICTION ANALYSIS BY SYNTHESIS Analysis by Synthesis Estimation of Excitation Multi-Pulse Linear Prediction Coder Regular Pulse Excited LP Coder Code Excited Linear Prediction Coder MIXED EXCITATION CODING Multi-Band Excitation Vocoder Mixed Excitation Linear Prediction Coder Split Band LPC Coder Harmonic Vector Excitation Coder Waveform Interpolation Coding PERCEPTUAL SPEECH CODING Auditory Processing of Speech Perceptual Coding Considerations Research in Perceptual Speech Coding APPENDIX: RELATED INTERNET SITES

83 citations


Journal ArticleDOI
TL;DR: A robust noniterative algorithm to design optimal minimum-phase digital FIR filters with real or complex coefficients is presented and the minimum fast Fourier transform length is derived for computing the DHT to achieve a desired coefficient accuracy.
Abstract: We present a robust noniterative algorithm to design optimal minimum-phase digital FIR filters with real or complex coefficients. We derive: (1) the discrete Hilbert transform (DHT) of the complex cepstrum of a causal complex minimum-phase sequence and (2) the minimum fast Fourier transform length for computing the DHT to achieve a desired coefficient accuracy.

57 citations


Proceedings ArticleDOI
23 Jul 2000
TL;DR: The implementation of a system for automatic detection of laryngeal pathologies using acoustic analysis of speech in the frequency domain is described and results make promissory the application of this alternative as a support tool for the diagnosis of pathologies of the vocal system.
Abstract: It is well known that most laryngeal diseases and vocal fold pathologies cause significant changes in speech. Different procedures of clinical application for laryngeal examination exist, being all of them of invasive nature. In the evaluation of quality of speech, acoustic analysis of normal and pathological voices have become increasingly interesting to researchers in laryngology and speech pathologies because of its nonintrusive nature and its potential for providing quantitative data with reasonable analysis time. In this article, the implementation of a system for automatic detection of laryngeal pathologies using acoustic analysis of speech in the frequency domain is described. Different processing techniques of speech signal are applied: cepstrum, mel-cepstrum, delta cepstrum and delta mel-cepstrum, and FFT. The obtained data feed to neural networks, which classify the voice patterns. Two types of neural network were examined: a system trained to distinguish between normal and pathological voices (no matter the pathology); and a more complex system, trained to classify normal, bicyclic and rough voice. High percentages of recognition are obtained, being the cepstral analysis the processing technique that achieves the highest actings. This indicates that this analysis type provides a characterization of the voice in pathological condition in a direct and noninvasive way. The obtained results make promissory the application of this alternative as a support tool for the diagnosis of pathologies of the vocal system.

51 citations


Journal ArticleDOI
TL;DR: Experimental results indicate that the proposed feature estimation technique leads to convergent identification of channel and noise and significantly improved recognition accuracy for speaker-independent continuous speech.
Abstract: A feature estimation technique is proposed for speech signals that are degraded by both additive and convolutive noises. An EM algorithm is formulated in the frequency-domain for identification of the magnitude response of the distortion channel and power spectrum of additive noise, and posterior estimates of short-time power spectra of speech are obtained based on the identified channel and noise. The estimated posterior power spectra are used to calculate perceptually-based linear prediction cepstral coefficients, and the estimated cepstral features and their temporal regression coefficients are used for automatic speech recognition using acoustic models trained from clean speech. Experiments were performed on speaker independent continuous speech recognition, where the speech data were taken from the TIMIT database and were degraded by a distortion channel and simulated additive noises with white or colored spectral characteristics at various SNR levels. Experimental results indicate that the proposed technique leads to convergent identification of channel and noise and significantly improved recognition accuracy for speaker-independent continuous speech.

41 citations


Journal ArticleDOI
TL;DR: The GA-based M-TDC (modified TDC) method is proposed to improve the representativeness and robustness of the selected TDC coefficients in noisy environments and has better recognition results than the original TDC approaching noisy environments.
Abstract: Among various kinds of speech features, the two-dimensional (2-D) cepstrum (TDC) is a special one, which can simultaneously represent several types of information contained in the speech waveform: static and dynamic features, as well as global and fine frequency structures. Analysis results show that the coefficients located at lower indexes portion of the TDC matrix seem to be more significant than others. Hence, to represent an utterance only some TDC coefficients need to be selected to form a feature vector instead of the sequence of feature vectors. It has the advantages of simple computation and less storage space. However, our experiments show that the selection of TDC coefficients is quite sensitive to background noise. In order to solve this problem, we propose the GA-based M-TDC (modified TDC) method in this paper to improve the representativeness and robustness of the selected TDC coefficients in noisy environments. The M-TDC differs from the standard TDC by the use of filters to remove the noise components. Furthermore, in the GA-based M-TDC method, we apply the genetic algorithms (GAs) to find the robust coefficients in the M-TDC matrix. From the experiments with five noise types, we find that the GA-based M-TDC method has better recognition results than the original TDC approaching noisy environments.

38 citations


Patent
28 Dec 2000
TL;DR: In this paper, a harmonic-noise speech coder comprises a noise spectral estimating means for coding the noise component by predicting the spectral by LPC analysis method after separating the noise, which is unvoiced sound component from the inputted LPC residual signal using cepstrum.
Abstract: The present invention relates to a harmonic-noise speech coder and coding algorithm of the mixed signal of voiced/unvoiced sound using harmonic model. The harmonic-noise speech coder comprises a noise spectral estimating means for coding the noise component by predicting the spectral by LPC analysis method after separating the noise, which is unvoiced sound component from the inputted LPC residual signal using cepstrum. And more improved speech quality can be obtained by analyzing noise effectively using the noise spectral model predicted through cepstrum-LPC analysis method of the mixed signal of voiced/unvoiced sound to the existing harmonic model and then coding the signal.

Proceedings ArticleDOI
Hong Kook Kim1, R. Cox
05 Jun 2000
TL;DR: From speaker-independent connected digit HMM recognition, it is found that the speech recognition system employing the proposed bitstream-based front-end gives superior word and string accuracies over a recognizer constructed from decoded speech signals.
Abstract: In this paper, we propose a feature extraction method for a speech recognizer that operates in digital communication networks. The feature parameters are basically extracted by converting the quantized spectral information of a speech coder into a cepstrum. We also combine the voiced/unvoiced information obtained from the bitstream of the speech coder into the recognition feature set. From speaker-independent connected digit HMM recognition, we find that the speech recognition system employing the proposed bitstream-based front-end gives superior word and string accuracies over a recognizer constructed from decoded speech signals. Its performance is comparable to that of the wireline recognition system that uses only the cepstrum as a feature set.

Journal ArticleDOI
TL;DR: The spectral estimation problem of a stationary autoregressive moving average (ARMA) process is considered, and a new method for the estimation of the MA part is proposed that requires neither any initial estimates nor fitting of a large order AR model.
Abstract: In this letter, the spectral estimation problem of a stationary autoregressive moving average (ARMA) process is considered, and a new method for the estimation of the MA part is proposed. A simple recursion relating the ARMA parameters and the cepstral coefficients of an ARMA process is derived and utilized for the estimation of the MA parameters. The method requires neither any initial estimates nor fitting of a large order AR model, both of which require further a priori knowledge of the signal and increase the computational complexity. Simulation results illustrating the performance of the new method are also given.

Proceedings ArticleDOI
23 Jul 2000
TL;DR: The present work consists on the use of delta cepstra coefficients in Mel scale, wavelet and wavelet packet transforms to feed a system for automatic speaker identification based on neural networks, which provided excellent results, compared with the opposing ones in the bibliography using other methods.
Abstract: The present work consists on the use of delta cepstra coefficients in Mel scale, wavelet and wavelet packet transforms to feed a system for automatic speaker identification based on neural networks. Different alternatives are tested for the classifier based on neural nets, having achieved very good performance for closed groups of speakers in a text independent form. When a single neural net is used for all the speakers, the results decay abruptly with increasing number of speakers to identify. This takes to implement a system where there is one neural net for each speaker, which provided excellent results, compared with the opposing ones in the bibliography using other methods. This classifier structure possesses other advantages, for example, adding a new speaker to the system only requires to train a net for the speaker in question, in contrast with a system where the classifier is formed by a single great net, which should be in general be trained completely again.

Journal ArticleDOI
TL;DR: An artificial neural network (ANN) based helicopter identification system is proposed and linear prediction, reflection coefficients, cepstrum, and line spectral frequencies (LSF) are compared in terms of recognition accuracy and robustness against additive noise.
Abstract: An artificial neural network (ANN) based helicopter identification system is proposed. The feature vectors are based on both the tonal and the broadband spectrum of the helicopter signal, ANN pattern classifiers are trained using various parametric spectral representation techniques. Specifically, linear prediction, reflection coefficients, cepstrum, and line spectral frequencies (LSF) are compared in terms of recognition accuracy and robustness against additive noise. Finally, an 8-helicopter ANN classifier is evaluated. It is also shown that the classifier performance is dramatically improved if it is trained using both clean data and data corrupted with additive noise.

01 Jan 2000
TL;DR: Some revealing aspects of human auditory preception are considered and the mel- scaled cepstrum algorithm is examined in order to draw some con- clusions.
Abstract: The mel-scaled cepstrum is a signal representation scheme used in the analysis of speech signals. Due to its reported superior performance, especially under adverse conditions, it is becoming an increasingly popular choice as feature extraction front end to spoken language systems. Having evolved over a pe- riod of more than fifty years, the mel-scaled cepstrum owes part of its heritage to the pattern recognition community and part to perceptual and acoustical research. It represents a good trade-off between computational efficiency and perceptual considerations. Unfortunately, maybe because of its hybrid nature, the literature tends to be vague on the implementation details of mel-scaled cep- strum algorithms. In this paper we clarify some of the issues re- garding the algorithm and its implementation. Our investigation also serves to expose some fundamental flaws remaining in the established approach to speech signal feature extraction. I. Introduction HE pre-processing and feature extraction stages of a pattern recognition system serves as an interface between the real world and a classifier operating on an idealised model of reality. Information that is discarded in this stage is forever lost; conversely, noise that is accepted will degrade the performance of the classifier stage that is typically sensitive to complexity in the data. The signals that spoken language systems have to deal with is unique in the sense that it is generated by a bio- logical system, for a biological system. Human speech is the evolutionary product of the vocal and auditory sys- tems and not the other way around. The result shows a distinct lack of engineering common sense. As a matter of fact, psychophysical studies over the last number of decades tend to leave us with the uncomfortable feeling that the world perceived through our senses is rather different from the one that we measure with our instru- ments. We will now consider some revealing aspects of human auditory preception and then examine the mel- scaled cepstrum algorithm in order to draw some con- clusions.

PatentDOI
Yifan Gong1
TL;DR: In this paper, an estimate of clean speech vector, typically Mel-Frequency Cepstral Coefficient (MFCC), given its noisy observation is provided, making use of two Gaussian mixtures.
Abstract: An estimate of clean speech vector, typically Mel-Frequency Cepstral Coefficient (MFCC) given its noisy observation is provided. The method makes use of two Gaussian mixtures. The first one is trained on clean speech and the second is derived from the first one using some noise samples. The method gives an estimate of a clean speech feature vector as the conditional expectancy of clean speech given an observed noisy vector.

Proceedings Article
01 Jan 2000
TL;DR: A research work for a better understanding of the effects of room acoustics on speech feature by comparing simultaneous recordings of close talking and distant talking speech utterances, and Vector Quantization based model is used to study the influence of the variation on feature vector distribution.
Abstract: Automatic speech recognition systems attain high performance for close-talking applications, but they deteriorate significantly in distant-talking environment The reason is the mismatch between training and testing conditions We have carried out a research work for a better understanding of the effects of room acoustics on speech feature by comparing simultaneous recordings of close talking and distant talking speech utterances The characteristics of two degrading sources, background noise and room reverberation are discussed Their impacts on the spectrum are different The noise affects on the valley of the spectrum while the reverberation causes the distortion at the peaks at the pitch frequency and its multiples In the situation of very few training data, we attempt to choose the efficient compensation approaches in the spectrum, spectrum subband or cepstrum domain Vector Quantization based model is used to study the influence of the variation on feature vector distribution The results of speaker identification experiments are presented for both close-talking and distant talking data

Journal ArticleDOI
TL;DR: Performance results from ultrasonic phantom experiments and Monte Carlo simulations for detecting and estimating duct wall spacings on the order of those typically found in breast tissue using methods based on the generalized spectrum (GS) and cepstrum are presented.

Journal ArticleDOI
TL;DR: In this article, a pitch extraction method was proposed for the purposes of real-time sequential speech processing that can be used in such applications as speech rate conversion systems, where autocorrelation functions of the input speech waveforms are calculated for one analyzed point in time using multiple lengths of the analysis windows, and the largest peaks of each autocorerelation function are detected within the appropriate ranges, after which the optimum pitch period is selected by weighting the candidates of the pitch period obtained by the number of windows.
Abstract: A high-performance method for pitch extraction is proposed for the purposes of real-time sequential speech processing that can be used in such applications as speech rate conversion systems. According to this method, autocorrelation functions of the input speech waveforms are calculated for one analyzed point in time using multiple lengths of the analysis windows, and the largest peaks of each autocorrelation function are detected within the appropriate ranges, after which the optimum pitch period is selected by weighting the candidates of the pitch period obtained by the number of windows. Such selection processing is carried out independently for each analyzed point without using such characteristics as the continuity of the fundamental frequencies of the entire speech segment. This method was applied to analysis of a large number of speech materials, including recordings made by different speakers and speech samples mixed with noise. The tests have demonstrated that the proposed method features pitch extraction potentials superior to those of the cepstrum pitch determination method and the LPC residual autocorrelation method within a wide range of fundamental frequencies and power levels. © 1999 Scripta Technica, Electron Comm Jpn Pt 3, 83(2): 67–79, 2000

Patent
Hong Heather Yu1, Xin Li1
10 Feb 2000
TL;DR: In this paper, an audio signal is received in a base domain and then transformed into a non-base domain, such as cepstrum domain or LP residue domain, to embed hidden data.
Abstract: A computer-implemented method and apparatus for embedding hidden data in an audio signal. An audio signal is received in a base domain and then transformed into a non-base domain, such as cepstrum domain or LP residue domain. The statistical mean manipulation is employed on selected transform coefficients to embed hidden data. The introduced distortion is controlled by psychoacoustic model to ensure the imperceptibility of the embedded hidden data. Scrambling techniques can be plugged in to further increase the security of the data hiding system. The present new audio data hiding scheme provides transparent audio quality, sufficient embedding capacity, and high survivability over a wide range of common signal processing attacks.

PatentDOI
TL;DR: In this article, a method of model adaptation for noisy speech recognition is proposed, which determines the cepstral mean vector and covariance matrix of adapted noisy speech from the CSPV and the covariance matrices of speech and noise.
Abstract: A method of model adaptation for noisy speech recognition determines the cepstral mean vector and covariance matrix of adapted noisy speech from the cepstral mean vectors and covariance matrices of speech and noise. The cepstral mean vectors of noise and speech are first transferred into the linear spectral domain, respectively. The linear spectral mean vectors of noise and speech are then combined to obtain a linear spectral mean vector of noisy speech. Next, the linear spectral mean vector of noisy speech is transferred from the linear spectral domain into the cepstral domain, so as to determine the cepstral mean vector of adapted noisy speech. Further, the cepstral covariance matrices of speech and noise are multiplied by a first and a second scaling factor, respectively, and the multiplied cepstral covariance matrices are combined together, so as to determine the cepstral covariance matrix of adapted noisy speech.

Proceedings ArticleDOI
17 Mar 2000
TL;DR: High-resolution range-bearing estimators have been developed using narrowband signatures and broadband signatures and a cepstrum-based range rate indicator was demonstrated using experimental data displaying multipath propagation (Lloyd mirror).
Abstract: The problem of interest is a rapidly closing object (e.g. a torpedo) emitting broadband and narrowband acoustic signatures. A high negative range rate is indicative of a torpedo threat, and good range-bearing estimates provide good target localization. A linear, towed hydrophone array is used as a receiver, and in many cases the received signals will display high signal-to-noise ratio. It is desirable to detect at ranges greater than the near field range of the array. High-resolution range-bearing estimators have been developed using narrowband signatures and broadband signatures. The range difference to the target across an array causes target signals to display a wave-front curvature. Range estimation is based on wave-front curvature determination using a matched-filtering post beamformer approach. First a target is selected for range estimation in which a subset of beams is chosen for sector inverse beamforming. Curvature pattern replicas over range and bearing are computed and used in a Capon form beamformer. Simulations and experiments with in-water data compare the algorithm performance to a derived benchmark. In addition, a cepstrum-based range rate indicator was demonstrated using experimental data displaying multipath propagation (Lloyd mirror).

Proceedings ArticleDOI
21 Aug 2000
TL;DR: Text-independent and text-dependent speaker recognition systems suitable for verification and identification (open set and closed set) and a limited vocabulary recognition system is developed using vowel phoneme in the limited vocabulary.
Abstract: Speaker recognition systems attempt to recognize a speaker by his/her voice through measurements of the specifically individual characteristics arising in the speaker's voice. Among transformations of LPC parameters the adaptive component weighted (ACW) cepstrum has been shown to be less susceptible to channel effects than others. Text-independent and text-dependent speaker recognition systems suitable for verification and identification (open set and closed set) are presented, The system is based on locating the vowel phonemes of the test utterance. A preprocessing is applied to the speech signal. The centers of the vowel phonemes are located and identified as speech events using a three-step vowel phoneme locating process. The steps of the locating process are: (1) average magnitude function calculation; (2) vowel phoneme candidates location; and (3) ripple rejection. For each vowel phoneme (20 ms) 10 ACW cepstrum coefficients are calculated and are used as inputs to neural networks and the outputs are accumulated and averaged. The system hardware requirements are a microphone and a round card. The system software written in C++ language for windows. The system was tested with a population of 10 speakers (7 male and 3 female), and the statistics were taken (95.67% for text-dependent verification, 93% for text-dependent identification, 92.2% for text-independent verification and 88.95% for text-independent identification). There tests were done with utterances of one word having one vowel phoneme (20 msec used for recognizing the speaker). A vowel phoneme recognition application is also presented. A limited vocabulary recognition system is developed using vowel phoneme in the limited vocabulary. The feature vectors calculation is the same as in the speaker recognition system the only difference is in the neural network training and size (97.5% of word recognition).

Proceedings Article
01 Jan 2000
TL;DR: Experiments show that, the FBE-MFCC and the frame energy with their corresponding auto-regressive analysis coefficients form the better combination reducing the syllable error rate (SER) by 10.0% across a giant speech database, compared to the traditional MFCC with its corresponding autoregressive analysis coefficients.
Abstract: The Mel-Frequency Cepstrum Coefficients (MFCC) is a widely used set of feature used in automatic speech recognition systems introduced in 1980 by Davis and Mermelstein [2]. In this traditional implementation, the 0 coefficient is excluded for the reason it is somewhat unreliable. In this paper, we analyze this term and find that it can be regarded as the generalized frequency band energy (FBE) and is hence useful, resulting in the FBE-MFCC. We also propose a better analysis, called the auto-regressive analysis, on the frame energy, which performs better than its 1 and/or 2 order differential derivatives. Experiments show that, the FBE-MFCC and the frame energy with their corresponding auto-regressive analysis coefficients form the better combination reducing the syllable error rate (SER) by 10.0% across a giant speech database, compared to the traditional MFCC with its corresponding autoregressive analysis coefficients.

PatentDOI
Tadashi Emori1, Koichi Shinoda1
TL;DR: In this paper, an analyzer for converting an input voice signal to an input pattern including cepstrum, a reference pattern for storing reference patterns, an elongation/contraction estimating unit for outputting an extension parameter in frequency axis direction, and a recognizing unit for calculating the distances between the converted input pattern from the converter and the reference patterns and outputting the reference pattern corresponding to the shortest distance as result of recognition.
Abstract: A voice recognition system comprises an analyzer for converting an input voice signal to an input pattern including cepstrum, a reference pattern for storing reference patterns, an elongation/contraction estimating unit for outputting an elongation/contraction parameter in frequency axis direction by using the input pattern and the reference patterns, and a recognizing unit for calculating the distances between the converted input pattern from the converter and the reference patterns and outputting the reference pattern corresponding to the shortest distance as result of recognition. The elongation/contraction unit estimates an elongation/contraction parameter by using cepstrum included in the input pattern. The elongation/contraction unit does not have various values in advance for determining the elongation/contraction parameter, nor is it necessary for the elongation/contraction unit have to execute distance calculation for various values.

Proceedings ArticleDOI
05 Jun 2000
TL;DR: Within the context of automatic speech recognition (ASR) applications for telephony, this work investigates the acoustic preprocessing issues that are at stake in going from the fixed line to the cellular network and investigates the relative advantages and drawbacks of conventional mel-frequency cepstral coefficient parameters derived from a non-parametric fast Fourier transform and linear predictive coding spectral estimate.
Abstract: Within the context of automatic speech recognition (ASR) applications for telephony, we investigate the acoustic preprocessing issues that are at stake in going from the fixed line to the cellular network. Because the spectral representation used in enhanced full rate GSM is linear prediction, we investigate the relative advantages and drawbacks of conventional mel-frequency cepstral coefficient (MFCC) parameters derived from a non-parametric fast Fourier transform (FFT) and MFCC parameters derived from a linear predictive coding (LPC) spectral estimate. Robust formant parameters, also derived from an LPC description of the spectrum, are studied as an alternative to MFCCs. Within the framework of connected digit recognition based on hidden Markov models, ASR performance was measured for clean conditions, as well as for three different additive noise conditions. In addition, the performance of a conventional recognition procedure was compared with the performance of an ASR system based on our acoustic backing-off implementation of missing feature theory (MFT).

Proceedings ArticleDOI
Jacek Ilow1
14 Aug 2000
TL;DR: This paper investigates cepstrum-based approaches for parameter estimation in FARIMA processes with non-Gaussian innovations to describe packet arrival rate in unit time and examines the fractional differencing parameter estimation procedure based on the smoothed periodogram and the log spectrum.
Abstract: Traffic measurements in many network environments demonstrate the coexistence of both long- and short-range dependence in traffic traces. In this paper, we use the fractionally integrated autoregressive moving average (FARIMA) processes with non-Gaussian innovations to describe packet arrival rate in unit time. Specifically, we investigate cepstrum-based approaches for parameter estimation in FARIMA processes. We examine the fractional differencing parameter estimation procedure based on the smoothed periodogram and the log spectrum. The simulation results demonstrate that the proposed cepstrum approach gives better estimation accuracy than the conventional least-square spectrum fit. Usefulness of the results presented is demonstrated on real network traffic traces by considering spectral fitting metrics.

Proceedings ArticleDOI
13 May 2000
TL;DR: In this article, a blind deconvolution method based on the cepstrum technique is proposed to identify specific damage modes in fiber reinforced composites, which is a quefrency domain method.
Abstract: The analysis of acoustic emission signals has been widely applied to damage detection and damage characterization in composites. Features of acoustic emission signals, such as amplitude, frequency, and counts, are usually utilized to identify the type of a damage. Recently, time-frequency distribution techniques, such as the wavelet transform and the Choi-Williams distribution, have also been applied to characterize damage. A common feature of these approaches is that the analysis is on the acoustic emission signal itself. Nevertheless, this signal is not the wave source signal as it has been modulated by the signal transfer path. Real information on damage is actually hidden behind the signal. To reveal direct information on damage, a blind deconvolution method has been developed. It is a quefrency domain method based on the cepstrum technique. With the method, acoustic emission signal is demodulated and information on the wave source can be revealed and thus damage can be identified. This paper presents preliminary test data to assess the validity of the proposed methodology as a means of identifying specific damage modes in fiber reinforced composites.