scispace - formally typeset
Search or ask a question

Showing papers on "Cepstrum published in 2003"


Proceedings ArticleDOI
06 Apr 2003
TL;DR: A new spectral representation of speech signals through group delay functions through cepstral coefficients is explored, which reduces the effects of zeroes close to the unit circle in the z-domain and these clutter the spectra.
Abstract: We explore a new spectral representation of speech signals through group delay functions. The group delay functions by themselves are noisy and difficult to interpret owing to zeroes that are close to the unit circle in the z-domain and these clutter the spectra. A new modified group delay function (Yegnanarayan, B. and Murthy, H.A., IEEE Trans. Sig. Processing, vol.40, p.2281-9, 1992) that reduces the effects of zeroes close to the unit circle is used. Assuming that this new function is minimum phase, the modified group delay spectrum is converted to a sequence of cepstral coefficients. A preliminary phoneme recogniser is built using features derived from these cepstra. Results are compared with those obtained from features derived from the traditional mel frequency cepstral coefficients (MFCC). The baseline MFCC performance is 34.7%, while that of the best modified group delay cepstrum is 39.2%. The performance of the composite MFCC feature, which includes the derivatives and double derivatives, is 60.7%, while that of the composite modified group delay feature is 57.3%. When these two composite features are combined, /spl sim/2% improvement in performance is achieved (62.8%). When this new system is combined with linear frequency cepstra (LFC) (Gadde, V.R.R. et al., The SRI SPINE 2001 Evaluation System. http://elazar.itd.nrl.navy.mil/spine/sri2/presentation/sri2001.html, 2001), the system performance results in another /spl sim/0.8% improvement (63.6%).

156 citations


Proceedings ArticleDOI
20 Jul 2003
TL;DR: This work presents the development of an automatic recognition system of infant cry, with the objective to classify two types of cry: normal and pathological cry from deaf babies, using acoustic characteristics obtained by the mel-frequency cepstrum technique and a feedforward neural network trained with several learning methods.
Abstract: This work presents the development of an automatic recognition system of infant cry, with the objective to classify two types of cry: normal and pathological cry from deaf babies. In this study, we used acoustic characteristics obtained by the mel-frequency cepstrum technique and as a classifier a feedforward neural network that was trained with several learning methods, resulting in a better scaled conjugate gradient algorithm. Current results are shown, which, at the moment, are very encouraging with an accuracy up to 97.43%.

73 citations


Journal ArticleDOI
Hong Kook Kim, Richard Rose1
TL;DR: In this paper, a set of acoustic feature pre-processing techniques that are applied to improving automatic speech recognition (ASR) performance on noisy speech recognition tasks are presented. But the main contribution of this paper is an approach for cepstrum-domain feature compensation in ASR which is motivated by techniques for decomposing speech and noise that were originally developed for noisy speech enhancement.
Abstract: This paper presents a set of acoustic feature pre-processing techniques that are applied to improving automatic speech recognition (ASR) performance on noisy speech recognition tasks. The principal contribution of this paper is an approach for cepstrum-domain feature compensation in ASR which is motivated by techniques for decomposing speech and noise that were originally developed for noisy speech enhancement. This approach is applied in combination with other feature compensation algorithms to compensating ASR features obtained from a mel-filterbank cepstrum coefficient front-end. Performance comparisons are made with respect to the application of the minimum mean squared error log spectral amplitude (MMSE-LSA) estimator based speech enhancement algorithm prior to feature analysis. An experimental study is presented where the feature compensation approaches described in the paper are found to greatly reduce ASR word error rate compared to uncompensated features under environmental and channel mismatched conditions.

70 citations


Journal ArticleDOI
TL;DR: A new probability density function (PDF) projection theorem makes it possible to project probability density functions from a low-dimensional feature space back to the raw data space and by recursive application of the projection theorem, it is possible to analyze complex signal processing chains.
Abstract: We present the theoretical foundation for optimal classification using class-specific features and provide examples of its use. A new probability density function (PDF) projection theorem makes it possible to project probability density functions from a low-dimensional feature space back to the raw data space. An M-ary classifier is constructed by estimating the PDFs of class-specific features, then transforming each PDF back to the raw data space where they can be fairly compared. Although statistical sufficiency is not a requirement, the classifier thus constructed becomes equivalent to the optimal Bayes classifier if the features meet sufficiency requirements individually for each class. This classifier is completely modular and avoids the dimensionality curse associated with large complex problems. By recursive application of the projection theorem, it is possible to analyze complex signal processing chains. We apply the method to feature sets, including linear functions of independent random variables, cepstrum, and Mel cepstrum. In addition, we demonstrate how it is possible to automate the feature and model selection process by direct comparison of log-likelihood values on the common raw data domain.

67 citations


Journal ArticleDOI
TL;DR: A computationally efficient algorithm is proposed for the pulse-spectrum estimation, which can be viewed as a modified version of the classical Donoho's three-step de-noising procedure, and it is shown that this algorithm, developed on the assumption of the "Gaussian" reflectivity function, remains applicable for broader classes of distributions.
Abstract: A different approach to the problem of estimation of the ultrasound pulse spectrum, which usually arises as a part of ultrasound image restoration algorithms, is presented. It is shown that this estimation problem can be reformulated in terms of a de-noising problem. In this formulation, the log-spectrum of a radio-frequency line (RF-line) is viewed as a noisy measurement of the signal that needs to be estimated, i.e., the ultrasound pulse log-spectrum. The log-spectrum of the tissue reflectivity function (i.e., tissue response) is considered as the noise to be rejected. The contribution of the paper is twofold. First, it provides statistical description of the reflectivity function log-spectrum for the case, when the samples of the reflectivity function are independent identically distributed (i.i.d.) Gaussian random variables. Moreover, it is shown that the problem of the pulse spectrum recovery is essentially a de-noising problem. Consequently, it is suggested to solve the problem within the framework of the de-noising by wavelet shrinkage. Second, a computationally efficient algorithm is proposed for the pulse-spectrum estimation, which can be viewed as a modified version of the classical Donoho's three-step de-noising procedure. This modification is necessary, because of specific properties of the noise to be rejected. It is shown, that whenever the samples of the reflectivity function can be assumed to be i.i.d. Gaussian random variables, the samples of its log-spectrum obey the Fisher-Tippet distribution. For this type of noise, straightforward implementation of the standard de-noising can cause serious estimation errors. In order to overcome this difficulty, an outlier-resistant de-noising is performed. The unique properties of this modified de-noising algorithm allow estimating the pulse spectrum adaptively to its properties, as they are continuously influenced by the frequency-dependent attenuation process. The performance of the proposed algorithm is examined in a series of computer-simulations. It is shown that this algorithm, developed on the assumption of the "Gaussian" reflectivity function, remains applicable for broader classes of distributions. The results obtained in a series of in vivo experiments reveal superior performance of the novel approach over some of alternative estimation techniques, e.g., cepstrum-based estimation.

62 citations


Proceedings ArticleDOI
06 Apr 2003
TL;DR: A new class of noise robust acoustic features derived from a new measure of autocorrelation, and explicitly exploiting the phase variation of the speech signal frame over time, are introduced, and are expected to be more robust to noise.
Abstract: We introduce a new class of noise robust acoustic features derived from a new measure of autocorrelation, and explicitly exploiting the phase variation of the speech signal frame over time. This family of features, referred to as "phase autocorrelation" (PAC) features, include PAC spectrum and PAC MFCC (Mel-frequency cepstral coefficient), among others. In regular autocorrelation based features, the correlation between two signal segments (signal vectors), separated by a particular time interval k, is calculated as a dot product of these two vectors. In our proposed PAC approach, the angle between the two vectors is used as a measure of correlation. Since dot product is usually more affected by noise than the angle, PAC-features are expected to be more robust to noise. This is indeed significantly confirmed by the presented experimental results. The experiments were conducted on the Numbers 95 database, on which "stationary" (car) and "non -stationary" (factory) Noisex 92 noises were added with varying SNR. In most of the cases, without any specific tuning, PAC-MFCC features perform better.

62 citations


Journal ArticleDOI
TL;DR: The proposed feature set embedded with a nonlinear liftering transformation is quite effective for robust speech recognition and can be decomposed as the superposition of the standard cepstrum and its nonlinearly liftered counterpart.

62 citations


Patent
22 Aug 2003
TL;DR: In this article, methods and apparatuses are provided for detecting blur within digital images using wavelet transform and/or Cepstrum analysis blur detection techniques that are able to detect motion blur and out-of-focus blur.
Abstract: Methods and apparatuses are provided for detecting blur within digital images using wavelet transform and/or Cepstrum analysis blur detection techniques that are able to detect motion blur and/or out-of-focus blur.

62 citations


Proceedings ArticleDOI
30 Nov 2003
TL;DR: These new dynamic features derived from the modulation spectrum of the cepstral trajectories of the speech signal yield a significant increase in the speech recognition performance in various noise conditions when compared directly to the standard temporal derivative features and C-JRASTA PLP features.
Abstract: In this paper, we present new dynamic features derived from the modulation spectrum of the cepstral trajectories of the speech signal. Cepstral trajectories are projected over the basis of sines and cosines yielding the cepstral modulation frequency response of the speech signal. We show that the different sines and cosines basis vectors select different modulation frequencies, whereas the frequency responses of the delta and the double delta filters are only centered over 15 Hz. Therefore, projecting cepstral trajectories over the basis of sines and cosines yield a more complementary and discriminative range of features. In this work, the cepstrum reconstructed from the lower cepstral modulation frequency components is used as the static feature. In experiments, it is shown that, as well as providing an improvement in clean conditions, these new dynamic features yield a significant increase in the speech recognition performance in various noise conditions when compared directly to the standard temporal derivative features and C-JRASTA PLP features.

56 citations


Proceedings ArticleDOI
06 Jul 2003
TL;DR: A comparison of 6 methods for classification of sports audio shows that all the combinations achieve classification accuracy of around 90% with the best and the second best being, respectively, MPEG-7 features with EP-H MM and MFCC with ML-HMM.
Abstract: We present a comparison of 6 methods for classification of sports audio. For feature extraction, we have two choices: MPEG-7 audio features and Mel-scale frequency cepstrum coefficients (MFCC). For classification, we also have two choices: maximum likelihood hidden Markov models (ML-HMM) and entropic prior HMMs (EP-HMM). EP-HMMs, in turn, have two variations: with and without trimming of the model parameters. We thus have 6 possible methods, each of which corresponds to a combination. Our results show that all the combinations achieve classification accuracy of around 90% with the best and the second best being, respectively, MPEG-7 features with EP-HMM and MFCC with ML-HMM.

53 citations


Proceedings Article
01 Jan 2003
TL;DR: An attempt is made to investigate the use of phase function in the analytic signal of critical-band filtered speech for deriving a representation of frequencies present in the speech signal.
Abstract: Cepstral features derived from power spectrum are widely used for automatic speech recognition. Very little work, if any,hasbeendoneinspeechresearchtoexplorephase-based representations. In this paper, an attempt is made to investigate the use of phase function in the analytic signal of critical-band filtered speech for deriving a representation of frequencies present in the speech signal. Results are presented which show the validity of this approach.

Proceedings Article
01 Jan 2003
TL;DR: This paper shows that Mel-frequency warping can equally well be integrated into the framework of VTN as linear transformation on the cepstrum and there is a strong interdependence ofVTN and Maximum Likelihood Linear Regression for the case of Gaussian emission probabilities.
Abstract: We have shown previously that vocal tract normalization (VTN) results in a linear transformation in the cepstral domain. In this paper we show that Mel-frequency warping can equally well be integrated into the framework of VTN as linear transformation on the cepstrum. We show examples of transformation matrices to obtain VTN warped Mel-frequency cepstral coefficients (VTN-MFCC) as linear transformation of the original MFCC and discuss the effect of Mel-frequency warping on the Jacobian determinant of the transformation matrix. Finally we show that there is a strong interdependence of VTN and Maximum Likelihood Linear Regression (MLLR) for the case of Gaussian emission probabilities.

Journal ArticleDOI
TL;DR: In this paper, a forensic-phonetic speaker identification experiment is described, which tests to what extent same-speaker pairs from a 60 speaker Japanese data base can be discriminated from different-Speaker pairs using a Bayesian likelihood ratio (LR) as discriminant function.
Abstract: A forensic-phonetic speaker identification experiment is described which tests to what extent same-speaker pairs from a 60 speaker Japanese data base can be discriminated from different-speaker pairs using a Bayesian likelihood ratio (LR) as discriminant function. Non-contemporaneous telephone recordings are used, with comparison based on mean values from three segments only: a nasal, a voiceless fricative, and a vowel. It is shown that discrimination using the LR-based distance is better than with a conventional distance, and that the cepstrum outperforms the formants. A LR for the test of 50 is obtained for formant-based discrimination, compared to c. 900 for the cepstrum, and the tests are thus shown to be capable of yielding a probative strength of support for the prosecution hypothesis that is conventionally quantified as ‘moderate’ for formants but ‘moderately strong’ for the cepstrum. Comparisons are made with results from similar experiments

Journal ArticleDOI
TL;DR: In this paper, a model is developed to predict the variation with time of the multipath delay for a jet aircraft or other broadband acoustic source in level flight with constant velocity over a hard ground.
Abstract: The signal emitted by an airborne acoustic source arrives at a stationary sensor located above a flat ground via a direct path and a ground-reflected path. The difference in the times of arrival of the direct path and ground-reflected path signal components, referred to as the multipath delay, provides an instantaneous estimate of the elevation angle of the source. A model is developed to predict the variation with time of the multipath delay for a jet aircraft or other broadband acoustic source in level flight with constant velocity over a hard ground. Based on this model, two methods are formulated to estimate the speed and altitude of the aircraft Both methods require the estimation of the multipath delay as a function of time. The methods differ only in the way the multipath delay is estimated; the first method uses the autocorrelation function, and the second uses the cepstrum, of the sensor output over a short time interval. The performances of both methods are evaluated and compared using real acoustic data. The second method provides the most precise aircraft speed and altitude estimates as compared with the first and two other existing methods.

Journal ArticleDOI
TL;DR: A root cepstrum based approach is presented to derive a minimum phase signal from a given magnitude spectrum and it is found that the causal portion of the signal obtained by taking the inverse Fourier transform of the squared magnitude spectrum is aminimum phase signal.
Abstract: A root cepstrum based approach is presented to derive a minimum phase signal from a given magnitude spectrum. The approach is based on computing the root homomorphic cepstrum. It is found that the causal portion of the signal obtained by taking the inverse Fourier transform of the squared magnitude spectrum is a minimum phase signal. Two separate root cepstra for a signal are defined, one which is derived from the squared magnitude spectrum referred to as x/sub rp/(n) and the other from the inverted squared magnitude spectrum referred to as x/sub rz/(n). It is observed that, for any non-minimum phase test signal, the causal portion of x/sub rp/(n) and x/sub rz/(n) contain information about the exact locations of poles and zeros respectively, which correspond to the minimum phase equivalent poles and zeros of the original signal.

Proceedings Article
01 Jan 2003
TL;DR: Improved schemes for simultaneous speech interpolation and demodulation based on continuous-time models are developed and robust nonlinear modulation features are used to enhance the classic cepstrum-based features and use the augmented feature set for ASR applications.
Abstract: In this paper, we develop improved schemes for simultaneous speech interpolation and demodulation based on continuous-time models. This leads to robust algorithms to estimate the instantaneous amplitudes and frequencies of the speech resonances and extract novel acoustic features for ASR. The continous-time models retain the excellent time resolution of the ESAs based on discrete energy operators and perform better in the presence of noise. We also introduce a robust algorithm based on the ESAs for amplitude compensation of the filtered signals. Furthermore, we use robust nonlinear modulation features to enhance the classic cepstrum-based features and use the augmented feature set for ASR applications. ASR experiments show promising evidence that the robust modulation features improve recognition.

Proceedings Article
01 Jan 2003
TL;DR: MFT is expressed in any domain that is a linear tra nsform of (log-)spectra, for example for cepstra and their ti mederivatives, and the success of the method is shown.
Abstract: When applying Missing Feature Theory to noise robus t speech recognition, spectral features are labeled a s either reliable or unreliable in the time-frequency plane. The acoustic model evaluation of the unreliable feature s is modified to express that their clean values are unk nown or confined within bounds. Classically, MFT requires a n assumption of statistical independence in the spect ral domain, which deteriorates the accuracy on clean speech. In t is paper, MFT is expressed in any domain that is a linear tra nsform of (log-)spectra, for example for cepstra and their ti mederivatives. The acoustic model evaluation is recas t as a nonnegative least squares problem. Approximate solutio ns are proposed and the success of the method is shown thr oug experiments on the AURORA-2 database.

Journal ArticleDOI
TL;DR: Stochastic methods for designing feature extraction methods which are trained to alleviate the unwanted variability present in speech signals are proposed and shown to provide significant advantages over the conventional methods both in terms of performance of ASR and in providing understanding about the nature of speech signal.

Proceedings ArticleDOI
01 Jan 2003
TL;DR: Experimental results prove the proposed method robustness to against the data compression and some kinds of synchronization attacks such as MP3, Gaussian white noise, filter and so on.
Abstract: In this paper, we propose apply a visually recognizable binary image as watermarking embedding audio signals cepstrum domain Cepstrum representation of audio can be shown to be very robust to a wide range of attacks including most challenging time-scaling and pitch-shifting warping An intuitive psychoacoustic model is employed to control the audibility of introduced distortion The results have shown the watermark imperceptible and robust against some signal processing, and our method succeeded in detection the embedded binary image Extensive experimental results prove that the proposed method robustness to against the data compression and some kinds of synchronization attacks such as MP3, Gaussian white noise, filter and so on

Proceedings Article
01 Jan 2003
TL;DR: None of the feature transformations could outperform the baseline when used alone, but improvement in the word error rate was gained when the baseline feature was combined with the feature transformation stream.
Abstract: In this work, linear and nonlinear feature transformations have been experimented in ASR front end. Unsupervised transformations were based on principal component analysis and independent component analysis. Discriminative transformations were based on linear discriminant analysis and multilayer perceptron networks. The acoustic models were trained using a subset of HUB5 training data and they were tested using OGI Numbers corpus. Baseline feature vector consisted of PLP cepstrum and energy with first and second order deltas. None of the feature transformations could outperform the baseline when used alone, but improvement in the word error rate was gained when the baseline feature was combined with the feature transformation stream. Two combination methods were experimented: feature vector concatenation and n-best list combination using ROVER. Best results were obtained using the combination of the baseline PLP cepstrum and the feature transform based on multilayer perceptron network. The word error rate in the number recognition task was reduced from 4.1 to 3.1.

01 Jan 2003
TL;DR: The speaker recognition method presented here is based on short-time spectra, however the feature extraction process does not correspond to the MFCC process, and is text-independent, invariant over time, and robust to channel variability.
Abstract: Automatic methods to determine voiceprints in speech samples predominantly use short-time spectra to yield specific features of a given speaker. Among these, the Mel Frequency Cepstrum Coefficient (MFCC) features are widely used today. The speaker recognition method presented here is based on short-time spectra, however the feature extraction process does not correspond to the MFCC process. The motivation was to avoid what we see as shortcomings of present approaches, particularly the blurring effect in the frequency domain, which confuses rather than helps in distinguishing speakers. We introduce a speech synthesis model that can be identified using Independent Component Analysis (ICA). The ICA representations of log spectral data result in cepstral-like, independent coefficients, which capture correlations among frequency bands specific to the given speaker. It also results in speaker specific basis functions. Coefficients determined from test data using a speaker’s true basis functions show a low degree of correlation, while those determined using other basis functions do not. This enables the system to reliably recognize speakers. The resulting speaker recognition method is text-independent, invariant over time, and robust to channel variability. Its effectiveness has been tested in representing and recognizing speakers from a set of 462 people from the TIMIT database.

Journal ArticleDOI
TL;DR: This research presents a novel probabilistic approach to estimating the response of the immune system to laser-spot assisted, 3D image analysis of central nervous system injury.

PatentDOI
TL;DR: In this paper, a method and apparatus map a set of vocal tract resonant frequencies, together with their corresponding bandwidths, into a simulated acoustic feature vector in the form of LPC cepstrum by calculating a separate function for each individual vocal tract resonance frequency/bandwidth and summing the result to form an element of the simulated feature vector.
Abstract: A method and apparatus map a set of vocal tract resonant frequencies, together with their corresponding bandwidths, into a simulated acoustic feature vector in the form of LPC cepstrum by calculating a separate function for each individual vocal tract resonant frequency/bandwidth and summing the result to form an element of the simulated feature vector. The simulated feature vector is applied to a model along with an input feature vector to determine a probability that the set of vocal tract resonant frequencies is present in a speech signal. Under one embodiment, the model includes a target-guided transition model that provides a probability of a vocal tract resonant frequency based on a past vocal tract resonant frequency and a target for the vocal tract resonant frequency. Under another embodiment, the phone segmentation is provided by an HMM system and is used to precisely determine which target value to use at each frame.

Proceedings ArticleDOI
01 Jan 2003
TL;DR: In this article, an exponentially weighted autoregressive (EWAR) spectral model is proposed to model the peak amplitudes, bandwidths, and frequencies in an ARMA spectral model without any explicit model of the spectral zeros.
Abstract: This paper proposes a formant tracker capable of computing the maximum a posteriori probability formant frequencies (eigenfrequencies of the vocal tract) during periods of consonant closure Two specific novel algorithms are proposed First, an exponentially weighted autoregressive (EWAR) spectral model is proposed The EWAR model is capable of modeling the peak amplitudes, bandwidths, and frequencies in an ARMA spectral model without any explicit model of the spectral zeros Instead of explicit zero models, the amplitudes of spectral peaks are adjusted by exponential coupling weights It is demonstrated that the parameters of the EWAR model may be efficiently computed from the observed speech cepstrum Second, the smoothness of formant frequency trajectories is modeled using a linear dynamic systems model with a nonlinear output map, and maximum a posteriori probability tracking of dynamic formant frequencies is demonstrated using a particle filtering approach

Journal ArticleDOI
TL;DR: In this article, a method for the design of all-pass filters approximating a given all band group delay function with small approximation error is presented, based on the calculation of the complex cepstrum coefficients of the denominator of the allpass filter which approximate the desired group delay in a weighted least square sense.
Abstract: One method for the design of all-pass filters approximating a given all band group delay function with small approximation error is presented. The design procedure is based on the calculation of the complex cepstrum coefficients of the denominator of the all-pass filter which approximate the desired group delay in a weighted least square sense.

Journal ArticleDOI
TL;DR: A new and practical approach using the cepstrum technique is proposed in the design of minimum-phase digital filters as the sum of two allpass functions, which works well for both classical filter specification and general magnitude specification in the frequency domain.
Abstract: A new and practical approach using the cepstrum technique is proposed in the design of minimum-phase digital filters as the sum of two allpass functions. The desired magnitude response is specified in the frequency domain. Its corresponding minimum-phase response is then obtained from the desired magnitude response. The desired phases for the two allpass filters are obtained from the magnitude and phase responses. For both filters to be stable, the corresponding denominator polynomials are minimum phase. The filter coefficients are obtained from the desired phases using the cepstrum technique. Design examples show that the method works well for both classical filter specification and general magnitude specification in the frequency domain.

Proceedings Article
01 Jan 2003
TL;DR: A speaker identification system that uses multiple supplementary information sources for computing a combined match score for the unknown speaker and is evaluated with a corpus of 110 Finnish speakers.
Abstract: In this work, we describe a speaker identification system that uses multiple supplementary information sources for computing a combined match score for the unknown speaker. Each speaker profile in the database consists of multiple feature vector sets that can vary in their scale, dimensionality, and the number of vectors. The evidence from a given feature set is weighted by its reliability that is set in a priori fashion. The confidence of the identification result is also estimated. The system is evaluated with a corpus of 110 Finnish speakers. The evaluated feature sets include mel-cepstrum, LPC-cepstrum, dynamic cepstrum, long-term averaged spectrum of /A/ vowel, and F0.

Journal ArticleDOI
TL;DR: A novel and simple feature extraction method for speech recognition using the two-dimensional root cepstrum (TDRC) has been introduced because of the adjustable root index parameter (/spl gamma/) which shows promising results compared with the original TDC.
Abstract: A novel and simple feature extraction method for speech recognition using the two-dimensional root cepstrum (TDRC) has been introduced. Because of the adjustable root index parameter (/spl gamma/) it has some advantages over the original two-dimensional cepstrum (TDC). Experiments on isolated word recognition using the TIMIT data base show promising results compared with the original TDC.

Patent
07 Feb 2003
TL;DR: In this paper, the LPC coefficients of a block determined as a voiced sound block by a sound determining part 14 are inputted to a cepstrum converting part 17 and converted into a LPC coefficient.
Abstract: PROBLEM TO BE SOLVED: To reduce recognition errors in a part sufficiently including a mixed voice due to a background noise and a plurality of speakers when the statistical speech frequency of the speakers in a AV data is detected. SOLUTION: In an information detecting apparatus 10, a voice signal D11 of the AV data inputted from an inputting part 11 is LPC-analyzed by a LPC analyzing part 12. A LPC coefficient of a block determined as a voiced sound block by a voiced sound determining part 14 is inputted to a cepstrum converting part 17 and converted into a LPC cepstrum coefficient. The LPC cepstrum coefficient D12 is vector-quantized by a vector-quantizating part 18. A quantization distortion D18 is inputted to and evaluated by a speaker identifying part 19 for identifying and determining the speaker per the predetermined recognition block. The identified speaker D20 is inputted to a part for calculating the frequency of determining the speaker 20. The part 20 calculates the frequency of determining the speaker respectively recognized in an interval per the predetermined evaluating interval and outputs as frequency information of appearance of the speaker D21.

Proceedings ArticleDOI
06 Apr 2003
TL;DR: An extension of the method for dealing with HMMs whose distributions are mixture Gaussian distributions, which chooses the sequence ofGaussian distributions by selecting the best Gaussian distribution in the state during Viterbi decoding, achieves an 18.2% reduction in error rate.
Abstract: We have proposed a new speech recognition technique that generates a speech trajectory from HMMs by maximizing the likelihood of the trajectory, while accounting for the relation between the cepstrum and the dynamic cepstrum coefficients. This method has the major advantage that the relation, which is ignored in conventional speech recognition, is directly used in the speech recognition phase. This paper describes an extension of the method for dealing with HMMs whose distributions are mixture Gaussian distributions. The method chooses the sequence of Gaussian distributions by selecting the best Gaussian distribution in the state during Viterbi decoding. Speaker-independent speech recognition experiments were carried out. The proposed method obtained an 18.2% reduction in error rate for the task, proving that the proposed method is effective even for Gaussian mixture HMMs.