scispace - formally typeset
Search or ask a question

Showing papers on "Cepstrum published in 2005"


Journal ArticleDOI
TL;DR: In this paper, the Jacobian determinant of the transformation matrix is computed analytically for three typical warping functions and it is shown that the matrices are diagonal dominant and thus can be approximated by quindiagonal matrices.
Abstract: Vocal tract normalization (VTN) is a widely used speaker normalization technique which reduces the effect of different lengths of the human vocal tract and results in an improved recognition accuracy of automatic speech recognition systems. We show that VTN results in a linear transformation in the cepstral domain, which so far have been considered as independent approaches of speaker normalization. We are now able to compute the Jacobian determinant of the transformation matrix, which allows the normalization of the probability distributions used in speaker-normalization for automatic speech recognition. We show that VTN can be viewed as a special case of Maximum Likelihood Linear Regression (MLLR). Consequently, we can explain previous experimental results that improvements obtained by VTN and subsequent MLLR are not additive in some cases. For three typical warping functions the transformation matrix is calculated analytically and we show that the matrices are diagonal dominant and thus can be approximated by quindiagonal matrices.

217 citations


Book
31 Jan 2005
TL;DR: This is a newly revised and greatly expanded edition of a classic Artech House book that offers practical guidance in electronic warfare target location and provides practitioners with critical information on a variety of geolocation algorithms and techniques.
Abstract: Introduction to Emitter Geolocation -Introduction. Gradient Descent Algorithm. Concluding Remarks. Triangulation -Introduction. Basic Concepts. Least-Squares Error Estimation. Total Least-Squares Estimation. Least-Squares Distance Error PF Algorithm. Minimum Mean-Squares Error Estimation. The Discrete Probability Density Method. Generalized Bearings. Maximum Likelihood PF Algorithm. Multiple Sample Correlation. Bearing-Only Target Motion Analysis. Sources of Error in Triangulation. Concluding Remarks. DF Techniques -Introduction. Array Processing Direction of Arrival Measurement Methods. Other Methods of Estimating the AOA. MSE Phase Interferometer. DF with a Butler Matrix. Phase Difference Estimation Using SAW Devices. Concluding Remarks. MUSIC -Introduction. MUSIC Overview. MUSIC. Performance of MUSIC in the Presence of Modeling Errors. Determining the Number of Wavefields. Effect of Phase Errors on the Accuracy of MUSIC. Other Superresolution Algorithms. Concluding Remarks. Quadratic Position-Fixing Methods -Introduction. TDOA Position-Fixing Techniques. Differential Doppler. Range Difference Methods. Concluding Remarks. Time Delay Estimation -Introduction. System Overview. Cross Correlation. Generalized Cross-Correlation. Estimating the Time Delay with the Generalized Correlation Method. Time Delay Estimation Using the Phase of the Cross-Spectral Density. Effects of Frequency and Phase Errors in EW TDOA Direction-Finding Systems. Concluding Remarks. Single-Site Location Techniques -Introduction. HF Signal Propagation. Single-Site Location. Passive SSL. Determining the Reflection Delay with the Cepstrum. MUSIC Cepstrum SSL. Earth Curvature. Skywave DF Errors. Ray Tracing. Accuracy Comparison of SSL and Triangulation for Ionospherically Propagated Signals. Concluding Remarks.

151 citations


01 Sep 2005
TL;DR: In this article, a cepstrum-based iterative true envelope estimator is proposed for pitch shifting with preservation of the spectral envelope in the phase vocoder, which can reduce the run time by a factor of 2.5-11.
Abstract: In this article the estimation of the spectral envelope of sound signals is addressed. The intended application for the developed algorithm is pitch shifting with preservation of the spectral envelope in the phase vocoder. As a first step the different existing envelope estimation algorithms are investigated and their specific properties discussed. As the most promising algorithm the cepstrum based iterative true envelope estimator is selected. By means of controlled sub-sampling of the log amplitude spectrum and by means of a simple step size control for the iterative algorithm the run time of the algorithm can be decreased by a factor of 2.5-11. As a remedy for the ringing effects in the the spectral envelope that are due to the rectangular filter used for spectral smoothing we propose the use of a Hamming window as smoothing filter. The resulting implementation of the algorithm has slightly increased computational complexity compared to the standard LPC algorithm but offers significantly improved control over the envelope characteristics. The application of the true envelope estimator in a pitch shifting application is investigated. The main problems for pitch shifting with envelope preservation in a phase vocoder are identified and a simple yet efficient remedy is proposed.

146 citations


Journal ArticleDOI
TL;DR: In this paper, an indirect measurement of the cylinder pressure from diesel engines is demonstrated for a large two-stroke marine diesel engine and a small four-stroke diesel engine, which involves reconstructing the cylinder crank angle domain diagram from the acoustic emission generated during the combustion phase.

83 citations


Journal ArticleDOI
TL;DR: In this article, a one-stage spur gear transmission by a two degrees of freedom system produces two modes: rigid body and elastic, and the time varying meshing stiffness is the main internal excitation source for the transmission and governs the behaviour of the elastic mode.
Abstract: The modelling of a one-stage spur gear transmission by a two degrees of freedom system produces two modes: rigid body and elastic. The time varying meshing stiffness is the main internal excitation source for the transmission and governs the behaviour of the elastic mode. Deterioration of one or several teeth, which affects the gear mesh stiffness, is considered in this work. The beginning of crack or spalling are modelled respectively by tooth having localised and distributed defect and are taken into account in the model. Simulation results are analysed by cepstrum and spectrum techniques. It is found that cepstrum and spectrum techniques are very efficient for localised and distributed defects, respectively. Series of tests are made in the experimental setup. Spectrum and cepstrum analysis of the recorded responses, with and without defects, are compared with numerical results and confirms their usefulness in gear monitoring .

81 citations


Proceedings ArticleDOI
18 Mar 2005
TL;DR: It is shown that the proposed technique based features yield a significant increase in speech recognition performance in non-stationary noise conditions when compared directly to the MFCC and RASTA-PLP features.
Abstract: It is well known that the peaks in the spectrum of a log Mel-filter bank are important cues in characterizing speech sounds. However, low energy perturbations in the power spectrum may become numerically significant after the log compression. We show that even if the spectral peaks are kept constant, the low energy perturbations in the power spectrum can create huge variations in the cepstral coefficients. We show, both analytically and experimentally, that exponentiating the log Mel-filter bank spectrum before the cepstrum computation can significantly reduce the sensitivity of the cepstra to spurious low energy perturbations. The Mel-cepstrum modulation spectrum (Tyagi, V. et al., Proc. IEEE ASRU, 2003) is computed from the processed cepstra which results in further noise robustness of the composite feature vector. In experiments with speech signals, it is shown that the proposed technique based features yield a significant increase in speech recognition performance in non-stationary noise conditions when compared directly to the MFCC and RASTA-PLP features.

79 citations


Proceedings ArticleDOI
18 Mar 2005
TL;DR: Experiments performed on the large-vocabulary task VerbMobil II (German conversational speech) show that the accuracy of automatic speech recognition systems can be improved by the combination of different acoustic features.
Abstract: In this paper, we consider the use of multiple acoustic features of the speech signal for robust speech recognition. We investigate the combination of various auditory based (mel frequency cepstrum coefficients, perceptual linear prediction, etc.) and articulatory based (voicedness) features. Features are combined by linear discriminant analysis and log-linear model combination based techniques. We describe the two feature combination techniques and compare the experimental results. Experiments performed on the large-vocabulary task VerbMobil II (German conversational speech) show that the accuracy of automatic speech recognition systems can be improved by the combination of different acoustic features.

78 citations


Proceedings ArticleDOI
04 Sep 2005
TL;DR: Error analysis and speech recognition experiments show that the TECCs and the mel frequency cepstrum coefficients (MFCCs) perform similarly for clean recording conditions; while the T ECCs perform significantly better than the MFCCs for noisy recognition tasks.
Abstract: In this paper, a feature extraction algorithm for robust speech recognition is introduced. The feature extraction algorithm is motivated by the human auditory processing and the nonlinear Teager-Kaiser energy operator that estimates the true energy of the source of a resonance. The proposed features are labeled as Teager Energy Cepstrum Coefficients (TECCs). TECCs are computed by first filtering the speech signal through a dense non constant-Q Gammatone filterbank and then by estimating the “true” energy of the signal’s source, i.e., the short-time average of the output of the Teager-Kaiser energy operator. Error analysis and speech recognition experiments show that the TECCs and the mel frequency cepstrum coefficients (MFCCs) perform similarly for clean recording conditions; while the TECCs perform significantly better than the MFCCs for noisy recognition tasks. Specifically, relative word error rate improvement of 60% over the MFCC baseline is shown for the Aurora-3 database for the high-mismatch condition. Absolute error rate improvement ranging from 5% to 20% is shown for a phone recognition task in (various types of additive) noise.

77 citations


Proceedings ArticleDOI
18 Mar 2005
TL;DR: The paper shows mathematically that there exists an acoustic universal structure in speech, which can be interpreted as a physical implementation of structural phonology, and implies that there always exists a distortion-free communication channel between a speaker and a listener.
Abstract: The paper shows mathematically that there exists an acoustic universal structure in speech, which can be interpreted as a physical implementation of structural phonology. The structure has completely no dimensions of multiplicative and linear transformational distortions, which are inevitably involved in speech communication as differences of vocal tract shape, gender, age, microphone, room, line, hearing characteristics, and so on. A speech event, such as a phone, is probabilistically modeled as a distribution of parameters calculated by a linear transformation of a log spectrum, e.g., cepstrum. A set of events, such as a word, is relatively captured as structure composed of the distributions. An n-point structure is uniquely determined by fixing the lengths of its /sub n/C/sub 2/ diagonal lines, namely, the distance matrix among the n points. The distance between two distributions is calculated as a Bhattacharyya distance. The resulting structure has very interesting characteristics. Multiplicative and linear transformational distortions are geometrically interpreted as shift and rotation of the structure, respectively. This fact implies that there always exists a distortion-free communication channel between a speaker and a listener.

65 citations


Proceedings ArticleDOI
01 Oct 2005
TL;DR: Experimental results on various configurations of front-end techniques reported herein demonstrate that, besides providing robustness against channel mismatch and noise as found in existing literature, feature warping is useful more generally as a technique for pre-mapping data for improved compatibility with a GMM back-end.
Abstract: This paper proposes the novel use of feature warping for automatic language identification, in combination with the shifted delta cepstrum (SDC) and perceptual linear predictive coefficients in a Gaussian mixture model (GMM) based system. Experimental results on various configurations of front-end techniques reported herein demonstrate that, besides providing robustness against channel mismatch and noise as found in existing literature, feature warping is useful more generally as a technique for pre-mapping data for improved compatibility with a GMM back-end. The configuration reported in this paper provides a language identification performance of 76.4% using the OGI/NIST database, a 46.5% relative reduction in error rate when compared with a benchmark system employing Mel frequency cepstral coefficients and the SDC

54 citations


Proceedings ArticleDOI
18 Mar 2005
TL;DR: Results of embedding using a clean and a noisy hot utterance show the embedded information is robust to additive noise and bandpass filtering.
Abstract: A method of embedding information in the cepstral domain of a cover audio signal is described for audio steganography applications. The proposed technique combines the commonly employed psychoacoustical masking property of the human auditory system with the decorrelation property of the speech cepstrum, and achieves imperceptible embedding, large payload, and accurate data retrieval. Results of embedding using a clean and a noisy hot utterance show the embedded information is robust to additive noise and bandpass filtering.

Journal ArticleDOI
17 May 2005
TL;DR: Inspired by time-frequency duality, this paper proposes the use of Linear Predictive Coding (LPC) and Cepstrum coefficients to model time varying software artifact histories to recover time variant information from software repositories.
Abstract: This paper presents an approach to recover time variant information from software repositories. It is widely accepted that software evolves due to factors such as defect removal, market opportunity or adding new features. Software evolution details are stored in software repositories which often contain the changes history. On the other hand there is a lack of approaches, technologies and methods to efficiently extract and represent time dependent information. Disciplines such as signal and image processing or speech recognition adopt frequency domain representations to mitigate differences of signals evolving in time. Inspired by time-frequency duality, this paper proposes the use of Linear Predictive Coding (LPC) and Cepstrum coefficients to model time varying software artifact histories. LPC or Cepstrum allow obtaining very compact representations with linear complexity. These representations can be used to highlight components and artifacts evolved in the same way or with very similar evolution patterns. To assess the proposed approach we applied LPC and Cepstral analysis to 211 Linux kernel releases (i.e., from 1.0 to 1.3.100), to identify files with very similar size histories. The approach, the preliminary results and the lesson learned are presented in this paper.

Proceedings ArticleDOI
04 Sep 2005
TL;DR: By exploiting previous research in mel cepstrum feature enhancement, it is shown that a unified probabilistic framework under which the feature denoising and bandwidth extension processes are tightly integrated using a single shared statistical model is created.
Abstract: We present a new bandwidth extension algorithm for converting narrowband telephone speech into wideband speech using a transformation in the mel cepstral domain. Unlike previous approaches, the proposed method is designed specifically for bandwidth extension of narrowband speech that has been corrupted by environmental noise. We show that by exploiting previous research in mel cepstrum feature enhancement, we can create a unified probabilistic framework under which the feature denoising and bandwidth extension processes are tightly integrated using a single shared statistical model. By doing so, we are able to both denoise the observed narrowband speech and robustly extend its bandwidth in a jointly optimal manner. A series of experiments on clean and noise-corrupted narrowband speech is performed to validate our approach.

Journal ArticleDOI
12 Dec 2005
TL;DR: The hidden Markov Model constructed and conditions investigated that would provide improved performance for a dysarthric speech (isolated word) recognition system found that a Mel cepstrum based model outperformed a fast Fourier transform and linear prediction based model.
Abstract: In this study, a hidden Markov Model was constructed and conditions were investigated that would provide improved performance for a dysarthric speech (isolated word) recognition system. The speaker dependant system was intended to act as an assistive/control tool. A small size vocabulary spoken by three cerebral palsy subjects was chosen. Fast Fourier transform, linear predictive, and Mel frequency cepstral coefficients extracted from data provided training input to several whole-word hidden Markov model configurations. The effect of model structure, number of states, and frame rates were also investigated. It was noted that a 10-state ergodic model using 15 msec frames was better than other configurations. Furthermore, it was found that a Mel cepstrum based model outperformed a fast Fourier transform and linear prediction based model. The system offers effective and robust application as a rehabilitation and/or control tool to assist dysarthric motor impaired individuals.

Proceedings ArticleDOI
18 Mar 2005
TL;DR: It is found that the dynamic cepstrum is more robust to additive noise than its static counterpart, and a simple yet effective strategy of exponentially weighting the likelihoods that are contributed by the static and dynamic features during the decoding process is proposed.
Abstract: In this paper, we investigate the relative noise robustness between dynamic and static spectral features, by using two speaker independent continuous digit databases in English (Aurora2) and Cantonese (CUDigit) It is found that the dynamic cepstrum is more robust to additive noise than its static counterpart The results are consistent across different types of noise and under various SNRs Optimal exponential weights for exploiting unequal noise robustness of the two features are discriminatively trained in a development set When tested under various noise conditions, the optimal weights yielded relative word error rate reductions of 366% and 419% for Aurora2 and CUDigit, respectively The proposed weighting is attractive for many ASR applications in noise because: (1) no noise estimation for feature compensation; (2) no adaptation of clean HMMs to a noisy environment; and (3) only a trivial change in the decoding process by weighting log likelihoods of static and dynamic components separately

Proceedings Article
01 Sep 2005
TL;DR: The real cepstrum is used to design an arbitrary length minimum-phase FIR filter from a mixed-phase sequence and the resulting magnitude response is exactly the same with the original sequence.
Abstract: The real cepstrum is used to design an arbitrary length minimum-phase FIR filter from a mixed-phase sequence. There is no need to start with the odd-length equiripple linear-phase sequence first. Neither phase-unwrapping nor root-finding is needed. Only two FFTs and an iterative procedure are required to compute the filter impulse response from real cepstrum; the resulting magnitude response is exactly the same with the original sequence.

Proceedings ArticleDOI
18 Mar 2005
TL;DR: A novel acoustic model of speech, based on statistical hidden trajectory modeling (HTM) with bi-directional vocal tract resonance (VTR) target filtering, with dramatic reduction of an upper error bound is achieved in the standard TIMIT phonetic recognition task using a large-scale N-best rescoring paradigm.
Abstract: We present a novel acoustic model of speech, based on statistical hidden trajectory modeling (HTM) with bi-directional vocal tract resonance (VTR) target filtering, for speech recognition. The HTM consists of two stages of the generative process of speech: from the phone sequence to VTR dynamics and then to the cepstrum-based acoustic observation. Two types of model implementation are detailed, one with straightforward two-stage cascading, and another which integrates over the statistical distribution of VTR in model construction and in computing acoustic likelihood. With the use of first-order Taylor series approximation to the nonlinearity in the VTR-to-cepstrum prediction component of HTM, the acoustic likelihood is established in an analytical form. It is a Gaussian with the time-varying mean that gives structured long-span context dependence over the entire utterance, and with the dynamically adjusted variance proportional to the squared "local slope" in the nonlinear mapping function from VTR to cepstrum. When the HTM parameters are trained via maximizing this "integrated" likelihood, dramatic reduction of an upper error bound is achieved in the standard TIMIT phonetic recognition task using a large-scale N-best rescoring paradigm.

Journal ArticleDOI
TL;DR: In this paper, the authors presented a detailed study on the suitability of homomorphic prediction as a formant tracking tool for high-pitched speech where linear prediction fails to obtain accurate estimation The formant frequencies estimated using the proposed method are found to be accurate by more than an order of magnitude compared to the conventional procedure.
Abstract: The conventional model of the linear prediction analysis suffers from difficulties in estimating vocal tract characteristics of high-pitched speakers This is because the autocorrelation function used by the autocorrelation method of linear prediction for estimating autoregressive coefficients is actually an aliased version of that of the vocal tract impulse response This aliasing occurs due to the periodic nature of voiced speech Generally it is accepted that homomorphic filtering can be used to obtain an estimate of vocal tract impulse response which is free from periodicity Thus linear prediction of the resulting vocal tract impulse response (referred to as homomorphic prediction) is expected to be free from variations of fundamental frequencies To our knowledge any experimental study, however, has not yet appeared on the suitability of this method for analyzing high-pitched speech This paper presents a detail study on the prospects of homomorphic prediction as a formant tracking tool especially for high-pitched speech where linear prediction fails to obtain accurate estimation The formant frequencies estimated using the proposed method are found to be accurate by more than an order of magnitude compared to the conventional procedure The accuracy of formant estimation is verified on synthetic vowels for a wide range of pitch periods covering typical male and high-pitched female speakers The validity of the proposed method is also examined by inspecting the spectral envelopes of natural speech spoken by high-pitched female speakers We noticed that almost all the previous methods dealing with this limitation of linear prediction are based on the covariance technique where the obtained AR filter can be unstable The solutions obtained by the current method are guaranteed to be stable which makes it superior for many speech analysis applications

Book ChapterDOI
TL;DR: A new approach is introduced and shown to provide accurate HNR measurements for synthesised glottal and voiced speech waveforms and the action of cepstral low-pass liftering and subsequent Fourier transformation is shown to be analogous to a moving average filter.
Abstract: The estimation of the harmonics-to-noise ratio (HNR) in voiced speech provides an indication of the ratio between the periodic to aperiodic components of the signal. Time-domain methods for HNR estimation are problematic because of the difficulty of estimating the period markers for (pathological) voiced speech. Frequency-domain methods encounter the problem of estimating the noise level at harmonic locations. Cepstral techniques have been introduced to supply noise estimates at all frequency locations in the spectrum. A detailed description of cepstral processing is provided in order to motivate its use as a HNR estimator. The action of cepstral low-pass liftering and subsequent Fourier transformation is shown to be analogous to the action of a moving average filter. Based on this description, short-comings of two existing cepstral-based HNRs are illustrated and a new approach is introduced and shown to provide accurate HNR measurements for synthesised glottal and voiced speech waveforms.

Book ChapterDOI
19 Apr 2005
TL;DR: An overview is given of advanced methods for inverse filtering: model based, adaptive iterative, higher order statistics and cepstral approaches are examined and the advantages and disadvantages of these methods are highlighted.
Abstract: Glottal inverse filtering is a technique used to derive the glottal waveform during voiced speech. Closed phase inverse filtering (CPIF) is a common approach for achieving this goal. During the closed phase there is no input to the vocal tract and hence the impulse response of the vocal tract can be determined through linear prediction. However, a number of problems are known to exist with the CPIF approach. This review paper briefly details the CPIF technique and highlights certain associated theoretical and methodological problems. An overview is then given of advanced methods for inverse filtering: model based, adaptive iterative, higher order statistics and cepstral approaches are examined. The advantages and disadvantages of these methods are highlighted. Outstanding issues and suggestions for further work are outlined.

Book ChapterDOI
19 Apr 2005
TL;DR: The present study highlights the cepstrum-based noise baseline estimation process; it is shown to analogous to the action of a moving average filter applied to the power spectrum of voiced speech.
Abstract: Cepstral analysis is used to estimate the harmonics-to-noise ratio (HNR) in speech signals. The inverse Fourier transformed liftered cepstrum approximates a noise baseline from which the harmonics-to-noise ratio is estimated. The present study highlights the cepstrum-based noise baseline estimation process; it is shown to analogous to the action of a moving average filter applied to the power spectrum of voiced speech. The noise baseline, which is taken to approximate the noise excited vocal tract is influenced by the window length and the shape of the glottal source spectrum. Two existing estimation techniques are tested systematically using synthetically generated glottal flow and voiced speech signals with a priori knowledge of the HNR. The source influence is removed using a novel harmonic pre-emphasis technique. The results indicate accurate HNR estimation using the present approach. A preliminary investigation of the method with a set of normal/ pathological data is investigated.

01 Jan 2005
TL;DR: A hybrid method based neural network algorithm has been proposed for speech recognition that combines Self-Organizing Map (SOM) and Multilayer Perceptron (MLP) which improves the recognition accuracy up to about 4%.
Abstract: In this paper, a hybrid method based neural network algorithm has been proposed for speech recognition. The proposed method combines Self-Organizing Map (SOM) which known as unsupervised network and Multilayer Perceptron (MLP) which known as supervised network for Malay speech recognition. After the acoustic preprocessing where Linear Prediction Coding (LPC) is used to extract the acoustic information from raw signal, then a 2-dimensional (2D) self-organizing feature map is used as also a feature extractor which acts as a sequential mapping function in order to transform the acoustic vector sequences of speech signal into trajectories. The SOM is used to produce the trajectory vector for classification. The SOM converts the cepstrum vectors into a binary matrix which has the same dimension with the SOM. The idea behind this method is accumulating the all winner node of a syllable utterance in a same dimension map where the winner node is scaled into value “1” and others are scaled into value “0”. This result a binary pattern in the 2D map which represent the speech content. The transformation of the feature vector by SOM simplifies the classification task by recognizer using Multilayer Perceptron. The MLP classifies feature vector that each utterance corresponds to. Various experiments were conducted on the 15 Malay syllables by a speaker (speaker dependent system) for conventional technique (MLP only) and the proposed method (SOM and MLP). Our proposed algorithm has achieved better performance where improves the recognition accuracy up to about 4%.

Proceedings ArticleDOI
14 Nov 2005
TL;DR: The audio watermarking is classified in three categories: patchwork in the frequency domain, echo hiding in the time domain and cepstrum domain and experimental results show which scheme has a good robustness against common signal processing manipulations.
Abstract: In this paper, we survey the audio watermarking. The watermarking implementation techniques are briefly summarized and analyzed. The audio watermarking is classified in three categories: patchwork in the frequency domain, echo hiding in the time domain and cepstrum domain. Experimental results show us which scheme has a good robustness against common signal processing manipulations.

Patent
19 May 2005
TL;DR: In this article, the authors proposed a speech recognition system for recognizing vowels and consonants according to myoelectric signals, which is based on the hidden Markov model.
Abstract: PROBLEM TO BE SOLVED: To provide a speech recognition device recognizing vowels and consonants according to myoelectric signals. SOLUTION: The speech recognition device 10 is equipped with: a myoelectric signal detecting section 101 which detects the myoelectric signals generated during an utterance action from a plurality of regions; an LPC analysis section 102 which separates and recognizes the respective detected myoelectric signals to spectrum envelope portions and fine change portions based on linear prediction coefficient analyses; a feature extraction section 103 which calculates the linear prediction coefficient cepstrum according to the separated and recognized spectrum envelope portions and calculates the myoelectric signal featured values by each of the channels corresponding to the regions based on the results of the calculation; a likelihood calculating section 106 which receives the myoelectric signal featured values by each of the calculated channels as input vectors and calculates the likelihood based on the hidden Markov model; and a speech recognition section 107 which specifies the speech corresponding to the utterance action based on the calculated likelihood. COPYRIGHT: (C)2005,JPO&NCIPI

Journal ArticleDOI
TL;DR: In this article, the authors proposed a multiple linear regression of the log spectra (MRLS) for estimating the spectra of speech at a close-talking microphone and extended the MRLS concept to nonlinear regressions.
Abstract: In this paper, we address issues in improving hands-free speech recognition performance in different car environments using multiple spatially distributed microphones. In the previous work, we proposed the multiple linear regression of the log spectra (MRLS) for estimating the log spectra of speech at a close-talking microphone. In this paper, the concept is extended to nonlinear regressions. Regressions in the cepstrum domain are also investigated. An effective algorithm is developed to adapt the regression weights automatically to different noise environments. Compared to the nearest distant microphone and adaptive beamformer (Generalized Sidelobe Canceller), the proposed adaptive nonlinear regression approach shows an advantage in the average relative word error rate (WER) reductions of 58.5% and 10.3%, respectively, for isolated word recognition under 15 real car environments.

Book ChapterDOI
31 Aug 2005
TL;DR: It is pointed out that the use of artificial reverberation leads to more robustness to noise in general and most TRAP-based features excel in phone recognition.
Abstract: In this paper we will investigate the performance of TRAP-features on clean and noisy data. Multiple feature sets are evaluated on a corpus which was recorded in clean and noisy environment. In addition, the clean version was reverberated artificially. The feature sets are assembled from selected energy bands. In this manner multiple recognizers are trained using different energy bands. The outputs of all recognizers are joined with ROVER in order to achieve a single recognition result. This system is compared to a baseline recognizer that uses Mel frequency cepstrum coefficients (MFCC). In this paper we will point out that the use of artificial reverberation leads to more robustness to noise in general. Furthermore most TRAP-based features excel in phone recognition. While MFCC features prove to be better in a matched training/test situation, TRAP-features clearly outperform them in a mismatched training/test situation: When we train on clean data and evaluate on noisy data the word accuracy (WA) can be raised by 173 % relative (from 12.0 % to 32.8 % WA).

Journal ArticleDOI
TL;DR: The model suggests that the linear transformation can be acquired through learning from actual acoustic signals and is expected to provide a useful feature extraction method that has often been given by the cepstrum analysis.
Abstract: In this letter, we propose a noisy nonlinear version of independent component analysis (ICA). Assuming that the probability density function (p. d. f.) of sources is known, a learning rule is derived based on maximum likelihood estimation (MLE). Our model involves some algorithms of noisy linear ICA (e. g., Bermond & Cardoso, 1999) or noise-free nonlinear ICA (e. g., Lee, Koehler, & Orglmeister, 1997) as special cases. Especially when the nonlinear function is linear, the learning rule derived as a generalized expectation-maximization algorithm has a similar form to the noisy ICA algorithm previously presented by Douglas, Cichocki, and Amari (1998). Moreover, our learning rule becomes identical to the standard noise-free linear ICA algorithm in the noiseless limit, while existing MLE-based noisy ICA algorithms do not rigorously include the noise-free ICA. We trained our noisy nonlinear ICA by using acoustic signals such as speech and music. The model after learning successfully simulates virtual pitch phenomena, and the existence region of virtual pitch is qualitatively similar to that observed in a psychoacoustic experiment. Although a linear transformation hypothesized in the central auditory system can account for the pitch sensation, our model suggests that the linear transformation can be acquired through learning from actual acoustic signals. Since our model includes a cepstrum analysis in a special case, it is expected to provide a useful feature extraction method that has often been given by the cepstrum analysis.

Patent
13 Jun 2005
TL;DR: In this article, a CMN acoustic model is synthesized by obtaining an approximate cepstral mean (CM) of the speech data for learning, and by subtracting the obtained CM from a mean parameter of each distribution concerning the cepstrum in the acoustic model while using a model parameter created without performing the CMN processing.
Abstract: PROBLEM TO BE SOLVED: To solve the following problem: a CMN acoustic model having learnt from a feature level after a CMN processing to a large amount of speech data for learning and a lot of time have been required so as to create an acoustic model,. SOLUTION: The acoustic model after the CMN processing is synthesized by obtaining an approximate cepstral mean (CM) of the speech data for learning, and by subtracting the obtained CM from a mean parameter of each distribution concerning the cepstrum in the acoustic model while using a model parameter in the acoustic model created without performing the CMN processing, or using statistical information obtained in creating the acoustic model. Further, speech recognition is performed by obtaining a likelihood by collating this acoustic model after CMN processing with the feature level extracted by performing the CMN processing to a speech signal for recognition. COPYRIGHT: (C)2007,JPO&INPIT

Proceedings Article
01 Jan 2005
TL;DR: The implementation that is presented reduces the run time required by the algorithm depending on the cepstral order on the estimation parameters by a factor of 2 to 9 such that real time processing becomes feasible.
Abstract: The following article presents a new real time implementation of an iterative cepstrum based spectral envelope estimation technique that was originally published under the name true envelope. Because the original algorithm is hardly known outside Japan we will first describe the algorithm and compare it to the standard techniques, i.e. LPC and discrete cepstrum. The estimation properties are compared and it is shown that the true envelope estimator achieves convincing envelope estimations even for problematic, high pitch signals. The algorithm is analyzed with the objective to find an efficient implementation that sufficiently reduces the computational complexity such that the algorithm can be used in real time within the phase vocoder. The implementation that is presented reduces the run time required by the algorithm depending on the cepstral order on the estimation parameters by a factor of 2 to 9 such that real time processing becomes feasible.

Journal ArticleDOI
TL;DR: Experimental results on speaker-independent male and female speech show that accurate voicing classification and fundamental frequency prediction is attained when compared to hand-corrected reference fundamental frequency measurements.
Abstract: This work proposes a method to reconstruct an acoustic speech signal solely from a stream of mel-frequency cepstral coefficients (MFCCs) as may be encountered in a distributed speech recognition (DSR) system. Previous methods for speech reconstruction have required, in addition to the MFCC vectors, fundamental frequency and voicing components. In this work the voicing classification and fundamental frequency are predicted from the MFCC vectors themselves using two maximum a posteriori (MAP) methods. The first method enables fundamental frequency prediction by modeling the joint density of MFCCs and fundamental frequency using a single Gaussian mixture model (GMM). The second scheme uses a set of hidden Markov models (HMMs) to link together a set of state-dependent GMMs, which enables a more localized modeling of the joint density of MFCCs and fundamental frequency. Experimental results on speaker-independent male and female speech show that accurate voicing classification and fundamental frequency prediction is attained when compared to hand-corrected reference fundamental frequency measurements. The use of the predicted fundamental frequency and voicing for speech reconstruction is shown to give very similar speech quality to that obtained using the reference fundamental frequency and voicing.