scispace - formally typeset
Search or ask a question

Showing papers on "Spectrogram published in 2004"


Journal ArticleDOI
TL;DR: Two missing-feature algorithms that reconstruct complete spectrograms from incomplete noisy ones are presented that result in better recognition performance overall and are also less expensive computationally and do not require modification of the recognizer.

242 citations


Journal ArticleDOI
TL;DR: A unified theoretical picture of this time-frequency analysis method, the time-corrected instantaneous frequency spectrogram, together with detailed implementable algorithms comparing three published techniques for its computation are presented.
Abstract: A modification of the spectrogram (log magnitude of the short-time Fourier transform) to more accurately show the instantaneous frequencies of signal components was first proposed in 1976 [Kodera et al., Phys. Earth Planet. Inter. 12, 142-150 (1976)], and has been considered or reinvented a few times since but never widely adopted. This paper presents a unified theoretical picture of this time-frequency analysis method, the time-corrected instantaneous frequency spectrogram, together with detailed implementable algorithms comparing three published techniques for its computation. The new representation is evaluated against the conventional spectrogram for its superior ability to track signal components. The lack of a uniform framework for either mathematics or implementation details which has characterized the disparate literature on the schemes has been remedied here. Fruitful application of the method is shown in the realms of speech phonation analysis, whale song pitch tracking, and additive sound modeling.

198 citations


Journal ArticleDOI
TL;DR: A new mask estimation technique is presented that uses a Bayesian classifier to determine the reliability of spectrographic elements and resulted in significantly better recognition accuracy than conventional mask estimation approaches.

170 citations


Proceedings Article
01 Dec 2004
TL;DR: This work forms the problem of speech separation as a problem in segmenting the spectrogram of the signal into two or more disjoint sets, and develops an adaptive, speech-specific segmentation algorithm that can successfully separate one-microphone speech mixtures.
Abstract: We present an algorithm to perform blind, one-microphone speech separation. Our algorithm separates mixtures of speech without modeling individual speakers. Instead, we formulate the problem of speech separation as a problem in segmenting the spectrogram of the signal into two or more disjoint sets. We build feature sets for our segmenter using classical cues from speech psychophysics. We then combine these features into parameterized affinity matrices. We also take advantage of the fact that we can generate training examples for segmentation by artificially superposing separately-recorded signals. Thus the parameters of the affinity matrices can be tuned using recent work on learning spectral clustering [1]. This yields an adaptive, speech-specific segmentation algorithm that can successfully separate one-microphone speech mixtures.

103 citations


Journal Article
TL;DR: The detection system is capable of picking out a high proportion of right whale calls logged by a human operator, while at the same time working at a false alarm rate of only one or two calls per day, even in the presence of background noise from humpback whales and seismic exploration.
Abstract: A detector has been developed which can reliably detect right whale calls and distinguish them from those of other marine mammals and industrial noise. Detection is a two stage process. In the first, the spectrogram is smoothed by convolving it with a Gaussian kernel and the 'outlines' of sounds are extracted using an edge detection algorithm. This allows a number of parameters to be measured for each sound, including duration, bandwidth and details of the frequency contour such as the positions of maximum and minimum frequency. In the second stage, these parameters are used in a classification function in order to determine which sounds are from right whales. The classifier has been tuned by comparing data from a period when large numbers of right whales were known to be in the vicinity of bottom mounted recorders with data collected on days when it was believed, based on ship and aerial surveys, that no right whales were present. Overall, the detection system is capable of picking out a high proportion of right whale calls logged by a human operator, while at the same time working at a false alarm rate of only one or two calls per day, even in the presence of background noise from humpback whales and seismic exploration. Although it is impossible to reduce the false alarm rate for individual calls to zero whilst still maintaining adequate efficiency, by requiring the detection of several calls within a set waiting time, it is possible to reduce false alarm rate to a negligible level.

95 citations


Journal ArticleDOI
TL;DR: Two techniques for handling convolutional distortion with ‘missing data’ speech recognition using spectral features and a method for handling reverberated speech which attempts to identify time-frequency regions that are not badly contaminated by reverberation and have strong speech energy are proposed.

81 citations


Journal Article
TL;DR: In this article, two methods for the detection of the up call were compared: spectrogram correlation and a neural network, and the neural network performed best at this task, achieving an error rate of less than 6%, and is thus the preferred detection method here.
Abstract: North Atlantic, North Pacific, and southern right whales all produce the up call, a frequency-modulated upsweep in the 50-200 Hz range. This call is one of the most common sounds, and frequently the most common sound, received from right whales, and as such is a useful indicator of the presence of right whales for acoustic surveys. A data set was prepared of 1857 calls and 6359 non-call sounds recorded from North Atlantic right whales (Eubalaena glacialis) near Georgia and Massachusetts. Two methods for the detection of the calls were compared: spectrogram correlation and a neural network. Spectrogram correlation parameters were chosen two ways, by manual choice using a sample of 20 calls, and by an optimization procedure that used all available calls. Neural network weights were trained via backpropagation on 9/10 of the test data set. Performance was measured separately for calls of different signal-to-noise ratio, as SNR heavily influences the performance of any detector. Results showed that the neural network performed best at this task, achieving an error rate of less than 6%, and is thus the preferred detection method here. Spectrogram correlation may be useful in situations in which a large set of training data is not available, as manual training on a small set of examples achieved an error rate (26%) that may be acceptable for many applications.

78 citations


Journal ArticleDOI
TL;DR: It is shown that there is a lower bound on the local uncertainty product of the spectrogram due to the windowing operation of this method and that the uncertainty product for the average local standard deviations is always less than or equal to the standard uncertainty product and that it can be arbitrarily small.
Abstract: We address the issue of the relation between local quantities and the uncertainty principle. We approach the problem by defining local quantities as conditional standard deviations, and we relate these to the uncertainty product appearing in the standard uncertainty principle. We show that the uncertainty product for the average local standard deviations is always less than or equal to the standard uncertainty product and that it can be arbitrarily small. We apply these results to the short-time Fourier transform/spectrogram to explore the commonly held notion that the uncertainty principle somehow limits local quantities. We show that, indeed, for the spectrogram, there is a lower bound on the local uncertainty product of the spectrogram due to the windowing operation of this method. This limitation is an inherent property of the spectrogram and is not a property of the signal or a fundamental limit. We also examine the local uncertainty product for a large class of time-frequency distributions that satisfy the usual uncertainty principle, including the Wigner distribution, the Choi-Williams distribution, and many other commonly used distributions. We obtain an expression for the local uncertainty product in terms of the signal and show that for these distributions, the local uncertainty product is less than that of the spectrogram and can be arbitrarily small. Extension of our approach to an entropy formulation of the uncertainty principle is also considered.

72 citations


Journal ArticleDOI
TL;DR: In this paper, the frequency-domain analysis in the genomes of various organisms using tricolor spectrograms was performed, identifying several types of distinct visual patterns characterizing specific DNA regions.
Abstract: We perform frequency-domain analysis in the genomes of various organisms using tricolor spectrograms, identifying several types of distinct visual patterns characterizing specific DNA regions. We relate patterns and their frequency characteristics to the sequence characteristics of the DNA. At times, the spectrogram patterns can be related to the structure of the corresponding protein region by using various public databases such as GenBank. Some patterns are explained from the biological nature of the corresponding regions, which relate to chromosome structure and protein coding, and some patterns have yet unknown biological significance. We found biologically meaningful patterns, on the scale of millions of base pairs, to a few hundred base pairs. Chromosome-wide patterns include periodicities ranging from 2 to 300. The color of the spectrogram depends on the nucleotide content at specific frequencies, and therefore can be used as a local indicator of CG content and other measures of relative base content. Several smaller-scale patterns are found to represent different types of domains made up of various tandem repeats.

69 citations


Journal ArticleDOI
TL;DR: In this paper, the application of time-dependent parameters (e.g., instantaneous energy, mean and median frequencies, and bandwidth) in the detection and diagnosis of localised and wear gear failures is discussed.
Abstract: Time–frequency methods, which can lead to the clear identification of the nature of faults, are widely used to describe machine condition. Capabilities of time–frequency distributions in the detection of any abnormality can further be improved when their low-order frequency moments (or time-dependent parameters), which characterise dynamic behaviour of the observed signal with few parameters, are considered. This paper presents the applications of four time-dependent parameters (e.g. the instantaneous energy, mean and median frequencies, and bandwidth) based upon the use of spectrogram and scalogram, and compares their abilities in the detection and diagnosis of localised and wear gear failures. It has been found that scalogram based parameters are superior to those of a spectrogram in the detection and location of a local tooth defect even when the gear load is small, as they result in equally useful parameters in the revelation of gear wear. Moreover, the global values of these time-dependent parameters are found to be very useful and provide a very good basis for reflecting not only the presence of gear damage, but also any change in operating gear load.

55 citations


Proceedings Article
01 Nov 2004
TL;DR: The introduction of SDIF as data format for analysis data and the ability to use long and multichannel sound files, take the application to a new level of usability.
Abstract: AudioSculpt is an application for the musical analysis and processing of sound files. The program allows very detailed study of a sound's spectrum, waveform, fundamental frequency and partial contents. Multiple algorithms provide automatic segmentation of sounds. All analyses can be edited, stored and used to guide processing within the application, such as spectral filtering, time-stretching and noise removal, or serve as input for compositional environments. The current version is a complete revision, introducing many new features, new analysis, processing and segmentation algorithms plus a significantly enhanced user interface. The introduction of SDIF as data format for analysis data and the ability to use long and multichannel sound files, take the application to a new level of usability.

Journal ArticleDOI
Christophe Dorrer1, Inuk Kang1
TL;DR: In this paper, the authors demonstrate the first real-time implementation of linear spectrograms for the complete characterization of a train of optical pulses, composed of the spectra of the train of pulses, measured with a fast scanning microelectromechanical Fabry-Pe/spl acute/rot etalon, after gating with an electroabsorption modulator at various relative temporal delays programmed with a voltage controlled phase shifter.
Abstract: We demonstrate the first real-time implementation of linear spectrograms for the complete characterization of a train of optical pulses. The spectrogram is composed of the spectra of the train of pulses, measured with a fast scanning microelectromechanical Fabry-Pe/spl acute/rot etalon, after gating with an electroabsorption modulator at various relative temporal delays programmed with a voltage-controlled phase shifter. The temporal intensity and phase of the train of pulses are retrieved using an iterative deconvolution algorithm with an update rate of 9 Hz allowing real-time optimization and feedback. This diagnostic is validated on 40-Gb/s pulses generated by a LiNbO/sub 3/ Mach-Zehnder modulator.

Proceedings ArticleDOI
01 Jan 2004
TL;DR: The applicability of advanced digital signal processing algorithms to the analysis of heart sound signals is demonstrated and the development of a PDA-based biomedical instrument capable of acquisition, processing, andAnalysis of heart sounds is described.
Abstract: In this paper we demonstrate the applicability of advanced digital signal processing algorithms to the analysis of heart sound signals and describe the development of a PDA-based biomedical instrument capable of acquisition, processing, and analysis of heart sounds. Fourier transform-based spectral analysis of heart sounds was carried out first to show the differences in the frequency contents of normal and abnormal heart sounds. As the time-varying nature of heart sounds calls for better techniques capable of analyzing such signals: the short time Fourier transform (STFT) or spectrogram analysis was performed next. This method performed remarkably well in displaying frequency, magnitude, and time information of the heart sounds, providing robust parameters to make accurate diagnosis. With continuous technological advancements in computing and biomedical instrumentation, and the concurrent popularity of handheld instruments in the medical community, we introduce the concept of PDA-based digital phonocardiography. A prototype system is comprised of a digital stethoscope and a pocket PC. Heart sounds are recorded and displayed in the pocket PC screen. Advanced signal processing algorithms are implemented using the combined capabilities of software tools such as LabVlEW and embedded Visual C/sup ++/.

Patent
15 Dec 2004
TL;DR: In this article, the main melody is the portion of a piece of music which man perceives the loudest and the most precise, and the melody extraction or automatic transcription may be implemented clearly more stable and if applicable even less expensive when the assumption is considered sufficiently that the main tune is the one which man perceived the most clearly.
Abstract: The finding of the present invention is that the melody extraction or automatic transcription may be implemented clearly more stable and if applicable even less expensive when the assumption is considered sufficiently that the main melody is the portion of a piece of music which man perceives the loudest and the most precise. Regarding this, according to the present invention the time/spectral representation or the spectrogram of an interesting audio signal is scaled using the curves of equal volume reflecting human volume perception in order to determine the melody of the audio signal on the basis of the resulting perception-related time/spectral representation.

Journal ArticleDOI
TL;DR: A computer-based system has been designed for easy measurement and analysis of lung sound using the software package DasyLAB, able to digitally record the lung sounds which are captured with an electronic stethoscope plugged to a sound card on a portable computer.
Abstract: Listening to various lung sounds has proven to be an important diagnostic tool for detecting and monitoring certain types of lung diseases In this study a computer-based system has been designed for easy measurement and analysis of lung sound using the software package DasyLAB The designed system presents the following features: it is able to digitally record the lung sounds which are captured with an electronic stethoscope plugged to a sound card on a portable computer, display the lung sound waveform for auscultation sites, record the lung sound into the ASCII format, acoustically reproduce the lung sound, edit and print the sound waveforms, display its time-expanded waveform, compute the Fast Fourier Transform (FFT), and display the power spectrum and spectrogram

PatentDOI
Thanasis Loupas1
TL;DR: In this paper, an ultrasonic diagnostic imaging system and method are described by which a user can delineate a region of interest ( 122, 128 ) in a colorflow Doppler image.
Abstract: An ultrasonic diagnostic imaging system and method are described by which a user can delineate a region of interest ( 122, 128 ) in a colorflow Doppler image. The ultrasound system processes the Doppler pixel information of the region of interest ( 122, 128 ) to produce a spectrogram illustrating motion at the region of interest ( 122, 128 ) as a function of time. In a preferred embodiment the Doppler pixel information is processed by histograms to produce the spectrogram data.

Journal ArticleDOI
TL;DR: This model can determine delay times for three or more closely spaced objects with an accuracy of about 1 micros, when all the objects are located within 30 micros of delay separation, while the cross-correlation method is hard to apply to these problems.
Abstract: Using frequency-modulated echolocation, bats can discriminate the range of objects with an accuracy of less than a millimeter. However, bats' echolocation mechanism is not well understood. The delay separation of three or more closely spaced objects can be determined through analysis of the echo spectrum. However, delay times cannot be properly correlated with objects using only the echo spectrum because the sequence of delay separations cannot be determined without information on temporal changes in the interference pattern of the echoes. To illustrate this, Gaussian chirplets with a carrier frequency compatible with bat emission sweep rates were used. The delay time for object 1, T1, can be estimated from the echo spectrum around the onset time. The delay time for object 2 is obtained by adding T1 to the delay separation between objects 1 and 2 (extracted from the first appearance of interference effects). Further objects can be located in sequence by this same procedure. This model can determine delay times for three or more closely spaced objects with an accuracy of about 1 micros, when all the objects are located within 30 micros of delay separation. This model is applicable for the range discrimination of objects having different reflected intensities and in a noisy environment (0-dB signal-to-noise ratio) while the cross-correlation method is hard to apply to these problems.

Proceedings ArticleDOI
21 Jul 2004
TL;DR: In this paper, a differential feature-based classifier was proposed to address the problem of determining if a structural change has actually occurred in the ultrasonic wave field, and the results showed that both types of classifiers were successful in discriminating between environmental and structural changes.
Abstract: Diffuse ultrasonic signals received from ultrasonic sensors which are permanently mounted near, on or in critical structures of complex geometry are very difficult to interpret because of multiple modes and reflections constructively and destructively interfering. Both changing environmental and structural conditions affect the ultrasonic wave field, and the resulting changes in the received signals are similar and of the same magnitude. This paper describes a differential feature-based classifier approach to address the problem of determining if a structural change has actually occurred. Classifiers utilizing time and frequency domain features are compared to classifiers based upon time-frequency representations. Experimental data are shown from a metallic specimen subjected to both environmental changes and the introduction of artificial damage. Results show that both types of classifiers are successful in discriminating between environmental and structural changes. Furthermore, classifiers developed for one particular structure were successfully applied to a second one that was created by modifying the first structure. Best results were obtained using a classifier based upon features calculated from time-frequency regions of the spectrogram.

Proceedings ArticleDOI
17 May 2004
TL;DR: The proposed hidden Markov model based frequency bandwidth extension algorithm using line spectral frequencies (HMM-LSF-FBE) outperforms the traditional method by eliminating undesired whistling sounds completely and are significantly more pleasant to the human ear than the original narrowband speech signals from which they are derived.
Abstract: A new hidden Markov model (HMM) based frequency bandwidth extension algorithm using line spectral frequencies (HMM-LSF-FBE) is proposed. The proposed algorithm improves the performance of the traditional LSF-based extension algorithm by exploiting an HMM to indicate the proper representatives of different speech frames, and by applying a minimum mean square criterion to estimate the high-band LSF values. The proposed algorithm has been tested and compared to the traditional LSF-based algorithm in terms of the perceptual evaluation of speech quality (PESQ) objective measure and speech spectrograms. Simulation results show that the proposed algorithm outperforms the traditional method by eliminating undesired whistling sounds completely. In addition, the bandwidth extended speech signals created by the proposed algorithm are significantly more pleasant to the human ear than the original narrowband speech signals from which they are derived.

Proceedings ArticleDOI
17 May 2004
TL;DR: Informal listening tests and analysis of spectrograms reveals that speech reconstructed solely from the MFCC vectors is almost indistinguishable from that using the reference pitch.
Abstract: The paper proposes a technique for reconstructing an acoustic speech signal solely from a stream of Mel-frequency cepstral coefficients (MFCCs). Previous speech reconstruction methods have required an additional pitch element, but this work proposes two maximum a posteriori (MAP) methods for predicting pitch from the MFCC vectors themselves. The first method is based on a Gaussian mixture model (GMM) while the second scheme utilises the temporal correlation available from a hidden Markov model (HMM) framework. A formal measurement of both frame classification accuracy and RMS pitch error shows that an HMM-based scheme with 5 clusters per state is able to classify correctly over 94% of frames and has an RMS pitch error of 3.1 Hz in comparison to a reference pitch. Informal listening tests and analysis of spectrograms reveals that speech reconstructed solely from the MFCC vectors is almost indistinguishable from that using the reference pitch.

Journal ArticleDOI
TL;DR: Simulation results show that the adaptive window zero-crossing-based IF estimation method is superior to fixed window methods and is also better than adaptive spectrogram and adaptive Wigner-Ville distribution (WVD)-based IF estimators for different signal-to-noise ratio (SNR).
Abstract: We address the problem of estimating instantaneous frequency (IF) of a real-valued constant amplitude time-varying sinusoid. Estimation of polynomial IF is formulated using the zero-crossings of the signal. We propose an algorithm to estimate nonpolynomial IF by local approximation using a low-order polynomial, over a short segment of the signal. This involves the choice of window length to minimize the mean square error (MSE). The optimal window length found by directly minimizing the MSE is a function of the higher-order derivatives of the IF which are not available a priori. However, an optimum solution is formulated using an adaptive window technique based on the concept of intersection of confidence intervals. The adaptive algorithm enables minimum MSE-IF (MMSE-IF) estimation without requiring a priori information about the IF. Simulation results show that the adaptive window zero-crossing-based IF estimation method is superior to fixed window methods and is also better than adaptive spectrogram and adaptive Wigner-Ville distribution (WVD)-based IF estimators for different signal-to-noise ratio (SNR).

Journal ArticleDOI
TL;DR: A comparison is made with real fusion plasma signals that shows the advantages of the Choi–Williams distribution over wavelets as a complementary tool to the spectrogram.
Abstract: The continuous wavelet transform scalogram, and recently the Choi–Williams distribution, have both been used to improve upon the short-time Fourier transform spectrogram in the analysis of some nonstationary phenomena in fusion plasmas. Here, a comparison is made with real fusion plasma signals that shows the advantages of the Choi–Williams distribution over wavelets as a complementary tool to the spectrogram.

Patent
Jeffrey D. Earls1
29 Jul 2004
TL;DR: In this article, a sequence of frequency masks over a period of time is generated according to a frequency trajectory, frequency hops or other complex frequency events expected in the signal to form a spectrogram mask.
Abstract: A spectrogram mask trigger is generated in response to multiple or complex frequency events within a signal being monitored. A sequence of frequency masks over a period of time is generated according to a frequency trajectory, frequency hops or other complex frequency events expected in the signal to form a spectrogram mask. The spectrogram mask is then applied to multiple spectra or spectrogram of the signal to determine whether an anomalous frequency event has occurred within the time period or to identify a particular frequency pattern within the signal. Depending upon the results of the spectrogram mask application, the spectrogram mask trigger is generated for storing a block of data from the signal surrounding the triggering event for further analysis.

Journal Article
TL;DR: The results obtained show that the STFT analysis with different scale basis provides a clear comprehension of the cardiac events in both time and frequency domain.
Abstract: Heart sound is a highly nonstationary signal, and the Short-time Fourier Transform ation(STFT) is an effective method for this kind of signal to be analysed. Because of the nonstationarity of the phonocardiogram, it is important to maintain an analysing time window as short as possible to guaranty the stationarity hypothesis over small analysed segments. This will reduce the frequency resolution of the resulting spectrogram. However by adjusting the sliding time window, we can reach an acceptable result. The spectrogram is caculated by using first, short length sliding window to generate a temporal representation of the PCG, then longer length sliding window in order to generate a spectral representation of the PCG power. The resolution in such representations depend directly on the sliding window length. The temporal representation allows heart sounds and cardiac cycle durations to be measured. Whereas the spectrum, assuming a good frequency resolution, allows spectral characterization of each heart sound. The results we obtained on normal PCG signal show that the STFT analysis with different scale basis provides a clear comprehension of the cardiac events in both time and frequency domain.

Proceedings ArticleDOI
04 Oct 2004
TL;DR: Results of the experiments with good and intentionally-bad pronunciations of a single speaker showed that all the students are acoustically located between the two pronunciation, indicating that allThe students are judged to be acoustic closer to the speaker than the speaker himself is.
Abstract: Speech representation provided by acoustic phonetics, spectrogram, is very noisy representation in that it shows every acoustic aspect of speech. Age, gender, size, shape, microphone, room and line are completely irrelevant to speech recognition, pronunciation assessment, and so on. But the spectrogram is affected easily by these factors. This is the very essential reason why speech systems are sometimes unreliable and the author supposes that the education should not endure this inevitable characteristics. The author proposed a novel method of acoustic representation of speech where no dimensions of the above factors exist. The method was derived by implementing structural phonology on physics. This paper examines whether the new representation of speech can provide a good tool of pronunciation assessment. Results of the experiments with good and intentionally-bad pronunciations of a single speaker showed that all the students are acoustically located between the two pronunciations, indicating that all the students are judged to be acoustically closer to the speaker than the speaker himself is. This result shows that the proposed method can delete the irrelevant factors and is extremely reliable and effective in CALL.

Proceedings Article
01 Jan 2004
TL;DR: A technique to estimate a soft mask that weights the frequency sub-bands of the mixed signal so that the speech signal can be reconstructed from the estimated power spectrum of the speaker of interest.
Abstract: 1. Abstract The problem of single channel speaker separation, attempts to extract a speech signal uttered by the speaker of interest from a signal containing a mixture of auditory signals. Most algorithms that deal with this problem, are based on masking, where reliable components from the mixed signal spectrogram are inversed to obtain the speech signal from speaker of interest. As of now, most techniques, estimate this mask in a binary fashion, resulting in a hard mask. We present a technique to estimate a soft mask that weights the frequency sub-bands of the mixed signal. The speech signal can then be reconstructed from the estimated power spectrum of the speaker of interest. Experimental results shown in this paper, prove that the results are better than those obtained by estimating the hard mask.

Patent
16 Jun 2004
TL;DR: In this article, a two-dimensional spectrogram of the audio portion of a multimedia signal is computed, and one or more morphological operators are applied to the spectrogram to create a spectral peak track image.
Abstract: A method and system for analyzing an audio signal through the use of a spectrogram image of the audio signal. A two-dimension spectrogram of the audio portion of a multimedia signal is computed, and one or more morphological operators are applied to the spectrogram to create a spectral peak track image of the audio signal. Application of the morphological operators can extract the spectral peak tracks from background noise of the audio signal to show temporal patterns and spectral distribution of speech and music components of the audio signal. The spectral peak track image is analyzed to distinguish the speech and /or music content of the audio signal.

Journal ArticleDOI
TL;DR: It is shown that this representation outperforms the robust Wigner-Ville distribution (r-WVD) and the robust spectrogram in terms of artifacts suppression and high time-frequency resolution for this class of signals.

DOI
01 Jan 2004
TL;DR: This thesis proposes and evaluates new feature-based approaches for improving the ASR noise robustness and suggests a new approach that utilizes a soft-masking procedure instead of discarding the non-peak spectral components completely, and an existing data-driven approach called TANDEM is analyzed.
Abstract: Robustness against external noise is an important requirement for automatic speech recognition (ASR) systems, when it comes to deploying them for practical applications. This thesis proposes and evaluates new feature-based approaches for improving the ASR noise robustness. These approaches are based on nonlinear transformations that, when applied to the spectrum or feature, aim to emphasize the part of the speech that is relatively more invariant to noise and/or deemphasize the part that is more sensitive to noise. Spectral peaks constitute high signal-to-noise ratio part of the speech. Thus an efficient parameterization of the components only from the peak locations is expected to improve the noise robustness. An evaluation of this requires estimation of the peak locations. Two methods proposed in this thesis for the peak estimation task are: 1) frequency-based dynamic programming (DP) algorithm, that uses the spectral slope values of single time frame, and 2) HMM/ANN based algorithm, that uses distinct time-frequency (TF) patterns in the spectrogram (thus imposing temporal constraints during the peak estimation). The learning of the distinct TF patterns in an unsupervised manner makes the HMM/ANN based algorithm sensitive to energy fluctuations in the TF patterns, which is not the case with frequency-based DP algorithm. For an efficient parameterization of spectral components around the peak locations, parameters describing activity pattern (energy surface) within local TF patterns around the spectral peaks are computed and used as features. These features, referred to as spectro-temporal activity pattern (STAP) features, show improved noise robustness, however they are inferior to the standard features in clean speech. The main reason for this is the complete masking of the non-peak regions in the spectrum, which also carry significant information required for clean speech recognition. This leads to a development of a new approach that utilizes a soft-masking procedure instead of discarding the non-peak spectral components completely. In this approach, referred to as phase autocorrelation (PAC) approach, the noise robustness is actually addressed in the autocorrelation domain (time-domain Fourier equivalent of the power spectral domain). It uses phase (i.e., angle) variation of the signal vector over time as a measure of correlation, as opposed to the regular autocorrelation which uses dot product. This alternative measure of autocorrelation is referred to as PAC, and is motivated by the fact that angle gets less disturbed by the additive disturbances than the dot product. Interestingly, the use of PAC has an effect of emphasizing the peaks and smoothing out the valleys, in the spectral domain, without explicitly estimating the peak locations. PAC features exhibit improved noise robustness. However, even the soft masking strategy tends to degrade the clean speech recognition performance. This points to the fact that externally designed transformations, which do not take a complete account of underlying complexity of the speech signal, may not be able to improve the robustness without hurting the clean speech recognition. A better approach in this case will be to learn the transformation from the speech data itself in a data-driven manner, compromising between improving the noise robustness while keeping the clean performance intact. An existing data-driven approach called TANDEM is analyzed to validate this. In TANDEM approach, a multi-layer perceptron (MLP) used to perform a data-driven transformation of the input features, learns the transformation by getting trained in a supervised, discriminative mode, with phoneme labels as output classes. Such a training makes the MLP to perform a nonlinear discriminant analysis in the input feature space and thus makes it to learn a transformation that projects the input features onto a sub-space of maximum class discriminatory information. This projection is able to suppress the noise related variability, while keeping the speech discriminatory information intact. An experimental evaluation of the TANDEM approach shows that it is effective in improving the noise robustness. Interestingly, TANDEM approach is able to further improves the noise robustness of the STAP and PAC features, and also improve their clean speech recognition performance. The analysis of noise robustness of TANDEM has also lead to another interesting aspect of it namely, using it as an integration tool for adaptively combining multiple feature streams. The validity of the various noise robust approaches developed in this thesis is shown by evaluating them on OGI Numbers95 database added with noises from Noisex92, and also with Aurora-2 database. A combination of robust features developed in this thesis along with standard features, in a TANDEM framework, result in a system that is reasonably robust in all conditions.

Proceedings ArticleDOI
17 May 2004
TL;DR: Three structure patterns, including energy envelope pattern, sub-band spectral shape pattern and harmonicity prominence pattern, are proposed or refined, as successive development of the authors' previous work to improve audio representation.
Abstract: Although statistical characteristics of audio features are widely used for audio representation in most current audio analysis systems and have been proved to be effective, they only utilize the average feature variations over time, and thus lead to ambiguities in some cases. Structure patterns, which describe the representative structure characteristics of both temporal and spectral features, are proposed to improve audio representation. In this paper, three structure patterns, including energy envelope pattern, sub-band spectral shape pattern and harmonicity prominence pattern, are proposed or refined, as successive development of our previous work. Evaluations on a content-based audio retrieval system with more than 1500 clips showed very encouraging results.