scispace - formally typeset
Search or ask a question

Showing papers on "Spectrogram published in 2003"


Proceedings Article
01 Jan 2003
TL;DR: The max approximation to log spectrograms of mixtures is reviewed, why this motivates a “refiltering” approach to separation and denoising, and how the process of inference in factorial probabilistic models performs a computation useful for deriving the masking signals needed in refiltering is described.
Abstract: This paper proposes the combination of several ideas, some old and some new, from machine learning and speech processing. We review the max approximation to log spectrograms of mixtures, show why this motivates a “refiltering” approach to separation and denoising, and then describe how the process of inference in factorial probabilistic models performs a computation useful for deriving the masking signals needed in refiltering. A particularly simple model, factorial-max vector quantization (MAXVQ), is introduced along with a branch-and-bound technique for efficient exact inference and applied to both denoising and monaural separation. Our approach represents a return to the ideas of Ephraim, Varga and Moore but applied to auditory scene analysis rather than to speech recognition. 1. Sparsity & Redundancy in Spectrograms 1.1. The Log-MaxApproximation When two clean speech signals are mixed additively in the time domain, what is the relationship between the individual log spectrograms of the sources and the log spectrogram of the mixture? Unless the sources are highly dependent (synchronized), the spectrogram of the mixture is almost exactly the maximum of the individual spectrograms, with the maximum operating over small time-frequency regions (fig. 2). This amazing fact, first noted by Roger Moore in 1983, comes from the fact that unless e1 and e2 are both large and almost equal, log(e1 + e2) ≈ max(log e1, log e2) (fig. 1a). The sparse nature of the speech code across time and frequency is the key to the practical usefulness of this approximation: most narrow frequency bands carry substantial energy only a small fraction of the time and thus it is rare that two independent sources inject large amounts of energy into the same subband at the same time. (Figure 1b shows a plot of the relative energy of two simultaneous speakers in a narrow subband; most of the time at least one of the two sources shows negligible power.) 1.2. Masking and Refiltering Fortunately, the speech code is also redundant across timefrequency. Different frequency bands carry, to a certain extent, independent information and so if information in some bands is suppressed or masked, even for significant durations, other bands can fill in. (A similar effect occurs over time: if brief sections of the signal are obscured, even across all bands, the speech is still intelligible; while also useful, we do not exploit this here.) This is partly why humans perform so well on many monaural speech separation and denoising tasks. When we solve the cocktail party problem or recognize degraded speech, we are doing structural analysis, or a kind of “perceptual grouping” on the incoming sound. There is substantial evidence that the appropriate subparts of an audio signal for use in grouping may be narrow frequency bands over short times. To generate these parts computationally, we can perform multiband analysis – break the original speech signal y(t) into many subband signals bi(t) each lo g e2 ma

209 citations


Proceedings ArticleDOI
06 Apr 2003
TL;DR: A system architecture and a set of lightweight collaborative signal processing algorithms that achieve real-time behavior while minimizing inter-node communication to extend the system lifetime are proposed.
Abstract: We are developing an acoustic habitat-monitoring sensor network that recognizes and locates specific animal calls in real time. We investigate the system requirements of such a real-time acoustic monitoring network. We propose a system architecture and a set of lightweight collaborative signal processing algorithms that achieve real-time behavior while minimizing inter-node communication to extend the system lifetime. In particular, the target classification is based on spectrogram pattern matching while the target localization is based on beamforming using time difference of arrival (TDOA). We describe our preliminary implementation on a commercial off the shelf (COTS) testbed and present its performance based on testbed measurements.

160 citations


Patent
22 Apr 2003
TL;DR: In this article, a real-time spectrum analysis engine (SAGE) is proposed, consisting of a spectrum analyzer, a signal detector, a universal signal synchronizer and a snapshot buffer component.
Abstract: A real-time spectrum analysis engine (SAGE) that comprises a spectrum analyzer component, a signal detector component, a universal signal synchronizer component and a snapshot buffer component. The spectrum analyzer component generates data representing a real-time spectrogram of a bandwidth of radio frequency (RF) spectrum. The signal detector detects signal pulses in the frequency band and outputs pulse event information entries output, which include the start time, duration, power, center frequency and bandwidth of each detected pulse. The signal detector also provides pulse trigger outputs which may be used to enable/disable the collection of information by the spectrum analyzer and the snapshot buffer components. The snapshot buffer collects a set of raw digital signal samples useful for signal classification and other purposes. The universal signal synchronizer synchronizes to periodic signal sources, useful for instituting schemes to avoid interference with those signals.

131 citations


Journal ArticleDOI
TL;DR: Comparison via computer simulations of AR models between the proposed method and one of the well-known iterative methods, recursive least squares, shows the greater capability of the new method to track TV parameters.
Abstract: We extend a recently developed time invariant (TIV) model order search criterion named the optimal parameter search algorithm (OPS) for identification of time varying (TV) autoregressive (AR) and autoregressive moving average (ARMA) models. Using the TV algorithm is facilitated by the fact that expanding each TV coefficient onto a finite set of basis sequences permits TV parameters to become TIV. Taking advantage of this TIV feature of expansion parameters exploits the features of the OPS, which has been shown to provide accurate model order selection as well as extraction of only the significant model terms. Another advantage of the new algorithm is its ability to discriminate insignificant basis sequences thereby reducing the number of expansion parameters to be estimated. Due to these features, the resulting algorithm can accurately estimate TV AR or ARMA models and determine their orders. Indeed, comparison via computer simulations of AR models between the proposed method and one of the well-known iterative methods, recursive least squares, shows the greater capability of the new method to track TV parameters. Furthermore, application of the new method to experimentally obtained renal blood flow signals shows that the new method provides higher-resolution time-varying spectral capability than does the short-time Fourier transform (STFT), concomitant with fewer spurious frequency peaks than obtained with the STFT spectrogram. © 2003 Biomedical Engineering Society. PAC2003: 8710+e, 8719Uv, 8780Tq

107 citations


Journal ArticleDOI
TL;DR: General performance analysis of the shift covariant class of quadratic time-frequency distributions as instantaneous frequency (IF) estimators, for an arbitrary frequency-modulated (FM) signal, is presented and the variance expression for the estimation bias and variance is derived.
Abstract: General performance analysis of the shift covariant class of quadratic time-frequency distributions (TFDs) as instantaneous frequency (IF) estimators, for an arbitrary frequency-modulated (FM) signal, is presented. Expressions for the estimation bias and variance are derived. This class of distributions behaves as an unbiased estimator in the case of monocomponent signals with a linear IF. However, when the IF is not a linear function of time, then the estimate is biased. Cases of white stationary and white nonstationary additive noises are considered. The well-known results for the Wigner distribution (WD) and linear FM signal, and the spectrogram of signals whose IF may be considered as a constant within the lag window, are presented as special cases. In addition, we have derived the variance expression for the spectrogram of a linear FM signal that is quite simple but highly signal dependent. This signal is considered in the cases of other commonly used distributions, such as the Born-Jordan and the Choi-Williams distributions. It has been shown that the reduced interference distributions outperform the WD but only in the case when the IF is constant or its variations are small. Analysis is extended to the IF estimation of signal components in the case of multicomponent signals. All theoretical results are statistically confirmed.

84 citations


Proceedings ArticleDOI
17 Sep 2003
TL;DR: The proposed HN cancellation method uses an image processing technique to detect HN segments in the spectrogram of the recorded lung sound signal and estimates the missing data employing a 2D interpolation in the time-frequency domain.
Abstract: During lung sound recordings, an incessant noise source occurs due to heart sounds. The heart sound interference on lung sounds is significant especially at low flow rates. In this paper a new heart noise (HN) cancellation method is presented. This algorithm uses an image processing technique to detect HN segments in the spectrogram of the recorded lung sound signal. Afterwards the algorithm removes those segments and estimates the missing data employing a 2D interpolation in the time-frequency domain and finally reconstructs the signal in the time domain. The results show that the proposed method successfully cancels HN from lung sound signals while preserving the original fundamental components of the lung sound signal. The computational load and the speed of the proposed method were found to be much more efficient than other HN cancellation methods such as adaptive filtering.

47 citations


Journal ArticleDOI
TL;DR: In this article, a smoothed instantaneous power spectrum (SIPS) distribution is proposed for detecting and locating local tooth defects in gears, which has the added advantage of providing a considerable reduction in the ringing effect of the IPS transform, which results in a smoother and clearer timefrequency representation.
Abstract: Time–frequency methods are effective tools for analysing diagnostics signals and have been widely used to describe machine condition. This paper introduces a time–frequency distribution, called the smoothed instantaneous power spectrum (SIPS) distribution, and demonstrates its use in the detection and location of local tooth defects in gears. The SIPS distribution is derived from the frequency domain definition of the instantaneous power spectrum (IPS) distribution, but has the added advantage that provides a considerable reduction in the ringing effect of the IPS transform, which results in a smoother and clearer time–frequency representation. A simulated gear vibration signal is used to show the capabilities of the proposed method over the IPS distribution and spectrogram. Healthy and faulty vibration signals monitored from a gear test rig are analysed, the results of which show that a local gear tooth defect can clearly be detected by the SIPS distribution.

39 citations


Proceedings ArticleDOI
17 Sep 2003
TL;DR: In this article, a series of tests were performed with the goal of determining the AR order that is more adequate for the calculation of the AR spectrogram, and ranges of optimal orders for different interpolation rates of the HRV signal were presented.
Abstract: Time-frequency analysis of heart rate variability (HRV) makes it easier to evaluate how the balance between the sympathetic and parasympathetic influences on heart rhythm varies with time. The auto-regressive model can be used to calculate the power spectrum density of HRV and to create an auto-regressive spectrogram. This work presents these techniques and describes a series of tests performed with the goal of determining the AR order that is more adequate for the calculation of the AR spectrogram. As a result, ranges of optimal orders for different interpolation rates of the HRV signal are presented.

36 citations


Proceedings Article
09 Dec 2003
TL;DR: A generative model of time-domain speech signals and their spectrograms is described, and it is shown how an efficient optimizer can be used to find the maximum a posteriori speech signal, given the spectrogram.
Abstract: Many techniques for complex speech processing such as denoising and deconvolution, time/frequency warping, multiple speaker separation, and multiple microphone analysis operate on sequences of short-time power spectra (spectrograms), a representation which is often well-suited to these tasks. However, a significant problem with algorithms that manipulate spectrograms is that the output spectrogram does not include a phase component, which is needed to create a time-domain signal that has good perceptual quality. Here we describe a generative model of time-domain speech signals and their spectrograms, and show how an efficient optimizer can be used to find the maximum a posteriori speech signal, given the spectrogram. In contrast to techniques that alternate between estimating the phase and a spectrally-consistent signal, our technique directly infers the speech signal, thus jointly optimizing the phase and a spectrally-consistent signal. We compare our technique with a standard method using signal-to-noise ratios, but we also provide audio files on the web for the purpose of demonstrating the improvement in perceptual quality that our technique offers.

36 citations


Journal ArticleDOI
TL;DR: An approach to the estimation of motion parameters of moving objects in a video-sequence, by using the SLIDE (subspace-based line detection algorithm) algorithm, is considered, and a tradeoff between concentration of the TF representation and reduction of the cross-terms is achieved.
Abstract: An approach to the estimation of motion parameters of moving objects in a video-sequence, by using the SLIDE (subspace-based line detection algorithm) algorithm, is considered. The proposed procedure projects video-frames to the coordinate axes, in order to obtain synthetic images containing information about the motion parameters. These synthetic images are mapped to the FM signals by using constant /spl mu/-propagation. The problem of velocity estimation is reduced to the instantaneous frequency (IF) estimation. IF estimators, based on time-frequency (TF) representations, are used. Three TF representations: spectrogram (SPEC), Wigner distribution (WD), and S-method (SM), are used and compared to this aim. A tradeoff between concentration of the TF representation (velocity estimation accuracy) and reduction of the cross-terms (possibility for estimation of the multiple objects parameters) is achieved by the SM. A performance analysis of the algorithm is done. Theoretical results are illustrated on several numerical examples.

35 citations


Journal ArticleDOI
TL;DR: Window 95-based software, which processes the real time heart sound signal, has been developed and allows for both time varying amplitude graph and power spectral plot (based on 512-point fast Fourier transform (FFT)) to be shown simultaneously on a channel's view.
Abstract: A simple, low cost and non-invasive PC-based system that is capable to process real time fetal phonocardiographic signal has been built. The hardware of the system mainly consists of two modules: the front-end module and the data acquisition & control module. The front-end module is mainly used for heart sound signal capturing and conditioning. A new electronic stethoscope with enhanced performance that is non-intrusive, cost friendly and simple to implement has been built. The audio output unit enables the system to provide simultaneous listening and visual representation of the heart sound. The data acquisition & control module offers a four-channel analog multiplexer, a programmable gain amplifier, and a 12-bit resolution ADC. Various sampling rates can be provided through the programmable timer. Window 95-based software, which processes the real time heart sound signal, has been developed. The software written for the PCG allows for both time varying amplitude graph and power spectral plot (based on 512-point fast Fourier transform (FFT)) to be shown simultaneously on a channel's view. The simultaneous spectrograms gives a much better insight of the heart sounds characteristics than just the time-amplitude plot alone as in conventional PCG software. Using digital signal processing techniques, the power of the spectral plot is used to extract useful information of the heart sounds characteristics even in a situation where the heart sounds are among considerably loud background noises.

Journal ArticleDOI
TL;DR: The correlogram is a new method of displaying periodicity based on the waveform-matching techniques often used in F0 extraction programs, but with no mechanism to select an actual F0 value, and useful for analysis of pathological voices since it illustrates the full complexity of the periodicity in the voice signal.
Abstract: Fundamental frequency (F0) extraction is often used in voice quality analysis. In pathological voices with a high degree of instability in F0, it is common for F0 extraction algorithms to fail. In such cases, the faulty F0 values might spoil the possibilities for further data analysis. This paper presents the correlogram, a new method of displaying periodicity. The correlogram is based on the waveform-matching techniques often used in F0 extraction programs, but with no mechanism to select an actual F0 value. Instead, several candidates for F0 are shown as dark bands. The result is presented as a 3D plot with time on the x axis, correlation delay inverted to frequency on the y axis, and correlation on the z axis. The z axis is represented in a gray scale as in a spectrogram. Delays corresponding to integer multiples of the period time will receive high correlation, thus resulting in candidates at F0, F0/2, F0/3, etc. While the correlogram adds little to F0 analysis of normal voices, it is useful for analysis of pathological voices since it illustrates the full complexity of the periodicity in the voice signal. Also, in combination with manual tracing, the correlogram can be used for semimanual F0 extraction. If so, F0 extraction can be performed on many voices that cause problems for conventional F0 extractors. To demonstrate the properties of the method it is applied to synthetic and natural voices, among them six pathological voices, which are characterized by roughness, vocal fry, gratings/scrape, hypofunctional breathiness and voice breaks, or combinations of these.

Journal ArticleDOI
TL;DR: In this paper, a method of interpolating binaural implse responses and algorithm to simulate a moving sound image were evaluated objectively and the results showed that the method interpolated the responses more independently of the azimuths of the sound source than the simple method.
Abstract: Previously introduced method of interpolating binaural implse responses and algorithm to simulate a moving sound image were evaluated objectively. The method interpolates the responses taking into account the arrival time difference due to changes in the direction of a moving sound source. For the angular interval of 15°, the average of the SDR values of our method, 23 dB was larger than that of the simple method, 9.9 dB. The variances of the SDR values showed our method interpolated the responses more independently of the azimuths of the sound source than the simple method. The responses interpolated using our method changed smoothly as the source direction changed. We have evaluated the algorithm by comparing a moving sound image simulated using the algorithm with an actual moving sound image recorded using a rotating dummy head and with a moving sound image simulated using a conventional method. The spectrogram of the binaural signal of the moving sound image, and no ripples were seen.

Proceedings ArticleDOI
17 Sep 2003
TL;DR: An enhanced method for the detection of wheezes, based on the spectrogram of the breath sound recordings is proposed, which could be used for long-term wheezing screening in sleep-laboratories, resulting in significant data-volume reduction.
Abstract: An enhanced method for the detection of wheezes, based on the spectrogram of the breath sound recordings is proposed. The identification of wheezes in the total breath cycle would contribute to the diagnosis of pathologies related to patients with obstructive airway diseases. Fast and quite simple techniques are applied to automatically locate and identify wheezing-episodes. Amplitude criteria are applied to the peaks of the spectrogram in order to discriminate the wheezing from the breath sound, whereas frequency and time continuity criteria are used to improve the results. The proposed detector could be used for long-term wheezing screening in sleep-laboratories, resulting in significant data-volume reduction.

Proceedings ArticleDOI
Yu Shi1, Eric Chang1
06 Apr 2003
TL;DR: The experimental results show that the formants estimated by the proposed particle-filtering method are quite reliable and the trajectories are more accurate than LPC.
Abstract: The paper presents a particle-filtering method for estimating formant frequencies of speech signals from spectrograms. First, frequency bands corresponding to the analyzed formants are extracted via a two-step dynamic programming based algorithm. A particle-filtering method is then used to locate accurately formants in every formant area based on the posterior PDF described by a set of support points with associated weights. Formant trajectories of voiced frames of a group of 81 utterances were manually tracked and labeled, partly for model training and partly for algorithm evaluation. In the experiments, the proposed method obtains average estimation errors of 72, 115, and 113 Hz for the first three formants, respectively, whereas the LPC based method induces 118, 172, and 250 Hz deviations. The experimental results show that the formants estimated by the proposed method are quite reliable and the trajectories are more accurate than LPC.

Journal ArticleDOI
TL;DR: A new, bilinear, cross-term suppressed and alias-free time-frequency representation (TFR) that has a higher resolution than the spectrogram with the same window width and is applied to interference excision in direct-sequence spread-spectrum communications.

Proceedings ArticleDOI
01 Jul 2003
TL;DR: It has been shown that, when the components are separated in time-frequency (TF) plane, the results obtained for each of them separately can be the same as in the case when all components exist.
Abstract: The performance analysis of the S-method (SM) as an instantaneous frequency (IF) estimator, for the case of multicomponent signals, is derived. It has been shown that, when the components are separated in time-frequency (TF) plane, the results obtained for each of them separately can be the same as in the case when all components exist. Also, these results can outperform the ones obtained by the spectrogram (and, consequently, by the reduced interference distributions (RID)) and pseudo Wigner distribution (WD) in the IF estimation of multicomponent signals. Theoretical results are statistically confirmed.

Journal ArticleDOI
TL;DR: Results from both works open up the possibility of using MPEG-compression at high bitrates to store or transmit high-quality speech recordings, without altering their acoustic properties.

Proceedings ArticleDOI
TL;DR: In this article, the authors present the Java software module for the spectrogram implementation together with the associated programming environment to introduce to students the advanced concepts of TFRs at an early stage in their education without requiring a rigorous theoretical background.
Abstract: Time-frequency representations (TFRs) such as the spectrogram are important two-dimensional tools for processing time-varying signals. In this paper, we present the Java software module we developed for the spectrogram implementation together with the associated programming environment. Our aim is to introduce to students the advanced concepts of TFRs at an early stage in their education without requiring a rigorous theoretical background. We developed two sets of exercises using the spectrogram based on signal analysis and speech processing together with on-line evaluation forms to assess student learning experiences. In the paper, we also provide the positive statistical and qualitative feedback we obtained when the Java software and corresponding exercises were used in a signal processing course.

Journal ArticleDOI
TL;DR: Both analytically and experimentally, adaptive spectrogram was found to be more robust than adaptive Ps.WVD and its performance in the presence of multiplicative and additive noise is studied.

01 Jan 2003
TL;DR: The Java software module developed for the spectrogram implementation together with the associated programming environment is presented and the positive statistical and qualitative feedback obtained when the Java software and corresponding exercises were used in a signal processing course is provided.
Abstract: Tinielfrequency represenlations (TFRs) such as the spectrogram are important two-dimensional tools for processing time-varying signals. In this paper, we present the Java soflware module we developed for the spectrogram implementation together with the associated programming environment. Our aim is lo introduce 10 students the advanced concepts of TFRs at an early stage in their educalion without requiring a rigorous theoretical background, We developed two sets of exercises using 1he spectrogram based on signal analysis and speech processing together with on-line evaluation forms to assess studen1 learning experiences. In fhe paper. we also provide the positive statistical and qualitative feedback we obtained when the Java so/lware and corresponding exercises were used in a signalprocessing course.

Proceedings ArticleDOI
06 Apr 2003
TL;DR: A novel high-resolution time-frequency representation method is proposed for source detection and classification in over-the-horizon radar (OTHR) systems that can reveal important target maneuvering information, whereas other linear and bilinear time- frequencies representation methods fail.
Abstract: A novel high-resolution time-frequency representation method is proposed for source detection and classification in over-the-horizon radar (OTHR) systems. A data-dependent kernel is applied in the ambiguity domain to capture the target signal components, which are then resolved using the root-MUSIC based coherent spectrum estimation. This method is particularly effective to analyze a multi-component signal with time-varying time-Doppler signatures. By using the different time-Doppler signatures embedded in the multipath signals, this proposed method can reveal important target maneuvering information, whereas other linear and bilinear time-frequency representation methods fail.

Proceedings ArticleDOI
02 Jul 2003
TL;DR: A novel method for the automatic recognition of acoustic utterances is presented using acoustic images as the basis for the feature extraction that effectively employs the spectrogram, the Wigner-Ville distribution and co-occurrence matrices.
Abstract: With the increasing use of audio-visual databases, the need for automatic content-based classification has grown in importance. In this paper, a novel method for the automatic recognition of acoustic utterances is presented using acoustic images as the basis for the feature extraction. This method effectively employs the spectrogram, the Wigner-Ville distribution and co-occurrence matrices. The images are then compressed, using statistical methods, before being combined into a single feature matrix to be presented to a classifier. Initial results obtained from the classification of a database of sport sounds and gunshots indicate that the method is capable of accurate discrimination for coarse and fine classification respectively.

Journal ArticleDOI
TL;DR: A simplified real-time FROG device based on a single-shot geometry that no longer requires DSPs is developed and the principal component generalized projections algorithm is applied to invert polarization gate FROG traces at rates as high as 20 Hz.
Abstract: Frequency-resolved optical gating (FROG) is a technique used to measure the intensity and phase of ultrashort laser pulses through the optical construction of a spectrogram of the pulse. To obtain quantitative information about the pulse from its spectrogram, an iterative two-dimensional phase retrieval algorithm must be used. Current algorithms are quite robust but retrieval of all the pulse information can be slow. Previous real-time FROG trace inversion work focused on second-harmonic-generation FROG, which has an ambiguity in the direction of time, and required digital signal processors (DSPs). We develop a simplified real-time FROG device based on a single-shot geometry that no longer requires DSPs. We use it and apply the principal component generalized projections algorithm to invert polarization gate FROG traces at rates as high as 20 Hz.

Proceedings ArticleDOI
TL;DR: The stego obtained by altering the amplitude at perceptually masked points showed barely noticeable differences and excellent data recovery, compared to the stegos obtained by modifying the phase or amplitude of perceptually masks or significant regions of the host.
Abstract: This paper presents the results of embedding short covert message utterances on a host, or cover, utterance by modifying the phase or amplitude of perceptually masked or significant regions of the host. In the first method, the absolute phase at selected, perceptually masked frequency indices was changed to fixed, covert data-dependent values. Embedded bits were retrieved at the receiver from the phase at the selected frequency indices. Tests on embedding a GSM-coded covert utterance on clean and noisy host utterances showed no noticeable difference in the stego compared to the hosts in speech quality or spectrogram. A bit error rate of 2 out of 2800 was observed for a clean host utterance while no error occurred for a noisy host. In the second method, the absolute phase of 10 or fewer perceptually significant points in the host was set in accordance with covert data. This resulted in a stego with successful data retrieval and a slightly noticeable degradation in speech quality. Modifying the amplitude of perceptually significant points caused perceptible differences in the stego even with small changes of amplitude made at five points per frame. Finally, the stego obtained by altering the amplitude at perceptually masked points showed barely noticeable differences and excellent data recovery.© (2003) COPYRIGHT SPIE--The International Society for Optical Engineering. Downloading of the abstract is permitted for personal use only.

Proceedings ArticleDOI
17 Sep 2003
TL;DR: The investigation of techniques for lung sounds analysis using the spectrogram image processing of respiratory cycles as a parameter source for automatic wheezing recognition and visual user feedback achieves 83,93% match in the wheazing detection for isolated respiratory cycle and 96,43% match for detection in sounds from the same person.
Abstract: This paper describes the investigation of techniques for lung sounds analysis using the spectrogram image processing of respiratory cycles as a parameter source for automatic wheezing recognition and visual user feedback. The spectrogram is generated from lung sound recorded in a wave file. The spectrogram image is passed through a bidimensional convolution filter and a limiter in order to increase the contrast and isolate the highest components. The spectral average from the treated spectrogram is computed and stored as an array. The array's tops are located and used as inputs to a multi-layer perceptron artificial neural network. The presented results shows that this technique achieves 83,93% match in the wheezing detection for isolated respiratory cycle and 96,43% match for detection in sounds from the same person. Also, the system returns the original recorded sound and the post-processed spectrogram image for the user to take his own conclusions.

Patent
03 Feb 2003
TL;DR: In this paper, a sampled digital audio signal is displayed on a spectrogram, in terms of frequency vs. time, and the unwanted noise in the signal is visible in the spectrogram and the portion of the signal containing the unwanted noises can be selected using time and frequency constraints.
Abstract: A sampled digital audio signal is displayed on a spectrogram, in terms of frequency vs. time. An unwanted noise in the signal is visible in the spectrogram and the portion of the signal containing the unwanted noise can be selected using time and frequency constraints. An estimate for the signal within the selected portion is then interpolated on the basis of desired portions of the signal outside the time constraints defining the selected portion. The interpolated estimate can then be used to attenuate or remove the unwanted sound.

Proceedings ArticleDOI
Xu Shao1, Ben Milner1
06 Apr 2003
TL;DR: This work applies spectral subtraction to the mel-filterbank vector (derived from noisy MFCC) to provide a clean speech spectral estimate to obtain a reliable estimate of pitch and a robust extraction technique is used.
Abstract: This paper extends the technique of speech reconstruction from MFCC by considering the effect of noisy speech. To reconstruct a clean speech signal from noise contaminated MFCC an estimate of the clean mel-filterbank vector is required together with a robust estimate of the pitch. This work applies spectral subtraction to the mel-filterbank vector (derived from noisy MFCC) to provide a clean speech spectral estimate. To obtain a reliable estimate of pitch a robust extraction technique is used. Spectrograms and informal listening tests reveal that a clean speech signal can be successfully reconstructed from the noisy MFCC. Pitch errors are shown to manifest themselves as artificial sounding bursts in the reconstructed speech signal. Incorrect estimates of the spectral envelope introduce periods of noise into the reconstructed speech.

Proceedings ArticleDOI
27 Dec 2003
TL;DR: An environmental sound recognition system based on MPEG-7 audio LLDs (low-level descriptors) is proposed and successfully solves the shortcoming by taking the basis extraction.
Abstract: In this paper, an environmental sound recognition system based on MPEG-7 audio LLDs (low-level descriptors) is proposed. Traditional sound recognizer utilizes decision-tree based method and causes a problem where the parameter is not generalized. The HMM based sound recognizer has been introduced to resolve this drawback. However, it adopts spectrum parameter and will result in high dimensional feature vectors. This paper successfully solves the shortcoming by taking the basis extraction. The recognition rate is about 82% while only spectrogram is adopted as the parameter. The improved recognition rate is about 95% while three mentioned MPEG-7 audio LLDs are regarded as the parameters in our environmental sound recognizer. These three MPEG-7 audio LLDs are audio spectrum centroid descriptor, audio spectrum spread descriptor and audio spectrum flatness descriptor

Proceedings Article
01 Sep 2003
TL;DR: An integrated speech front-end for both speech recognition and speech reconstruction applications is proposed, showing the system to be more robust than both comb function and LPC-based pitch extraction.
Abstract: This paper proposes an integrated speech front-end for both speech recognition and speech reconstruction applications. Speech is first decomposed into a set of frequency bands by an auditory model. The output of this is then used to extract both robust pitch estimates and MFCC vectors. Initial tests used a 128 channel auditory model, but results show that this can be reduced significantly to between 23 and 32 channels. A detailed analysis of the pitch classification accuracy and the RMS pitch error shows the system to be more robust than both comb function and LPC-based pitch extraction. Speech recognition results show that the auditory-based cepstral coefficients give very similar performance to conventional MFCCs. Spectrograms and informal listening tests also reveal that speech reconstructed from the auditory-based cepstral coefficients and pitch has similar quality to that reconstructed from conventional MFCCs and pitch.