scispace - formally typeset
Search or ask a question

Showing papers on "Spectrogram published in 2013"


Proceedings ArticleDOI
26 May 2013
TL;DR: The proposed feature enhancement algorithm estimates a smoothed ideal ratio mask (IRM) in the Mel frequency domain using deep neural networks and a set of time-frequency unit level features that has previously been used to estimate the ideal binary mask.
Abstract: We propose a feature enhancement algorithm to improve robust automatic speech recognition (ASR). The algorithm estimates a smoothed ideal ratio mask (IRM) in the Mel frequency domain using deep neural networks and a set of time-frequency unit level features that has previously been used to estimate the ideal binary mask. The estimated IRM is used to filter out noise from a noisy Mel spectrogram before performing cepstral feature extraction for ASR. On the noisy subset of the Aurora-4 robust ASR corpus, the proposed enhancement obtains a relative improvement of over 38% in terms of word error rates using ASR models trained in clean conditions, and an improvement of over 14% when the models are trained using the multi-condition training data. In terms of instantaneous SNR estimation performance, the proposed system obtains a mean absolute error of less than 4 dB in most frequency channels.

557 citations


01 Jan 2013
TL;DR: Improvements in speech recognition are suggested without increasing the number of training epochs, and it is suggested that data transformations should be an important component of training neural networks for speech, especially for data limited projects.
Abstract: Augmenting datasets by transforming inputs in a way that does not change the label is a crucial ingredient of the state of the art methods for object recognition using neural networks. However this approach has (to our knowledge) not been exploited successfully in speech recognition (with or without neural networks). In this paper we lay the foundation for this approach, and show one way of augmenting speech datasets by transforming spectrograms, using a random linear warping along the frequency dimension. In practice this can be achieved by using warping techniques that are used for vocal tract length normalization (VTLN) - with the difference that a warp factor is generated randomly each time, during training, rather than tting a single warp factor to each training and test speaker (or utterance). At test time, a prediction is made by averaging the predictions over multiple warp factors. When this technique is applied to TIMIT using Deep Neural Networks (DNN) of dierent depths, the Phone Error Rate (PER) improved by an average of 0.65% on the test set. For a Convolutional neural network (CNN) with convolutional layer in the bottom, a gain of 1.0% was observed. These improvements were achieved without increasing the number of training epochs, and suggest that data transformations should be an important component of training neural networks for speech, especially for data limited projects.

351 citations


Posted Content
TL;DR: This paper generalizes one of the most important signal processing tools - windowed Fourier analysis - to the graph setting and designs dictionaries and transform methods to identify and exploit structure in signals on weighted graphs.
Abstract: One of the key challenges in the area of signal processing on graphs is to design dictionaries and transform methods to identify and exploit structure in signals on weighted graphs. To do so, we need to account for the intrinsic geometric structure of the underlying graph data domain. In this paper, we generalize one of the most important signal processing tools - windowed Fourier analysis - to the graph setting. Our approach is to first define generalized convolution, translation, and modulation operators for signals on graphs, and explore related properties such as the localization of translated and modulated graph kernels. We then use these operators to define a windowed graph Fourier transform, enabling vertex-frequency analysis. When we apply this transform to a signal with frequency components that vary along a path graph, the resulting spectrogram matches our intuition from classical discrete-time signal processing. Yet, our construction is fully generalized and can be applied to analyze signals on any undirected, connected, weighted graph.

260 citations


Journal ArticleDOI
TL;DR: This paper expands T-F unit features to include gammatone frequency cepstral coefficients (GFCC), mel-frequency cep stral coefficients, relative spectral transform (RASTA) and perceptual linear prediction (PLP), and proposes to use a group Lasso approach to select complementary features in a principled way.
Abstract: Monaural speech segregation has been a very challenging problem for decades. By casting speech segregation as a binary classification problem, recent advances have been made in computational auditory scene analysis on segregation of both voiced and unvoiced speech. So far, pitch and amplitude modulation spectrogram have been used as two main kinds of time-frequency (T-F) unit level features in classification. In this paper, we expand T-F unit features to include gammatone frequency cepstral coefficients (GFCC), mel-frequency cepstral coefficients, relative spectral transform (RASTA) and perceptual linear prediction (PLP). Comprehensive comparisons are performed in order to identify effective features for classification-based speech segregation. Our experiments in matched and unmatched test conditions show that these newly included features significantly improve speech segregation performance. Specifically, GFCC and RASTA-PLP are the best single features in matched-noise and unmatched-noise test conditions, respectively. We also find that pitch-based features are crucial for good generalization to unseen environments. To further explore complementarity in terms of discriminative power, we propose to use a group Lasso approach to select complementary features in a principled way. The final combined feature set yields promising results in both matched and unmatched test conditions.

192 citations


Proceedings ArticleDOI
26 May 2013
TL;DR: A novel deep convolutional neural network architecture is developed, where heterogeneous pooling is used to provide constrained frequency-shift invariance in the speech spectrogram while minimizing speech-class confusion induced by such invariance.
Abstract: We develop and present a novel deep convolutional neural network architecture, where heterogeneous pooling is used to provide constrained frequency-shift invariance in the speech spectrogram while minimizing speech-class confusion induced by such invariance. The design of the pooling layer is guided by domain knowledge about how speech classes would change when formant frequencies are modified. The convolution and heterogeneous-pooling layers are followed by a fully connected multi-layer neural network to form a deep architecture interfaced to an HMM for continuous speech recognition. During training, all layers of this entire deep net are regularized using a variant of the “dropout” technique. Experimental evaluation demonstrates the effectiveness of both heterogeneous pooling and dropout regularization. On the TIMIT phonetic recognition task, we have achieved an 18.7% phone error rate, lowest on this standard task reported in the literature with a single system and with no use of information about speaker identity. Preliminary experiments on large vocabulary speech recognition in a voice search task also show error rate reduction using heterogeneous pooling in the deep convolutional neural network.

185 citations


Proceedings ArticleDOI
01 Oct 2013
TL;DR: It is demonstrated that the proposed spectro-temporal features achieve a better recognition accuracy than MFCCs.
Abstract: In this contribution, an acoustic event detection system based on spectro-temporal features and a two-layer hidden Markov model as back-end is proposed within the framework of the IEEE AASP challenge `Detection and Classification of Acoustic Scenes and Events' (D-CASE). Noise reduction based on the log-spectral amplitude estimator by [1] and noise power density estimation by [2] is used for signal enhancement. Performance based on three different kinds of features is compared, i.e. for amplitude modulation spectrogram, Gabor filterbank-features and conventional Mel-frequency cepstral coefficients (MFCCs), all of them known from automatic speech recognition (ASR). The evaluation is based on the office live recordings provided within the D-CASE challenge. The influence of the signal enhancement is investigated and the increase in recognition rate by the proposed features in comparison to MFCC-features is shown. It is demonstrated that the proposed spectro-temporal features achieve a better recognition accuracy than MFCCs.

114 citations


Journal ArticleDOI
TL;DR: A novel method to improve the sound event classification performance in severe mismatched noise conditions is proposed, based on the subband power distribution (SPD) image - a novel two-dimensional representation that characterizes the spectral power distribution over time in each frequency subband.
Abstract: The ability to automatically recognize a wide range of sound events in real-world conditions is an important part of applications such as acoustic surveillance and machine hearing. Our approach takes inspiration from both audio and image processing fields, and is based on transforming the sound into a two-dimensional representation, then extracting an image feature for classification. This provided the motivation for our previous work on the spectrogram image feature (SIF). In this paper, we propose a novel method to improve the sound event classification performance in severe mismatched noise conditions. This is based on the subband power distribution (SPD) image - a novel two-dimensional representation that characterizes the spectral power distribution over time in each frequency subband. Here, the high-powered reliable elements of the spectrogram are transformed to a localized region of the SPD, hence can be easily separated from the noise. We then extract an image feature from the SPD, using the same approach as for the SIF, and develop a novel missing feature classification approach based on a nearest neighbor classifier (kNN). We carry out comprehensive experiments on a database of 50 environmental sound classes over a range of challenging noise conditions. The results demonstrate that the SPD-IF is both discriminative over the broad range of sound classes, and robust in severe non-stationary noise.

84 citations


Journal ArticleDOI
TL;DR: It is demonstrated using a simplified model of early auditory processing that both neural E and TFS encode the speech spectrogram with constant and comparable relative effectiveness regardless of the vocoder manipulations.
Abstract: There is much debate on how the spectrotemporal modulations of speech (or its spectrogram) are encoded in the responses of the auditory nerve, and whether speech intelligibility is best conveyed via the “envelope” (E) or “temporal fine-structure” (TFS) of the neural responses. Wide use of vocoders to resolve this question has commonly assumed that manipulating the amplitude-modulation and frequency-modulation components of the vocoded signal alters the relative importance of E or TFS encoding on the nerve, thus facilitating assessment of their relative importance to intelligibility. Here we argue that this assumption is incorrect, and that the vocoder approach is ineffective in differentially altering the neural E and TFS. In fact, we demonstrate using a simplified model of early auditory processing that both neural E and TFS encode the speech spectrogram with constant and comparable relative effectiveness regardless of the vocoder manipulations. However, we also show that neural TFS cues are less vulnerable than their E counterparts under severe noisy conditions, and hence should play a more prominent role in cochlear stimulation strategies.

81 citations


Journal ArticleDOI
TL;DR: This letter presents a novel algorithm to compute the instantaneous frequency (IF) of a multicomponent nonstationary signal using a combination of fractional spectrograms (FS).
Abstract: This letter presents a novel algorithm to compute the instantaneous frequency (IF) of a multicomponent nonstationary signal using a combination of fractional spectrograms (FS). A high resolution time frequency distribution (TFD) is defined by combining FS computed using windows of varying lengths and chirp rates. The IF of individual signal components is then computed by applying a peak detection and component extraction procedure. The mean square error (MSE) of IF estimates computed with the AFS is lower than the MSE of IF estimates obtained from other TFDs for SNR varying from -5 dB to 16 dB.

74 citations


Journal ArticleDOI
TL;DR: An environmental sound classification algorithm using spectrogram pattern matching along with neural network and k -nearest neighbor ( k -NN) classifiers is proposed, based on the observation that local features are more important than global features.

73 citations


Journal ArticleDOI
TL;DR: It is shown that the monaural mixed audio signal is considerably more separable in this nonuniform TF domain, and the analysis of signal separability is provided to verify this finding.
Abstract: A new unsupervised single-channel source separation method is presented. The proposed method does not require training knowledge and the separation system is based on nonuniform time-frequency (TF) analysis and feature extraction. Unlike conventional researches that concentrate on the use of spectrogram or its variants, we develop our separation algorithms using an alternative TF representation based on the gammatone filterbank. In particular, we show that the monaural mixed audio signal is considerably more separable in this nonuniform TF domain. We also provide the analysis of signal separability to verify this finding. In addition, we derive two new algorithms that extend the recently published Itakura-Saito nonnegative matrix factorization to the case of convolutive model for the nonstationary source signals. These formulations are based on the Quasi-EM framework and the multiplicative gradient descent (MGD) rule, respectively. Experimental tests have been conducted which show that the proposed method is efficient in extracting the sources' spectral-temporal features that are characterized by large dynamic range of energy, and thus leading to significant improvement in source separation performance.

Journal ArticleDOI
TL;DR: In this article, the authors proposed an approach based on Local Spectrogram Features (LSFs) which represent local spectral information that is extracted from the two-dimensional region surrounding "keypoints" detected in the spectrogram.

Journal ArticleDOI
TL;DR: In this paper, a 2D deconvolution operation on the short-time Fourier transform (STFT) spectrogram was proposed to reduce the computation burden caused by the 2D decomposition operation in the DSTFT.
Abstract: The spectral decomposition technique plays an important role in reservoir characterization, for which the time-frequency distribution method is essential. The deconvolutive short-time Fourier transform (DSTFT) method achieves a superior time-frequency resolution by applying a 2D deconvolution operation on the short-time Fourier transform (STFT) spectrogram. For seismic spectral decomposition, to reduce the computation burden caused by the 2D deconvolution operation in the DSTFT, the 2D STFT spectrogram is cropped into a smaller area, which includes the positive frequencies fallen in the seismic signal bandwidth only. In general, because the low-frequency components of a seismic signal are dominant, the removal of the negative frequencies may introduce a sharp edge at the zero frequency, which would produce artifacts in the DSTFT spectrogram. To avoid this problem, we used the analytic signal, which is obtained by applying the Hilbert transform on the original real seismic signal, to calculate the STFT spectrogram in our method. Synthetic and real seismic data examples were evaluated to demonstrate the performance of the proposed method.

Proceedings ArticleDOI
26 May 2013
TL;DR: A novel approach based on the temporal coding of Local Spectrogram Features (LSFs), which generate spikes that are used to train a Spiking Neural Network (SNN) with temporal learning, able to outperform the conventional frame-based baseline methods.
Abstract: There is much evidence to suggest that the human auditory system uses localised time-frequency information for the robust recognition of sounds. Despite this, conventional systems typically rely on features extracted from short windowed frames over time, covering the whole frequency spectrum. Such approaches are not inherently robust to noise, as each frame will contain a mixture of the spectral information from noise and signal. Here, we propose a novel approach based on the temporal coding of Local Spectrogram Features (LSFs), which generate spikes that are used to train a Spiking Neural Network (SNN) with temporal learning. LSFs represent robust location information in the spectrogram surrounding keypoints, which are detected in a signal-driven manner such that the effect of noise on the temporal coding is reduced. Our experiments demonstrate the robust performance of our approach across a variety of noise conditions, such that it is able to outperform the conventional frame-based baseline methods.

Proceedings Article
Yi-Hsuan Yang1
01 Jan 2013
TL;DR: This work uses online dictionary learning to learn the subspaces of vocal and instrumental sounds from a collection of clean signals first, and proposes a new algorithm called multiple low-rank representation (MLRR) to decompose a magnitude spectrogram into two low- rank matrices.
Abstract: Recent research work has shown that the magnitude spectrogram of a song can be considered as a superposition of a low-rank component and a sparse component, which appear to correspond to the instrumental part and the vocal part of the song, respectively. Based on this observation, one can separate singing voice from the background music. However, the quality of such separation might be limited, because the vocal part of a song can sometimes be lowrank as well. Therefore, we propose to learn the subspace structures of vocal and instrumental sounds from a collection of clean signals first, and then compute the low-rank representations of both the vocal and instrumental parts of a song based on the learned subspaces. Specifically, we use online dictionary learning to learn the subspaces, and propose a new algorithm called multiple low-rank representation (MLRR) to decompose a magnitude spectrogram into two low-rank matrices. Our approach is flexible in that the subspaces of singing voice and music accompaniment are both learned from data. Evaluation on the MIR-1K dataset shows that the approach improves the source-to-distortion ratio (SDR) and the source-to-interference ratio (SIR), but not the source-to-artifact ratio (SAR).

Journal ArticleDOI
TL;DR: A new feature descriptor that uses image shape features is proposed to identify bird species based on the recognition of fixed-duration birdsong segments where their corresponding spectrograms are viewed as gray-level images, better than traditional descriptors such as LPCC, MFCC, and TDMFCC.
Abstract: Traditional birdsong recognition approaches used acoustic features based on the acoustic model of speech production or the perceptual model of the human auditory system to identify the associated bird species. In this paper, a new feature descriptor that uses image shape features is proposed to identify bird species based on the recognition of fixed-duration birdsong segments where their corresponding spectrograms are viewed as gray-level images. The MPEG-7 angular radial transform (ART) descriptor, which can compactly and efficiently describe the gray-level variations within an image region in both angular and radial directions, will be employed to extract the shape features from the spectrogram image. To effectively capture both frequency and temporal variations within a birdsong segment using ART, a sector expansion algorithm is proposed to transform its spectrogram image into a corresponding sector image such that the frequency and temporal axes of the spectrogram image will align with the radial and angular directions of the ART basis functions, respectively. For the classification of 28 bird species using Gaussian mixture models (GMM), the best classification accuracy is 86.30% and 94.62% for 3-second and 5-second birdsong segments using the proposed ART descriptor, which is better than traditional descriptors such as LPCC, MFCC, and TDMFCC.

01 Jan 2013
TL;DR: Experiments indicate that the proposed voice conversion system based on non-negative spectrogram deconvolution outperforms the conventional joint density Gaussian mixture model by a wide margin in terms of both objective and subjective evaluations.
Abstract: In the traditional voice conversion, converted speech is generated using statistical parametric models (for example Gaussian mixture model) whose parameters are estimated from parallel training utterances. A well-known problem of the statistical parametric methods is that statistical average in parameter estimation results in the over-smoothing of the speech parameter trajectories, and thus leads to low conversion quality. Inspired by recent success of so-called exemplar-based methods in robust speech recognition, we propose a voice conversion system based on non-negative spectrogram deconvolution with similar ideas. Exemplars, which are able to capture temporal context, are employed to generate converted speech spectrogram convolutely. The exemplar-based approach is seen as a data-driven, non-parametric approach as an alternative to the traditional parametric approaches to voice conversion. Experiments on VOICES database indicate that the proposed method outperforms the conventional joint density Gaussian mixture model by a wide margin in terms of both objective and subjective evaluations.

Proceedings ArticleDOI
01 Oct 2013
TL;DR: A novel speech enhancement system based on decomposing the spectrogram into sparse activation of a dictionary of target speech templates, and a low-rank background model, which makes few assumptions about the noise other than its limited spectral variation is proposed.
Abstract: Speech enhancement requires some principle by which to distinguish speech and noise, and the most successful separation requires strong models for both speech and noise. If, however, the noise encountered differs significantly from the system's assumptions, performance will suffer. In this work, we propose a novel speech enhancement system based on decomposing the spectrogram into sparse activation of a dictionary of target speech templates, and a low-rank background model, which makes few assumptions about the noise other than its limited spectral variation. A variation of this model specifically designed to handle transient noise intrusions is also proposed. Evaluation via BSS EVAL and PESQ show that the new approaches improve signal-to-distortion ratio in most cases and PESQ in high-noise conditions when compared to several traditional speech enhancement algorithms including log-MMSE.

Proceedings ArticleDOI
01 Sep 2013
TL;DR: This paper uses micro-Doppler features in radar signal corresponding to human body motions and gait to detect falls using a narrowband pulse- doppler radar to achieve fast and accurate fall detections.
Abstract: Falls are one of the greatest threats to elderly health as they carry out their daily living routines and activities. Therefore, it is very important to detect falls of an elderly in a timely and accurate manner, so that immediate response and proper care can be rendered. Radar is an effective non-intrusive sensing modality which is well suited for this purpose. It can detect human motions in all types of environments, penetrate walls and fabrics, preserve privacy, and is insensitive to lighting conditions. In this paper, we use micro-Doppler features in radar signal corresponding to human body motions and gait to detect falls using a narrowband pulse-Doppler radar. Human motions cause time-varying Doppler signatures, which are analyzed using time-frequency representations and matching pursuit decomposition for feature extraction and fall detection. The extracted features include the principal components of the time-frequency signal representations. To analyze the sequential characteristics of typical falls, we use the extracted signal features for training and testing hidden Markov models and support vector machines indifferent falling scenarios. Experimental results demonstrate that the proposed algorithm and method achieve fast and accurate fall detections.

Journal ArticleDOI
TL;DR: Huang et al. as discussed by the authors presented an open-source implementation of the Hilbert Huang transform (HHT), an alternative spectral method designed to avoid the linearity and stationarity constraints of Fourier analysis.
Abstract: Online Material: Color versions of spectrogram figures; R and hht code installation instructions with examples The Fourier transform remains one of the most popular spectral methods in time‐series analysis, so much so that the word “spectrum” is virtually equivalent to “Fourier spectrum” (Huang et al , 2001) This method assumes that a time series extends from positive to negative infinity (stationarity) and consists of a linear superposition of sinusoids (linearity) However, geophysical signals are never stationary and are not necessarily linear This results in a trade‐off between time and frequency resolution for nonstationary signals and the creation of spurious harmonics for nonlinear signals We present an open‐source implementation of the Hilbert–Huang transform (HHT), an alternative spectral method designed to avoid the linearity and stationarity constraints of Fourier analysis The HHT defines instantaneous frequency as the time derivative of phase, illuminating previously inaccessible spectral details in transient signals Nonlinear signals become frequency modulations rather than a series of fitted sinusoids, eliminating artificial harmonics in the resulting spectrogram In this paper, we describe the HHT algorithm and present our recently‐developed hht package for the R programming language This package includes routines for empirical mode decomposition (EMD), ensemble empirical mode decomposition (EEMD) and Hilbert spectral analysis It also comes with high‐level plotting functions for easy and accurate visualization of the resulting waveforms and spectra We demonstrate this code by applying it to three signals: a synthetic nonlinear waveform, a transient signal recorded at Deception Island volcano, Antarctica, and quasi‐harmonic tremor from Reventador volcano, Ecuador The synthetic signal shows how the EMD method breaks complex time series into simpler modes It also illustrates how the Hilbert transforms of nonlinear signals produce frequency oscillations rather than harmonics The transient signal demonstrates the high‐time/frequency resolution of the HHT method The volcanic‐tremor signal has high‐frequency harmonics in the …

Proceedings ArticleDOI
25 Aug 2013
TL;DR: The method overcomes the limitation of conventional non-negative matrix factorisation algorithms to utilise the redundancy of sounds in frequency and synthesise sounds separated by filtering the mixture signal with a Wiener-like filter generated from the estimated tensor factors.
Abstract: This paper proposes an algorithm for separating monaural audio signals by non-negative tensor factorisation of modulation spectrograms. The modulation spectrogram is able to represent redundant patterns across frequency with similar features, and the tensor factorisation is able to isolate these patterns in an unsupervised way. The method overcomes the limitation of conventional non-negative matrix factorisation algorithms to utilise the redundancy of sounds in frequency. In the proposed method, separated sounds are synthesised by filtering the mixture signal with a Wiener-like filter generated from the estimated tensor factors. The proposed method was compared to conventional algorithms in unsupervised separation of mixtures of speech and music. Improved signal to distortion ratios were obtained compared to standard non-negative matrix factorisation and non-negative matrix deconvolution.

Book ChapterDOI
20 Nov 2013
TL;DR: A novel approach for automatic music genre recognition in the visual domain that uses two texture descriptors and it is shown that the SVM classifier trained with LPQ is able to achieve a recognition rate above 80%.
Abstract: This paper presents a novel approach for automatic music genre recognition in the visual domain that uses two texture descriptors. For this, the audio signal is converted into spectrograms and then textural features are extracted from this visual representation. Gabor filters and LPQ texture descriptors were used to capture the spectrogram content. In order to evaluate the performance of local feature extraction, some different zoning mechanisms were taken into account. The experiments were performed on the Latin Music Database. At the end, we have shown that the SVM classifier trained with LPQ is able to achieve a recognition rate above 80%. This rate is among the best results ever presented in the literature.

Journal ArticleDOI
TL;DR: This paper evaluates the influence of the feature functions in an audio-to-score alignment task, on a large database of popular and classical polyphonic music, and explores two different learning strategies.
Abstract: This paper addresses the design of feature functions for the matching of a musical recording to the symbolic representation of the piece (the score). These feature functions are defined as dissimilarity measures between the audio observations and template vectors corresponding to the score. By expressing the template construction as a linear mapping from the symbolic to the audio representation, one can learn the feature functions by optimizing the linear transformation. In this paper, we explore two different learning strategies. The first one uses a best-fit criterion (minimum divergence), while the second one exploits a discriminative framework based on a Conditional Random Fields model (maximum likelihood criterion). We evaluate the influence of the feature functions in an audio-to-score alignment task, on a large database of popular and classical polyphonic music. The results show that with several types of models, using different temporal constraints, the learned mappings have the potential to outperform the classic heuristic mappings. Several representations of the audio observations, along with several distance functions are compared in this alignment task. Our experiments elect the symmetric Kullback-Leibler divergence. Moreover, both the spectrogram and a CQT-based representation turn out to provide very accurate alignments, detecting more than 97% of the onsets with a precision of 100 ms with our most complex system.

Journal ArticleDOI
TL;DR: The usage of coprime sampling for calculating ambiguity function of matched filter in radar system is investigated and the effect of it is examined, and several useful guidelines of choosing configuration to conduct the sparse sensing while retain the detection quality are concluded.
Abstract: Estimating the spectrogram of non-stationary signal relates to many important applications in radar signal processing. In recent years, coprime sampling and array attract attention for their potential of sparse sensing with derivative to estimate autocorrelation coefficients with all lags, which could in turn calculate the power spectrum density. But this theoretical merit is based on the premise that the input signals are wide-sense stationary. In this article, we discuss how to implement coprime sampling for non-stationary signal, especially how to attain the benefits of coprime sampling meanwhile limiting the disadvantages due to lack of observations for estimations. Furthermore, we investigate the usage of coprime sampling for calculating ambiguity function of matched filter in radar system. We also examine the effect of it and conclude several useful guidelines of choosing configuration to conduct the sparse sensing while retain the detection quality.

Journal ArticleDOI
TL;DR: This paper is involved in using only one ultrasonic sensor to detect stair-cases in electronic cane using a multiclass SVM approach and recognition rates of 82.4% has been achieved.
Abstract: Blinds people need some aid to interact with their environment with more security. A new device is then proposed to enable them to see the world with their ears. Considering not only system requirements but also technology cost, we used, for the conception of our tool, ultrasonic sensors and one monocular camera to enable user being aware of the presence and nature of potential encountered obstacles. In this paper, we are involved in using only one ultrasonic sensor to detect stair-cases in electronic cane. In this context, no previous work has considered such a challenge. Aware that the performance of an object recognition system depends on both object representation and classification algorithms, we have used in our system, one representation of ultrasonic signal in frequency domain: spectrogram representation explaining how the spectral density of signal varies with time, spectrum representation showing the amplitudes as a function of the frequency, periodogram representation estimating the spectral density of signal. Several features, thus extracted from each representation, contribute in the classification process. Our system was evaluated on a set of ultrasonic signal where stair-cases occur with different shapes. Using a multiclass SVM approach, recognition rates of 82.4% has been achieved.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a technique for Informed Source Separation (ISS) of a single channel mixture, based on the Multiple Input Spectrogram Inversion (MISI) phase estimation method.
Abstract: This paper presents a technique for Informed Source Separation (ISS) of a single channel mixture, based on the Multiple Input Spectrogram Inversion (MISI) phase estimation method. The reconstruction of the source signals is iterative, alternating between a time-frequency consistency enforcement and a re-mixing constraint. A dual resolution technique is also proposed, for sharper transients reconstruction. The two algorithms are compared to a state-of-the-art Wiener-based ISS technique, on a database of fourteen monophonic mixtures, with standard source separation objective measures. Experimental results show that the proposed algorithms outperform both this reference technique and the oracle Wiener filter by up to 3 dB in distortion, at the cost of a significantly heavier computation.

Proceedings Article
01 Jan 2013
TL;DR: Positive semidefinite tensor factorization (PSDTF) is used for directly estimating source signals from the mixture signal in the time domain and an efficient multiplicative update algorithm for PSDTF can be derived.
Abstract: This paper presents a new fundamental technique for source separation of single-channel audio signals. Although nonnegative matrix factorization (NMF) has recently become very popular for music source separation, it deals only with the amplitude or power of the spectrogram of a given mixture signal and completely discards the phase. The component spectrograms are typically estimated using a Wiener filter that reuses the phase of the mixture spectrogram, but such rough phase reconstruction makes it hard to recover high-quality source signals because the estimated spectrograms are inconsistent, i.e., they do not correspond to any real time-domain signals. To avoid the frequency-domain phase reconstruction, we use positive semidefinite tensor factorization (PSDTF) for directly estimating source signals from the mixture signal in the time domain. Since PSDTF is a natural extension of NMF, an efficient multiplicative update algorithm for PSDTF can be derived. Experimental results show that PSDTF outperforms conventional NMF variants in terms of source separation quality.

Journal ArticleDOI
TL;DR: An automated infant cry analyzer with high accuracy to detect important acoustic features of cry is described and validated, which has implications for basic and applied research on infant cry development.
Abstract: Purpose In this article, the authors describe and validate the performance of a modern acoustic analyzer specifically designed for infant cry analysis. Method Utilizing known algorithms, the authors developed a method to extract acoustic parameters describing infant cries from standard digital audio files. They used a frame rate of 25 ms with a frame advance of 12.5 ms. Cepstral-based acoustic analysis proceeded in 2 phases, computing frame-level data and then organizing and summarizing this information within cry utterances. Using signal detection methods, the authors evaluated the accuracy of the automated system to determine voicing and to detect fundamental frequency (F0) as compared to voiced segments and pitch periods manually coded from spectrogram displays. Results The system detected F0 with 88% to 95% accuracy, depending on tolerances set at 10 to 20 Hz. Receiver operating characteristic analyses demonstrated very high accuracy at detecting voicing characteristics in the cry samples. Conclusions...

Journal ArticleDOI
TL;DR: In this article, an explicit form for the reassigned Gabor spectrogram of an Hermite function of arbitrary order is given, and it is shown that the energy concentration sharply localizes outside the border of a clearance area limited by the "classical" circle where the spectrogram attains its maximum value.
Abstract: An explicit form is given for the reassigned Gabor spectrogram of an Hermite function of arbitrary order. It is shown that the energy concentration sharply localizes outside the border of a clearance area limited by the “classical” circle where the Gabor spectrogram attains its maximum value, with a perfect localization that can only be achieved in the limit of infinite order.

Proceedings ArticleDOI
26 May 2013
TL;DR: The refinement method allows end-users to provide feedback to the separation process by painting on spectrogram displays of intermediate output results and is able to perform high-quality separation with minimal user-interaction.
Abstract: We propose an interactive refinement method for supervised and semi-supervised single-channel source separation. The refinement method allows end-users to provide feedback to the separation process by painting on spectrogram displays of intermediate output results. The time-frequency annotations are then used to update the separation estimates and iteratively refine the results. The initial separation is performed using probabilistic latent component analysis and is then extended to incorporate the painting annotations using linear grouping expectation constraints via the framework of posterior regularization. Using a prototype user-interface, we show that the method is able to perform high-quality separation with minimal user-interaction.