scispace - formally typeset
Search or ask a question

Showing papers on "Spectrogram published in 2016"


Proceedings ArticleDOI
20 Mar 2016
TL;DR: In this paper, a deep network is trained to assign contrastive embedding vectors to each time-frequency region of the spectrogram in order to implicitly predict the segmentation labels of the target spectrogram from the input mixtures.
Abstract: We address the problem of "cocktail-party" source separation in a deep learning framework called deep clustering. Previous deep network approaches to separation have shown promising performance in scenarios with a fixed number of sources, each belonging to a distinct signal class, such as speech and noise. However, for arbitrary source classes and number, "class-based" methods are not suitable. Instead, we train a deep network to assign contrastive embedding vectors to each time-frequency region of the spectrogram in order to implicitly predict the segmentation labels of the target spectrogram from the input mixtures. This yields a deep network-based analogue to spectral clustering, in that the embeddings form a low-rank pair-wise affinity matrix that approximates the ideal affinity matrix, while enabling much faster performance. At test time, the clustering step "decodes" the segmentation implicit in the embeddings by optimizing K-means with respect to the unknown assignments. Preliminary experiments on single-channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker mixtures can improve signal quality for mixtures of held-out speakers by an average of 6dB. More dramatically, the same model does surprisingly well with three-speaker mixtures.

1,216 citations


Journal ArticleDOI
TL;DR: The proposed approach improves over other methods when evaluated with several objective metrics, including the perceptual evaluation of speech quality (PESQ), and a listening test where subjects prefer the proposed approach with at least a 69% rate.
Abstract: Speech separation systems usually operate on the short-time Fourier transform (STFT) of noisy speech, and enhance only the magnitude spectrum while leaving the phase spectrum unchanged. This is done because there was a belief that the phase spectrum is unimportant for speech enhancement. Recent studies, however, suggest that phase is important for perceptual quality, leading some researchers to consider magnitude and phase spectrum enhancements. We present a supervised monaural speech separation approach that simultaneously enhances the magnitude and phase spectra by operating in the complex domain. Our approach uses a deep neural network to estimate the real and imaginary components of the ideal ratio mask defined in the complex domain. We report separation results for the proposed method and compare them to related systems. The proposed approach improves over other methods when evaluated with several objective metrics, including the perceptual evaluation of speech quality (PESQ), and a listening test where subjects prefer the proposed approach with at least a 69% rate.

699 citations


Proceedings ArticleDOI
12 Sep 2016
TL;DR: The intuition is that due to the differences in gaits of different people, the WiFi signal reflected by a walking human generates unique variations in the Channel State Information on the WiFi receiver, so WifiU is proposed, which uses commercial WiFi devices to capture fine-grained gait patterns to recognize humans.
Abstract: In this paper, we propose WifiU, which uses commercial WiFi devices to capture fine-grained gait patterns to recognize humans. The intuition is that due to the differences in gaits of different people, the WiFi signal reflected by a walking human generates unique variations in the Channel State Information (CSI) on the WiFi receiver. To profile human movement using CSI, we use signal processing techniques to generate spectrograms from CSI measurements so that the resulting spectrograms are similar to those generated by specifically designed Doppler radars. To extract features from spectrograms that best characterize the walking pattern, we perform autocorrelation on the torso reflection to remove imperfection in spectrograms. We evaluated WifiU on a dataset with 2,800 gait instances collected from 50 human subjects walking in a room with an area of 50 square meters. Experimental results show that WifiU achieves top-1, top-2, and top-3 recognition accuracies of 79.28%, 89.52%, and 93.05%, respectively.

447 citations


Journal ArticleDOI
TL;DR: It is shown that ESTOI can be interpreted in terms of an orthogonal decomposition of short-time spectrograms into intelligibility subspaces, i.e., a ranking of spectrogram features according to their importance to intelligibility.
Abstract: Intelligibility listening tests are necessary during development and evaluation of speech processing algorithms, despite the fact that they are expensive and time consuming. In this paper, we propose a monaural intelligibility prediction algorithm, which has the potential of replacing some of these listening tests. The proposed algorithm shows similarities to the short-time objective intelligibility STOI algorithm, but works for a larger range of input signals. In contrast to STOI, extended STOI ESTOI does not assume mutual independence between frequency bands. ESTOI also incorporates spectral correlation by comparing complete 400ms length spectrograms of the noisy/processed speech and the clean speech signals. As a consequence, ESTOI is also able to accurately predict the intelligibility of speech contaminated by temporally highly modulated noise sources in addition to noisy signals processed with time-frequency weighting. We show that ESTOI can be interpreted in terms of an orthogonal decomposition of short-time spectrograms into intelligibility subspaces, i.e., a ranking of spectrogram features according to their importance to intelligibility. A free MATLAB implementation of the algorithm is available for noncommercial use at http://kom.aau.dk/~jje/.

404 citations


Proceedings ArticleDOI
08 Sep 2016
TL;DR: In this paper, an end-to-end signal approximation objective was proposed to improve the performance of a speaker-independent multi-speaker separation system using deep clustering, which achieved a 10.3 dB improvement in the SDR.
Abstract: Deep clustering is a recently introduced deep learning architecture that uses discriminatively trained embeddings as the basis for clustering. It was recently applied to spectrogram segmentation, resulting in impressive results on speaker-independent multi-speaker separation. In this paper we extend the baseline system with an end-to-end signal approximation objective that greatly improves performance on a challenging speech separation. We first significantly improve upon the baseline system performance by incorporating better regularization, larger temporal context, and a deeper architecture, culminating in an overall improvement in signal to distortion ratio (SDR) of 10.3 dB compared to the baseline of 6.0 dB for two-speaker separation, as well as a 7.1 dB SDR improvement for three-speaker separation. We then extend the model to incorporate an enhancement layer to refine the signal estimates, and perform end-to-end training through both the clustering and enhancement stages to maximize signal fidelity. We evaluate the results using automatic speech recognition. The new signal approximation objective, combined with end-to-end training, produces unprecedented performance, reducing the word error rate (WER) from 89.1% down to 30.8%. This represents a major advancement towards solving the cocktail party problem.

354 citations


Journal ArticleDOI
TL;DR: A new rolling bearing fault diagnosis method that is based on short-time Fourier transform and stacked sparse autoencoder is first proposed; this method analyzes sound signals and is compared with empirical mode decomposition, Teager energy operator, and stacked scant autoen coder when using vibration signals to verify the performance and effectiveness of the proposed method.
Abstract: The main challenge of fault diagnosis lies in finding good fault features. A deep learning network has the ability to automatically learn good characteristics from input data in an unsupervised fashion, and its unique layer-wise pretraining and fine-tuning using the backpropagation strategy can solve the difficulties of training deep multilayer networks. Stacked sparse autoencoders or other deep architectures have shown excellent performance in speech recognition, face recognition, text classification, image recognition, and other application domains. Thus far, however, there have been very few research studies on deep learning in fault diagnosis. In this paper, a new rolling bearing fault diagnosis method that is based on short-time Fourier transform and stacked sparse autoencoder is first proposed; this method analyzes sound signals. After spectrograms are obtained by short-time Fourier transform, stacked sparse autoencoder is employed to automatically extract the fault features, and softmax regression is adopted as the method for classifying the fault modes. The proposed method, when applied to sound signals that are obtained from a rolling bearing test rig, is compared with empirical mode decomposition, Teager energy operator, and stacked sparse autoencoder when using vibration signals to verify the performance and effectiveness of the proposed method.

157 citations


Posted Content
TL;DR: In this paper, a large-scale music dataset, MusicNet, is introduced to serve as a source of supervision and evaluation of machine learning methods for music research, which consists of hundreds of freely-licensed classical music recordings by 10 composers, written for 11 instruments.
Abstract: This paper introduces a new large-scale music dataset, MusicNet, to serve as a source of supervision and evaluation of machine learning methods for music research. MusicNet consists of hundreds of freely-licensed classical music recordings by 10 composers, written for 11 instruments, together with instrument/note annotations resulting in over 1 million temporal labels on 34 hours of chamber music performances under various studio and microphone conditions. The paper defines a multi-label classification task to predict notes in musical recordings, along with an evaluation protocol, and benchmarks several machine learning architectures for this task: i) learning from spectrogram features; ii) end-to-end learning with a neural net; iii) end-to-end learning with a convolutional neural net. These experiments show that end-to-end models trained for note prediction learn frequency selective filters as a low-level representation of audio.

135 citations


Proceedings ArticleDOI
12 Sep 2016
TL;DR: The results show that AudioGest can detect six hand gestures with an accuracy up to 96%, and by distinguishing the gesture attributions, it can provide up to 162 control commands for various applications.
Abstract: Hand gesture is becoming an increasingly popular means of interacting with consumer electronic devices, such as mobile phones, tablets and laptops. In this paper, we present AudioGest, a device-free gesture recognition system that can accurately sense the hand in-air movement around user's devices. Compared to the state-of-the-art, AudioGest is superior in using only one pair of built-in speaker and microphone, without any extra hardware or infrastructure support and with no training, to achieve fine-grained hand detection. Our system is able to accurately recognize various hand gestures, estimate the hand in-air time, as well as average moving speed and waving range. We achieve this by transforming the device into an active sonar system that transmits inaudible audio signal and decodes the echoes of hand at its microphone. We address various challenges including cleaning the noisy reflected sound signal, interpreting the echo spectrogram into hand gestures, decoding the Doppler frequency shifts into the hand waving speed and range, as well as being robust to the environmental motion and signal drifting. We implement the proof-of-concept prototype in three different electronic devices and extensively evaluate the system in four real-world scenarios using 3,900 hand gestures that collected by five users for more than two weeks. Our results show that AudioGest can detect six hand gestures with an accuracy up to 96%, and by distinguishing the gesture attributions, it can provide up to 162 control commands for various applications.

120 citations


Journal ArticleDOI
TL;DR: This work proposes an ensemble of heterogeneous classifiers for maximizing the performance that could be obtained starting from the acoustic features, and shows for the first time that a bag of feature approach can be effective in this problem.
Abstract: Coupling texture descriptors and acoustic features.Bag of feature approach can be effectively used in this problem.Heterogeneous ensemble of different classifiers improves performance. Since musical genre is one of the most common ways used by people for managing digital music databases, music genre recognition is a crucial task, deep studied by the Music Information Retrieval (MIR) research community since 2002. In this work we present a novel and effective approach for automated musical genre recognition based on the fusion of different set of features. Both acoustic and visual features are considered, evaluated, compared and fused in a final ensemble which show classification accuracy comparable or even better than other state-of-the-art approaches. The visual features are locally extracted from sub-windows of the spectrogram taken by Mel scale zoning: the input signal is represented by its spectrogram which is divided in sub-windows in order to extract local features; feature extraction is performed by calculating texture descriptors and bag of features projections from each sub-window; the final decision is taken using an ensemble of SVM classifiers. In this work we show for the first time that a bag of feature approach can be effective in this problem. As the acoustic features are concerned, we propose an ensemble of heterogeneous classifiers for maximizing the performance that could be obtained starting from the acoustic features. First timbre features are obtained from the audio signal, second some statistical measures are calculated from the texture window and the modulation spectrum, third a feature selection is executed to increase the recognition performance and decrease the computational complexity. Finally, the resulting descriptors are classified by fusing the scores of heterogeneous classifiers (SVM and Random subspace of AdaBoost). The experimental evaluation is performed on three well-known databases: the Latin Music Database (LMD), the ISMIR 2004 database and the GTZAN genre collection. The reported performance of the proposed approach is very encouraging, since they outperform other state-of-the-art approaches, without any ad hoc parameter optimization (i.e. using the same ensemble of classifiers and parameters setting in all the three datasets). The advantage of using both visual and audio features is also proved by means of Q-statistics, which confirms that the two sets of features are partially independent and they are suitable to be fused together in a heterogeneous system. The MATLAB code of the ensemble of classifiers and for the visual features extraction will be publicly available (see footnote 1) to other researchers for future comparisons. The code for acoustic features is not available since it is used in a commercial system.

102 citations


Journal ArticleDOI
TL;DR: Experimental results show that the proposed micro-Doppler features can achieve a good discriminative ability and a satisfactory classification performance.
Abstract: A novel feature extraction method based on micro-Doppler signature is proposed to categorize ground moving targets into three kinds, i.e., single walking person, two people walking, and a moving wheeled vehicle. Signal models and measured data from a low-resolution radar are first analyzed to find the differences between the micro-Doppler signatures from the three kinds of considered targets. Then, such discriminative micro-Doppler signatures are represented by a 3-D feature vector extracted from the time-frequency spectrograms. In the experiments based on the measured data, the ratio of the between-class distance to the within-class distance, which is defined based on Fisher discriminant analysis, is exploited to assess the discriminative ability of the 3-D feature vector. Moreover, support vector machine classifier is utilized to evaluate the classification performance. Experimental results show that the proposed micro-Doppler features can achieve a good discriminative ability and a satisfactory classification performance.

84 citations


Proceedings ArticleDOI
01 Aug 2016
TL;DR: This paper presents a new automatic and intelligent fault diagnosis method based on convolution neural network, which finds that the enhanced convolutional neural network achieves the best classification performance, 5% and 4% higher than ReLU network and Dropout network respectively.
Abstract: Feature extraction is an important step in conventional vibration-based fault diagnosis methods. However, the features are usually empirically extracted, leading to inconsistent performance. This paper presents a new automatic and intelligent fault diagnosis method based on convolution neural network. Firstly, the vibration signal is processed by wavelet transform into a multi-scale spectrogram image to manifest the fault characteristics. Next, the spectrogram image is directly fed into convolution neural network to learn the invariant representation for vibration signal and recognize the fault status for fault diagnosis. During model construction, rectifier neural activation function and dropout layer are incorporated into convolution neural network to improve the computational efficiency and model generalization. Training data is input into traditional convolutional neural network, ReLU network, Dropout network and enhanced convolutional neural network. The classification results are reached by inputting training data and test data. Then, comparison is made on the analytical results of the four networks to conclude that the preciseness of the classification result of the enhanced convolutional neural network achieves as high as 96%, 8% higher than traditional convolutional neural network. Through adjusting p, the holding probability of Dropout, 3 kinds of sparse neural networks are trained and the classification results are compared. It finds, when p=0.4, the enhanced convolutional neural network achieves the best classification performance, 5% and 4% higher than ReLU network and Dropout network respectively.

Journal ArticleDOI
TL;DR: An automated fall detection system based on smartphone audio features is developed and the best performance is achieved using spectrogram features with ANN classifier with sensitivity, specificity, and accuracy all above 98%.
Abstract: An automated fall detection system based on smartphone audio features is developed. The spectrogram, mel frequency cepstral coefficents (MFCCs), linear predictive coding (LPC), and matching pursuit (MP) features of different fall and no-fall sound events are extracted from experimental data. Based on the extracted audio features, four different machine learning classifiers: k -nearest neighbor classifier ( k -NN), support vector machine (SVM), least squares method (LSM), and artificial neural network (ANN) are investigated for distinguishing between fall and no-fall events. For each audio feature, the performance of each classifier in terms of sensitivity, specificity, accuracy, and computational complexity is evaluated. The best performance is achieved using spectrogram features with ANN classifier with sensitivity, specificity, and accuracy all above $98\%$ . The classifier also has acceptable computational requirement for training and testing. The system is applicable in home environments where the phone is placed in the vicinity of the user.

Journal ArticleDOI
TL;DR: In this paper, the ptychographic reconstruction algorithm was proposed for frequency-resolved optical gating (FROG) and demonstrated robustness of two unknown pulses from a single measured spectrogram and power spectrum of only one of the pulses.
Abstract: Frequency-resolved optical gating (FROG) is probably the most popular technique for complete characterization of ultrashort laser pulses. In FROG, a reconstruction algorithm retrieves the pulse from a measured spectrogram, yet current FROG reconstruction algorithms require and exhibit several restricting features that weaken FROG performances. For example, the delay step must correspond to the spectral bandwidth measured with large enough SNR a condition that limits the temporal resolution of the reconstructed pulse, obscures measurements of weak broadband pulses, and makes measurement of broadband mid-IR pulses hard and slow because the spectrograms become huge. We develop a new approach for FROG reconstruction, based on ptychography (a scanning coherent diffraction imaging technique), that removes many of the algorithmic restrictions. The ptychographic reconstruction algorithm is significantly faster and more robust to noise than current FROG algorithms, which are based on generalized projections (GP). We demonstrate, numerically and experimentally, that ptychographic reconstruction works well with very partial spectrograms, e. g. spectrograms with reduced number of measured delays and spectrograms that have been substantially spectrally filtered. In addition, we implement the ptychogrpahic approach to blind second harmonic generation (SHG) FROG and demonstrate robust and complete characterization of two unknown pulses from a single measured spectrogram and power spectrum of only one of the pulses. We believe that the ptychograpy-based approach will become the standard reconstruction procedure in FROG and related diagnostics methods, allowing successful reconstructions from so far unreconstructable spectrograms.

Posted Content
TL;DR: This paper significantly improves upon the baseline system performance by incorporating better regularization, larger temporal context, and a deeper architecture, culminating in an overall improvement in signal to distortion ratio (SDR) of 10.3 dB compared to the baseline, and produces unprecedented performance on a challenging speech separation.
Abstract: Deep clustering is a recently introduced deep learning architecture that uses discriminatively trained embeddings as the basis for clustering. It was recently applied to spectrogram segmentation, resulting in impressive results on speaker-independent multi-speaker separation. In this paper we extend the baseline system with an end-to-end signal approximation objective that greatly improves performance on a challenging speech separation. We first significantly improve upon the baseline system performance by incorporating better regularization, larger temporal context, and a deeper architecture, culminating in an overall improvement in signal to distortion ratio (SDR) of 10.3 dB compared to the baseline of 6.0 dB for two-speaker separation, as well as a 7.1 dB SDR improvement for three-speaker separation. We then extend the model to incorporate an enhancement layer to refine the signal estimates, and perform end-to-end training through both the clustering and enhancement stages to maximize signal fidelity. We evaluate the results using automatic speech recognition. The new signal approximation objective, combined with end-to-end training, produces unprecedented performance, reducing the word error rate (WER) from 89.1% down to 30.8%. This represents a major advancement towards solving the cocktail party problem.

Proceedings Article
01 Oct 2016
TL;DR: Results are obtained by the DCNN while outperforming a few conventional classifiers, showing the possible benefits of deep learning approaches in human gait classification.
Abstract: This paper presents the use of a deep convolutional neural network (DCNN) in distinguishing between absence of human gait and the presence of single or multiple instances of human gait by applying the DCNN to micro-Doppler spectrograms. The approach is evaluated for various radar frequencies and SNR levels using model data, while final validation is performed using X-band CW radar measurements. Satisfactorily results are obtained by the DCNN while outperforming a few conventional classifiers, showing the possible benefits of deep learning approaches in human gait classification.

Proceedings ArticleDOI
20 Mar 2016
TL;DR: This study improves the acoustic model by proposing a 2-D, time-frequency (TF) LSTM, which jointly scans the input over the time and frequency axes to model spectro-temporal warping, and uses the output activations as the input to a time L STM (T-LSTM).
Abstract: Long short-term memory (LSTM) recurrent neural networks (RNNs) have recently shown significant performance improvements over deep feed-forward neural networks. A key aspect of these models is the use of time recurrence, combined with a gating architecture that allows them to track the long-term dynamics of speech. Inspired by human spectrogram reading, we recently proposed the frequency LSTM (F-LSTM) that performs 1-D recurrence over the frequency axis and then performs 1-D recurrence over the time axis. In this study, we further improve the acoustic model by proposing a 2-D, time-frequency (TF) LSTM. The TF-LSTM jointly scans the input over the time and frequency axes to model spectro-temporal warping, and then uses the output activations as the input to a time LSTM (T-LSTM). The joint time-frequency modeling better normalizes the features for the upper layer T-LSTMs. Evaluated on a 375-hour short message dictation task, the proposed TF-LSTM obtained a 3.4% relative WER reduction over the best T-LSTM. The invariance property achieved by joint time-frequency analysis is demonstrated on a mismatched test set, where the TF-LSTM achieves a 14.2% relative WER reduction over the best T-LSTM.

Journal ArticleDOI
TL;DR: This letter investigates the use of micro-Doppler signatures experimentally recorded by a multistatic radar system to perform recognition of people walking and reports high classification accuracy above 98% for the most favorable aspect angle.
Abstract: In this letter, we investigate the use of micro-Doppler signatures experimentally recorded by a multistatic radar system to perform recognition of people walking. Three different sets of features are tested, taking into account the impact on the overall classification performance of parameters, such as aspect angle, types of classifier, different values of signal-to-noise ratio, and different ways of exploiting multistatic information. High classification accuracy of above 98% is reported for the most favorable aspect angle, and the benefit of using multistatic data at less favorable angles is discussed.

Journal ArticleDOI
TL;DR: Efficiency and ability of the proposed procedure in impulsive noise cancellation and the ability to detect a damage are indicated.
Abstract: In this paper, we deal with a problem of local damage detection in bearings in the presence of a high-energy impulsive noise. Such a problem was identified during diagnostics of bearings in raw materials crusher. Unfortunately, classical approaches cannot be applied due to the impulsive character of the noise. In this paper we propose, a procedure that cancels out impulsive noise rather than extracts signal of interest. The methodology is based on the regime switching model with two regimes: first corresponding to high-energy noncyclic impulses and second to the rest of the signal. We apply the proposed technique to a simulated signal as well as to the real one. Effectiveness of the method is presented graphically using time series, time-frequency spectrogram, and classical envelope analysis. The obtained results indicate efficiency of the method in impulsive noise cancellation and improve the ability to detect a damage.

Journal ArticleDOI
TL;DR: A novel empirical model is proposed that adaptively adjusts the window size for a narrow band-signal using spectrum sensing technique and results obtained not only improve the spectrogram visualization but also reduce the computation cost.
Abstract: Short Time Fourier Transform (STFT) is an important technique for the time-frequency analysis of a time varying signal. The basic approach behind it involves the application of a Fast Fourier Transform (FFT) to a signal multiplied with an appropriate window function with fixed resolution. The selection of an appropriate window size is difficult when no background information about the input signal is known. In this paper, a novel empirical model is proposed that adaptively adjusts the window size for a narrow band-signal using spectrum sensing technique. For wide-band signals, where a fixed time-frequency resolution is undesirable, the approach adapts the constant Q transform (CQT). Unlike the STFT, the CQT provides a varying time-frequency resolution. This results in a high spectral resolution at low frequencies and high temporal resolution at high frequencies. In this paper, a simple but effective switching framework is provided between both STFT and CQT. The proposed method also allows for the dynamic construction of a filter bank according to user-defined parameters. This helps in reducing redundant entries in the filter bank. Results obtained from the proposed method not only improve the spectrogram visualization but also reduce the computation cost and achieves 87.71% of the appropriate window length selection.

Journal ArticleDOI
TL;DR: This study proposes a speech enhancement method based on compressive sensing that is experimentally compared with the baseline methods and demonstrates its superiority.
Abstract: This study proposes a speech enhancement method based on compressive sensing. The main procedures involved in the proposed method are performed in the frequency domain. First, an overcomplete dictionary is constructed from the trained speech frames. The atoms of this redundant dictionary are spectrum vectors that are trained by the K-SVD algorithm to ensure the sparsity of the dictionary. For a noisy speech spectrum, formant detection and a quasi-SNR criterion are first utilized to determine whether a frequency bin in the spectrogram is reliable, and a corresponding mask is designed. The mask-extracted reliable components in a speech spectrum are regarded as partial observations and a measurement matrix is constructed. The problem can therefore be treated as a compressive sensing problem. The K atoms of a K-sparsity speech spectrum are found using an orthogonal matching pursuit algorithm. Because the K atoms form the speech signal subspace, the removal of the noise projected onto these K atoms is achieved by multiplying the noisy spectrum with the optimized gain that corresponds to each selected atom. The proposed method is experimentally compared with the baseline methods and demonstrates its superiority.

Journal ArticleDOI
TL;DR: Fast, round-trip-resolved spectral dynamics of cavity-based systems in real-time are obtained, with temporal resolution of one cavity round trip and frequency resolution defined by its inverse (85 ns and 24 MHz respectively are demonstrated).
Abstract: Conventional tools for measurement of laser spectra (e.g. optical spectrum analysers) capture data averaged over a considerable time period. However, the generation spectrum of many laser types may involve spectral dynamics whose relatively fast time scale is determined by their cavity round trip period, calling for instrumentation featuring both high temporal and spectral resolution. Such real-time spectral characterisation becomes particularly challenging if the laser pulses are long, or they have continuous or quasi-continuous wave radiation components. Here we combine optical heterodyning with a technique of spatiooral intensity measurements that allows the characterisation of such complex sources. Fast, round-trip-resolved spectral dynamics of cavity-based systems in real-time are obtained, with temporal resolution of one cavity round trip and frequency resolution defined by its inverse (85 ns and 24 MHz respectively are demonstrated). We also show how under certain conditions for quasi-continuous wave sources, the spectral resolution could be further increased by a factor of 100 by direct extraction of phase information from the heterodyned dynamics or by using double time scales within the spectrogram approach.

Journal ArticleDOI
TL;DR: Two new approaches to mode reconstruction are discussed, the first determines the ridge associated with a mode by considering the location where the direction of the reassignment vector sharply changes, the technique used to determine the basin of attraction being directly derived from that used for ridge extraction.
Abstract: This paper discusses methods for the adaptive reconstruction of the modes of multicomponent AM-FM signals by their time-frequency (TF) representation derived from their short-time Fourier transform (STFT). The STFT of an AM-FM component or mode spreads the information relative to that mode in the TF plane around curves commonly called ridges. An alternative view is to consider a mode as a particular TF domain termed a basin of attraction. Here we discuss two new approaches to mode reconstruction. The first determines the ridge associated with a mode by considering the location where the direction of the reassignment vector sharply changes, the technique used to determine the basin of attraction being directly derived from that used for ridge extraction. A second uses the fact that the STFT of a signal is fully characterized by its zeros (and then the particular distribution of these zeros for Gaussian noise) to deduce an algorithm to compute the mode domains. For both techniques, mode reconstruction is then carried out by simply integrating the information inside these basins of attraction or domains.

Proceedings ArticleDOI
20 Mar 2016
TL;DR: Results show that the proposed system outperforms several state-of-the-art methods for overlapping acoustic event detection on the same task, using both frame-based and event-based metrics, and is robust to varying event density and noise levels.
Abstract: In this paper, a system for overlapping acoustic event detection is proposed, which models the temporal evolution of sound events. The system is based on probabilistic latent component analysis, supporting the use of a sound event dictionary where each exemplar consists of a succession of spectral templates. The temporal succession of the templates is controlled through event class-wise Hidden Markov Models (HMMs). As input time/frequency representation, the Equivalent Rectangular Bandwidth (ERB) spectrogram is used. Experiments are carried out on polyphonic datasets of office sounds generated using an acoustic scene simulator, as well as real and synthesized monophonic datasets for comparative purposes. Results show that the proposed system outperforms several state-of-the-art methods for overlapping acoustic event detection on the same task, using both frame-based and event-based metrics, and is robust to varying event density and noise levels.

Proceedings Article
04 Nov 2016
TL;DR: A multi-label classification task to predict notes in musical recordings is defined, along with an evaluation protocol, and several machine learning architectures for this task are benchmarked.
Abstract: This paper introduces a new large-scale music dataset, MusicNet, to serve as a source of supervision and evaluation of machine learning methods for music research. MusicNet consists of hundreds of freely-licensed classical music recordings by 10 composers, written for 11 instruments, together with instrument/note annotations resulting in over 1 million temporal labels on 34 hours of chamber music performances under various studio and microphone conditions. The paper defines a multi-label classification task to predict notes in musical recordings, along with an evaluation protocol, and benchmarks several machine learning architectures for this task: i) learning from spectrogram features; ii) end-to-end learning with a neural net; iii) end-to-end learning with a convolutional neural net. These experiments show that end-to-end models trained for note prediction learn frequency selective filters as a low-level representation of audio.

Journal ArticleDOI
TL;DR: In this article, a multichannel vibration data processing method for local damage detection in gearboxes is presented. The method is a combination of time-frequency representation and principal component analysis (PCA) applied not to the raw time series but to each slice (along the time) from its spectrogram.
Abstract: A multichannel vibration data processing method in the context of local damage detection in gearboxes is presented in this paper. The purpose of the approach is to achieve more reliable information about local damage by using several channels in comparison to results obtained by single channel vibration analysis. The method is a combination of time-frequency representation and Principal Component Analysis (PCA) applied not to the raw time series but to each slice (along the time) from its spectrogram. Finally, we create a new time-frequency map which aggregated clearly indicates presence of the damage. Details and properties of this procedure are described in this paper, along with comparison to single-channel results. We refer to autocorrelation function of the new aggregated time frequency map (1D signal) or simple spectrum (that might be somehow linked to classical envelope analysis). The results are very convincing – cyclic impulses associated with local damage might be clearly detected. In order to validate our method, we used a model of vibration data from heavy duty gearbox exploited in mining industry.

Journal Article
TL;DR: A novel procedure for data-driven enhancement of informative signal by model each sub-signal in time-frequency representation by α-stable distribution, which is a generalization of standard Gaussian one and allows for modeling sub-Signals related to both informative and non-informative frequencies.
Abstract: A novel procedure for data-driven enhancement of informative signal is presented in this paper The introduced methodology covers decomposition of the signal via time-frequency spectrogram into set of narrowband sub-signals Furthermore, each of the sub-signals is considered as a sample of independent identically distributed random variables and we model the distribution of the sample, in contrast to the classical methodology where the simple statistics, for example kurtosis, for each sub-signal was calculated This approach provides a new perspective in the signal processing techniques for local damage detection Using our methodology one can eliminate potential risk related to high sensitivity towards single outlier In the proposed procedure we model each sub-signal in time-frequency representation by α-stable distribution This distribution is a generalization of standard Gaussian one and allows us for modeling sub-signals related to both informative and non-informative frequencies As a result, we obtain distribution of stability parameter vs frequencies that is analogy to spectral kurtosis approach well known in the literature Such characteristic is basis for filter design used for raw signal enhancement To evaluate efficiency of our method we compare raw and filtered signal in time, time-frequency and frequency (envelope spectrum) domains Moreover, we present comparison to the spectral kurtosis approach The presented methodology we applied to simulated signal and real vibration signal from two stage heavy duty gearbox used in mining industry

Proceedings ArticleDOI
01 Aug 2016
TL;DR: It is demonstrated that electrocoticography (ECoG) intracranial activity from temporal areas can be used to resynthesize speech in real-time and significant correlations between the original and reconstructed spectrograms and temporal waveforms can be achieved.
Abstract: Most current Brain-Computer Interfaces (BCIs) achieve high information transfer rates using spelling paradigms based on stimulus-evoked potentials. Despite the success of this interfaces, this mode of communication can be cumbersome and unnatural. Direct synthesis of speech from neural activity represents a more natural mode of communication that would enable users to convey verbal messages in real-time. In this pilot study with one participant, we demonstrate that electrocoticography (ECoG) intracranial activity from temporal areas can be used to resynthesize speech in real-time. This is accomplished by reconstructing the audio magnitude spectrogram from neural activity and subsequently creating the audio waveform from these reconstructed spectrograms. We show that significant correlations between the original and reconstructed spectrograms and temporal waveforms can be achieved. While this pilot study uses audibly spoken speech for the models, it represents a first step towards speech synthesis from speech imagery.

Journal ArticleDOI
TL;DR: In this paper, an accelerogram of the instantaneous phase of signal components referred to as an instantaneous frequency rate spectrogram (IFRS) is presented as a joint time-frequency distribution.

Proceedings ArticleDOI
21 Mar 2016
TL;DR: A recursive implementation of a recently proposed reassignment process called the Levenberg Marquardt reassignment, which allows a user to adjust the slimness of the signal components localization in the time-frequency plane, and a generalization of the signals reconstruction formula that paves the way for a real-time computation of a reversible and adjustable almost-ideal time- frequencies representation.
Abstract: In this paper, we first present a recursive implementation of a recently proposed reassignment process called the Levenberg Marquardt reassignment, which allows a user to adjust the slimness of the signal components localization in the time-frequency plane. Thanks to a generalization of the signal reconstruction formula, we also present a recursive implementation of the synchrosqueezed short-time Fourier transform. This approach paves the way for a real-time computation of a reversible and adjustable almost-ideal time-frequency representation.

Journal ArticleDOI
TL;DR: A new method of singing voice analysis that performs mutually-dependent singing voice separation and vocal fundamental frequency (F0) estimation that outperformed all the other methods ofsing voice separation submitted to an international music analysis competition called MIREX 2014.
Abstract: This paper presents a new method of singing voice analysis that performs mutually-dependent singing voice separation and vocal fundamental frequency (F0) estimation. Vocal F0 estimation is considered to become easier if singing voices can be separated from a music audio signal, and vocal F0 contours are useful for singing voice separation. This calls for an approach that improves the performance of each of these tasks by using the results of the other. The proposed method first performs robust principal component analysis (RPCA) for roughly extracting singing voices from a target music audio signal. The F0 contour of the main melody is then estimated from the separated singing voices by finding the optimal temporal path over an F0 saliency spectrogram. Finally, the singing voices are separated again more accurately by combining a conventional time-frequency mask given by RPCA with another mask that passes only the harmonic structures of the estimated F0s. Experimental results showed that the proposed method significantly improved the performances of both singing voice separation and vocal F0 estimation. The proposed method also outperformed all the other methods of singing voice separation submitted to an international music analysis competition called MIREX 2014.