Showing papers on "Spectrogram published in 2016"

PDF

Open Access

Proceedings Article•DOI•

Deep clustering: Discriminative embeddings for segmentation and separation

[...]

John R. Hershey¹, Zhuo Chen², Jonathan Le Roux¹, Shinji Watanabe¹•Institutions (2)

Mitsubishi Electric Research Laboratories¹, Columbia University²

20 Mar 2016

TL;DR: In this paper, a deep network is trained to assign contrastive embedding vectors to each time-frequency region of the spectrogram in order to implicitly predict the segmentation labels of the target spectrogram from the input mixtures.

...read moreread less

Abstract: We address the problem of "cocktail-party" source separation in a deep learning framework called deep clustering. Previous deep network approaches to separation have shown promising performance in scenarios with a fixed number of sources, each belonging to a distinct signal class, such as speech and noise. However, for arbitrary source classes and number, "class-based" methods are not suitable. Instead, we train a deep network to assign contrastive embedding vectors to each time-frequency region of the spectrogram in order to implicitly predict the segmentation labels of the target spectrogram from the input mixtures. This yields a deep network-based analogue to spectral clustering, in that the embeddings form a low-rank pair-wise affinity matrix that approximates the ideal affinity matrix, while enabling much faster performance. At test time, the clustering step "decodes" the segmentation implicit in the embeddings by optimizing K-means with respect to the unknown assignments. Preliminary experiments on single-channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker mixtures can improve signal quality for mixtures of held-out speakers by an average of 6dB. More dramatically, the same model does surprisingly well with three-speaker mixtures.

...read moreread less

1,216 citations

Journal Article•DOI•

Complex ratio masking for monaural speech separation

[...]

Donald S. Williamson¹, Yuxuan Wang¹, DeLiang Wang¹•Institutions (1)

Ohio State University¹

01 Mar 2016-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: The proposed approach improves over other methods when evaluated with several objective metrics, including the perceptual evaluation of speech quality (PESQ), and a listening test where subjects prefer the proposed approach with at least a 69% rate.

...read moreread less

Abstract: Speech separation systems usually operate on the short-time Fourier transform (STFT) of noisy speech, and enhance only the magnitude spectrum while leaving the phase spectrum unchanged. This is done because there was a belief that the phase spectrum is unimportant for speech enhancement. Recent studies, however, suggest that phase is important for perceptual quality, leading some researchers to consider magnitude and phase spectrum enhancements. We present a supervised monaural speech separation approach that simultaneously enhances the magnitude and phase spectra by operating in the complex domain. Our approach uses a deep neural network to estimate the real and imaginary components of the ideal ratio mask defined in the complex domain. We report separation results for the proposed method and compare them to related systems. The proposed approach improves over other methods when evaluated with several objective metrics, including the perceptual evaluation of speech quality (PESQ), and a listening test where subjects prefer the proposed approach with at least a 69% rate.

...read moreread less

699 citations

Proceedings Article•DOI•

Gait recognition using wifi signals

[...]

Wei Wang¹, Alex X. Liu², Muhammad Shahzad³•Institutions (3)

Nanjing University¹, Michigan State University², North Carolina State University³

12 Sep 2016

TL;DR: The intuition is that due to the differences in gaits of different people, the WiFi signal reflected by a walking human generates unique variations in the Channel State Information on the WiFi receiver, so WifiU is proposed, which uses commercial WiFi devices to capture fine-grained gait patterns to recognize humans.

...read moreread less

Abstract: In this paper, we propose WifiU, which uses commercial WiFi devices to capture fine-grained gait patterns to recognize humans. The intuition is that due to the differences in gaits of different people, the WiFi signal reflected by a walking human generates unique variations in the Channel State Information (CSI) on the WiFi receiver. To profile human movement using CSI, we use signal processing techniques to generate spectrograms from CSI measurements so that the resulting spectrograms are similar to those generated by specifically designed Doppler radars. To extract features from spectrograms that best characterize the walking pattern, we perform autocorrelation on the torso reflection to remove imperfection in spectrograms. We evaluated WifiU on a dataset with 2,800 gait instances collected from 50 human subjects walking in a room with an area of 50 square meters. Experimental results show that WifiU achieves top-1, top-2, and top-3 recognition accuracies of 79.28%, 89.52%, and 93.05%, respectively.

...read moreread less

447 citations

Journal Article•DOI•

An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers

[...]

Jesper Jensen¹, Cees H. Taal•Institutions (1)

Aalborg University¹

01 Nov 2016-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: It is shown that ESTOI can be interpreted in terms of an orthogonal decomposition of short-time spectrograms into intelligibility subspaces, i.e., a ranking of spectrogram features according to their importance to intelligibility.

...read moreread less

Abstract: Intelligibility listening tests are necessary during development and evaluation of speech processing algorithms, despite the fact that they are expensive and time consuming. In this paper, we propose a monaural intelligibility prediction algorithm, which has the potential of replacing some of these listening tests. The proposed algorithm shows similarities to the short-time objective intelligibility STOI algorithm, but works for a larger range of input signals. In contrast to STOI, extended STOI ESTOI does not assume mutual independence between frequency bands. ESTOI also incorporates spectral correlation by comparing complete 400ms length spectrograms of the noisy/processed speech and the clean speech signals. As a consequence, ESTOI is also able to accurately predict the intelligibility of speech contaminated by temporally highly modulated noise sources in addition to noisy signals processed with time-frequency weighting. We show that ESTOI can be interpreted in terms of an orthogonal decomposition of short-time spectrograms into intelligibility subspaces, i.e., a ranking of spectrogram features according to their importance to intelligibility. A free MATLAB implementation of the algorithm is available for noncommercial use at http://kom.aau.dk/~jje/.

...read moreread less

404 citations

Proceedings Article•DOI•

Single-Channel Multi-Speaker Separation using Deep Clustering

[...]

Yusuf Ziya Isik¹, Yusuf Ziya Isik², Jonathan Le Roux¹, Zhuo Chen¹, Zhuo Chen³, Shinji Watanabe¹, John R. Hershey¹ - Show less +3 more•Institutions (3)

Mitsubishi Electric¹, Sabancı University², Columbia University³

08 Sep 2016

TL;DR: In this paper, an end-to-end signal approximation objective was proposed to improve the performance of a speaker-independent multi-speaker separation system using deep clustering, which achieved a 10.3 dB improvement in the SDR.

...read moreread less

Abstract: Deep clustering is a recently introduced deep learning architecture that uses discriminatively trained embeddings as the basis for clustering. It was recently applied to spectrogram segmentation, resulting in impressive results on speaker-independent multi-speaker separation. In this paper we extend the baseline system with an end-to-end signal approximation objective that greatly improves performance on a challenging speech separation. We first significantly improve upon the baseline system performance by incorporating better regularization, larger temporal context, and a deeper architecture, culminating in an overall improvement in signal to distortion ratio (SDR) of 10.3 dB compared to the baseline of 6.0 dB for two-speaker separation, as well as a 7.1 dB SDR improvement for three-speaker separation. We then extend the model to incorporate an enhancement layer to refine the signal estimates, and perform end-to-end training through both the clustering and enhancement stages to maximize signal fidelity. We evaluate the results using automatic speech recognition. The new signal approximation objective, combined with end-to-end training, produces unprecedented performance, reducing the word error rate (WER) from 89.1% down to 30.8%. This represents a major advancement towards solving the cocktail party problem.

...read moreread less

354 citations

Journal Article•DOI•

Rolling Bearing Fault Diagnosis Based on STFT-Deep Learning and Sound Signals

[...]

Hongmei Liu, Li Lianfeng, Jian Ma

06 Sep 2016-Shock and Vibration

TL;DR: A new rolling bearing fault diagnosis method that is based on short-time Fourier transform and stacked sparse autoencoder is first proposed; this method analyzes sound signals and is compared with empirical mode decomposition, Teager energy operator, and stacked scant autoen coder when using vibration signals to verify the performance and effectiveness of the proposed method.

...read moreread less

Abstract: The main challenge of fault diagnosis lies in finding good fault features. A deep learning network has the ability to automatically learn good characteristics from input data in an unsupervised fashion, and its unique layer-wise pretraining and fine-tuning using the backpropagation strategy can solve the difficulties of training deep multilayer networks. Stacked sparse autoencoders or other deep architectures have shown excellent performance in speech recognition, face recognition, text classification, image recognition, and other application domains. Thus far, however, there have been very few research studies on deep learning in fault diagnosis. In this paper, a new rolling bearing fault diagnosis method that is based on short-time Fourier transform and stacked sparse autoencoder is first proposed; this method analyzes sound signals. After spectrograms are obtained by short-time Fourier transform, stacked sparse autoencoder is employed to automatically extract the fault features, and softmax regression is adopted as the method for classifying the fault modes. The proposed method, when applied to sound signals that are obtained from a rolling bearing test rig, is compared with empirical mode decomposition, Teager energy operator, and stacked sparse autoencoder when using vibration signals to verify the performance and effectiveness of the proposed method.

...read moreread less

157 citations

Posted Content•

Learning Features of Music from Scratch

[...]

John Thickstun¹, Zaid Harchaoui², Sham M. Kakade¹•Institutions (2)

University of Washington¹, French Institute for Research in Computer Science and Automation²

29 Nov 2016-arXiv: Machine Learning

TL;DR: In this paper, a large-scale music dataset, MusicNet, is introduced to serve as a source of supervision and evaluation of machine learning methods for music research, which consists of hundreds of freely-licensed classical music recordings by 10 composers, written for 11 instruments.

...read moreread less

Abstract: This paper introduces a new large-scale music dataset, MusicNet, to serve as a source of supervision and evaluation of machine learning methods for music research. MusicNet consists of hundreds of freely-licensed classical music recordings by 10 composers, written for 11 instruments, together with instrument/note annotations resulting in over 1 million temporal labels on 34 hours of chamber music performances under various studio and microphone conditions. The paper defines a multi-label classification task to predict notes in musical recordings, along with an evaluation protocol, and benchmarks several machine learning architectures for this task: i) learning from spectrogram features; ii) end-to-end learning with a neural net; iii) end-to-end learning with a convolutional neural net. These experiments show that end-to-end models trained for note prediction learn frequency selective filters as a low-level representation of audio.

...read moreread less

135 citations

Proceedings Article•DOI•

AudioGest: enabling fine-grained hand gesture detection by decoding echo signal

[...]

Wenjie Ruan¹, Quan Z. Sheng¹, Lei Yang², Tao Gu³, Peipei Xu⁴, Longfei Shangguan⁵ - Show less +2 more•Institutions (5)

University of Adelaide¹, Tsinghua University², RMIT University³, University of Electronic Science and Technology of China⁴, Princeton University⁵

12 Sep 2016

TL;DR: The results show that AudioGest can detect six hand gestures with an accuracy up to 96%, and by distinguishing the gesture attributions, it can provide up to 162 control commands for various applications.

...read moreread less

Abstract: Hand gesture is becoming an increasingly popular means of interacting with consumer electronic devices, such as mobile phones, tablets and laptops. In this paper, we present AudioGest, a device-free gesture recognition system that can accurately sense the hand in-air movement around user's devices. Compared to the state-of-the-art, AudioGest is superior in using only one pair of built-in speaker and microphone, without any extra hardware or infrastructure support and with no training, to achieve fine-grained hand detection. Our system is able to accurately recognize various hand gestures, estimate the hand in-air time, as well as average moving speed and waving range. We achieve this by transforming the device into an active sonar system that transmits inaudible audio signal and decodes the echoes of hand at its microphone. We address various challenges including cleaning the noisy reflected sound signal, interpreting the echo spectrogram into hand gestures, decoding the Doppler frequency shifts into the hand waving speed and range, as well as being robust to the environmental motion and signal drifting. We implement the proof-of-concept prototype in three different electronic devices and extensively evaluate the system in four real-world scenarios using 3,900 hand gestures that collected by five users for more than two weeks. Our results show that AudioGest can detect six hand gestures with an accuracy up to 96%, and by distinguishing the gesture attributions, it can provide up to 162 control commands for various applications.

...read moreread less

120 citations

Journal Article•DOI•

Combining visual and acoustic features for music genre classification

[...]

Loris Nanni¹, Yandre M. G. Costa², Alessandra Lumini³, Moo Young Kim⁴, Seung Ryul Baek⁴ - Show less +1 more•Institutions (4)

University of Padua¹, Universidade Estadual de Maringá², University of Bologna³, Sejong University⁴

01 Mar 2016-Expert Systems With Applications

TL;DR: This work proposes an ensemble of heterogeneous classifiers for maximizing the performance that could be obtained starting from the acoustic features, and shows for the first time that a bag of feature approach can be effective in this problem.

...read moreread less

Abstract: Coupling texture descriptors and acoustic features.Bag of feature approach can be effectively used in this problem.Heterogeneous ensemble of different classifiers improves performance. Since musical genre is one of the most common ways used by people for managing digital music databases, music genre recognition is a crucial task, deep studied by the Music Information Retrieval (MIR) research community since 2002. In this work we present a novel and effective approach for automated musical genre recognition based on the fusion of different set of features. Both acoustic and visual features are considered, evaluated, compared and fused in a final ensemble which show classification accuracy comparable or even better than other state-of-the-art approaches. The visual features are locally extracted from sub-windows of the spectrogram taken by Mel scale zoning: the input signal is represented by its spectrogram which is divided in sub-windows in order to extract local features; feature extraction is performed by calculating texture descriptors and bag of features projections from each sub-window; the final decision is taken using an ensemble of SVM classifiers. In this work we show for the first time that a bag of feature approach can be effective in this problem. As the acoustic features are concerned, we propose an ensemble of heterogeneous classifiers for maximizing the performance that could be obtained starting from the acoustic features. First timbre features are obtained from the audio signal, second some statistical measures are calculated from the texture window and the modulation spectrum, third a feature selection is executed to increase the recognition performance and decrease the computational complexity. Finally, the resulting descriptors are classified by fusing the scores of heterogeneous classifiers (SVM and Random subspace of AdaBoost). The experimental evaluation is performed on three well-known databases: the Latin Music Database (LMD), the ISMIR 2004 database and the GTZAN genre collection. The reported performance of the proposed approach is very encouraging, since they outperform other state-of-the-art approaches, without any ad hoc parameter optimization (i.e. using the same ensemble of classifiers and parameters setting in all the three datasets). The advantage of using both visual and audio features is also proved by means of Q-statistics, which confirms that the two sets of features are partially independent and they are suitable to be fused together in a heterogeneous system. The MATLAB code of the ensemble of classifiers and for the visual features extraction will be publicly available (see footnote 1) to other researchers for future comparisons. The code for acoustic features is not available since it is used in a commercial system.

...read moreread less

102 citations

Journal Article•DOI•

Micro-Doppler Feature Extraction Based on Time-Frequency Spectrogram for Ground Moving Targets Classification With Low-Resolution Radar

[...]

Lan Du¹, Li Linsen¹, Baoshuai Wang¹, Jinguo Xiao¹•Institutions (1)

Xidian University¹

15 May 2016-IEEE Sensors Journal

TL;DR: Experimental results show that the proposed micro-Doppler features can achieve a good discriminative ability and a satisfactory classification performance.

...read moreread less

Abstract: A novel feature extraction method based on micro-Doppler signature is proposed to categorize ground moving targets into three kinds, i.e., single walking person, two people walking, and a moving wheeled vehicle. Signal models and measured data from a low-resolution radar are first analyzed to find the differences between the micro-Doppler signatures from the three kinds of considered targets. Then, such discriminative micro-Doppler signatures are represented by a 3-D feature vector extracted from the time-frequency spectrograms. In the experiments based on the measured data, the ratio of the between-class distance to the within-class distance, which is defined based on Fisher discriminant analysis, is exploited to assess the discriminative ability of the 3-D feature vector. Moreover, support vector machine classifier is utilized to evaluate the classification performance. Experimental results show that the proposed micro-Doppler features can achieve a good discriminative ability and a satisfactory classification performance.

...read moreread less

84 citations

Proceedings Article•DOI•

A multi-scale convolution neural network for featureless fault diagnosis

[...]

Jinjiang Wang¹, Junfei Zhuang¹, Lixiang Duan¹, Weidong Cheng²•Institutions (2)

China University of Petroleum¹, Beijing Jiaotong University²

01 Aug 2016

TL;DR: This paper presents a new automatic and intelligent fault diagnosis method based on convolution neural network, which finds that the enhanced convolutional neural network achieves the best classification performance, 5% and 4% higher than ReLU network and Dropout network respectively.

...read moreread less

Abstract: Feature extraction is an important step in conventional vibration-based fault diagnosis methods. However, the features are usually empirically extracted, leading to inconsistent performance. This paper presents a new automatic and intelligent fault diagnosis method based on convolution neural network. Firstly, the vibration signal is processed by wavelet transform into a multi-scale spectrogram image to manifest the fault characteristics. Next, the spectrogram image is directly fed into convolution neural network to learn the invariant representation for vibration signal and recognize the fault status for fault diagnosis. During model construction, rectifier neural activation function and dropout layer are incorporated into convolution neural network to improve the computational efficiency and model generalization. Training data is input into traditional convolutional neural network, ReLU network, Dropout network and enhanced convolutional neural network. The classification results are reached by inputting training data and test data. Then, comparison is made on the analytical results of the four networks to conclude that the preciseness of the classification result of the enhanced convolutional neural network achieves as high as 96%, 8% higher than traditional convolutional neural network. Through adjusting p, the holding probability of Dropout, 3 kinds of sparse neural networks are trained and the classification results are compared. It finds, when p=0.4, the enhanced convolutional neural network achieves the best classification performance, 5% and 4% higher than ReLU network and Dropout network respectively.

...read moreread less

Journal Article•DOI•

Fall Detection Using Smartphone Audio Features

[...]

Michael Cheffena¹•Institutions (1)

Gjøvik University College¹

01 Jul 2016-IEEE Journal of Biomedical and Health Informatics

TL;DR: An automated fall detection system based on smartphone audio features is developed and the best performance is achieved using spectrogram features with ANN classifier with sensitivity, specificity, and accuracy all above 98%.

...read moreread less

Abstract: An automated fall detection system based on smartphone audio features is developed. The spectrogram, mel frequency cepstral coefficents (MFCCs), linear predictive coding (LPC), and matching pursuit (MP) features of different fall and no-fall sound events are extracted from experimental data. Based on the extracted audio features, four different machine learning classifiers: k -nearest neighbor classifier ( k -NN), support vector machine (SVM), least squares method (LSM), and artificial neural network (ANN) are investigated for distinguishing between fall and no-fall events. For each audio feature, the performance of each classifier in terms of sensitivity, specificity, accuracy, and computational complexity is evaluated. The best performance is achieved using spectrogram features with ANN classifier with sensitivity, specificity, and accuracy all above $98\%$ . The classifier also has acceptable computational requirement for training and testing. The system is applicable in home environments where the phone is placed in the vicinity of the user.

...read moreread less

Journal Article•DOI•

Ptychographic reconstruction algorithm for frequency resolved optical gating: super-resolution and supreme robustness

[...]

Pavel Sidorenko¹, Oren Lahav¹, Zohar Avnat¹, Oren Cohen¹•Institutions (1)

Technion – Israel Institute of Technology¹

30 Jun 2016-arXiv: Optics

TL;DR: In this paper, the ptychographic reconstruction algorithm was proposed for frequency-resolved optical gating (FROG) and demonstrated robustness of two unknown pulses from a single measured spectrogram and power spectrum of only one of the pulses.

...read moreread less

Abstract: Frequency-resolved optical gating (FROG) is probably the most popular technique for complete characterization of ultrashort laser pulses. In FROG, a reconstruction algorithm retrieves the pulse from a measured spectrogram, yet current FROG reconstruction algorithms require and exhibit several restricting features that weaken FROG performances. For example, the delay step must correspond to the spectral bandwidth measured with large enough SNR a condition that limits the temporal resolution of the reconstructed pulse, obscures measurements of weak broadband pulses, and makes measurement of broadband mid-IR pulses hard and slow because the spectrograms become huge. We develop a new approach for FROG reconstruction, based on ptychography (a scanning coherent diffraction imaging technique), that removes many of the algorithmic restrictions. The ptychographic reconstruction algorithm is significantly faster and more robust to noise than current FROG algorithms, which are based on generalized projections (GP). We demonstrate, numerically and experimentally, that ptychographic reconstruction works well with very partial spectrograms, e. g. spectrograms with reduced number of measured delays and spectrograms that have been substantially spectrally filtered. In addition, we implement the ptychogrpahic approach to blind second harmonic generation (SHG) FROG and demonstrate robust and complete characterization of two unknown pulses from a single measured spectrogram and power spectrum of only one of the pulses. We believe that the ptychograpy-based approach will become the standard reconstruction procedure in FROG and related diagnostics methods, allowing successful reconstructions from so far unreconstructable spectrograms.

...read moreread less

Posted Content•

Single-Channel Multi-Speaker Separation using Deep Clustering

[...]

Yusuf Ziya Isik¹, Yusuf Ziya Isik², Jonathan Le Roux¹, Zhuo Chen¹, Zhuo Chen³, Shinji Watanabe¹, John R. Hershey¹ - Show less +3 more•Institutions (3)

Mitsubishi Electric¹, Sabancı University², Columbia University³

07 Jul 2016-arXiv: Learning

TL;DR: This paper significantly improves upon the baseline system performance by incorporating better regularization, larger temporal context, and a deeper architecture, culminating in an overall improvement in signal to distortion ratio (SDR) of 10.3 dB compared to the baseline, and produces unprecedented performance on a challenging speech separation.

...read moreread less

Proceedings Article•

Multi-target human gait classification using deep convolutional neural networks on micro-doppler spectrograms

[...]

R.P. Trommel¹, R. I. A. Harmanny, L. Cifola, J.N. Driessen¹•Institutions (1)

Delft University of Technology¹

01 Oct 2016

TL;DR: Results are obtained by the DCNN while outperforming a few conventional classifiers, showing the possible benefits of deep learning approaches in human gait classification.

...read moreread less

Abstract: This paper presents the use of a deep convolutional neural network (DCNN) in distinguishing between absence of human gait and the presence of single or multiple instances of human gait by applying the DCNN to micro-Doppler spectrograms. The approach is evaluated for various radar frequencies and SNR levels using model data, while final validation is performed using X-band CW radar measurements. Satisfactorily results are obtained by the DCNN while outperforming a few conventional classifiers, showing the possible benefits of deep learning approaches in human gait classification.

...read moreread less

Proceedings Article•DOI•

Exploring multidimensional lstms for large vocabulary ASR

[...]

Jinyu Li¹, Abdelrahman Mohamed¹, Geoffrey Zweig¹, Yifan Gong¹•Institutions (1)

Microsoft¹

20 Mar 2016

TL;DR: This study improves the acoustic model by proposing a 2-D, time-frequency (TF) LSTM, which jointly scans the input over the time and frequency axes to model spectro-temporal warping, and uses the output activations as the input to a time L STM (T-LSTM).

...read moreread less

Abstract: Long short-term memory (LSTM) recurrent neural networks (RNNs) have recently shown significant performance improvements over deep feed-forward neural networks. A key aspect of these models is the use of time recurrence, combined with a gating architecture that allows them to track the long-term dynamics of speech. Inspired by human spectrogram reading, we recently proposed the frequency LSTM (F-LSTM) that performs 1-D recurrence over the frequency axis and then performs 1-D recurrence over the time axis. In this study, we further improve the acoustic model by proposing a 2-D, time-frequency (TF) LSTM. The TF-LSTM jointly scans the input over the time and frequency axes to model spectro-temporal warping, and then uses the output activations as the input to a time LSTM (T-LSTM). The joint time-frequency modeling better normalizes the features for the upper layer T-LSTMs. Evaluated on a 375-hour short message dictation task, the proposed TF-LSTM obtained a 3.4% relative WER reduction over the best T-LSTM. The invariance property achieved by joint time-frequency analysis is demonstrated on a mismatched test set, where the TF-LSTM achieves a 14.2% relative WER reduction over the best T-LSTM.

...read moreread less

Journal Article•DOI•

Performance Analysis of Centroid and SVD Features for Personnel Recognition Using Multistatic Micro-Doppler

[...]

Francesco Fioranelli¹, Matthew Ritchie¹, Hugh Griffiths¹•Institutions (1)

University College London¹

30 Mar 2016-IEEE Geoscience and Remote Sensing Letters

TL;DR: This letter investigates the use of micro-Doppler signatures experimentally recorded by a multistatic radar system to perform recognition of people walking and reports high classification accuracy above 98% for the most favorable aspect angle.

...read moreread less

Abstract: In this letter, we investigate the use of micro-Doppler signatures experimentally recorded by a multistatic radar system to perform recognition of people walking. Three different sets of features are tested, taking into account the impact on the overall classification performance of parameters, such as aspect angle, types of classifier, different values of signal-to-noise ratio, and different ways of exploiting multistatic information. High classification accuracy of above 98% is reported for the most favorable aspect angle, and the benefit of using multistatic data at less favorable angles is discussed.

...read moreread less

Journal Article•DOI•

Impulsive Noise Cancellation Method for Copper Ore Crusher Vibration Signals Enhancement

[...]

Agnieszka Wyłomańska¹, Radoslaw Zimroz, Joanna Janczura¹, Jakub Obuchowski•Institutions (1)

Wrocław University of Technology¹

05 May 2016-IEEE Transactions on Industrial Electronics

TL;DR: Efficiency and ability of the proposed procedure in impulsive noise cancellation and the ability to detect a damage are indicated.

...read moreread less

Abstract: In this paper, we deal with a problem of local damage detection in bearings in the presence of a high-energy impulsive noise. Such a problem was identified during diagnostics of bearings in raw materials crusher. Unfortunately, classical approaches cannot be applied due to the impulsive character of the noise. In this paper we propose, a procedure that cancels out impulsive noise rather than extracts signal of interest. The methodology is based on the regime switching model with two regimes: first corresponding to high-energy noncyclic impulses and second to the rest of the signal. We apply the proposed technique to a simulated signal as well as to the real one. Effectiveness of the method is presented graphically using time series, time-frequency spectrogram, and classical envelope analysis. The obtained results indicate efficiency of the method in impulsive noise cancellation and improve the ability to detect a damage.

...read moreread less

Journal Article•DOI•

An Efficient Adaptive Window Size Selection Method for Improving Spectrogram Visualization

[...]

Shibli Nisar¹, Omar Usman Khan¹, Muhammad Tariq¹•Institutions (1)

National University of Computer and Emerging Sciences¹

01 Aug 2016-Computational Intelligence and Neuroscience

TL;DR: A novel empirical model is proposed that adaptively adjusts the window size for a narrow band-signal using spectrum sensing technique and results obtained not only improve the spectrogram visualization but also reduce the computation cost.

...read moreread less

Abstract: Short Time Fourier Transform (STFT) is an important technique for the time-frequency analysis of a time varying signal. The basic approach behind it involves the application of a Fast Fourier Transform (FFT) to a signal multiplied with an appropriate window function with fixed resolution. The selection of an appropriate window size is difficult when no background information about the input signal is known. In this paper, a novel empirical model is proposed that adaptively adjusts the window size for a narrow band-signal using spectrum sensing technique. For wide-band signals, where a fixed time-frequency resolution is undesirable, the approach adapts the constant Q transform (CQT). Unlike the STFT, the CQT provides a varying time-frequency resolution. This results in a high spectral resolution at low frequencies and high temporal resolution at high frequencies. In this paper, a simple but effective switching framework is provided between both STFT and CQT. The proposed method also allows for the dynamic construction of a filter bank according to user-defined parameters. This helps in reducing redundant entries in the filter bank. Results obtained from the proposed method not only improve the spectrogram visualization but also reduce the computation cost and achieves 87.71% of the appropriate window length selection.

...read moreread less

Journal Article•DOI•

Compressive Sensing-Based Speech Enhancement

[...]

Jia-Ching Wang¹, Yuan-Shan Lee¹, Chang-Hong Lin¹, Shu-Fan Wang¹, Chih-Hao Shih¹, Chung-Hsien Wu² - Show less +2 more•Institutions (2)

National Central University¹, National Cheng Kung University²

01 Nov 2016-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This study proposes a speech enhancement method based on compressive sensing that is experimentally compared with the baseline methods and demonstrates its superiority.

...read moreread less

Abstract: This study proposes a speech enhancement method based on compressive sensing. The main procedures involved in the proposed method are performed in the frequency domain. First, an overcomplete dictionary is constructed from the trained speech frames. The atoms of this redundant dictionary are spectrum vectors that are trained by the K-SVD algorithm to ensure the sparsity of the dictionary. For a noisy speech spectrum, formant detection and a quasi-SNR criterion are first utilized to determine whether a frequency bin in the spectrogram is reliable, and a corresponding mask is designed. The mask-extracted reliable components in a speech spectrum are regarded as partial observations and a measurement matrix is constructed. The problem can therefore be treated as a compressive sensing problem. The K atoms of a K-sparsity speech spectrum are found using an orthogonal matching pursuit algorithm. Because the K atoms form the speech signal subspace, the removal of the noise projected onto these K atoms is achieved by multiplying the noisy spectrum with the optimized gain that corresponds to each selected atom. The proposed method is experimentally compared with the baseline methods and demonstrates its superiority.

...read moreread less

Journal Article•DOI•

Real-time high-resolution heterodyne-based measurements of spectral dynamics in fibre lasers

[...]

Srikanth Sugavanam¹, Simon J. Fabbri¹, Son Thai Le¹, Ivan A. Lobach, Sergey I. Kablukov, S. V. Khorev², Dmitry V. Churkin², Dmitry V. Churkin¹ - Show less +4 more•Institutions (2)

Aston University¹, Novosibirsk State University²

17 Mar 2016-Scientific Reports

TL;DR: Fast, round-trip-resolved spectral dynamics of cavity-based systems in real-time are obtained, with temporal resolution of one cavity round trip and frequency resolution defined by its inverse (85 ns and 24 MHz respectively are demonstrated).

...read moreread less

Abstract: Conventional tools for measurement of laser spectra (e.g. optical spectrum analysers) capture data averaged over a considerable time period. However, the generation spectrum of many laser types may involve spectral dynamics whose relatively fast time scale is determined by their cavity round trip period, calling for instrumentation featuring both high temporal and spectral resolution. Such real-time spectral characterisation becomes particularly challenging if the laser pulses are long, or they have continuous or quasi-continuous wave radiation components. Here we combine optical heterodyning with a technique of spatiooral intensity measurements that allows the characterisation of such complex sources. Fast, round-trip-resolved spectral dynamics of cavity-based systems in real-time are obtained, with temporal resolution of one cavity round trip and frequency resolution defined by its inverse (85 ns and 24 MHz respectively are demonstrated). We also show how under certain conditions for quasi-continuous wave sources, the spectral resolution could be further increased by a factor of 100 by direct extraction of phase information from the heterodyned dynamics or by using double time scales within the spectrogram approach.

...read moreread less

Journal Article•DOI•

Adaptive multimode signal reconstruction from time-frequency representations.

[...]

Sylvain Meignen¹, Thomas Oberlin², Philippe Depalle³, Patrick Flandrin⁴, Stephen McLaughlin⁵ - Show less +1 more•Institutions (5)

Joseph Fourier University¹, University of Toulouse², McGill University³, University of Lyon⁴, Heriot-Watt University⁵

13 Apr 2016-Philosophical Transactions of the Royal Society A

TL;DR: Two new approaches to mode reconstruction are discussed, the first determines the ridge associated with a mode by considering the location where the direction of the reassignment vector sharply changes, the technique used to determine the basin of attraction being directly derived from that used for ridge extraction.

...read moreread less

Abstract: This paper discusses methods for the adaptive reconstruction of the modes of multicomponent AM-FM signals by their time-frequency (TF) representation derived from their short-time Fourier transform (STFT). The STFT of an AM-FM component or mode spreads the information relative to that mode in the TF plane around curves commonly called ridges. An alternative view is to consider a mode as a particular TF domain termed a basin of attraction. Here we discuss two new approaches to mode reconstruction. The first determines the ridge associated with a mode by considering the location where the direction of the reassignment vector sharply changes, the technique used to determine the basin of attraction being directly derived from that used for ridge extraction. A second uses the fact that the STFT of a signal is fully characterized by its zeros (and then the particular distribution of these zeros for Gaussian noise) to deduce an algorithm to compute the mode domains. For both techniques, mode reconstruction is then carried out by simply integrating the information inside these basins of attraction or domains.

...read moreread less

Proceedings Article•DOI•

Detection of overlapping acoustic events using a temporally-constrained probabilistic model

[...]

Emmanouil Benetos¹, Grégoire Lafay², Mathieu Lagrange², Mark D. Plumbley³•Institutions (3)

Queen Mary University of London¹, École centrale de Nantes², University of Surrey³

20 Mar 2016

TL;DR: Results show that the proposed system outperforms several state-of-the-art methods for overlapping acoustic event detection on the same task, using both frame-based and event-based metrics, and is robust to varying event density and noise levels.

...read moreread less

Abstract: In this paper, a system for overlapping acoustic event detection is proposed, which models the temporal evolution of sound events. The system is based on probabilistic latent component analysis, supporting the use of a sound event dictionary where each exemplar consists of a succession of spectral templates. The temporal succession of the templates is controlled through event class-wise Hidden Markov Models (HMMs). As input time/frequency representation, the Equivalent Rectangular Bandwidth (ERB) spectrogram is used. Experiments are carried out on polyphonic datasets of office sounds generated using an acoustic scene simulator, as well as real and synthesized monophonic datasets for comparative purposes. Results show that the proposed system outperforms several state-of-the-art methods for overlapping acoustic event detection on the same task, using both frame-based and event-based metrics, and is robust to varying event density and noise levels.

...read moreread less

Proceedings Article•

Learning Features of Music from Scratch

[...]

John Thickstun¹, Zaid Harchaoui², Sham M. Kakade¹•Institutions (2)

University of Washington¹, French Institute for Research in Computer Science and Automation²

04 Nov 2016

TL;DR: A multi-label classification task to predict notes in musical recordings is defined, along with an evaluation protocol, and several machine learning architectures for this task are benchmarked.

...read moreread less

Journal Article•DOI•

Combination of principal component analysis and time-frequency representations of multichannel vibration data for gearbox fault detection

[...]

Jacek Wodecki, Paweł Stefaniak, Jakub Obuchowski, Agnieszka Wyłomańska, Radoslaw Zimroz - Show less +1 more

30 Jun 2016-Journal of Vibroengineering

TL;DR: In this article, a multichannel vibration data processing method for local damage detection in gearboxes is presented. The method is a combination of time-frequency representation and principal component analysis (PCA) applied not to the raw time series but to each slice (along the time) from its spectrogram.

...read moreread less

Abstract: A multichannel vibration data processing method in the context of local damage detection in gearboxes is presented in this paper. The purpose of the approach is to achieve more reliable information about local damage by using several channels in comparison to results obtained by single channel vibration analysis. The method is a combination of time-frequency representation and Principal Component Analysis (PCA) applied not to the raw time series but to each slice (along the time) from its spectrogram. Finally, we create a new time-frequency map which aggregated clearly indicates presence of the damage. Details and properties of this procedure are described in this paper, along with comparison to single-channel results. We refer to autocorrelation function of the new aggregated time frequency map (1D signal) or simple spectrum (that might be somehow linked to classical envelope analysis). The results are very convincing – cyclic impulses associated with local damage might be clearly detected. In order to validate our method, we used a model of vibration data from heavy duty gearbox exploited in mining industry.

...read moreread less

Journal Article•

Data-driven vibration signal filtering procedure based on the α-stable distribution

[...]

Grzegorz Żak, Agnieszka Wyłomańska, Radoslaw Zimroz

01 Mar 2016-Journal of Vibroengineering

TL;DR: A novel procedure for data-driven enhancement of informative signal by model each sub-signal in time-frequency representation by α-stable distribution, which is a generalization of standard Gaussian one and allows for modeling sub-Signals related to both informative and non-informative frequencies.

...read moreread less

Abstract: A novel procedure for data-driven enhancement of informative signal is presented in this paper The introduced methodology covers decomposition of the signal via time-frequency spectrogram into set of narrowband sub-signals Furthermore, each of the sub-signals is considered as a sample of independent identically distributed random variables and we model the distribution of the sample, in contrast to the classical methodology where the simple statistics, for example kurtosis, for each sub-signal was calculated This approach provides a new perspective in the signal processing techniques for local damage detection Using our methodology one can eliminate potential risk related to high sensitivity towards single outlier In the proposed procedure we model each sub-signal in time-frequency representation by α-stable distribution This distribution is a generalization of standard Gaussian one and allows us for modeling sub-signals related to both informative and non-informative frequencies As a result, we obtain distribution of stability parameter vs frequencies that is analogy to spectral kurtosis approach well known in the literature Such characteristic is basis for filter design used for raw signal enhancement To evaluate efficiency of our method we compare raw and filtered signal in time, time-frequency and frequency (envelope spectrum) domains Moreover, we present comparison to the spectral kurtosis approach The presented methodology we applied to simulated signal and real vibration signal from two stage heavy duty gearbox used in mining industry

...read moreread less

Proceedings Article•DOI•

Towards direct speech synthesis from ECoG: A pilot study

[...]

Christian Herff¹, Garett D. Johnson², Lorenz Diener¹, Jerry J. Shih³, Dean J. Krusienski², Tanja Schultz¹ - Show less +2 more•Institutions (3)

University of Bremen¹, Old Dominion University², Mayo Clinic³

01 Aug 2016

TL;DR: It is demonstrated that electrocoticography (ECoG) intracranial activity from temporal areas can be used to resynthesize speech in real-time and significant correlations between the original and reconstructed spectrograms and temporal waveforms can be achieved.

...read moreread less

Abstract: Most current Brain-Computer Interfaces (BCIs) achieve high information transfer rates using spelling paradigms based on stimulus-evoked potentials. Despite the success of this interfaces, this mode of communication can be cumbersome and unnatural. Direct synthesis of speech from neural activity represents a more natural mode of communication that would enable users to convey verbal messages in real-time. In this pilot study with one participant, we demonstrate that electrocoticography (ECoG) intracranial activity from temporal areas can be used to resynthesize speech in real-time. This is accomplished by reconstructing the audio magnitude spectrogram from neural activity and subsequently creating the audio waveform from these reconstructed spectrograms. We show that significant correlations between the original and reconstructed spectrograms and temporal waveforms can be achieved. While this pilot study uses audibly spoken speech for the models, it represents a first step towards speech synthesis from speech imagery.

...read moreread less

Journal Article•DOI•

The instantaneous frequency rate spectrogram

[...]

Krzysztof Czarnecki¹•Institutions (1)

Gdańsk University of Technology¹

01 Jan 2016-Mechanical Systems and Signal Processing

TL;DR: In this paper, an accelerogram of the instantaneous phase of signal components referred to as an instantaneous frequency rate spectrogram (IFRS) is presented as a joint time-frequency distribution.

...read moreread less

Proceedings Article•DOI•

Recursive versions of the Levenberg-Marquardt reassigned spectrogram and of the synchrosqueezed STFT

[...]

Dominique Fourer, François Auger, Patrick Flandrin¹•Institutions (1)

École normale supérieure de Lyon¹

21 Mar 2016

TL;DR: A recursive implementation of a recently proposed reassignment process called the Levenberg Marquardt reassignment, which allows a user to adjust the slimness of the signal components localization in the time-frequency plane, and a generalization of the signals reconstruction formula that paves the way for a real-time computation of a reversible and adjustable almost-ideal time- frequencies representation.

...read moreread less

Abstract: In this paper, we first present a recursive implementation of a recently proposed reassignment process called the Levenberg Marquardt reassignment, which allows a user to adjust the slimness of the signal components localization in the time-frequency plane. Thanks to a generalization of the signal reconstruction formula, we also present a recursive implementation of the synchrosqueezed short-time Fourier transform. This approach paves the way for a real-time computation of a reversible and adjustable almost-ideal time-frequency representation.

...read moreread less

Journal Article•DOI•

Singing Voice Separation and Vocal F0 Estimation based on Mutual Combination of Robust Principal Component Analysis and Subharmonic Summation

[...]

Yukara Ikemiya¹, Katsutoshi Itoyama¹, Kazuyoshi Yoshii¹•Institutions (1)

Kyoto University¹

01 Apr 2016-arXiv: Sound

TL;DR: A new method of singing voice analysis that performs mutually-dependent singing voice separation and vocal fundamental frequency (F0) estimation that outperformed all the other methods ofsing voice separation submitted to an international music analysis competition called MIREX 2014.

...read moreread less

Abstract: This paper presents a new method of singing voice analysis that performs mutually-dependent singing voice separation and vocal fundamental frequency (F0) estimation. Vocal F0 estimation is considered to become easier if singing voices can be separated from a music audio signal, and vocal F0 contours are useful for singing voice separation. This calls for an approach that improves the performance of each of these tasks by using the results of the other. The proposed method first performs robust principal component analysis (RPCA) for roughly extracting singing voices from a target music audio signal. The F0 contour of the main melody is then estimated from the separated singing voices by finding the optimal temporal path over an F0 saliency spectrogram. Finally, the singing voices are separated again more accurately by combining a conventional time-frequency mask given by RPCA with another mask that passes only the harmonic structures of the estimated F0s. Experimental results showed that the proposed method significantly improved the performances of both singing voice separation and vocal F0 estimation. The proposed method also outperformed all the other methods of singing voice separation submitted to an international music analysis competition called MIREX 2014.

...read moreread less

Collapse