scispace - formally typeset
Search or ask a question

Showing papers on "Spectrogram published in 2017"


Posted Content
TL;DR: Tacotron 2 as mentioned in this paper uses a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms.
Abstract: This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of $4.53$ comparable to a MOS of $4.58$ for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the input to WaveNet instead of linguistic, duration, and $F_0$ features. We further demonstrate that using a compact acoustic intermediate representation enables significant simplification of the WaveNet architecture.

733 citations


Journal Article
TL;DR: The main idea of SET is to only retain the TF information of STFT results most related to time-varying features of the signal and to remove most smeared TF energy, such that the energy concentration of the novel TF representation can be enhanced greatly.
Abstract: In this paper, we introduce a new time-frequency (TF) analysis (TFA) method to study the trend and instantaneous frequency (IF) of nonlinear and nonstationary data. Our proposed method is termed the synchroextracting transform (SET), which belongs to a postprocessing procedure of the short-time Fourier transform (STFT). Compared with classical TFA methods, the proposed method can generate a more energy concentrated TF representation and allow for signal reconstruction. The proposed SET method is inspired by the recently proposed synchrosqueezing transform (SST) and the theory of the ideal TFA. To analyze a signal, it is important to obtain the time-varying information, such as the IF and instantaneous amplitude. The SST is to squeeze all TF coefficients into the IF trajectory. Differ from the squeezing manner of SST, the main idea of SET is to only retain the TF information of STFT results most related to time-varying features of the signal and to remove most smeared TF energy, such that the energy concentration of the novel TF representation can be enhanced greatly. Numerical and real-world signals are employed to validate the effectiveness of the SET method.

310 citations


Proceedings ArticleDOI
01 Feb 2017
TL;DR: Preliminary results indicate that the proposed approach based on freshly trained model is better than the fine-tuned model, and is capable of predicting emotions accurately and efficiently.
Abstract: This paper presents a method for speech emotion recognition using spectrograms and deep convolutional neural network (CNN). Spectrograms generated from the speech signals are input to the deep CNN. The proposed model consisting of three convolutional layers and three fully connected layers extract discriminative features from spectrogram images and outputs predictions for the seven emotions. In this study, we trained the proposed model on spectrograms obtained from Berlin emotions dataset. Furthermore, we also investigated the effectiveness of transfer learning for emotions recognition using a pre-trained AlexNet model. Preliminary results indicate that the proposed approach based on freshly trained model is better than the fine-tuned model, and is capable of predicting emotions accurately and efficiently.

230 citations


Journal ArticleDOI
TL;DR: In this paper, the Fast Spectral Correlation (FSC) estimator is proposed, which is based on the short-time Fourier transform (STFT) for cyclostationary signals.

228 citations


Proceedings ArticleDOI
20 Aug 2017
TL;DR: In this article, a conditional generative adversarial network (cGAN) was proposed to improve the performance of speech enhancement in noisy environments by learning a mapping from the spectrogram of noisy speech to an enhanced counterpart, which was trained in an adversarial manner to distinguish between enhanced spectrograms provided by the generator and clean ones from the database using the noisy spectrogram as a condition.
Abstract: Improving speech system performance in noisy environments remains a challenging task, and speech enhancement (SE) is one of the effective techniques to solve the problem. Motivated by the promising results of generative adversarial networks (GANs) in a variety of image processing tasks, we explore the potential of conditional GANs (cGANs) for SE, and in particular, we make use of the image processing framework proposed by Isola et al. [1] to learn a mapping from the spectrogram of noisy speech to an enhanced counterpart. The SE cGAN consists of two networks, trained in an adversarial manner: a generator that tries to enhance the input noisy spectrogram, and a discriminator that tries to distinguish between enhanced spectrograms provided by the generator and clean ones from the database using the noisy spectrogram as a condition. We evaluate the performance of the cGAN method in terms of perceptual evaluation of speech quality (PESQ), short-time objective intelligibility (STOI), and equal error rate (EER) of speaker verification (an example application). Experimental results show that the cGAN method overall outperforms the classical short-time spectral amplitude minimum mean square error (STSA-MMSE) SE algorithm, and is comparable to a deep neural network-based SE approach (DNN-SE).

191 citations


Journal ArticleDOI
TL;DR: It is found out that spectrogram image classification with CNN algorithm works as well as the SVM algorithm, and given the large amount of data, CNN and SVM machine learning algorithms can accurately classify and pre-diagnose respiratory audio.
Abstract: In the field of medicine, with the introduction of computer systems that can collect and analyze massive amounts of data, many non-invasive diagnostic methods are being developed for a variety of conditions. In this study, our aim is to develop a non-invasive method of classifying respiratory sounds that are recorded by an electronic stethoscope and the audio recording software that uses various machine learning algorithms. In order to store respiratory sounds on a computer, we developed a cost-effective and easy-to-use electronic stethoscope that can be used with any device. Using this device, we recorded 17,930 lung sounds from 1630 subjects. We employed two types of machine learning algorithms; mel frequency cepstral coefficient (MFCC) features in a support vector machine (SVM) and spectrogram images in the convolutional neural network (CNN). Since using MFCC features with a SVM algorithm is a generally accepted classification method for audio, we utilized its results to benchmark the CNN algorithm. We prepared four data sets for each CNN and SVM algorithm to classify respiratory audio: (1) healthy versus pathological classification; (2) rale, rhonchus, and normal sound classification; (3) singular respiratory sound type classification; and (4) audio type classification with all sound types. Accuracy results of the experiments were; (1) CNN 86%, SVM 86%, (2) CNN 76%, SVM 75%, (3) CNN 80%, SVM 80%, and (4) CNN 62%, SVM 62%, respectively. As a result, we found out that spectrogram image classification with CNN algorithm works as well as the SVM algorithm, and given the large amount of data, CNN and SVM machine learning algorithms can accurately classify and pre-diagnose respiratory audio.

165 citations


Proceedings ArticleDOI
01 Sep 2017
TL;DR: In this paper, a CNN model was proposed to estimate clean real and imaginary (RI) spectrograms from noisy ones, which are then used to synthesize enhanced speech waveforms.
Abstract: This paper aims to address two issues existing in the current speech enhancement methods: 1) the difficulty of phase estimations; 2) a single objective function cannot consider multiple metrics simultaneously. To solve the first problem, we propose a novel convolutional neural network (CNN) model for complex spectrogram enhancement, namely estimating clean real and imaginary (RI) spectrograms from noisy ones. The reconstructed RI spectrograms are directly used to synthesize enhanced speech waveforms. In addition, since log-power spectrogram (LPS) can be represented as a function of RI spectrograms, its reconstruction is also considered as another target. Thus a unified objective function, which combines these two targets (reconstruction of RI spectrograms and LPS), is equivalent to simultaneously optimizing two commonly used objective metrics: segmental signal-to-noise ratio (SSNR) and log-spectral distortion (LSD). Therefore, the learning process is called multi-metrics learning (MML). Experimental results confirm the effectiveness of the proposed CNN with RI spectrograms and MML in terms of improved standardized evaluation metrics on a speech enhancement task.

157 citations


Proceedings ArticleDOI
19 Oct 2017
TL;DR: Key results indicate that deep-spectrum features are comparable in performance with the other tested acoustic feature representations in matched for noise type train-test conditions; however, the BoAW paradigm is better suited to cross-noise-type train- test conditions.
Abstract: The outputs of the higher layers of deep pre-trained convolutional neural networks (CNNs) have consistently been shown to provide a rich representation of an image for use in recognition tasks. This study explores the suitability of such an approach for speech-based emotion recognition tasks. First, we detail a new acoustic feature representation, denoted as deep spectrum features, derived from feeding spectrograms through a very deep image classification CNN and forming a feature vector from the activations of the last fully connected layer. We then compare the performance of our novel features with standardised brute-force and bag-of-audio-words (BoAW) acoustic feature representations for 2- and 5-class speech-based emotion recognition in clean, noisy and denoised conditions. The presented results show that image-based approaches are a promising avenue of research for speech-based recognition tasks. Key results indicate that deep-spectrum features are comparable in performance with the other tested acoustic feature representations in matched for noise type train-test conditions; however, the BoAW paradigm is better suited to cross-noise-type train-test conditions.

141 citations


Posted Content
TL;DR: The experiments show how deep architectures with sample-level filters improve the accuracy in music auto-tagging and they provide results comparable to previous state-of-the-art performances for the Magnatagatune dataset and Million Song Dataset.
Abstract: Recently, the end-to-end approach that learns hierarchical representations from raw data using deep convolutional neural networks has been successfully explored in the image, text and speech domains. This approach was applied to musical signals as well but has been not fully explored yet. To this end, we propose sample-level deep convolutional neural networks which learn representations from very small grains of waveforms (e.g. 2 or 3 samples) beyond typical frame-level input representations. Our experiments show how deep architectures with sample-level filters improve the accuracy in music auto-tagging and they provide results comparable to previous state-of-the-art performances for the Magnatagatune dataset and Million Song Dataset. In addition, we visualize filters learned in a sample-level DCNN in each layer to identify hierarchically learned features and show that they are sensitive to log-scaled frequency along layer, such as mel-frequency spectrogram that is widely used in music classification systems.

131 citations


Proceedings ArticleDOI
01 Jan 2017
TL;DR: Experimental results of hand gesture recognition with a low power FMCW radar and a deep convolutional neural network (CNN) yielding excellent recognition performance in a simple test case consisting of 3 different gestures.
Abstract: Gesture recognition with radar enables remote control of consumer devices such as audio equipment, television sets and gaming consoles. In this paper, experimental results of hand gesture recognition with a low power FMCW radar and a deep convolutional neural network (CNN) are presented. The FMCW radar operates in the 24 GHz ISM frequency band and has an effective isotropic radiated power level of 0 dBm. Since low power consumption is a key aspect for application in consumer devices, the FMCW radar has only one receive channel which is different from other FMCW radars with multiple receive channels that have been described in literature. The recognition of gestures is performed with a deep convolutional neural network that is trained and tested with micro-Doppler spectrograms yielding excellent recognition performance in a simple test case consisting of 3 different gestures. A comparison of the training and test results for an amplitude spectrogram and a complex-valued spectrogram as the CNN input shows that in this test case there is no major benefit of using the phase information in the spectrogram.

120 citations


Posted Content
TL;DR: This study supports the hypothesis that time-frequency representations are valuable in learning useful features for sound classification and observes that the optimal window size during transformation is dependent on the characteristics of the audio signal and architecturally, 2D convolution yielded better results in most cases compared to 1D.
Abstract: Recent successful applications of convolutional neural networks (CNNs) to audio classification and speech recognition have motivated the search for better input representations for more efficient training. Visual displays of an audio signal, through various time-frequency representations such as spectrograms offer a rich representation of the temporal and spectral structure of the original signal. In this letter, we compare various popular signal processing methods to obtain this representation, such as short-time Fourier transform (STFT) with linear and Mel scales, constant-Q transform (CQT) and continuous Wavelet transform (CWT), and assess their impact on the classification performance of two environmental sound datasets using CNNs. This study supports the hypothesis that time-frequency representations are valuable in learning useful features for sound classification. Moreover, the actual transformation used is shown to impact the classification accuracy, with Mel-scaled STFT outperforming the other discussed methods slightly and baseline MFCC features to a large degree. Additionally, we observe that the optimal window size during transformation is dependent on the characteristics of the audio signal and architecturally, 2D convolution yielded better results in most cases compared to 1D.

Journal ArticleDOI
TL;DR: A scaled spectrogram and tensor decomposition based method to extract more discriminative features for heart sound classification is proposed and is evaluated on three public datasets offered by the PASCAL classifying heart sounds challenge and 2016 PhysioNet challenge.
Abstract: First, the spectrograms of heart cycles are scaled for comparison.Second, tensor decomposition is utilized to the scaled spectrograms.Third, the intrinsic structure information of scaled spectrograms is extracted.Fourth, more useful physiological and pathological information is reserved.Fifth, the extracted features are more discriminative. Heart sound signal analysis is an effective and convenient method for the preliminary diagnosis of heart disease. However, automatic heart sound classification is still a challenging problem which mainly reflected in heart sound segmentation and feature extraction from the corresponding segmentation results. In order to extract more discriminative features for heart sound classification, a scaled spectrogram and tensor decomposition based method was proposed in this study. In the proposed method, the spectrograms of the detected heart cycles are first scaled to a fixed size. Then a dimension reduction process of the scaled spectrograms is performed to extract the most discriminative features. During the dimension reduction process, the intrinsic structure of the scaled spectrograms, which contains important physiological and pathological information of the heart sound signals, is extracted using tensor decomposition method. As a result, the extracted features are more discriminative. Finally, the classification task is completed by support vector machine (SVM). Moreover, the proposed method is evaluated on three public datasets offered by the PASCAL classifying heart sounds challenge and 2016 PhysioNet challenge. The results show that the proposed method is competitive.

Posted Content
TL;DR: In this article, the authors present a review of various representations and issues that arise when using neural networks for style transfer in audio applications, focusing particularly on spectrograms for generating audio using NNs.
Abstract: One of the decisions that arise when designing a neural network for any application is how the data should be represented in order to be presented to, and possibly generated by, a neural network. For audio, the choice is less obvious than it seems to be for visual images, and a variety of representations have been used for different applications including the raw digitized sample stream, hand-crafted features, machine discovered features, MFCCs and variants that include deltas, and a variety of spectral representations. This paper reviews some of these representations and issues that arise, focusing particularly on spectrograms for generating audio using neural networks for style transfer.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a convolutional neural networks (CNN)-based architecture that embraces multi-level and multi-scaled features for music auto-tagging, which outperformed previous state-of-the-art on the MagnaTagATune dataset and the Million Song Dataset.
Abstract: Music auto-tagging is often handled in a similar manner to image classification by regarding the 2D audio spectrogram as image data. However, music auto-tagging is distinguished from image classification in that the tags are highly diverse and have different levels of abstractions. Considering this issue, we propose a convolutional neural networks (CNN)-based architecture that embraces multi-level and multi-scaled features. The architecture is trained in three steps. First, we conduct supervised feature learning to capture local audio features using a set of CNNs with different input sizes. Second, we extract audio features from each layer of the pre-trained convolutional networks separately and aggregate them altogether given a long audio clip. Finally, we put them into fully-connected networks and make final predictions of the tags. Our experiments show that using the combination of multi-level and multi-scale features is highly effective in music auto-tagging and the proposed method outperforms previous state-of-the-arts on the MagnaTagATune dataset and the Million Song Dataset. We further show that the proposed architecture is useful in transfer learning.

Proceedings ArticleDOI
01 Dec 2017
TL;DR: A deep Convolutional Neural Network model is proposed that enables Measurement Capable Devices (MCDs) to identify the presence of radar signals in the radio spectrum, even when these signals are overlapped with other sources of interference, such as commercial Long-Term Evolution (LTE) and Wireless Local Area Network (WLAN).
Abstract: In this paper, we present a spectrum monitoring framework for the detection of radar signals in spectrum sharing scenarios. The core of our framework is a deep Convolutional Neural Network (CNN) model that enables Measurement Capable Devices (MCDs) to identify the presence of radar signals in the radio spectrum, even when these signals are overlapped with other sources of interference, such as commercial Long-Term Evolution (LTE) and Wireless Local Area Network (WLAN). We collected a large dataset of RF measurements, which include the transmissions of multiple radar pulse waveforms, downlink LTE, WLAN, and thermal noise. We propose a pre- processing data representation that leverages the amplitude and phase shifts of the collected samples. This representation allows our CNN model to achieve a classification accuracy of 99.6% on our testing dataset. The trained CNN model is then tested under various SNR values, outperforming other models, such as spectrogram-based CNN models.

Journal ArticleDOI
TL;DR: A scaled spectrogram and partial least squares regression (PLSR) based method was proposed for the classification of PCG signals and the results are compared to those obtained using the best methods in the challenge, thereby proving the effectiveness of the method.

Proceedings ArticleDOI
01 Oct 2017
TL;DR: DeepBreath as mentioned in this paper is a deep learning model which automatically detects people's psychological stress level (mental overload) from their breathing patterns using a low-cost thermal camera, tracking a person's breathing patterns as temperature changes around his/her nostril.
Abstract: We propose DeepBreath, a deep learning model which automatically recognises people's psychological stress level (mental overload) from their breathing patterns. Using a low cost thermal camera, we track a person's breathing patterns as temperature changes around his/her nostril. The paper's technical contribution is threefold. First of all, instead of creating handcrafted features to capture aspects of the breathing patterns, we transform the uni-dimensional breathing signals into two dimensional respiration variability spectrogram (RVS) sequences. The spectrograms easily capture the complexity of the breathing dynamics. Second, a spatial pattern analysis based on a deep Convolutional Neural Network (CNN) is directly applied to the spectrogram sequences without the need of hand-crafting features. Finally, a data augmentation technique, inspired from solutions for over-fitting problems in deep learning, is applied to allow the CNN to learn with a small-scale dataset from short-term measurements (e.g., up to a few hours). The model is trained and tested with data collected from people exposed to two types of cognitive tasks (Stroop Colour Word Test, Mental Computation test) with sessions of different difficulty levels. Using normalised self-report as ground truth, the CNN reaches 84.59% accuracy in discriminating between two levels of stress and 56.52% in discriminating between three levels. In addition, the CNN outperformed powerful shallow learning methods based on a single layer neural network. Finally, the dataset of labelled thermal images will be open to the community.

Journal ArticleDOI
TL;DR: This work proposes a new method for automated field recording analysis with improved automated segmentation and robust bird species classification, and provides comparable identification performance with respect to the eleven species of interest.

Posted Content
TL;DR: This work focuses on studying how waveform-based models outperform spectrogram-based ones in large-scale data scenarios when datasets of variable size are available for training, suggesting that music domain assumptions are relevant when not enough training data are available.
Abstract: The lack of data tends to limit the outcomes of deep learning research, particularly when dealing with end-to-end learning stacks processing raw data such as waveforms. In this study, 1.2M tracks annotated with musical labels are available to train our end-to-end models. This large amount of data allows us to unrestrictedly explore two different design paradigms for music auto-tagging: assumption-free models - using waveforms as input with very small convolutional filters; and models that rely on domain knowledge - log-mel spectrograms with a convolutional neural network designed to learn timbral and temporal features. Our work focuses on studying how these two types of deep architectures perform when datasets of variable size are available for training: the MagnaTagATune (25k songs), the Million Song Dataset (240k songs), and a private dataset of 1.2M songs. Our experiments suggest that music domain assumptions are relevant when not enough training data are available, thus showing how waveform-based models outperform spectrogram-based ones in large-scale data scenarios.

Proceedings ArticleDOI
20 Aug 2017
TL;DR: The proposed postfilter can be used to reduce the gap between synthesized and target spectra, even in the highdimensional STFT domain, and is applied to a DNN-based speech-synthesis task.
Abstract: We propose a learning-based postfilter to reconstruct the high-fidelity spectral texture in short-term Fourier transform (STFT) spectrograms. In speech-processing systems, such as speech synthesis, voice conversion, and speech enhancement, the STFT spectrograms have been widely used as key acoustic representations. In these tasks, we normally need to precisely generate or predict the representations from inputs; however, generated spectra typically lack the fine structures close to the true data. To overcome these limitations and reconstruct spectra having finer structures, we propose a generative adversarial network (GAN)-based postfilter that is implicitly optimized to match the true feature distribution in adversarial learning. The challenge with this postfilter is that a GAN cannot be easily trained for very high-dimensional data such as the STFT. Therefore, we introduce a divide-and-concatenate strategy. We first divide the spectrograms into multiple frequency bands with overlap, train the GAN-based postfilter for the individual bands, and finally connect the bands with overlap. We applied our proposed postfilter to a DNN-based speech-synthesis task. The results show that our proposed postfilter can be used to reduce the gap between synthesized and target spectra, even in the highdimensional STFT domain.

Journal ArticleDOI
TL;DR: New chirp rate and instantaneous frequency estimators designed for frequency-modulated signals are introduced and paves the way to the real-time computation of a time-frequency representation, which is both invertible and sharply localized in frequency.
Abstract: This letter introduces new chirp rate and instantaneous frequency estimators designed for frequency-modulated signals. These estimators are first investigated from a deterministic point of view, then compared together in terms of statistical efficiency. They are also used to design new recursive versions of the vertically synchrosqueezed short-time Fourier transform, using a previously published method (D. Fourer, F. Auger, and P. Flandrin, “Recursive versions of the Levenberg-Marquardt reassigned spectrogram and of the synchrosqueezed STFT,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. , Mar. 2016, pp. 4880–4884). This study paves the way to the real-time computation of a time-frequency representation, which is both invertible and sharply localized in frequency.

Journal ArticleDOI
TL;DR: An extensive set of acoustic–phonetic features extracted in adverse conditions is investigated, and feature combination sets constructed using a sequential floating forward selection algorithm outperform individual ones and are found that optimal feature sets in anechoic conditions are different from those in reverberant conditions.
Abstract: Monaural speech separation is a fundamental problem in speech and signal processing. This problem can be approached from a supervised learning perspective by predicting an ideal time–frequency mask from features of noisy speech. In reverberant conditions at low signal-to-noise ratios (SNRs), accurate mask prediction is challenging and can benefit from effective features. In this paper, we investigate an extensive set of acoustic–phonetic features extracted in adverse conditions. Deep neural networks are used as the learning machine, and separation performance is evaluated using standard objective speech intelligibility metrics. Separation performance is systematically evaluated in both nonspeech and speech interference, in a variety of SNRs, reverberation times, and direct-to-reverberant energy ratios. Considerable performance improvement is observed by using contextual information, likely due to temporal effects of room reverberation. In addition, we construct feature combination sets using a sequential floating forward selection algorithm, and combined features outperform individual ones. We also find that optimal feature sets in anechoic conditions are different from those in reverberant conditions.

Book ChapterDOI
14 Nov 2017
TL;DR: This article proposed a hybrid Convolutional Recurrent Neural Network (CRNN) that operates on spectrogram images of the provided audio snippets, which is applicable to a range of noisy scenarios and can easily be extended to previously unknown languages.
Abstract: Language Identification (LID) systems are used to classify the spoken language from a given audio sample and are typically the first step for many spoken language processing tasks, such as Automatic Speech Recognition (ASR) systems. Without automatic language detection, speech utterances cannot be parsed correctly and grammar rules cannot be applied, causing subsequent speech recognition steps to fail. We propose a LID system that solves the problem in the image domain, rather than the audio domain. We use a hybrid Convolutional Recurrent Neural Network (CRNN) that operates on spectrogram images of the provided audio snippets. In extensive experiments we show, that our model is applicable to a range of noisy scenarios and can easily be extended to previously unknown languages, while maintaining its classification accuracy. We release our code and a large scale training set for LID systems to the community.

Journal ArticleDOI
TL;DR: The authors propose an object-oriented dimension-reduction technique: subspace reliability analysis, which directly removes the unreliable feature dimensions of two class-conditional covariance matrices in two separate subspaces, which demonstrates better performance than the state-of-the-art approaches.

Posted Content
TL;DR: A deep neural network, trained jointly on different speakers is able to extract individual speaker characteristics and gives promising results of reconstructing intelligible speech with superior word recognition accuracy.
Abstract: In this study, we propose a deep neural network for reconstructing intelligible speech from silent lip movement videos. We use auditory spectrogram as spectral representation of speech and its corresponding sound generation method resulting in a more natural sounding reconstructed speech. Our proposed network consists of an autoencoder to extract bottleneck features from the auditory spectrogram which is then used as target to our main lip reading network comprising of CNN, LSTM and fully connected layers. Our experiments show that the autoencoder is able to reconstruct the original auditory spectrogram with a 98% correlation and also improves the quality of reconstructed speech from the main lip reading network. Our model, trained jointly on different speakers is able to extract individual speaker characteristics and gives promising results of reconstructing intelligible speech with superior word recognition accuracy.

Proceedings ArticleDOI
19 Jun 2017
TL;DR: With CNNs trained in such a way that filter dimensions are interpretable in time and frequency, results show that only eight music features are more efficient than 513 frequency bins of a spectrogram and that late score fusion between systems based on both feature types reaches 91% accuracy on the GTZAN database.
Abstract: Nowadays, deep learning is more and more used for Music Genre Classification: particularly Convolutional Neural Networks (CNN) taking as entry a spectrogram considered as an image on which are sought different types of structure.But, facing the criticism relating to the difficulty in understanding the underlying relationships that neural networks learn in presence of a spectrogram, we propose to use, as entries of a CNN, a small set of eight music features chosen along three main music dimensions: dynamics, timbre and tonality. With CNNs trained in such a way that filter dimensions are interpretable in time and frequency, results show that only eight music features are more efficient than 513 frequency bins of a spectrogram and that late score fusion between systems based on both feature types reaches 91% accuracy on the GTZAN database.

Proceedings ArticleDOI
01 Nov 2017
TL;DR: A dual band radar classification scheme to enhance the robustness of micro-Doppler based classification of drones, which shows that the classification accuracy obtained by the fusion of dual-band radar sensors is higher than that obtained by using only single radar sensor.
Abstract: Drone classification has become of great importance due to its increasing popularity and potential threats. The micro-Doppler signatures depending on the rotation of rotor blades allow us to differentiate various types of drones. To enhance the robustness of micro-Doppler based classification of drones, a dual band radar classification scheme is proposed in this paper. Firstly, the time-frequency spectrograms are obtained by performing the short-time Fourier Transform (STFT) on the radar data collected by K-band and X-band radar sensors respectively. Then the principal components analysis (PCA) is utilized to extract the features from the time-frequency spectrograms, and the features obtained by the two radar sensors are fused together. Finally, the classification results are obtained by using the Support Vector Machine (SVM). The experimental results show that the classification accuracy obtained by the fusion of dual-band radar sensors is higher than that obtained by using only single radar sensor.

Journal ArticleDOI
TL;DR: This work proposes to extract the local binary pattern (LBP) from the logarithm of the Gammatone-like spectrogram, and proposes two projection-based LBP features to better capture the texture information of the spectrogram.
Abstract: Sound-event classification often utilizes time–frequency analysis, which produces an image-like spectrogram. Recent approaches such as spectrogram image features and subband power distribution image features extract the image local statistics such as mean and variance from the spectrogram. They have demonstrated good performance. However, we argue that such simple image statistics cannot well capture the complex texture details of the spectrogram. Thus, we propose to extract the local binary pattern (LBP) from the logarithm of the Gammatone-like spectrogram. However, the LBP feature is sensitive to noise. After analyzing the spectrograms of sound events and the audio noise, we find that the magnitude of pixel differences, which is discarded by the LBP feature, carries important information for sound-event classification. We thus propose a multichannel LBP feature via pixel difference quantization to improve the robustness to the audio noise. In view of the differences between spectrograms and natural images, and the reliability issues of LBP features, we propose two projection-based LBP features to better capture the texture information of the spectrogram. To validate the proposed multichannel projection-based LBP features for robot hearing, we have built a new sound-event classification database, the NTU-SEC database, in the context of social interaction between human and robot. It is publicly available to promote research on sound-event classification in a social context. The proposed approaches are compared with the state of the art on the RWCP database and the NTU-SEC database. They consistently demonstrate superior performance under various noise conditions.

Journal ArticleDOI
TL;DR: The outcomes of this research enhance the understanding of knowledge-based classifiers for authentication as well as the Gauss-Newton based optimization for vectorial inputs of spectrogram analysis.
Abstract: This paper deals with a novel frequency based authentication method and a Gauss-Newton based Neural Network classifier.The purpose of this research is to provide the foundations of frequency authentication to enhance keystroke authentication protocols.We presented short time Fourier transform to analyze the train signal of keystrokes.We also analyzed the spectrograms to discriminate various signals.EER of the proposed feature extraction and classification method is found as 4.1%. Keystroke recognition is one of the branch of biometrics that is designed to strengthen regular passwords through inter-key times to protect the password owner from fraud attacks. The signals of keystrokes are usually evaluated only in the time domain since the applied systems collect and analyze only the time values. In addition to these kinds of algorithms, we introduce the extraction of novel frequency feature and a keystroke authentication system which has a classifier operating in frequency domain. The frequency extraction is a new approach that will enhance the authentication protocols and shed light on the keystroke authentication by providing a hidden security level. Above all, instead of inter-key times, the exact key press times are extracted and binarized in time domain. Subsequently, the spectrograms are generated by regular short time Fourier transform with the optimized window size. Since the spectrograms include both frequency and time data, represented as images, low frequencies under a threshold are erased and the high frequencies are collected in bins after the digitization. Consequently the average bin values are used as the inputs to train the Gauss-Newton based Neural Network classifier to validate the attempts. The results are highly promising that we obtained 4.1% Equal Error Rate (EER) after 60 real attempts of the password owner and 60 fraud attacks from 12 different users. The outcomes of this research enhance our understanding of knowledge-based classifiers for authentication as well as the Gauss-Newton based optimization for vectorial inputs of spectrogram analysis.

Proceedings ArticleDOI
01 Sep 2017
TL;DR: In this article, the authors employ recurrent neural networks and train them using prior knowledge only for the magnitude spectrum of the target source to estimate time-frequency masks from an observed mixture magnitude spectrum.
Abstract: The objective of deep learning methods based on encoder-decoder architectures for music source separation is to approximate either ideal time-frequency masks or spectral representations of the target music source(s). The spectral representations are then used to derive time-frequency masks. In this work we introduce a method to directly learn time-frequency masks from an observed mixture magnitude spectrum. We employ recurrent neural networks and train them using prior knowledge only for the magnitude spectrum of the target source. To assess the performance of the proposed method, we focus on the task of singing voice separation. The results from an objective evaluation show that our proposed method provides comparable results to deep learning based methods which operate over complicated signal representations. Compared to previous methods that approximate time-frequency masks, our method has increased performance of signal to distortion ratio by an average of 3.8 dB.