Top 481 papers published in the topic of Spectrogram in 2018

Proceedings Article•DOI•

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

[...]

Jonathan Shen¹, Ruoming Pang¹, Ron Weiss¹, Mike Schuster¹, Navdeep Jaitly¹, Zongheng Yang², Zhifeng Chen¹, Yu Zhang¹, Yuxuan Wang¹, Rj Skerrv-Ryan¹, Rif A. Saurous¹, Yannis Agiomvrgiannakis¹, Yonghui Wu¹ - Show less +9 more•Institutions (2)

Google¹, University of California, Berkeley²

15 Apr 2018

TL;DR: Tacotron 2, a neural network architecture for speech synthesis directly from text that is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those Spectrograms is described.

...read moreread less

Abstract: This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the conditioning input to WaveNet instead of linguistic, duration, and $F_{0}$ features. We further show that using this compact acoustic intermediate representation allows for a significant reduction in the size of the WaveNet architecture.

...read moreread less

2,039 citations

Journal Article•DOI•

Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation

[...]

Yi Luo¹, Nima Mesgarani¹•Institutions (1)

Columbia University¹

20 Sep 2018-arXiv: Sound

TL;DR: A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures.

...read moreread less

Abstract: Single-channel, speaker-independent speech separation methods have recently seen great progress. However, the accuracy, latency, and computational cost of such methods remain insufficient. The majority of the previous methods have formulated the separation problem through the time-frequency representation of the mixed signal, which has several drawbacks, including the decoupling of the phase and magnitude of the signal, the suboptimality of time-frequency representation for speech separation, and the long latency in calculating the spectrograms. To address these shortcomings, we propose a fully-convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time-domain speech separation. Conv-TasNet uses a linear encoder to generate a representation of the speech waveform optimized for separating individual speakers. Speaker separation is achieved by applying a set of weighting functions (masks) to the encoder output. The modified encoder representations are then inverted back to the waveforms using a linear decoder. The masks are found using a temporal convolutional network (TCN) consisting of stacked 1-D dilated convolutional blocks, which allows the network to model the long-term dependencies of the speech signal while maintaining a small model size. The proposed Conv-TasNet system significantly outperforms previous time-frequency masking methods in separating two- and three-speaker mixtures. Additionally, Conv-TasNet surpasses several ideal time-frequency magnitude masks in two-speaker speech separation as evaluated by both objective distortion measures and subjective quality assessment by human listeners. Finally, Conv-TasNet has a significantly smaller model size and a shorter minimum latency, making it a suitable solution for both offline and real-time speech separation applications.

...read moreread less

1,061 citations

Proceedings Article•DOI•

A Wavenet for Speech Denoising

[...]

Dario Rethage¹, Jordi Pons¹, Xavier Serra¹•Institutions (1)

Pompeu Fabra University¹

15 Apr 2018

TL;DR: The proposed model adaptation retains Wavenet's powerful acoustic modeling capabilities, while significantly reducing its time-complexity by eliminating its autoregressive nature.

...read moreread less

Abstract: Most speech processing techniques use magnitude spectrograms as front-end and are therefore by default discarding part of the signal: the phase. In order to overcome this limitation’ we propose an end-to-end learning method for speech denoising based on Wavenet. The proposed model adaptation retains Wavenet's powerful acoustic modeling capabilities, while significantly reducing its time-complexity by eliminating its autoregressive nature. Specifically, the model makes use of non-causal, dilated convolutions and predicts target fields instead of a single target sample. The discriminative adaptation of the model we propose, learns in a supervised fashion via minimizing a regression loss. These modifications make the model highly parallelizable during both training and inference. Both quantitative and qualitative evaluations indicate that the proposed method is preferred over Wiener filtering, a common method based on processing the magnitude spectrogram.

...read moreread less

387 citations

Proceedings Article•

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

[...]

Ye Jia¹, Yu Zhang¹, Ron Weiss², Quan Wang¹, Jonathan Shen¹, Fei Ren, Zhifeng Chen¹, Patrick Nguyen¹, Ruoming Pang¹, Ignacio Lopez Moreno¹, Yonghui Wu¹ - Show less +7 more•Institutions (2)

Google¹, Massachusetts Institute of Technology²

01 Jun 2018

TL;DR: In this article, a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training is presented.

...read moreread less

Abstract: We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the new task, and is able to synthesize natural speech from speakers that were not seen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.

...read moreread less

258 citations

Posted Content•

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

[...]

Ye Jia¹, Yu Zhang¹, Ron Weiss², Quan Wang¹, Jonathan Shen¹, Fei Ren, Zhifeng Chen¹, Patrick Nguyen¹, Ruoming Pang¹, Ignacio Lopez Moreno¹, Yonghui Wu¹ - Show less +7 more•Institutions (2)

Google¹, Massachusetts Institute of Technology²

12 Jun 2018-arXiv: Computation and Language

TL;DR: In this paper, a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training is presented.

...read moreread less

Abstract: We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the new task, and is able to synthesize natural speech from speakers that were not seen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.

...read moreread less

224 citations

Proceedings Article•DOI•

Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation

[...]

Daniel Stoller¹, Sebastian Ewert¹, Simon Dixon¹•Institutions (1)

Queen Mary University of London¹

08 Jun 2018

TL;DR: The Wave-U-Net as discussed by the authors is an adaptation of the U-Net to the one-dimensional time domain, which repeatedly resamples feature maps to compute and combine features at different time scales.

...read moreread less

Abstract: Models for audio source separation usually operate on the magnitude spectrum, which ignores phase information and makes separation performance dependant on hyper-parameters for the spectral front-end. Therefore, we investigate end-to-end source separation in the time-domain, which allows modelling phase information and avoids fixed spectral transformations. Due to high sampling rates for audio, employing a long temporal input context on the sample level is difficult, but required for high quality separation results because of long-range temporal correlations. In this context, we propose the Wave-U-Net, an adaptation of the U-Net to the one-dimensional time domain, which repeatedly resamples feature maps to compute and combine features at different time scales. We introduce further architectural improvements, including an output layer that enforces source additivity, an upsampling technique and a context-aware prediction framework to reduce output artifacts. Experiments for singing voice separation indicate that our architecture yields a performance comparable to a state-of-the-art spectrogram-based U-Net architecture, given the same data. Finally, we reveal a problem with outliers in the currently used SDR evaluation metrics and suggest reporting rank-based statistics to alleviate this problem.

...read moreread less

202 citations

Posted Content•

VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking.

[...]

Quan Wang, Hannah Muckenhirn, Kevin W. Wilson, Prashant Sridhar, Zelin Wu, John R. Hershey, Rif A. Saurous, Ron Weiss, Ye Jia, Ignacio Lopez Moreno - Show less +6 more

11 Oct 2018-arXiv: Audio and Speech Processing

TL;DR: In this paper, a speaker recognition network that produces speaker-discriminative embeddings and a spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask.

...read moreread less

Abstract: In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask. Our system significantly reduces the speech recognition WER on multi-speaker signals, with minimal WER degradation on single-speaker signals.

...read moreread less

197 citations

Proceedings Article•DOI•

Direction of Arrival Estimation for Multiple Sound Sources Using Convolutional Recurrent Neural Network

[...]

Sharath Adavanne¹, Archontis Politis², Tuomas Virtanen¹•Institutions (2)

Tampere University of Technology¹, Aalto University²

01 Sep 2018

TL;DR: In this paper, a deep neural network was proposed to estimate the directions of arrival (DOA) of multiple sound sources in anechoic, matched and unmatched reverberant conditions.

...read moreread less

Abstract: This paper proposes a deep neural network for estimating the directions of arrival (DOA) of multiple sound sources. The proposed stacked convolutional and recurrent neural network (DOAnet) generates a spatial pseudo-spectrum (SPS) along with the DOA estimates in both azimuth and elevation. We avoid any explicit feature extraction step by using the magnitudes and phases of the spectrograms of all the channels as input to the network. The proposed DOAnet is evaluated by estimating the DOAs of multiple concurrently present sources in anechoic, matched and unmatched reverberant conditions. The results show that the proposed DOAnet is capable of estimating the number of sources and their respective DOAs with good precision and generate SPS with high signal-to-noise ratio.

...read moreread less

191 citations

Posted Content•

Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation

[...]

Daniel Stoller¹, Sebastian Ewert¹, Simon Dixon¹•Institutions (1)

Queen Mary University of London¹

08 Jun 2018-arXiv: Sound

TL;DR: The Wave-U-Net is proposed, an adaptation of the U-Net to the one-dimensional time domain, which repeatedly resamples feature maps to compute and combine features at different time scales and indicates that its architecture yields a performance comparable to a state-of-the-art spectrogram-based U- net architecture, given the same data.

...read moreread less

Abstract: Models for audio source separation usually operate on the magnitude spectrum, which ignores phase information and makes separation performance dependant on hyper-parameters for the spectral front-end. Therefore, we investigate end-to-end source separation in the time-domain, which allows modelling phase information and avoids fixed spectral transformations. Due to high sampling rates for audio, employing a long temporal input context on the sample level is difficult, but required for high quality separation results because of long-range temporal correlations. In this context, we propose the Wave-U-Net, an adaptation of the U-Net to the one-dimensional time domain, which repeatedly resamples feature maps to compute and combine features at different time scales. We introduce further architectural improvements, including an output layer that enforces source additivity, an upsampling technique and a context-aware prediction framework to reduce output artifacts. Experiments for singing voice separation indicate that our architecture yields a performance comparable to a state-of-the-art spectrogram-based U-Net architecture, given the same data. Finally, we reveal a problem with outliers in the currently used SDR evaluation metrics and suggest reporting rank-based statistics to alleviate this problem.

...read moreread less

164 citations

Proceedings Article•DOI•

An Attention Pooling based Representation Learning Method for Speech Emotion Recognition

[...]

Pengcheng Li¹, Yan Song¹, Ian McLoughlin², Wu Guo¹, Li-Rong Dai¹ - Show less +1 more•Institutions (2)

University of Science and Technology of China¹, Monash University²

02 Sep 2018

TL;DR: An attention pooling based representation learning method for speech emotion recognition (SER) by applying a deep convolutional neural network directly to spectrograms extracted from speech utterances that outperforms the state-of-the-art method.

...read moreread less

Abstract: This paper proposes an attention pooling based representation learning method for speech emotion recognition (SER) The emotional representation is learned in an end-to-end fashion by applying a deep convolutional neural network (CNN) directly to spectrograms extracted from speech utterances Motivated by the success of GoogleNet, two groups of filters with different shapes are designed to capture both temporal and frequency domain context information from the input spectrogram The learned features are concatenated and fed into the subsequent convolutional layers To learn the final emotional representation, a novel attention pooling method is further proposed Compared with the existing pooling methods, such as max-pooling and average-pooling, the proposed attention pooling can effectively incorporate class-agnostic bottom-up, and class-specific top-down, attention maps We conduct extensive evaluations on benchmark IEMOCAP data to assess the effectiveness of the proposed representation Results demonstrate a recognition performance of 718% weighted accuracy (WA) and 68% unweighted accuracy (UA) over four emotions, which outperforms the state-of-the-art method by about 3% absolute for WA and 4% for UA

...read moreread less

142 citations

Proceedings Article•DOI•

Speech Emotion Recognition Using Spectrogram & Phoneme Embedding.

[...]

Promod Yenigalla¹, Abhay Kumar¹, Suraj Tripathi¹, Chirag Singh¹, Sibsambhu Kar¹, Jithendra Vepa² - Show less +2 more•Institutions (2)

Samsung¹, Philips²

02 Sep 2018

TL;DR: A phoneme and spectrogram combined CNN model proved to be most accurate in recognizing emotions on IEMOCAP data and achieved more than 4% increase in overall accuracy and average class accuracy as compared to the existing state-of-the-art methods.

...read moreread less

Abstract: This paper proposes a speech emotion recognition method based on phoneme sequence and spectrogram. Both phoneme sequence and spectrogram retain emotion contents of speech which is missed if the speech is converted into text. We performed various experiments with different kinds of deep neural networks with phoneme and spectrogram as inputs. Three of those network architectures are presented here that helped to achieve better accuracy when compared to the stateof-the-art methods on benchmark dataset. A phoneme and spectrogram combined CNN model proved to be most accurate in recognizing emotions on IEMOCAP data. We achieved more than 4% increase in overall accuracy and average class accuracy as compared to the existing state-of-the-art methods.

...read moreread less

Journal Article•DOI•

Automated depression analysis using convolutional neural networks from speech.

[...]

Lang He¹, Cui Cao²•Institutions (2)

Northwestern Polytechnical University¹, Weinan Normal University²

01 Jul 2018-Journal of Biomedical Informatics

TL;DR: This paper proposes a combination of hand-crafted and deep-learned features which can effectively measure the severity of depression from speech and proposes joint fine-tuning layers to combine the raw and spectrogram DCNN to boost the depression recognition performance.

...read moreread less

Proceedings Article•DOI•

Statistical Speech Enhancement Based on Probabilistic Integration of Variational Autoencoder and Non-Negative Matrix Factorization

[...]

Yoshiaki Bando¹, Masato Mimura¹, Katsutoshi Itoyama¹, Kazuyoshi Yoshii¹, Tatsuya Kawahara¹ - Show less +1 more•Institutions (1)

Kyoto University¹

15 Apr 2018

TL;DR: This paper presents a statistical method of single-channel speech enhancement that uses a variational autoencoder (VAE) as a prior distribution on clean speech that outperformed the conventional DNN-based method in unseen noisy environments.

...read moreread less

Abstract: This paper presents a statistical method of single-channel speech enhancement that uses a variational autoencoder (VAE) as a prior distribution on clean speech. A standard approach to speech enhancement is to train a deep neural network (DNN) to take noisy speech as input and output clean speech. Although this supervised approach requires a very large amount of pair data for training, it is not robust against unknown environments. Another approach is to use non-negative matrix factorization (NMF) based on basis spectra trained on clean speech in advance and those adapted to noise on the fly. This semi-supervised approach, however, causes considerable signal distortion in enhanced speech due to the unrealistic assumption that speech spectrograms are linear combinations of the basis spectra. Replacing the poor linear generative model of clean speech in NMF with a VAE—a powerful nonlinear deep generative model—trained on clean speech, we formulate a unified probabilistic generative model of noisy speech. Given noisy speech as observed data, we can sample clean speech from its posterior distribution. The proposed method outperformed the conventional DNN-based method in unseen noisy environments.

...read moreread less

Journal Article•DOI•

SampleCNN: End-to-End Deep Convolutional Neural Networks Using Very Small Filters for Music Classification

[...]

Jongpil Lee, Ji-Young Park, Keunhyoung Luke Kim, Juhan Nam

22 Jan 2018-Applied Sciences

TL;DR: A CNN architecture which learns representations using sample-level filters beyond typical frame-level input representations is proposed and extended using multi-level and multi-scale feature aggregation technique and subsequently conduct transfer learning for several music classification tasks.

...read moreread less

Abstract: Convolutional Neural Networks (CNN) have been applied to diverse machine learning tasks for different modalities of raw data in an end-to-end fashion. In the audio domain, a raw waveform-based approach has been explored to directly learn hierarchical characteristics of audio. However, the majority of previous studies have limited their model capacity by taking a frame-level structure similar to short-time Fourier transforms. We previously proposed a CNN architecture which learns representations using sample-level filters beyond typical frame-level input representations. The architecture showed comparable performance to the spectrogram-based CNN model in music auto-tagging. In this paper, we extend the previous work in three ways. First, considering the sample-level model requires much longer training time, we progressively downsample the input signals and examine how it affects the performance. Second, we extend the model using multi-level and multi-scale feature aggregation technique and subsequently conduct transfer learning for several music classification tasks. Finally, we visualize filters learned by the sample-level CNN in each layer to identify hierarchically learned features and show that they are sensitive to log-scaled frequency.

...read moreread less

Journal Article•DOI•

Deep Scalogram Representations for Acoustic Scene Classification

[...]

Zhao Ren¹, Kun Qian², Yebin Wang³, Zixing Zhang⁴, Vedhas Pandit¹, Alice Baird¹, Björn Schuller⁴ - Show less +3 more•Institutions (4)

University of Augsburg¹, Technische Universität München², Mitsubishi Electric Research Laboratories³, Imperial College London⁴

05 Apr 2018-IEEE/CAA Journal of Automatica Sinica

TL;DR: In this paper, the authors presented an approach for exploring the benefits of deep scalogram representations, extracted in segments from an audio stream, by transforming the segmented acoustic scenes into bump and morse scalograms, as well as spectrograms.

...read moreread less

Abstract: Spectrogram representations of acoustic scenes have achieved competitive performance for acoustic scene classification. Yet, the spectrogram alone does not take into account a substantial amount of time-frequency information. In this study, we present an approach for exploring the benefits of deep scalogram representations, extracted in segments from an audio stream. The approach presented firstly transforms the segmented acoustic scenes into bump and morse scalograms, as well as spectrograms; secondly, the spectrograms or scalograms are sent into pre-trained convolutional neural networks; thirdly, the features extracted from a subsequent fully connected layer are fed into U+0028 bidirectional U+0029 gated recurrent neural networks, which are followed by a single highway layer and a softmax layer; finally, predictions from these three systems are fused by a margin sampling value strategy. We then evaluate the proposed approach using the acoustic scene classification data set of 2017 IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events U+0028 DCASE U+0029. On the evaluation set, an accuracy of 64.0 U+0025 from bidirectional gated recurrent neural networks is obtained when fusing the spectrogram and the bump scalogram, which is an improvement on the 61.0 U+0025 baseline result provided by the DCASE 2017 organisers. This result shows that extracted bump scalograms are capable of improving the classification accuracy, when fusing with a spectrogram-based system.

...read moreread less

Journal Article•DOI•

Practical classification of different moving targets using automotive radar and deep neural networks

[...]

Aleksandar Angelov, Andrew Robertson, Roderick Murray-Smith, Francesco Fioranelli

01 Oct 2018-Iet Radar Sonar and Navigation

TL;DR: A fast implementation of radar algorithms for detection, tracking, and micro-Doppler extraction is proposed in conjunction with the automotive radar transceiver TEF810X and microcontroller unit SR32R274 manufactured by NXP Semiconductors.

...read moreread less

Abstract: In this work, the authors present results for classification of different classes of targets (car, single and multiple people, bicycle) using automotive radar data and different neural networks. A fast implementation of radar algorithms for detection, tracking, and micro-Doppler extraction is proposed in conjunction with the automotive radar transceiver TEF810X and microcontroller unit SR32R274 manufactured by NXP Semiconductors. Three different types of neural networks are considered, namely a classic convolutional network, a residual network, and a combination of convolutional and recurrent network, for different classification problems across the four classes of targets recorded. Considerable accuracy (close to 100% in some cases) and low latency of the radar pre-processing prior to classification (∼0.55 s to produce a 0.5 s long spectrogram) are demonstrated in this study, and possible shortcomings and outstanding issues are discussed.

...read moreread less

Posted Content•

TasNet: Surpassing Ideal Time-Frequency Masking for Speech Separation.

[...]

Yi Luo, Nima Mesgarani

20 Sep 2018

TL;DR: TasNet as discussed by the authors uses a convolutional encoder to create a representation of the signal that is optimized for extracting individual speakers, which is achieved by applying a weighting function (mask) to the encoder output.

...read moreread less

Abstract: Robust speech processing in multitalker acoustic environments requires automatic speech separation. While single-channel, speaker-independent speech separation methods have recently seen great progress, the accuracy, latency, and computational cost of speech separation remain insufficient. The majority of the previous methods have formulated the separation problem through the time-frequency representation of the mixed signal, which has several drawbacks, including the decoupling of the phase and magnitude of the signal, the suboptimality of spectrogram representations for speech separation, and the long latency in calculating the spectrogram. To address these shortcomings, we propose the time-domain audio separation network (TasNet), which is a deep learning autoencoder framework for time-domain speech separation. TasNet uses a convolutional encoder to create a representation of the signal that is optimized for extracting individual speakers. Speaker extraction is achieved by applying a weighting function (mask) to the encoder output. The modified encoder representation is then inverted to the sound waveform using a linear decoder. The masks are found using a temporal convolutional network consisting of dilated convolutions, which allow the network to model the long-term dependencies of the speech signal. This end-to-end speech separation algorithm significantly outperforms previous time-frequency methods in terms of separating speakers in mixed audio, even when compared to the separation accuracy achieved with the ideal time-frequency mask of the speakers. In addition, TasNet has a smaller model size and a shorter minimum latency, making it a suitable solution for both offline and real-time speech separation applications. This study therefore represents a major step toward actualizing speech separation for real-world speech processing technologies.

...read moreread less

Proceedings Article•DOI•

Emotion Recognition from Variable-Length Speech Segments Using Deep Learning on Spectrograms.

[...]

Xi Ma¹, Zhiyong Wu¹, Jia Jia¹, Mingxing Xu¹, Helen Meng², Lianhong Cai¹ - Show less +2 more•Institutions (2)

Tsinghua University¹, The Chinese University of Hong Kong²

02 Sep 2018

TL;DR: This work proposes a specially designed neural network structure that accepts variable-length speech sentences directly as input and outperforms the fixed-length neural network on both weighted accuracy (WA) and unweighted accuracy (UA).

...read moreread less

Abstract: In this work, an approach of emotion recognition is proposed for variable-length speech segments by applying deep neutral network to spectrograms directly. The spectrogram carries comprehensive para-lingual information that are useful for emotion recognition. We tried to extract such information from spectrograms and accomplish the emotion recognition task by combining Convolutional Neural Networks (CNNs) with Recurrent Neural Networks (RNNs). To handle the variablelength speech segments, we proposed a specially designed neural network structure that accepts variable-length speech sentences directly as input. Compared to the traditional methods that split the sentence into smaller fixed-length segments, our method can solve the problem of accuracy degradation introduced by the speech segmentation process. We evaluated the emotion recognition model on the IEMOCAP dataset over four emotions. Experimental results demonstrate that the proposed method outperforms the fixed-length neural network on both weighted accuracy (WA) and unweighted accuracy (UA).

...read moreread less

Proceedings Article•DOI•

Attention Based Fully Convolutional Network for Speech Emotion Recognition

[...]

Yuanyuan Zhang¹, Jun Du¹, Zi-Rui Wang¹, Jianshu Zhang¹, Yan-Hui Tu¹ - Show less +1 more•Institutions (1)

University of Science and Technology of China¹

01 Nov 2018

TL;DR: In this article, a novel attention-based fully convolutional network for speech emotion recognition was proposed, which is able to handle variable-length speech, free of the demand of segmentation to keep critical information not lost.

...read moreread less

Abstract: Speech emotion recognition is a challenging task for three main reasons: 1) human emotion is abstract, which means it is hard to distinguish; 2) in general, human emotion can only be detected in some specific moments during a long utterance; 3) speech data with emotional labeling is usually limited. In this paper, we present a novel attention based fully convolutional network for speech emotion recognition. We employ fully convolutional network as it is able to handle variable-length speech, free of the demand of segmentation to keep critical information not lost. The proposed attention mechanism can make our model be aware of which time-frequency region of speech spectrogram is more emotion-relevant. Considering limited data, the transfer learning is also adapted to improve the accuracy. Especially, it's interesting to observe obvious improvement obtained with natural scene image based pre-trained model. Validated on the publicly available IEMOCAP corpus, the proposed model outperformed the state-of-the-art methods with a weighted accuracy of 70.4% and an unweighted accuracy of 63.9% respectively.

...read moreread less

Proceedings Article•DOI•

Speech Dereverberation Using Fully Convolutional Networks

[...]

Ori Ernst¹, Shlomo E. Chazan¹, Sharon Gannot¹, Jacob Goldberger¹•Institutions (1)

Bar-Ilan University¹

22 Mar 2018

TL;DR: In this article, a generative adversarial network (GAN) was used to enhance the speech signal represented by short-time Fourier transform (STFT) images, which yields a more intuitive cost function for training.

...read moreread less

Abstract: Speech derverberation using a single microphone is addressed in this paper. Motivated by the recent success of the fully convolutional networks (FCN) in many image processing applications, we investigate their applicability to enhance the speech signal represented by short-time Fourier transform (STFT) images. We present two variations: a “U-Net” which is an encoder-decoder network with skip connections and a generative adversarial network (GAN) with U-Net as generator, which yields a more intuitive cost function for training. To evaluate our method we used the data from the REVERB challenge, and compared our results to other methods under the same conditions. We have found that our method outperforms the competing methods in most cases.

...read moreread less

Journal Article•DOI•

Fast Spectrogram Inversion using Multi-head Convolutional Neural Networks

[...]

Sercan O. Arik¹, Heewoo Jun¹, Gregory Diamos¹•Institutions (1)

Baidu¹

20 Aug 2018-arXiv: Sound

TL;DR: It is demonstrated that MCNN constitutes a very promising approach for high-quality speech synthesis, without any iterative algorithms or autoregression in computations.

...read moreread less

Abstract: We propose the multi-head convolutional neural network (MCNN) architecture for waveform synthesis from spectrograms. Nonlinear interpolation in MCNN is employed with transposed convolution layers in parallel heads. MCNN achieves more than an order of magnitude higher compute intensity than commonly-used iterative algorithms like Griffin-Lim, yielding efficient utilization for modern multi-core processors, and very fast (more than 300x real-time) waveform synthesis. For training of MCNN, we use a large-scale speech recognition dataset and losses defined on waveforms that are related to perceptual audio quality. We demonstrate that MCNN constitutes a very promising approach for high-quality speech synthesis, without any iterative algorithms or autoregression in computations.

...read moreread less

Journal Article•DOI•

Noise robust sound event classification with convolutional neural network

[...]

Ilyas Ozer¹, Zeynep Ozer¹, Oğuz Findik¹•Institutions (1)

Karabük University¹

10 Jan 2018-Neurocomputing

TL;DR: After converting highly overlapped spectrograms into linear quantized images and reducing dimensions by applying various image resizing methods, feature extraction and classification are performed with convolutional neural networks (CNN), which have very high performance in image classification.

...read moreread less

Journal Article•DOI•

Multi-Scale Stochastic Resonance Spectrogram for fault diagnosis of rolling element bearings

[...]

Qingbo He¹, Enhao Wu², Yuanyuan Pan•Institutions (2)

Shanghai Jiao Tong University¹, University of Science and Technology of China²

28 Apr 2018-Journal of Sound and Vibration

TL;DR: The proposed multi-scale SR spectrogram (MSSRS) is able to well deal with the non-stationary transient signal, and can highlight the defect-induced frequency component corresponding to the impulse information.

...read moreread less

Proceedings Article•DOI•

Attention Based Fully Convolutional Network for Speech Emotion Recognition

[...]

Yuanyuan Zhang, Jun Du, Zi-Rui Wang, Jianshu Zhang

05 Jun 2018-arXiv: Sound

TL;DR: In this paper, a novel attention-based fully convolutional network for speech emotion recognition was proposed, which is able to handle variable-length speech, free of the demand of segmentation to keep critical information not lost.

...read moreread less

Abstract: Speech emotion recognition is a challenging task for three main reasons: 1) human emotion is abstract, which means it is hard to distinguish; 2) in general, human emotion can only be detected in some specific moments during a long utterance; 3) speech data with emotional labeling is usually limited. In this paper, we present a novel attention based fully convolutional network for speech emotion recognition. We employ fully convolutional network as it is able to handle variable-length speech, free of the demand of segmentation to keep critical information not lost. The proposed attention mechanism can make our model be aware of which time-frequency region of speech spectrogram is more emotion-relevant. Considering limited data, the transfer learning is also adapted to improve the accuracy. Especially, it's interesting to observe obvious improvement obtained with natural scene image based pre-trained model. Validated on the publicly available IEMOCAP corpus, the proposed model outperformed the state-of-the-art methods with a weighted accuracy of 70.4% and an unweighted accuracy of 63.9% respectively.

...read moreread less

Journal Article•DOI•

Pattern recognition based on time-frequency analysis and convolutional neural networks for vibrational events in φ-OTDR

[...]

Chengjin Xu¹, Junjun Guan, Ming Bao, Jiangang Lu¹, Wei Ye¹ - Show less +1 more•Institutions (1)

Zhejiang University¹

03 Jan 2018-Optical Engineering

TL;DR: Experiments show that after using this implement of time-frequency analysis and convolutional neural network to process 4000 vibration signal samples generated by four different vibration events, the recognition rates of vibration events are over 90%.

...read moreread less

Abstract: Based on vibration signals detected by a phase-sensitive optical time-domain reflectometer distributed optical fiber sensing system, this paper presents an implement of time-frequency analysis and convolutional neural network (CNN), used to classify different types of vibrational events. First, spectral subtraction and the short-time Fourier transform are used to enhance time-frequency features of vibration signals and transform different types of vibration signals into spectrograms, which are input to the CNN for automatic feature extraction and classification. Finally, by replacing the soft-max layer in the CNN with a multiclass support vector machine, the performance of the classifier is enhanced. Experiments show that after using this method to process 4000 vibration signal samples generated by four different vibration events, namely, digging, walking, vehicles passing, and damaging, the recognition rates of vibration events are over 90%. The experimental results prove that this method can automatically make an effective feature selection and greatly improve the classification accuracy of vibrational events in distributed optical fiber sensing systems.

...read moreread less

Posted Content•

Music Genre Classification using Machine Learning Techniques.

[...]

Hareesh Bahuleyan

03 Apr 2018-arXiv: Sound

TL;DR: This study compares the performance of two classes of models using a deep learning approach wherein a CNN model is trained end-to-end, to predict the genre label of an audio signal, solely using its spectrogram.

...read moreread less

Abstract: Categorizing music files according to their genre is a challenging task in the area of music information retrieval (MIR). In this study, we compare the performance of two classes of models. The first is a deep learning approach wherein a CNN model is trained end-to-end, to predict the genre label of an audio signal, solely using its spectrogram. The second approach utilizes hand-crafted features, both from the time domain and the frequency domain. We train four traditional machine learning classifiers with these features and compare their performance. The features that contribute the most towards this multi-class classification task are identified. The experiments are conducted on the Audio set data set and we report an AUC value of 0.894 for an ensemble classifier which combines the two proposed approaches.

...read moreread less

Journal Article•DOI•

Prognostic model based on image-based time-frequency features and genetic algorithm for fetal hypoxia assessment.

[...]

Zafer Cömert¹, Adnan Fatih Kocamaz², Velappan Subha³•Institutions (3)

Bitlis Eren University¹, İnönü University², Manonmaniam Sundaranar University³

01 Aug 2018-Computers in Biology and Medicine

TL;DR: A novel prognostic model is proposed for predicting fetal hypoxia from CTG traces based on an innovative approach called image-based time-frequency (IBTF) analysis comprised of a combination of short time Fourier transform (STFT) and gray level co-occurrence matrix (GLCM).

...read moreread less

Proceedings Article•DOI•

Seeing Through Noise: Visually Driven Speaker Separation And Enhancement

[...]

Aviv Gabbay¹, Ariel Ephrat¹, Tavi Halperin¹, Shmuel Peleg¹•Institutions (1)

Hebrew University of Jerusalem¹

15 Apr 2018

TL;DR: In this paper, face motions captured in the video are used to estimate the speaker's voice, by passing the silent video frames through a video-to-speech neural network-based model.

...read moreread less

Abstract: Isolating the voice of a specific person while filtering out other voices or background noises is challenging when video is shot in noisy environments. We propose audio-visual methods to isolate the voice of a single speaker and eliminate unrelated sounds. First, face motions captured in the video are used to estimate the speaker's voice, by passing the silent video frames through a video-to-speech neural network-based model. Then the speech predictions are applied as a filter on the noisy input audio. This approach avoids using mixtures of sounds in the learning process, as the number of such possible mixtures is huge, and would inevitably bias the trained model. We evaluate our method on two audio-visual datasets, GRID and TCD- TIMIT, and show that our method attains significant SDR and PESQ improvements over the raw video-to-speech predictions, and a well-known audio-only method.

...read moreread less

Proceedings Article•DOI•

End-To-End Source Separation With Adaptive Front-Ends

[...]

Shrikant Venkataramani¹, Jonah Casebeer¹, Paris Smaragdis¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Oct 2018

TL;DR: An auto-encoder neural network is developed that can act as an equivalent to short-time front-end transforms and demonstrate the ability of the network to learn optimal, real-valued basis functions directly from the raw waveform of a signal.

...read moreread less

Abstract: Source separation and other audio applications have traditionally relied on the use of short-time Fourier transforms as a front-end frequency domain representation step. The unavailability of a neural network equivalent to forward and inverse transforms hinders the implementation of end-to-end learning systems for these applications. We develop an auto-encoder neural network that can act as an equivalent to short-time front-end transforms. We demonstrate the ability of the network to learn optimal, real-valued basis functions directly from the raw waveform of a signal and further show how it can be used as an adaptive front-end for supervised source separation. In terms of separation performance, these transforms significantly outperform their Fourier counterparts. Finally, we also propose and interpret a novel source to distortion ratio based cost function for end-to-end source separation.

...read moreread less

Journal Article•DOI•

Investigation of the effect of spectrogram images and different texture analysis methods on speech emotion recognition

[...]

Turgut Özseven¹•Institutions (1)

Gaziosmanpaşa University¹

15 Dec 2018-Applied Acoustics

TL;DR: The results have shown that texture analysis methods can be used for speech emotion recognition, and the combined use of both methods increased the success rate.

...read moreread less

Showing papers on "Spectrogram published in 2018"