Showing papers on "Spectrogram published in 2020"

PDF

Open Access

Posted Content•

DiffWave: A Versatile Diffusion Model for Audio Synthesis

[...]

Zhifeng Kong¹, Wei Ping², Jiaji Huang³, Kexin Zhao³, Bryan Catanzaro² - Show less +1 more•Institutions (3)

University of California, San Diego¹, Nvidia², Baidu³

21 Sep 2020-arXiv: Audio and Speech Processing

TL;DR: DiffWave significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations.

...read moreread less

Abstract: In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional and unconditional waveform generation. The model is non-autoregressive, and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis. It is efficiently trained by optimizing a variant of variational bound on the data likelihood. DiffWave produces high-fidelity audios in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram, class-conditional generation, and unconditional generation. We demonstrate that DiffWave matches a strong WaveNet vocoder in terms of speech quality (MOS: 4.44 versus 4.43), while synthesizing orders of magnitude faster. In particular, it significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations.

...read moreread less

459 citations

Proceedings Article•DOI•

Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram

[...]

Ryuichi Yamamoto, Eunwoo Song¹, Jae-Min Kim¹•Institutions (1)

Naver Corporation¹

04 May 2020

TL;DR: Parallel WaveGAN as discussed by the authors is a distillation-free, fast, and small-footprint waveform generation method using a generative adversarial network, which can effectively capture the time-frequency distribution of the realistic speech waveform.

...read moreread less

Abstract: We propose Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a generative adversarial network. In the proposed method, a non-autoregressive WaveNet is trained by jointly optimizing multi-resolution spectrogram and adversarial loss functions, which can effectively capture the time-frequency distribution of the realistic speech waveform. As our method does not require density distillation used in the conventional teacher-student framework, the entire model can be easily trained. Furthermore, our model is able to generate high-fidelity speech even with its compact architecture. In particular, the proposed Parallel WaveGAN has only 1.44 M parameters and can generate 24 kHz speech waveform 28.68 times faster than real-time on a single GPU environment. Perceptual listening test results verify that our proposed method achieves 4.16 mean opinion score within a Transformer-based text-to-speech framework, which is comparative to the best distillation-based Parallel WaveNet system.

...read moreread less

409 citations

Journal Article•DOI•

Speech emotion recognition with deep convolutional neural networks

[...]

Dias Issa¹, M. Fatih Demirci¹, Adnan Yazici¹•Institutions (1)

Nazarbayev University¹

01 May 2020-Biomedical Signal Processing and Control

TL;DR: A new architecture is introduced, which extracts mel-frequency cepstral coefficients, chromagram, mel-scale spectrogram, Tonnetz representation, and spectral contrast features from sound files and uses them as inputs for the one-dimensional Convolutional Neural Network for the identification of emotions using samples from the Ryerson Audio-Visual Database of Emotional Speech and Song, Berlin, and EMO-DB datasets.

...read moreread less

251 citations

Posted Content•

DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement

[...]

Yanxin Hu¹, Yun Liu, Shubo Lv, Mengtao Xing¹, Shimin Zhang¹, Yihui Fu¹, Jian Wu², Bihong Zhang, Lei Xie¹ - Show less +5 more•Institutions (2)

Northwestern Polytechnical University¹, Microsoft²

01 Aug 2020-arXiv: Audio and Speech Processing

TL;DR: A new network structure simulating the complex-valued operation, called Deep Complex Convolution Recurrent Network (DCCRN), where both CNN and RNN structures can handle complex- valued operation.

...read moreread less

Abstract: Speech enhancement has benefited from the success of deep learning in terms of intelligibility and perceptual quality. Conventional time-frequency (TF) domain methods focus on predicting TF-masks or speech spectrum, via a naive convolution neural network (CNN) or recurrent neural network (RNN). Some recent studies use complex-valued spectrogram as a training target but train in a real-valued network, predicting the magnitude and phase component or real and imaginary part, respectively. Particularly, convolution recurrent network (CRN) integrates a convolutional encoder-decoder (CED) structure and long short-term memory (LSTM), which has been proven to be helpful for complex targets. In order to train the complex target more effectively, in this paper, we design a new network structure simulating the complex-valued operation, called Deep Complex Convolution Recurrent Network (DCCRN), where both CNN and RNN structures can handle complex-valued operation. The proposed DCCRN models are very competitive over other previous networks, either on objective or subjective metric. With only 3.7M parameters, our DCCRN models submitted to the Interspeech 2020 Deep Noise Suppression (DNS) challenge ranked first for the real-time-track and second for the non-real-time track in terms of Mean Opinion Score (MOS).

...read moreread less

237 citations

Journal Article•DOI•

Learning Complex Spectral Mapping With Gated Convolutional Recurrent Networks for Monaural Speech Enhancement

[...]

Ke Tan¹, DeLiang Wang¹•Institutions (1)

Ohio State University¹

01 Jan 2020-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A gated convolutional recurrent network (GCRN) for complex spectral mapping is proposed, which amounts to a causal system for monaural speech enhancement and yields significantly higher STOI and PESQ than magnitude spectral mapping and complex ratio masking.

...read moreread less

Abstract: Phase is important for perceptual quality of speech. However, it seems intractable to directly estimate phase spectra through supervised learning due to their lack of spectrotemporal structure in it. Complex spectral mapping aims to estimate the real and imaginary spectrograms of clean speech from those of noisy speech, which simultaneously enhances magnitude and phase responses of speech. Inspired by multi-task learning, we propose a gated convolutional recurrent network (GCRN) for complex spectral mapping, which amounts to a causal system for monaural speech enhancement. Our experimental results suggest that the proposed GCRN substantially outperforms an existing convolutional neural network (CNN) for complex spectral mapping in terms of both objective speech intelligibility and quality. Moreover, the proposed approach yields significantly higher STOI and PESQ than magnitude spectral mapping and complex ratio masking. We also find that complex spectral mapping with the proposed GCRN provides an effective phase estimate.

...read moreread less

237 citations

Proceedings Article•DOI•

DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement

[...]

Yanxin Hu¹, Yun Liu, Shubo Lv, Mengtao Xing¹, Shimin Zhang¹, Yihui Fu¹, Jian Wu², Bihong Zhang, Lei Xie¹ - Show less +5 more•Institutions (2)

Northwestern Polytechnical University¹, Microsoft²

01 Aug 2020

TL;DR: Deep Complex Convolution Recurrent Network (DCCRN) as mentioned in this paper is a new network structure simulating the complex-valued operation, where both convolutional encoder-decoder (CED) and long short-term memory (LSTM) structures can handle complexvalued operation.

...read moreread less

225 citations

Journal Article•DOI•

PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network

[...]

Dacheng Yin¹, Chong Luo², Zhiwei Xiong¹, Wenjun Zeng²•Institutions (2)

University of Science and Technology of China¹, Microsoft²

03 Apr 2020

TL;DR: This paper proposes a phase-and-harmonics-aware deep neural network (DNN), named PHASEN, which has the ability to handle detailed phase patterns and to utilize harmonic patterns, and outperforms previous methods by a large margin on four metrics.

...read moreread less

Abstract: Time-frequency (T-F) domain masking is a mainstream approach for single-channel speech enhancement. Recently, focuses have been put to phase prediction in addition to amplitude prediction. In this paper, we propose a phase-and-harmonics-aware deep neural network (DNN), named PHASEN, for this task. Unlike previous methods which directly use a complex ideal ratio mask to supervise the DNN learning, we design a two-stream network, where amplitude stream and phase stream are dedicated to amplitude and phase prediction. We discover that the two streams should communicate with each other, and this is crucial to phase prediction. In addition, we propose frequency transformation blocks to catch long-range correlations along the frequency axis. Visualization shows that the learned transformation matrix implicitly captures the harmonic correlation, which has been proven to be helpful for T-F spectrogram reconstruction. With these two innovations, PHASEN acquires the ability to handle detailed phase patterns and to utilize harmonic patterns, getting 1.76dB SDR improvement on AVSpeech + AudioSet dataset. It also achieves significant gains over Google's network on this dataset. On Voice Bank + DEMAND dataset, PHASEN outperforms previous methods by a large margin on four metrics.

...read moreread less

195 citations

Journal Article•DOI•

Clustering-Based Speech Emotion Recognition by Incorporating Learned Features and Deep BiLSTM

[...]

Mustaqeem¹, Muhammad Sajjad², Soonil Kwon¹•Institutions (2)

Sejong University¹, Islamia College University²

27 Apr 2020-IEEE Access

TL;DR: A novel framework for SER is introduced using a key sequence segment selection based on redial based function network (RBFN) similarity measurement in clusters to reduce the computational complexity of the overall model and normalize the CNN features before their actual processing, so that it can easily recognize the Spatio-temporal information.

...read moreread less

Abstract: Emotional state recognition of a speaker is a difficult task for machine learning algorithms which plays an important role in the field of speech emotion recognition (SER). SER plays a significant role in many real-time applications such as human behavior assessment, human-robot interaction, virtual reality, and emergency centers to analyze the emotional state of speakers. Previous research in this field is mostly focused on handcrafted features and traditional convolutional neural network (CNN) models used to extract high-level features from speech spectrograms to increase the recognition accuracy and overall model cost complexity. In contrast, we introduce a novel framework for SER using a key sequence segment selection based on redial based function network (RBFN) similarity measurement in clusters. The selected sequence is converted into a spectrogram by applying the STFT algorithm and passed into the CNN model to extract the discriminative and salient features from the speech spectrogram. Furthermore, we normalize the CNN features to ensure precise recognition performance and feed them to the deep bi-directional long short-term memory (BiLSTM) to learn the temporal information for recognizing the final state of emotion. In the proposed technique, we process the key segments instead of the whole utterance to reduce the computational complexity of the overall model and normalize the CNN features before their actual processing, so that it can easily recognize the Spatio-temporal information. The proposed system is evaluated over different standard dataset including IEMOCAP, EMO-DB, and RAVDESS to improve the recognition accuracy and reduce the processing time of the model, respectively. The robustness and effectiveness of the suggested SER model is proved from the experimentations when compared to state-of-the-art SER methods with an achieve up to 72.25%, 85.57%, and 77.02% accuracy over IEMOCAP, EMO-DB, and RAVDESS dataset, respectively.

...read moreread less

190 citations

Posted Content•

End-to-End Adversarial Text-to-Speech.

[...]

Jeff Donahue¹, Sander Dieleman¹, Mikołaj Bińkowski¹, Erich Elsen¹, Karen Simonyan¹ - Show less +1 more•Institutions (1)

Google¹

05 Jun 2020-arXiv: Sound

TL;DR: This work takes on the challenging task of learning to synthesise speech from normalised text or phonemes in an end-to-end manner, resulting in models which operate directly on character or phoneme input sequences and produce raw speech audio outputs.

...read moreread less

Abstract: Modern text-to-speech synthesis pipelines typically involve multiple processing stages, each of which is designed or learnt independently from the rest. In this work, we take on the challenging task of learning to synthesise speech from normalised text or phonemes in an end-to-end manner, resulting in models which operate directly on character or phoneme input sequences and produce raw speech audio outputs. Our proposed generator is feed-forward and thus efficient for both training and inference, using a differentiable alignment scheme based on token length prediction. It learns to produce high fidelity audio through a combination of adversarial feedback and prediction losses constraining the generated audio to roughly match the ground truth in terms of its total duration and mel-spectrogram. To allow the model to capture temporal variation in the generated audio, we employ soft dynamic time warping in the spectrogram-based prediction loss. The resulting model achieves a mean opinion score exceeding 4 on a 5 point scale, which is comparable to the state-of-the-art models relying on multi-stage training and additional supervision.

...read moreread less

111 citations

Journal Article•DOI•

Convolutional neural networks based efficient approach for classification of lung diseases

[...]

Fatih Demir¹, Abdulkadir Sengur¹, Varun Bajaj•Institutions (1)

Fırat University¹

01 Dec 2020

TL;DR: ICBHI 2017 database which includes different sample frequencies, noise and background sounds was used for the classification of lung sounds and it was seen that obtained scores were better than the other results.

...read moreread less

Abstract: Treatment of lung diseases, which are the third most common cause of death in the world, is of great importance in the medical field. Many studies using lung sounds recorded with stethoscope have been conducted in the literature in order to diagnose the lung diseases with artificial intelligence-compatible devices and to assist the experts in their diagnosis. In this paper, ICBHI 2017 database which includes different sample frequencies, noise and background sounds was used for the classification of lung sounds. The lung sound signals were initially converted to spectrogram images by using time-frequency method. The short time Fourier transform (STFT) method was considered as time-frequency transformation. Two deep learning based approaches were used for lung sound classification. In the first approach, a pre-trained deep convolutional neural networks (CNN) model was used for feature extraction and a support vector machine (SVM) classifier was used in classification of the lung sounds. In the second approach, the pre-trained deep CNN model was fine-tuned (transfer learning) via spectrogram images for lung sound classification. The accuracies of the proposed methods were tested by using the ten-fold cross validation. The accuracies for the first and second proposed methods were 65.5% and 63.09%, respectively. The obtained accuracies were then compared with some of the existing results and it was seen that obtained scores were better than the other results.

...read moreread less

102 citations

Journal Article•DOI•

Environmental sound classification using a regularized deep convolutional neural network with data augmentation

[...]

Zohaib Mushtaq¹, Shun-Feng Su¹•Institutions (1)

National Taiwan University of Science and Technology¹

01 Oct 2020-Applied Acoustics

TL;DR: The performance evaluation illustrates that the best accuracy attained by the proposed DCNN without max-pooling function (Model-2) and using Log-Mel audio feature extraction on those augmented datasets can accomplish the best performance on environment sound classification problems.

...read moreread less

Proceedings Article•DOI•

Flow-TTS: A Non-Autoregressive Network for Text to Speech Based on Flow

[...]

Chenfeng Miao, Shuang Liang, Minchuan Chen, Ma Jun, Shaojun Wang, Jing Xiao - Show less +2 more

04 May 2020

TL;DR: Experiments on LJSpeech show that the speech quality of Flow-TTS heavily approaches that of human and is even better than that of autoregressive model Tacotron 2.

...read moreread less

Abstract: In this work, we propose Flow-TTS, a non-autoregressive end-to-end neural TTS model based on generative flow. Unlike other non-autoregressive models, Flow-TTS can achieve high-quality speech generation by using a single feed-forward network. To our knowledge, Flow-TTS is the first TTS model utilizing flow in spectrogram generation network and the first non-autoregssive model which jointly learns the alignment and spectrogram generation through a single network. Experiments on LJSpeech show that the speech quality of Flow-TTS heavily approaches that of human and is even better than that of autoregressive model Tacotron 2 (outperforms Tacotron 2 with a gap of 0.09 in MOS). Meanwhile, the inference speed of Flow-TTS is about 23 times speed-up over Tacotron 2, which is comparable to FastSpeech.1

...read moreread less

Journal Article•DOI•

A Hybrid CNN–LSTM Network for the Classification of Human Activities Based on Micro-Doppler Radar

[...]

Jianping Zhu¹, Haiquan Chen¹, Wenbin Ye¹•Institutions (1)

Shenzhen University¹

03 Feb 2020-IEEE Access

TL;DR: For the first time, the radar spectrogram is treated as a time sequence with multiple channels and a DL model composed of 1-D convolutional neural networks and long short-term memory (LSTM) is proposed that achieves the best recognition accuracy and relatively low complexity compared to the existing 2D-CNN methods.

...read moreread less

Abstract: Many deep learning (DL) models have shown exceptional promise in radar-based human activity recognition (HAR) area. For radar-based HAR, the raw data is generally converted into a 2-D spectrogram by using short-time Fourier transform (STFT). All the existing DL methods treat the spectrogram as an optical image, and thus the corresponding architectures such as 2-D convolutional neural networks (2D-CNNs) are adopted in those methods. These 2-D methods that ignore temporal characteristics ordinarily lead to a complex network with a huge amount of parameters but limited recognition accuracy. In this paper, for the first time, the radar spectrogram is treated as a time sequence with multiple channels. Hence, we propose a DL model composed of 1-D convolutional neural networks (1D-CNNs) and long short-term memory (LSTM). The experiments results show that the proposed model can extract spatio-temporal characteristics of the radar data and thus achieves the best recognition accuracy and relatively low complexity compared to the existing 2D-CNN methods.

...read moreread less

Proceedings Article•DOI•

Acoustic Scene Classification Using Deep Residual Networks with Late Fusion of Separated High and Low Frequency Paths

[...]

Mark D. McDonnell¹, Wei Gao¹•Institutions (1)

University of South Australia¹

04 May 2020

TL;DR: The performance of the models are significantly enhanced by the use of log-mel deltas, and overall the approach is capable of training strong single models, without use of any supplementary data from outside the official challenge dataset, with excellent generalization to unknown devices.

...read moreread less

Abstract: We investigate the problem of acoustic scene classification, using a deep residual network applied to log-mel spectrograms complemented by log-mel deltas and delta-deltas. We design the network to take into account that the temporal and frequency axes in spectrograms represent fundamentally different information. In particular, we use two pathways in the residual network: one for high frequencies and one for low frequencies, that were fused just two convolutional layers prior to the network output. We conduct experiments using two public 2019 DCASE datasets for acoustic scene classification; the first with binaural audio inputs recorded by a single device, and the second with single-channel audio inputs recorded through various devices. We show the performance of our models are significantly enhanced by the use of log-mel deltas, and that overall our approach is capable of training strong single models, without use of any supplementary data from outside the official challenge dataset, with excellent generalization to unknown devices. In particular, our approach achieved second place in 2019 DCASE Task 1b (0.4% behind the winning entry), and the best Task 1B evaluation results (by a large margin of over 5%) on test data from a device not used to record any training data.

...read moreread less

Journal Article•DOI•

Multi-Modal Multi-Channel Target Speech Separation

[...]

Rongzhi Gu¹, Shi-Xiong Zhang¹, Yong Xu¹, Lianwu Chen¹, Yuexian Zou², Dong Yu¹ - Show less +2 more•Institutions (2)

Tencent¹, Peking University²

16 Mar 2020-IEEE Journal of Selected Topics in Signal Processing

TL;DR: A general multi-modal framework for target speech separation is proposed by utilizing all the available information of the target speaker, including his/her spatial location, voice characteristics and lip movements, and a factorized attention-based fusion method is proposed to aggregate the high-level semantic information of multi- modalities at embedding level.

...read moreread less

Abstract: Target speech separation refers to extracting a target speaker's voice from an overlapped audio of simultaneous talkers. Previously the use of visual modality for target speech separation has demonstrated great potentials. This work proposes a general multi-modal framework for target speech separation by utilizing all the available information of the target speaker, including his/her spatial location, voice characteristics and lip movements. Also, under this framework, we investigate on the fusion methods for multi-modal joint modeling. A factorized attention-based fusion method is proposed to aggregate the high-level semantic information of multi-modalities at embedding level. This method firstly factorizes the mixture audio into a set of acoustic subspaces, then leverages the target's information from other modalities to enhance these subspace acoustic embeddings with a learnable attention scheme. To validate the robustness of proposed multi-modal separation model in practical scenarios, the system was evaluated under the condition that one of the modalities is temporarily missing, invalid or corrupted. Experiments are conducted on a large-scale audio-visual dataset collected from YouTube (to be released) that spatialized by simulated room impulse responses (RIRs). Experiment results illustrate that our proposed multi-modal framework significantly outperforms single-modal and bi-modal speech separation approaches, while can still support real-time processing.

...read moreread less

Proceedings Article•DOI•

Channel-Attention Dense U-Net for Multichannel Speech Enhancement

[...]

Bahareh Tolooshams¹, Ritwik Giri², Andy Song³, Umut Isik², Arvindh Krishnaswamy² - Show less +1 more•Institutions (3)

Harvard University¹, Amazon.com², Massachusetts Institute of Technology³

04 May 2020

TL;DR: This paper proposes Channel-Attention Dense U-Net, in which the channel-attention unit is applied recursively on feature maps at every layer of the network, enabling the network to perform non-linear beamforming.

...read moreread less

Abstract: Supervised deep learning has gained significant attention for speech enhancement recently. The state-of-the-art deep learning methods perform the task by learning a ratio/binary mask that is applied to the mixture in the time-frequency domain to produce the clean speech. Despite the great performance in the single-channel setting, these frameworks lag in performance in the multichannel setting as the majority of these methods a) fail to exploit the available spatial information fully, and b) still treat the deep architecture as a black box which may not be well-suited for multichannel audio processing. This paper addresses these drawbacks, a) by utilizing complex ratio masking instead of masking on the magnitude of the spectrogram, and more importantly, b) by introducing a channel-attention mechanism inside the deep architecture to mimic beamforming. We propose Channel-Attention Dense U-Net, in which we apply the channel-attention unit recursively on feature maps at every layer of the network, enabling the network to perform non-linear beamforming. We demonstrate the superior performance of the network against the state-of-the-art approaches on the CHiME-3 dataset.

...read moreread less

Journal Article•DOI•

A new machine learning method for identifying Alzheimer's disease

[...]

Lin Liu¹, Shenghui Zhao, Haibao Chen, Aiguo Wang•Institutions (1)

Anhui University of Science and Technology¹

01 Feb 2020-Simulation Modelling Practice and Theory

TL;DR: A new method that uses the spectrogram features extracted from speech data to identify Alzheimer's disease, which can help families to understand the disease development of patients in an earlier stage, so that they can take measures in advance to delay the diseaseDevelopment.

...read moreread less

Journal Article•DOI•

A New Deep CNN Model for Environmental Sound Classification

[...]

Fatih Demir¹, Daban Abdulsalam Abdullah², Abdulkadir Sengur¹•Institutions (2)

Fırat University¹, Sulaimani Polytechnic University²

01 Apr 2020-IEEE Access

TL;DR: Deep features are used in the environmental sound classification (ESC) problem by using a newly developed Convolutional Neural Networks (CNN) model, which is trained in the end-to-end fashion with the spectrogram images.

...read moreread less

Abstract: Cognitive prediction in the complicated and active environments is of great importance role in artificial learning. Classification accuracy of sound events has a robust relation with the feature extraction. In this paper, deep features are used in the environmental sound classification (ESC) problem. The deep features are extracted by using the fully connected layers of a newly developed Convolutional Neural Networks (CNN) model, which is trained in the end-to-end fashion with the spectrogram images. The feature vector is constituted with concatenating of the fully connected layers of the proposed CNN model. For testing the performance of the proposed method, the feature set is conveyed as input to the random subspaces K Nearest Neighbor (KNN) ensembles classifier. The experimental studies, which are carried out on the DCASE-2017 ASC and the UrbanSound8K datasets, show that the proposed CNN model achieves classification accuracies 96.23% and 86.70%, respectively.

...read moreread less

Journal Article•DOI•

Automatic detection of schizophrenia by applying deep learning over spectrogram images of EEG signals

[...]

Zülfikar Aslan, Mehmet Akin

30 Apr 2020-Traitement Du Signal

TL;DR: This work is the first in the relevant literature in using 2D timefrequency features for the purpose of automatic diagnosis of SZ patients by using Short-time Fourier Transform (STFT) in order to have a useful representation of frequency-time features.

...read moreread less

Abstract: Received: 17 January 2020 Accepted: 20 March 2020 This study presents a method that aims to automatically diagnose Schizophrenia (SZ) patients by using EEG recordings. Unlike many literature studies, the proposed method does not manually extract features from EEG recordings, instead it transforms the raw EEG into 2D by using Short-time Fourier Transform (STFT) in order to have a useful representation of frequency-time features. This work is the first in the relevant literature in using 2D timefrequency features for the purpose of automatic diagnosis of SZ patients. In order to extract most useful features out of all present in the 2D space and classify samples with high accuracy, a state-of-art Convolutional Neural Network architecture, namely VGG-16, is trained. The experimental results show that the method presented in the paper is successful in the task of classifying SZ patients and healthy controls with a classification accuracy of 95% and 97% in two datasets of different age groups. With this performance, the proposed method outperforms most of the literature methods. The experiments of the study also reveal that there is a relationship between frequency components of an EEG recording and the SZ disease. Moreover, Grad-CAM images presented in the paper clearly show that mid-level frequency components matter more while discriminating a SZ patient from a healthy control.

...read moreread less

Journal Article•DOI•

nnAudio: An on-the-Fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolutional Neural Networks

[...]

Kin Wai Cheuk¹, Hans Anderson, Kat Agres², Dorien Herremans¹•Institutions (2)

Singapore University of Technology and Design¹, Agency for Science, Technology and Research²

24 Aug 2020-IEEE Access

TL;DR: A new neural network-based audio processing framework with graphics processing unit (GPU) support that leverages 1D convolutional neural networks to perform time domain to frequency domain conversion, which allows on-the-fly spectrogram extraction due to its fast speed, without the need to store any spectrograms on the disk.

...read moreread less

Abstract: In this paper, we present nnAudio, a new neural network-based audio processing framework with graphics processing unit (GPU) support that leverages 1D convolutional neural networks to perform time domain to frequency domain conversion. It allows on-the-fly spectrogram extraction due to its fast speed, without the need to store any spectrograms on the disk. Moreover, this approach also allows back-propagation on the waveforms-to-spectrograms transformation layer, and hence, the transformation process can be made trainable, further optimizing the waveform-to-spectrogram transformation for the specific task that the neural network is trained on. All spectrogram implementations scale as Big-O of linear time with respect to the input length. nnAudio, however, leverages the compute unified device architecture (CUDA) of 1D convolutional neural network from PyTorch, its short-time Fourier transform (STFT), Mel spectrogram, and constant-Q transform (CQT) implementations are an order of magnitude faster than other implementations using only the central processing unit (CPU). We tested our framework on three different machines with NVIDIA GPUs, and our framework significantly reduces the spectrogram extraction time from the order of seconds (using a popular python library librosa) to the order of milliseconds, given that the audio recordings are of the same length. When applying nnAudio to variable input audio lengths, an average of 11.5 hours are required to extract 34 spectrogram types with different parameters from the MusicNet dataset using librosa. An average of 2.8 hours is required for nnAudio, which is still four times faster than librosa. Our proposed framework also outperforms existing GPU processing libraries such as Kapre and torchaudio in terms of processing speed.

...read moreread less

Posted Content•

Parallel Tacotron: Non-Autoregressive and Controllable TTS.

[...]

Isaac Elias¹, Heiga Zen¹, Jonathan Shen¹, Yu Zhang¹, Ye Jia¹, Ron Weiss¹, Yonghui Wu¹ - Show less +3 more•Institutions (1)

Google¹

22 Oct 2020-arXiv: Sound

TL;DR: A non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder, called Parallel Tacotron, which is highly parallelizable during both training and inference, allowing efficient synthesis on modern parallel hardware.

...read moreread less

Abstract: Although neural end-to-end text-to-speech models can synthesize highly natural speech, there is still room for improvements to its efficiency and naturalness. This paper proposes a non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder. This model, called \emph{Parallel Tacotron}, is highly parallelizable during both training and inference, allowing efficient synthesis on modern parallel hardware. The use of the variational autoencoder relaxes the one-to-many mapping nature of the text-to-speech problem and improves naturalness. To further improve the naturalness, we use lightweight convolutions, which can efficiently capture local contexts, and introduce an iterative spectrogram loss inspired by iterative refinement. Experimental results show that Parallel Tacotron matches a strong autoregressive baseline in subjective evaluations with significantly decreased inference time.

...read moreread less

Journal Article•DOI•

A 1D-CNN-Spectrogram Based Approach for Seizure Detection from EEG Signal

[...]

Gopal Chandra Jana¹, Ratna Sharma², Anupam Agrawal¹•Institutions (2)

Indian Institute of Information Technology, Allahabad¹, All India Institute of Medical Sciences²

01 Jan 2020-Procedia Computer Science

TL;DR: The main objective in this work is to represent a methodology with the combination of two methods Spectrogram and 1D CNN which can be one possible approach for seizure detection.

...read moreread less

Journal Article•DOI•

Sound Event Detection of Weakly Labelled Data With CNN-Transformer and Automatic Threshold Optimization

[...]

Qiuqiang Kong¹, Yong Xu², Wenwu Wang¹, Mark D. Plumbley¹•Institutions (2)

University of Surrey¹, Tencent²

12 Aug 2020-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: In this paper, a convolutional neural network transformer (CNN-Transfomer) was proposed for audio tagging and sound event detection, which outperformed the previous state-of-the-art.

...read moreread less

Abstract: Sound event detection (SED) is a task to detect sound events in an audio recording. One challenge of the SED task is that many datasets such as the Detection and Classification of Acoustic Scenes and Events (DCASE) datasets are weakly labelled. That is, there are only audio tags for each audio clip without the onset and offset times of sound events. We compare segment-wise and clip-wise training for SED that is lacking in previous works. We propose a convolutional neural network transformer (CNN-Transfomer) for audio tagging and SED, and show that CNN-Transformer performs similarly to a convolutional recurrent neural network (CRNN). Another challenge of SED is that thresholds are required for detecting sound events. Previous works set thresholds empirically, and are not an optimal approaches. To solve this problem, we propose an automatic threshold optimization method. The first stage is to optimize the system with respect to metrics that do not depend on thresholds, such as mean average precision (mAP). The second stage is to optimize the thresholds with respect to metrics that depends on those thresholds. Our proposed automatic threshold optimization system achieves a state-of-the-art audio tagging F1 of 0.646, outperforming that without threshold optimization of 0.629, and a sound event detection F1 of 0.584, outperforming that without threshold optimization of 0.564.

...read moreread less

Proceedings Article•DOI•

VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network

[...]

Jinhyeok Yang, Junmo Lee, Young-Ik Kim, Hoon-Young Cho, In-Jung Kim¹ - Show less +1 more•Institutions (1)

Handong Global University¹

30 Jul 2020

TL;DR: VocGAN is nearly as fast as MelGAN, but it significantly improves the quality and consistency of the output waveform, and exhibits significantly improved quality in multiple evaluation metrics including mean opinion score (MOS) with minimal additional overhead.

...read moreread less

Abstract: We present a novel high-fidelity real-time neural vocoder called VocGAN. A recently developed GAN-based vocoder, MelGAN, produces speech waveforms in real-time. However, it often produces a waveform that is insufficient in quality or inconsistent with acoustic characteristics of the input mel spectrogram. VocGAN is nearly as fast as MelGAN, but it significantly improves the quality and consistency of the output waveform. VocGAN applies a multi-scale waveform generator and a hierarchically-nested discriminator to learn multiple levels of acoustic properties in a balanced way. It also applies the joint conditional and unconditional objective, which has shown successful results in high-resolution image synthesis. In experiments, VocGAN synthesizes speech waveforms 416.7x faster on a GTX 1080Ti GPU and 3.24x faster on a CPU than real-time. Compared with MelGAN, it also exhibits significantly improved quality in multiple evaluation metrics including mean opinion score (MOS) with minimal additional overhead. Additionally, compared with Parallel WaveGAN, another recently developed high-fidelity vocoder, VocGAN is 6.98x faster on a CPU and exhibits higher MOS.

...read moreread less

Journal Article•DOI•

Data Augmentation for Motor Imagery Signal Classification Based on a Hybrid Neural Network.

[...]

Kai Zhang¹, Guanghua Xu¹, Zezhen Han¹, Ma Kaiquan¹, Xiaowei Zheng¹, Longting Chen¹, Nan Duan¹, Sicong Zhang¹ - Show less +4 more•Institutions (1)

Xi'an Jiaotong University¹

11 Aug 2020-Sensors

TL;DR: The results showed that the deep convolutional generative adversarial network (DCGAN) provided better augmentation performance than traditional DA methods: geometric transformation (GT), autoencoder (AE), and variational autoenCoder (VAE) (p < 0.01).

...read moreread less

Abstract: As an important paradigm of spontaneous brain-computer interfaces (BCIs), motor imagery (MI) has been widely used in the fields of neurological rehabilitation and robot control. Recently, researchers have proposed various methods for feature extraction and classification based on MI signals. The decoding model based on deep neural networks (DNNs) has attracted significant attention in the field of MI signal processing. Due to the strict requirements for subjects and experimental environments, it is difficult to collect large-scale and high-quality electroencephalogram (EEG) data. However, the performance of a deep learning model depends directly on the size of the datasets. Therefore, the decoding of MI-EEG signals based on a DNN has proven highly challenging in practice. Based on this, we investigated the performance of different data augmentation (DA) methods for the classification of MI data using a DNN. First, we transformed the time series signals into spectrogram images using a short-time Fourier transform (STFT). Then, we evaluated and compared the performance of different DA methods for this spectrogram data. Next, we developed a convolutional neural network (CNN) to classify the MI signals and compared the classification performance of after DA. The Frechet inception distance (FID) was used to evaluate the quality of the generated data (GD) and the classification accuracy, and mean kappa values were used to explore the best CNN-DA method. In addition, analysis of variance (ANOVA) and paired t-tests were used to assess the significance of the results. The results showed that the deep convolutional generative adversarial network (DCGAN) provided better augmentation performance than traditional DA methods: geometric transformation (GT), autoencoder (AE), and variational autoencoder (VAE) (p < 0.01). Public datasets of the BCI competition IV (datasets 1 and 2b) were used to verify the classification performance. Improvements in the classification accuracies of 17% and 21% (p < 0.01) were observed after DA for the two datasets. In addition, the hybrid network CNN-DCGAN outperformed the other classification methods, with average kappa values of 0.564 and 0.677 for the two datasets.

...read moreread less

Journal Article•DOI•

A Novel Detection and Recognition Method for Continuous Hand Gesture Using FMCW Radar

[...]

Yong Wang¹, Aihu Ren¹, Mu Zhou¹, Wen Wang¹, Xiaobo Yang¹ - Show less +1 more•Institutions (1)

Chongqing University of Posts and Telecommunications¹

10 Sep 2020-IEEE Access

TL;DR: A novel method for continuous hand gesture detection and recognition is proposed based on a frequency modulated continuous wave (FMCW) radar and the Fusion Dynamic Time Warping (FDTW) algorithm is presented to recognize the hand gestures.

...read moreread less

Abstract: In this article, a novel method for continuous hand gesture detection and recognition is proposed based on a frequency modulated continuous wave (FMCW) radar. Firstly, we adopt the 2-Dimensional Fast Fourier Transform (2D-FFT) to estimate the range and Doppler parameters of the hand gesture raw data, and construct the range-time map (RTM) and Doppler-time map (DTM). Meanwhile, we apply the Multiple Signal Classification (MUSIC) algorithm to calculate the angle and construct the angle-time map (ATM). Secondly, a hand gesture detection method is proposed to segment the continuous hand gestures using a decision threshold. Thirdly, the central time-frequency trajectory of each hand gesture spectrogram is clustered using the k-means algorithm, and then the Fusion Dynamic Time Warping (FDTW) algorithm is presented to recognize the hand gestures. Finally, experiments show that the accuracy of the proposed hand gesture detection method can reach 96.17%. The hand gesture average recognition accuracy of the proposed FDTW algorithm is 95.83%, while its time complexity is reduced by more than 50%.

...read moreread less

Journal Article•DOI•

Emotional sounds of crowds: spectrogram-based analysis using deep learning.

[...]

Valentina Franzoni¹, Giulio Biondi², Alfredo Milani¹•Institutions (2)

University of Perugia¹, University of Florence²

17 Aug 2020-Multimedia Tools and Applications

TL;DR: This work presents a technique based on the generation of sound spectrograms from fragments of fixed length, extracted from original audio clips recorded in high-attendance events, where the crowd acts as a collective individual.

...read moreread less

Abstract: Crowds express emotions as a collective individual, which is evident from the sounds that a crowd produces in particular events, e.g., collective booing, laughing or cheering in sports matches, movies, theaters, concerts, political demonstrations, and riots. A critical question concerning the innovative concept of crowd emotions is whether the emotional content of crowd sounds can be characterized by frequency-amplitude features, using analysis techniques similar to those applied on individual voices, where deep learning classification is applied to spectrogram images derived by sound transformations. In this work, we present a technique based on the generation of sound spectrograms from fragments of fixed length, extracted from original audio clips recorded in high-attendance events, where the crowd acts as a collective individual. Transfer learning techniques are used on a convolutional neural network, pre-trained on low-level features using the well-known ImageNet extensive dataset of visual knowledge. The original sound clips are filtered and normalized in amplitude for a correct spectrogram generation, on which we fine-tune the domain-specific features. Experiments held on the finally trained Convolutional Neural Network show promising performances of the proposed model to classify the emotions of the crowd.

...read moreread less

Journal Article•DOI•

Continuous Human Activity Classification With Unscented Kalman Filter Tracking Using FMCW Radar

[...]

Prachi Vaishnav¹, Avik Santra¹•Institutions (1)

Infineon Technologies¹

29 Apr 2020

TL;DR: This letter proposes a novel integrated human localization and activity classification using unscented Kalman filter and demonstrates the results using a short-range 60-GHz frequency modulated continuous wave radar.

...read moreread less

Abstract: Short-range compact radar systems offer attractive modality for localization and tracking of human targets in indoor and outdoor environments for industrial and consumer applications. Micro-Doppler radar reflections from human targets can be sensed and used for human activity classification, which has applications in human–computer interaction and health assessment among others. Traditionally, the detected human targets’ location are tracked and its micro-Doppler spectrogram extracted for further activity classification of the human target. In this letter, we propose a novel integrated human localization and activity classification using unscented Kalman filter and demonstrate our results using a short-range 60-GHz frequency modulated continuous wave radar. The proposed solution is shown to result in an improved classification accuracy with the capability of providing uncertainty with associated classification probabilities and, thus, is a simple mechanism to achieve Bayesian classification.

...read moreread less

Journal Article•DOI•

Time-Frequency Analysis, Denoising, Compression, Segmentation, and Classification of PCG Signals

[...]

Tanzil Hoque Chowdhury¹, Khem Narayan Poudel¹, Yating Hu¹•Institutions (1)

Middle Tennessee State University¹

01 Sep 2020-IEEE Access

TL;DR: This paper has combined different signal processing techniques and a deep learning method to denoise, compress, segment, and classify PCG signals effectively and accurately and achieves overall testing accuracy of around 97.10%.

...read moreread less

Abstract: Phonocardigraphy (PCG) is the graphical representation of heart sounds. The PCG signal contains useful information about the functionality and the condition of the heart. It also provides an early indication of potential cardiac abnormalities. Extracting cardiac information from heart sounds and detecting abnormal heart sounds to diagnose heart diseases using the PCG signal can play a vital role in remote patient monitoring. In this paper, we have combined different signal processing techniques and a deep learning method to denoise, compress, segment, and classify PCG signals effectively and accurately. First, the PCG signal is denoised and compressed by using a multi-resolution analysis based on the Discrete Wavelet Transform (DWT). Then, a segmentation algorithm, based on the Shannon energy envelope and zero-crossing, is applied to segment the PCG signal into four major parts: the first heart sound (S1), the systole interval, the second heart sound (S2), and the diastole interval. Finally, Mel-scaled power spectrogram and Mel-frequency cepstral coefficients (MFCC) are employed to extract informative features from the PCG signal, which are then fed into a classifier to classify each PCG signal into a normal or an abnormal signal by using a deep learning approach. For the classification, a 5-layer feed-forward Deep Neural Network (DNN) model is used, and overall testing accuracy of around 97.10% is achieved. Besides providing valuable information regarding heart condition, this signal processing approach can help cardiologists take appropriate and reliable steps toward diagnosis if any cardiovascular disorder is found in the initial stage.

...read moreread less

Journal Article•DOI•

A Novel Approach for Classification of Speech Emotions Based on Deep and Acoustic Features

[...]

Mehmet Bilal Er¹•Institutions (1)

Harran University¹

07 Dec 2020-IEEE Access

TL;DR: In this article, a hybrid architecture based on acoustic and deep features was proposed to increase the classification accuracy in the problem of speech emotion recognition, which consists of feature extraction, feature selection and classification stages.

...read moreread less

Abstract: The problem of recognition and classification of emotions in speech is one of the most prominent research topics, that has gained popularity, in human-computer interaction in the last decades. Having recognized the feelings or emotions in human conversations might have a deep impact on understanding a human’s physical and psychological situation. This study proposes a novel hybrid architecture based on acoustic and deep features to increase the classification accuracy in the problem of speech emotion recognition. The proposed method consists of feature extraction, feature selection and classification stages. At first, acoustic features such as Root Mean Square energy (RMS), Mel-Frequency Cepstral Coefficients (MFCC) and Zero-crossing Rate are obtained from voice records. Subsequently, spectrogram images of the original sound signals are given as input to the pre-trained deep network architecture, which is VGG16, ResNet18, ResNet50, ResNet101, SqueezeNet and DenseNet201 and deep features are extracted. Thereafter, a hybrid feature vector is created by combining acoustic and deep features. Also, the ReliefF algorithm is used to select more efficient features from the hybrid feature vector. Finally, in order for the completion of the classification task, Support vector machine (SVM) is used. Experiments are made using three popular datasets used in the literature so as to evaluate the effect of various techniques. These datasets are Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), Berlin (EMO-DB) and Interactive Emotional Dyadic Motion Capture (IEMOCAP). As a consequence, we reach to 79.41%, 90.21% and 85.37% accuracy rates for RAVDESS, EMO-DB, and IEMOCAP datasets, respectively. The Final results obtained in experiments, clearly, show that the proposed technique might be utilized to accomplish the task of speech emotion recognition efficiently. Moreover, when our technique is compared with those of methods used in the context, it is obvious that our method outperforms others in terms of classification accuracy rates.

...read moreread less

Collapse