scispace - formally typeset
Search or ask a question

Showing papers on "Spectrogram published in 2020"


Posted Content
TL;DR: DiffWave significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations.
Abstract: In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional and unconditional waveform generation. The model is non-autoregressive, and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis. It is efficiently trained by optimizing a variant of variational bound on the data likelihood. DiffWave produces high-fidelity audios in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram, class-conditional generation, and unconditional generation. We demonstrate that DiffWave matches a strong WaveNet vocoder in terms of speech quality (MOS: 4.44 versus 4.43), while synthesizing orders of magnitude faster. In particular, it significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations.

459 citations


Proceedings ArticleDOI
04 May 2020
TL;DR: Parallel WaveGAN as discussed by the authors is a distillation-free, fast, and small-footprint waveform generation method using a generative adversarial network, which can effectively capture the time-frequency distribution of the realistic speech waveform.
Abstract: We propose Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a generative adversarial network. In the proposed method, a non-autoregressive WaveNet is trained by jointly optimizing multi-resolution spectrogram and adversarial loss functions, which can effectively capture the time-frequency distribution of the realistic speech waveform. As our method does not require density distillation used in the conventional teacher-student framework, the entire model can be easily trained. Furthermore, our model is able to generate high-fidelity speech even with its compact architecture. In particular, the proposed Parallel WaveGAN has only 1.44 M parameters and can generate 24 kHz speech waveform 28.68 times faster than real-time on a single GPU environment. Perceptual listening test results verify that our proposed method achieves 4.16 mean opinion score within a Transformer-based text-to-speech framework, which is comparative to the best distillation-based Parallel WaveNet system.

409 citations


Journal ArticleDOI
TL;DR: A new architecture is introduced, which extracts mel-frequency cepstral coefficients, chromagram, mel-scale spectrogram, Tonnetz representation, and spectral contrast features from sound files and uses them as inputs for the one-dimensional Convolutional Neural Network for the identification of emotions using samples from the Ryerson Audio-Visual Database of Emotional Speech and Song, Berlin, and EMO-DB datasets.

251 citations


Posted Content
TL;DR: A new network structure simulating the complex-valued operation, called Deep Complex Convolution Recurrent Network (DCCRN), where both CNN and RNN structures can handle complex- valued operation.
Abstract: Speech enhancement has benefited from the success of deep learning in terms of intelligibility and perceptual quality. Conventional time-frequency (TF) domain methods focus on predicting TF-masks or speech spectrum, via a naive convolution neural network (CNN) or recurrent neural network (RNN). Some recent studies use complex-valued spectrogram as a training target but train in a real-valued network, predicting the magnitude and phase component or real and imaginary part, respectively. Particularly, convolution recurrent network (CRN) integrates a convolutional encoder-decoder (CED) structure and long short-term memory (LSTM), which has been proven to be helpful for complex targets. In order to train the complex target more effectively, in this paper, we design a new network structure simulating the complex-valued operation, called Deep Complex Convolution Recurrent Network (DCCRN), where both CNN and RNN structures can handle complex-valued operation. The proposed DCCRN models are very competitive over other previous networks, either on objective or subjective metric. With only 3.7M parameters, our DCCRN models submitted to the Interspeech 2020 Deep Noise Suppression (DNS) challenge ranked first for the real-time-track and second for the non-real-time track in terms of Mean Opinion Score (MOS).

237 citations


Journal ArticleDOI
TL;DR: A gated convolutional recurrent network (GCRN) for complex spectral mapping is proposed, which amounts to a causal system for monaural speech enhancement and yields significantly higher STOI and PESQ than magnitude spectral mapping and complex ratio masking.
Abstract: Phase is important for perceptual quality of speech. However, it seems intractable to directly estimate phase spectra through supervised learning due to their lack of spectrotemporal structure in it. Complex spectral mapping aims to estimate the real and imaginary spectrograms of clean speech from those of noisy speech, which simultaneously enhances magnitude and phase responses of speech. Inspired by multi-task learning, we propose a gated convolutional recurrent network (GCRN) for complex spectral mapping, which amounts to a causal system for monaural speech enhancement. Our experimental results suggest that the proposed GCRN substantially outperforms an existing convolutional neural network (CNN) for complex spectral mapping in terms of both objective speech intelligibility and quality. Moreover, the proposed approach yields significantly higher STOI and PESQ than magnitude spectral mapping and complex ratio masking. We also find that complex spectral mapping with the proposed GCRN provides an effective phase estimate.

237 citations


Proceedings ArticleDOI
01 Aug 2020
TL;DR: Deep Complex Convolution Recurrent Network (DCCRN) as mentioned in this paper is a new network structure simulating the complex-valued operation, where both convolutional encoder-decoder (CED) and long short-term memory (LSTM) structures can handle complexvalued operation.
Abstract: Speech enhancement has benefited from the success of deep learning in terms of intelligibility and perceptual quality. Conventional time-frequency (TF) domain methods focus on predicting TF-masks or speech spectrum, via a naive convolution neural network (CNN) or recurrent neural network (RNN). Some recent studies use complex-valued spectrogram as a training target but train in a real-valued network, predicting the magnitude and phase component or real and imaginary part, respectively. Particularly, convolution recurrent network (CRN) integrates a convolutional encoder-decoder (CED) structure and long short-term memory (LSTM), which has been proven to be helpful for complex targets. In order to train the complex target more effectively, in this paper, we design a new network structure simulating the complex-valued operation, called Deep Complex Convolution Recurrent Network (DCCRN), where both CNN and RNN structures can handle complex-valued operation. The proposed DCCRN models are very competitive over other previous networks, either on objective or subjective metric. With only 3.7M parameters, our DCCRN models submitted to the Interspeech 2020 Deep Noise Suppression (DNS) challenge ranked first for the real-time-track and second for the non-real-time track in terms of Mean Opinion Score (MOS).

225 citations


Journal ArticleDOI
03 Apr 2020
TL;DR: This paper proposes a phase-and-harmonics-aware deep neural network (DNN), named PHASEN, which has the ability to handle detailed phase patterns and to utilize harmonic patterns, and outperforms previous methods by a large margin on four metrics.
Abstract: Time-frequency (T-F) domain masking is a mainstream approach for single-channel speech enhancement. Recently, focuses have been put to phase prediction in addition to amplitude prediction. In this paper, we propose a phase-and-harmonics-aware deep neural network (DNN), named PHASEN, for this task. Unlike previous methods which directly use a complex ideal ratio mask to supervise the DNN learning, we design a two-stream network, where amplitude stream and phase stream are dedicated to amplitude and phase prediction. We discover that the two streams should communicate with each other, and this is crucial to phase prediction. In addition, we propose frequency transformation blocks to catch long-range correlations along the frequency axis. Visualization shows that the learned transformation matrix implicitly captures the harmonic correlation, which has been proven to be helpful for T-F spectrogram reconstruction. With these two innovations, PHASEN acquires the ability to handle detailed phase patterns and to utilize harmonic patterns, getting 1.76dB SDR improvement on AVSpeech + AudioSet dataset. It also achieves significant gains over Google's network on this dataset. On Voice Bank + DEMAND dataset, PHASEN outperforms previous methods by a large margin on four metrics.

195 citations


Journal ArticleDOI
TL;DR: A novel framework for SER is introduced using a key sequence segment selection based on redial based function network (RBFN) similarity measurement in clusters to reduce the computational complexity of the overall model and normalize the CNN features before their actual processing, so that it can easily recognize the Spatio-temporal information.
Abstract: Emotional state recognition of a speaker is a difficult task for machine learning algorithms which plays an important role in the field of speech emotion recognition (SER). SER plays a significant role in many real-time applications such as human behavior assessment, human-robot interaction, virtual reality, and emergency centers to analyze the emotional state of speakers. Previous research in this field is mostly focused on handcrafted features and traditional convolutional neural network (CNN) models used to extract high-level features from speech spectrograms to increase the recognition accuracy and overall model cost complexity. In contrast, we introduce a novel framework for SER using a key sequence segment selection based on redial based function network (RBFN) similarity measurement in clusters. The selected sequence is converted into a spectrogram by applying the STFT algorithm and passed into the CNN model to extract the discriminative and salient features from the speech spectrogram. Furthermore, we normalize the CNN features to ensure precise recognition performance and feed them to the deep bi-directional long short-term memory (BiLSTM) to learn the temporal information for recognizing the final state of emotion. In the proposed technique, we process the key segments instead of the whole utterance to reduce the computational complexity of the overall model and normalize the CNN features before their actual processing, so that it can easily recognize the Spatio-temporal information. The proposed system is evaluated over different standard dataset including IEMOCAP, EMO-DB, and RAVDESS to improve the recognition accuracy and reduce the processing time of the model, respectively. The robustness and effectiveness of the suggested SER model is proved from the experimentations when compared to state-of-the-art SER methods with an achieve up to 72.25%, 85.57%, and 77.02% accuracy over IEMOCAP, EMO-DB, and RAVDESS dataset, respectively.

190 citations


Posted Content
Jeff Donahue1, Sander Dieleman1, Mikołaj Bińkowski1, Erich Elsen1, Karen Simonyan1 
TL;DR: This work takes on the challenging task of learning to synthesise speech from normalised text or phonemes in an end-to-end manner, resulting in models which operate directly on character or phoneme input sequences and produce raw speech audio outputs.
Abstract: Modern text-to-speech synthesis pipelines typically involve multiple processing stages, each of which is designed or learnt independently from the rest. In this work, we take on the challenging task of learning to synthesise speech from normalised text or phonemes in an end-to-end manner, resulting in models which operate directly on character or phoneme input sequences and produce raw speech audio outputs. Our proposed generator is feed-forward and thus efficient for both training and inference, using a differentiable alignment scheme based on token length prediction. It learns to produce high fidelity audio through a combination of adversarial feedback and prediction losses constraining the generated audio to roughly match the ground truth in terms of its total duration and mel-spectrogram. To allow the model to capture temporal variation in the generated audio, we employ soft dynamic time warping in the spectrogram-based prediction loss. The resulting model achieves a mean opinion score exceeding 4 on a 5 point scale, which is comparable to the state-of-the-art models relying on multi-stage training and additional supervision.

111 citations


Journal ArticleDOI
01 Dec 2020
TL;DR: ICBHI 2017 database which includes different sample frequencies, noise and background sounds was used for the classification of lung sounds and it was seen that obtained scores were better than the other results.
Abstract: Treatment of lung diseases, which are the third most common cause of death in the world, is of great importance in the medical field. Many studies using lung sounds recorded with stethoscope have been conducted in the literature in order to diagnose the lung diseases with artificial intelligence-compatible devices and to assist the experts in their diagnosis. In this paper, ICBHI 2017 database which includes different sample frequencies, noise and background sounds was used for the classification of lung sounds. The lung sound signals were initially converted to spectrogram images by using time-frequency method. The short time Fourier transform (STFT) method was considered as time-frequency transformation. Two deep learning based approaches were used for lung sound classification. In the first approach, a pre-trained deep convolutional neural networks (CNN) model was used for feature extraction and a support vector machine (SVM) classifier was used in classification of the lung sounds. In the second approach, the pre-trained deep CNN model was fine-tuned (transfer learning) via spectrogram images for lung sound classification. The accuracies of the proposed methods were tested by using the ten-fold cross validation. The accuracies for the first and second proposed methods were 65.5% and 63.09%, respectively. The obtained accuracies were then compared with some of the existing results and it was seen that obtained scores were better than the other results.

102 citations


Journal ArticleDOI
TL;DR: The performance evaluation illustrates that the best accuracy attained by the proposed DCNN without max-pooling function (Model-2) and using Log-Mel audio feature extraction on those augmented datasets can accomplish the best performance on environment sound classification problems.

Proceedings ArticleDOI
04 May 2020
TL;DR: Experiments on LJSpeech show that the speech quality of Flow-TTS heavily approaches that of human and is even better than that of autoregressive model Tacotron 2.
Abstract: In this work, we propose Flow-TTS, a non-autoregressive end-to-end neural TTS model based on generative flow. Unlike other non-autoregressive models, Flow-TTS can achieve high-quality speech generation by using a single feed-forward network. To our knowledge, Flow-TTS is the first TTS model utilizing flow in spectrogram generation network and the first non-autoregssive model which jointly learns the alignment and spectrogram generation through a single network. Experiments on LJSpeech show that the speech quality of Flow-TTS heavily approaches that of human and is even better than that of autoregressive model Tacotron 2 (outperforms Tacotron 2 with a gap of 0.09 in MOS). Meanwhile, the inference speed of Flow-TTS is about 23 times speed-up over Tacotron 2, which is comparable to FastSpeech.1

Journal ArticleDOI
TL;DR: For the first time, the radar spectrogram is treated as a time sequence with multiple channels and a DL model composed of 1-D convolutional neural networks and long short-term memory (LSTM) is proposed that achieves the best recognition accuracy and relatively low complexity compared to the existing 2D-CNN methods.
Abstract: Many deep learning (DL) models have shown exceptional promise in radar-based human activity recognition (HAR) area. For radar-based HAR, the raw data is generally converted into a 2-D spectrogram by using short-time Fourier transform (STFT). All the existing DL methods treat the spectrogram as an optical image, and thus the corresponding architectures such as 2-D convolutional neural networks (2D-CNNs) are adopted in those methods. These 2-D methods that ignore temporal characteristics ordinarily lead to a complex network with a huge amount of parameters but limited recognition accuracy. In this paper, for the first time, the radar spectrogram is treated as a time sequence with multiple channels. Hence, we propose a DL model composed of 1-D convolutional neural networks (1D-CNNs) and long short-term memory (LSTM). The experiments results show that the proposed model can extract spatio-temporal characteristics of the radar data and thus achieves the best recognition accuracy and relatively low complexity compared to the existing 2D-CNN methods.

Proceedings ArticleDOI
04 May 2020
TL;DR: The performance of the models are significantly enhanced by the use of log-mel deltas, and overall the approach is capable of training strong single models, without use of any supplementary data from outside the official challenge dataset, with excellent generalization to unknown devices.
Abstract: We investigate the problem of acoustic scene classification, using a deep residual network applied to log-mel spectrograms complemented by log-mel deltas and delta-deltas. We design the network to take into account that the temporal and frequency axes in spectrograms represent fundamentally different information. In particular, we use two pathways in the residual network: one for high frequencies and one for low frequencies, that were fused just two convolutional layers prior to the network output. We conduct experiments using two public 2019 DCASE datasets for acoustic scene classification; the first with binaural audio inputs recorded by a single device, and the second with single-channel audio inputs recorded through various devices. We show the performance of our models are significantly enhanced by the use of log-mel deltas, and that overall our approach is capable of training strong single models, without use of any supplementary data from outside the official challenge dataset, with excellent generalization to unknown devices. In particular, our approach achieved second place in 2019 DCASE Task 1b (0.4% behind the winning entry), and the best Task 1B evaluation results (by a large margin of over 5%) on test data from a device not used to record any training data.

Journal ArticleDOI
Rongzhi Gu1, Shi-Xiong Zhang1, Yong Xu1, Lianwu Chen1, Yuexian Zou2, Dong Yu1 
TL;DR: A general multi-modal framework for target speech separation is proposed by utilizing all the available information of the target speaker, including his/her spatial location, voice characteristics and lip movements, and a factorized attention-based fusion method is proposed to aggregate the high-level semantic information of multi- modalities at embedding level.
Abstract: Target speech separation refers to extracting a target speaker's voice from an overlapped audio of simultaneous talkers. Previously the use of visual modality for target speech separation has demonstrated great potentials. This work proposes a general multi-modal framework for target speech separation by utilizing all the available information of the target speaker, including his/her spatial location, voice characteristics and lip movements. Also, under this framework, we investigate on the fusion methods for multi-modal joint modeling. A factorized attention-based fusion method is proposed to aggregate the high-level semantic information of multi-modalities at embedding level. This method firstly factorizes the mixture audio into a set of acoustic subspaces, then leverages the target's information from other modalities to enhance these subspace acoustic embeddings with a learnable attention scheme. To validate the robustness of proposed multi-modal separation model in practical scenarios, the system was evaluated under the condition that one of the modalities is temporarily missing, invalid or corrupted. Experiments are conducted on a large-scale audio-visual dataset collected from YouTube (to be released) that spatialized by simulated room impulse responses (RIRs). Experiment results illustrate that our proposed multi-modal framework significantly outperforms single-modal and bi-modal speech separation approaches, while can still support real-time processing.

Proceedings ArticleDOI
04 May 2020
TL;DR: This paper proposes Channel-Attention Dense U-Net, in which the channel-attention unit is applied recursively on feature maps at every layer of the network, enabling the network to perform non-linear beamforming.
Abstract: Supervised deep learning has gained significant attention for speech enhancement recently. The state-of-the-art deep learning methods perform the task by learning a ratio/binary mask that is applied to the mixture in the time-frequency domain to produce the clean speech. Despite the great performance in the single-channel setting, these frameworks lag in performance in the multichannel setting as the majority of these methods a) fail to exploit the available spatial information fully, and b) still treat the deep architecture as a black box which may not be well-suited for multichannel audio processing. This paper addresses these drawbacks, a) by utilizing complex ratio masking instead of masking on the magnitude of the spectrogram, and more importantly, b) by introducing a channel-attention mechanism inside the deep architecture to mimic beamforming. We propose Channel-Attention Dense U-Net, in which we apply the channel-attention unit recursively on feature maps at every layer of the network, enabling the network to perform non-linear beamforming. We demonstrate the superior performance of the network against the state-of-the-art approaches on the CHiME-3 dataset.

Journal ArticleDOI
TL;DR: A new method that uses the spectrogram features extracted from speech data to identify Alzheimer's disease, which can help families to understand the disease development of patients in an earlier stage, so that they can take measures in advance to delay the diseaseDevelopment.

Journal ArticleDOI
TL;DR: Deep features are used in the environmental sound classification (ESC) problem by using a newly developed Convolutional Neural Networks (CNN) model, which is trained in the end-to-end fashion with the spectrogram images.
Abstract: Cognitive prediction in the complicated and active environments is of great importance role in artificial learning. Classification accuracy of sound events has a robust relation with the feature extraction. In this paper, deep features are used in the environmental sound classification (ESC) problem. The deep features are extracted by using the fully connected layers of a newly developed Convolutional Neural Networks (CNN) model, which is trained in the end-to-end fashion with the spectrogram images. The feature vector is constituted with concatenating of the fully connected layers of the proposed CNN model. For testing the performance of the proposed method, the feature set is conveyed as input to the random subspaces K Nearest Neighbor (KNN) ensembles classifier. The experimental studies, which are carried out on the DCASE-2017 ASC and the UrbanSound8K datasets, show that the proposed CNN model achieves classification accuracies 96.23% and 86.70%, respectively.

Journal ArticleDOI
TL;DR: This work is the first in the relevant literature in using 2D timefrequency features for the purpose of automatic diagnosis of SZ patients by using Short-time Fourier Transform (STFT) in order to have a useful representation of frequency-time features.
Abstract: Received: 17 January 2020 Accepted: 20 March 2020 This study presents a method that aims to automatically diagnose Schizophrenia (SZ) patients by using EEG recordings. Unlike many literature studies, the proposed method does not manually extract features from EEG recordings, instead it transforms the raw EEG into 2D by using Short-time Fourier Transform (STFT) in order to have a useful representation of frequency-time features. This work is the first in the relevant literature in using 2D timefrequency features for the purpose of automatic diagnosis of SZ patients. In order to extract most useful features out of all present in the 2D space and classify samples with high accuracy, a state-of-art Convolutional Neural Network architecture, namely VGG-16, is trained. The experimental results show that the method presented in the paper is successful in the task of classifying SZ patients and healthy controls with a classification accuracy of 95% and 97% in two datasets of different age groups. With this performance, the proposed method outperforms most of the literature methods. The experiments of the study also reveal that there is a relationship between frequency components of an EEG recording and the SZ disease. Moreover, Grad-CAM images presented in the paper clearly show that mid-level frequency components matter more while discriminating a SZ patient from a healthy control.

Journal ArticleDOI
TL;DR: A new neural network-based audio processing framework with graphics processing unit (GPU) support that leverages 1D convolutional neural networks to perform time domain to frequency domain conversion, which allows on-the-fly spectrogram extraction due to its fast speed, without the need to store any spectrograms on the disk.
Abstract: In this paper, we present nnAudio, a new neural network-based audio processing framework with graphics processing unit (GPU) support that leverages 1D convolutional neural networks to perform time domain to frequency domain conversion. It allows on-the-fly spectrogram extraction due to its fast speed, without the need to store any spectrograms on the disk. Moreover, this approach also allows back-propagation on the waveforms-to-spectrograms transformation layer, and hence, the transformation process can be made trainable, further optimizing the waveform-to-spectrogram transformation for the specific task that the neural network is trained on. All spectrogram implementations scale as Big-O of linear time with respect to the input length. nnAudio, however, leverages the compute unified device architecture (CUDA) of 1D convolutional neural network from PyTorch, its short-time Fourier transform (STFT), Mel spectrogram, and constant-Q transform (CQT) implementations are an order of magnitude faster than other implementations using only the central processing unit (CPU). We tested our framework on three different machines with NVIDIA GPUs, and our framework significantly reduces the spectrogram extraction time from the order of seconds (using a popular python library librosa) to the order of milliseconds, given that the audio recordings are of the same length. When applying nnAudio to variable input audio lengths, an average of 11.5 hours are required to extract 34 spectrogram types with different parameters from the MusicNet dataset using librosa. An average of 2.8 hours is required for nnAudio, which is still four times faster than librosa. Our proposed framework also outperforms existing GPU processing libraries such as Kapre and torchaudio in terms of processing speed.

Posted Content
Isaac Elias1, Heiga Zen1, Jonathan Shen1, Yu Zhang1, Ye Jia1, Ron Weiss1, Yonghui Wu1 
TL;DR: A non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder, called Parallel Tacotron, which is highly parallelizable during both training and inference, allowing efficient synthesis on modern parallel hardware.
Abstract: Although neural end-to-end text-to-speech models can synthesize highly natural speech, there is still room for improvements to its efficiency and naturalness. This paper proposes a non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder. This model, called \emph{Parallel Tacotron}, is highly parallelizable during both training and inference, allowing efficient synthesis on modern parallel hardware. The use of the variational autoencoder relaxes the one-to-many mapping nature of the text-to-speech problem and improves naturalness. To further improve the naturalness, we use lightweight convolutions, which can efficiently capture local contexts, and introduce an iterative spectrogram loss inspired by iterative refinement. Experimental results show that Parallel Tacotron matches a strong autoregressive baseline in subjective evaluations with significantly decreased inference time.

Journal ArticleDOI
TL;DR: The main objective in this work is to represent a methodology with the combination of two methods Spectrogram and 1D CNN which can be one possible approach for seizure detection.

Journal ArticleDOI
TL;DR: In this paper, a convolutional neural network transformer (CNN-Transfomer) was proposed for audio tagging and sound event detection, which outperformed the previous state-of-the-art.
Abstract: Sound event detection (SED) is a task to detect sound events in an audio recording. One challenge of the SED task is that many datasets such as the Detection and Classification of Acoustic Scenes and Events (DCASE) datasets are weakly labelled. That is, there are only audio tags for each audio clip without the onset and offset times of sound events. We compare segment-wise and clip-wise training for SED that is lacking in previous works. We propose a convolutional neural network transformer (CNN-Transfomer) for audio tagging and SED, and show that CNN-Transformer performs similarly to a convolutional recurrent neural network (CRNN). Another challenge of SED is that thresholds are required for detecting sound events. Previous works set thresholds empirically, and are not an optimal approaches. To solve this problem, we propose an automatic threshold optimization method. The first stage is to optimize the system with respect to metrics that do not depend on thresholds, such as mean average precision (mAP). The second stage is to optimize the thresholds with respect to metrics that depends on those thresholds. Our proposed automatic threshold optimization system achieves a state-of-the-art audio tagging F1 of 0.646, outperforming that without threshold optimization of 0.629, and a sound event detection F1 of 0.584, outperforming that without threshold optimization of 0.564.

Proceedings ArticleDOI
30 Jul 2020
TL;DR: VocGAN is nearly as fast as MelGAN, but it significantly improves the quality and consistency of the output waveform, and exhibits significantly improved quality in multiple evaluation metrics including mean opinion score (MOS) with minimal additional overhead.
Abstract: We present a novel high-fidelity real-time neural vocoder called VocGAN. A recently developed GAN-based vocoder, MelGAN, produces speech waveforms in real-time. However, it often produces a waveform that is insufficient in quality or inconsistent with acoustic characteristics of the input mel spectrogram. VocGAN is nearly as fast as MelGAN, but it significantly improves the quality and consistency of the output waveform. VocGAN applies a multi-scale waveform generator and a hierarchically-nested discriminator to learn multiple levels of acoustic properties in a balanced way. It also applies the joint conditional and unconditional objective, which has shown successful results in high-resolution image synthesis. In experiments, VocGAN synthesizes speech waveforms 416.7x faster on a GTX 1080Ti GPU and 3.24x faster on a CPU than real-time. Compared with MelGAN, it also exhibits significantly improved quality in multiple evaluation metrics including mean opinion score (MOS) with minimal additional overhead. Additionally, compared with Parallel WaveGAN, another recently developed high-fidelity vocoder, VocGAN is 6.98x faster on a CPU and exhibits higher MOS.

Journal ArticleDOI
11 Aug 2020-Sensors
TL;DR: The results showed that the deep convolutional generative adversarial network (DCGAN) provided better augmentation performance than traditional DA methods: geometric transformation (GT), autoencoder (AE), and variational autoenCoder (VAE) (p < 0.01).
Abstract: As an important paradigm of spontaneous brain-computer interfaces (BCIs), motor imagery (MI) has been widely used in the fields of neurological rehabilitation and robot control. Recently, researchers have proposed various methods for feature extraction and classification based on MI signals. The decoding model based on deep neural networks (DNNs) has attracted significant attention in the field of MI signal processing. Due to the strict requirements for subjects and experimental environments, it is difficult to collect large-scale and high-quality electroencephalogram (EEG) data. However, the performance of a deep learning model depends directly on the size of the datasets. Therefore, the decoding of MI-EEG signals based on a DNN has proven highly challenging in practice. Based on this, we investigated the performance of different data augmentation (DA) methods for the classification of MI data using a DNN. First, we transformed the time series signals into spectrogram images using a short-time Fourier transform (STFT). Then, we evaluated and compared the performance of different DA methods for this spectrogram data. Next, we developed a convolutional neural network (CNN) to classify the MI signals and compared the classification performance of after DA. The Frechet inception distance (FID) was used to evaluate the quality of the generated data (GD) and the classification accuracy, and mean kappa values were used to explore the best CNN-DA method. In addition, analysis of variance (ANOVA) and paired t-tests were used to assess the significance of the results. The results showed that the deep convolutional generative adversarial network (DCGAN) provided better augmentation performance than traditional DA methods: geometric transformation (GT), autoencoder (AE), and variational autoencoder (VAE) (p < 0.01). Public datasets of the BCI competition IV (datasets 1 and 2b) were used to verify the classification performance. Improvements in the classification accuracies of 17% and 21% (p < 0.01) were observed after DA for the two datasets. In addition, the hybrid network CNN-DCGAN outperformed the other classification methods, with average kappa values of 0.564 and 0.677 for the two datasets.

Journal ArticleDOI
TL;DR: A novel method for continuous hand gesture detection and recognition is proposed based on a frequency modulated continuous wave (FMCW) radar and the Fusion Dynamic Time Warping (FDTW) algorithm is presented to recognize the hand gestures.
Abstract: In this article, a novel method for continuous hand gesture detection and recognition is proposed based on a frequency modulated continuous wave (FMCW) radar. Firstly, we adopt the 2-Dimensional Fast Fourier Transform (2D-FFT) to estimate the range and Doppler parameters of the hand gesture raw data, and construct the range-time map (RTM) and Doppler-time map (DTM). Meanwhile, we apply the Multiple Signal Classification (MUSIC) algorithm to calculate the angle and construct the angle-time map (ATM). Secondly, a hand gesture detection method is proposed to segment the continuous hand gestures using a decision threshold. Thirdly, the central time-frequency trajectory of each hand gesture spectrogram is clustered using the k-means algorithm, and then the Fusion Dynamic Time Warping (FDTW) algorithm is presented to recognize the hand gestures. Finally, experiments show that the accuracy of the proposed hand gesture detection method can reach 96.17%. The hand gesture average recognition accuracy of the proposed FDTW algorithm is 95.83%, while its time complexity is reduced by more than 50%.

Journal ArticleDOI
TL;DR: This work presents a technique based on the generation of sound spectrograms from fragments of fixed length, extracted from original audio clips recorded in high-attendance events, where the crowd acts as a collective individual.
Abstract: Crowds express emotions as a collective individual, which is evident from the sounds that a crowd produces in particular events, e.g., collective booing, laughing or cheering in sports matches, movies, theaters, concerts, political demonstrations, and riots. A critical question concerning the innovative concept of crowd emotions is whether the emotional content of crowd sounds can be characterized by frequency-amplitude features, using analysis techniques similar to those applied on individual voices, where deep learning classification is applied to spectrogram images derived by sound transformations. In this work, we present a technique based on the generation of sound spectrograms from fragments of fixed length, extracted from original audio clips recorded in high-attendance events, where the crowd acts as a collective individual. Transfer learning techniques are used on a convolutional neural network, pre-trained on low-level features using the well-known ImageNet extensive dataset of visual knowledge. The original sound clips are filtered and normalized in amplitude for a correct spectrogram generation, on which we fine-tune the domain-specific features. Experiments held on the finally trained Convolutional Neural Network show promising performances of the proposed model to classify the emotions of the crowd.

Journal ArticleDOI
29 Apr 2020
TL;DR: This letter proposes a novel integrated human localization and activity classification using unscented Kalman filter and demonstrates the results using a short-range 60-GHz frequency modulated continuous wave radar.
Abstract: Short-range compact radar systems offer attractive modality for localization and tracking of human targets in indoor and outdoor environments for industrial and consumer applications. Micro-Doppler radar reflections from human targets can be sensed and used for human activity classification, which has applications in human–computer interaction and health assessment among others. Traditionally, the detected human targets’ location are tracked and its micro-Doppler spectrogram extracted for further activity classification of the human target. In this letter, we propose a novel integrated human localization and activity classification using unscented Kalman filter and demonstrate our results using a short-range 60-GHz frequency modulated continuous wave radar. The proposed solution is shown to result in an improved classification accuracy with the capability of providing uncertainty with associated classification probabilities and, thus, is a simple mechanism to achieve Bayesian classification.

Journal ArticleDOI
TL;DR: This paper has combined different signal processing techniques and a deep learning method to denoise, compress, segment, and classify PCG signals effectively and accurately and achieves overall testing accuracy of around 97.10%.
Abstract: Phonocardigraphy (PCG) is the graphical representation of heart sounds. The PCG signal contains useful information about the functionality and the condition of the heart. It also provides an early indication of potential cardiac abnormalities. Extracting cardiac information from heart sounds and detecting abnormal heart sounds to diagnose heart diseases using the PCG signal can play a vital role in remote patient monitoring. In this paper, we have combined different signal processing techniques and a deep learning method to denoise, compress, segment, and classify PCG signals effectively and accurately. First, the PCG signal is denoised and compressed by using a multi-resolution analysis based on the Discrete Wavelet Transform (DWT). Then, a segmentation algorithm, based on the Shannon energy envelope and zero-crossing, is applied to segment the PCG signal into four major parts: the first heart sound (S1), the systole interval, the second heart sound (S2), and the diastole interval. Finally, Mel-scaled power spectrogram and Mel-frequency cepstral coefficients (MFCC) are employed to extract informative features from the PCG signal, which are then fed into a classifier to classify each PCG signal into a normal or an abnormal signal by using a deep learning approach. For the classification, a 5-layer feed-forward Deep Neural Network (DNN) model is used, and overall testing accuracy of around 97.10% is achieved. Besides providing valuable information regarding heart condition, this signal processing approach can help cardiologists take appropriate and reliable steps toward diagnosis if any cardiovascular disorder is found in the initial stage.

Journal ArticleDOI
TL;DR: In this article, a hybrid architecture based on acoustic and deep features was proposed to increase the classification accuracy in the problem of speech emotion recognition, which consists of feature extraction, feature selection and classification stages.
Abstract: The problem of recognition and classification of emotions in speech is one of the most prominent research topics, that has gained popularity, in human-computer interaction in the last decades. Having recognized the feelings or emotions in human conversations might have a deep impact on understanding a human’s physical and psychological situation. This study proposes a novel hybrid architecture based on acoustic and deep features to increase the classification accuracy in the problem of speech emotion recognition. The proposed method consists of feature extraction, feature selection and classification stages. At first, acoustic features such as Root Mean Square energy (RMS), Mel-Frequency Cepstral Coefficients (MFCC) and Zero-crossing Rate are obtained from voice records. Subsequently, spectrogram images of the original sound signals are given as input to the pre-trained deep network architecture, which is VGG16, ResNet18, ResNet50, ResNet101, SqueezeNet and DenseNet201 and deep features are extracted. Thereafter, a hybrid feature vector is created by combining acoustic and deep features. Also, the ReliefF algorithm is used to select more efficient features from the hybrid feature vector. Finally, in order for the completion of the classification task, Support vector machine (SVM) is used. Experiments are made using three popular datasets used in the literature so as to evaluate the effect of various techniques. These datasets are Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), Berlin (EMO-DB) and Interactive Emotional Dyadic Motion Capture (IEMOCAP). As a consequence, we reach to 79.41%, 90.21% and 85.37% accuracy rates for RAVDESS, EMO-DB, and IEMOCAP datasets, respectively. The Final results obtained in experiments, clearly, show that the proposed technique might be utilized to accomplish the task of speech emotion recognition efficiently. Moreover, when our technique is compared with those of methods used in the context, it is obvious that our method outperforms others in terms of classification accuracy rates.