Showing papers on "Spectrogram published in 2021"

PDF

Open Access

Proceedings Article•DOI•

[...]

Yuan Gong¹, Yu-An Chung², James Glass²•Institutions (2)

University of Electronic Science and Technology of China¹, Massachusetts Institute of Technology²

05 Apr 2021

TL;DR: The Audio Spectrogram Transformer (AST) as mentioned in this paper is the first convolution-free, purely attention-based model for audio classification, which achieves state-of-the-art results on various audio classification benchmarks.

...read moreread less

Abstract: In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To better capture long-range global context, a recent trend is to add a self-attention mechanism on top of the CNN, forming a CNN-attention hybrid model. However, it is unclear whether the reliance on a CNN is necessary, and if neural networks purely based on attention are sufficient to obtain good performance in audio classification. In this paper, we answer the question by introducing the Audio Spectrogram Transformer (AST), the first convolution-free, purely attention-based model for audio classification. We evaluate AST on various audio classification benchmarks, where it achieves new state-of-the-art results of 0.485 mAP on AudioSet, 95.6% accuracy on ESC-50, and 98.1% accuracy on Speech Commands V2.

...read moreread less

103 citations

Journal Article•DOI•

Spectral images based environmental sound classification using CNN with meaningful data augmentation

[...]

Zohaib Mushtaq¹, Shun-Feng Su¹, Quoc-Viet Tran¹•Institutions (1)

National Taiwan University of Science and Technology¹

15 Jan 2021-Applied Acoustics

TL;DR: The results show the effectiveness, robustness, and high accuracy of the proposed approach to have meaningful data augmentation by considering variations applied to the audio clips directly.

...read moreread less

75 citations

Journal Article•DOI•

On the Compensation Between Magnitude and Phase in Speech Separation

[...]

Zhong-Qiu Wang¹, Gordon Wichern¹, Jonathan Le Roux¹•Institutions (1)

Mitsubishi Electric Research Laboratories¹

29 Sep 2021-IEEE Signal Processing Letters

TL;DR: In this paper, the implicit compensation between estimated magnitude and phase was analyzed for monaural speech separation and robust automatic speech recognition (ASR) tasks in noisy-reverberant conditions.

...read moreread less

Abstract: Deep neural network (DNN) based end-to-end optimization in the complex time-frequency (T-F) domain or time domain has shown considerable potential in monaural speech separation. Many recent studies optimize loss functions defined solely in the time or complex domain, without including a loss on magnitude. Although such loss functions typically produce better scores if the evaluation metrics are objective time-domain metrics, they however produce worse scores on speech quality and intelligibility metrics and usually lead to worse speech recognition performance, compared with including a loss on magnitude. While this phenomenon has been experimentally observed by many studies, it is often not accurately explained and there lacks a thorough understanding on its fundamental cause. This letter provides a novel view from the perspective of the implicit compensation between estimated magnitude and phase. Analytical results based on monaural speech separation and robust automatic speech recognition (ASR) tasks in noisy-reverberant conditions support the validity of our view.

...read moreread less

60 citations

Journal Article•DOI•

Radio Frequency Fingerprint Identification for LoRa Using Deep Learning

[...]

Guanxiong Shen¹, Junqing Zhang¹, Alan G. Marshall¹, Linning Peng², Xianbin Wang³ - Show less +1 more•Institutions (3)

University of Liverpool¹, Southeast University², University of Western Ontario³

07 Jun 2021-IEEE Journal on Selected Areas in Communications

TL;DR: A hybrid classifier that can adjust the prediction of deep learning models with the estimated CFO is designed to further increase the classification accuracy of the deep learning-based RFFI scheme for Long Range (LoRa) systems.

...read moreread less

Abstract: Radio frequency fingerprint identification (RFFI) is an emerging device authentication technique that relies on the intrinsic hardware characteristics of wireless devices. This paper designs a deep learning-based RFFI scheme for Long Range (LoRa) systems. Firstly, the instantaneous carrier frequency offset (CFO) is found to drift, which could result in misclassification and significantly compromise the stability of the deep learning-based RFFI system. CFO compensation is demonstrated to be effective mitigation. Secondly, three signal representations for deep learning-based RFFI are investigated in time, frequency, and time-frequency domains, namely in-phase and quadrature (IQ) samples, fast Fourier transform (FFT) results and spectrograms, respectively. For these signal representations, three deep learning models are implemented, i.e., multilayer perceptron (MLP), long short-term memory (LSTM) network and convolutional neural network (CNN), in order to explore an optimal framework. Finally, a hybrid classifier that can adjust the prediction of deep learning models with the estimated CFO is designed to further increase the classification accuracy. The CFO will not change dramatically over several continuous days, hence it can be used to correct predictions when the estimated CFO is much different from the reference one. Experimental evaluation is performed in real wireless environments involving 25 LoRa devices and a Universal Software Radio Peripheral (USRP) N210 platform. The spectrogram-CNN model is found to be optimal for classifying LoRa devices which can reach an accuracy of 96.40% with the least complexity and training time.

...read moreread less

60 citations

Journal Article•DOI•

Automated ASD detection using hybrid deep lightweight features extracted from EEG signals.

[...]

Mehmet Baygin¹, Sengul Dogan², Turker Tuncer², Prabal Datta Barua³, Oliver Faust⁴, N. Arunkumar⁵, Enas Abdulhay⁶, Elizabeth E. Palmer⁷, U. Rajendra Acharya⁸, U. Rajendra Acharya⁹, U. Rajendra Acharya¹⁰ - Show less +7 more•Institutions (10)

Ardahan University¹, Fırat University², University of Southern Queensland³, Sheffield Hallam University⁴, Shanmugha Arts, Science, Technology & Research Academy⁵, Jordan University of Science and Technology⁶, Boston Children's Hospital⁷, Ngee Ann Polytechnic⁸, Asia University (Taiwan)⁹, National University of Singapore¹⁰

10 Jun 2021-Computers in Biology and Medicine

TL;DR: The results strongly indicate that the proposed hybrid deep lightweight feature extractor is suitable for autism detection using EEG signals and is ready to serve as part of an adjunct tool that aids neurologists during autism diagnosis in medical centers.

...read moreread less

58 citations

Journal Article•DOI•

STFT spectrogram based hybrid evaluation method for rotating machine transient vibration analysis

[...]

Gabor Manhertz¹, Ákos Bereczky¹•Institutions (1)

Budapest University of Technology and Economics¹

01 Jun 2021-Mechanical Systems and Signal Processing

TL;DR: With the combination of vibration signal- and image processing techniques the evaluation time and computational resource requirements are decreased enhancing more efficient and accurate analysis, nevertheless opens the possibility of a real-time condition monitoring based on a basic vibration measurement.

...read moreread less

48 citations

Proceedings Article•DOI•

NELoRa: Towards Ultra-low SNR LoRa Communication with Neural-enhanced Demodulation

[...]

Chenning Li¹, Hanqing Guo¹, Shuai Tong¹, Xiao Zeng¹, Zhichao Cao¹, Mi Zhang¹, Qiben Yan¹, Li Xiao¹, Jiliang Wang², Yunhao Liu² - Show less +6 more•Institutions (2)

Michigan State University¹, Tsinghua University²

15 Nov 2021

TL;DR: NELoRa as mentioned in this paper is a neural-enhanced LoRa demodulation method, exploiting the feature abstraction ability of deep learning to support ultra-low SNR LoRa communication.

...read moreread less

Abstract: Low-Power Wide-Area Networks (LPWANs) are an emerging Internet-of-Things (IoT) paradigm marked by low-power and long-distance communication. Among them, LoRa is widely deployed for its unique characteristics and open-source technology. By adopting the Chirp Spread Spectrum (CSS) modulation, LoRa enables low signal-to-noise ratio (SNR) communication. However, the standard demodulation method does not fully exploit the properties of chirp signals, thus yields a sub-optimal SNR threshold under which the decoding fails. Consequently, the communication range and energy consumption have to be compromised for robust transmission. This paper presents NELoRa, a neural-enhanced LoRa demodulation method, exploiting the feature abstraction ability of deep learning to support ultra-low SNR LoRa communication. Taking the spectrogram of both amplitude and phase as input, we first design a mask-enabled Deep Neural Network (DNN) filter that extracts multi-dimension features to capture clean chirp symbols. Second, we develop a spectrogram-based DNN decoder to decode these chirp symbols accurately. Finally, we propose a generic packet demodulation system by incorporating a method that generates high-quality chirp symbols from received signals. We implement and evaluate NELoRa on both indoor and campus-scale outdoor testbeds. The results show that NELoRa achieves 1.84-2.35 dB SNR gains and extends the battery life up to 272% (~0.38-1.51 years) in average for various LoRa configurations.

...read moreread less

48 citations

Proceedings Article•DOI•

Slow-Fast Auditory Streams for Audio Recognition

[...]

Evangelos Kazakos¹, Arsha Nagrani², Andrew Zisserman², Dima Damen¹•Institutions (2)

University of Bristol¹, University of Oxford²

06 Jun 2021

TL;DR: In this article, a two-stream convolutional network for audio recognition is proposed, which operates on time-frequency spectrogram inputs and achieves state-of-the-art results on both VGG-Sound and EPIC-KITCHENS-100 datasets.

...read moreread less

Abstract: We propose a two-stream convolutional network for audio recognition, that operates on time-frequency spectrogram inputs. Following similar success in visual recognition, we learn Slow-Fast auditory streams with separable convolutions and multi-level lateral connections. The Slow pathway has high channel capacity while the Fast pathway operates at a fine-grained temporal resolution. We showcase the importance of our two-stream proposal on two diverse datasets: VGG-Sound and EPIC-KITCHENS-100, and achieve state- of-the-art results on both.

...read moreread less

48 citations

Journal Article•DOI•

Underwater target recognition using convolutional recurrent neural networks with 3-D Mel-spectrogram and data augmentation

[...]

Liu Feng¹, Shen Tongsheng¹, Luo Zailei¹, Zhao Dexin¹, Guo Shaojun¹ - Show less +1 more•Institutions (1)

Academy of Military Science¹

01 Jul 2021-Applied Acoustics

TL;DR: The proposed model contains three steps to deal with the recognition of underwater targets: feature extraction, data augmentation and deep neural network, which uses the convolutional recurrent neural network for acoustic target recognition.

...read moreread less

47 citations

Proceedings Article•DOI•

An Exploration of Log-Mel Spectrogram and MFCC Features for Alzheimer’s Dementia Recognition from Spontaneous Speech

[...]

Amit Meghanani¹, C S Anoop¹, A. G. Ramakrishnan¹•Institutions (1)

Indian Institute of Science¹

19 Jan 2021

TL;DR: In this article, the effectiveness of log-Mel spectrogram and MFCC features for Alzheimer's dementia (AD) recognition on ADReSS challenge dataset was explored using three different deep neural networks (DNN) for AD recognition and mini-mental state examination (MMSE) score prediction.

...read moreread less

Abstract: In this work, we explore the effectiveness of log-Mel spectrogram and MFCC features for Alzheimer’s dementia (AD) recognition on ADReSS challenge dataset We use three different deep neural networks (DNN) for AD recognition and mini-mental state examination (MMSE) score prediction: (i) convolutional neural network followed by a long-short term memory network (CNN-LSTM), (ii) pre-trained ResNet18 network followed by LSTM (ResNet-LSTM), and (iii) pyramidal bidirectional LSTM followed by a CNN (pBLSTM-CNN) CNN-LSTM achieves an accuracy of 6458% with MFCC features and ResNet-LSTM achieves an accuracy of 625% using log-Mel spectrograms pBLSTM-CNN and ResNet-LSTM models achieve root mean square errors (RMSE) of 59 and 598 in the MMSE score prediction, using the log-Mel spectrograms Our results beat the baseline accuracy (625%) and RMSE (614) reported for acoustic features on ADReSS challenge dataset The results suggest that log-Mel spectrograms and MFCCs are effective features for AD recognition problem when used with DNN models

...read moreread less

45 citations

Proceedings Article•DOI•

Radio Frequency Fingerprint Identification for LoRa Using Spectrogram and CNN

[...]

Guanxiong Shen¹, Junqing Zhang¹, Alan G. Marshall¹, Linning Peng², Xianbin Wang² - Show less +1 more•Institutions (2)

University of Liverpool¹, Southeast University²

10 May 2021

TL;DR: In this article, the authors designed an RFFI scheme for Long Range (LoRa) systems based on spectrogram and convolutional neural network (CNN) to represent the fine-grained time-frequency characteristics of LoRa signals.

...read moreread less

Abstract: Radio frequency fingerprint identification (RFFI) is an emerging device authentication technique that relies on intrin-sic hardware characteristics of wireless devices. We designed an RFFI scheme for Long Range (LoRa) systems based on spectrogram and convolutional neural network (CNN). Specifically, we used spectrogram to represent the fine-grained time-frequency characteristics of LoRa signals. In addition, we revealed that the instantaneous carrier frequency offset (CFO) is drifting, which will result in misclassification and significantly compromise the system stability; we demonstrated CFO compensation is an effective mitigation. Finally, we designed a hybrid classifier that can adjust CNN outputs with the estimated CFO. The mean value of CFO remains relatively stable, hence it can be used to rule out CNN predictions whose estimated CFO falls out of the range. We performed experiments in real wireless environments using 20 LoRa devices under test (DUTs) and a Universal Software Radio Peripheral (USRP) N210 receiver. By comparing with the IQ-based and FFT-based RFFI schemes, our spectrogram-based scheme can reach the best classification accuracy, i.e., 97.61% for 20 LoRa DUTs.

...read moreread less

Posted Content•DOI•

Decoupling Magnitude and Phase Estimation with Deep ResUNet for Music Source Separation

[...]

Qiuqiang Kong, Yin Cao, Haohe Liu, Keunwoo Choi, Yuxuan Wang - Show less +1 more

07 Nov 2021-arXiv: Sound

TL;DR: This work proposes to estimate phases by estimating complex ideal ratio masks (cIRMs) where it decouple the estimation of cIRMs into magnitude and phase estimations, and extends the separation method to effectively allow the magnitude of the mask to be larger than 1.

...read moreread less

Abstract: Deep neural network based methods have been successfully applied to music source separation. They typically learn a mapping from a mixture spectrogram to a set of source spectrograms, all with magnitudes only. This approach has several limitations: 1) its incorrect phase reconstruction degrades the performance, 2) it limits the magnitude of masks between 0 and 1 while we observe that 22% of time-frequency bins have ideal ratio mask values of over~1 in a popular dataset, MUSDB18, 3) its potential on very deep architectures is under-explored. Our proposed system is designed to overcome these. First, we propose to estimate phases by estimating complex ideal ratio masks (cIRMs) where we decouple the estimation of cIRMs into magnitude and phase estimations. Second, we extend the separation method to effectively allow the magnitude of the mask to be larger than 1. Finally, we propose a residual UNet architecture with up to 143 layers. Our proposed system achieves a state-of-the-art MSS result on the MUSDB18 dataset, especially, a SDR of 8.98~dB on vocals, outperforming the previous best performance of 7.24~dB. The source code is available at: this https URL

...read moreread less

Journal Article•DOI•

Multi-channel spectrograms for speech processing applications using deep learning methods

[...]

Tomas Arias-Vergara¹, Tomas Arias-Vergara², Tomas Arias-Vergara³, Philipp Klumpp¹, Juan Camilo Vásquez-Correa³, Juan Camilo Vásquez-Correa¹, Elmar Nöth¹, Juan Rafael Orozco-Arroyave¹, Juan Rafael Orozco-Arroyave³, Maria Schuster² - Show less +6 more•Institutions (3)

University of Erlangen-Nuremberg¹, Ludwig Maximilian University of Munich², University of Antioquia³

01 Jan 2021-Pattern Analysis and Applications

TL;DR: This paper proposes a methodology to combine three different time–frequency representations of the signals by computing continuous wavelet transform, Mel-spectrograms, and Gammatone spectrograms and combining then into 3D-channel spectrogram to analyze speech in two different applications: automatic detection of speech deficits in cochlear implant users and phoneme class recognition to extract phone-attribute features.

...read moreread less

Abstract: Time–frequency representations of the speech signals provide dynamic information about how the frequency component changes with time. In order to process this information, deep learning models with convolution layers can be used to obtain feature maps. In many speech processing applications, the time–frequency representations are obtained by applying the short-time Fourier transform and using single-channel input tensors to feed the models. However, this may limit the potential of convolutional networks to learn different representations of the audio signal. In this paper, we propose a methodology to combine three different time–frequency representations of the signals by computing continuous wavelet transform, Mel-spectrograms, and Gammatone spectrograms and combining then into 3D-channel spectrograms to analyze speech in two different applications: (1) automatic detection of speech deficits in cochlear implant users and (2) phoneme class recognition to extract phone-attribute features. For this, two different deep learning-based models are considered: convolutional neural networks and recurrent neural networks with convolution layers.

...read moreread less

Journal Article•DOI•

A Deep Neural Network Model for Speaker Identification

[...]

Feng Ye, Jun Yang

16 Apr 2021-Applied Sciences

TL;DR: A deep neural network (DNN) model based on a two-dimensional convolutional neural network and gated recurrent unit (GRU) for speaker identification is proposed and the experimental results showed that the proposed DNN model, which is called deep GRU, achieved a high recognition accuracy of 98.96%.

...read moreread less

Abstract: Speaker identification is a classification task which aims to identify a subject from a given time-series sequential data. Since the speech signal is a continuous one-dimensional time series, most of the current research methods are based on convolutional neural network (CNN) or recurrent neural network (RNN). Indeed, these methods perform well in many tasks, but there is no attempt to combine these two network models to study the speaker identification task. Due to the spectrogram that a speech signal contains, the spatial features of voiceprint (which corresponds to the voice spectrum) and CNN are effective for spatial feature extraction (which corresponds to modeling spectral correlations in acoustic features). At the same time, the speech signal is in a time series, and deep RNN can better represent long utterances than shallow networks. Considering the advantage of gated recurrent unit (GRU) (compared with traditional RNN) in the segmentation of sequence data, we decide to use stacked GRU layers in our model for frame-level feature extraction. In this paper, we propose a deep neural network (DNN) model based on a two-dimensional convolutional neural network (2-D CNN) and gated recurrent unit (GRU) for speaker identification. In the network model design, the convolutional layer is used for voiceprint feature extraction and reduces dimensionality in both the time and frequency domains, allowing for faster GRU layer computation. In addition, the stacked GRU recurrent network layers can learn a speaker’s acoustic features. During this research, we tried to use various neural network structures, including 2-D CNN, deep RNN, and deep LSTM. The above network models were evaluated on the Aishell-1 speech dataset. The experimental results showed that our proposed DNN model, which we call deep GRU, achieved a high recognition accuracy of 98.96%. At the same time, the results also demonstrate the effectiveness of the proposed deep GRU network model versus other models for speaker identification. Through further optimization, this method could be applied to other research similar to the study of speaker identification.

...read moreread less

Journal Article•DOI•

Combining a parallel 2D CNN with a self-attention Dilated Residual Network for CTC-based discrete speech emotion recognition.

[...]

Ziping Zhao¹, Qifei Li¹, Zixing Zhang², Nicholas Cummins³, Haishuai Wang¹, Jianhua Tao⁴, Jianhua Tao⁵, Bjorn W. Schuller¹ - Show less +4 more•Institutions (5)

Tianjin Normal University¹, Imperial College London², King's College London³, Center for Excellence in Education⁴, Chinese Academy of Sciences⁵

23 Mar 2021-Neural Networks

TL;DR: In this paper, a deep neural network architecture incorporating Connectionist Temporal Classification (CTC) loss for discrete speech emotion recognition (SER) is presented, which uses parallel convolutional layers (PCN) integrated with Squeeze-and-Excitation Network (SEnet) to extract relationships from 3D spectrograms across timesteps and frequencies; here, they use the log-Mel spectrogram with deltas and delta-deltas as input.

...read moreread less

Journal Article•DOI•

Attention guided 3D CNN-LSTM model for accurate speech based emotion recognition

[...]

Orhan Atila¹, Abdulkadir Sengur¹•Institutions (1)

Fırat University¹

01 Nov 2021-Applied Acoustics

TL;DR: A novel approach, based on attention guided 3D convolutional neural networks (CNN)-long short-term memory (LSTM) model, is proposed for speech based emotion recognition and it is seen that the proposed method outperforms the compared methods.

...read moreread less

Journal Article•DOI•

Contactless Fall Detection Using Time-Frequency Analysis and Convolutional Neural Networks

[...]

Hamidreza Sadreazami¹, Miodrag Bolic¹, Sreeraman Rajan²•Institutions (2)

University of Ottawa¹, Carleton University²

05 Jan 2021-IEEE Transactions on Industrial Informatics

TL;DR: The proposed radar-based fall detection technique based on time-frequency analysis and convolutional neural networks employs high-level feature learning, which distinguishes it from previously studied methods that use heuristic feature extraction.

...read moreread less

Abstract: Automatic detection of a falling person based on noncontact sensing is a challenging problem with applications in smart homes for elderly care. In this article, we propose a radar-based fall detection technique based on time-frequency analysis and convolutional neural networks. The time-frequency analysis is performed by applying the short-time Fourier transform to each radar return signal. The resulting spectrograms are converted into binary images, which are fed into the convolutional neural network. The network is trained using labeled examples of fall and nonfall activities. Our method employs high-level feature learning, which distinguishes it from previously studied methods that use heuristic feature extraction. The performance of the proposed method is evaluated by conducting several experiments on a set of radar return signals. We show that our method distinguishes falls from nonfalls with 98.37% precision and 97.82% specificity, while maintaining a low false-alarm rate, which is superior to existing methods. We also show that our proposed method is robust in that it successfully distinguishes falls from nonfalls when trained on subjects in one room, but tested on different subjects in a different room. In the proposed convolutional neural network, the hierarchical features extracted from the radar return signals are the key to understand the fundamental composition of human activities and determine whether or not a fall has occurred during human daily activities. Our method may be extended to other radar-based applications such as apnea detection and gesture detection.

...read moreread less

Journal Article•DOI•

A spectrogram image based intelligent technique for automatic detection of autism spectrum disorder from EEG.

[...]

Md. Nurul Ahad Tawhid¹, Siuly Siuly¹, Hua Wang¹, Frank Whittaker, Kate N. Wang², Yanchun Zhang¹ - Show less +2 more•Institutions (2)

Victoria University, Australia¹, RMIT University²

25 Jun 2021-PLOS ONE

TL;DR: The findings of this study suggest that the DL based structure could discover important biomarkers for efficient and automatic diagnosis of ASD from EEG and may assist to develop computer-aided diagnosis system.

...read moreread less

Abstract: Autism spectrum disorder (ASD) is a developmental disability characterized by persistent impairments in social interaction, speech and nonverbal communication, and restricted or repetitive behaviors. Currently Electroencephalography (EEG) is the most popular tool to inspect the existence of neurological disorders like autism biomarkers due to its low setup cost, high temporal resolution and wide availability. Generally, EEG recordings produce vast amount of data with dynamic behavior, which are visually analyzed by professional clinician to detect autism. It is laborious, expensive, subjective, error prone and has reliability issue. Therefor this study intends to develop an efficient diagnostic framework based on time-frequency spectrogram images of EEG signals to automatically identify ASD. In the proposed system, primarily, the raw EEG signals are pre-processed using re-referencing, filtering and normalization. Then, Short-Time Fourier Transform is used to transform the pre-processed signals into two-dimensional spectrogram images. Afterward those images are evaluated by machine learning (ML) and deep learning (DL) models, separately. In the ML process, textural features are extracted, and significant features are selected using principal component analysis, and feed them to six different ML classifiers for classification. In the DL process, three different convolutional neural network models are tested. The proposed DL based model achieves higher accuracy (99.15%) compared to the ML based model (95.25%) on an ASD EEG dataset and also outperforms existing methods. The findings of this study suggest that the DL based structure could discover important biomarkers for efficient and automatic diagnosis of ASD from EEG and may assist to develop computer-aided diagnosis system.

...read moreread less

Proceedings Article•DOI•

Parallel Tacotron: Non-Autoregressive and Controllable TTS

[...]

Isaac Elias¹, Heiga Zen¹, Jonathan Shen¹, Yu Zhang¹, Ye Jia¹, Ron Weiss¹, Yonghui Wu¹ - Show less +3 more•Institutions (1)

Google¹

06 Jun 2021

TL;DR: Parallel Tacotron as mentioned in this paper uses a variational autoencoder-based residual encoder for text-to-speech models, which is highly parallelizable during both training and inference.

...read moreread less

Abstract: Although neural end-to-end text-to-speech models can synthesize highly natural speech, there is still room for improvements to its efficiency and naturalness. This paper proposes a non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder. This model, called Parallel Tacotron, is highly parallelizable during both training and inference, allowing efficient synthesis on modern parallel hardware. The use of the variational autoencoder relaxes the one-to-many mapping nature of the text-to-speech problem and improves naturalness. To further improve the naturalness, we use lightweight convolutions, which can efficiently capture local contexts, and introduce an iterative spectrogram loss inspired by iterative refinement. Experimental results show that Parallel Tacotron matches a strong autoregressive baseline in subjective evaluations with significantly decreased inference time.

...read moreread less

Journal Article•DOI•

Spatiotemporal and frequential cascaded attention networks for speech emotion recognition

[...]

Shuzhen Li¹, Xiaofen Xing¹, Weiquan Fan¹, Bolun Cai², Perry Fordson¹, Xiangmin Xu¹ - Show less +2 more•Institutions (2)

South China University of Technology¹, Tencent²

11 Aug 2021-Neurocomputing

TL;DR: A novel spatiotemporal and frequential cascaded attention network with large-margin learning is proposed that achieves a promising performance in speech emotion recognition.

...read moreread less

Journal Article•DOI•

CNN-MoE Based Framework for Classification of Respiratory Anomalies and Lung Disease Detection

[...]

Lam Pham¹, Huy Phan², Ramaswamy Palaniappan³, Alfred Mertins⁴, Ian McLoughlin⁵ - Show less +1 more•Institutions (5)

Austrian Institute of Technology¹, Queen Mary University of London², University of Kent³, University of Lübeck⁴, Singapore Institute of Technology⁵

08 Mar 2021-IEEE Journal of Biomedical and Health Informatics

TL;DR: In this article, a robust deep learning framework for auscultation analysis is proposed, which aims to classify anomalies in respiratory cycles and detect diseases, from respiratory sound recordings, by using a back-end deep learning network to classify the spectrogram features into categories of respiratory anomaly cycles or diseases.

...read moreread less

Abstract: This paper presents and explores a robust deep learning framework for auscultation analysis. This aims to classify anomalies in respiratory cycles and detect diseases, from respiratory sound recordings. The framework begins with front-end feature extraction that transforms input sound into a spectrogram representation. Then, a back-end deep learning network is used to classify the spectrogram features into categories of respiratory anomaly cycles or diseases. Experiments, conducted over the ICBHI benchmark dataset of respiratory sounds, confirm three main contributions towards respiratory-sound analysis. Firstly, we carry out an extensive exploration of the effect of spectrogram types, spectral-time resolution, overlapping/non-overlapping windows, and data augmentation on final prediction accuracy. This leads us to propose a novel deep learning system, built on the proposed framework, which outperforms current state-of-the-art methods. Finally, we apply a Teacher-Student scheme to achieve a trade-off between model performance and model complexity which holds promise for building real-time applications.

...read moreread less

Proceedings Article•DOI•

Efficient Neuromorphic Signal Processing with Loihi 2

[...]

Garrick Orchard¹, E. Paxon Frady¹, Daniel Ben Dayan Rubin¹, Sophia Sanborn¹, Sumit Bam Shrestha¹, Friedrich T. Sommer¹, Michael Davies¹ - Show less +3 more•Institutions (1)

Intel¹

01 Oct 2021

TL;DR: In this article, the authors use spiking neurons to compute the Short Time Fourier Transform (STFT) with similar computational complexity but 47x less output bandwidth than the conventional STFT.

...read moreread less

Abstract: The biologically inspired spiking neurons used in neuromorphic computing are nonlinear filters with dynamic state variables—very different from the stateless neuron models used in deep learning. The next version of Intel's neuromorphic research processor, Loihi 2, supports a wide range of stateful spiking neuron models with fully programmable dynamics. Here we showcase advanced spiking neuron models that can be used to efficiently process streaming data in simulation experiments on emulated Loihi 2 hardware. In one example, Resonate-and-Fire (RF) neurons are used to compute the Short Time Fourier Transform (STFT) with similar computational complexity but 47x less output bandwidth than the conventional STFT. In another example, we describe an algorithm for optical flow estimation using spatiotemporal RF neurons that requires over 90x fewer operations than a conventional DNN-based solution. We also demonstrate promising preliminary results using backpropagation to train RF neurons for audio classification tasks. Finally, we show that a cascade of Hopf resonators—a variant of the RF neuron—replicates novel properties of the cochlea and motivates an efficient spike-based spectrogram encoder.

...read moreread less

Journal Article•DOI•

Wi-Sense: a passive human activity recognition system using Wi-Fi and convolutional neural network and its integration in health information systems

[...]

Muhammad Muaaz¹, Ali Chelli¹, Martin Gerdes¹, Matthias Patzold¹•Institutions (1)

University of Agder¹

13 Jul 2021-Annales Des Télécommunications

TL;DR: This work presents Wi-Sense—a human activity recognition system that uses a convolutional neural network (CNN) to recognize human activities based on the environment-independent fingerprints extracted from the Wi-Fi channel state information (CSI).

...read moreread less

Abstract: A human activity recognition (HAR) system acts as the backbone of many human-centric applications, such as active assisted living and in-home monitoring for elderly and physically impaired people. Although existing Wi-Fi-based human activity recognition methods report good results, their performance is affected by the changes in the ambient environment. In this work, we present Wi-Sense—a human activity recognition system that uses a convolutional neural network (CNN) to recognize human activities based on the environment-independent fingerprints extracted from the Wi-Fi channel state information (CSI). First, Wi-Sense captures the CSI by using a standard Wi-Fi network interface card. Wi-Sense applies the CSI ratio method to reduce the noise and the impact of the phase offset. In addition, it applies the principal component analysis to remove redundant information. This step not only reduces the data dimension but also removes the environmental impact. Thereafter, we compute the processed data spectrogram which reveals environment-independent time-variant micro-Doppler fingerprints of the performed activity. We use these spectrogram images to train a CNN. We evaluate our approach by using a human activity data set collected from nine volunteers in an indoor environment. Our results show that Wi-Sense can recognize these activities with an overall accuracy of 97.78%. To stress on the applicability of the proposed Wi-Sense system, we provide an overview of the standards involved in the health information systems and systematically describe how Wi-Sense HAR system can be integrated into the eHealth infrastructure.

...read moreread less

Journal Article•DOI•

Local damage detection based on vibration data analysis in the presence of Gaussian and heavy-tailed impulsive noise

[...]

Jacek Wodecki¹, Anna Michalak¹, Radoslaw Zimroz¹•Institutions (1)

Wrocław University of Technology¹

01 Feb 2021-Measurement

TL;DR: This work proposes to use Nonnegative Matrix Factorization of spectrogram for separation of cyclic and non-cyclic impulsive components in presence of non-Gaussian impulsive noise, and allows to detect and extract impulsive signal (damage in bearing) in presence high amplitude non- cyclic impulse signal.

...read moreread less

Proceedings Article•

DiffWave: A Versatile Diffusion Model for Audio Synthesis

[...]

Zhifeng Kong¹, Wei Ping², Jiaji Huang³, Kexin Zhao³, Bryan Catanzaro⁴ - Show less +1 more•Institutions (4)

University of California, San Diego¹, Nvidia², Baidu³, University of California, Berkeley⁴

03 May 2021

TL;DR: DiffWave as mentioned in this paper is a diffusion probabilistic model for conditional and unconditional waveform generation, which is non-autoregressive and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis.

...read moreread less

Abstract: In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional and unconditional waveform generation. The model is non-autoregressive, and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis. It is efficiently trained by optimizing a variant of variational bound on the data likelihood. DiffWave produces high-fidelity audios in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram, class-conditional generation, and unconditional generation. We demonstrate that DiffWave matches a strong WaveNet vocoder in terms of speech quality (MOS: 4.44 versus 4.43), while synthesizing orders of magnitude faster. In particular, it significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations.

...read moreread less

Journal Article•DOI•

Shared Spectrum Monitoring using Deep Learning

[...]

Farrukh A. Bhatti¹, M. Jaleed Khan¹, Ahmed Selim², Francisco Paisana•Institutions (2)

Institute of Space Technology¹, Trinity College, Dublin²

05 Apr 2021-IEEE Transactions on Cognitive Communications and Networking

TL;DR: A novel (spectrogram) representation called the Quarter-spectrogram (Q-Spectrogram) that squeezes temporal and frequency information for input to CNN models and a simple WiFi classification scheme that buffers several WiFi Q-spectrograms and then makes a decision about WiFi’s presence and also gives a quantified measure of WiFi traffic density.

...read moreread less

Abstract: Shared spectrum usage is inevitable due to the ongoing increase in wireless services and bandwidth requirements. Spectrum monitoring is a key enabler for efficient spectrum sharing by multiple radio access technologies (RATs). In this paper, we present signal classification using deep neural networks to identify various radio technologies and their associated interferences. We use Convolutional Neural Networks (CNN) to perform signal classification and employ six well-known CNN models to train for ten signal classes. These classes include LTE, Radar, WiFi and FBMC (Filter Bank Multicarrier) and their interference combinations, which include, LTE+Radar, LTE+WiFi, FBMC+Radar, FBMC+WiFi, WiFi+Radar and Noise. The CNN models include, AlexNet, VGG16, ResNet18, SqueezeNet, InceptionV3 and ResNet50. The radio signal data sets for training and testing of CNN-based classifiers are acquired using a USRP-based experimental setup. Extensive measurements of these radio technologies (LTE, WiFi, Radar and FBMC) are done over different locations and times to generate a robust dataset. We propose a novel (spectrogram) representation called the Quarter-spectrogram (Q-spectrogram) that squeezes temporal and frequency information for input to CNN models. While considering classification accuracy, model complexity and prediction time for a single input Q-spectrogram (image), ResNet18 (CNN model) gives the best overall performance with 98% classification accuracy. While SqueezeNet (CNN model) offers the lowest model complexity which makes it very suitable for resource-constrained radio monitoring devices and also offers the least prediction time of 110 msec. Moreover, we also propose a simple WiFi classification scheme that buffers several WiFi Q-spectrograms and then makes a decision about WiFi’s presence and also gives a quantified measure of WiFi traffic density.

...read moreread less

Journal Article•DOI•

Adaptive Multi-Trace Carving for Robust Frequency Tracking in Forensic Applications

[...]

Qiang Zhu¹, Mingliang Chen¹, Chau-Wai Wong¹, Min Wu¹•Institutions (1)

University of Maryland, College Park¹

01 Jan 2021-IEEE Transactions on Information Forensics and Security

TL;DR: The proposed Adaptive Multi-Trace Carving (AMTC) algorithm is a unified approach for detecting and tracking one or more subtle frequency components under very low signal-to-noise ratio (SNR) conditions and in near real time.

...read moreread less

Abstract: In the field of information forensics, many emerging problems involve a critical step that estimates and tracks weak frequency components in noisy signals. It is often challenging for the prior art of frequency tracking to i) achieve a high accuracy under noisy conditions, ii) detect and track multiple frequency components efficiently, or iii) strike a good trade-off of the processing delay versus the resilience and the accuracy of tracking. To address these issues, we propose Adaptive Multi-Trace Carving (AMTC), a unified approach for detecting and tracking one or more subtle frequency components under very low signal-to-noise ratio (SNR) conditions and in near real time. AMTC takes as input a time-frequency representation of the system’s preprocessing results (such as the spectrogram), and identifies frequency components through iterative dynamic programming and adaptive trace compensation. The proposed algorithm considers relatively high energy traces sustaining over a certain duration as an indicator of the presence of frequency/oscillation components of interest and track their time-varying trend. Extensive experiments using both synthetic data and real-world forensic data of power signatures and physiological monitoring reveal that the proposed method outperforms representative prior art under low SNR conditions, and can be implemented in near real-time settings. The proposed AMTC algorithm can empower the development of new information forensic technologies that harness very small signals.

...read moreread less

Proceedings Article•DOI•

UltraSE: single-channel speech enhancement using ultrasound

[...]

Ke Sun¹, Xinyu Zhang¹•Institutions (1)

University of California, San Diego¹

09 Sep 2021

TL;DR: In this paper, a multi-modal, multi-domain deep learning framework is proposed to fuse the ultrasonic Doppler features and the audible speech spectrogram, and an adversarially trained discriminator is employed to learn the correlation between the two heterogeneous feature modalities.

...read moreread less

Abstract: Robust speech enhancement is considered as the holy grail of audio processing and a key requirement for human-human and human-machine interaction. Solving this task with single-channel, audio-only methods remains an open challenge, especially for practical scenarios involving a mixture of competing speakers and background noise. In this paper, we propose UltraSE, which uses ultrasound sensing as a complementary modality to separate the desired speaker's voice from interferences and noise. UltraSE uses a commodity mobile device (e.g., smartphone) to emit ultrasound and capture the reflections from the speaker's articulatory gestures. It introduces a multi-modal, multi-domain deep learning framework to fuse the ultrasonic Doppler features and the audible speech spectrogram. Furthermore, it employs an adversarially trained discriminator, based on a cross-modal similarity measurement network, to learn the correlation between the two heterogeneous feature modalities. Our experiments verify that UltraSE simultaneously improves speech intelligibility and quality, and outperforms state-of-the-art solutions by a large margin.

...read moreread less

Journal Article•DOI•

Decision support system for major depression detection using spectrogram and convolution neural network with EEG signals

[...]

Hui Wen Loh¹, Chui Ping Ooi¹, Emrah Aydemir², Turker Tuncer³, Sengul Dogan³, U. Rajendra Acharya - Show less +2 more•Institutions (3)

National University of Singapore¹, Sakarya University², Fırat University³

13 Jul 2021-Expert Systems

Journal Article•DOI•

Sound-based remote real-time multi-device operational monitoring system using a Convolutional Neural Network (CNN)

[...]

Ji Soo Kim¹, Hyunsu Lee¹, Jeong Suhwan¹, Sung-Hoon Ahn¹•Institutions (1)

Seoul National University¹

01 Jan 2021-Journal of Manufacturing Systems

TL;DR: In this study, multi-device operation monitoring system by analyzing sound is developed and was applied successfully in monitoring experiments in two different environments: a workshop in which hand-operated device was used and a factory with a computer numerical control machine.

...read moreread less

Collapse