scispace - formally typeset
Search or ask a question

Showing papers on "Spectrogram published in 2021"


Proceedings ArticleDOI
05 Apr 2021
TL;DR: The Audio Spectrogram Transformer (AST) as mentioned in this paper is the first convolution-free, purely attention-based model for audio classification, which achieves state-of-the-art results on various audio classification benchmarks.
Abstract: In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To better capture long-range global context, a recent trend is to add a self-attention mechanism on top of the CNN, forming a CNN-attention hybrid model. However, it is unclear whether the reliance on a CNN is necessary, and if neural networks purely based on attention are sufficient to obtain good performance in audio classification. In this paper, we answer the question by introducing the Audio Spectrogram Transformer (AST), the first convolution-free, purely attention-based model for audio classification. We evaluate AST on various audio classification benchmarks, where it achieves new state-of-the-art results of 0.485 mAP on AudioSet, 95.6% accuracy on ESC-50, and 98.1% accuracy on Speech Commands V2.

103 citations


Journal ArticleDOI
TL;DR: The results show the effectiveness, robustness, and high accuracy of the proposed approach to have meaningful data augmentation by considering variations applied to the audio clips directly.

75 citations


Journal ArticleDOI
TL;DR: In this paper, the implicit compensation between estimated magnitude and phase was analyzed for monaural speech separation and robust automatic speech recognition (ASR) tasks in noisy-reverberant conditions.
Abstract: Deep neural network (DNN) based end-to-end optimization in the complex time-frequency (T-F) domain or time domain has shown considerable potential in monaural speech separation. Many recent studies optimize loss functions defined solely in the time or complex domain, without including a loss on magnitude. Although such loss functions typically produce better scores if the evaluation metrics are objective time-domain metrics, they however produce worse scores on speech quality and intelligibility metrics and usually lead to worse speech recognition performance, compared with including a loss on magnitude. While this phenomenon has been experimentally observed by many studies, it is often not accurately explained and there lacks a thorough understanding on its fundamental cause. This letter provides a novel view from the perspective of the implicit compensation between estimated magnitude and phase. Analytical results based on monaural speech separation and robust automatic speech recognition (ASR) tasks in noisy-reverberant conditions support the validity of our view.

60 citations


Journal ArticleDOI
TL;DR: A hybrid classifier that can adjust the prediction of deep learning models with the estimated CFO is designed to further increase the classification accuracy of the deep learning-based RFFI scheme for Long Range (LoRa) systems.
Abstract: Radio frequency fingerprint identification (RFFI) is an emerging device authentication technique that relies on the intrinsic hardware characteristics of wireless devices. This paper designs a deep learning-based RFFI scheme for Long Range (LoRa) systems. Firstly, the instantaneous carrier frequency offset (CFO) is found to drift, which could result in misclassification and significantly compromise the stability of the deep learning-based RFFI system. CFO compensation is demonstrated to be effective mitigation. Secondly, three signal representations for deep learning-based RFFI are investigated in time, frequency, and time-frequency domains, namely in-phase and quadrature (IQ) samples, fast Fourier transform (FFT) results and spectrograms, respectively. For these signal representations, three deep learning models are implemented, i.e., multilayer perceptron (MLP), long short-term memory (LSTM) network and convolutional neural network (CNN), in order to explore an optimal framework. Finally, a hybrid classifier that can adjust the prediction of deep learning models with the estimated CFO is designed to further increase the classification accuracy. The CFO will not change dramatically over several continuous days, hence it can be used to correct predictions when the estimated CFO is much different from the reference one. Experimental evaluation is performed in real wireless environments involving 25 LoRa devices and a Universal Software Radio Peripheral (USRP) N210 platform. The spectrogram-CNN model is found to be optimal for classifying LoRa devices which can reach an accuracy of 96.40% with the least complexity and training time.

60 citations


Journal ArticleDOI
TL;DR: The results strongly indicate that the proposed hybrid deep lightweight feature extractor is suitable for autism detection using EEG signals and is ready to serve as part of an adjunct tool that aids neurologists during autism diagnosis in medical centers.

58 citations


Journal ArticleDOI
TL;DR: With the combination of vibration signal- and image processing techniques the evaluation time and computational resource requirements are decreased enhancing more efficient and accurate analysis, nevertheless opens the possibility of a real-time condition monitoring based on a basic vibration measurement.

48 citations


Proceedings ArticleDOI
15 Nov 2021
TL;DR: NELoRa as mentioned in this paper is a neural-enhanced LoRa demodulation method, exploiting the feature abstraction ability of deep learning to support ultra-low SNR LoRa communication.
Abstract: Low-Power Wide-Area Networks (LPWANs) are an emerging Internet-of-Things (IoT) paradigm marked by low-power and long-distance communication. Among them, LoRa is widely deployed for its unique characteristics and open-source technology. By adopting the Chirp Spread Spectrum (CSS) modulation, LoRa enables low signal-to-noise ratio (SNR) communication. However, the standard demodulation method does not fully exploit the properties of chirp signals, thus yields a sub-optimal SNR threshold under which the decoding fails. Consequently, the communication range and energy consumption have to be compromised for robust transmission. This paper presents NELoRa, a neural-enhanced LoRa demodulation method, exploiting the feature abstraction ability of deep learning to support ultra-low SNR LoRa communication. Taking the spectrogram of both amplitude and phase as input, we first design a mask-enabled Deep Neural Network (DNN) filter that extracts multi-dimension features to capture clean chirp symbols. Second, we develop a spectrogram-based DNN decoder to decode these chirp symbols accurately. Finally, we propose a generic packet demodulation system by incorporating a method that generates high-quality chirp symbols from received signals. We implement and evaluate NELoRa on both indoor and campus-scale outdoor testbeds. The results show that NELoRa achieves 1.84-2.35 dB SNR gains and extends the battery life up to 272% (~0.38-1.51 years) in average for various LoRa configurations.

48 citations


Proceedings ArticleDOI
06 Jun 2021
TL;DR: In this article, a two-stream convolutional network for audio recognition is proposed, which operates on time-frequency spectrogram inputs and achieves state-of-the-art results on both VGG-Sound and EPIC-KITCHENS-100 datasets.
Abstract: We propose a two-stream convolutional network for audio recognition, that operates on time-frequency spectrogram inputs. Following similar success in visual recognition, we learn Slow-Fast auditory streams with separable convolutions and multi-level lateral connections. The Slow pathway has high channel capacity while the Fast pathway operates at a fine-grained temporal resolution. We showcase the importance of our two-stream proposal on two diverse datasets: VGG-Sound and EPIC-KITCHENS-100, and achieve state- of-the-art results on both.

48 citations


Journal ArticleDOI
Liu Feng1, Shen Tongsheng1, Luo Zailei1, Zhao Dexin1, Guo Shaojun1 
TL;DR: The proposed model contains three steps to deal with the recognition of underwater targets: feature extraction, data augmentation and deep neural network, which uses the convolutional recurrent neural network for acoustic target recognition.

47 citations


Proceedings ArticleDOI
19 Jan 2021
TL;DR: In this article, the effectiveness of log-Mel spectrogram and MFCC features for Alzheimer's dementia (AD) recognition on ADReSS challenge dataset was explored using three different deep neural networks (DNN) for AD recognition and mini-mental state examination (MMSE) score prediction.
Abstract: In this work, we explore the effectiveness of log-Mel spectrogram and MFCC features for Alzheimer’s dementia (AD) recognition on ADReSS challenge dataset We use three different deep neural networks (DNN) for AD recognition and mini-mental state examination (MMSE) score prediction: (i) convolutional neural network followed by a long-short term memory network (CNN-LSTM), (ii) pre-trained ResNet18 network followed by LSTM (ResNet-LSTM), and (iii) pyramidal bidirectional LSTM followed by a CNN (pBLSTM-CNN) CNN-LSTM achieves an accuracy of 6458% with MFCC features and ResNet-LSTM achieves an accuracy of 625% using log-Mel spectrograms pBLSTM-CNN and ResNet-LSTM models achieve root mean square errors (RMSE) of 59 and 598 in the MMSE score prediction, using the log-Mel spectrograms Our results beat the baseline accuracy (625%) and RMSE (614) reported for acoustic features on ADReSS challenge dataset The results suggest that log-Mel spectrograms and MFCCs are effective features for AD recognition problem when used with DNN models

45 citations


Proceedings ArticleDOI
10 May 2021
TL;DR: In this article, the authors designed an RFFI scheme for Long Range (LoRa) systems based on spectrogram and convolutional neural network (CNN) to represent the fine-grained time-frequency characteristics of LoRa signals.
Abstract: Radio frequency fingerprint identification (RFFI) is an emerging device authentication technique that relies on intrin-sic hardware characteristics of wireless devices. We designed an RFFI scheme for Long Range (LoRa) systems based on spectrogram and convolutional neural network (CNN). Specifically, we used spectrogram to represent the fine-grained time-frequency characteristics of LoRa signals. In addition, we revealed that the instantaneous carrier frequency offset (CFO) is drifting, which will result in misclassification and significantly compromise the system stability; we demonstrated CFO compensation is an effective mitigation. Finally, we designed a hybrid classifier that can adjust CNN outputs with the estimated CFO. The mean value of CFO remains relatively stable, hence it can be used to rule out CNN predictions whose estimated CFO falls out of the range. We performed experiments in real wireless environments using 20 LoRa devices under test (DUTs) and a Universal Software Radio Peripheral (USRP) N210 receiver. By comparing with the IQ-based and FFT-based RFFI schemes, our spectrogram-based scheme can reach the best classification accuracy, i.e., 97.61% for 20 LoRa DUTs.

Posted ContentDOI
TL;DR: This work proposes to estimate phases by estimating complex ideal ratio masks (cIRMs) where it decouple the estimation of cIRMs into magnitude and phase estimations, and extends the separation method to effectively allow the magnitude of the mask to be larger than 1.
Abstract: Deep neural network based methods have been successfully applied to music source separation. They typically learn a mapping from a mixture spectrogram to a set of source spectrograms, all with magnitudes only. This approach has several limitations: 1) its incorrect phase reconstruction degrades the performance, 2) it limits the magnitude of masks between 0 and 1 while we observe that 22% of time-frequency bins have ideal ratio mask values of over~1 in a popular dataset, MUSDB18, 3) its potential on very deep architectures is under-explored. Our proposed system is designed to overcome these. First, we propose to estimate phases by estimating complex ideal ratio masks (cIRMs) where we decouple the estimation of cIRMs into magnitude and phase estimations. Second, we extend the separation method to effectively allow the magnitude of the mask to be larger than 1. Finally, we propose a residual UNet architecture with up to 143 layers. Our proposed system achieves a state-of-the-art MSS result on the MUSDB18 dataset, especially, a SDR of 8.98~dB on vocals, outperforming the previous best performance of 7.24~dB. The source code is available at: this https URL

Journal ArticleDOI
TL;DR: This paper proposes a methodology to combine three different time–frequency representations of the signals by computing continuous wavelet transform, Mel-spectrograms, and Gammatone spectrograms and combining then into 3D-channel spectrogram to analyze speech in two different applications: automatic detection of speech deficits in cochlear implant users and phoneme class recognition to extract phone-attribute features.
Abstract: Time–frequency representations of the speech signals provide dynamic information about how the frequency component changes with time. In order to process this information, deep learning models with convolution layers can be used to obtain feature maps. In many speech processing applications, the time–frequency representations are obtained by applying the short-time Fourier transform and using single-channel input tensors to feed the models. However, this may limit the potential of convolutional networks to learn different representations of the audio signal. In this paper, we propose a methodology to combine three different time–frequency representations of the signals by computing continuous wavelet transform, Mel-spectrograms, and Gammatone spectrograms and combining then into 3D-channel spectrograms to analyze speech in two different applications: (1) automatic detection of speech deficits in cochlear implant users and (2) phoneme class recognition to extract phone-attribute features. For this, two different deep learning-based models are considered: convolutional neural networks and recurrent neural networks with convolution layers.

Journal ArticleDOI
TL;DR: A deep neural network (DNN) model based on a two-dimensional convolutional neural network and gated recurrent unit (GRU) for speaker identification is proposed and the experimental results showed that the proposed DNN model, which is called deep GRU, achieved a high recognition accuracy of 98.96%.
Abstract: Speaker identification is a classification task which aims to identify a subject from a given time-series sequential data. Since the speech signal is a continuous one-dimensional time series, most of the current research methods are based on convolutional neural network (CNN) or recurrent neural network (RNN). Indeed, these methods perform well in many tasks, but there is no attempt to combine these two network models to study the speaker identification task. Due to the spectrogram that a speech signal contains, the spatial features of voiceprint (which corresponds to the voice spectrum) and CNN are effective for spatial feature extraction (which corresponds to modeling spectral correlations in acoustic features). At the same time, the speech signal is in a time series, and deep RNN can better represent long utterances than shallow networks. Considering the advantage of gated recurrent unit (GRU) (compared with traditional RNN) in the segmentation of sequence data, we decide to use stacked GRU layers in our model for frame-level feature extraction. In this paper, we propose a deep neural network (DNN) model based on a two-dimensional convolutional neural network (2-D CNN) and gated recurrent unit (GRU) for speaker identification. In the network model design, the convolutional layer is used for voiceprint feature extraction and reduces dimensionality in both the time and frequency domains, allowing for faster GRU layer computation. In addition, the stacked GRU recurrent network layers can learn a speaker’s acoustic features. During this research, we tried to use various neural network structures, including 2-D CNN, deep RNN, and deep LSTM. The above network models were evaluated on the Aishell-1 speech dataset. The experimental results showed that our proposed DNN model, which we call deep GRU, achieved a high recognition accuracy of 98.96%. At the same time, the results also demonstrate the effectiveness of the proposed deep GRU network model versus other models for speaker identification. Through further optimization, this method could be applied to other research similar to the study of speaker identification.

Journal ArticleDOI
TL;DR: In this paper, a deep neural network architecture incorporating Connectionist Temporal Classification (CTC) loss for discrete speech emotion recognition (SER) is presented, which uses parallel convolutional layers (PCN) integrated with Squeeze-and-Excitation Network (SEnet) to extract relationships from 3D spectrograms across timesteps and frequencies; here, they use the log-Mel spectrogram with deltas and delta-deltas as input.

Journal ArticleDOI
TL;DR: A novel approach, based on attention guided 3D convolutional neural networks (CNN)-long short-term memory (LSTM) model, is proposed for speech based emotion recognition and it is seen that the proposed method outperforms the compared methods.

Journal ArticleDOI
TL;DR: The proposed radar-based fall detection technique based on time-frequency analysis and convolutional neural networks employs high-level feature learning, which distinguishes it from previously studied methods that use heuristic feature extraction.
Abstract: Automatic detection of a falling person based on noncontact sensing is a challenging problem with applications in smart homes for elderly care. In this article, we propose a radar-based fall detection technique based on time-frequency analysis and convolutional neural networks. The time-frequency analysis is performed by applying the short-time Fourier transform to each radar return signal. The resulting spectrograms are converted into binary images, which are fed into the convolutional neural network. The network is trained using labeled examples of fall and nonfall activities. Our method employs high-level feature learning, which distinguishes it from previously studied methods that use heuristic feature extraction. The performance of the proposed method is evaluated by conducting several experiments on a set of radar return signals. We show that our method distinguishes falls from nonfalls with 98.37% precision and 97.82% specificity, while maintaining a low false-alarm rate, which is superior to existing methods. We also show that our proposed method is robust in that it successfully distinguishes falls from nonfalls when trained on subjects in one room, but tested on different subjects in a different room. In the proposed convolutional neural network, the hierarchical features extracted from the radar return signals are the key to understand the fundamental composition of human activities and determine whether or not a fall has occurred during human daily activities. Our method may be extended to other radar-based applications such as apnea detection and gesture detection.

Journal ArticleDOI
25 Jun 2021-PLOS ONE
TL;DR: The findings of this study suggest that the DL based structure could discover important biomarkers for efficient and automatic diagnosis of ASD from EEG and may assist to develop computer-aided diagnosis system.
Abstract: Autism spectrum disorder (ASD) is a developmental disability characterized by persistent impairments in social interaction, speech and nonverbal communication, and restricted or repetitive behaviors. Currently Electroencephalography (EEG) is the most popular tool to inspect the existence of neurological disorders like autism biomarkers due to its low setup cost, high temporal resolution and wide availability. Generally, EEG recordings produce vast amount of data with dynamic behavior, which are visually analyzed by professional clinician to detect autism. It is laborious, expensive, subjective, error prone and has reliability issue. Therefor this study intends to develop an efficient diagnostic framework based on time-frequency spectrogram images of EEG signals to automatically identify ASD. In the proposed system, primarily, the raw EEG signals are pre-processed using re-referencing, filtering and normalization. Then, Short-Time Fourier Transform is used to transform the pre-processed signals into two-dimensional spectrogram images. Afterward those images are evaluated by machine learning (ML) and deep learning (DL) models, separately. In the ML process, textural features are extracted, and significant features are selected using principal component analysis, and feed them to six different ML classifiers for classification. In the DL process, three different convolutional neural network models are tested. The proposed DL based model achieves higher accuracy (99.15%) compared to the ML based model (95.25%) on an ASD EEG dataset and also outperforms existing methods. The findings of this study suggest that the DL based structure could discover important biomarkers for efficient and automatic diagnosis of ASD from EEG and may assist to develop computer-aided diagnosis system.

Proceedings ArticleDOI
Isaac Elias1, Heiga Zen1, Jonathan Shen1, Yu Zhang1, Ye Jia1, Ron Weiss1, Yonghui Wu1 
06 Jun 2021
TL;DR: Parallel Tacotron as mentioned in this paper uses a variational autoencoder-based residual encoder for text-to-speech models, which is highly parallelizable during both training and inference.
Abstract: Although neural end-to-end text-to-speech models can synthesize highly natural speech, there is still room for improvements to its efficiency and naturalness. This paper proposes a non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder. This model, called Parallel Tacotron, is highly parallelizable during both training and inference, allowing efficient synthesis on modern parallel hardware. The use of the variational autoencoder relaxes the one-to-many mapping nature of the text-to-speech problem and improves naturalness. To further improve the naturalness, we use lightweight convolutions, which can efficiently capture local contexts, and introduce an iterative spectrogram loss inspired by iterative refinement. Experimental results show that Parallel Tacotron matches a strong autoregressive baseline in subjective evaluations with significantly decreased inference time.

Journal ArticleDOI
TL;DR: A novel spatiotemporal and frequential cascaded attention network with large-margin learning is proposed that achieves a promising performance in speech emotion recognition.

Journal ArticleDOI
TL;DR: In this article, a robust deep learning framework for auscultation analysis is proposed, which aims to classify anomalies in respiratory cycles and detect diseases, from respiratory sound recordings, by using a back-end deep learning network to classify the spectrogram features into categories of respiratory anomaly cycles or diseases.
Abstract: This paper presents and explores a robust deep learning framework for auscultation analysis. This aims to classify anomalies in respiratory cycles and detect diseases, from respiratory sound recordings. The framework begins with front-end feature extraction that transforms input sound into a spectrogram representation. Then, a back-end deep learning network is used to classify the spectrogram features into categories of respiratory anomaly cycles or diseases. Experiments, conducted over the ICBHI benchmark dataset of respiratory sounds, confirm three main contributions towards respiratory-sound analysis. Firstly, we carry out an extensive exploration of the effect of spectrogram types, spectral-time resolution, overlapping/non-overlapping windows, and data augmentation on final prediction accuracy. This leads us to propose a novel deep learning system, built on the proposed framework, which outperforms current state-of-the-art methods. Finally, we apply a Teacher-Student scheme to achieve a trade-off between model performance and model complexity which holds promise for building real-time applications.

Proceedings ArticleDOI
01 Oct 2021
TL;DR: In this article, the authors use spiking neurons to compute the Short Time Fourier Transform (STFT) with similar computational complexity but 47x less output bandwidth than the conventional STFT.
Abstract: The biologically inspired spiking neurons used in neuromorphic computing are nonlinear filters with dynamic state variables—very different from the stateless neuron models used in deep learning. The next version of Intel's neuromorphic research processor, Loihi 2, supports a wide range of stateful spiking neuron models with fully programmable dynamics. Here we showcase advanced spiking neuron models that can be used to efficiently process streaming data in simulation experiments on emulated Loihi 2 hardware. In one example, Resonate-and-Fire (RF) neurons are used to compute the Short Time Fourier Transform (STFT) with similar computational complexity but 47x less output bandwidth than the conventional STFT. In another example, we describe an algorithm for optical flow estimation using spatiotemporal RF neurons that requires over 90x fewer operations than a conventional DNN-based solution. We also demonstrate promising preliminary results using backpropagation to train RF neurons for audio classification tasks. Finally, we show that a cascade of Hopf resonators—a variant of the RF neuron—replicates novel properties of the cochlea and motivates an efficient spike-based spectrogram encoder.

Journal ArticleDOI
TL;DR: This work presents Wi-Sense—a human activity recognition system that uses a convolutional neural network (CNN) to recognize human activities based on the environment-independent fingerprints extracted from the Wi-Fi channel state information (CSI).
Abstract: A human activity recognition (HAR) system acts as the backbone of many human-centric applications, such as active assisted living and in-home monitoring for elderly and physically impaired people. Although existing Wi-Fi-based human activity recognition methods report good results, their performance is affected by the changes in the ambient environment. In this work, we present Wi-Sense—a human activity recognition system that uses a convolutional neural network (CNN) to recognize human activities based on the environment-independent fingerprints extracted from the Wi-Fi channel state information (CSI). First, Wi-Sense captures the CSI by using a standard Wi-Fi network interface card. Wi-Sense applies the CSI ratio method to reduce the noise and the impact of the phase offset. In addition, it applies the principal component analysis to remove redundant information. This step not only reduces the data dimension but also removes the environmental impact. Thereafter, we compute the processed data spectrogram which reveals environment-independent time-variant micro-Doppler fingerprints of the performed activity. We use these spectrogram images to train a CNN. We evaluate our approach by using a human activity data set collected from nine volunteers in an indoor environment. Our results show that Wi-Sense can recognize these activities with an overall accuracy of 97.78%. To stress on the applicability of the proposed Wi-Sense system, we provide an overview of the standards involved in the health information systems and systematically describe how Wi-Sense HAR system can be integrated into the eHealth infrastructure.

Journal ArticleDOI
TL;DR: This work proposes to use Nonnegative Matrix Factorization of spectrogram for separation of cyclic and non-cyclic impulsive components in presence of non-Gaussian impulsive noise, and allows to detect and extract impulsive signal (damage in bearing) in presence high amplitude non- cyclic impulse signal.

Proceedings Article
03 May 2021
TL;DR: DiffWave as mentioned in this paper is a diffusion probabilistic model for conditional and unconditional waveform generation, which is non-autoregressive and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis.
Abstract: In this work, we propose DiffWave, a versatile diffusion probabilistic model for conditional and unconditional waveform generation. The model is non-autoregressive, and converts the white noise signal into structured waveform through a Markov chain with a constant number of steps at synthesis. It is efficiently trained by optimizing a variant of variational bound on the data likelihood. DiffWave produces high-fidelity audios in different waveform generation tasks, including neural vocoding conditioned on mel spectrogram, class-conditional generation, and unconditional generation. We demonstrate that DiffWave matches a strong WaveNet vocoder in terms of speech quality (MOS: 4.44 versus 4.43), while synthesizing orders of magnitude faster. In particular, it significantly outperforms autoregressive and GAN-based waveform models in the challenging unconditional generation task in terms of audio quality and sample diversity from various automatic and human evaluations.

Journal ArticleDOI
TL;DR: A novel (spectrogram) representation called the Quarter-spectrogram (Q-Spectrogram) that squeezes temporal and frequency information for input to CNN models and a simple WiFi classification scheme that buffers several WiFi Q-spectrograms and then makes a decision about WiFi’s presence and also gives a quantified measure of WiFi traffic density.
Abstract: Shared spectrum usage is inevitable due to the ongoing increase in wireless services and bandwidth requirements. Spectrum monitoring is a key enabler for efficient spectrum sharing by multiple radio access technologies (RATs). In this paper, we present signal classification using deep neural networks to identify various radio technologies and their associated interferences. We use Convolutional Neural Networks (CNN) to perform signal classification and employ six well-known CNN models to train for ten signal classes. These classes include LTE, Radar, WiFi and FBMC (Filter Bank Multicarrier) and their interference combinations, which include, LTE+Radar, LTE+WiFi, FBMC+Radar, FBMC+WiFi, WiFi+Radar and Noise. The CNN models include, AlexNet, VGG16, ResNet18, SqueezeNet, InceptionV3 and ResNet50. The radio signal data sets for training and testing of CNN-based classifiers are acquired using a USRP-based experimental setup. Extensive measurements of these radio technologies (LTE, WiFi, Radar and FBMC) are done over different locations and times to generate a robust dataset. We propose a novel (spectrogram) representation called the Quarter-spectrogram (Q-spectrogram) that squeezes temporal and frequency information for input to CNN models. While considering classification accuracy, model complexity and prediction time for a single input Q-spectrogram (image), ResNet18 (CNN model) gives the best overall performance with 98% classification accuracy. While SqueezeNet (CNN model) offers the lowest model complexity which makes it very suitable for resource-constrained radio monitoring devices and also offers the least prediction time of 110 msec. Moreover, we also propose a simple WiFi classification scheme that buffers several WiFi Q-spectrograms and then makes a decision about WiFi’s presence and also gives a quantified measure of WiFi traffic density.

Journal ArticleDOI
TL;DR: The proposed Adaptive Multi-Trace Carving (AMTC) algorithm is a unified approach for detecting and tracking one or more subtle frequency components under very low signal-to-noise ratio (SNR) conditions and in near real time.
Abstract: In the field of information forensics, many emerging problems involve a critical step that estimates and tracks weak frequency components in noisy signals. It is often challenging for the prior art of frequency tracking to i) achieve a high accuracy under noisy conditions, ii) detect and track multiple frequency components efficiently, or iii) strike a good trade-off of the processing delay versus the resilience and the accuracy of tracking. To address these issues, we propose Adaptive Multi-Trace Carving (AMTC), a unified approach for detecting and tracking one or more subtle frequency components under very low signal-to-noise ratio (SNR) conditions and in near real time. AMTC takes as input a time-frequency representation of the system’s preprocessing results (such as the spectrogram), and identifies frequency components through iterative dynamic programming and adaptive trace compensation. The proposed algorithm considers relatively high energy traces sustaining over a certain duration as an indicator of the presence of frequency/oscillation components of interest and track their time-varying trend. Extensive experiments using both synthetic data and real-world forensic data of power signatures and physiological monitoring reveal that the proposed method outperforms representative prior art under low SNR conditions, and can be implemented in near real-time settings. The proposed AMTC algorithm can empower the development of new information forensic technologies that harness very small signals.

Proceedings ArticleDOI
09 Sep 2021
TL;DR: In this paper, a multi-modal, multi-domain deep learning framework is proposed to fuse the ultrasonic Doppler features and the audible speech spectrogram, and an adversarially trained discriminator is employed to learn the correlation between the two heterogeneous feature modalities.
Abstract: Robust speech enhancement is considered as the holy grail of audio processing and a key requirement for human-human and human-machine interaction. Solving this task with single-channel, audio-only methods remains an open challenge, especially for practical scenarios involving a mixture of competing speakers and background noise. In this paper, we propose UltraSE, which uses ultrasound sensing as a complementary modality to separate the desired speaker's voice from interferences and noise. UltraSE uses a commodity mobile device (e.g., smartphone) to emit ultrasound and capture the reflections from the speaker's articulatory gestures. It introduces a multi-modal, multi-domain deep learning framework to fuse the ultrasonic Doppler features and the audible speech spectrogram. Furthermore, it employs an adversarially trained discriminator, based on a cross-modal similarity measurement network, to learn the correlation between the two heterogeneous feature modalities. Our experiments verify that UltraSE simultaneously improves speech intelligibility and quality, and outperforms state-of-the-art solutions by a large margin.


Journal ArticleDOI
TL;DR: In this study, multi-device operation monitoring system by analyzing sound is developed and was applied successfully in monitoring experiments in two different environments: a workshop in which hand-operated device was used and a factory with a computer numerical control machine.