scispace - formally typeset
Search or ask a question

Showing papers on "Audio signal processing published in 2020"


Journal ArticleDOI
TL;DR: The aim of this study is to summarize the literature of the audio signal processing specially focusing on the feature extraction techniques, and the temporal domain, frequency domain, cepstral domain, wavelet domain and time-frequency domain features are discussed in detail.

179 citations


Journal ArticleDOI
TL;DR: This work advances the music emotion recognition state-of-the-art by proposing novel emotionally-relevant audio features related with musical texture and expressive techniques, and analysing the features relevance and results uncovered interesting relations.
Abstract: This work advances the music emotion recognition state-of-the-art by proposing novel emotionally-relevant audio features. We reviewed the existing audio features implemented in well-known frameworks and their relationships with the eight commonly defined musical concepts. This knowledge helped uncover musical concepts lacking computational extractors, to which we propose algorithms - namely related with musical texture and expressive techniques. To evaluate our work, we created a public dataset of 900 audio clips, with subjective annotations following Russell's emotion quadrants. The existent audio features (baseline) and the proposed features (novel) were tested using 20 repetitions of 10-fold cross-validation. Adding the proposed features improved the F1-score to 76.4 percent (by 9 percent), when compared to a similar number of baseline-only features. Moreover, analysing the features relevance and results uncovered interesting relations, namely the weight of specific features and musical concepts to each emotion quadrant, and warrant promising new directions for future research in the field of music emotion recognition, interactive media, and novel music interfaces.

98 citations


Proceedings ArticleDOI
04 May 2020
TL;DR: This paper proposes Channel-Attention Dense U-Net, in which the channel-attention unit is applied recursively on feature maps at every layer of the network, enabling the network to perform non-linear beamforming.
Abstract: Supervised deep learning has gained significant attention for speech enhancement recently. The state-of-the-art deep learning methods perform the task by learning a ratio/binary mask that is applied to the mixture in the time-frequency domain to produce the clean speech. Despite the great performance in the single-channel setting, these frameworks lag in performance in the multichannel setting as the majority of these methods a) fail to exploit the available spatial information fully, and b) still treat the deep architecture as a black box which may not be well-suited for multichannel audio processing. This paper addresses these drawbacks, a) by utilizing complex ratio masking instead of masking on the magnitude of the spectrogram, and more importantly, b) by introducing a channel-attention mechanism inside the deep architecture to mimic beamforming. We propose Channel-Attention Dense U-Net, in which we apply the channel-attention unit recursively on feature maps at every layer of the network, enabling the network to perform non-linear beamforming. We demonstrate the superior performance of the network against the state-of-the-art approaches on the CHiME-3 dataset.

69 citations


Journal ArticleDOI
TL;DR: A new neural network-based audio processing framework with graphics processing unit (GPU) support that leverages 1D convolutional neural networks to perform time domain to frequency domain conversion, which allows on-the-fly spectrogram extraction due to its fast speed, without the need to store any spectrograms on the disk.
Abstract: In this paper, we present nnAudio, a new neural network-based audio processing framework with graphics processing unit (GPU) support that leverages 1D convolutional neural networks to perform time domain to frequency domain conversion. It allows on-the-fly spectrogram extraction due to its fast speed, without the need to store any spectrograms on the disk. Moreover, this approach also allows back-propagation on the waveforms-to-spectrograms transformation layer, and hence, the transformation process can be made trainable, further optimizing the waveform-to-spectrogram transformation for the specific task that the neural network is trained on. All spectrogram implementations scale as Big-O of linear time with respect to the input length. nnAudio, however, leverages the compute unified device architecture (CUDA) of 1D convolutional neural network from PyTorch, its short-time Fourier transform (STFT), Mel spectrogram, and constant-Q transform (CQT) implementations are an order of magnitude faster than other implementations using only the central processing unit (CPU). We tested our framework on three different machines with NVIDIA GPUs, and our framework significantly reduces the spectrogram extraction time from the order of seconds (using a popular python library librosa) to the order of milliseconds, given that the audio recordings are of the same length. When applying nnAudio to variable input audio lengths, an average of 11.5 hours are required to extract 34 spectrogram types with different parameters from the MusicNet dataset using librosa. An average of 2.8 hours is required for nnAudio, which is still four times faster than librosa. Our proposed framework also outperforms existing GPU processing libraries such as Kapre and torchaudio in terms of processing speed.

55 citations


Proceedings ArticleDOI
13 Jan 2020
TL;DR: In this paper, the authors construct a metric by fitting a deep neural network to a new large dataset of crowdsourced human judgments, where subjects are prompted to answer a straightforward, objective question: are two recordings identical or not?
Abstract: Many audio processing tasks require perceptual assessment. The ``gold standard`` of obtaining human judgments is time-consuming, expensive, and cannot be used as an optimization criterion. On the other hand, automated metrics are efficient to compute but often correlate poorly with human judgment, particularly for audio differences at the threshold of human detection. In this work, we construct a metric by fitting a deep neural network to a new large dataset of crowdsourced human judgments. Subjects are prompted to answer a straightforward, objective question: are two recordings identical or not? These pairs are algorithmically generated under a variety of perturbations, including noise, reverb, and compression artifacts; the perturbation space is probed with the goal of efficiently identifying the just-noticeable difference (JND) level of the subject. We show that the resulting learned metric is well-calibrated with human judgments, outperforming baseline methods. Since it is a deep network, the metric is differentiable, making it suitable as a loss function for other tasks. Thus, simply replacing an existing loss (e.g., deep feature loss) with our metric yields significant improvement in a denoising network, as measured by subjective pairwise comparison.

49 citations


Journal ArticleDOI
TL;DR: A new audio encryption scheme that provides a high degree of security and is secure enough to withstand many common attacks and can be recommended for multi-channel audio processing.
Abstract: Transferring multimedia files like audio is a common problem with information security. Therefore, various encryption technologies are needed to protect these contents. This paper proposes a new audio encryption scheme that provides a high degree of security. The novelty of this scheme is the use of chaotic systems and DNA coding to confuse and diffuse audio data. The initial value of the chaotic system is controlled by the hash value of the audio, making the chaotic trajectory unpredictable. Comparison experiments using different types of audio show that the algorithm works well and is secure enough to withstand many common attacks and can be recommended for multi-channel audio processing.

36 citations


Journal ArticleDOI
TL;DR: The ConditionaL Neural Network (CLNN) 1 and its extension, the MaskedconditionaL neural network (MCLNN), designed to exploit the nature of sound in a time–frequency representation and surpasses neural networks based architectures including state-of-the-art Convolutional Neural Networks and several hand-crafted attempts.

32 citations


Proceedings ArticleDOI
25 Oct 2020
TL;DR: In this article, the authors proposed a model for the Environment Sound Classification (ESC) that consists of multiple feature channels given as input to a DeepConvolutional Neural Network (CNN) with Attention mechanism.
Abstract: In this paper, we propose a model for the Environment Sound Classification Task (ESC) that consists of multiple feature channels given as input to a Deep Convolutional Neural Network (CNN) with Attention mechanism. The novelty of the paper lies in using multiple feature channels consisting of Mel-Frequency Cepstral Coefficients (MFCC), Gammatone Frequency Cepstral Coefficients (GFCC), the Constant Q-transform (CQT) and Chromagram. Such multiple features have never been used before for signal or audio processing. And, we employ a deeper CNN (DCNN) compared to previous models, consisting of spatially separable convolutions working on time and feature domain separately. Alongside, we use attention modules that perform channel and spatial attention together. We use some data augmentation techniques to further boost performance. Our model is able to achieve state-of-the-art performance on all three benchmark environment sound classification datasets, i.e. the UrbanSound8K (97.52%), ESC-10 (95.75%) and ESC-50 (88.50%). To the best of our knowledge, this is the first time that a single environment sound classification model is able to achieve state-of-the-art results on all three datasets. For ESC-10 and ESC-50 datasets, the accuracy achieved by the proposed model is beyond human accuracy of 95.7% and 81.3% respectively.

31 citations


Posted Content
TL;DR: This work shows that nearest neighbor interpolation upsamplers can be an alternative to the problematic (but state-of-the-art) transposed and subpixel convolutions which are prone to introduce tonal artifacts.
Abstract: A number of recent advances in neural audio synthesis rely on upsampling layers, which can introduce undesired artifacts. In computer vision, upsampling artifacts have been studied and are known as checkerboard artifacts (due to their characteristic visual pattern). However, their effect has been overlooked so far in audio processing. Here, we address this gap by studying this problem from the audio signal processing perspective. We first show that the main sources of upsampling artifacts are: (i) the tonal and filtering artifacts introduced by problematic upsampling operators, and (ii) the spectral replicas that emerge while upsampling. We then compare different upsampling layers, showing that nearest neighbor upsamplers can be an alternative to the problematic (but state-of-the-art) transposed and subpixel convolutions which are prone to introduce tonal artifacts.

30 citations


Journal ArticleDOI
TL;DR: The author's final peer reviewed version can be found by following the DOI link and clicking on the link to the file attached to this record.

30 citations


Journal ArticleDOI
TL;DR: The development and evolution of audio effect technology is discussed, highlighting major technical breakthroughs and the impact of available audio effects.
Abstract: Audio effects are an essential tool that the field of music production relies upon. The ability to intentionally manipulate and modify a piece of sound has opened up considerable opportunities for music making. The evolution of technology has often driven new audio tools and effects, from early architectural acoustics through electromechanical and electronic devices to the digitisation of music production studios. Throughout time, music has constantly borrowed ideas and technological advancements from all other fields and contributed back to the innovative technology. This is defined as transsectorial innovation and fundamentally underpins the technological developments of audio effects. The development and evolution of audio effect technology is discussed, highlighting major technical breakthroughs and the impact of available audio effects.

Journal ArticleDOI
TL;DR: Different state-of-the-art deep learning models based on convolutional and recurrent neural networks, feedforward WaveNet architectures and a new model based on the combination of the aforementioned 10 models are analysed and the performance of these models when modeling various analog audio effects is explored.
Abstract: Virtual analog modeling of audio effects consists of emulating the sound of an audio processor reference device. This digital simulation is normally done by designing mathematical models of these systems. It is often difficult because it seeks to accurately model all components within the effect unit, which usually contains various nonlinearities and time-varying components. Most existing methods for audio effects modeling are either simplified or optimized to a very specific circuit or type of audio effect and cannot be efficiently translated to other types of audio effects. Recently, deep neural networks have been explored as black-box modeling strategies to solve this task, i.e., by using only input–output measurements. We analyse different state-of-the-art deep learning models based on convolutional and recurrent neural networks, feedforward WaveNet architectures and we also introduce a new model based on the combination of the aforementioned models. Through objective perceptual-based metrics and subjective listening tests we explore the performance of these models when modeling various analog audio effects. Thus, we show virtual analog models of nonlinear effects, such as a tube preamplifier; nonlinear effects with memory, such as a transistor-based limiter and nonlinear time-varying effects, such as the rotating horn and rotating woofer of a Leslie speaker cabinet.

Journal ArticleDOI
TL;DR: This work has combined the statistical models along with machine learning algorithm to improve the results thereby resulting in robust communication and experimentation results on AURORA 2 database shows improved results over the state of the art methods discussed in the literature.

Journal ArticleDOI
TL;DR: In this paper, a neural encoding and decoding scheme called Biologically plausible Auditory Encoding (BAE) was proposed for audio processing, which emulates the functions of the perceptual components of the human auditory system, including the cochlear filter bank, inner hair cells, auditory masking effects from psychoacoustic models, and the spike neural encoding by the auditory nerve.
Abstract: The auditory front-end is an integral part of a spiking neural network (SNN) when performing auditory cognitive tasks. It encodes the temporal dynamic stimulus, such as speech and audio, into an efficient, effective and reconstructable spike pattern to facilitate the subsequent processing. However, most of the auditory front-ends in current studies have not made use of recent findings in psychoacoustics and physiology concerning human listening. In this paper, we propose a neural encoding and decoding scheme that is optimized for audio processing. The neural encoding scheme, that we call Biologically plausible Auditory Encoding (BAE), emulates the functions of the perceptual components of the human auditory system, that include the cochlear filter bank, the inner hair cells, auditory masking effects from psychoacoustic models, and the spike neural encoding by the auditory nerve. We evaluate the perceptual quality of the BAE scheme using PESQ; the performance of the BAE based on sound classification and speech recognition experiments. Finally, we also built and published two spike-version of speech datasets: the Spike-TIDIGITS and the Spike-TIMIT, for researchers to use and benchmarking of future SNN research.

Proceedings Article
30 Apr 2020
TL;DR: Harmonic Convolution is proposed, an operation that helps deep networks distill priors in audio signals by explicitly utilizing the harmonic structure within by engineering the kernel to be supported by sets of harmonic series, instead of local neighborhoods for convolutional kernels.
Abstract: Convolutional neural networks (CNNs) excel in image recognition and generation. Among many efforts to explain their effectiveness, experiments show that CNNs carry strong inductive biases that capture natural image priors. Do deep networks also have inductive biases for audio signals? In this paper, we empirically show that current network architectures for audio processing do not show strong evidence in capturing such priors. We propose Harmonic Convolution, an operation that helps deep networks distill priors in audio signals by explicitly utilizing the harmonic structure within. This is done by engineering the kernel to be supported by sets of harmonic series, instead of local neighborhoods for convolutional kernels. We show that networks using Harmonic Convolution can reliably model audio priors and achieve high performance in unsupervised audio restoration tasks. With Harmonic Convolution, they also achieve better generalization performance for sound source separation.

Journal ArticleDOI
TL;DR: Three types of infinite impulse response filter i.e. Butterworth, Chebyshev type I and elliptic low pass, high pass, band pass and band stop filter have been designed in this paper using MATLAB Software.
Abstract: In the field of digital signal processing, the function of a filter is to remove unwanted parts of the signal such as random noise that is also undesirable. To remove noise from the speech signal transmission or to extract useful parts of the signal such as the components lying within a certain frequency range, filters are necessary. Filters are broadly used in signal processing and communication systems in applications such as channel equalization, noise reduction, radar, audio processing, speech signal processing, video processing, biomedical signal processing that is noisy ECG, EEG, EMG signal filtering, electrical circuit analysis and analysis of economic and financial data. In this paper, three types of infinite impulse response filter i.e. Butterworth, Chebyshev type I and Elliptical filter have been discussed theoretically and experimentally. Butterworth, Chebyshev type I and elliptic low pass, high pass, band pass and band stop filter have been designed in this paper using MATLAB Software. The impulse responses, magnitude responses, phase responses of Butterworth, Chebyshev type I and Elliptical filter for filtering the speech signal have been observed in this paper. Analyzing the Speech signal, its sampling rate and spectrum response have also been found.

Proceedings ArticleDOI
23 Jul 2020
TL;DR: Developed a DA based and offset binary coding DA based MAC cores which offers greater speed compared with different conventional MAC's using various multipliers and achieves best area and less delay result when compared with previous approximate adder designs.
Abstract: MAC is an essential core which is used in every Digital signal processor. The primary focal point of this article is to introduce a high performance Distributed based (DA) MAC and offset binary coding Distributed Arithmetic (DA) based MAC for real time Signal Processing Applications. Addition and multiplication are the two hardware resources widely used to design any arithmetic blocks in many fields like video processing, audio processing, speech processing and medical image processing applications. In this article, a literature survey is done on different MAC [2] units with different multipliers to generate partial products and to perform accumulation. Developed a DA based and offset binary coding DA based MAC cores which offers greater speed compared with different conventional MAC's using various multipliers. The coding for DA and offset based architectures are done using Verilog and simulation, synthesis are performed in Xilinx 14.7 Integrated Simulation Environment version. It achieves best area and less delay result when compared with previous approximate adder designs. The results of DA based MAC cores gives much more efficient in delay whereas offset binary coding-based DA offers both speed and area optimization.

Proceedings ArticleDOI
01 Dec 2020
TL;DR: In this article, a BiGRU-based encoder-decoder architecture was proposed to extract subject-verb embeddings using the subjects and verbs from the audio captions.
Abstract: Audio captioning is a recently proposed task for automatically generating a textual description of a given audio clip. Most existing approaches use the encoder-decoder model without using semantic information. In this study, we propose a bi-directional Gated Recurrent Unit (BiGRU) model based on encoder-decoder architecture using audio and semantic embed-dings. To obtain semantic embeddings, we extract subject-verb embeddings using the subjects and verbs from the audio captions. We use a Multilayer Perceptron classifier to predict subject-verb embeddings of test audio clips for the testing stage. Within the aim of extracting audio features, in addition to log Mel energies, we use a pretrained audio neural network (PANN) as a feature extractor which is used for the first time in the audio captioning task to explore the usability of audio embeddings in the audio captioning task. We combine audio embeddings and semantic embeddings to feed the BiGRU-based encoder-decoder model. Following this, we evaluate our model on two audio captioning datasets: Clotho and AudioCaps. Experimental results show that the proposed BiGRU-based deep model significantly outperforms the state of the art results across different evaluation metrics and inclusion of semantic information enhance the captioning performance.

Journal ArticleDOI
TL;DR: Recently, for bolt looseness detection, percussion-based methods have attracted more attention due to their advantages of eliminating contact sensors.
Abstract: Recently, for bolt looseness detection, percussion-based methods have attracted more attention due to their advantages of eliminating contact sensors. The core issue of percussion-based methods is ...

Journal ArticleDOI
TL;DR: A human search system that uses an unmanned aerial vehicle to detect the human voice as a means of finding people that cameras cannot and a search method is proposed that combines voice and camera human detection to compensate for their respective shortcomings.
Abstract: This paper describes a human search system that uses an unmanned aerial vehicle (UAV). The use of robots to search for people is expected to become an auxiliary tool for saving lives during a disaster. In particular, because UAVs can collect information from the air, there has been much research into human search using UAVs equipped with cameras. However, the disadvantage of cameras is that they struggle to detect people who are hidden in shadows. To solve this problem, we mounted an array microphone on a UAV and to detect the human voice as a means of finding people that cameras cannot. Also a search method is proposed that combines voice and camera human detection to compensate for their respective shortcomings. The rate and accuracy of human detection by the proposed method are assessed experimentally.

Journal ArticleDOI
12 Nov 2020-PLOS ONE
TL;DR: A new method for extracting features related to the rhythmic activity of music signals using the topological properties of a graph constructed from an audio signal and it is found that the four network properties were among the top-ranking positions given by this test.
Abstract: Most feature extraction algorithms for music audio signals use Fourier transforms to obtain coefficients that describe specific aspects of music information within the sound spectrum, such as the timbral texture, tonal texture and rhythmic activity. In this paper, we introduce a new method for extracting features related to the rhythmic activity of music signals using the topological properties of a graph constructed from an audio signal. We map the local standard deviation of a music signal to a visibility graph and calculate the modularity (Q), the number of communities (Nc), the average degree (〈k〉), and the density (Δ) of this graph. By applying this procedure to each signal in a database of various musical genres, we detected the existence of a hierarchy of rhythmic self-similarities between musical styles given by these four network properties. Using Q, Nc, 〈k〉 and Δ as input attributes in a classification experiment based on supervised artificial neural networks, we obtained an accuracy higher than or equal to the beat histogram in 70% of the musical genre pairs, using only four features from the networks. Finally, when performing the attribute selection test with Q, Nc, 〈k〉 and Δ, along with the main signal processing field descriptors, we found that the four network properties were among the top-ranking positions given by this test.

Posted Content
TL;DR: This work constructs a metric by fitting a deep neural network to a new large dataset of crowdsourced human judgments and shows that the resulting learned metric is well-calibrated with human judgments, outperforming baseline methods.
Abstract: Many audio processing tasks require perceptual assessment. The ``gold standard`` of obtaining human judgments is time-consuming, expensive, and cannot be used as an optimization criterion. On the other hand, automated metrics are efficient to compute but often correlate poorly with human judgment, particularly for audio differences at the threshold of human detection. In this work, we construct a metric by fitting a deep neural network to a new large dataset of crowdsourced human judgments. Subjects are prompted to answer a straightforward, objective question: are two recordings identical or not? These pairs are algorithmically generated under a variety of perturbations, including noise, reverb, and compression artifacts; the perturbation space is probed with the goal of efficiently identifying the just-noticeable difference (JND) level of the subject. We show that the resulting learned metric is well-calibrated with human judgments, outperforming baseline methods. Since it is a deep network, the metric is differentiable, making it suitable as a loss function for other tasks. Thus, simply replacing an existing loss (e.g., deep feature loss) with our metric yields significant improvement in a denoising network, as measured by subjective pairwise comparison.

Journal ArticleDOI
TL;DR: It is concluded that the front-end processing of the SONNET provides users with better hearing than does the OPUS 2, and this was offset by SONnet's superiority in real-life listening situations.
Abstract: Objectives: Speech understanding in noise remains a challenge for many cochlear implant users. To improve this, the SONNET audio processor features three microphone directionality (MD) settings and...

Journal ArticleDOI
TL;DR: This work unfolds an automatic audio signal processing framework achieving classification between normal and abnormal respiratory sounds, surpassing the current state of the art as it is able to identify respiratory sound patterns with a 66% accuracy.
Abstract: There are several diseases (e.g. asthma, pneumonia etc.) affecting the human respiratory apparatus altering its airway path substantially, thus characterising its acoustic properties. This work unfolds an automatic audio signal processing framework achieving classification between normal and abnormal respiratory sounds. Thanks to a recent challenge, a real-world dataset specifically designed to address the needs of the specific problem is available to the scientific community. Unlike previous works in the literature, the authors take advantage of information provided by several stethoscopes simultaneously, i.e. elaborating at the acoustic sensor network level. To this end, they employ two features sets extracted from different domains, i.e. spectral and wavelet. These are modelled by convolutional neural networks, hidden Markov models and Gaussian mixture models. Subsequently, a synergistic scheme is designed operating at the decision level of the best-performing classifier with respect to each stethoscope. Interestingly, such a scheme was able to boost the classification accuracy surpassing the current state of the art as it is able to identify respiratory sound patterns with a 66.7% accuracy.

Journal ArticleDOI
TL;DR: A novel approach of DWT is presented by replacing conventionalAdders and multipliers with XOR-MUX adders and Truncations multipliers thereby reducing the 2n logic size to n-size logic.


Proceedings ArticleDOI
05 Oct 2020
TL;DR: This paper investigates how intentional degradation of audio frames can affect the recognition results of the target classes while maintaining effective privacy mitigation, and results indicate that degradation ofaudio frames can leave minimal effects for audio recognition using frame-level features.
Abstract: Audio has been increasingly adopted as a sensing modality in a variety of human-centered mobile applications and in smart assistants in the home. Although acoustic features can capture complex semantic information about human activities and context, continuous audio recording often poses significant privacy concerns. An intuitive way to reduce privacy concerns is to degrade audio quality such that speech and other relevant acoustic markers become unintelligible, but this often comes at the cost of activity recognition performance. In this paper, we employ a mixed-methods approach to characterize this balance. We first conduct an online survey with 266 participants to capture their perception of privacy qualitatively and quantitatively with degraded audio. Given our findings that privacy concerns can be significantly reduced at high levels of audio degradation, we then investigate how intentional degradation of audio frames can affect the recognition results of the target classes while maintaining effective privacy mitigation. Our results indicate that degradation of audio frames can leave minimal effects for audio recognition using frame-level features. Furthermore, degradation of audio frames can hurt the performance to some extend for audio recognition using segment-level features, though the usage of such features may still yield superior recognition performance. Given the different requirements on privacy mitigation and recognition performance for different sensing purposes, such trade-offs need to be balanced in actual implementations.

Proceedings ArticleDOI
15 May 2020
TL;DR: Experimental evaluation using two public datasets, medley-solos-db and gtzan, respectively of monophonic and polyphonic music demonstrate that the proposed architecture achieves state of the art performance.
Abstract: Bandwidth extension has a long history in audio processing. While speech processing tools do not rely on side information, production-ready bandwidth extension tools of general audio signals rely on side information that has to be transmitted alongside the bitstream of the low frequency part, mostly because polyphonic music has a more complex and less predictable spectral structure than speech.This paper studies the benefit of considering a dilated fully convolutional neural network to perform the bandwidth extension of musical audio signals with no side information on the magnitude spectra. Experimental evaluation using two public datasets, medley-solos-db and gtzan, respectively of monophonic and polyphonic music demonstrate that the proposed architecture achieves state of the art performance.

Posted Content
TL;DR: This work shows that 2 of the recent and current state-of-the-art adversarial attacks on audio processing systems systematically lead to higher-than-expected activation at some subset of nodes and can detect these with up to an AUC of 0.98 with no degradation in performance on benign samples.
Abstract: Audio processing models based on deep neural networks are susceptible to adversarial attacks even when the adversarial audio waveform is 99.9% similar to a benign sample. Given the wide application of DNN-based audio recognition systems, detecting the presence of adversarial examples is of high practical relevance. By applying anomalous pattern detection techniques in the activation space of these models, we show that 2 of the recent and current state-of-the-art adversarial attacks on audio processing systems systematically lead to higher-than-expected activation at some subset of nodes and we can detect these with up to an AUC of 0.98 with no degradation in performance on benign samples.

Proceedings ArticleDOI
23 Mar 2020
TL;DR: This paper proposes a system based on a deep-learning model able to identify the number of people in the crowded scenarios through the speech sound by counting concurrent speakers in overlapping talking sound directly and clustering single-speaker sound by speaker-identity over time.
Abstract: People counting techniques have been widely researched recently and many different types of sensors can be used in this context. In this paper, we propose a system based on a deep-learning model able to identify the number of people in the crowded scenarios through the speech sound. In a nutshell the system relies on two components: counting concurrent speakers in overlapping talking sound directly and clustering single-speaker sound by speaker-identity over time. Compared to previously proposed speaker-counting systems models that only cluster single-speaker sound, this system is more accurate and less vulnerable to the overlapping sound in the crowded environment. In addition, counting speakers in overlapping sound also gives the minimal number of speakers so that it also improves the counting accuracy in a quiet environment.Our methodology is inspired by the newly proposed SincNet deep neural network framework which proves to be outstanding and highly efficient in sound processing with raw signals. By transferring the bottleneck layer of SincNet model as features fed to our speaker clustering model we reached a noticeably better performance than previous models who rely on the use MFCC and other engineered features.