scispace - formally typeset
Search or ask a question

Showing papers on "Audio signal processing published in 2021"


Proceedings ArticleDOI
09 Sep 2021
TL;DR: In this paper, a multi-modal, multi-domain deep learning framework is proposed to fuse the ultrasonic Doppler features and the audible speech spectrogram, and an adversarially trained discriminator is employed to learn the correlation between the two heterogeneous feature modalities.
Abstract: Robust speech enhancement is considered as the holy grail of audio processing and a key requirement for human-human and human-machine interaction. Solving this task with single-channel, audio-only methods remains an open challenge, especially for practical scenarios involving a mixture of competing speakers and background noise. In this paper, we propose UltraSE, which uses ultrasound sensing as a complementary modality to separate the desired speaker's voice from interferences and noise. UltraSE uses a commodity mobile device (e.g., smartphone) to emit ultrasound and capture the reflections from the speaker's articulatory gestures. It introduces a multi-modal, multi-domain deep learning framework to fuse the ultrasonic Doppler features and the audible speech spectrogram. Furthermore, it employs an adversarially trained discriminator, based on a cross-modal similarity measurement network, to learn the correlation between the two heterogeneous feature modalities. Our experiments verify that UltraSE simultaneously improves speech intelligibility and quality, and outperforms state-of-the-art solutions by a large margin.

25 citations


Journal ArticleDOI
TL;DR: In this paper, a two-stream network structure which handles each modality with attention mechanism is developed for sound source localization, and the network naturally reveals the localized response in the scene without human annotation.
Abstract: Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn to correlate the visual scene and sound, as well as localize the sound source only by observing them like humans? To investigate its empirical learnability, in this work we first present a novel unsupervised algorithm to address the problem of localizing sound sources in visual scenes. In order to achieve this goal, a two-stream network structure which handles each modality with attention mechanism is developed for sound source localization. The network naturally reveals the localized response in the scene without human annotation. In addition, a new sound source dataset is developed for performance evaluation. Nevertheless, our empirical evaluation shows that the unsupervised method generates false conclusions in some cases. Thereby, we show that this false conclusion cannot be fixed without human prior knowledge due to the well-known correlation and causality mismatch misconception. To fix this issue, we extend our network to the supervised and semi-supervised network settings via a simple modification due to the general architecture of our two-stream network. We show that the false conclusions can be effectively corrected even with a small amount of supervision, i.e. , semi-supervised setup. Furthermore, we present the versatility of the learned audio and visual embeddings on the cross-modal content alignment and we extend this proposed algorithm to a new application, sound saliency based automatic camera view panning in 360 degree videos.

24 citations


Proceedings ArticleDOI
25 Oct 2021
TL;DR: The L3DAS21 challenge as discussed by the authors has recently released a 65 hours 3D audio corpus, accompanied with a Python API that facilitates the data usage and results submission stage, which is used to encourage and foster collaborative research on machine learning for 3D Audio signal processing, with particular focus on 3D speech enhancement and 3D sound localization and detection.
Abstract: The L3DAS21 Challenge11www.13das.com/mlsp2021 is aimed at encouraging and fostering collaborative research on machine learning for 3D audio signal processing, with particular focus on 3D speech enhancement (SE) and 3D sound localization and detection (SELD). Alongside with the challenge, we release the L3DAS21 dataset, a 65 hours 3D audio corpus, accompanied with a Python API that facilitates the data usage and results submission stage. Usually, machine learning approaches to 3D audio tasks are based on single-perspective Ambisonics recordings or on arrays of single-capsule microphones. We propose, instead, a novel multichannel audio configuration based multiple-source and multiple-perspective Ambisonics recordings, performed with an array of two first-order Ambisonics microphones. To the best of our knowledge, it is the first time that a dualmic Ambisonics configuration is used for these tasks. We provide baseline models and results for both tasks, obtained with state-of-the-art architectures: FaSNet for SE and SELDnet for SELD.

21 citations


Book ChapterDOI
01 Jan 2021
TL;DR: This work developed an application for the classification of different types and brands of construction vehicles and tools, which operates on the emitted audio through a stack of convolutional layers, demonstrating its effectiveness in environmental sound classification (ESC) achieving a high accuracy.
Abstract: Convolutional Neural Networks (CNNs) have been widely used in the field of audio recognition and classification, since they often provide positive results. Motivated by the success of this kind of approach and the lack of practical methodologies for the monitoring of construction sites by using audio data, we developed an application for the classification of different types and brands of construction vehicles and tools, which operates on the emitted audio through a stack of convolutional layers. The proposed architecture works on the mel-spectrogram representation of the input audio frames and it demonstrates its effectiveness in environmental sound classification (ESC) achieving a high accuracy. In summary, our contribution shows that techniques employed for general ESC can be also successfully adapted to a more specific environmental sound classification task, such as event recognition in construction sites.

18 citations


Journal ArticleDOI
TL;DR: In this paper, a multi-resolution analysis for feature extraction in sound event detection is proposed, which finds that different resolutions can be more adequate for the detection of different sound event categories, and that combining the information provided by multiple resolutions could improve the performance of Sound Event Detection systems.
Abstract: Sound Event Detection is a task with a rising relevance over the recent years in the field of audio signal processing, due to the creation of specific datasets such as Google AudioSet or DESED (Domestic Environment Sound Event Detection) and the introduction of competitive evaluations like the DCASE Challenge (Detection and Classification of Acoustic Scenes and Events). The different categories of acoustic events can present diverse temporal and spectral characteristics. However, most approaches use a fixed time-frequency resolution to represent the audio segments. This work proposes a multi-resolution analysis for feature extraction in Sound Event Detection, hypothesizing that different resolutions can be more adequate for the detection of different sound event categories, and that combining the information provided by multiple resolutions could improve the performance of Sound Event Detection systems. Experiments are carried out over the DESED dataset in the context of the DCASE 2020 Challenge, concluding that the combination of up to 5 resolutions allows a neural network-based system to obtain better results than single-resolution models in terms of event-based F1-score in every event category and in terms of PSDS (Polyphonic Sound Detection Score). Furthermore, we analyze the impact of score thresholding in the computation of F1-score results, finding that the standard value of 0.5 is suboptimal and proposing an alternative strategy based in the use of a specific threshold for each event category, which obtains further improvements in performance.

17 citations


Journal ArticleDOI
TL;DR: In this article, the authors formulate phase retrieval as a new minimization problem involving Bregman divergences, and derive two algorithms based on accelerated gradient descent and alternating direction method of multipliers.
Abstract: Phase retrieval (PR) aims to recover a signal from the magnitudes of a set of inner products. This problem arises in many audio signal processing applications which operate on a short-time Fourier transform magnitude or power spectrogram, and discard the phase information. Recovering the missing phase from the resulting modified spectrogram is indeed necessary in order to synthesize time-domain signals. PR is commonly addressed by considering a minimization problem involving a quadratic loss function. In this paper, we adopt a different standpoint. Indeed, the quadratic loss does not properly account for some perceptual properties of audio, and alternative discrepancy measures such as beta-divergences have been preferred in many settings. Therefore, we formulate PR as a new minimization problem involving Bregman divergences. Since these divergences are not symmetric with respect to their two input arguments in general, they lead to two different formulations of the problem. To optimize the resulting objective, we derive two algorithms based on accelerated gradient descent and alternating direction method of multipliers. Experiments conducted on audio signal recovery from spectrograms that are either exact or estimated from noisy observations highlight the potential of our proposed methods for audio restoration. In particular, leveraging some of these Bregman divergences induce better performance than the quadratic loss when performing PR from spectrograms under very noisy conditions.

16 citations


Proceedings ArticleDOI
06 Jun 2021
TL;DR: In this paper, the main sources of upsampling artifacts are: (i) the tonal and filtering artifacts introduced by problematic up-sampling operators, and (ii) the spectral replicas that emerge while upsam sampling.
Abstract: A number of recent advances in neural audio synthesis rely on up-sampling layers, which can introduce undesired artifacts. In computer vision, upsampling artifacts have been studied and are known as checkerboard artifacts (due to their characteristic visual pattern). However, their effect has been overlooked so far in audio processing. Here, we address this gap by studying this problem from the audio signal processing perspective. We first show that the main sources of upsampling artifacts are: (i) the tonal and filtering artifacts introduced by problematic upsampling operators, and (ii) the spectral replicas that emerge while upsampling. We then compare different upsampling layers, showing that nearest neighbor upsamplers can be an alternative to the problematic (but state-of-the-art) transposed and subpixel convolutions which are prone to introduce tonal artifacts.

16 citations


Proceedings ArticleDOI
06 Jun 2021
TL;DR: In this article, a data-driven approach to automate audio signal processing by incorporating stateful third-party audio effects as layers within a deep neural network is presented, where a deep encoder is trained to analyze input audio and control effect parameters to perform the desired signal manipulation, requiring only input-target paired audio data as supervision.
Abstract: We present a data-driven approach to automate audio signal processing by incorporating stateful third-party, audio effects as layers within a deep neural network. We then train a deep encoder to analyze input audio and control effect parameters to perform the desired signal manipulation, requiring only input-target paired audio data as supervision. To train our network with non-differentiable black-box effects layers, we use a fast, parallel stochastic gradient approximation scheme within a standard auto differentiation graph, yielding efficient end-to-end backpropagation. We demonstrate the power of our approach with three separate automatic audio production applications: tube amplifier emulation, automatic removal of breaths and pops from voice recordings, and automatic music mastering. We validate our results with a subjective listening test, showing our approach not only can enable new automatic audio effects tasks, but can yield results comparable to a specialized, state-of-the-art commercial solution for music mastering.

15 citations


Journal ArticleDOI
TL;DR: In this article, a combination of a convolutional neural network and a recurrent neural network (RNN) was used to detect snoring from audio recordings of 38 patients referred to a clinical center for a sleep study.

14 citations


Posted Content
TL;DR: In this article, the authors conduct a comprehensive survey on the progress of DRL in the audio domain by bringing together the research studies across different speech and music-related areas and conclude by presenting challenges faced by audio-based DRL agents and highlighting open areas for future research and investigation.
Abstract: Deep reinforcement learning (DRL) is poised to revolutionise the field of artificial intelligence (AI) by endowing autonomous systems with high levels of understanding of the real world. Currently, deep learning (DL) is enabling DRL to effectively solve various intractable problems in various fields. Most importantly, DRL algorithms are also being employed in audio signal processing to learn directly from speech, music and other sound signals in order to create audio-based autonomous systems that have many promising application in the real world. In this article, we conduct a comprehensive survey on the progress of DRL in the audio domain by bringing together the research studies across different speech and music-related areas. We begin with an introduction to the general field of DL and reinforcement learning (RL), then progress to the main DRL methods and their applications in the audio domain. We conclude by presenting challenges faced by audio-based DRL agents and highlighting open areas for future research and investigation.

14 citations


Journal ArticleDOI
TL;DR: In this article, a multiresolution deep learning approach is proposed to encode relevant information contained in unprocessed time-domain acoustic signals captured by microphone arrays for real-time sound source two-dimensional localization tasks.
Abstract: Sound source localization using multichannel signal processing has been a subject of active research for decades. In recent years, the use of deep learning in audio signal processing has significantly improved the performances for machine hearing. This has motivated the scientific community to also develop machine learning strategies for source localization applications. This paper presents BeamLearning, a multiresolution deep learning approach that allows the encoding of relevant information contained in unprocessed time-domain acoustic signals captured by microphone arrays. The use of raw data aims at avoiding the simplifying hypothesis that most traditional model-based localization methods rely on. Benefits of its use are shown for real-time sound source two-dimensional localization tasks in reverberating and noisy environments. Since supervised machine learning approaches require large-sized, physically realistic, precisely labelled datasets, a fast graphics processing unit-based computation of room impulse responses was developed using fractional delays for image source models. A thorough analysis of the network representation and extensive performance tests are carried out using the BeamLearning network with synthetic and experimental datasets. Obtained results demonstrate that the BeamLearning approach significantly outperforms the wideband MUSIC and steered response power-phase transform methods in terms of localization accuracy and computational efficiency in the presence of heavy measurement noise and reverberation.

Proceedings ArticleDOI
24 Jan 2021
TL;DR: The proposed ESN-based approach serves as a baseline for further investigations of ESN in audio signal processing in the future and is presented as a first exploration of ESNs for the challenging task of multipitch tracking in music signals.
Abstract: Currently, convolutional neural networks (CNNs) define the state of the art for multipitch tracking in music signals. Echo State Networks (ESNs), a recently introduced recurrent neural network architecture, achieved similar results as CNNs for various tasks, such as phoneme or digit recognition. However, they have not yet received much attention in the community of Music Information Retrieval. The core of ESNs is a group of unordered, randomly connected neurons, i.e., the reservoir, by which the low-dimensional input space is non-linearly transformed into a high-dimensional feature space. Because only the weights of the connections between the reservoir and the output are trained using linear regression, ESNs are easier to train than deep neural networks. This paper presents a first exploration of ESNs for the challenging task of multipitch tracking in music signals. The best results presented in this paper were achieved with a bidirectional two-layer ESN with 20 000 neurons in each layer. Although the final F-score of 0.7198 still falls below the state of the art (0.7370), the proposed ESN-based approach serves as a baseline for further investigations of ESNs in audio signal processing in the future.

Posted Content
TL;DR: This work extends trainable infinite impulse response (IIR) filters to the hyperconditioned case, in which a transformation is learned to directly map external parameters of the distortion effect to its internal filter and gain parameters, along with activations necessary to ensure filter stability.
Abstract: In this work, we propose using differentiable cascaded biquads to model an audio distortion effect. We extend trainable infinite impulse response (IIR) filters to the hyperconditioned case, in which a transformation is learned to directly map external parameters of the distortion effect to its internal filter and gain parameters, along with activations necessary to ensure filter stability. We propose a novel, efficient training scheme of IIR filters by means of a Fourier transform. Our models have significantly fewer parameters and reduced complexity relative to more traditional black-box neural audio effect modeling methodologies using finite impulse response filters. Our smallest, best-performing model adequately models a BOSS MT-2 pedal at 44.1 kHz, using a total of 40 biquads and only 210 parameters. Its model parameters are interpretable, can be related back to the original analog audio circuit, and can even be intuitively altered by machine learning non-specialists after model training. Quantitative and qualitative results illustrate the effectiveness of the proposed method.

Proceedings ArticleDOI
30 Aug 2021
TL;DR: This article proposed CADNet, a contextualized attention-based distillation approach, which applies both cross-attention and selfattention to obtain ASR-robust contextualized embedding representations of the passage and dialogue history for performance improvements.
Abstract: Spoken conversational question answering (SCQA) requires machines to model complex dialogue flow given the speech utterances and text corpora. Different from traditional text question answering (QA) tasks, SCQA involves audio signal processing, passage comprehension, and contextual understanding. However, ASR systems introduce unexpected noisy signals to the transcriptions, which result in performance degradation on SCQA. To overcome the problem, we propose CADNet, a novel contextualized attention-based distillation approach, which applies both cross-attention and self-attention to obtain ASR-robust contextualized embedding representations of the passage and dialogue history for performance improvements. We also introduce the spoken conventional knowledge distillation framework to distill the ASR-robust knowledge from the estimated probabilities of the teacher model to the student. We conduct extensive experiments on the Spoken-CoQA dataset and demonstrate that our approach achieves remarkable performance in this task.

Proceedings ArticleDOI
06 Jun 2021
TL;DR: In this article, text-to-audio grounding (TAG) is proposed, which interactively considers the relationship be-tween audio processing and language understanding, resulting in an event-F1 score of 28.3% and a Polyphonic Sound Detection Score (PSDS) score of 14.7%.
Abstract: Automated Audio Captioning is a cross-modal task, generating natural language descriptions to summarize the audio clips’ sound events. However, grounding the actual sound events in the given audio based on its corresponding caption has not been investigated. This paper contributes an Audio-Grounding dataset1, which provides the correspondence be-tween sound events and the captions provided in Audiocaps, along with the location (timestamps) of each present sound event. Based on such, we propose the text-to-audio grounding (TAG) task, which interactively considers the relationship be-tween audio processing and language understanding. A base-line approach is provided, resulting in an event-F1 score of 28.3% and a Polyphonic Sound Detection Score (PSDS) score of 14.7%.

Book ChapterDOI
01 Jan 2021
TL;DR: In this article, the authors provide an overview of how deep learning techniques can be used for audio signals and discuss how the first layers of a DNN can be set to take into account these specificity's.
Abstract: This chapter provides an overview of how deep learning techniques can be used for audio signals. We first review the main DNN architectures, meta-architectures and training paradigms used for audio processing. By highlighting the specifies of the audio signal, we discuss the various possible audio representations to be used as input of a DNN—time and frequency representations, waveform representations and knowledge-driven representations—and discuss how the first layers of a DNN can be set to take into account these specificity’s. We then review a set of applications for three main classes of problems: audio recognition, audio processing and audio generation. We do this considering two types of audio content which are less commonly addressed in the literature: music and environmental sounds.

Proceedings ArticleDOI
06 Jun 2021
TL;DR: The Melon Playlist Dataset as discussed by the authors is a large dataset of mel-spectrograms for 649,091 tracks and 148,826 associated playlists annotated by 30,652 different tags.
Abstract: One of the main limitations in the field of audio signal processing is the lack of large public datasets with audio representations and high-quality annotations due to restrictions of copyrighted commercial music. We present Melon Playlist Dataset, a public dataset of mel-spectrograms for 649,091 tracks and 148,826 associated playlists annotated by 30,652 different tags. All the data is gathered from Melon, a popular Korean streaming service. The dataset is suitable for music information retrieval tasks, in particular, auto-tagging and automatic playlist continuation. Even though the latter can be addressed by collaborative filtering approaches, audio provides opportunities for research on track suggestions and building systems resistant to the cold-start problem, for which we provide a baseline. Moreover, the playlists and the annotations included in the Melon Playlist Dataset make it suitable for metric learning and representation learning.

Journal ArticleDOI
TL;DR: In this paper, some songs are taken into consideration to create the training data and through which the neural network is trained and the efficiency is observed for the detection of the new and unknown signer to be detected.
Abstract: Communication with people is the most common phenomena of human. Mostly they can recognize the voice of their known one. Even the same thing is seen while recognizing a voice in the music. If the voice of the artist is known, then the recognition will be the easier one, but if the voice is not very familiar to the listener, it will be a tough job to identify the voice within music. Thus, singer recognition is one of the demanding areas of research by the implication of eligible algorithms in the domain of audio signal processing. There are different approaches that can be made for fulfilling the objective by attaining the goal of truncating the voice frequency range from the audio signal or it may be the detection of the peaks of the voice within that music. As music is polyphonic, so, the essential analysis is required to check for the frequency components and thereby detecting the peaks of the voice signal which can be an easier approach for such detection. In this paper, some songs are taken into consideration to create the training data and through which the neural network is trained. With that training data, a separate set of data is prepared which is used for testing. Apart from the application of the supervised learning procedure, with the implication of hyper parameter tuning, the efficiency is observed for the detection of the new and unknown signer to be detected. Essentially, the neural network works in this field fairly with about 99.29% accuracy and thus the detection is made with a satisfactory level.

Journal ArticleDOI
TL;DR: Signal processing algorithms are the hidden components in the audio processor that convert the received acoustic signal into electrical impulses while maintaining as much relevant information as possible as mentioned in this paper, which is the hidden part of the signal processing algorithm.
Abstract: Signal processing algorithms are the hidden components in the audio processor that converts the received acoustic signal into electrical impulses while maintaining as much relevant information as p...

Proceedings ArticleDOI
17 Oct 2021
TL;DR: In this paper, a self-supervised, pretrained audio embedding method for depression detection is proposed, where an encoder-decoder network is used to extract DEPA on in-domain depressed datasets (DAIC and MDD).
Abstract: Depression detection research has increased over the last few decades, one major bottleneck of which is the limited data availability and representation learning. Recently, self-supervised learning has seen success in pretraining text embeddings and has been applied broadly on related tasks with sparse data, while pretrained audio embeddings based on self-supervised learning are rarely investigated. This paper proposes DEPA, a self-supervised, pretrained dep ression a udio embedding method for depression detection. An encoder-decoder network is used to extract DEPA on in-domain depressed datasets (DAIC and MDD) and out-domain (Switchboard, Alzheimer's) datasets. With DEPA as the audio embedding extracted at response-level, a significant performance gain is achieved on downstream tasks, evaluated on both sparse datasets like DAIC and large major depression disorder dataset (MDD). This paper not only exhibits itself as a novel embedding extracting method capturing response-level representation for depression detection but more significantly, is an exploration of self-supervised learning in a specific task within audio processing.

Posted ContentDOI
18 Feb 2021-bioRxiv
TL;DR: For instance, this paper found that auditory cortex in older adults is hyperresponsive to sound onsets, but that sustained neural activity in auditory cortex, indexing the processing of a sound pattern, is reduced.
Abstract: Sensitivity to repetitions in sound amplitude and frequency is crucial for sound perception. As with other aspects of sound processing, sensitivity to such patterns may change with age, and may help explain some age-related changes in hearing such as segregating speech from background sound. We recorded magnetoencephalography to characterize differences in the processing of sound patterns between younger and older adults. We presented tone sequences that either contained a pattern (made of a repeated set of tones) or did not contain a pattern. We show that auditory cortex in older, compared to younger, adults is hyperresponsive to sound onsets, but that sustained neural activity in auditory cortex, indexing the processing of a sound pattern, is reduced. Hence, the sensitivity of neural populations in auditory cortex fundamentally differs between younger and older individuals, overresponding to sound onsets, while underresponding to patterns in sounds. This may help to explain some age-related changes in hearing such as increased sensitivity to distracting sounds and difficulties tracking speech in the presence of other sound.

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a model consisting of two modules, a reservoir module and a decision-making module, which projects complex spatio-temporal patterns into spatially separated neural representations via its recurrent dynamics.

Proceedings ArticleDOI
06 Jun 2021
TL;DR: In this article, an iterative estimator based on maximum likelihood and alternating least squares is proposed for blind estimation of reflection amplitudes in early room reflections, with connections between the proposed method and raking beamformers.
Abstract: Estimation of the properties of early room reflections is an important task in audio signal processing, with applications in beamforming, source separation, room geometry inference, and spatial audio. While methods exist to blindly estimate the direction of arrival and delay of the early reflections, the blind estimation of reflection amplitudes remains an open problem. This work presents a preliminary attempt to blindly estimate reflection amplitudes. An iterative estimator is suggested, based on maximum likelihood and alternating least squares. We discuss some fundamental scaling ambiguities of the problem, and show connections between the proposed method and raking beamformers. A simulation study demonstrates the effectiveness of the proposed method.

Journal ArticleDOI
TL;DR: In this paper, a multi-resolution deep learning approach is proposed to encode relevant information contained in unprocessed time domain acoustic signals captured by microphone arrays, which can be used for real-time sound source 2D-localization tasks in reverberating and noisy environments.
Abstract: Sound sources localization using multichannel signal processing has been a subject of active research for decades. In recent years, the use of deep learning in audio signal processing has allowed to drastically improve performances for machine hearing. This has motivated the scientific community to also develop machine learning strategies for source localization applications. In this paper, we present BeamLearning, a multi-resolution deep learning approach that allows to encode relevant information contained in unprocessed time domain acoustic signals captured by microphone arrays. The use of raw data aims at avoiding simplifying hypothesis that most traditional model-based localization methods rely on. Benefits of its use are shown for realtime sound source 2D-localization tasks in reverberating and noisy environments. Since supervised machine learning approaches require large-sized, physically realistic, precisely labelled datasets, we also developed a fast GPU-based computation of room impulse responses using fractional delays for image source models. A thorough analysis of the network representation and extensive performance tests are carried out using the BeamLearning network with synthetic and experimental datasets. Obtained results demonstrate that the BeamLearning approach significantly outperforms the wideband MUSIC and SRP-PHAT methods in terms of localization accuracy and computational efficiency in presence of heavy measurement noise and reverberation.

Proceedings ArticleDOI
06 Jun 2021
TL;DR: In this paper, trainable infinite impulse response (IIR) filters are extended to the hyperconditioned case, in which a transformation is learned to directly map external parameters of the distortion effect to its internal filter and gain parameters, along with activations necessary to ensure filter stability.
Abstract: In this work, we propose using differentiable cascaded biquads to model an audio distortion effect. We extend trainable infinite impulse response (IIR) filters to the hyperconditioned case, in which a transformation is learned to directly map external parameters of the distortion effect to its internal filter and gain parameters, along with activations necessary to ensure filter stability. We propose a novel, efficient training scheme of IIR filters by means of a Fourier transform. Our models have significantly fewer parameters and reduced complexity relative to more traditional black-box neural audio effect modeling methodologies using finite impulse response filters. Our smallest, best-performing model adequately models a BOSS MT-2 pedal at 44.1 kHz, using a total of 40 biquads and only 210 parameters. Its model parameters are interpretable, can be related back to the original analog audio circuit, and can even be intuitively altered by machine learning non-specialists after model training. Quantitative and qualitative results illustrate the effectiveness of the proposed method.

Journal ArticleDOI
TL;DR: The main intent of this paper is to develop an intelligent model using the deep learning concept for recognizing the hungry stomach by using the synthetically collected audio signals through mobile phones.
Abstract: The process of transmitting signals to the body regarding the hungry stomach is referred to as the migrating motor complex (MMC) process. The intestines and stomach are considered for sensing the unavailability of food in the body. Hence, the receptors present in the stomach wall generate the electrical activity waves and activate the hunger. In general, audio signal processing algorithms include signal analysis, property extraction, and behavior prediction, identifying the pattern available in the signal, and predicting how a specific signal is correlated to various identical signals. The major challenge here is to consider the audio signals that are produced from the stomach for identifying the growling sound that well describes the hungry. The main intent of this paper is to develop an intelligent model using the deep learning concept for recognizing the hungry stomach by using the synthetically collected audio signals through mobile phones. This makes society reaching the hungry stomach by way of intelligent technology. The proposed detection model covers different phases for automated hungry stomach detection. The data acquisition is performed by gathering information using mobile phones. Further, the pre-processing of the signals is done by the median filtering and the smoothening methods. In order to perform the proper classification, the spectral features like spectral centroid, spectral roll-off, spectral skewness, spectral kurtosis, spectral slope, spectral crest factor, and spectral flux, and cepstral domain features like mel-frequency cepstral coefficients (MFCCs), linear prediction cepstral coefficients (LPCCs), Perceptual linear prediction (PLP) cepstral coefficients, Greenwood function cepstral coefficients (GFCC), and gammatone cepstral coefficients (GTCCs) are extracted. Further, the optimal feature selection is done by the improved meta-heuristic algorithm called best and worst position updated deer hunting optimization algorithm (BWP-DHOA). An improved deep learning model called optimized recurrent neural network (RNN) is used for classifying the optimal features of the audio signal into growling sound and burp sound. Finally, the performance comparison over the existing models proves the efficiency and reliability of the proposed model.

Proceedings ArticleDOI
06 Jun 2021
TL;DR: In this article, the authors proposed and evaluated DNN-based approximations of a state-of-the-art auditory model, which yield accurate neurogram predictions for previously unseen speech signals with processing times shorter than signal duration.
Abstract: Computational models of the auditory periphery are important tools for understanding mechanisms of normal and impaired hearing and for developing advanced speech and audio processing algorithms. However, the simulation of accurate neural representations entails a high computational effort. This prevents the use of auditory models in applications with real-time requirements and the design of speech enhancement algorithms based on efficient bio-inspired optimization criteria. Hence, in this work we propose and evaluate DNN-based approximations of a state-of-the-art auditory model. The DNN models yield accurate neurogram predictions for previously unseen speech signals with processing times shorter than signal duration, thus indicating their potential to be applied in real-time.


Journal ArticleDOI
TL;DR: In this article, an audio scene recognition method based on optimized audio processing and convolutional neural network is proposed, which can make use of the spatial features of the scene and then improve the recognition accuracy.
Abstract: Audio scene recognition is a task that enables devices to understand their environment through digital audio analysis. It belongs to a branch of the field of computer auditory scene. At present, this technology has been widely used in intelligent wearable devices, robot sensing services, and other application scenarios. In order to explore the applicability of machine learning technology in the field of digital audio scene recognition, an audio scene recognition method based on optimized audio processing and convolutional neural network is proposed. Firstly, different from the traditional audio feature extraction method using mel-frequency cepstrum coefficient, the proposed method uses binaural representation and harmonic percussive source separation method to optimize the original audio and extract the corresponding features, so that the system can make use of the spatial features of the scene and then improve the recognition accuracy. Then, an audio scene recognition system with two-layer convolution module is designed and implemented. In terms of network structure, we try to learn from the VGGNet structure in the field of image recognition to increase the network depth and improve the system flexibility. Experimental data analysis shows that compared with traditional machine learning methods, the proposed method can greatly improve the recognition accuracy of each scene and achieve better generalization effect on different data.

Journal ArticleDOI
TL;DR: A novel digital implementation of the model, called Time Difference Encoder, for temporal encoding on event-based signals, which translates the time difference between two consecutive input events into a burst of output events.
Abstract: Neuromorphic systems are a viable alternative to conventional systems for real-time tasks with constrained resources. Their low power consumption, compact hardware realization, and low-latency response characteristics are the key ingredients of such systems. Furthermore, the event-based signal processing approach can be exploited for reducing the computational load and avoiding data loss, thanks to its inherently sparse representation of sensed data and adaptive sampling time. In event-based systems, the information is commonly coded by the number of spikes within a specific temporal window. However, event-based signals may contain temporal information which is complex to extract when using rate coding. In this work, we present a novel digital implementation of the model, called Time Difference Encoder, for temporal encoding on event-based signals, which translates the time difference between two consecutive input events into a burst of output events. The number of output events along with the time between them encodes the temporal information. The proposed model has been implemented as a digital circuit with a configurable time constant, allowing it to be used in a wide range of sensing tasks which require the encoding of the time difference between events, such as optical flow based obstacle avoidance, sound source localization and gas source localization. This proposed bio-inspired model offers an alternative to the Jeffress model for the Interaural Time Difference estimation, validated with a sound source lateralization proof-of-concept. The model has been simulated and implemented on an FPGA, requiring 122 slice registers of hardware resources and less than 1 mW of power consumption.