scispace - formally typeset
Search or ask a question

Showing papers on "Audio signal processing published in 2022"


Journal ArticleDOI
TL;DR: In this paper, an overview of research on human audio signals using artificial intelligence techniques to screen, diagnose, monitor, and spread the awareness about COVID-19 is presented, using non-obtrusive and easy to use bio-signals conveyed in human non-speech and speech audio productions.

24 citations


Proceedings ArticleDOI
23 May 2022
TL;DR: This work presents unsupervised pretraining models used to build fake audio detection systems for the ADD2022 challenge, which aims to spot various kinds of fake audios.
Abstract: This work presents our systems for the ADD2022 challenge. The ADD2022 challenge is the first audio deep synthesis detection challenge, which aims to spot various kinds of fake audios. We have explored using unsupervised pretraining models to build fake audio detection systems. Results indicate that unsupervised pretraining models can achieve excellent performance for fake audio detection. Our final EER results for low-quality fake audio detection and partially fake audio detection are 32.80% and 4.80% relatively. For partially fake audio detection, our results ranked first in the competition. Even trained with totally mismatched data, our method still generalizes well for partially fake audio detection.

15 citations


Journal ArticleDOI
TL;DR: Data-based methods are those that operate directly on the spatial information carried by audio signals as mentioned in this paper , which is in contrast to modelbased methods, which impose spatial information from, for example, metadata like the intended position of a source onto signals that are otherwise free of spatial information.
Abstract: Abstract The domain of spatial audio comprises methods for capturing, processing, and reproducing audio content that contains spatial information. Data-based methods are those that operate directly on the spatial information carried by audio signals. This is in contrast to model-based methods, which impose spatial information from, for example, metadata like the intended position of a source onto signals that are otherwise free of spatial information. Signal processing has traditionally been at the core of spatial audio systems, and it continues to play a very important role. The irruption of deep learning in many closely related fields has put the focus on the potential of learning-based approaches for the development of data-based spatial audio applications. This article reviews the most important application domains of data-based spatial audio including well-established methods that employ conventional signal processing while paying special attention to the most recent achievements that make use of machine learning. Our review is organized based on the topology of the spatial audio pipeline that consist in capture, processing/manipulation, and reproduction. The literature on the three stages of the pipeline is discussed, as well as on the spatial audio representations that are used to transmit the content between them, highlighting the key references and elaborating on the underlying concepts. We reflect on the literature based on a juxtaposition of the prerequisites that made machine learning successful in domains other than spatial audio with those that are found in the domain of spatial audio as of today. Based on this, we identify routes that may facilitate future advancement.

14 citations


Journal ArticleDOI
TL;DR: For instance, this paper found that auditory cortex in older adults is hyperresponsive to sound onsets, but that sustained neural activity in auditory cortex, indexing the processing of a sound pattern, is reduced.

10 citations


Journal ArticleDOI
TL;DR: In this article , the authors proposed an efficient technique for anomaly detection and classification of rare events in audio data by extracting melfrequency cepstral coefficients (MFCCs) features from the audio signals of the newly created dataset and selecting the minimum number of best-performing features for optimum performance using principal component analysis (PCA).
Abstract: With the emergence of new digital technologies, a significant surge has been seen in the volume of multimedia data generated from various smart devices. Several challenges for data analysis have emerged to extract useful information from multimedia data. One such challenge is the early and accurate detection of anomalies in multimedia data. This study proposes an efficient technique for anomaly detection and classification of rare events in audio data. In this paper, we develop a vast audio dataset containing seven different rare events (anomalies) with 15 different background environmental settings (e.g., beach, restaurant, and train) to focus on both detection of anomalous audio and classification of rare sound (e.g., events—baby cry, gunshots, broken glasses, footsteps) events for audio forensics. The proposed approach uses the supreme feature extraction technique by extracting mel-frequency cepstral coefficients (MFCCs) features from the audio signals of the newly created dataset and selects the minimum number of best-performing features for optimum performance using principal component analysis (PCA). These features are input to state-of-the-art machine learning algorithms for performance analysis. We also apply machine learning algorithms to the state-of-the-art dataset and realize good results. Experimental results reveal that the proposed approach effectively detects all anomalies and superior performance to existing approaches in all environments and cases.

7 citations


Journal ArticleDOI
TL;DR: A generalized framework for describing spatial audio signal processing for the binaural reproduction of recorded sound is proposed, and specific methods for signal transformations such as rotation, translation and enhancement, enabling additional flexibility in reproduction and improvement in the quality of the bINAural signal are presented.
Abstract: Spatial audio has been studied for several decades, but has seen much renewed interest recently due to advances in both software and hardware for capture and playback, and the emergence of applications such as virtual reality and augmented reality. This renewed interest has led to the investment of increasing efforts in developing signal processing algorithms for spatial audio, both for capture and for playback. In particular, due to the popularity of headphones and earphones, many spatial audio signal processing methods have dealt with binaural reproduction based on headphone listening. Among these new developments, processing spatial audio signals recorded in real environments using microphone arrays plays an important role. Following this emerging activity, this paper aims to provide a scientific review of recent developments and an outlook for future challenges. This review also proposes a generalized framework for describing spatial audio signal processing for the binaural reproduction of recorded sound. This framework helps to understand the collective progress of the research community, and to identify gaps for future research. It is composed of five main blocks, namely: the acoustic scene, recording, processing, reproduction, and perception and evaluation. First, each block is briefly presented, and then, a comprehensive review of the processing block is provided. This includes topics from simple binaural recording to Ambisonics and perceptually motivated approaches, which focus on careful array configuration and design. Beamforming and parametric-based processing afford more flexible designs and shift the focus to processing and modeling of the sound field. Then, emerging machine- and deep-learning approaches, which take a further step towards flexibility in design, are described. Finally, specific methods for signal transformations such as rotation, translation and enhancement, enabling additional flexibility in reproduction and improvement in the quality of the binaural signal, are presented. The review concludes by highlighting directions for future research.

7 citations


Journal ArticleDOI
TL;DR: Data-based methods are those that operate directly on the spatial information carried by audio signals as discussed by the authors , which is in contrast to modelbased methods, which impose spatial information from, for example, metadata like the intended position of a source onto signals that are otherwise free of spatial information.
Abstract: Abstract The domain of spatial audio comprises methods for capturing, processing, and reproducing audio content that contains spatial information. Data-based methods are those that operate directly on the spatial information carried by audio signals. This is in contrast to model-based methods, which impose spatial information from, for example, metadata like the intended position of a source onto signals that are otherwise free of spatial information. Signal processing has traditionally been at the core of spatial audio systems, and it continues to play a very important role. The irruption of deep learning in many closely related fields has put the focus on the potential of learning-based approaches for the development of data-based spatial audio applications. This article reviews the most important application domains of data-based spatial audio including well-established methods that employ conventional signal processing while paying special attention to the most recent achievements that make use of machine learning. Our review is organized based on the topology of the spatial audio pipeline that consist in capture, processing/manipulation, and reproduction. The literature on the three stages of the pipeline is discussed, as well as on the spatial audio representations that are used to transmit the content between them, highlighting the key references and elaborating on the underlying concepts. We reflect on the literature based on a juxtaposition of the prerequisites that made machine learning successful in domains other than spatial audio with those that are found in the domain of spatial audio as of today. Based on this, we identify routes that may facilitate future advancement.

6 citations


Journal ArticleDOI
TL;DR: In this paper , a multi-modal perception attention network is introduced to derive the perception weights that measure the reliability and effectiveness of intermittent audio and video streams disturbed by noise, and a unique crossmodal self-supervised learning method is presented to model the confidence of audio and visual observations by leveraging the complementarity and consistency between different modalities.
Abstract: Multi-modal fusion is proven to be an effective method to improve the accuracy and robustness of speaker tracking, especially in complex scenarios. However, how to combine the heterogeneous information and exploit the complementarity of multi-modal signals remains a challenging issue. In this paper, we propose a novel Multi-modal Perception Tracker (MPT) for speaker tracking using both audio and visual modalities. Specifically, a novel acoustic map based on spatial-temporal Global Coherence Field (stGCF) is first constructed for heterogeneous signal fusion, which employs a camera model to map audio cues to the localization space consistent with the visual cues. Then a multi-modal perception attention network is introduced to derive the perception weights that measure the reliability and effectiveness of intermittent audio and video streams disturbed by noise. Moreover, a unique cross-modal self-supervised learning method is presented to model the confidence of audio and visual observations by leveraging the complementarity and consistency between different modalities. Experimental results show that the proposed MPT achieves 98.6% and 78.3% tracking accuracy on the standard and occluded datasets, respectively, which demonstrates its robustness under adverse conditions and outperforms the current state-of-the-art methods.

4 citations



Journal ArticleDOI
TL;DR: Various application areas, citation records, documents published year-wise, and source-wise analysis are computed using Scopus and Web of Science databases and establishes the significance of the deep learning techniques in audio segmentation.
Abstract: Audio processing has become an inseparable part of modern applications in domains ranging from health care to speech-controlled devices. In automated audio segmentation, deep learning plays a vital role. In this article, we are discussing audio segmentation based on deep learning. Audio segmentation divides the digital audio signal into a sequence of segments or frames and then classifies these into various classes such as speech recognition, music, or noise. Segmentation plays an important role in audio signal processing. The most important aspect is to secure a large amount of high-quality data when training a deep learning network. In this study, various application areas, citation records, documents published year-wise, and source-wise analysis are computed using Scopus and Web of Science (WoS) databases. The analysis presented in this paper supports and establishes the significance of the deep learning techniques in audio segmentation.

3 citations



Journal ArticleDOI
TL;DR: In this article , a semi-supervised method for generating pseudo-labels from unsupervised data using a student-teacher scheme that balances self-training and cross-training was proposed.
Abstract: Sound event detection is an important facet of audio tagging that aims to identify sounds of interest and define both the sound category and time boundaries for each sound event in a continuous recording. With advances in deep neural networks, there has been tremendous improvement in the performance of sound event detection systems, although at the expense of costly data collection and labeling efforts. In fact, current state-of-the-art methods employ supervised training methods that leverage large amounts of data samples and corresponding labels in order to facilitate identification of sound category and time stamps of events. As an alternative, the current study proposes a semi-supervised method for generating pseudo-labels from unsupervised data using a student-teacher scheme that balances self-training and cross-training. Additionally, this paper explores post-processing which extracts sound intervals from network prediction, for further improvement in sound event detection performance. The proposed approach is evaluated on sound event detection task for the DCASE2020 challenge. The results of these methods on both "validation" and "public evaluation" sets of DESED database show significant improvement compared to the state-of-the art systems in semi-supervised learning.

Proceedings ArticleDOI
23 May 2022
TL;DR: In this paper , the authors propose a method to find the temporal location of the splices based on transformer networks, which identifies which temporal portions of a audio signal have undergone single or multiple compression at the temporal frame level, which is the smallest temporal unit of MP3 compression.
Abstract: Audio signals are often stored and transmitted in compressed formats. Among the many available audio compression schemes, MPEG-1 Audio Layer III (MP3) is very popular and widely used. Since MP3 is lossy it leaves characteristic traces in the compressed audio which can be used forensically to expose the past history of an audio file. In this paper, we consider the scenario of audio signal manipulation done by temporal splicing of compressed and uncompressed audio signals. We propose a method to find the temporal location of the splices based on transformer networks. Our method identifies which temporal portions of a audio signal have undergone single or multiple compression at the temporal frame level, which is the smallest temporal unit of MP3 compression. We tested our method on a dataset of 486,743 MP3 audio clips. Our method achieved higher performance and demonstrated robustness with respect to different MP3 data when compared with existing methods.

Proceedings ArticleDOI
25 Jun 2022
TL;DR: This paper model perforation, demonstrates how it affects the classification accuracy, and proposes two approaches to deal with the problem, and quantifies the loss of accuracy of a standard classifier when the input audio is perforated.
Abstract: Missing samples is common in many practical audio acquisition systems. These \emph{perforated} audio clips are routinely discarded by today's audio classification systems -- even though they may have information that could have been used to make accurate inferences. In this paper, we study perforated audio classification problem on an intermittently-powered batteryless system. We model perforation, demonstrate how it affects the classification accuracy, and propose two approaches to deal with the problem. We conduct extensive experiments using over 115,000 audio clips from three popular audio datasets and quantify the loss of accuracy of a standard classifier when the input audio is perforated. We also empirically demonstrate how much of the loss of accuracy can be gained back by the two proposed approaches to deal with audio perforation.

Proceedings ArticleDOI
26 Feb 2022
TL;DR: In this article , various classifiers for speech detection from the audio signal and extracting the data through modules were discussed. And the accuracy of speech detection is better in Stochastic gradient descent (SGD) than in other classifiers, 93%.
Abstract: Human-machine interaction is everywhere as technologies affecting audio, natural language processing, and machine vision evolve for artificial Intelligence (AI). Speech detection based on AI techniques can be used in devices or systems driven by voice and automatic speech recognition for security purposes or detecting specific sounds like instrumental or animal sound from audio. This paper discusses the various classifiers for speech detection from the audio signal and extracting the data through modules. The input audio signal is 3 secs, and ~60kb of size is given to Classifiers and compared the different performance metrics of Machine Learning Classifiers (MLC) for extracting the speech from the audio signal. The accuracy of speech detection is better in Stochastic gradient descent (SGD) than in other classifiers, 93%. Specificity, Sensitivity, and F1 scores were also calculated for speech detection. Receiver Operating Characteristic (ROC) of machine learning classifiers was calculated and compared. MATLAB is used to calculate and analyze the performance metrics for detection from the audio signal.

Journal ArticleDOI
TL;DR: A robust approach for dangerous sound events detection (e.g. gunshots) to improve recent surveillance systems and a newly proposed Independent Channel Residual Convolutional Network architecture based on standard residual blocks are proposed.
Abstract: —The main purpose of this work is to propose a robust approach for dangerous sound events detection (e.g. gunshots) to improve recent surveillance systems. Despite the fact that the detection and classification of different sound events has a long history in signal processing, the analysis of environmental sounds is still challenging. The most recent works aim to prefer the time-frequency 2-D representation of sound as input to feed convolutional neural networks. This paper includes an analysis of known architectures as well as a newly proposed Independent Channel Residual Convolutional Network architecture based on standard residual blocks. Our approach consists of processing three different types of features in the individual channels. The UrbanSound8k and the Free Firearm Sound Library audio datasets are used for training and testing data generation, achieving a 98% F1 score. The model was also evaluated in the wild using manually annotated movie audio track, achieving a 44% F1 score, which is not too high but still better than other state-of-the-art techniques.

Book ChapterDOI
01 Jan 2022
TL;DR: In this paper, the authors presented an approach to model the patient's behavior by processing recordings during the therapy, using machine learning to obtain a customized model that makes it possible to evaluate the individual's performance during his interaction with other people.
Abstract: In severe degrees of ASD (Autistic Spectrum Disorder), patients are not able to produce or understand natural language, and they also have social disorders that make it difficult the communication with other people. Their natural language presents different degrees of alteration, reaching in some cases the impossibility of speaking. This chapter presents an approach to model the patient’s behavior by processing recordings during the therapy. Video and audio data provide certain hidden patterns as we already presented in previous work. By using Machine Learning, it is possible to obtain a customized model that makes it possible to evaluate the individual’s performance during his interaction with other people. The model inputs a specific set of stereotyped responses collected in a systematic way, labeled as patterns. Those movements and sounds, represents how patterns in audio and video relate to stimuli from the environment. Findings allow to discriminate when and how there is a reaction, an autistic verbal behavior.

Journal ArticleDOI
TL;DR: In this paper , a case study of a mesh network format cicada monitoring system in coffee plantations is presented, which manages the sending, receiving, controlling and caching of data traveling between nodes deployed in the field.
Abstract: The Internet of Things (IoT) is increasingly present in people’s daily lives and in many projects involving Smart Farm monitoring. Digital processing of audio signals enables detection and monitoring of species that emit sounds in crop fields. This paper aims to show a case study of a mesh network format cicada monitoring system in coffee plantations. The system, manages the sending, receiving, controlling and caching of data traveling between nodes deployed in the field. Laboratory tests have shown promising results for the intended application.


Proceedings ArticleDOI
15 Jun 2022
TL;DR: The non-negative matrix decomposition method (NMF) is adopted to enhance the audio data, which makes the differentiation between different audio data more significant, and the accuracy of the MLP model built based on the reconstructed new audio data finally reaches 89.12%.
Abstract: This article is based on deep learning theory and big data technology to build a model on how to analyse massive amounts of audio data and use it to provide better services. Firstly, the spectrograms and waveforms are visualised to initially analyse the audio features. Then, the MFCC and Chroma features of audio were extracted respectively, and the MLP model was built to classify the two features and trained separately. In order to make the audio recognition technique highly efficient, this paper also adopts the non-negative matrix decomposition method (NMF) to enhance the audio data, which makes the differentiation between different audio data more significant, and the accuracy of the MLP model built based on the reconstructed new audio data finally reaches 89.12%.

Posted ContentDOI
08 Jan 2022
TL;DR: In this article , the authors proposed to map one-dimensional audio waveforms to two-dimensional images using space filling curves (SFCs), which do not compress the input signal, while preserving its local structure.
Abstract: Since convolutional neural networks (CNNs) have revolutionized the image processing field, they have been widely applied in the audio context. A common approach is to convert the one-dimensional audio signal time series to two-dimensional images using a time-frequency decomposition method. Also it is common to discard the phase information. In this paper, we propose to map one-dimensional audio waveforms to two-dimensional images using space filling curves (SFCs). These mappings do not compress the input signal, while preserving its local structure. Moreover, the mappings benefit from progress made in deep learning and the large collection of existing computer vision networks. We test eight SFCs on two keyword spotting problems. We show that the Z curve yields the best results due to its shift equivariance under convolution operations. Additionally, the Z curve produces comparable results to the widely used mel frequency cepstral coefficients across multiple CNNs.

DatasetDOI
01 Jan 2022
TL;DR: The pyAudioProcessing as mentioned in this paper is a Python based library for processing audio data, constructing and extracting numerical features from audio, building and testing machine learning models, and classifying data with existing pre-trained audio classification models or custom user-built models.
Abstract: pyAudioProcessing is a Python based library for processing audio data, constructing and extracting numerical features from audio, building and testing machine learning models, and classifying data with existing pre-trained audio classification models or custom user-built models. This library contains features built in Python that were originally published in MATLAB. pyAudioProcessing allows the user to compute various features from audio files including Gammatone Frequency Cepstral Coefficients (GFCC), Mel Frequency Cepstral Coefficients (MFCC), spectral features, chroma features, and others such as beat-based and cepstrum-based features from audio. One can use these features along with one’s own classification backend or any of the popular scikit-learn classifiers that have been integrated into pyAudioProcessing. Cleaning functions to strip unwanted portions from the audio are another offering of the library. It further contains integrations with other audio functionalities such as frequency and time-series visualizations and audio format conversions. This software aims to provide machine learning engineers, data scientists, researchers, and students with a set of baseline models to classify audio. The library is available at https://github.com/jsingh811/pyAudioProcessing and is under GPL-3.0 license.


Journal ArticleDOI
TL;DR: In this article , a MATLAB and C+++ library performs automated circuit solving for modeling audio effects, including wave digital filters, state-space modeling, and modified nodal analysis.
Abstract: In music production, many recording and mixing engineers prefer to use analog equipment as a matter of perceptual preference. Digital models of analog circuits have the potential to achieve similar perceptual qualities as hardware without the drawbacks of cost, maintenance, and availability. Many different techniques of system modeling are used in music production software, ranging on a spectrum from “black-box modeling” to “white-box modeling.” In black-box modeling, the analog system is modeled as a processing block which maps an input signal to an output signal. Examples include the linear impulse response, adaptive filters, Volterra series, and Weiner-Hammerstein models. In white-box modeling, each individual component of the analog circuit is modeled as part of the overall system. Examples include wave digital filters, state-space modeling, and modified nodal analysis. Various other techniques exist on the spectrum between these two types, using some combination of each. One example is Virtual Analog Filtering based on the Topology Preserving Transform. Machine Learning techniques have also had an important role in advancing the accuracy of digital modeling. Lastly, the “Point to Point Library,” developed by the author, will be demonstrated. This MATLAB and C + + library performs automated circuit solving for modeling audio effects.

Journal ArticleDOI
TL;DR: The theory of signal processing and its application to audio was largely developed at Bell Labs in the mid-20th century and Claude Shannon and Harry Nyquist’s early work on communication theory and pulse-code modulation (PCM) laid the foundations for the field.
Abstract: Abstract: Audio Signal Processing is also known as Digital Analog Conversion (DAC). Sound waves are the most common example of longitudinal waves. The speed of sound waves is a particular medium depends on the properties of that temperature and the medium. Sound waves travel through air when the air elements vibrate to produce changes in pressure and density along the direction of the wave’s motion. It transforms the Analog Signal into Digital Signals, and then converted Digital Signals is sent to the Devices. Which can be used in Various things., Such as audio signal, RADAR, speed processing, voice recognition, entertainment industry, and to find defected in machines using audio signals or frequencies. The signals pay important role in our day-to-day communication, perception of environment, and entertainment. A joint time-frequency (TF) approach would be better choice to effectively process this signal. The theory of signal processing and its application to audio was largely developed at Bell Labs in the mid-20th century. Claude Shannon and Harry Nyquist’s early work on communication theory and pulse-code modulation (PCM) laid the foundations for the field.

Proceedings ArticleDOI
01 Jan 2022
TL;DR: The pyAudioProcessing as mentioned in this paper is a Python based library for processing audio data, constructing and extracting numerical features from audio, building and testing machine learning models, and classifying data with existing pre-trained audio classification models or custom user-built models.
Abstract: pyAudioProcessing is a Python based library for processing audio data, constructing and extracting numerical features from audio, building and testing machine learning models, and classifying data with existing pre-trained audio classification models or custom user-built models. MATLAB is a popular language of choice for a vast amount of research in the audio and speech processing domain. On the contrary, Python remains the language of choice for a vast majority of machine learning research and functionality. This library contains features built in Python that were originally published in MATLAB. pyAudioProcessing allows the user to compute various features from audio files including Gammatone Frequency Cepstral Coefficients (GFCC), Mel Frequency Cepstral Coefficients (MFCC), spectral features, chroma features, and others such as beat-based and cepstrum-based features from audio. One can use these features along with one’s own classification backend or any of the popular scikit-learn classifiers that have been integrated into pyAudioProcessing. Cleaning functions to strip unwanted portions from the audio are another offering of the library. It further contains integrations with other audio functionalities such as frequency and time-series visualizations and audio format conversions. This software aims to provide machine learning engineers, data scientists, researchers, and students with a set of baseline models to classify audio. The library is available at https://github.com/jsingh811/pyAudioProcessing and is under GPL-3.0 license.

Journal ArticleDOI
TL;DR: The possibility of processing sound from noise using a neural network that works with image recognition was discussed, and it was revealed that the neural network will be able to recognize the differences in the images on which the noise is visible.
Abstract: This article discusses the possibility of processing sound from noise using a neural network that works with image recognition. To make sure of this, spectrograms of the recorded voice of the speaker with a duration of 10 seconds and spectrograms with white noise superimposed on the recorded audio track were considered. After analyzing the noisy audio track by a sub-jective method (listening to the audio track) and analyzing the spectrograms of the noisy audio track, it was revealed that the neural network will be able to recognize the differences in the images on which the noise is visible. This is necessary in order to further train the neural net-work to recognize the noise intensity of the audio track.


Posted ContentDOI
31 Oct 2022
TL;DR: In this paper , a TF-based audio coding scheme with novel psychoacoustics model, music classification, audio classification of environmental sounds, audio fingerprinting, and audio watermarking is presented to demonstrate the advantages of using timefrequency approaches in analyzing and extracting information from audio signals.
Abstract: <p>Audio signals are information rich nonstationary signals that play an important role in our day-to-day communication, perception of environment, and entertainment. Due to its non-stationary nature, time- or frequency-only approaches are inadequate in analyzing these signals. A joint time-frequency (TF) approach would be a better choice to efficiently process these signals. In this digital era, compression, intelligent indexing for content-based retrieval, classification, and protection of digital audio content are few of the areas that encapsulate a majority of the audio signal processing applications. In this paper, we present a comprehensive</p> <p>array of TF methodologies that successfully address applications in all of the above mentioned areas. A TF-based audio coding scheme with novel psychoacoustics model, music classification, audio classification of environmental sounds, audio fingerprinting, and audio watermarking will be presented to demonstrate the advantages of using time-frequency approaches in analyzing and extracting information from audio signals.</p>