Showing papers on "Audio signal processing published in 2020"

PDF

Open Access

Journal Article•DOI•

Trends in audio signal feature extraction methods

[...]

Garima Sharma¹, Kartikeyan Umapathy¹, Sridhar Krishnan¹•Institutions (1)

15 Jan 2020-Applied Acoustics

TL;DR: The aim of this study is to summarize the literature of the audio signal processing specially focusing on the feature extraction techniques, and the temporal domain, frequency domain, cepstral domain, wavelet domain and time-frequency domain features are discussed in detail.

...read moreread less

179 citations

Journal Article•DOI•

Novel Audio Features for Music Emotion Recognition

[...]

Renato Panda¹, Ricardo Malheiro¹, Rui Pedro Paiva¹•Institutions (1)

University of Coimbra¹

01 Oct 2020-IEEE Transactions on Affective Computing

TL;DR: This work advances the music emotion recognition state-of-the-art by proposing novel emotionally-relevant audio features related with musical texture and expressive techniques, and analysing the features relevance and results uncovered interesting relations.

...read moreread less

Abstract: This work advances the music emotion recognition state-of-the-art by proposing novel emotionally-relevant audio features. We reviewed the existing audio features implemented in well-known frameworks and their relationships with the eight commonly defined musical concepts. This knowledge helped uncover musical concepts lacking computational extractors, to which we propose algorithms - namely related with musical texture and expressive techniques. To evaluate our work, we created a public dataset of 900 audio clips, with subjective annotations following Russell's emotion quadrants. The existent audio features (baseline) and the proposed features (novel) were tested using 20 repetitions of 10-fold cross-validation. Adding the proposed features improved the F1-score to 76.4 percent (by 9 percent), when compared to a similar number of baseline-only features. Moreover, analysing the features relevance and results uncovered interesting relations, namely the weight of specific features and musical concepts to each emotion quadrant, and warrant promising new directions for future research in the field of music emotion recognition, interactive media, and novel music interfaces.

...read moreread less

98 citations

Proceedings Article•DOI•

Channel-Attention Dense U-Net for Multichannel Speech Enhancement

[...]

Bahareh Tolooshams¹, Ritwik Giri², Andy Song³, Umut Isik², Arvindh Krishnaswamy² - Show less +1 more•Institutions (3)

Harvard University¹, Amazon.com², Massachusetts Institute of Technology³

04 May 2020

TL;DR: This paper proposes Channel-Attention Dense U-Net, in which the channel-attention unit is applied recursively on feature maps at every layer of the network, enabling the network to perform non-linear beamforming.

...read moreread less

Abstract: Supervised deep learning has gained significant attention for speech enhancement recently. The state-of-the-art deep learning methods perform the task by learning a ratio/binary mask that is applied to the mixture in the time-frequency domain to produce the clean speech. Despite the great performance in the single-channel setting, these frameworks lag in performance in the multichannel setting as the majority of these methods a) fail to exploit the available spatial information fully, and b) still treat the deep architecture as a black box which may not be well-suited for multichannel audio processing. This paper addresses these drawbacks, a) by utilizing complex ratio masking instead of masking on the magnitude of the spectrogram, and more importantly, b) by introducing a channel-attention mechanism inside the deep architecture to mimic beamforming. We propose Channel-Attention Dense U-Net, in which we apply the channel-attention unit recursively on feature maps at every layer of the network, enabling the network to perform non-linear beamforming. We demonstrate the superior performance of the network against the state-of-the-art approaches on the CHiME-3 dataset.

...read moreread less

69 citations

Journal Article•DOI•

nnAudio: An on-the-Fly GPU Audio to Spectrogram Conversion Toolbox Using 1D Convolutional Neural Networks

[...]

Kin Wai Cheuk¹, Hans Anderson, Kat Agres², Dorien Herremans¹•Institutions (2)

Singapore University of Technology and Design¹, Agency for Science, Technology and Research²

24 Aug 2020-IEEE Access

TL;DR: A new neural network-based audio processing framework with graphics processing unit (GPU) support that leverages 1D convolutional neural networks to perform time domain to frequency domain conversion, which allows on-the-fly spectrogram extraction due to its fast speed, without the need to store any spectrograms on the disk.

...read moreread less

Abstract: In this paper, we present nnAudio, a new neural network-based audio processing framework with graphics processing unit (GPU) support that leverages 1D convolutional neural networks to perform time domain to frequency domain conversion. It allows on-the-fly spectrogram extraction due to its fast speed, without the need to store any spectrograms on the disk. Moreover, this approach also allows back-propagation on the waveforms-to-spectrograms transformation layer, and hence, the transformation process can be made trainable, further optimizing the waveform-to-spectrogram transformation for the specific task that the neural network is trained on. All spectrogram implementations scale as Big-O of linear time with respect to the input length. nnAudio, however, leverages the compute unified device architecture (CUDA) of 1D convolutional neural network from PyTorch, its short-time Fourier transform (STFT), Mel spectrogram, and constant-Q transform (CQT) implementations are an order of magnitude faster than other implementations using only the central processing unit (CPU). We tested our framework on three different machines with NVIDIA GPUs, and our framework significantly reduces the spectrogram extraction time from the order of seconds (using a popular python library librosa) to the order of milliseconds, given that the audio recordings are of the same length. When applying nnAudio to variable input audio lengths, an average of 11.5 hours are required to extract 34 spectrogram types with different parameters from the MusicNet dataset using librosa. An average of 2.8 hours is required for nnAudio, which is still four times faster than librosa. Our proposed framework also outperforms existing GPU processing libraries such as Kapre and torchaudio in terms of processing speed.

...read moreread less

55 citations

Proceedings Article•DOI•

A Differentiable Perceptual Audio Metric Learned from Just Noticeable Differences.

[...]

Pranay Manocha¹, Adam Finkelstein¹, Richard Zhang², Nicholas J. Bryan², Gautham J. Mysore², Zeyu Jin² - Show less +2 more•Institutions (2)

Princeton University¹, Adobe Systems²

13 Jan 2020

TL;DR: In this paper, the authors construct a metric by fitting a deep neural network to a new large dataset of crowdsourced human judgments, where subjects are prompted to answer a straightforward, objective question: are two recordings identical or not?

...read moreread less

Abstract: Many audio processing tasks require perceptual assessment. The ``gold standard`` of obtaining human judgments is time-consuming, expensive, and cannot be used as an optimization criterion. On the other hand, automated metrics are efficient to compute but often correlate poorly with human judgment, particularly for audio differences at the threshold of human detection. In this work, we construct a metric by fitting a deep neural network to a new large dataset of crowdsourced human judgments. Subjects are prompted to answer a straightforward, objective question: are two recordings identical or not? These pairs are algorithmically generated under a variety of perturbations, including noise, reverb, and compression artifacts; the perturbation space is probed with the goal of efficiently identifying the just-noticeable difference (JND) level of the subject. We show that the resulting learned metric is well-calibrated with human judgments, outperforming baseline methods. Since it is a deep network, the metric is differentiable, making it suitable as a loss function for other tasks. Thus, simply replacing an existing loss (e.g., deep feature loss) with our metric yields significant improvement in a denoising network, as measured by subjective pairwise comparison.

...read moreread less

49 citations

Journal Article•DOI•

An Audio Encryption Algorithm Based on DNA Coding and Chaotic System

[...]

Xingyuan Wang¹, Yining Su¹•Institutions (1)

Dalian Maritime University¹

01 Jan 2020-IEEE Access

TL;DR: A new audio encryption scheme that provides a high degree of security and is secure enough to withstand many common attacks and can be recommended for multi-channel audio processing.

...read moreread less

Abstract: Transferring multimedia files like audio is a common problem with information security. Therefore, various encryption technologies are needed to protect these contents. This paper proposes a new audio encryption scheme that provides a high degree of security. The novelty of this scheme is the use of chaotic systems and DNA coding to confuse and diffuse audio data. The initial value of the chaotic system is controlled by the hash value of the audio, making the chaotic trajectory unpredictable. Comparison experiments using different types of audio show that the algorithm works well and is secure enough to withstand many common attacks and can be recommended for multi-channel audio processing.

...read moreread less

36 citations

Journal Article•DOI•

Masked Conditional Neural Networks for sound classification

[...]

Fady Medhat¹, David Chesmore¹, John Robinson¹•Institutions (1)

University of York¹

01 May 2020-Applied Soft Computing

TL;DR: The ConditionaL Neural Network (CLNN) 1 and its extension, the MaskedconditionaL neural network (MCLNN), designed to exploit the nature of sound in a time–frequency representation and surpasses neural networks based architectures including state-of-the-art Convolutional Neural Networks and several hand-crafted attempts.

...read moreread less

32 citations

Proceedings Article•DOI•

Environment Sound Classification Using Multiple Feature Channels and Attention Based Deep Convolutional Neural Network.

[...]

Jivitesh Sharma¹, Ole-Christoffer Granmo¹, Morten Goodwin¹•Institutions (1)

University of Agder¹

25 Oct 2020

TL;DR: In this article, the authors proposed a model for the Environment Sound Classification (ESC) that consists of multiple feature channels given as input to a DeepConvolutional Neural Network (CNN) with Attention mechanism.

...read moreread less

Abstract: In this paper, we propose a model for the Environment Sound Classification Task (ESC) that consists of multiple feature channels given as input to a Deep Convolutional Neural Network (CNN) with Attention mechanism. The novelty of the paper lies in using multiple feature channels consisting of Mel-Frequency Cepstral Coefficients (MFCC), Gammatone Frequency Cepstral Coefficients (GFCC), the Constant Q-transform (CQT) and Chromagram. Such multiple features have never been used before for signal or audio processing. And, we employ a deeper CNN (DCNN) compared to previous models, consisting of spatially separable convolutions working on time and feature domain separately. Alongside, we use attention modules that perform channel and spatial attention together. We use some data augmentation techniques to further boost performance. Our model is able to achieve state-of-the-art performance on all three benchmark environment sound classification datasets, i.e. the UrbanSound8K (97.52%), ESC-10 (95.75%) and ESC-50 (88.50%). To the best of our knowledge, this is the first time that a single environment sound classification model is able to achieve state-of-the-art results on all three datasets. For ESC-10 and ESC-50 datasets, the accuracy achieved by the proposed model is beyond human accuracy of 95.7% and 81.3% respectively.

...read moreread less

31 citations

Posted Content•

Upsampling artifacts in neural audio synthesis

[...]

Jordi Pons¹, Santiago Pascual¹, Giulio Cengarle¹, Joan Serrà¹•Institutions (1)

Dolby Laboratories¹

27 Oct 2020-arXiv: Sound

TL;DR: This work shows that nearest neighbor interpolation upsamplers can be an alternative to the problematic (but state-of-the-art) transposed and subpixel convolutions which are prone to introduce tonal artifacts.

...read moreread less

Abstract: A number of recent advances in neural audio synthesis rely on upsampling layers, which can introduce undesired artifacts. In computer vision, upsampling artifacts have been studied and are known as checkerboard artifacts (due to their characteristic visual pattern). However, their effect has been overlooked so far in audio processing. Here, we address this gap by studying this problem from the audio signal processing perspective. We first show that the main sources of upsampling artifacts are: (i) the tonal and filtering artifacts introduced by problematic upsampling operators, and (ii) the spectral replicas that emerge while upsampling. We then compare different upsampling layers, showing that nearest neighbor upsamplers can be an alternative to the problematic (but state-of-the-art) transposed and subpixel convolutions which are prone to introduce tonal artifacts.

...read moreread less

30 citations

Journal Article•DOI•

Audio content analysis for unobtrusive event detection in smart homes

[...]

Anastasios Vafeiadis¹, Anastasios Vafeiadis², Konstantinos Votis², Dimitrios Giakoumis², Dimitrios Tzovaras², Liming Chen¹, Raouf Hamzaoui¹ - Show less +3 more•Institutions (2)

De Montfort University¹, Information Technology Institute²

01 Mar 2020-Engineering Applications of Artificial Intelligence

TL;DR: The author's final peer reviewed version can be found by following the DOI link and clicking on the link to the file attached to this record.

...read moreread less

30 citations

Journal Article•DOI•

A History of Audio Effects

[...]

Thomas Wilmering, David Moffat, Alessia Milo, Mark Sandler

22 Jan 2020-Applied Sciences

TL;DR: The development and evolution of audio effect technology is discussed, highlighting major technical breakthroughs and the impact of available audio effects.

...read moreread less

Abstract: Audio effects are an essential tool that the field of music production relies upon. The ability to intentionally manipulate and modify a piece of sound has opened up considerable opportunities for music making. The evolution of technology has often driven new audio tools and effects, from early architectural acoustics through electromechanical and electronic devices to the digitisation of music production studios. Throughout time, music has constantly borrowed ideas and technological advancements from all other fields and contributed back to the innovative technology. This is defined as transsectorial innovation and fundamentally underpins the technological developments of audio effects. The development and evolution of audio effect technology is discussed, highlighting major technical breakthroughs and the impact of available audio effects.

...read moreread less

Journal Article•DOI•

Deep Learning for Black-Box Modeling of Audio Effects

[...]

Marco A. Martínez Ramírez, Emmanouil Benetos, Joshua D. Reiss

16 Jan 2020-Applied Sciences

TL;DR: Different state-of-the-art deep learning models based on convolutional and recurrent neural networks, feedforward WaveNet architectures and a new model based on the combination of the aforementioned 10 models are analysed and the performance of these models when modeling various analog audio effects is explored.

...read moreread less

Abstract: Virtual analog modeling of audio effects consists of emulating the sound of an audio processor reference device. This digital simulation is normally done by designing mathematical models of these systems. It is often difficult because it seeks to accurately model all components within the effect unit, which usually contains various nonlinearities and time-varying components. Most existing methods for audio effects modeling are either simplified or optimized to a very specific circuit or type of audio effect and cannot be efficiently translated to other types of audio effects. Recently, deep neural networks have been explored as black-box modeling strategies to solve this task, i.e., by using only input–output measurements. We analyse different state-of-the-art deep learning models based on convolutional and recurrent neural networks, feedforward WaveNet architectures and we also introduce a new model based on the combination of the aforementioned models. Through objective perceptual-based metrics and subjective listening tests we explore the performance of these models when modeling various analog audio effects. Thus, we show virtual analog models of nonlinear effects, such as a tube preamplifier; nonlinear effects with memory, such as a transistor-based limiter and nonlinear time-varying effects, such as the rotating horn and rotating woofer of a Leslie speaker cabinet.

...read moreread less

Journal Article•DOI•

Combining statistical models using modified spectral subtraction method for embedded system

[...]

V. Balaji¹, Maheswaran S², M. Rajesh Babu³, M. Kowsigan⁴, E Prabhu⁵, K. Venkatachalam⁶ - Show less +2 more•Institutions (6)

Sri Krishna College of Engineering & Technology¹, Kongu Engineering College², RVS College of Engineering & Technology³, SRM University⁴, Amrita Vishwa Vidyapeetham⁵, VIT University⁶

01 Mar 2020-Microprocessors and Microsystems

TL;DR: This work has combined the statistical models along with machine learning algorithm to improve the results thereby resulting in robust communication and experimentation results on AURORA 2 database shows improved results over the state of the art methods discussed in the literature.

...read moreread less

Journal Article•DOI•

An Efficient and Perceptually Motivated Auditory Neural Encoding and Decoding Algorithm for Spiking Neural Networks

[...]

Zihan Pan¹, Yansong Chua², Jibin Wu¹, Malu Zhang¹, Haizhou Li¹, Eliathamby Ambikairajah³ - Show less +2 more•Institutions (3)

National University of Singapore¹, Agency for Science, Technology and Research², University of New South Wales³

22 Jan 2020-Frontiers in Neuroscience

TL;DR: In this paper, a neural encoding and decoding scheme called Biologically plausible Auditory Encoding (BAE) was proposed for audio processing, which emulates the functions of the perceptual components of the human auditory system, including the cochlear filter bank, inner hair cells, auditory masking effects from psychoacoustic models, and the spike neural encoding by the auditory nerve.

...read moreread less

Abstract: The auditory front-end is an integral part of a spiking neural network (SNN) when performing auditory cognitive tasks. It encodes the temporal dynamic stimulus, such as speech and audio, into an efficient, effective and reconstructable spike pattern to facilitate the subsequent processing. However, most of the auditory front-ends in current studies have not made use of recent findings in psychoacoustics and physiology concerning human listening. In this paper, we propose a neural encoding and decoding scheme that is optimized for audio processing. The neural encoding scheme, that we call Biologically plausible Auditory Encoding (BAE), emulates the functions of the perceptual components of the human auditory system, that include the cochlear filter bank, the inner hair cells, auditory masking effects from psychoacoustic models, and the spike neural encoding by the auditory nerve. We evaluate the perceptual quality of the BAE scheme using PESQ; the performance of the BAE based on sound classification and speech recognition experiments. Finally, we also built and published two spike-version of speech datasets: the Spike-TIDIGITS and the Spike-TIMIT, for researchers to use and benchmarking of future SNN research.

...read moreread less

Proceedings Article•

Deep Audio Priors Emerge From Harmonic Convolutional Networks

[...]

Zhoutong Zhang¹, Yunyun Wang², Chuang Gan³, Jiajun Wu¹, Joshua B. Tenenbaum¹, Antonio Torralba¹, William T. Freeman⁴ - Show less +3 more•Institutions (4)

Massachusetts Institute of Technology¹, Tsinghua University², IBM³, Google⁴

30 Apr 2020

TL;DR: Harmonic Convolution is proposed, an operation that helps deep networks distill priors in audio signals by explicitly utilizing the harmonic structure within by engineering the kernel to be supported by sets of harmonic series, instead of local neighborhoods for convolutional kernels.

...read moreread less

Abstract: Convolutional neural networks (CNNs) excel in image recognition and generation. Among many efforts to explain their effectiveness, experiments show that CNNs carry strong inductive biases that capture natural image priors. Do deep networks also have inductive biases for audio signals? In this paper, we empirically show that current network architectures for audio processing do not show strong evidence in capturing such priors. We propose Harmonic Convolution, an operation that helps deep networks distill priors in audio signals by explicitly utilizing the harmonic structure within. This is done by engineering the kernel to be supported by sets of harmonic series, instead of local neighborhoods for convolutional kernels. We show that networks using Harmonic Convolution can reliably model audio priors and achieve high performance in unsupervised audio restoration tasks. With Harmonic Convolution, they also achieve better generalization performance for sound source separation.

...read moreread less

Journal Article•DOI•

Design and Implementation of Butterworth, Chebyshev-I and Elliptic Filter for Speech Signal Analysis

[...]

Prajoy Podder, Md. Mehedi Hasan, Md. Rafiqul Islam, Mursalin Sayeed

08 Feb 2020-arXiv: Signal Processing

TL;DR: Three types of infinite impulse response filter i.e. Butterworth, Chebyshev type I and elliptic low pass, high pass, band pass and band stop filter have been designed in this paper using MATLAB Software.

...read moreread less

Abstract: In the field of digital signal processing, the function of a filter is to remove unwanted parts of the signal such as random noise that is also undesirable. To remove noise from the speech signal transmission or to extract useful parts of the signal such as the components lying within a certain frequency range, filters are necessary. Filters are broadly used in signal processing and communication systems in applications such as channel equalization, noise reduction, radar, audio processing, speech signal processing, video processing, biomedical signal processing that is noisy ECG, EEG, EMG signal filtering, electrical circuit analysis and analysis of economic and financial data. In this paper, three types of infinite impulse response filter i.e. Butterworth, Chebyshev type I and Elliptical filter have been discussed theoretically and experimentally. Butterworth, Chebyshev type I and elliptic low pass, high pass, band pass and band stop filter have been designed in this paper using MATLAB Software. The impulse responses, magnitude responses, phase responses of Butterworth, Chebyshev type I and Elliptical filter for filtering the speech signal have been observed in this paper. Analyzing the Speech signal, its sampling rate and spectrum response have also been found.

...read moreread less

Proceedings Article•DOI•

Performance evaluation of Distributed Arithmetic based MAC Structures for DSP Applications

[...]

M. Bharathi¹, Yasha Jyothi M Shirur², P.L. Lahari¹•Institutions (2)

Sree Vidyanikethan Engineering College¹, BNM Institute of Technology²

23 Jul 2020

TL;DR: Developed a DA based and offset binary coding DA based MAC cores which offers greater speed compared with different conventional MAC's using various multipliers and achieves best area and less delay result when compared with previous approximate adder designs.

...read moreread less

Abstract: MAC is an essential core which is used in every Digital signal processor. The primary focal point of this article is to introduce a high performance Distributed based (DA) MAC and offset binary coding Distributed Arithmetic (DA) based MAC for real time Signal Processing Applications. Addition and multiplication are the two hardware resources widely used to design any arithmetic blocks in many fields like video processing, audio processing, speech processing and medical image processing applications. In this article, a literature survey is done on different MAC [2] units with different multipliers to generate partial products and to perform accumulation. Developed a DA based and offset binary coding DA based MAC cores which offers greater speed compared with different conventional MAC's using various multipliers. The coding for DA and offset based architectures are done using Verilog and simulation, synthesis are performed in Xilinx 14.7 Integrated Simulation Environment version. It achieves best area and less delay result when compared with previous approximate adder designs. The results of DA based MAC cores gives much more efficient in delay whereas offset binary coding-based DA offers both speed and area optimization.

...read moreread less

Proceedings Article•DOI•

Audio Captioning Based on Combined Audio and Semantic Embeddings

[...]

Aysegul Ozkaya Eren¹, Mustafa Sert¹•Institutions (1)

Başkent University¹

01 Dec 2020

TL;DR: In this article, a BiGRU-based encoder-decoder architecture was proposed to extract subject-verb embeddings using the subjects and verbs from the audio captions.

...read moreread less

Abstract: Audio captioning is a recently proposed task for automatically generating a textual description of a given audio clip. Most existing approaches use the encoder-decoder model without using semantic information. In this study, we propose a bi-directional Gated Recurrent Unit (BiGRU) model based on encoder-decoder architecture using audio and semantic embed-dings. To obtain semantic embeddings, we extract subject-verb embeddings using the subjects and verbs from the audio captions. We use a Multilayer Perceptron classifier to predict subject-verb embeddings of test audio clips for the testing stage. Within the aim of extracting audio features, in addition to log Mel energies, we use a pretrained audio neural network (PANN) as a feature extractor which is used for the first time in the audio captioning task to explore the usability of audio embeddings in the audio captioning task. We combine audio embeddings and semantic embeddings to feed the BiGRU-based encoder-decoder model. Following this, we evaluate our model on two audio captioning datasets: Clotho and AudioCaps. Experimental results show that the proposed BiGRU-based deep model significantly outperforms the state of the art results across different evaluation metrics and inclusion of semantic information enhance the captioning performance.

...read moreread less

Journal Article•DOI•

1D-TICapsNet: An audio signal processing algorithm for bolt early looseness detection:

[...]

Furui Wang¹, Gangbing Song¹•Institutions (1)

University of Houston¹

15 Dec 2020-Structural Health Monitoring-an International Journal

TL;DR: Recently, for bolt looseness detection, percussion-based methods have attracted more attention due to their advantages of eliminating contact sensors.

...read moreread less

Abstract: Recently, for bolt looseness detection, percussion-based methods have attracted more attention due to their advantages of eliminating contact sensors. The core issue of percussion-based methods is ...

...read moreread less

Journal Article•DOI•

Audio-Processing-Based Human Detection at Disaster Sites with Unmanned Aerial Vehicle

[...]

Yuki Yamazaki¹, Chinthaka Premachandra¹, Chamika Janith Perea²•Institutions (2)

Shibaura Institute of Technology¹, University of Moratuwa²

01 Jun 2020-IEEE Access

TL;DR: A human search system that uses an unmanned aerial vehicle to detect the human voice as a means of finding people that cameras cannot and a search method is proposed that combines voice and camera human detection to compensate for their respective shortcomings.

...read moreread less

Abstract: This paper describes a human search system that uses an unmanned aerial vehicle (UAV). The use of robots to search for people is expected to become an auxiliary tool for saving lives during a disaster. In particular, because UAVs can collect information from the air, there has been much research into human search using UAVs equipped with cameras. However, the disadvantage of cameras is that they struggle to detect people who are hidden in shadows. To solve this problem, we mounted an array microphone on a UAV and to detect the human voice as a means of finding people that cameras cannot. Also a search method is proposed that combines voice and camera human detection to compensate for their respective shortcomings. The rate and accuracy of human detection by the proposed method are assessed experimentally.

...read moreread less

Journal Article•DOI•

Graph-based feature extraction: A new proposal to study the classification of music signals outside the time-frequency domain

[...]

Dirceu de Freitas Piedade Melo, Inácio de Sousa Fadigas¹, Hernane Borges de Barros Pereira²•Institutions (2)

State University of Feira de Santana¹, SENAI²

12 Nov 2020-PLOS ONE

TL;DR: A new method for extracting features related to the rhythmic activity of music signals using the topological properties of a graph constructed from an audio signal and it is found that the four network properties were among the top-ranking positions given by this test.

...read moreread less

Abstract: Most feature extraction algorithms for music audio signals use Fourier transforms to obtain coefficients that describe specific aspects of music information within the sound spectrum, such as the timbral texture, tonal texture and rhythmic activity. In this paper, we introduce a new method for extracting features related to the rhythmic activity of music signals using the topological properties of a graph constructed from an audio signal. We map the local standard deviation of a music signal to a visibility graph and calculate the modularity (Q), the number of communities (Nc), the average degree (〈k〉), and the density (Δ) of this graph. By applying this procedure to each signal in a database of various musical genres, we detected the existence of a hierarchy of rhythmic self-similarities between musical styles given by these four network properties. Using Q, Nc, 〈k〉 and Δ as input attributes in a classification experiment based on supervised artificial neural networks, we obtained an accuracy higher than or equal to the beat histogram in 70% of the musical genre pairs, using only four features from the networks. Finally, when performing the attribute selection test with Q, Nc, 〈k〉 and Δ, along with the main signal processing field descriptors, we found that the four network properties were among the top-ranking positions given by this test.

...read moreread less

Posted Content•

A Differentiable Perceptual Audio Metric Learned from Just Noticeable Differences

[...]

Pranay Manocha¹, Adam Finkelstein¹, Richard Zhang², Nicholas J. Bryan², Gautham J. Mysore², Zeyu Jin² - Show less +2 more•Institutions (2)

Princeton University¹, Adobe Systems²

13 Jan 2020-arXiv: Audio and Speech Processing

TL;DR: This work constructs a metric by fitting a deep neural network to a new large dataset of crowdsourced human judgments and shows that the resulting learned metric is well-calibrated with human judgments, outperforming baseline methods.

...read moreread less

Journal Article•DOI•

Microphone directionality and wind noise reduction enhance speech perception in users of the MED-EL SONNET audio processor.

[...]

Rudolf Hagen¹, Andreas Radeloff², Thomas Stark³, Ilona Anderson, Peter Nopp, Ernst Aschbacher, Alexander Möltner, Yassaman Khajehnouri, Kristen Rak¹ - Show less +5 more•Institutions (3)

University of Würzburg¹, University of Oldenburg², Technische Universität München³

01 Jan 2020-Cochlear Implants International

TL;DR: It is concluded that the front-end processing of the SONNET provides users with better hearing than does the OPUS 2, and this was offset by SONnet's superiority in real-life listening situations.

...read moreread less

Abstract: Objectives: Speech understanding in noise remains a challenge for many cochlear implant users. To improve this, the SONNET audio processor features three microphone directionality (MD) settings and...

...read moreread less

Journal Article•DOI•

Collaborative framework for automatic classification of respiratory sounds

[...]

Stavros Ntalampiras

01 Jun 2020-Iet Signal Processing

TL;DR: This work unfolds an automatic audio signal processing framework achieving classification between normal and abnormal respiratory sounds, surpassing the current state of the art as it is able to identify respiratory sound patterns with a 66% accuracy.

...read moreread less

Abstract: There are several diseases (e.g. asthma, pneumonia etc.) affecting the human respiratory apparatus altering its airway path substantially, thus characterising its acoustic properties. This work unfolds an automatic audio signal processing framework achieving classification between normal and abnormal respiratory sounds. Thanks to a recent challenge, a real-world dataset specifically designed to address the needs of the specific problem is available to the scientific community. Unlike previous works in the literature, the authors take advantage of information provided by several stethoscopes simultaneously, i.e. elaborating at the acoustic sensor network level. To this end, they employ two features sets extracted from different domains, i.e. spectral and wavelet. These are modelled by convolutional neural networks, hidden Markov models and Gaussian mixture models. Subsequently, a synergistic scheme is designed operating at the decision level of the best-performing classifier with respect to each stethoscope. Interestingly, such a scheme was able to boost the classification accuracy surpassing the current state of the art as it is able to identify respiratory sound patterns with a 66.7% accuracy.

...read moreread less

Journal Article•DOI•

FPGA implementation of XOR-MUX full adder based DWT for signal processing applications

[...]

P. Radhakrishnan, G. Themozhi¹•Institutions (1)

AMET University¹

01 Mar 2020-Microprocessors and Microsystems

TL;DR: A novel approach of DWT is presented by replacing conventionalAdders and multipliers with XOR-MUX adders and Truncations multipliers thereby reducing the 2n logic size to n-size logic.

...read moreread less

Book•

Data-Driven Multi-Microphone Speaker Localization on Manifolds

[...]

Bracha Laufer-Goldshtein¹, Ronen Talmon², Sharon Gannot¹•Institutions (2)

Bar-Ilan University¹, Technion – Israel Institute of Technology²

06 Oct 2020

Proceedings Article•DOI•

Characterizing the Effect of Audio Degradation on Privacy Perception And Inference Performance in Audio-Based Human Activity Recognition

[...]

Dawei Liang¹, Wenting Song¹, Edison Thomaz¹•Institutions (1)

University of Texas at Austin¹

05 Oct 2020

TL;DR: This paper investigates how intentional degradation of audio frames can affect the recognition results of the target classes while maintaining effective privacy mitigation, and results indicate that degradation ofaudio frames can leave minimal effects for audio recognition using frame-level features.

...read moreread less

Abstract: Audio has been increasingly adopted as a sensing modality in a variety of human-centered mobile applications and in smart assistants in the home. Although acoustic features can capture complex semantic information about human activities and context, continuous audio recording often poses significant privacy concerns. An intuitive way to reduce privacy concerns is to degrade audio quality such that speech and other relevant acoustic markers become unintelligible, but this often comes at the cost of activity recognition performance. In this paper, we employ a mixed-methods approach to characterize this balance. We first conduct an online survey with 266 participants to capture their perception of privacy qualitatively and quantitatively with degraded audio. Given our findings that privacy concerns can be significantly reduced at high levels of audio degradation, we then investigate how intentional degradation of audio frames can affect the recognition results of the target classes while maintaining effective privacy mitigation. Our results indicate that degradation of audio frames can leave minimal effects for audio recognition using frame-level features. Furthermore, degradation of audio frames can hurt the performance to some extend for audio recognition using segment-level features, though the usage of such features may still yield superior recognition performance. Given the different requirements on privacy mitigation and recognition performance for different sensing purposes, such trade-offs need to be balanced in actual implementations.

...read moreread less

Proceedings Article•DOI•

Bandwidth Extension of Musical Audio Signals With No Side Information Using Dilated Convolutional Neural Networks

[...]

Mathieu Lagrange¹, Félix Gontier¹•Institutions (1)

Centre national de la recherche scientifique¹

15 May 2020

TL;DR: Experimental evaluation using two public datasets, medley-solos-db and gtzan, respectively of monophonic and polyphonic music demonstrate that the proposed architecture achieves state of the art performance.

...read moreread less

Abstract: Bandwidth extension has a long history in audio processing. While speech processing tools do not rely on side information, production-ready bandwidth extension tools of general audio signals rely on side information that has to be transmitted alongside the bitstream of the low frequency part, mostly because polyphonic music has a more complex and less predictable spectral structure than speech.This paper studies the benefit of considering a dilated fully convolutional neural network to perform the bandwidth extension of musical audio signals with no side information on the magnitude spectra. Experimental evaluation using two public datasets, medley-solos-db and gtzan, respectively of monophonic and polyphonic music demonstrate that the proposed architecture achieves state of the art performance.

...read moreread less

Posted Content•

Identifying Audio Adversarial Examples via Anomalous Pattern Detection.

[...]

Victor Akinwande¹, Celia Cintas, Skyler Speakman, Srihari Sridharan•Institutions (1)

IBM¹

13 Feb 2020-arXiv: Learning

TL;DR: This work shows that 2 of the recent and current state-of-the-art adversarial attacks on audio processing systems systematically lead to higher-than-expected activation at some subset of nodes and can detect these with up to an AUC of 0.98 with no degradation in performance on benign samples.

...read moreread less

Abstract: Audio processing models based on deep neural networks are susceptible to adversarial attacks even when the adversarial audio waveform is 99.9% similar to a benign sample. Given the wide application of DNN-based audio recognition systems, detecting the presence of adversarial examples is of high practical relevance. By applying anomalous pattern detection techniques in the activation space of these models, we show that 2 of the recent and current state-of-the-art adversarial attacks on audio processing systems systematically lead to higher-than-expected activation at some subset of nodes and we can detect these with up to an AUC of 0.98 with no degradation in performance on benign samples.

...read moreread less

Proceedings Article•DOI•

Speaker Counting Model based on Transfer Learning from SincNet Bottleneck Layer

[...]

Wei Wang¹, Fatjon Seraj¹, Nirvana Meratnia¹, Paul J.M. Havinga¹•Institutions (1)

University of Twente¹

23 Mar 2020

TL;DR: This paper proposes a system based on a deep-learning model able to identify the number of people in the crowded scenarios through the speech sound by counting concurrent speakers in overlapping talking sound directly and clustering single-speaker sound by speaker-identity over time.

...read moreread less

Abstract: People counting techniques have been widely researched recently and many different types of sensors can be used in this context. In this paper, we propose a system based on a deep-learning model able to identify the number of people in the crowded scenarios through the speech sound. In a nutshell the system relies on two components: counting concurrent speakers in overlapping talking sound directly and clustering single-speaker sound by speaker-identity over time. Compared to previously proposed speaker-counting systems models that only cluster single-speaker sound, this system is more accurate and less vulnerable to the overlapping sound in the crowded environment. In addition, counting speakers in overlapping sound also gives the minimal number of speakers so that it also improves the counting accuracy in a quiet environment.Our methodology is inspired by the newly proposed SincNet deep neural network framework which proves to be outstanding and highly efficient in sound processing with raw signals. By transferring the bottleneck layer of SincNet model as features fed to our speaker clustering model we reached a noticeably better performance than previous models who rely on the use MFCC and other engineered features.

...read moreread less

Collapse