scispace - formally typeset
Search or ask a question

Showing papers on "Audio signal processing published in 2019"


Journal ArticleDOI
TL;DR: Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross fertilization between areas.
Abstract: Given the recent surge in developments of deep learning, this paper provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e., audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.

445 citations


Journal ArticleDOI
28 Dec 2019-Sensors
TL;DR: An artificial intelligence-assisted deep stride convolutional neural network (DSCNN) architecture using the plain nets strategy to learn salient and discriminative features from spectrogram of speech signals that are enhanced in prior steps to perform better.
Abstract: Speech is the most significant mode of communication among human beings and a potential method for human-computer interaction (HCI) by using a microphone sensor. Quantifiable emotion recognition using these sensors from speech signals is an emerging area of research in HCI, which applies to multiple applications such as human-reboot interaction, virtual reality, behavior assessment, healthcare, and emergency call centers to determine the speaker's emotional state from an individual's speech. In this paper, we present major contributions for; (i) increasing the accuracy of speech emotion recognition (SER) compared to state of the art and (ii) reducing the computational complexity of the presented SER model. We propose an artificial intelligence-assisted deep stride convolutional neural network (DSCNN) architecture using the plain nets strategy to learn salient and discriminative features from spectrogram of speech signals that are enhanced in prior steps to perform better. Local hidden patterns are learned in convolutional layers with special strides to down-sample the feature maps rather than pooling layer and global discriminative features are learned in fully connected layers. A SoftMax classifier is used for the classification of emotions in speech. The proposed technique is evaluated on Interactive Emotional Dyadic Motion Capture (IEMOCAP) and Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) datasets to improve accuracy by 7.85% and 4.5%, respectively, with the model size reduced by 34.5 MB. It proves the effectiveness and significance of the proposed SER technique and reveals its applicability in real-world applications.

205 citations


Proceedings ArticleDOI
01 Jun 2019
TL;DR: A deep convolutional neural network is devised that learns to decode the monaural soundtrack into its binaural counterpart by injecting visual information about object and scene configurations, and the resulting output 2.5D visual sound helps "lift" the flat single channel audio into spatialized sound.
Abstract: Binaural audio provides a listener with 3D sound sensation, allowing a rich perceptual experience of the scene. However, binaural recordings are scarcely available and require nontrivial expertise and equipment to obtain. We propose to convert common monaural audio into binaural audio by leveraging video. The key idea is that visual frames reveal significant spatial cues that, while explicitly lacking in the accompanying single-channel audio, are strongly linked to it. Our multi-modal approach recovers this link from unlabeled video. We devise a deep convolutional neural network that learns to decode the monaural (single-channel) soundtrack into its binaural counterpart by injecting visual information about object and scene configurations. We call the resulting output 2.5D visual sound---the visual stream helps "lift" the flat single channel audio into spatialized sound. In addition to sound generation, we show the self-supervised representation learned by our network benefits audio-visual source separation. Our video results: http://vision.cs.utexas.edu/projects/2.5D_visual_sound/

114 citations


Proceedings ArticleDOI
29 Sep 2019
TL;DR: FaSNet as discussed by the authors is a two-stage system design that first learns frame-level time-domain adaptive beamforming filters for a selected reference channel, and then calculate the filters for all remaining channels.
Abstract: 1. ABSTRACT Beamforming has been extensively investigated for multi-channel audio processing tasks. Recently, learning-based beamforming methods, sometimes called neural beamformers, have achieved significant improvements in both signal quality (e.g. signal-to-noise ratio (SNR)) and speech recognition (e.g. word error rate (WER)). Such systems are generally non-causal and require a large context for robust estimation of inter-channel features, which is impractical in applications requiring low-latency responses. In this paper, we propose filter-and-sum network (FaSNet), a time-domain, filter-based beamforming approach suitable for low-latency scenarios. FaSNet has a two-stage system design that first learns frame-level time-domain adaptive beamforming filters for a selected reference channel, and then calculate the filters for all remaining channels. The filtered outputs at all channels are summed to generate the final output. Experiments show that despite its small model size, FaSNet is able to outperform several traditional oracle beamformers with respect to scale-invariant signal-to-noise ratio (SI-SNR) in reverberant speech enhancement and separation tasks. Moreover, when trained with a frequency-domain objective function on the CHiME-3 dataset, FaSNet achieves 14.3% relative word error rate reduction (RWERR) compared with the baseline model. These results show the efficacy of FaSNet particularly in reverberant and noisy signal conditions.

109 citations


Proceedings ArticleDOI
18 Mar 2019
TL;DR: This paper exploits the fact that multiple source audio samples have similar feature vectors when transformed by acoustic feature extraction algorithms to exploit knowledge of the signal processing algorithms commonly used by VPSes to generate the data fed into machine learning systems.
Abstract: Voice Processing Systems (VPSes), now widely deployed, have been made significantly more accurate through the application of recent advances in machine learning. However, adversarial machine learning has similarly advanced and has been used to demonstrate that VPSes are vulnerable to the injection of hidden commands - audio obscured by noise that is correctly recognized by a VPS but not by human beings. Such attacks, though, are often highly dependent on white-box knowledge of a specific machine learning model and limited to specific microphones and speakers, making their use across different acoustic hardware platforms (and thus their practicality) limited. In this paper, we break these dependencies and make hidden command attacks more practical through model-agnostic (blackbox) attacks, which exploit knowledge of the signal processing algorithms commonly used by VPSes to generate the data fed into machine learning systems. Specifically, we exploit the fact that multiple source audio samples have similar feature vectors when transformed by acoustic feature extraction algorithms (e.g., FFTs). We develop four classes of perturbations that create unintelligible audio and test them against 12 machine learning models, including 7 proprietary models (e.g., Google Speech API, Bing Speech API, IBM Speech API, Azure Speaker API, etc), and demonstrate successful attacks against all targets. Moreover, we successfully use our maliciously generated audio samples in multiple hardware configurations, demonstrating effectiveness across both models and real systems. In so doing, we demonstrate that domain-specific knowledge of audio signal processing represents a practical means of generating successful hidden voice command attacks.

108 citations


Journal ArticleDOI
TL;DR: In this paper, a blind source separation method is used to separate source signals from noise and an extended principal component analysis is used for dimensionality reduction, which can be used to classify tool wear conditions with high accuracy.

85 citations


Proceedings ArticleDOI
12 May 2019
TL;DR: In this article, three neural network-based approaches for mean opinion score (MOS) estimation were proposed, with a fully connected deep neural network using Mel-frequency features providing the best correlation and lowest mean squared error.
Abstract: Estimating the perceived quality of an audio signal is critical for many multimedia and audio processing systems. Providers strive to offer optimal and reliable services in order to increase the user quality of experience (QoE). In this work, we present an investigation of the applicability of neural networks for non-intrusive audio quality assessment. We propose three neural network-based approaches for mean opinion score (MOS) estimation. We compare our results to three instrumental measures: the perceptual evaluation of speech quality (PESQ), the ITU-T Recommendation P.563, and the speech-to-reverberation energy ratio. Our evaluation uses a speech dataset contaminated with convo-lutive and additive noise, labeled using a crowd-based QoE evaluation, evaluated with Pearson correlation with MOS labels, and mean-squared-error of the estimated MOS. Our proposed approaches outperform the aforementioned instrumental measures, with a fully connected deep neural network using Mel-frequency features providing the best correlation (0.87) and the lowest mean squared error (0.15).

61 citations


Posted Content
TL;DR: This work presents an investigation of the applicability of neural networks for non-intrusive audio quality assessment, and proposes three neural network-based approaches for mean opinion score (MOS) estimation.
Abstract: Estimating the perceived quality of an audio signal is critical for many multimedia and audio processing systems. Providers strive to offer optimal and reliable services in order to increase the user quality of experience (QoE). In this work, we present an investigation of the applicability of neural networks for non-intrusive audio quality assessment. We propose three neural network-based approaches for mean opinion score (MOS) estimation. We compare our results to three instrumental measures: the perceptual evaluation of speech quality (PESQ), the ITU-T Recommendation P.563, and the speech-to-reverberation energy ratio. Our evaluation uses a speech dataset contaminated with convolutive and additive noise, labeled using a crowd-based QoE evaluation, evaluated with Pearson correlation with MOS labels, and mean-squared-error of the estimated MOS. Our proposed approaches outperform the aforementioned instrumental measures, with a fully connected deep neural network using Mel-frequency features providing the best correlation (0.87) and the lowest mean squared error (0.15)

58 citations


Proceedings ArticleDOI
03 Jul 2019
TL;DR: In this article, the authors investigate the reasons why CNN architectures perform worse in acoustic scene classification compared to simpler models (e.g., VGG) and demonstrate the importance of the receptive field (RF) to the generalization capability of the models.
Abstract: Convolutional Neural Networks (CNNs) have had great success in many machine vision as well as machine audition tasks. Many image recognition network architectures have consequently been adapted for audio processing tasks. However, despite some successes, the performance of many of these did not translate from the image to the audio domain. For example, very deep architectures such as ResNet [1] and DenseNet [2], which significantly outperform VGG [3] in image recognition, do not perform better in audio processing tasks such as Acoustic Scene Classification (ASC). In this paper, we investigate the reasons why such powerful architectures perform worse in ASC compared to simpler models (e.g., VGG). To this end, we analyse the receptive field (RF) of these CNNs and demonstrate the importance of the RF to the generalization capability of the models. Using our receptive field analysis, we adapt both ResNet and DenseNet, achieving state-of-theart performance and eventually outperforming the VGG-based models. We introduce systematic ways of adapting the RF in CNNs, and present results on three data sets that show how changing the RF over the time and frequency dimensions affects a model’s performance. Our experimental results show that very small or very large RFs can cause performance degradation, but deep models can be made to generalize well by carefully choosing an appropriate RF size within a certain range.

55 citations


Proceedings ArticleDOI
12 May 2019
TL;DR: The proposed attentional similarity module can be plugged into any metric-based learning method for few-shot learning, allowing the resulting model to especially match related short sound events.
Abstract: In this paper, we introduce a novel attentional similarity module for the problem of few-shot sound recognition. Given a few examples of an unseen sound event, a classifier must be quickly adapted to recognize the new sound event without much fine-tuning. The proposed attentional similarity module can be plugged into any metric-based learning method for few-shot learning, allowing the resulting model to especially match related short sound events. Extensive experiments on two datasets show that the proposed module consistently improves the performance of five different metric-based learning methods for few-shot sound recognition. The relative improvement ranges from +4.1% to +7.7% for 5-shot 5-way accuracy for the ESC-50 dataset, and from +2.1% to +6.5% for noiseESC-50. Qualitative results demonstrate that our method contributes in particular to the recognition of transient sound events.

53 citations


Journal ArticleDOI
29 Mar 2019
TL;DR: A framework for audio-based activity recognition that can make use of millions of embedding features from public online video sound clips is proposed, based on the combination of oversampling and deep learning approaches, that does not require further feature processing or outliers filtering.
Abstract: Over the years, activity sensing and recognition has been shown to play a key enabling role in a wide range of applications, from sustainability and human-computer interaction to health care. While many recognition tasks have traditionally employed inertial sensors, acoustic-based methods offer the benefit of capturing rich contextual information, which can be useful when discriminating complex activities. Given the emergence of deep learning techniques and leveraging new, large-scale multimedia datasets, this paper revisits the opportunity of training audio-based classifiers without the onerous and time-consuming task of annotating audio data. We propose a framework for audio-based activity recognition that can make use of millions of embedding features from public online video sound clips. Based on the combination of oversampling and deep learning approaches, our framework does not require further feature processing or outliers filtering as in prior work. We evaluated our approach in the context of Activities of Daily Living (ADL) by recognizing 15 everyday activities with 14 participants in their own homes, achieving 64.2% and 83.6% averaged within-subject accuracy in terms of top-1 and top-3 classification respectively. Individual class performance was also examined in the paper to further study the co-occurrence characteristics of the activities and the robustness of the framework.

Journal ArticleDOI
TL;DR: An end-to-end feedforward convolutional neural network that is able to reliably classify the source and type of animal calls in a noisy environment using two streams of audio data after being trained on a dataset of modest size and imperfect labels is introduced.
Abstract: This paper introduces an end-to-end feedforward convolutional neural network that is able to reliably classify the source and type of animal calls in a noisy environment using two streams of audio data after being trained on a dataset of modest size and imperfect labels. The data consists of audio recordings from captive marmoset monkeys housed in pairs, with several other cages nearby. The network in this paper can classify both the call type and which animal made it with a single pass through a single network using raw spectrogram images as input. The network vastly increases data analysis capacity for researchers interested in studying marmoset vocalizations, and allows data collection in the home cage, in group housed animals.

Posted Content
TL;DR: Experiments show that despite its small model size, FaSNet is able to outperform several traditional oracle beamformers with respect to scale-invariant signal-to-noise ratio (SI-SNR) in reverberant speech enhancement and separation tasks.
Abstract: Beamforming has been extensively investigated for multi-channel audio processing tasks. Recently, learning-based beamforming methods, sometimes called \textit{neural beamformers}, have achieved significant improvements in both signal quality (e.g. signal-to-noise ratio (SNR)) and speech recognition (e.g. word error rate (WER)). Such systems are generally non-causal and require a large context for robust estimation of inter-channel features, which is impractical in applications requiring low-latency responses. In this paper, we propose filter-and-sum network (FaSNet), a time-domain, filter-based beamforming approach suitable for low-latency scenarios. FaSNet has a two-stage system design that first learns frame-level time-domain adaptive beamforming filters for a selected reference channel, and then calculate the filters for all remaining channels. The filtered outputs at all channels are summed to generate the final output. Experiments show that despite its small model size, FaSNet is able to outperform several traditional oracle beamformers with respect to scale-invariant signal-to-noise ratio (SI-SNR) in reverberant speech enhancement and separation tasks. Moreover, when trained with a frequency-domain objective function on the CHiME-3 dataset, FaSNet achieves 14.3\% relative word error rate reduction (RWERR) compared with the baseline model. These results show the efficacy of FaSNet particularly in reverberant and noisy signal conditions.

Journal ArticleDOI
TL;DR: It is found that precise English vowel perception and accurate English grammatical judgment were linked to lower psychoacoustic thresholds, better auditory‐motor integration, and more consistent frequency‐following responses to sound, suggesting that they are dissociable indexes of sound processing.

Journal ArticleDOI
11 Mar 2019-PLOS ONE
TL;DR: The technical details of the 3D Tune-In Toolkit are presented, outlining its architecture and describing the processes implemented in each of its components, followed by a comparison between the features offered by the 3DTI Toolkit and those found in other currently available open- and closed-source binaural renderers.
Abstract: The 3D Tune-In Toolkit (3DTI Toolkit) is an open-source standard C++ library which includes a binaural spatialiser. This paper presents the technical details of this renderer, outlining its architecture and describing the processes implemented in each of its components. In order to put this description into context, the basic concepts behind binaural spatialisation are reviewed through a chronology of research milestones in the field in the last 40 years. The 3DTI Toolkit renders the anechoic signal path by convolving sound sources with Head Related Impulse Responses (HRIRs), obtained by interpolating those extracted from a set that can be loaded from any file in a standard audio format. Interaural time differences are managed separately, in order to be able to customise the rendering according the head size of the listener, and to reduce comb-filtering when interpolating between different HRIRs. In addition, geometrical and frequency-dependent corrections for simulating near-field sources are included. Reverberation is computed separately using a virtual loudspeakers Ambisonic approach and convolution with Binaural Room Impulse Responses (BRIRs). In all these processes, special care has been put in avoiding audible artefacts produced by changes in gains and audio filters due to the movements of sources and of the listener. The 3DTI Toolkit performance, as well as some other relevant metrics such as non-linear distortion, are assessed and presented, followed by a comparison between the features offered by the 3DTI Toolkit and those found in other currently available open- and closed-source binaural renderers.

Patent
12 Apr 2019
TL;DR: In this article, an audio classifier for classifying audio signals to at least one audio type in real-time, an audio improving device for improving the experience of audiences and an adjusting unit for adjusting at least a parameter of the audio improving devices based on a confidence value of at least 1 audio type.
Abstract: The invention discloses a device and a method for audio classification and audio processing. In one implementation mode, the audio processing device comprises an audio classifier for classifying audio signals to at least one audio type in real time, an audio improving device for improving the experience of audiences and an adjusting unit for adjusting at least one parameter of the audio improving device based on a confidence value of at least one audio type in a continuous mode.

Journal ArticleDOI
TL;DR: The structure of the spatial correlation matrix is comprehensively studied, showing that under some well-defined conditions, the DOA of the direct sound can be correctly extracted from its dominant eigenvector, even when contaminated by reflections.
Abstract: Direction of arrival (DOA) estimation for speech sources is an important task in audio signal processing. This task becomes a challenge in reverberant environments, which are typical to real scenarios. Several methods of DOA estimation for speech sources have been developed recently, in an attempt to overcome the effect of reverberation. One effective approach aims to identify time-frequency bins in the short time Fourier transform domain that are dominated by the direct sound. This approach was shown to be particularly adequate for spherical arrays, with processing in the spherical harmonics domain. The direct-path dominance (DPD) test, and a method which is based on the directivity of the sound field are recent examples. While these methods seem to perform well, high reverberation conditions may degrade their performance. In this paper, the structure of the spatial correlation matrix is comprehensively studied, showing that under some well-defined conditions, the DOA of the direct sound can be correctly extracted from its dominant eigenvector, even when contaminated by reflections. This new insight leads to the development of a new test, performing an enhanced decomposition of the direct sound (EDS), denoted the DPD-EDS test. The proposed test is compared to previous DPD tests, and to other recently proposed reverberation-robust methods, using computer simulations and an experimental study, demonstrating its potential advantage. The studies include multiple speakers in highly reverberant environments, therefore representing challenging real-life acoustics scenes.

Journal ArticleDOI
TL;DR: A novel byte-level method for detecting malware by audio signal processing techniques is presented, where program’s bytes are converted to a meaningful audio signal, then Music Information Retrieval techniques are employed to construct a machine learning music classification model from audio signals to detect new and unseen instances.
Abstract: Each year, a huge number of malicious programs are released which causes malware detection to become a critical task in computer security. Antiviruses use various methods for detecting malware, such as signature-based and heuristic-based techniques. Polymorphic and metamorphic malwares employ obfuscation techniques to bypass traditional detection methods used by antiviruses. Recently, the number of these malware has increased dramatically. Most of the previously proposed methods to detect malware are based on high-level features such as opcodes, function calls or program’s control flow graph (CFG). Due to new obfuscation techniques, extracting high-level features is tough, fallible and time-consuming; hence approaches using program’s bytes are quicker and more accurate. In this paper, a novel byte-level method for detecting malware by audio signal processing techniques is presented. In our proposed method, program’s bytes are converted to a meaningful audio signal, then Music Information Retrieval (MIR) techniques are employed to construct a machine learning music classification model from audio signals to detect new and unseen instances. Experiments evaluate the influence of different strategies converting bytes to audio signals and the effectiveness of the method.

Journal ArticleDOI
TL;DR: It is shown that even if recent improvements in digital video and signal processing allow for increased automation of processing, the context of the NICU makes a fully automated analysis of long recordings problematic.
Abstract: Objective - Video and sound acquisition and processing technologies have seen great improvements in recent decades, with many applications in the biomedical area. The aim of this paper is to review the overall state of the art of advances within these topics in paediatrics and to evaluate their potential application for monitoring in the neonatal intensive care unit (NICU). Approach - For this purpose, more than 150 papers dealing with video and audio processing were reviewed. For both topics, clinical applications are described according to the considered cohorts-full-term newborns, infants and toddlers or preterm newborns. Then, processing methods are presented, in terms of data acquisition, feature extraction and characterization. Main results - The paper first focuses on the exploitation of video recordings; these began to be automatically processed in the 2000s and we show that they have mainly been used to characterize infant motion. Other applications, including respiration and heart rate estimation and facial analysis, are also presented. Audio processing is then reviewed, with a focus on the analysis of crying. The first studies in this field focused on induced-pain cries and the newest ones deal with spontaneous cries; the analyses are mainly based on frequency features. Then, some papers dealing with non-cry signals are also discussed. Significance - Finally, we show that even if recent improvements in digital video and signal processing allow for increased automation of processing, the context of the NICU makes a fully automated analysis of long recordings problematic. A few proposals for overcoming some of the limitations are given.

Proceedings ArticleDOI
12 May 2019
TL;DR: This work investigates deep learning architectures for audio processing and aims to find a general purpose end-to-end deep neural network to perform modeling of nonlinear audio effects.
Abstract: In the context of music production, distortion effects are mainly used for aesthetic reasons and are usually applied to electric musical instruments. Most existing methods for nonlinear modeling are often either simplified or optimized to a very specific circuit. In this work, we investigate deep learning architectures for audio processing and we aim to find a general purpose end-to-end deep neural network to perform modeling of nonlinear audio effects. We show the network modeling various nonlinearities and we discuss the generalization capabilities among different instruments.

Posted Content
TL;DR: The receptive field (RF) of CNNs is analysed and the importance of the RF to the generalization capability of the models is demonstrated, showing that very small or very large RFs can cause performance degradation, but deep models can be made to generalize well by carefully choosing an appropriate RF size within a certain range.
Abstract: Convolutional Neural Networks (CNNs) have had great success in many machine vision as well as machine audition tasks. Many image recognition network architectures have consequently been adapted for audio processing tasks. However, despite some successes, the performance of many of these did not translate from the image to the audio domain. For example, very deep architectures such as ResNet and DenseNet, which significantly outperform VGG in image recognition, do not perform better in audio processing tasks such as Acoustic Scene Classification (ASC). In this paper, we investigate the reasons why such powerful architectures perform worse in ASC compared to simpler models (e.g., VGG). To this end, we analyse the receptive field (RF) of these CNNs and demonstrate the importance of the RF to the generalization capability of the models. Using our receptive field analysis, we adapt both ResNet and DenseNet, achieving state-of-the-art performance and eventually outperforming the VGG-based models. We introduce systematic ways of adapting the RF in CNNs, and present results on three data sets that show how changing the RF over the time and frequency dimensions affects a model's performance. Our experimental results show that very small or very large RFs can cause performance degradation, but deep models can be made to generalize well by carefully choosing an appropriate RF size within a certain range.

Journal ArticleDOI
TL;DR: In this study, an automatic method for key-events detection and summarisation based on audio-visual features is presented for cricket videos and achieves an average accuracy of 95.5%, which signifies its effectiveness.
Abstract: Sports broadcasters generate an enormous amount of video content on the cyberspace due to massive viewership all over the world. Analysis and consumption of this huge repository urges the broadcasters to apply video summarisation to extract the exciting segments from the entire video to capture user's interest and reap the storage and transmission benefits. Therefore, in this study an automatic method for key-events detection and summarisation based on audio-visual features is presented for cricket videos. Acoustic local binary pattern features are used to capture excitement level in the audio stream, which is used to train a binary support vector machine (SVM) classifier. Trained SVM classifier is used to label audio frame as an excited or non-excited frame. Excited audio frames are used to select candidate key-video frames. A decision tree-based classifier is trained to detect key-events in the input cricket videos that are then used for video summarisation. Performance of the proposed framework has been evaluated on a diverse dataset of cricket videos belonging to different tournaments and broadcasters. Experimental results indicate that the proposed method achieves an average accuracy of 95.5%, which signifies its effectiveness.

Journal ArticleDOI
TL;DR: This paper proposes a unified approach for modeling capacitors and inductors in the WD domain using generic linear multi-step discretization methods with variable time-step size, and provides generalized adaptation conditions.
Abstract: There is a growing interest in Virtual Analog modeling algorithms for musical audio processing designed in the Wave Digital WD domain. Such algorithms typically employ a discretization strategy based on the trapezoidal rule with fixed sampling step, though this is not the only option. In fact, alternative discretization strategies possibly with an adaptive sampling step can be quite advantageous, particularly when dealing with nonlinear systems characterized by stiff equations. In this paper, we propose a unified approach for modeling capacitors and inductors in the WD domain using generic linear multi-step discretization methods with variable time-step size, and provide generalized adaptation conditions. We also show that the proposed approach for implementing dynamic energy-storing elements in the WD domain is particularly suitable to be combined with a recently developed technique for efficiently solving a class of circuits with multiple one-port nonlinearities, called Scattering Iterative Method. Finally, as examples of application, we develop WD models for a Van Der Pol oscillator and a dynamic diode-based ring modulator, which use different discretization methods.

Proceedings ArticleDOI
01 Jan 2019
TL;DR: This study addresses the acoustic scene classification of raw audio signal and proposes a cascaded CNN architecture that uses spatial pyramid pooling (SPP) method to aggregate local features coming from convolutional layers of the CNN.
Abstract: Automatic understanding of audio events and acoustic scenes has been an active research topic for researchers from signal processing and machine learning communities. Recognition of acoustic scenes in the real life scenarios is a challenging task due to the diversity of environmental sounds and uncontrolled environments. Efficient methods and feature representations are needed to cope with these challenges. In this study, we address the acoustic scene classification of raw audio signal and propose a cascaded CNN architecture that uses spatial pyramid pooling (SPP, also referred to as spatial pyramid matching) method to aggregate local features coming from convolutional layers of the CNN. We use three well known audio features, namely MFCC, Mel Energy, and spectrogram to represent audio content and evaluate the effectiveness of our proposed CNN-SPP architecture on the DCASE 2018 acoustic scene performance dataset. Our results show that, the proposed CNN-SPP architecture with the spectrogram feature improves the classification accuracy.

Journal ArticleDOI
TL;DR: A novel binaural SSL method based on time–frequency convolutional neural network (TF-CNN) with multitask learning is proposed to simultaneously localize azimuth and elevation under unknown acoustic conditions and achieves preferable localization performance compared with other popular methods.
Abstract: Sound source localization (SSL) is an important technique for many audio processing systems, such as speech enhancement/recognition and human-robot interaction. Although many methods have been proposed for SSL, it still remains a challenging task to achieve accurate localization under adverse acoustic scenarios. In this paper, a novel binaural SSL method based on time-frequency convolutional neural network (TF-CNN) with multitask learning is proposed to simultaneously localize azimuth and elevation under unknown acoustic conditions. First, the interaural phase difference and interaural level difference are extracted from the received binaural signals, which are taken as the input of the proposed SSL neural network. Then, an SSL neural network is designed to map the interaural cues to sound direction, which consists of TF-CNN module and multitask neural network. The TF-CNN module learns and combines the time-frequency information of extracted interaural cues to generate the shared feature for multitask SSL. With the shared feature, a multitask neural network is designed to simultaneously estimate azimuth and elevation through multitask learning, which generates the posterior probability for candidate directions. Finally, the candidate direction with the highest probability is taken as the final direction estimation. The experiments based on public head-related transfer function (HRTF) database demonstrate that the proposed method achieves preferable localization performance compared with other popular methods.

Patent
04 Jun 2019
TL;DR: In this article, the authors describe an audio output device that includes a communication module that communicates with an external electronic device, a speaker that outputs sound, and a mounting detection sensor that detects whether the audio output devices is mounted on a user of the device.
Abstract: Systems, methods, and audio output devices are described. In one aspect, an audio output device includes a communication module that communicates with an external electronic device, a speaker that outputs sound, a mounting detection sensor that detects whether the audio output device is mounted on a user of the audio output device, and a control circuit that is electrically connected with the communication module, the speaker, and the mounting detection sensor. The control circuit wirelessly connects the external electronic device with the audio output device using the communication module if the mounting of the audio output device is detected by the mounting detection sensor, receives audio data from the external electronic device through the wireless connection, and outputs the audio data using the speaker.

Posted Content
TL;DR: This is the first time that a single environment sound classification model is able to achieve state-of-the-art results on all three datasets, and the accuracy achieved by the proposed model is beyond human accuracy.
Abstract: In this paper, we propose a model for the Environment Sound Classification Task (ESC) that consists of multiple feature channels given as input to a Deep Convolutional Neural Network (CNN) with Attention mechanism. The novelty of the paper lies in using multiple feature channels consisting of Mel-Frequency Cepstral Coefficients (MFCC), Gammatone Frequency Cepstral Coefficients (GFCC), the Constant Q-transform (CQT) and Chromagram. Such multiple features have never been used before for signal or audio processing. And, we employ a deeper CNN (DCNN) compared to previous models, consisting of spatially separable convolutions working on time and feature domain separately. Alongside, we use attention modules that perform channel and spatial attention together. We use some data augmentation techniques to further boost performance. Our model is able to achieve state-of-the-art performance on all three benchmark environment sound classification datasets, i.e. the UrbanSound8K (97.52%), ESC-10 (95.75%) and ESC-50 (88.50%). To the best of our knowledge, this is the first time that a single environment sound classification model is able to achieve state-of-the-art results on all three datasets. For ESC-10 and ESC-50 datasets, the accuracy achieved by the proposed model is beyond human accuracy of 95.7% and 81.3% respectively.

Proceedings ArticleDOI
01 May 2019
TL;DR: In this article, a complex-valued spectrogram of a sum of sinusoids can be reduced to a low-rank representation by modifying the phase of the spectrogram, and a convex prior emphasizing harmonic signals is proposed for audio denoising.
Abstract: Low-rankness of amplitude spectrograms has been effectively utilized in audio signal processing methods including non-negative matrix factorization. However, such methods have a fundamental limitation owing to their amplitude-only treatment where the phase of the observed signal is utilized for resynthesizing the estimated signal. In order to address this limitation, we directly treat a complex-valued spectrogram and show a complex-valued spectrogram of a sum of sinusoids can be approximately low-rank by modifying its phase. For evaluating the applicability of the proposed low-rank representation, we further propose a convex prior emphasizing harmonic signals, and it is applied to audio denoising.

Proceedings ArticleDOI
01 Nov 2019
TL;DR: A data-mining framework to learn the synchronous pattern between different channels from large recorded audio/text dataset and visual dataset, and apply it to generate realistic talking face animations is presented.
Abstract: Providing methods to support audio-visual interaction with growing volumes of video data is an increasingly important challenge for data mining. To this end, there has been some success in speech-driven lip motion generation or talking face generation. Among them, talking face generation aims to generate realistic talking heads synchronized with the audio or text input. This task requires mining the relationship between audio signal/text and lip-sync video frames and ensures the temporal continuity between frames. Due to the issues such as polysemy, ambiguity, and fuzziness of sentences, creating visual images with lip synchronization is still challenging. To overcome the problems above, we present a data-mining framework to learn the synchronous pattern between different channels from large recorded audio/text dataset and visual dataset, and apply it to generate realistic talking face animations. Specifically, we decompose this task into two steps: mouth landmarks prediction and video synthesis. First, a multimodal learning method is proposed to generate accurate mouth landmarks with multimedia inputs (both text and audio). Second, a network named Face2Vid is proposed to generate video frames conditioned on the predicted mouth landmarks. In Face2Vid, optical flow is employed to model the temporal dependency between frames, meanwhile, a self-attention mechanism is introduced to model the spatial dependency across image regions. Extensive experiments demonstrate that our method can generate realistic videos with background, and exhibit the superiorities on accurate synchronization of lip movements and smooth transition of facial movements.

Journal ArticleDOI
TL;DR: In this article, a directional feedback delay network (DDN) was proposed to produce a non-uniform direction-dependent decay time, suitable for anisotropic decay reproduction on a loudspeaker array or in binaural playback through the use of ambisonics.
Abstract: Artificial reverberation algorithms are used to enhance dry audio signals. Delay-based reverberators can produce a realistic effect at a reasonable computational cost. While the recent popularity of spatial audio algorithms is mainly related to the reproduction of the perceived direction of sound sources, there is also a need to spatialize the reverberant sound field. Usually multichannel reverberation algorithms output a series of decorrelated signals yielding an isotropic energy decay. This means that the reverberation time is uniform in all directions. However, the acoustics of physical spaces can exhibit more complex direction-dependent characteristics. This paper proposes a new method to control the directional distribution of energy over time, within a delay-based reverberator, capable of producing a directional impulse response with anisotropic energy decay. We present a method using multichannel delay lines in conjunction with a direction-dependent transform in the spherical harmonic domain to control the direction-dependent decay of the late reverberation. The new reverberator extends the feedback delay network, retaining its time-frequency domain characteristics. The proposed directional feedback delay network reverberator can produce non-uniform direction-dependent decay time, suitable for anisotropic decay reproduction on a loudspeaker array or in binaural playback through the use of ambisonics.