Recognition of Urban Sound Events Using Deep Context-Aware Feature Extractors and Handcrafted Features

doi:10.1007/978-3-030-19909-8_16

Home
/
Papers
/
Recognition of Urban Sound Events Using Deep Context-Aware Feature Extractors and Handcrafted Features

Book Chapter•DOI•

Recognition of Urban Sound Events Using Deep Context-Aware Feature Extractors and Handcrafted Features

Theodore Giannakopoulos, Evaggelos Spyrou, Stavros Perantonis

24 May 2019-pp 184-195

TL;DR: The main contribution of this work is the demonstration that transferring audio contextual knowledge using CNNs as feature extractors can significantly improve the performance of the audio classifier, without need for CNN training.

read less

Abstract: This paper proposes a method for recognizing audio events in urban environments that combines handcrafted audio features with a deep learning architectural scheme (Convolutional Neural Networks, CNNs), which has been trained to distinguish between different audio context classes. The core idea is to use the CNNs as a method to extract context-aware deep audio features that can offer supplementary feature representations to any soundscape analysis classification task. Towards this end, the CNN is trained on a database of audio samples which are annotated in terms of their respective “scene” (e.g. train, street, park), and then it is combined with handcrafted audio features in an early fusion approach, in order to recognize the audio event of an unknown audio recording. Detailed experimentation proves that the proposed context-aware deep learning scheme, when combined with the typical handcrafted features, leads to a significant performance boosting in terms of classification accuracy. The main contribution of this work is the demonstration that transferring audio contextual knowledge using CNNs as feature extractors can significantly improve the performance of the audio classifier, without need for CNN training (a rather demanding process that requires huge datasets and complex data augmentation procedures).

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

AUCO ResNet: an end-to-end network for Covid-19 pre-screening from cough and breath

[...]

Vincenzo Dentamaro, Paolo Giglio, Donato Impedovo, Luigi Moretti, Giuseppe Pirlo - Show less +1 more

01 Mar 2022-Pattern Recognition

TL;DR: AUCO ResNet as discussed by the authors is a biologically inspired deep neural network especially designed for sound classification and more specifically for Covid-19 recognition from audio tracks of coughs and breaths, which can be trained end-to-end thus optimizing (with gradient descent) all the modules of the learning algorithm: mel-like filter design, feature extraction, feature selection, dimensionality reduction and prediction.

...read moreread less

20 citations

Journal Article•DOI•

Ensemble of handcrafted and deep features for urban sound classification

[...]

Jederson S. Luz¹, Myllena C. Oliveira¹, Flávio H. D. Araújo¹, Deborah M.V. Magalhães¹•Institutions (1)

Federal University of Piauí¹

01 Apr 2021-Applied Acoustics

TL;DR: A small parameter space CNN model to extract deep features that are combined with handcrafted features extracted from audio signals is proposed, outperforming most of the state-of-the-art CNN models for urban sound classification.

...read moreread less

17 citations

Journal Article•DOI•

Enabling Real-Time Computation of Psycho-Acoustic Parameters in Acoustic Sensors Using Convolutional Neural Networks

[...]

Jesus Lopez-Ballester¹, Adolfo Pastor-Aparicio¹, Santiago Felici-Castell¹, Jaume Segura-Garcia¹, Maximo Cobos¹ - Show less +1 more•Institutions (1)

University of Valencia¹

01 Oct 2020-IEEE Sensors Journal

TL;DR: This paper describes the design and analysis of a deep convolutional neural network (CNN) trained with a big dataset of typical sounds occurrying in a city that allows to predict the psycho-acoustic parameters considered by the well-known Zwicker’s psycho-ACoustic nuisance model with great accuracy.

...read moreread less

Abstract: Sensor networks have become an extremely useful tool for monitoring and analysing many aspects of our daily lives. Noise pollution levels are very important today, especially in cities where the number of inhabitants and disturbing sounds are constantly increasing. Psycho-acoustic parameters are a fundamental tool for assessing the degree of discomfort produced by different sounds and, combined with wireless acoustic sensor networks (WASNs), could enable, for example, the efficient implementation of acoustic discomfort maps within smart cities. However, the continuous monitoring of psycho-acoustic parameters to create time-dependent discomfort maps requires a high computational demand that prevents real-time computations within the nodes. Moreover, sending audio streams outside of the WASN for their further computation, would require extra communication and computational efforts without warranting a real-time monitoring, with the added problem of violating some privacy laws. As a result, most existing systems for nuisance assessment are usually based on less accurate indicators that require lower computational cost. In this paper, we describe the design and analysis of a deep convolutional neural network (CNN) trained with a big dataset of typical sounds occurrying in a city. The CNN allows to predict the psycho-acoustic parameters considered by the well-known Zwicker’s psycho-acoustic nuisance model with great accuracy, directly from the raw recorded audio signal. The proposed CNN-based system has been tested on both desktop computers and typical WASN devices (such as Raspberry Pi), achieving very fast calculation times that allow real-time operation and a continuous monitoring of psycho-acoustic parameters.

...read moreread less

16 citations

Proceedings Article•DOI•

Fusing Handcrafted and Contextual Features for Human Activity Recognition

[...]

Ioannis Vernikos¹, Eirini Mathe², Evaggelos Spyrou¹, Alexandros Mitsou³, Theodore Giannakopoulos³, Phivos Mylonas² - Show less +2 more•Institutions (3)

University of Thessaly¹, Ionian University², National Centre of Scientific Research "Demokritos"³

01 Jun 2019

TL;DR: Experimental results prove that the proposed method significantly improves the recognition accuracy in an arm gesture recognition problem, compared to the use of handcrafted features only.

...read moreread less

Abstract: In this paper we present an approach for the recognition of human activity that combines handcrafted features from 3D skeletal data and contextual features learnt by a trained deep Convolutional Neural Network (CNN). Our approach is based on the idea that contextual features, i.e., features learnt in a similar problem are able to provide a diverse representation, which, when combined with the handcrafted features is able to boost performance. To validate our idea, we train a CNN using a dataset for action recognition and use the output of the last fully-connected layer as a contextual feature representation. Then, a Support Vector Machine is trained upon an early fusion step of both representations. Experimental results prove that the proposed method significantly improves the recognition accuracy in an arm gesture recognition problem, compared to the use of handcrafted features only.

...read moreread less

5 citations

Journal Article•DOI•

Sound Classification and Processing of Urban Environments: A Systematic Literature Review

[...]

Ana Filipa Rodrigues Nogueira, Hugo S. Oliveira, J. Machado, João Manuel R. S. Tavares

01 Nov 2022-Sensors

TL;DR: In this article , the authors summarize the most recent works on this subject to understand the current approaches and identify their limitations, and they conclude that Deep Learning (DL) architectures, attention mechanisms, data augmentation techniques, and pretraining are the most crucial factors to consider while creating an efficient sound classification model.

...read moreread less

Abstract: Audio recognition can be used in smart cities for security, surveillance, manufacturing, autonomous vehicles, and noise mitigation, just to name a few. However, urban sounds are everyday audio events that occur daily, presenting unstructured characteristics containing different genres of noise and sounds unrelated to the sound event under study, making it a challenging problem. Therefore, the main objective of this literature review is to summarize the most recent works on this subject to understand the current approaches and identify their limitations. Based on the reviewed articles, it can be realized that Deep Learning (DL) architectures, attention mechanisms, data augmentation techniques, and pretraining are the most crucial factors to consider while creating an efficient sound classification model. The best-found results were obtained by Mushtaq and Su, in 2020, using a DenseNet-161 with pretrained weights from ImageNet, and NA-1 and NA-2 as augmentation techniques, which were of 97.98%, 98.52%, and 99.22% for UrbanSound8K, ESC-50, and ESC-10 datasets, respectively. Nonetheless, the use of these models in real-world scenarios has not been properly addressed, so their effectiveness is still questionable in such situations.

...read moreread less

2 citations

References

PDF

Open Access

More filters

Proceedings Article•

ImageNet Classification with Deep Convolutional Neural Networks

[...]

Alex Krizhevsky¹, Ilya Sutskever¹, Geoffrey E. Hinton¹•Institutions (1)

University of Toronto¹

03 Dec 2012

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overriding in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

...read moreread less

73,978 citations

Journal Article•

Dropout: a simple way to prevent neural networks from overfitting

[...]

Nitish Srivastava¹, Geoffrey E. Hinton¹, Alex Krizhevsky¹, Ilya Sutskever¹, Ruslan Salakhutdinov¹ - Show less +1 more•Institutions (1)

University of Toronto¹

01 Jan 2014-Journal of Machine Learning Research

TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

...read moreread less

Abstract: Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

...read moreread less

33,597 citations

Journal Article•DOI•

Deep learning in neural networks

[...]

Jürgen Schmidhuber¹•Institutions (1)

University of Lugano¹

01 Jan 2015-Neural Networks

TL;DR: This historical survey compactly summarizes relevant work, much of it from the previous millennium, review deep supervised learning, unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.

...read moreread less

14,635 citations

Posted Content•

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

[...]

Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek G. Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul A. Tucker, Vincent Vanhoucke, Vijay K. Vasudevan, Fernanda B. Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, Xiaoqiang Zheng - Show less +36 more

01 Jan 2015-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: The TensorFlow interface and an implementation of that interface that is built at Google are described, which has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields.

...read moreread less

Abstract: TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org.

...read moreread less

10,447 citations

Journal Article•DOI•

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

[...]

Geoffrey E. Hinton¹, Li Deng², Dong Yu², George E. Dahl¹, Abdelrahman Mohamed¹, Navdeep Jaitly¹, Andrew W. Senior³, Vincent Vanhoucke³, Patrick Nguyen³, Tara N. Sainath⁴, Brian Kingsbury⁴ - Show less +7 more•Institutions (4)

University of Toronto¹, Microsoft², Google³, IBM⁴

18 Oct 2012-IEEE Signal Processing Magazine

TL;DR: This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.

...read moreread less

Abstract: Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. An alternative way to evaluate the fit is to use a feed-forward neural network that takes several frames of coefficients as input and produces posterior probabilities over HMM states as output. Deep neural networks (DNNs) that have many hidden layers and are trained using new methods have been shown to outperform GMMs on a variety of speech recognition benchmarks, sometimes by a large margin. This article provides an overview of this progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.

...read moreread less

9,091 citations