A novel approach for automatic acoustic novelty detection using a denoising autoencoder with bidirectional LSTM neural networks

doi:10.1109/ICASSP.2015.7178320

Home
/
Papers
/
A novel approach for automatic acoustic novelty detection using a denoising autoencoder with bidirectional LSTM neural networks

Proceedings Article•DOI•

A novel approach for automatic acoustic novelty detection using a denoising autoencoder with bidirectional LSTM neural networks

Erik Marchi¹, Fabio Vesperini², Florian Eyben¹, Stefano Squartini², Björn Schuller³ - Show less +1 more•Institutions (3)

Technische Universität München¹, Marche Polytechnic University², University of Passau³

19 Apr 2015-pp 1996-2000

TL;DR: This paper presents a novel unsupervised approach based on a denoising autoencoder which significantly outperforms existing methods by achieving up to 93.4% F-Measure.

read less

Abstract: Acoustic novelty detection aims at identifying abnormal/novel acoustic signals which differ from the reference/normal data that the system was trained with. In this paper we present a novel unsupervised approach based on a denoising autoencoder. In our approach auditory spectral features are processed by a denoising autoencoder with bidirectional Long Short-Term Memory recurrent neural networks. We use the reconstruction error between the input and the output of the autoencoder as activation signal to detect novel events. The autoencoder is trained on a public database which contains recordings of typical in-home situations such as talking, watching television, playing and eating. The evaluation was performed on more than 260 different abnormal events. We compare results with state-of-theart methods and we conclude that our novel approach significantly outperforms existing methods by achieving up to 93.4% F-Measure.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

A Unifying Review of Deep and Shallow Anomaly Detection

[...]

Lukas Ruff¹, Jacob R. Kauffmann¹, Robert A. Vandermeulen¹, Grégoire Montavon¹, Wojciech Samek², Marius Kloft³, Thomas G. Dietterich⁴, Klaus-Robert Müller¹ - Show less +4 more•Institutions (4)

Technical University of Berlin¹, Heinrich Hertz Institute², Kaiserslautern University of Technology³, Oregon State University⁴

24 Sep 2020-arXiv: Learning

TL;DR: This review aims to identify the common underlying principles and the assumptions that are often made implicitly by various methods in deep learning, and draws connections between classic “shallow” and novel deep approaches and shows how this relation might cross-fertilize or extend both directions.

...read moreread less

Abstract: Deep learning approaches to anomaly detection have recently improved the state of the art in detection performance on complex datasets such as large collections of images or text. These results have sparked a renewed interest in the anomaly detection problem and led to the introduction of a great variety of new methods. With the emergence of numerous such methods, including approaches based on generative models, one-class classification, and reconstruction, there is a growing need to bring methods of this field into a systematic and unified perspective. In this review we aim to identify the common underlying principles as well as the assumptions that are often made implicitly by various methods. In particular, we draw connections between classic 'shallow' and novel deep approaches and show how this relation might cross-fertilize or extend both directions. We further provide an empirical assessment of major existing methods that is enriched by the use of recent explainability techniques, and present specific worked-through examples together with practical advice. Finally, we outline critical open challenges and identify specific paths for future research in anomaly detection.

...read moreread less

310 citations

Cites background from "A novel approach for automatic acou..."

...DAEs, thus, provide a way to specify a noise model for ε (see Section II-C2), which has been applied for noise-robust acoustic novelty detection [42], for instance....
[...]

Journal Article•DOI•

Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge

[...]

Annamaria Mesaros¹, Toni Heittola¹, Emmanouil Benetos², Peter Foster², Mathieu Lagrange³, Tuomas Virtanen¹, Mark D. Plumbley³ - Show less +3 more•Institutions (3)

Tampere University of Technology¹, Queen Mary University of London², École centrale de Nantes³

01 Feb 2018-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: The emergence of deep learning as the most popular classification method is observed, replacing the traditional approaches based on Gaussian mixture models and support vector machines.

...read moreread less

Abstract: Public evaluation campaigns and datasets promote active development in target research areas, allowing direct comparison of algorithms. The second edition of the challenge on detection and classification of acoustic scenes and events (DCASE 2016) has offered such an opportunity for development of the state-of-the-art methods, and succeeded in drawing together a large number of participants from academic and industrial backgrounds. In this paper, we report on the tasks and outcomes of the DCASE 2016 challenge. The challenge comprised four tasks: acoustic scene classification, sound event detection in synthetic audio, sound event detection in real-life audio, and domestic audio tagging. We present each task in detail and analyze the submitted systems in terms of design and performance. We observe the emergence of deep learning as the most popular classification method, replacing the traditional approaches based on Gaussian mixture models and support vector machines. By contrast, feature representations have not changed substantially throughout the years, as mel frequency-based representations predominate in all tasks. The datasets created for and used in DCASE 2016 are publicly available and are a valuable resource for further research.

...read moreread less

276 citations

Journal Article•DOI•

A Unifying Review of Deep and Shallow Anomaly Detection

[...]

Technical University of Berlin¹, Heinrich Hertz Institute², Kaiserslautern University of Technology³, Oregon State University⁴

04 Feb 2021

TL;DR: Deep learning approaches to anomaly detection (AD) have recently improved the state of the art in detection performance on complex data sets, such as large collections of images or text as mentioned in this paper, and led to the introduction of a great variety of new methods.

...read moreread less

Abstract: Deep learning approaches to anomaly detection (AD) have recently improved the state of the art in detection performance on complex data sets, such as large collections of images or text. These results have sparked a renewed interest in the AD problem and led to the introduction of a great variety of new methods. With the emergence of numerous such methods, including approaches based on generative models, one-class classification, and reconstruction, there is a growing need to bring methods of this field into a systematic and unified perspective. In this review, we aim to identify the common underlying principles and the assumptions that are often made implicitly by various methods. In particular, we draw connections between classic “shallow” and novel deep approaches and show how this relation might cross-fertilize or extend both directions. We further provide an empirical assessment of major existing methods that are enriched by the use of recent explainability techniques and present specific worked-through examples together with practical advice. Finally, we outline critical open challenges and identify specific paths for future research in AD.

...read moreread less

257 citations

Proceedings Article•DOI•

Unsupervised Extraction of Video Highlights via Robust Recurrent Auto-Encoders

[...]

Huan Yang¹, Baoyuan Wang², Stephen Lin², David Wipf², Minyi Guo¹, Baining Guo² - Show less +2 more•Institutions (2)

Shanghai Jiao Tong University¹, Microsoft²

07 Dec 2015

TL;DR: This work presents an unsupervised learning approach that takes advantage of the abundance of user-edited videos on social media websites such as YouTube to infer highlights using only a set of downloaded edited videos, without also needing their pre-edited counterparts which are rarely available online.

...read moreread less

Abstract: With the growing popularity of short-form video sharing platforms such as Instagram and Vine, there has been an increasing need for techniques that automatically extract highlights from video. Whereas prior works have approached this problem with heuristic rules or supervised learning, we present an unsupervised learning approach that takes advantage of the abundance of user-edited videos on social media websites such as YouTube. Based on the idea that the most significant sub-events within a video class are commonly present among edited videos while less interesting ones appear less frequently, we identify the significant sub-events via a robust recurrent auto-encoder trained on a collection of user-edited videos queried for each particular class of interest. The auto-encoder is trained using a proposed shrinking exponential loss function that makes it robust to noise in the web-crawled training data, and is configured with bidirectional long short term memory (LSTM) [5] cells to better model the temporal structure of highlight segments. Different from supervised techniques, our method can infer highlights using only a set of downloaded edited videos, without also needing their pre-edited counterparts which are rarely available online. Extensive experiments indicate the promise of our proposed solution in this challenging unsupervised setting.

...read moreread less

217 citations

Cites methods from "A novel approach for automatic acou..."

...In [15], novelty detection is performed for audio features using an auto-encoder with LSTM....
[...]

Proceedings Article•DOI•

Safe Visual Navigation via Deep Learning and Novelty Detection

[...]

Charles Richter¹, Nicholas Roy¹•Institutions (1)

Massachusetts Institute of Technology¹

12 Jul 2017

TL;DR: This work uses an autoencoder to recognize when a query is novel, and revert to a safe prior behavior, and can deploy an autonomous deep learning system in arbitrary environments, without concern for whether it has received the appropriate training.

...read moreread less

Abstract: Robots that use learned perceptual models in the real world must be able to safely handle cases where they are forced to make decisions in scenarios that are unlike any of their training examples. However, state-of-the-art deep learning methods are known to produce erratic or unsafe predictions when faced with novel inputs. Furthermore, recent ensemble, bootstrap and dropout methods for quantifying neural network uncertainty may not efficiently provide accurate uncertainty estimates when queried with inputs that are very different from their training data. Rather than unconditionally trusting the predictions of a neural network for unpredictable real-world data, we use an autoencoder to recognize when a query is novel, and revert to a safe prior behavior. With this capability, we can deploy an autonomous deep learning system in arbitrary environments, without concern for whether it has received the appropriate training. We demonstrate our method with a vision-guided robot that can leverage its deep neural network to navigate 50% faster than a safe baseline policy in familiar types of environments, while reverting to the prior behavior in novel environments so that it can safely collect additional training data and continually improve. A video illustrating our approach is available at: http://groups.csail.mit.edu/rrg/videos/safe visual navigation.

...read moreread less

196 citations

Cites background from "A novel approach for automatic acou..."

...including acoustic signals [22], network server anomalies [33], data mining [14], document classification [21] and others....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Bidirectional recurrent neural networks

[...]

Mike Schuster, Kuldip K. Paliwal

01 Nov 1997-IEEE Transactions on Signal Processing

TL;DR: It is shown how the proposed bidirectional structure can be easily modified to allow efficient estimation of the conditional posterior probability of complete symbol sequences without making any explicit assumption about the shape of the distribution.

...read moreread less

Abstract: In the first part of this paper, a regular recurrent neural network (RNN) is extended to a bidirectional recurrent neural network (BRNN). The BRNN can be trained without the limitation of using input information just up to a preset future frame. This is accomplished by training it simultaneously in positive and negative time direction. Structure and training procedure of the proposed network are explained. In regression and classification experiments on artificial data, the proposed structure gives better results than other approaches. For real data, classification experiments for phonemes from the TIMIT database show the same tendency. In the second part of this paper, it is shown how the proposed bidirectional structure can be easily modified to allow efficient estimation of the conditional posterior probability of complete symbol sequences without making any explicit assumption about the shape of the distribution. For this part, experiments on real data are reported.

...read moreread less

7,290 citations

"A novel approach for automatic acou..." refers methods in this paper

...In addition to LSTM memory blocks, we use bidirectional RNNs [23]....
[...]
...Suitable types of networks for our purpose are RNNs and Bidirectional RNNs with LSTM units instead of ‘usual’ non-linear ones....
[...]
...The best network layout for our BRNNs has six hidden layers (three for each direction) with 216 LSTM units, each....
[...]
...The combination of bidirectional RNNs and LSTM memory blocks leads to bidirectional LSTM networks [24], where context from both temporal directions is exploited....
[...]
...The best network layout for our RNNs has three hidden layers with 156, 256, and 156 LSTM units, respectively....
[...]

Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising 1 criterion

[...]

P. Vincent

01 Jan 2010

TL;DR: This work clearly establishes the value of using a denoising criterion as a tractable unsupervised objective to guide the learning of useful higher level representations.

...read moreread less

Abstract: We explore an original strategy for building deep networks, based on stacking layers of denoising autoencoders which are trained locally to denoise corrupted versions of their inputs. The resulting algorithm is a straightforward variation on the stacking of ordinary autoencoders. It is however shown on a benchmark of classification problems to yield significantly lower classification error, thus bridging the performance gap with deep belief networks (DBN), and in several cases surpassing it. Higher level representations learnt in this purely unsupervised fashion also help boost the performance of subsequent SVM classifiers. Qualitative experiments show that, contrary to ordinary autoencoders, denoising autoencoders are able to learn Gabor-like edge detectors from natural image patches and larger stroke detectors from digit images. This work clearly establishes the value of using a denoising criterion as a tractable unsupervised objective to guide the learning of useful higher level representations.

...read moreread less

5,303 citations

"A novel approach for automatic acou..." refers background in this paper

...The idea of denoising autoencoders [20] is quite intuitive....
[...]

Journal Article•

Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion

[...]

Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol - Show less +1 more

01 Mar 2010-Journal of Machine Learning Research

TL;DR: Denoising autoencoders as mentioned in this paper are trained locally to denoise corrupted versions of their inputs, which is a straightforward variation on the stacking of ordinary autoencoder.

...read moreread less

4,814 citations

Proceedings Article•

Greedy Layer-Wise Training of Deep Networks

[...]

Yoshua Bengio¹, Pascal Lamblin¹, Dan Popovici¹, Hugo Larochelle¹•Institutions (1)

Université de Montréal¹

04 Dec 2006

TL;DR: These experiments confirm the hypothesis that the greedy layer-wise unsupervised training strategy mostly helps the optimization, by initializing weights in a region near a good local minimum, giving rise to internal distributed representations that are high-level abstractions of the input, bringing better generalization.

...read moreread less

Abstract: Complexity theory of circuits strongly suggests that deep architectures can be much more efficient (sometimes exponentially) than shallow architectures, in terms of computational elements required to represent some functions. Deep multi-layer neural networks have many levels of non-linearities allowing them to compactly represent highly non-linear and highly-varying functions. However, until recently it was not clear how to train such deep networks, since gradient-based optimization starting from random initialization appears to often get stuck in poor solutions. Hinton et al. recently introduced a greedy layer-wise unsupervised learning algorithm for Deep Belief Networks (DBN), a generative model with many layers of hidden causal variables. In the context of the above optimization problem, we study this algorithm empirically and explore variants to better understand its success and extend it to cases where the inputs are continuous or where the structure of the input distribution is not revealing enough about the variable to be predicted in a supervised task. Our experiments also confirm the hypothesis that the greedy layer-wise unsupervised training strategy mostly helps the optimization, by initializing weights in a region near a good local minimum, giving rise to internal distributed representations that are high-level abstractions of the input, bringing better generalization.

...read moreread less

4,385 citations

"A novel approach for automatic acou..." refers background in this paper

...Deep neural networks use it during training of hidden layers to find common data representation from the input [18, 19]....
[...]

Proceedings Article•

Framewise phoneme classification with bidirectional LSTM and other neural network architectures

[...]

Alex Graves, Jürgen Schmidhuber

01 Jan 2005

TL;DR: In this article, a modified, full gradient version of the LSTM learning algorithm was used for framewise phoneme classification, using the TIMIT database, and the results support the view that contextual information is crucial to speech processing, and suggest that bidirectional networks outperform unidirectional ones.

...read moreread less

Abstract: In this paper, we present bidirectional Long Short Term Memory (LSTM) networks, and a modified, full gradient version of the LSTM learning algorithm. We evaluate Bidirectional LSTM (BLSTM) and several other network architectures on the benchmark task of framewise phoneme classification, using the TIMIT database. Our main findings are that bidirectional networks outperform unidirectional ones, and Long Short Term Memory (LSTM) is much faster and also more accurate than both standard Recurrent Neural Nets (RNNs) and time-windowed Multilayer Perceptrons (MLPs). Our results support the view that contextual information is crucial to speech processing, and suggest that BLSTM is an effective architecture with which to exploit it'.

...read moreread less

3,028 citations