Audio Concept Classification with Hierarchical Deep Neural Networks

Home
/
Papers
/
Audio Concept Classification with Hierarchical Deep Neural Networks

Posted Content•

Audio Concept Classification with Hierarchical Deep Neural Networks

Mirco Ravanelli¹, Benjamin Elizalde², Karl Ni³, Gerald Friedland²•Institutions (3)

fondazione bruno kessler¹, International Computer Science Institute², Lawrence Livermore National Laboratory³

11 Oct 2017-arXiv: Audio and Speech Processing-

TL;DR: In this paper, the authors explored the potential of deep learning in classifying audio concepts on user-Generated Content videos, using two cascaded neural networks in a hierarchical configuration to analyze the short and long-term context information.

read less

Abstract: Audio-based multimedia retrieval tasks may identify semantic information in audio streams, i.e., audio concepts (such as music, laughter, or a revving engine). Conventional Gaussian-Mixture-Models have had some success in classifying a reduced set of audio concepts. However, multi-class classification can benefit from context window analysis and the discriminating power of deeper architectures. Although deep learning has shown promise in various applications such as speech and object recognition, it has not yet met the expectations for other fields such as audio concept classification. This paper explores, for the first time, the potential of deep learning in classifying audio concepts on User-Generated Content videos. The proposed system is comprised of two cascaded neural networks in a hierarchical configuration to analyze the short- and long-term context information. Our system outperforms a GMM approach by a relative 54%, a Neural Network by 33%, and a Deep Neural Network by 12% on the TRECVID-MED database

...read moreread less

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Environmental sound classification with convolutional neural networks

[...]

Karol J. Piczak¹•Institutions (1)

Warsaw University of Technology¹

12 Nov 2015

TL;DR: The model outperforms baseline implementations relying on mel-frequency cepstral coefficients and achieves results comparable to other state-of-the-art approaches.

...read moreread less

Abstract: This paper evaluates the potential of convolutional neural networks in classifying short audio clips of environmental sounds. A deep model consisting of 2 convolutional layers with max-pooling and 2 fully connected layers is trained on a low level representation of audio data (segmented spectrograms) with deltas. The accuracy of the network is evaluated on 3 public datasets of environmental and urban recordings. The model outperforms baseline implementations relying on mel-frequency cepstral coefficients and achieves results comparable to other state-of-the-art approaches.

...read moreread less

742 citations

Proceedings Article•DOI•

Enhancing Micro-video Understanding by Harnessing External Sounds

[...]

Liqiang Nie¹, Xiang Wang², Jianglong Zhang³, Xiangnan He², Hanwang Zhang⁴, Richang Hong⁵, Qi Tian⁶ - Show less +3 more•Institutions (6)

Shandong University¹, National University of Singapore², Communication University of China³, Columbia University⁴, Hefei University of Technology⁵, University of Texas at San Antonio⁶

19 Oct 2017

TL;DR: A deep transfer model is developed which can jointly enhance the concept-level representation of micro-videos and the venue category prediction and alleviate the sparsity problem of unpopular categories by leveraging the external acoustic knowledge.

...read moreread less

Abstract: Different from traditional long videos, micro-videos are much shorter and usually recorded at a specific place with mobile devices. To better understand the semantics of a micro-video and facilitate downstream applications, it is crucial to estimate the venue where the micro-video is recorded, for example, in a concert or on a beach. However, according to our statistics over two million micro-videos, only $1.22%$ of them were labeled with location information. For the remaining large number of micro-videos without location information, we have to rely on their content to estimate their venue categories. This is a highly challenging task, as micro-videos are naturally multi-modal (with textual, visual and, acoustic content), and more importantly, the quality of each modality varies greatly for different micro-videos. In this work, we focus on enhancing the acoustic modality for the venue category estimation task. This is motivated by our finding that although the acoustic signal can well complement the visual and textual signal in reflecting a micro-video's venue, its quality is usually relatively lower. As such, simply integrating acoustic features with visual and textual features only leads to suboptimal results, or even adversely degrades the overall performance (cf the barrel theory). To address this, we propose to compensate the shortest board --- the acoustic modality --- via harnessing the external sound knowledge. We develop a deep transfer model which can jointly enhance the concept-level representation of micro-videos and the venue category prediction. To alleviate the sparsity problem of unpopular categories, we further regularize the representation learning of micro-videos of the same venue category. Through extensive experiments on a real-world dataset, we show that our model significantly outperforms the state-of-the-art method in terms of both Micro-F1 and Macro-F1 scores by leveraging the external acoustic knowledge.

...read moreread less

93 citations

Proceedings Article•DOI•

Audio-based multimedia event detection using deep recurrent neural networks

[...]

Yun Wang¹, Leonardo Neves¹, Florian Metze¹•Institutions (1)

Carnegie Mellon University¹

17 Mar 2016

TL;DR: This paper introduces longer-range temporal information with deep recurrent neural networks (RNNs) for both stages ofimedia event detection, and observes improvements in both frame-level and clip-level performance compared to SVM and feed-forward neural network baselines.

...read moreread less

Abstract: Multimedia event detection (MED) is the task of detecting given events (e.g. birthday party, making a sandwich) in a large collection of video clips. While visual features and automatic speech recognition typically provide the best features for this task, nonspeech audio can also contribute useful information, such as crowds cheering, engine noises, or animal sounds. MED is typically formulated as a two-stage process: the first stage generates clip-level feature representations, often by aggregating frame-level features; the second stage performs binary or multi-class classification to decide whether a given event occurs in a video clip. Both stages are usually performed "statically", i.e. using only local temporal information, or bag-of-words models. In this paper, we introduce longer-range temporal information with deep recurrent neural networks (RNNs) for both stages. We classify each audio frame among a set of semantic units called "noisemes" the sequence of frame-level confidence distributions is used as a variable-length clip-level representation. Such confidence vector sequences are then fed into long short-term memory (LSTM) networks for clip-level classification. We observe improvements in both frame-level and clip-level performance compared to SVM and feed-forward neural network baselines.

...read moreread less

75 citations

Journal Article•DOI•

Environmental sound classification with dilated convolutions

[...]

Yan Chen¹, Qian Guo¹, Xinyan Liang¹, Jiang Wang¹, Yuhua Qian¹ - Show less +1 more•Institutions (1)

Shanxi University¹

01 May 2019-Applied Acoustics

TL;DR: Dilated CNN, being introduced to ESC problem, achieves better results than that of CNN with max-pooling and other state-of-the-art approaches, and the effect of different dilation rate and the number of layers of dilated convolution to the experimental results is explored.

...read moreread less

72 citations

Posted Content•

A network of deep neural networks for distant speech recognition

[...]

Mirco Ravanelli¹, Philemon Brakel², Maurizio Omologo¹, Yoshua Bengio²•Institutions (2)

fondazione bruno kessler¹, Université de Montréal²

23 Mar 2017-arXiv: Computation and Language

TL;DR: In this paper, the authors proposed a novel architecture based on a network of deep neural networks, where all the components are jointly trained and better cooperate with each other thanks to a full communication scheme between them.

...read moreread less

Abstract: Despite the remarkable progress recently made in distant speech recognition, state-of-the-art technology still suffers from a lack of robustness, especially when adverse acoustic conditions characterized by non-stationary noises and reverberation are met. A prominent limitation of current systems lies in the lack of matching and communication between the various technologies involved in the distant speech recognition process. The speech enhancement and speech recognition modules are, for instance, often trained independently. Moreover, the speech enhancement normally helps the speech recognizer, but the output of the latter is not commonly used, in turn, to improve the speech enhancement. To address both concerns, we propose a novel architecture based on a network of deep neural networks, where all the components are jointly trained and better cooperate with each other thanks to a full communication scheme between them. Experiments, conducted using different datasets, tasks and acoustic conditions, revealed that the proposed framework can overtake other competitive solutions, including recent joint training approaches.

...read moreread less

37 citations

1
2
3
4
…
5

References

PDF

Open Access

More filters

Journal Article•DOI•

A fast learning algorithm for deep belief nets

[...]

Geoffrey E. Hinton¹, Simon Osindero¹, Yee Whye Teh²•Institutions (2)

University of Toronto¹, National University of Singapore²

01 Jul 2006-Neural Computation

TL;DR: A fast, greedy algorithm is derived that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory.

...read moreread less

Abstract: We show how to use "complementary priors" to eliminate the explaining-away effects that make inference difficult in densely connected belief nets that have many hidden layers. Using complementary priors, we derive a fast, greedy algorithm that can learn deep, directed belief networks one layer at a time, provided the top two layers form an undirected associative memory. The fast, greedy algorithm is used to initialize a slower learning procedure that fine-tunes the weights using a contrastive version of the wake-sleep algorithm. After fine-tuning, a network with three hidden layers forms a very good generative model of the joint distribution of handwritten digit images and their labels. This generative model gives better digit classification than the best discriminative learning algorithms. The low-dimensional manifolds on which the digits lie are modeled by long ravines in the free-energy landscape of the top-level associative memory, and it is easy to explore these ravines by using the directed connections to display what the associative memory has in mind.

...read moreread less

15,055 citations

Proceedings Article•

Greedy Layer-Wise Training of Deep Networks

[...]

Yoshua Bengio¹, Pascal Lamblin¹, Dan Popovici¹, Hugo Larochelle¹•Institutions (1)

Université de Montréal¹

04 Dec 2006

TL;DR: These experiments confirm the hypothesis that the greedy layer-wise unsupervised training strategy mostly helps the optimization, by initializing weights in a region near a good local minimum, giving rise to internal distributed representations that are high-level abstractions of the input, bringing better generalization.

...read moreread less

Abstract: Complexity theory of circuits strongly suggests that deep architectures can be much more efficient (sometimes exponentially) than shallow architectures, in terms of computational elements required to represent some functions. Deep multi-layer neural networks have many levels of non-linearities allowing them to compactly represent highly non-linear and highly-varying functions. However, until recently it was not clear how to train such deep networks, since gradient-based optimization starting from random initialization appears to often get stuck in poor solutions. Hinton et al. recently introduced a greedy layer-wise unsupervised learning algorithm for Deep Belief Networks (DBN), a generative model with many layers of hidden causal variables. In the context of the above optimization problem, we study this algorithm empirically and explore variants to better understand its success and extend it to cases where the inputs are continuous or where the structure of the input distribution is not revealing enough about the variable to be predicted in a supervised task. Our experiments also confirm the hypothesis that the greedy layer-wise unsupervised training strategy mostly helps the optimization, by initializing weights in a region near a good local minimum, giving rise to internal distributed representations that are high-level abstractions of the input, bringing better generalization.

...read moreread less

4,385 citations

Journal Article•DOI•

Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition

[...]

George E. Dahl¹, Dong Yu², Li Deng², Alex Acero²•Institutions (2)

University of Toronto¹, Microsoft²

01 Jan 2012-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: A pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output that can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs.

...read moreread less

Abstract: We propose a novel context-dependent (CD) model for large-vocabulary speech recognition (LVSR) that leverages recent advances in using deep belief networks for phone recognition. We describe a pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output. The deep belief network pre-training algorithm is a robust and often helpful way to initialize deep neural networks generatively that can aid in optimization and reduce generalization error. We illustrate the key components of our model, describe the procedure for applying CD-DNN-HMMs to LVSR, and analyze the effects of various modeling choices on performance. Experiments on a challenging business search dataset demonstrate that CD-DNN-HMMs can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs, with an absolute sentence accuracy improvement of 5.8% and 9.2% (or relative error reduction of 16.0% and 23.2%) over the CD-GMM-HMMs trained using the minimum phone error rate (MPE) and maximum-likelihood (ML) criteria, respectively.

...read moreread less

3,120 citations

TRECVID 2012 - An overview of the goals, tasks, data, evaluation mechanisms, and metrics

[...]

Paul Over, Jonathan G. Fiscus, G. Sanders, B. Shaw, Martial Michel, George Awad, Alan F. Smeaton, Wessel Kraaij, Georges Quénot - Show less +5 more

01 Jan 2013

TL;DR: The TREC Video Retrieval Evaluation (TRECVID) 2012 was a TREC-style video analysis and retrieval evaluation, the goal of which remains to promote progress in content-based exploitation of digital video via open, metrics-based evaluation as mentioned in this paper.

...read moreread less

Abstract: The TREC Video Retrieval Evaluation (TRECVID) 2012 was a TREC-style video analysis and retrieval evaluation, the goal of which remains to promote progress in content-based exploitation of digital video via open, metrics-based evaluation. Over the last ten years this effort has yielded a better understanding of how systems can effectively accomplish such processing and how one can reliably benchmark their performance. TRECVID is funded by the NIST and other US government agencies. Many organizations and individuals worldwide contribute significant time and effort.

...read moreread less

582 citations

Journal Article•DOI•

Content analysis for audio classification and segmentation

[...]

Lie Lu¹, Hong-Jiang Zhang¹, Hao Jiang¹•Institutions (1)

Microsoft¹

10 Dec 2002-IEEE Transactions on Speech and Audio Processing

TL;DR: A robust approach that is capable of classifying and segmenting an audio stream into speech, music, environment sound, and silence is proposed, and an unsupervised speaker segmentation algorithm using a novel scheme based on quasi-GMM and LSP correlation analysis is developed.

...read moreread less

Abstract: We present our study of audio content analysis for classification and segmentation, in which an audio stream is segmented according to audio type or speaker identity. We propose a robust approach that is capable of classifying and segmenting an audio stream into speech, music, environment sound, and silence. Audio classification is processed in two steps, which makes it suitable for different applications. The first step of the classification is speech and nonspeech discrimination. In this step, a novel algorithm based on K-nearest-neighbor (KNN) and linear spectral pairs-vector quantization (LSP-VQ) is developed. The second step further divides nonspeech class into music, environment sounds, and silence with a rule-based classification scheme. A set of new features such as the noise frame ratio and band periodicity are introduced and discussed in detail. We also develop an unsupervised speaker segmentation algorithm using a novel scheme based on quasi-GMM and LSP correlation analysis. Without a priori knowledge, this algorithm can support the open-set speaker, online speaker modeling and real time segmentation. Experimental results indicate that the proposed algorithms can produce very satisfactory results.

...read moreread less

559 citations