scispace - formally typeset
Search or ask a question
Author

Juan Pablo Bello

Bio: Juan Pablo Bello is an academic researcher from New York University. The author has contributed to research in topics: Computer science & Music information retrieval. The author has an hindex of 43, co-authored 170 publications receiving 8162 citations. Previous affiliations of Juan Pablo Bello include Queen Mary University of London & University of London.


Papers
More filters
Journal ArticleDOI
TL;DR: It is shown that the improved performance stems from the combination of a deep, high-capacity model and an augmented training set: this combination outperforms both the proposed CNN without augmentation and a “shallow” dictionary learning model with augmentation.
Abstract: The ability of deep convolutional neural networks (CNNs) to learn discriminative spectro-temporal patterns makes them well suited to environmental sound classification. However, the relative scarcity of labeled data has impeded the exploitation of this family of high-capacity models. This study has two primary contributions: first, we propose a deep CNN architecture for environmental sound classification. Second, we propose the use of audio data augmentation for overcoming the problem of data scarcity and explore the influence of different augmentations on the performance of the proposed CNN architecture. Combined with data augmentation, the proposed model produces state-of-the-art results for environmental sound classification. We show that the improved performance stems from the combination of a deep, high-capacity model and an augmented training set: this combination outperforms both the proposed CNN without augmentation and a “shallow” dictionary learning model with augmentation. Finally, we examine the influence of each augmentation on the model's classification accuracy for each class, and observe that the accuracy for each class is influenced differently by each augmentation, suggesting that the performance of the model could be improved further by applying class-conditional data augmentation.

996 citations

Proceedings ArticleDOI
03 Nov 2014
TL;DR: A taxonomy of urban sounds and a new dataset, UrbanSound, containing 27 hours of audio with 18.5 hours of annotated sound event occurrences across 10 sound classes are presented.
Abstract: Automatic urban sound classification is a growing area of research with applications in multimedia retrieval and urban informatics. In this paper we identify two main barriers to research in this area - the lack of a common taxonomy and the scarceness of large, real-world, annotated data. To address these issues we present a taxonomy of urban sounds and a new dataset, UrbanSound, containing 27 hours of audio with 18.5 hours of annotated sound event occurrences across 10 sound classes. The challenges presented by the new dataset are studied through a series of experiments using a baseline classification system.

954 citations

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a deep convolutional neural network architecture for environmental sound classification and used audio data augmentation for overcoming the problem of data scarcity and explore the influence of different augmentations on the performance of the proposed CNN architecture.
Abstract: The ability of deep convolutional neural networks (CNN) to learn discriminative spectro-temporal patterns makes them well suited to environmental sound classification. However, the relative scarcity of labeled data has impeded the exploitation of this family of high-capacity models. This study has two primary contributions: first, we propose a deep convolutional neural network architecture for environmental sound classification. Second, we propose the use of audio data augmentation for overcoming the problem of data scarcity and explore the influence of different augmentations on the performance of the proposed CNN architecture. Combined with data augmentation, the proposed model produces state-of-the-art results for environmental sound classification. We show that the improved performance stems from the combination of a deep, high-capacity model and an augmented training set: this combination outperforms both the proposed CNN without augmentation and a "shallow" dictionary learning model with augmentation. Finally, we examine the influence of each augmentation on the model's classification accuracy for each class, and observe that the accuracy for each class is influenced differently by each augmentation, suggesting that the performance of the model could be improved further by applying class-conditional data augmentation.

864 citations

Journal ArticleDOI
TL;DR: Methods based on the use of explicitly predefined signal features: the signal's amplitude envelope, spectral magnitudes and phases, time-frequency representations, and methods based on probabilistic signal models are discussed.
Abstract: Note onset detection and localization is useful in a number of analysis and indexing techniques for musical signals. The usual way to detect onsets is to look for "transient" regions in the signal, a notion that leads to many definitions: a sudden burst of energy, a change in the short-time spectrum of the signal or in the statistical properties, etc. The goal of this paper is to review, categorize, and compare some of the most commonly used techniques for onset detection, and to present possible enhancements. We discuss methods based on the use of explicitly predefined signal features: the signal's amplitude envelope, spectral magnitudes and phases, time-frequency representations; and methods based on probabilistic signal models: model-based change point detection, surprise signals, etc. Using a choice of test cases, we provide some guidelines for choosing the appropriate method for a given application.

802 citations

Proceedings ArticleDOI
11 Sep 2005
TL;DR: It is hoped that by utilizing the notion of a musically-motivated mid-level representation, this work may help bridge the gap between symbolic and audio research.
Abstract: When considering the problem of audio-to-audio matching, determining musical similarity using low-level features such as Fourier transforms and MFCCs is an extremely difficult task, as there is little semantic information available. Full semantic transcription of audio is an unreliable and imperfect task in the best case, an unsolved problem in the worst. To this end we propose a robust mid-level representation that incorporates both harmonic and rhythmic information, without attempting full transcription. We describe a process for creating this representation automatically, directly from multi-timbral and polyphonic music signals, with an emphasis on popular music. We also offer various evaluations of our techniques. Moreso than most approaches working from raw audio, we incorporate musical knowledge into our assumptions, our models, and our processes. Our hope is that by utilizing this notion of a musically-motivated mid-level representation we may help bridge the gap between symbolic and audio research.

241 citations


Cited by
More filters
Journal ArticleDOI

[...]

08 Dec 2001-BMJ
TL;DR: There is, I think, something ethereal about i —the square root of minus one, which seems an odd beast at that time—an intruder hovering on the edge of reality.
Abstract: There is, I think, something ethereal about i —the square root of minus one. I remember first hearing about it at school. It seemed an odd beast at that time—an intruder hovering on the edge of reality. Usually familiarity dulls this sense of the bizarre, but in the case of i it was the reverse: over the years the sense of its surreal nature intensified. It seemed that it was impossible to write mathematics that described the real world in …

33,785 citations

Christopher M. Bishop1
01 Jan 2006
TL;DR: Probability distributions of linear models for regression and classification are given in this article, along with a discussion of combining models and combining models in the context of machine learning and classification.
Abstract: Probability Distributions.- Linear Models for Regression.- Linear Models for Classification.- Neural Networks.- Kernel Methods.- Sparse Kernel Machines.- Graphical Models.- Mixture Models and EM.- Approximate Inference.- Sampling Methods.- Continuous Latent Variables.- Sequential Data.- Combining Models.

10,141 citations

Journal ArticleDOI
TL;DR: A broad survey of the recent advances in convolutional neural networks can be found in this article, where the authors discuss the improvements of CNN on different aspects, namely, layer design, activation function, loss function, regularization, optimization and fast computation.

3,125 citations

Book
Li Deng1, Dong Yu1
12 Jun 2014
TL;DR: This monograph provides an overview of general deep learning methodology and its applications to a variety of signal and information processing tasks, including natural language and text processing, information retrieval, and multimodal information processing empowered by multi-task deep learning.
Abstract: This monograph provides an overview of general deep learning methodology and its applications to a variety of signal and information processing tasks. The application areas are chosen with the following three criteria in mind: (1) expertise or knowledge of the authors; (2) the application areas that have already been transformed by the successful use of deep learning technology, such as speech recognition and computer vision; and (3) the application areas that have the potential to be impacted significantly by deep learning and that have been experiencing research growth, including natural language and text processing, information retrieval, and multimodal information processing empowered by multi-task deep learning.

2,817 citations

Proceedings ArticleDOI
05 Mar 2017
TL;DR: The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
Abstract: Audio event recognition, the human-like ability to identify and relate sounds from audio, is a nascent problem in machine perception. Comparable problems such as object detection in images have reaped enormous benefits from comprehensive datasets - principally ImageNet. This paper describes the creation of Audio Set, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research. Using a carefully structured hierarchical ontology of 632 audio classes guided by the literature and manual curation, we collect data from human labelers to probe the presence of specific audio classes in 10 second segments of YouTube videos. Segments are proposed for labeling using searches based on metadata, context (e.g., links), and content analysis. The result is a dataset of unprecedented breadth and size that will, we hope, substantially stimulate the development of high-performance audio event recognizers.

2,204 citations