scispace - formally typeset
Search or ask a question
Author

Slim Essid

Bio: Slim Essid is an academic researcher from Télécom ParisTech. The author has contributed to research in topics: Non-negative matrix factorization & Feature learning. The author has an hindex of 25, co-authored 128 publications receiving 2081 citations. Previous affiliations of Slim Essid include Nanyang Technological University & University of Paris.


Papers
More filters
Proceedings ArticleDOI
09 Aug 2010
TL;DR: In this paper, a new audio feature extraction software, YAAFE 1, is presented and compared to widely used libraries and the main advantage is a significantly lower complexity due to the appropriate exploitation of redundancy in the feature calculation.
Abstract: Music Information Retrieval systems are commonly built on a feature extraction stage. For applications involving automatic classification (e.g. speech/music discrimination, music genre or mood recognition, ...), traditional approaches will consider a large set of audio features to be extracted on a large dataset. In some cases, this will lead to computationally intensive systems and there is, therefore, a strong need for efficient feature extraction. In this paper, a new audio feature extraction software, YAAFE 1 , is presented and compared to widely used libraries. The main advantage of YAAFE is a significantly lower complexity due to the appropriate exploitation of redundancy in the feature calculation. YAAFE remains easy to configure and each feature can be parameterized independently. Finally, the YAAFE framework and most of its core feature library are released in source code under the GNU Lesser General Public License.

175 citations

Journal ArticleDOI
TL;DR: This study focuses on a single music genre but combines a variety of instruments among which are percussion and singing voice, and obtains a taxonomy of musical ensembles which is used to efficiently classify possible combinations of instruments played simultaneously.
Abstract: We propose a new approach to instrument recognition in the context of real music orchestrations ranging from solos to quartets. The strength of our approach is that it does not require prior musical source separation. Thanks to a hierarchical clustering algorithm exploiting robust probabilistic distances, we obtain a taxonomy of musical ensembles which is used to efficiently classify possible combinations of instruments played simultaneously. Moreover, a wide set of acoustic features is studied including some new proposals. In particular, signal to mask ratios are found to be useful features for audio classification. This study focuses on a single music genre (i.e., jazz) but combines a variety of instruments among which are percussion and singing voice. Using a varied database of sound excerpts from commercial recordings, we show that the segmentation of music with respect to the instruments played can be achieved with an average accuracy of 53%.

133 citations

Journal ArticleDOI
C. Joder1, Slim Essid1, Gael Richard1
TL;DR: A number of methods for early and late temporal integration are proposed and an in-depth experimental study on their interest for the task of musical instrument recognition on solo musical phrases is provided.
Abstract: Nowadays, it appears essential to design automatic indexing tools which provide meaningful and efficient means to describe the musical audio content. There is in fact a growing interest for music information retrieval (MIR) applications amongst which the most popular are related to music similarity retrieval, artist identification, musical genre or instrument recognition. Current MIR-related classification systems usually do not take into account the mid-term temporal properties of the signal (over several frames) and lie on the assumption that the observations of the features in different frames are statistically independent. The aim of this paper is to demonstrate the usefulness of the information carried by the evolution of these characteristics over time. To that purpose, we propose a number of methods for early and late temporal integration and provide an in-depth experimental study on their interest for the task of musical instrument recognition on solo musical phrases. In particular, the impact of the time horizon over which the temporal integration is performed will be assessed both for fixed and variable frame length analysis. Also, a number of proposed alignment kernels will be used for late temporal integration. For all experiments, the results are compared to a state of the art musical instrument recognition system.

129 citations

Journal ArticleDOI
TL;DR: It is shown that higher recognition rates can be reached with pairwise optimized subsets of features in association with SVM classification using a radial basis function kernel.
Abstract: Musical instrument recognition is an important aspect of music information retrieval. In this paper, statistical pattern recognition techniques are utilized to tackle the problem in the context of solo musical phrases. Ten instrument classes from different instrument families are considered. A large sound database is collected from excerpts of musical phrases acquired from commercial recordings translating different instrument instances, performers, and recording conditions. More than 150 signal processing features are studied including new descriptors. Two feature selection techniques, inertia ratio maximization with feature space projection and genetic algorithms are considered in a class pairwise manner whereby the most relevant features are fetched for each instrument pair. For the classification task, experimental results are provided using Gaussian mixture models (GMMs) and support vector machines (SVMs). It is shown that higher recognition rates can be reached with pairwise optimized subsets of features in association with SVM classification using a radial basis function kernel

111 citations

Journal ArticleDOI
TL;DR: It is shown that the unsupervised learning methods provide better representations of acoustic scenes than the best conventional hand-crafted features on both datasets and the introduction of a novel nonnegative supervised matrix factorization model and deep neural networks trained on spectrograms allow for further improvements.
Abstract: In this paper, we study the usefulness of various matrix factorization methods for learning features to be used for the specific acoustic scene classification ASC problem A common way of addressing ASC has been to engineer features capable of capturing the specificities of acoustic environments Instead, we show that better representations of the scenes can be automatically learned from time-frequency representations using matrix factorization techniques We mainly focus on extensions including sparse, kernel-based, convolutive and a novel supervised dictionary learning variant of principal component analysis and nonnegative matrix factorization An experimental evaluation is performed on two of the largest ASC datasets available in order to compare and discuss the usefulness of these methods for the task We show that the unsupervised learning methods provide better representations of acoustic scenes than the best conventional hand-crafted features on both datasets Furthermore, the introduction of a novel nonnegative supervised matrix factorization model and deep neural networks trained on spectrograms, allow us to reach further improvements

93 citations


Cited by
More filters
Christopher M. Bishop1
01 Jan 2006
TL;DR: Probability distributions of linear models for regression and classification are given in this article, along with a discussion of combining models and combining models in the context of machine learning and classification.
Abstract: Probability Distributions.- Linear Models for Regression.- Linear Models for Classification.- Neural Networks.- Kernel Methods.- Sparse Kernel Machines.- Graphical Models.- Mixture Models and EM.- Approximate Inference.- Sampling Methods.- Continuous Latent Variables.- Sequential Data.- Combining Models.

10,141 citations

Journal ArticleDOI
TL;DR: It is shown that the improved performance stems from the combination of a deep, high-capacity model and an augmented training set: this combination outperforms both the proposed CNN without augmentation and a “shallow” dictionary learning model with augmentation.
Abstract: The ability of deep convolutional neural networks (CNNs) to learn discriminative spectro-temporal patterns makes them well suited to environmental sound classification. However, the relative scarcity of labeled data has impeded the exploitation of this family of high-capacity models. This study has two primary contributions: first, we propose a deep CNN architecture for environmental sound classification. Second, we propose the use of audio data augmentation for overcoming the problem of data scarcity and explore the influence of different augmentations on the performance of the proposed CNN architecture. Combined with data augmentation, the proposed model produces state-of-the-art results for environmental sound classification. We show that the improved performance stems from the combination of a deep, high-capacity model and an augmented training set: this combination outperforms both the proposed CNN without augmentation and a “shallow” dictionary learning model with augmentation. Finally, we examine the influence of each augmentation on the model's classification accuracy for each class, and observe that the accuracy for each class is influenced differently by each augmentation, suggesting that the performance of the model could be improved further by applying class-conditional data augmentation.

996 citations

Book ChapterDOI
01 Jan 1998

885 citations

Journal ArticleDOI
TL;DR: In this paper, the authors proposed a deep convolutional neural network architecture for environmental sound classification and used audio data augmentation for overcoming the problem of data scarcity and explore the influence of different augmentations on the performance of the proposed CNN architecture.
Abstract: The ability of deep convolutional neural networks (CNN) to learn discriminative spectro-temporal patterns makes them well suited to environmental sound classification. However, the relative scarcity of labeled data has impeded the exploitation of this family of high-capacity models. This study has two primary contributions: first, we propose a deep convolutional neural network architecture for environmental sound classification. Second, we propose the use of audio data augmentation for overcoming the problem of data scarcity and explore the influence of different augmentations on the performance of the proposed CNN architecture. Combined with data augmentation, the proposed model produces state-of-the-art results for environmental sound classification. We show that the improved performance stems from the combination of a deep, high-capacity model and an augmented training set: this combination outperforms both the proposed CNN without augmentation and a "shallow" dictionary learning model with augmentation. Finally, we examine the influence of each augmentation on the model's classification accuracy for each class, and observe that the accuracy for each class is influenced differently by each augmentation, suggesting that the performance of the model could be improved further by applying class-conditional data augmentation.

864 citations