scispace - formally typeset
Search or ask a question
Author

Reinhard Sonnleitner

Bio: Reinhard Sonnleitner is an academic researcher from Johannes Kepler University of Linz. The author has contributed to research in topics: Voice analysis & Voice activity detection. The author has an hindex of 7, co-authored 11 publications receiving 160 citations.

Papers
More filters
Proceedings ArticleDOI
04 May 2014
TL;DR: A set of three new audio features designed to reduce the amount of false vocal detections appears to be at least on par with more complex state-of-the-art methods.
Abstract: Motivated by the observation that one of the biggest problems in automatic singing voice detection is the confusion of vocals with other pitch-continuous and pitch-varying instruments, we propose a set of three new audio features designed to reduce the amount of false vocal detections. This is borne out in comparative experiments with three different musical corpora. The resulting singing voice detector appears to be at least on par with more complex state-of-the-art methods. New features and classifier are very light-weight and in principle suitable for on-line use.

55 citations

Journal ArticleDOI
TL;DR: An audio fingerprinting method that adapts findings from the field of blind astrometry to define simple, efficiently representable characteristic feature combinations called quads is proposed.
Abstract: We propose an audio fingerprinting method that adapts findings from the field of blind astrometry to define simple, efficiently representable characteristic feature combinations called quads. Based on these, an audio identification algorithm is described that is robust to noise and severe time-frequency scale distortions and accurately identifies the underlying scale transform factors. The low number and compact representation of content features allows for efficient application of exact fixed-radius near-neighbor search methods for fingerprint matching in large audio collections. We demonstrate the practicability of the method on a collection of 100,000 songs, analyze its performance for a diverse set of noise as well as severe speed, tempo and pitch scale modifications, and identify a number of advantages of our method over two state-of-the-art distortion-robust audio identification algorithms.

34 citations

Proceedings Article
01 Jan 2013
TL;DR: It is shown that singing voice detection – the problem of identifying those parts of a polyphonic audio recording where one or several persons sing(s) – can be realised with substantially fewer features than used in current state-of-the-art methods.
Abstract: We present a study that indicates that singing voice detection – the problem of identifying those parts of a polyphonic audio recording where one or several persons sing(s) – can be realised with substantially fewer (and less expensive) features than used in current state-of-the-art methods. Essentially, we show that MFCCs alone, if appropriately optimised and used with a suitable classifier, are sufficient to achieve detection results that seem on par with the state of the art – at least as far as this can be ascertained by direct, fair comparisons to existing systems. To make this comparison, we select three relevant publications from the literature where publicly accessible training/test data were used, and where the experimental setup is described in enough detail for us to perform fair comparison experiments. The result of the experiments is that with our simple, optimised MFCC-based classifier we achieve at least comparable identification results, but with (in some cases much) less computational effort, and without any need for extensive lookahead, thus paving the way to on-line, real-time voice detection applications.

25 citations

01 Jan 2014
TL;DR: A new audio fingerprinting method is proposed that adapts findings from the field of blind astrometry to define simple, efficiently representable characteristic feature combinations called quads, and accurately estimates the scaling factors of the applied time/frequency distortions.
Abstract: We propose a new audio fingerprinting method that adapts findings from the field of blind astrometry to define simple, efficiently representable characteristic feature combinations called quads. Based on these, an audio identification algorithm is described that is robust to large amounts of noise and speed, tempo and pitch-shifting distortions. In addition to reliably identifying audio queries that are modified in this way, it also accurately estimates the scaling factors of the applied time/frequency distortions. We experimentally evaluate the performance of the method for a diverse set of noise, speed and tempo modifications, and identify a number of advantages of the new method over a recently published distortioninvariant audio copy detection algorithm.

18 citations

Book ChapterDOI
24 Oct 2012
TL;DR: This paper presents classification experiments that verify the claim that intuitively music similarity measures based on auto-tags should profit from the improvement of the quality of the underlying audio tag predictors and suggests a straight forward way to further improve content-basedMusic similarity measures by improving the underlying auto-taggers.
Abstract: This paper focuses on the relation between automatic tag prediction and music similarity. Intuitively music similarity measures based on auto-tags should profit from the improvement of the quality of the underlying audio tag predictors. We present classification experiments that verify this claim. Our results suggest a straight forward way to further improve content-based music similarity measures by improving the underlying auto-taggers.

13 citations


Cited by
More filters
Proceedings ArticleDOI
26 Oct 2015
TL;DR: A range of label-preserving audio transformations are applied and pitch shifting is found to be the most helpful augmentation method for music data augmentation, reaching the state of the art on two public datasets.
Abstract: In computer vision, state-of-the-art object recognition systems rely on label-preserving image transformations such as scaling and rotation to augment the training datasets. The additional training examples help the system to learn invariances that are difficult to build into the model, and improve generalization to unseen data. To the best of our knowledge, this approach has not been systematically explored for music signals. Using the problem of singing voice detection with neural networks as an example, we apply a range of label-preserving audio transformations to assess their utility for music data augmentation. In line with recent research in speech recognition, we find pitch shifting to be the most helpful augmentation method. Combined with time stretching and random frequency filtering, we achieve a reduction in classification error between 10 and 30%, reaching the state of the art on two public datasets. We expect that audio data augmentation would yield significant gains for several other sequence labelling and event detection tasks in music information retrieval.

188 citations

Journal ArticleDOI
Bob L. Sturm1
TL;DR: This article disprove the claims that all MGR systems are affected in the same ways by these faults, and that the performances of M GR systems in GTZAN are still meaningfully comparable since they all face the same faults.
Abstract: The GTZAN dataset appears in at least 100 published works, and is the most-used public dataset for evaluation in machine listening research for music genre recognition (MGR). Our recent work, however, shows GTZAN has several faults (repetitions, mislabelings, and distortions), which challenge the interpretability of any result derived using it. In this article, we disprove the claims that all MGR systems are affected in the same ways by these faults, and that the performances of MGR systems in GTZAN are still meaningfully comparable since they all face the same faults. We identify and analyze the contents of GTZAN, and provide a catalog of its faults. We review how GTZAN has been used in MGR research, and find few indications that its faults have been known and considered. Finally, we rigorously study the effects of its faults on evaluating five different MGR systems. The lesson is not to banish GTZAN, but to use it with consideration of its contents.

141 citations

Book ChapterDOI
24 Oct 2012
TL;DR: This paper compiles a bibliography of work in MGR, and analyzes three aspects of evaluation: experimental designs, datasets, and figures of merit.
Abstract: Much work is focused upon music genre recognition (MGR) from audio recordings, symbolic data, and other modalities. While reviews have been written of some of this work before, no survey has been made of the approaches to evaluating approaches to MGR. This paper compiles a bibliography of work in MGR, and analyzes three aspects of evaluation: experimental designs, datasets, and figures of merit.

123 citations

Proceedings ArticleDOI
19 Apr 2015
TL;DR: A new algorithm is proposed for robust principal component analysis with predefined sparsity patterns to separate the singing voice from the instrumental accompaniment using vocal activity information and a new publicly available iKala dataset is constructed.
Abstract: A new algorithm is proposed for robust principal component analysis with predefined sparsity patterns. The algorithm is then applied to separate the singing voice from the instrumental accompaniment using vocal activity information. To evaluate its performance, we construct a new publicly available iKala dataset that features longer durations and higher quality than the existing MIR-1K dataset for singing voice separation. Part of it will be used in the MIREX Singing Voice Separation task. Experimental results on both the MIR-1K dataset and the new iKala dataset confirmed that the more informed the algorithm is, the better the separation results are.

98 citations

Journal ArticleDOI
01 Dec 2013
TL;DR: It is argued that an evaluation of system behavior at the level of the music is required to usefully address the fundamental problems of music genre recognition (MGR), and indeed other tasks of music information retrieval, such as autotagging.
Abstract: We argue that an evaluation of system behavior at the level of the music is required to usefully address the fundamental problems of music genre recognition (MGR), and indeed other tasks of music information retrieval, such as autotagging. A recent review of works in MGR since 1995 shows that most (82 %) measure the capacity of a system to recognize genre by its classification accuracy. After reviewing evaluation in MGR, we show that neither classification accuracy, nor recall and precision, nor confusion tables, necessarily reflect the capacity of a system to recognize genre in musical signals. Hence, such figures of merit cannot be used to reliably rank, promote or discount the genre recognition performance of MGR systems if genre recognition (rather than identification by irrelevant confounding factors) is the objective. This motivates the development of a richer experimental toolbox for evaluating any system designed to intelligently extract information from music signals.

85 citations