scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Musical source clustering and identification in polyphonic audio

TL;DR: This paper proposes novel schemes using semi-supervised as well as unsupervised approach to source clustering based on auditory perception theory and is implemented using various tools like probabilistic latent component analysis and graph clustering, while taking into account various perceptual cues for characterizing a source.
Abstract: For music transcription or musical source separation, apart from knowing the multi-F0 contours, it is also important to know which F0 has been played by which instrument. This paper focuses on this aspect, i.e. given the polyphonic audio along with its multiple F0 contours, the proposed system clusters them so as to decide ‘which instrument played when.’ For the task of identifying the instrument or singers in the polyphonic audio, there are many supervised methods available. But many times individual source audio is not available for training. To address this problem, this paper proposes novel schemes using semi-supervised as well as unsupervised approach to source clustering. The proposed theoretical framework is based on auditory perception theory and is implemented using various tools like probabilistic latent component analysis and graph clustering, while taking into account various perceptual cues for characterizing a source. Experiments have been carried out over a wide variety of datasets - ranging from vocal to instrumental as well as from synthetic to real world music. The proposed scheme significantly outperforms a state of the art unsupervised scheme, which does not make use of the given F0 contours. The proposed semi-supervised approach also performs better than another semi-supervised scheme, which makes use of the given F0 information, in terms of computations as well as accuracy.
Citations
More filters
Journal Article
TL;DR: In this paper, the authors present a short bibliography on AI and the arts, which is presented in four sections: General Arguments, Proposals, and Approaches (31 references), Artificial Intelligence in Music (124 references); Artificial AI in Literature and the Performing Arts (13 references), and Artificial Intelligence and Visual Art (57 references).
Abstract: The title of this technical report says almost everything: this is indeed \"a short bibliography on AI and the arts\". It is presented in four sections: General Arguments, Proposals, and Approaches (31 references); Artificial Intelligence in Music (124 references); Artificial Intelligence in Literature and the Performing Arts (13 references), and Artificial Intelligence and Visual Art (57 references). About a quarter of these have short abstracts. Creating a bibliography can be a monumental task, and this bibliography should be viewed as a good and useful start, though it is by no means complete. For comparison, consider the 4,585-entry bibliography Computer Applications in Music by Deta Davis (A-REditions). No direct comparison is intended (or possible), but my point is that many more papers are likely to exist. As a rough check, I looked for several pre-1990 AI and Music articles and books (including my own, of course) in the bibliography. Out of five papers from well-known sources, only one was listed. On the other hand, I discovered a number of papers in this report that were unknown to me, so I am grateful to have a new source of references. In their introduction, the authors acknowledge the need for more references and even offer.a cup of coffee in reward for each new one. I will be sending a number of contributions, so the next time anyone is in Vienna, the coffee is on me. I hope the authors will continue to collect abstracts and publish an updated report in the future.

356 citations

Journal ArticleDOI
TL;DR: This article proposes a multi-instrument AMT method, with signal processing techniques specifying pitch saliency, novel deep learning techniques, and concepts partly inspired by multi-object recognition, instance segmentation, and image-to-image translation in computer vision.
Abstract: Multi-instrument automatic music transcription (AMT) is a critical but less investigated problem in the field of music information retrieval (MIR). With all the difficulties faced by traditional AMT research, multi-instrument AMT needs further investigation on high-level music semantic modeling, efficient training methods for multiple attributes, and a clear problem scenario for system performance evaluation. In this article, we propose a multi-instrument AMT method, with signal processing techniques specifying pitch saliency, novel deep learning techniques, and concepts partly inspired by multi-object recognition, instance segmentation, and image-to-image translation in computer vision. The proposed method is flexible for all the sub-tasks in multi-instrument AMT, including multi-instrument note tracking, a task that has rarely been investigated before. State-of-the-art performance is also reported in the sub-task of multi-pitch streaming.

30 citations

Journal ArticleDOI
TL;DR: The paper proposes a novel strategy using the trade-off between precision and recall of multiple F0 estimation for better clustering, and outperforms a state-of-the-art unsupervised source streaming algorithm in a set of comparative experiments.
Abstract: Source transcription of pitched polyphonic music entails providing the pitch (F0) values corresponding to each source in a separate channel. This problem is an important step towards many important problems in music and speech processing. It involves 1) estimating the multiple F0 values in each short time frame, and 2) clustering the F0 values into streams corresponding to different sources. We address the problem in an unsupervised way, with only the total number of sources given beforehand. The framework of probabilistic latent component analysis (PLCA) is used to decompose the polyphonic short-time magnitude spectra for multiple F0 estimation and source-specific feature extraction. It is further embedded into the structure of hidden Markov random fields (HMRF) for clustering the F0s into different sources. This clustering is constrained by the cognitive grouping of continuous F0 contours as well as segregation of simultaneous F0s into different source streams. Such constraints are effectively and elegantly modeled by the HMRF's. Simulated annealing varies the degree of constraints for better clustering. The paper also proposes a novel strategy using the trade-off between precision and recall of multiple F0 estimation for better clustering. Evaluations over a variety of datasets show the efficacy of the proposed algorithm and its robustness to the presence of spurious F0s while clustering. It also outperforms a state-of-the-art unsupervised source streaming algorithm in a set of comparative experiments.

24 citations


Cites background or methods from "Musical source clustering and ident..."

  • ...There is comparatively much less work done in unsupervised musical source clustering [19], [20]....

    [...]

  • ...work [19] also employed constrained graph clustering over all...

    [...]

  • ...The implementation details are elaborated in our earlier work [19]....

    [...]

  • ...For source clustering, the framework for three levels of cognitive clustering, as proposed in our previous work [19], is used: 1) Pitched event decomposition: The decomposition of spectra based on multi-F0 values 2) Group object formation: The clustering of temporally connected F0s across successive frames 3) Source streaming: The clustering of all F0s over all the frames using source-characterizing timbre features and constraints like group objects etc....

    [...]

  • ...[20] and our previous work [19], the present work proposes a novel Hidden Markov Random Field (HMRF) model to cluster the F0s into source clusters, while taking into account these constraints....

    [...]

Journal ArticleDOI
TL;DR: A generalized Short Time Fourier Transform (STFT)-based technique, combined with filter bank to extract vocals from background music, and shows that the proposed approach performs better than the other state-of-the-art approaches, in terms of Signal-to-Interference Ratio (SIR) and Signal- to-Distortion ratio (SDR), respectively.
Abstract: Blind Source Separation techniques are widely used in the field of wireless communication for a very long time to extract signals of interest from a set of multiple signals without training data. In this paper, we investigate the problem of separation of the human voice from a mixture of human voice and sounds from different musical instruments. The human voice may be a singing voice in a song or may be a part of some news, broadcast by a channel with background music. This paper proposes a generalized Short Time Fourier Transform (STFT)-based technique, combined with filter bank to extract vocals from background music. The main purpose is to design a filter bank and to eliminate background aliasing errors with best reconstruction conditions, having approximated scaling factors. Stereo signals in time-frequency domain are used in experiments. The input stereo signals are processed in the form of frames and passed through the proposed STFT-based technique. The output of the STFT-based technique is passed through the filter bank to minimize the background aliasing errors. For reconstruction, first an inverse STFT is applied and then the signals are reconstructed by the OverLap-Add method to get the final output, containing vocals only. The experiments show that the proposed approach performs better than the other state-of-the-art approaches, in terms of Signal-to-Interference Ratio (SIR) and Signal-to-Distortion Ratio (SDR), respectively.

20 citations


Cites background from "Musical source clustering and ident..."

  • ...Another main reason behind using the CM-based approaches is that, in many cases, especially the speech and music signals, various signal bands are very narrow and appear around a certain range of frequencies [29]....

    [...]

Book Chapter
30 Jul 2015
TL;DR: In this paper, a method of transformation from a gain-space to a mix-space using a novel representation of the individual track gains is introduced, and an experiment is conducted in order to obtain time-series data of mix engineers exploration of this space as they adjust levels within a multitrack session to create their desired mixture.
Abstract: The mixing of audio signals has been at the foundation of audio production since the advent of electrical recording in the 1920’s, yet the mathematical and psychological bases for this activity are relatively under-studied. This paper investigates how the process of mixing music is conducted. We introduce a method of transformation from a “gain-space” to a “mix-space”, using a novel representation of the individual track gains. An experiment is conducted in order to obtain time-series data of mix engineers exploration of this space as they adjust levels within a multitrack session to create their desired mixture. It is observed that, while the exploration of the space is influenced by the initial configuration of track gains, there is agreement between individuals on the appropriate gain settings required to create a balanced mixture. Implications for the design of intelligent music production systems are discussed.

10 citations

References
More filters
Journal ArticleDOI
TL;DR: This book covers a broad range of topics for regular factorial designs and presents all of the material in very mathematical fashion and will surely become an invaluable resource for researchers and graduate students doing research in the design of factorial experiments.
Abstract: (2007). Pattern Recognition and Machine Learning. Technometrics: Vol. 49, No. 3, pp. 366-366.

18,802 citations

Journal ArticleDOI
TL;DR: In this article, the authors present the most common spectral clustering algorithms, and derive those algorithms from scratch by several different approaches, and discuss the advantages and disadvantages of these algorithms.
Abstract: In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. On the first glance spectral clustering appears slightly mysterious, and it is not obvious to see why it works at all and what it really does. The goal of this tutorial is to give some intuition on those questions. We describe different graph Laplacians and their basic properties, present the most common spectral clustering algorithms, and derive those algorithms from scratch by several different approaches. Advantages and disadvantages of the different spectral clustering algorithms are discussed.

9,141 citations

Book
01 Aug 2006
TL;DR: Looking for competent reading resources?
Abstract: Looking for competent reading resources? We have pattern recognition and machine learning information science and statistics to read, not only read, but also download them or even check out online. Locate this fantastic book writtern by by now, simply here, yeah just here. Obtain the reports in the kinds of txt, zip, kindle, word, ppt, pdf, as well as rar. Once again, never ever miss to review online and download this book in our site right here. Click the link.

8,923 citations

Journal ArticleDOI
TL;DR: An algorithm is presented for the estimation of the fundamental frequency (F0) of speech or musical sounds, based on the well-known autocorrelation method with a number of modifications that combine to prevent errors.
Abstract: An algorithm is presented for the estimation of the fundamental frequency (F0) of speech or musical sounds. It is based on the well-known autocorrelation method with a number of modifications that combine to prevent errors. The algorithm has several desirable features. Error rates are about three times lower than the best competing methods, as evaluated over a database of speech recorded together with a laryngograph signal. There is no upper limit on the frequency search range, so the algorithm is suited for high-pitched voices and music. The algorithm is relatively simple and may be implemented efficiently and with low latency, and it involves few parameters that must be tuned. It is based on a signal model (periodic signal) that may be extended in several ways to handle various forms of aperiodicity that occur in particular applications. Finally, interesting parallels may be drawn with models of auditory processing.

1,975 citations

Journal ArticleDOI
TL;DR: This paper starts with the fundamentals of automatic speaker recognition, concerning feature extraction and speaker modeling and elaborate advanced computational techniques to address robustness and session variability.

1,433 citations