scispace - formally typeset
Search or ask a question
Author

Tuomas Virtanen

Bio: Tuomas Virtanen is an academic researcher from Tampere University of Technology. The author has contributed to research in topics: Source separation & Spectrogram. The author has an hindex of 52, co-authored 322 publications receiving 11595 citations. Previous affiliations of Tuomas Virtanen include University of Tampere & University of Cambridge.


Papers
More filters
Journal ArticleDOI
TL;DR: An unsupervised learning algorithm for the separation of sound sources in one-channel music signals is presented and enables a better separation quality than the previous algorithms.
Abstract: An unsupervised learning algorithm for the separation of sound sources in one-channel music signals is presented. The algorithm is based on factorizing the magnitude spectrogram of an input signal into a sum of components, each of which has a fixed magnitude spectrum and a time-varying gain. Each sound source, in turn, is modeled as a sum of one or more components. The parameters of the components are estimated by minimizing the reconstruction error between the input spectrogram and the model, while restricting the component spectrograms to be nonnegative and favoring components whose gains are slowly varying and sparse. Temporal continuity is favored by using a cost term which is the sum of squared differences between the gains in adjacent frames, and sparseness is favored by penalizing nonzero gains. The proposed iterative estimation algorithm is initialized with random values, and the gains and the spectra are then alternatively updated using multiplicative update rules until the values converge. Simulation experiments were carried out using generated mixtures of pitched musical instrument samples and drum sounds. The performance of the proposed method was compared with independent subspace analysis and basic nonnegative matrix factorization, which are based on the same linear model. According to these simulations, the proposed method enables a better separation quality than the previous algorithms. Especially, the temporal continuity criterion improved the detection of pitched musical sounds. The sparseness criterion did not produce significant improvements

1,096 citations

Proceedings ArticleDOI
01 Aug 2016
TL;DR: The recording and annotation procedure, the database content, a recommended cross-validation setup and performance of supervised acoustic scene classification system and event detection baseline system using mel frequency cepstral coefficients and Gaussian mixture models are presented.
Abstract: We introduce TUT Acoustic Scenes 2016 database for environmental sound research, consisting of binaural recordings from 15 different acoustic environments. A subset of this database, called TUT Sound Events 2016, contains annotations for individual sound events, specifically created for sound event detection. TUT Sound Events 2016 consists of residential area and home environments, and is manually annotated to mark onset, offset and label of sound events. In this paper we present the recording and annotation procedure, the database content, a recommended cross-validation setup and performance of supervised acoustic scene classification system and event detection baseline system using mel frequency cepstral coefficients and Gaussian mixture models. The database is publicly released to provide support for algorithm development and common ground for comparison of different techniques.

519 citations

Journal ArticleDOI
TL;DR: This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously.
Abstract: This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously The system output in this case contains overlapping events, marked as multiple sounds detected as being active at the same time The polyphonic system output requires a suitable procedure for evaluation against a reference Metrics from neighboring fields such as speech recognition and speaker diarization can be used, but they need to be partially redefined to deal with the overlapping events We present a review of the most common metrics in the field and the way they are adapted and interpreted in the polyphonic case We discuss segment-based and event-based definitions of each metric and explain the consequences of instance-based and class-based averaging using a case study In parallel, we provide a toolbox containing implementations of presented metrics

493 citations

Journal ArticleDOI
TL;DR: Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross fertilization between areas.
Abstract: Given the recent surge in developments of deep learning, this paper provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e., audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.

445 citations

Journal ArticleDOI
TL;DR: In this paper, a convolutional recurrent neural network (CRNN) was proposed for polyphonic sound event detection task and compared with CNN, RNN and other established methods, and observed a considerable improvement for four different datasets consisting of everyday sound events.
Abstract: Sound events often occur in unstructured environments where they exhibit wide variations in their frequency content and temporal structure. Convolutional neural networks CNNs are able to extract higher level features that are invariant to local spectral and temporal variations. Recurrent neural networks RNNs are powerful in learning the longer term temporal context in the audio signals. CNNs and RNNs as classifiers have recently shown improved performances over established methods in various sound recognition tasks. We combine these two approaches in a convolutional recurrent neural network CRNN and apply it on a polyphonic sound event detection task. We compare the performance of the proposed CRNN method with CNN, RNN, and other established methods, and observe a considerable improvement for four different datasets consisting of everyday sound events.

432 citations


Cited by
More filters
Journal ArticleDOI

[...]

08 Dec 2001-BMJ
TL;DR: There is, I think, something ethereal about i —the square root of minus one, which seems an odd beast at that time—an intruder hovering on the edge of reality.
Abstract: There is, I think, something ethereal about i —the square root of minus one. I remember first hearing about it at school. It seemed an odd beast at that time—an intruder hovering on the edge of reality. Usually familiarity dulls this sense of the bizarre, but in the case of i it was the reverse: over the years the sense of its surreal nature intensified. It seemed that it was impossible to write mathematics that described the real world in …

33,785 citations

Proceedings ArticleDOI
05 Mar 2017
TL;DR: The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
Abstract: Audio event recognition, the human-like ability to identify and relate sounds from audio, is a nascent problem in machine perception. Comparable problems such as object detection in images have reaped enormous benefits from comprehensive datasets - principally ImageNet. This paper describes the creation of Audio Set, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research. Using a carefully structured hierarchical ontology of 632 audio classes guided by the literature and manual curation, we collect data from human labelers to probe the presence of specific audio classes in 10 second segments of YouTube videos. Segments are proposed for labeling using searches based on metadata, context (e.g., links), and content analysis. The result is a dataset of unprecedented breadth and size that will, we hope, substantially stimulate the development of high-performance audio event recognizers.

2,204 citations

Journal ArticleDOI
01 Oct 1980

1,565 citations

Book
12 Mar 2012
TL;DR: Comprehensive and coherent, this hands-on text develops everything from basic reasoning to advanced techniques within the framework of graphical models, and develops analytical and problem-solving skills that equip them for the real world.
Abstract: Machine learning methods extract value from vast data sets quickly and with modest resources They are established tools in a wide range of industrial applications, including search engines, DNA sequencing, stock market analysis, and robot locomotion, and their use is spreading rapidly People who know the methods have their choice of rewarding jobs This hands-on text opens these opportunities to computer science students with modest mathematical backgrounds It is designed for final-year undergraduates and master's students with limited background in linear algebra and calculus Comprehensive and coherent, it develops everything from basic reasoning to advanced techniques within the framework of graphical models Students learn more than a menu of techniques, they develop analytical and problem-solving skills that equip them for the real world Numerous examples and exercises, both computer based and theoretical, are included in every chapter Resources for students and instructors, including a MATLAB toolbox, are available online

1,474 citations

Proceedings ArticleDOI
05 Mar 2017
TL;DR: In this paper, the authors used various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels.
Abstract: Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on our audio classification task, and larger training and label sets help up to a point. A model using embeddings from these classifiers does much better than raw features on the Audio Set [5] Acoustic Event Detection (AED) classification task.

1,470 citations