Home
/
Authors
/
Tuomas Virtanen

Author

Tuomas Virtanen

Other affiliations: University of Tampere, University of Cambridge, Nokia

Bio: Tuomas Virtanen is an academic researcher from Tampere University of Technology. The author has contributed to research in topics: Source separation & Spectrogram. The author has an hindex of 52, co-authored 322 publications receiving 11595 citations. Previous affiliations of Tuomas Virtanen include University of Tampere & University of Cambridge.

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria

[...]

Tuomas Virtanen¹•Institutions (1)

Tampere University of Technology¹

01 Mar 2007-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: An unsupervised learning algorithm for the separation of sound sources in one-channel music signals is presented and enables a better separation quality than the previous algorithms.

...read moreread less

Abstract: An unsupervised learning algorithm for the separation of sound sources in one-channel music signals is presented. The algorithm is based on factorizing the magnitude spectrogram of an input signal into a sum of components, each of which has a fixed magnitude spectrum and a time-varying gain. Each sound source, in turn, is modeled as a sum of one or more components. The parameters of the components are estimated by minimizing the reconstruction error between the input spectrogram and the model, while restricting the component spectrograms to be nonnegative and favoring components whose gains are slowly varying and sparse. Temporal continuity is favored by using a cost term which is the sum of squared differences between the gains in adjacent frames, and sparseness is favored by penalizing nonzero gains. The proposed iterative estimation algorithm is initialized with random values, and the gains and the spectra are then alternatively updated using multiplicative update rules until the values converge. Simulation experiments were carried out using generated mixtures of pitched musical instrument samples and drum sounds. The performance of the proposed method was compared with independent subspace analysis and basic nonnegative matrix factorization, which are based on the same linear model. According to these simulations, the proposed method enables a better separation quality than the previous algorithms. Especially, the temporal continuity criterion improved the detection of pitched musical sounds. The sparseness criterion did not produce significant improvements

...read moreread less

1,096 citations

Proceedings Article•DOI•

TUT database for acoustic scene classification and sound event detection

[...]

Annamaria Mesaros¹, Toni Heittola¹, Tuomas Virtanen¹•Institutions (1)

Tampere University of Technology¹

01 Aug 2016

TL;DR: The recording and annotation procedure, the database content, a recommended cross-validation setup and performance of supervised acoustic scene classification system and event detection baseline system using mel frequency cepstral coefficients and Gaussian mixture models are presented.

...read moreread less

Abstract: We introduce TUT Acoustic Scenes 2016 database for environmental sound research, consisting of binaural recordings from 15 different acoustic environments. A subset of this database, called TUT Sound Events 2016, contains annotations for individual sound events, specifically created for sound event detection. TUT Sound Events 2016 consists of residential area and home environments, and is manually annotated to mark onset, offset and label of sound events. In this paper we present the recording and annotation procedure, the database content, a recommended cross-validation setup and performance of supervised acoustic scene classification system and event detection baseline system using mel frequency cepstral coefficients and Gaussian mixture models. The database is publicly released to provide support for algorithm development and common ground for comparison of different techniques.

...read moreread less

519 citations

Journal Article•DOI•

Metrics for Polyphonic Sound Event Detection

[...]

Annamaria Mesaros, Toni Heittola, Tuomas Virtanen

25 May 2016-Applied Sciences

TL;DR: This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously.

...read moreread less

Abstract: This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously The system output in this case contains overlapping events, marked as multiple sounds detected as being active at the same time The polyphonic system output requires a suitable procedure for evaluation against a reference Metrics from neighboring fields such as speech recognition and speaker diarization can be used, but they need to be partially redefined to deal with the overlapping events We present a review of the most common metrics in the field and the way they are adapted and interpreted in the polyphonic case We discuss segment-based and event-based definitions of each metric and explain the consequences of instance-based and class-based averaging using a case study In parallel, we provide a toolbox containing implementations of presented metrics

...read moreread less

493 citations

Journal Article•DOI•

Deep Learning for Audio Signal Processing

[...]

Hendrik Purwins¹, Bo Li², Tuomas Virtanen, Jan Schlüter³, Shuo-Yiin Chang², Tara N. Sainath² - Show less +2 more•Institutions (3)

Aalborg University – Copenhagen¹, Google², Austrian Research Institute for Artificial Intelligence³

01 Apr 2019-IEEE Journal of Selected Topics in Signal Processing

TL;DR: Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross fertilization between areas.

...read moreread less

Abstract: Given the recent surge in developments of deep learning, this paper provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e., audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.

...read moreread less

445 citations

Journal Article•DOI•

Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection

[...]

Emre Cakir¹, Giambattista Parascandolo¹, Toni Heittola¹, Heikki Huttunen¹, Tuomas Virtanen¹ - Show less +1 more•Institutions (1)

Tampere University of Technology¹

01 Jun 2017-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: In this paper, a convolutional recurrent neural network (CRNN) was proposed for polyphonic sound event detection task and compared with CNN, RNN and other established methods, and observed a considerable improvement for four different datasets consisting of everyday sound events.

...read moreread less

Abstract: Sound events often occur in unstructured environments where they exhibit wide variations in their frequency content and temporal structure. Convolutional neural networks CNNs are able to extract higher level features that are invariant to local spectral and temporal variations. Recurrent neural networks RNNs are powerful in learning the longer term temporal context in the audio signals. CNNs and RNNs as classifiers have recently shown improved performances over established methods in various sound recognition tasks. We combine these two approaches in a convolutional recurrent neural network CRNN and apply it on a polyphonic sound event detection task. We compare the performance of the proposed CRNN method with CNN, RNN, and other established methods, and observe a considerable improvement for four different datasets consisting of everyday sound events.

...read moreread less

432 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

I and i

[...]

Kevin Barraclough

08 Dec 2001-BMJ

TL;DR: There is, I think, something ethereal about i —the square root of minus one, which seems an odd beast at that time—an intruder hovering on the edge of reality.

...read moreread less

Abstract: There is, I think, something ethereal about i —the square root of minus one. I remember first hearing about it at school. It seemed an odd beast at that time—an intruder hovering on the edge of reality. Usually familiarity dulls this sense of the bizarre, but in the case of i it was the reverse: over the years the sense of its surreal nature intensified. It seemed that it was impossible to write mathematics that described the real world in …

...read moreread less

33,785 citations

Proceedings Article•DOI•

Audio Set: An ontology and human-labeled dataset for audio events

[...]

Jort F. Gemmeke¹, Daniel P. W. Ellis¹, Dylan Freedman¹, Aren Jansen¹, Wade Lawrence¹, R. Channing Moore¹, Manoj Plakal¹, Marvin Ritter¹ - Show less +4 more•Institutions (1)

Google¹

05 Mar 2017

TL;DR: The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.

...read moreread less

Abstract: Audio event recognition, the human-like ability to identify and relate sounds from audio, is a nascent problem in machine perception. Comparable problems such as object detection in images have reaped enormous benefits from comprehensive datasets - principally ImageNet. This paper describes the creation of Audio Set, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research. Using a carefully structured hierarchical ontology of 632 audio classes guided by the literature and manual curation, we collect data from human labelers to probe the presence of specific audio classes in 10 second segments of YouTube videos. Segments are proposed for labeling using searches based on metadata, context (e.g., links), and content analysis. The result is a dataset of unprecedented breadth and size that will, we hope, substantially stimulate the development of high-performance audio event recognizers.

...read moreread less

2,204 citations

Journal Article•DOI•

Digital processing of speech signals

[...]

M.G. Bellanger

01 Oct 1980

1,565 citations

Book•

Bayesian Reasoning and Machine Learning

[...]

David Barber¹•Institutions (1)

University College London¹

12 Mar 2012

TL;DR: Comprehensive and coherent, this hands-on text develops everything from basic reasoning to advanced techniques within the framework of graphical models, and develops analytical and problem-solving skills that equip them for the real world.

...read moreread less

Abstract: Machine learning methods extract value from vast data sets quickly and with modest resources They are established tools in a wide range of industrial applications, including search engines, DNA sequencing, stock market analysis, and robot locomotion, and their use is spreading rapidly People who know the methods have their choice of rewarding jobs This hands-on text opens these opportunities to computer science students with modest mathematical backgrounds It is designed for final-year undergraduates and master's students with limited background in linear algebra and calculus Comprehensive and coherent, it develops everything from basic reasoning to advanced techniques within the framework of graphical models Students learn more than a menu of techniques, they develop analytical and problem-solving skills that equip them for the real world Numerous examples and exercises, both computer based and theoretical, are included in every chapter Resources for students and instructors, including a MATLAB toolbox, are available online

...read moreread less

1,474 citations

Proceedings Article•DOI•

CNN architectures for large-scale audio classification

[...]

Shawn Hershey¹, Sourish Chaudhuri¹, Daniel P. W. Ellis¹, Jort F. Gemmeke¹, Aren Jansen¹, R. Channing Moore¹, Manoj Plakal¹, Devin Platt¹, Rif A. Saurous¹, Bryan Seybold¹, Malcolm Slaney¹, Ron Weiss¹, Kevin W. Wilson¹ - Show less +9 more•Institutions (1)

Google¹

05 Mar 2017

TL;DR: In this paper, the authors used various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels.

...read moreread less

Abstract: Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on our audio classification task, and larger training and label sets help up to a point. A model using embeddings from these classifiers does much better than raw features on the Audio Set [5] Acoustic Event Detection (AED) classification task.

...read moreread less

1,470 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse