Home
/
Authors
/
Emre Cakir

Author

Emre Cakir

Bio: Emre Cakir is an academic researcher from Tampere University of Technology. The author has contributed to research in topics: Recurrent neural network & Artificial neural network. The author has an hindex of 12, co-authored 19 publications receiving 928 citations.

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection

[...]

Emre Cakir¹, Giambattista Parascandolo¹, Toni Heittola¹, Heikki Huttunen¹, Tuomas Virtanen¹ - Show less +1 more•Institutions (1)

Tampere University of Technology¹

01 Jun 2017-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: In this paper, a convolutional recurrent neural network (CRNN) was proposed for polyphonic sound event detection task and compared with CNN, RNN and other established methods, and observed a considerable improvement for four different datasets consisting of everyday sound events.

...read moreread less

Abstract: Sound events often occur in unstructured environments where they exhibit wide variations in their frequency content and temporal structure. Convolutional neural networks CNNs are able to extract higher level features that are invariant to local spectral and temporal variations. Recurrent neural networks RNNs are powerful in learning the longer term temporal context in the audio signals. CNNs and RNNs as classifiers have recently shown improved performances over established methods in various sound recognition tasks. We combine these two approaches in a convolutional recurrent neural network CRNN and apply it on a polyphonic sound event detection task. We compare the performance of the proposed CRNN method with CNN, RNN, and other established methods, and observe a considerable improvement for four different datasets consisting of everyday sound events.

...read moreread less

432 citations

Proceedings Article•DOI•

Polyphonic sound event detection using multi label deep neural networks

[...]

Emre Cakir¹, Toni Heittola¹, Heikki Huttunen¹, Tuomas Virtanen¹•Institutions (1)

Tampere University of Technology¹

12 Jul 2015

TL;DR: Frame-wise spectral-domain features are used as inputs to train a deep neural network for multi label classification in this work and the proposed method improves the accuracy by 19% percentage points overall.

...read moreread less

Abstract: In this paper, the use of multi label neural networks are proposed for detection of temporally overlapping sound events in realistic environments. Real-life sound recordings typically have many overlapping sound events, making it hard to recognize each event with the standard sound event detection methods. Frame-wise spectral-domain features are used as inputs to train a deep neural network for multi label classification in this work. The model is evaluated with recordings from realistic everyday environments and the obtained overall accuracy is 63.8%. The method is compared against a state-of-the-art method using non-negative matrix factorization as a pre-processing stage and hidden Markov models as a classifier. The proposed method improves the accuracy by 19% percentage points overall.

...read moreread less

255 citations

Proceedings Article•DOI•

Convolutional recurrent neural networks for bird audio detection

[...]

Emre Cakir¹, Sharath Adavanne¹, Giambattista Parascandolo¹, Konstantinos Drossos¹, Tuomas Virtanen¹ - Show less +1 more•Institutions (1)

Tampere University of Technology¹

01 Aug 2017

TL;DR: In the proposed method, convolutional layers extract high dimensional, local frequency shift invariant features, while recurrent layers capture longer term dependencies between the features extracted from short time frames.

...read moreread less

Abstract: Bird sounds possess distinctive spectral structure which may exhibit small shifts in spectrum depending on the bird species and environmental conditions. In this paper, we propose using convolutional recurrent neural networks on the task of automated bird audio detection in real-life environments. In the proposed method, convolutional layers extract high dimensional, local frequency shift invariant features, while recurrent layers capture longer term dependencies between the features extracted from short time frames. This method achieves 88.5% Area Under ROC Curve (AUC) score on the unseen evaluation data and obtains the second place in the Bird Audio Detection challenge.

...read moreread less

74 citations

Proceedings Article•DOI•

Robust direction estimation with convolutional neural networks based steered response power

[...]

Pasi Pertilä¹, Emre Cakir¹•Institutions (1)

Tampere University of Technology¹

05 Mar 2017

TL;DR: In this paper, the use of convolutional neural networks (CNNs) was proposed for the prediction of a TF mask for emphasizing the direct path speech signal in time-varying interference.

...read moreread less

Abstract: The steered response power (SRP) methods can be used to build a map of sound direction likelihood. In the presence of interference and reverberation, the map will exhibit multiple peaks with heights related to the corresponding sound's spectral content. Often in realistic use cases, the target of interest (such as speech) can exhibit a lower peak compared to an interference source. This will corrupt any direction dependent method, such as beamforming. Regression has been used to predict time-frequency (TF) regions corrupted by reverberation, and static broadband noise can be efficiently estimated for TF points. TF regions dominated by noise or reverberation can then be de-emphasized to obtain more reliable source direction estimates. In this work, we propose the use of convolutional neural networks (CNNs) for the prediction of a TF mask for emphasizing the direct path speech signal in time-varying interference. SRP with phase transform (SRP-PHAT) combined with the CNN-based masking is shown to be capable of reducing the impact of time-varying interference for speaker direction estimation using real speech sources in reverberation.

...read moreread less

53 citations

Proceedings Article•DOI•

Stacked convolutional and recurrent neural networks for bird audio detection

[...]

Sharath Adavanne¹, Konstantinos Drossos¹, Emre Cakir¹, Tuomas Virtanen¹•Institutions (1)

Tampere University of Technology¹

01 Aug 2017

TL;DR: In this paper, the detection of bird calls in audio segments using stacked convolutional and recurrent neural networks was studied and the best achieved AUC measure on five cross-validations of the development data is 95.5% and 88.1% on the unseen evaluation data.

...read moreread less

Abstract: This paper studies the detection of bird calls in audio segments using stacked convolutional and recurrent neural networks. Data augmentation by blocks mixing and domain adaptation using a novel method of test mixing are proposed and evaluated in regard to making the method robust to unseen data. The contributions of two kinds of acoustic features (dominant frequency and log mel-band energy) and their combinations are studied in the context of bird audio detection. Our best achieved AUC measure on five cross-validations of the development data is 95.5% and 88.1% on the unseen evaluation data.

...read moreread less

49 citations

1
2
3
4
…

Cited by

PDF

Open Access

More filters

The Self-Organizing Map

[...]

Teuvo Kohonen¹•Institutions (1)

Helsinki University of Technology¹

01 Jan 1990

TL;DR: An overview of the self-organizing map algorithm, on which the papers in this issue are based, is presented in this article, where the authors present an overview of their work.

...read moreread less

Abstract: An overview of the self-organizing map algorithm, on which the papers in this issue are based, is presented in this article.

...read moreread less

2,933 citations

Journal Article•DOI•

Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification

[...]

Justin Salamon¹, Juan Pablo Bello¹•Institutions (1)

New York University¹

23 Jan 2017-IEEE Signal Processing Letters

TL;DR: It is shown that the improved performance stems from the combination of a deep, high-capacity model and an augmented training set: this combination outperforms both the proposed CNN without augmentation and a “shallow” dictionary learning model with augmentation.

...read moreread less

Abstract: The ability of deep convolutional neural networks (CNNs) to learn discriminative spectro-temporal patterns makes them well suited to environmental sound classification. However, the relative scarcity of labeled data has impeded the exploitation of this family of high-capacity models. This study has two primary contributions: first, we propose a deep CNN architecture for environmental sound classification. Second, we propose the use of audio data augmentation for overcoming the problem of data scarcity and explore the influence of different augmentations on the performance of the proposed CNN architecture. Combined with data augmentation, the proposed model produces state-of-the-art results for environmental sound classification. We show that the improved performance stems from the combination of a deep, high-capacity model and an augmented training set: this combination outperforms both the proposed CNN without augmentation and a “shallow” dictionary learning model with augmentation. Finally, we examine the influence of each augmentation on the model's classification accuracy for each class, and observe that the accuracy for each class is influenced differently by each augmentation, suggesting that the performance of the model could be improved further by applying class-conditional data augmentation.

...read moreread less

996 citations

Journal Article•DOI•

Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification

[...]

Justin Salamon¹, Juan Pablo Bello¹•Institutions (1)

New York University¹

15 Aug 2016-arXiv: Sound

TL;DR: In this paper, the authors proposed a deep convolutional neural network architecture for environmental sound classification and used audio data augmentation for overcoming the problem of data scarcity and explore the influence of different augmentations on the performance of the proposed CNN architecture.

...read moreread less

Abstract: The ability of deep convolutional neural networks (CNN) to learn discriminative spectro-temporal patterns makes them well suited to environmental sound classification. However, the relative scarcity of labeled data has impeded the exploitation of this family of high-capacity models. This study has two primary contributions: first, we propose a deep convolutional neural network architecture for environmental sound classification. Second, we propose the use of audio data augmentation for overcoming the problem of data scarcity and explore the influence of different augmentations on the performance of the proposed CNN architecture. Combined with data augmentation, the proposed model produces state-of-the-art results for environmental sound classification. We show that the improved performance stems from the combination of a deep, high-capacity model and an augmented training set: this combination outperforms both the proposed CNN without augmentation and a "shallow" dictionary learning model with augmentation. Finally, we examine the influence of each augmentation on the model's classification accuracy for each class, and observe that the accuracy for each class is influenced differently by each augmentation, suggesting that the performance of the model could be improved further by applying class-conditional data augmentation.

...read moreread less

864 citations

Posted Content•

SoundNet: Learning Sound Representations from Unlabeled Video

[...]

Yusuf Aytar¹, Carl Vondrick¹, Antonio Torralba¹•Institutions (1)

Massachusetts Institute of Technology¹

27 Oct 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, the authors leverage the natural synchronization between vision and sound to learn an acoustic representation using two-million unlabeled videos and propose a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabelled video as a bridge.

...read moreread less

Abstract: We learn rich natural sound representations by capitalizing on large amounts of unlabeled sound data collected in the wild We leverage the natural synchronization between vision and sound to learn an acoustic representation using two-million unlabeled videos Unlabeled video has the advantage that it can be economically acquired at massive scales, yet contains useful signals about natural sound We propose a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge Our sound representation yields significant performance improvements over the state-of-the-art results on standard benchmarks for acoustic scene/object classification Visualizations suggest some high-level semantics automatically emerge in the sound network, even though it is trained without ground truth labels

...read moreread less

725 citations

Journal Article•DOI•

PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition

[...]

Qiuqiang Kong¹, Yin Cao¹, Turab Iqbal¹, Yuxuan Wang, Wenwu Wang¹, Mark D. Plumbley¹ - Show less +2 more•Institutions (1)

University of Surrey¹

19 Oct 2020-IEEE Transactions on Audio, Speech, and Language Processing

TL;DR: This paper proposes pretrained audio neural networks (PANNs) trained on the large-scale AudioSet dataset, and investigates the performance and computational complexity of PANNs modeled by a variety of convolutional neural networks.

...read moreread less

Abstract: Audio pattern recognition is an important research topic in the machine learning area, and includes several tasks such as audio tagging, acoustic scene classification, music classification, speech emotion classification and sound event detection. Recently, neural networks have been applied to tackle audio pattern recognition problems. However, previous systems are built on specific datasets with limited durations. Recently, in computer vision and natural language processing, systems pretrained on large-scale datasets have generalized well to several tasks. However, there is limited research on pretraining systems on large-scale datasets for audio pattern recognition. In this paper, we propose pretrained audio neural networks (PANNs) trained on the large-scale AudioSet dataset. These PANNs are transferred to other audio related tasks. We investigate the performance and computational complexity of PANNs modeled by a variety of convolutional neural networks. We propose an architecture called Wavegram-Logmel-CNN using both log-mel spectrogram and waveform as input feature. Our best PANN system achieves a state-of-the-art mean average precision (mAP) of 0.439 on AudioSet tagging, outperforming the best previous system of 0.392. We transfer PANNs to six audio pattern recognition tasks, and demonstrate state-of-the-art performance in several of those tasks. We have released the source code and pretrained models of PANNs: https://github.com/qiuqiangkong/audioset_tagging_cnn .

...read moreread less

560 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184

Collapse