Home
/
Authors
/
Sourish Chaudhuri

Author

Sourish Chaudhuri

Other affiliations: Carnegie Mellon University

Bio: Sourish Chaudhuri is an academic researcher from Google. The author has contributed to research in topics: Collaborative learning & Speaker diarisation. The author has an hindex of 13, co-authored 36 publications receiving 1711 citations. Previous affiliations of Sourish Chaudhuri include Carnegie Mellon University.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

CNN architectures for large-scale audio classification

[...]

Shawn Hershey¹, Sourish Chaudhuri¹, Daniel P. W. Ellis¹, Jort F. Gemmeke¹, Aren Jansen¹, R. Channing Moore¹, Manoj Plakal¹, Devin Platt¹, Rif A. Saurous¹, Bryan Seybold¹, Malcolm Slaney¹, Ron Weiss¹, Kevin W. Wilson¹ - Show less +9 more•Institutions (1)

Google¹

05 Mar 2017

TL;DR: In this paper, the authors used various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels.

...read moreread less

Abstract: Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on our audio classification task, and larger training and label sets help up to a point. A model using embeddings from these classifiers does much better than raw features on the Audio Set [5] Acoustic Event Detection (AED) classification task.

...read moreread less

1,470 citations

Posted Content•

CNN Architectures for Large-Scale Audio Classification

[...]

Google¹

29 Sep 2016-arXiv: Sound

TL;DR: This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.

...read moreread less

487 citations

Proceedings Article•DOI•

Non-negative matrix factorization based compensation of music for automatic speech recognition.

[...]

Bhiksha Raj¹, Tuomas Virtanen², Sourish Chaudhuri¹, Rita Singh¹•Institutions (2)

Carnegie Mellon University¹, Tampere University of Technology²

26 Sep 2010

TL;DR: Non-negative matrix factorization based speech enhancement in robust automatic recognition of mixtures of speech and music is proposed and shown to produce a consistent, significant improvement on the recognition performance in the comparison with the baseline method.

...read moreread less

Abstract: This paper proposes to use non-negative matrix factorization based speech enhancement in robust automatic recognition of mixtures of speech and music. We represent magnitude spectra of noisy speech signals as the non-negative weighted linear combination of speech and noise spectral basis vectors, that are obtained from training corpora of speech and music. We use overcomplete dictionaries consisting of random exemplars of the training data. The method is tested on theWall Street Journal large vocabulary speech corpus which is artificially corrupted with polyphonic music from the RWC music database. Various music styles and speech-tomusic ratios are evaluated. The proposed methods are shown to produce a consistent, significant improvement on the recognition performance in the comparison with the baseline method. Audio demonstrations of the enhanced signals are available at http://www.cs.tut.fi/ tuomasv.

...read moreread less

129 citations

Posted Content•

AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection

[...]

Joseph Roth, Sourish Chaudhuri, Ondrej Klejch, Radhika Marvin, Andrew C. Gallagher, Liat Kaver, Sharadh Ramaswamy, Arkadiusz Stopczynski, Cordelia Schmid, Zhonghua Xi, Caroline Pantofaru¹ - Show less +7 more•Institutions (1)

Google¹

05 Jan 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper presents the AVA Active Speaker detection dataset (AVA-ActiveSpeaker), which has been publicly released to facilitate algorithm development and comparison, and introduces a state-of-the-art, jointly trained audio-visual model for real-time active speaker detection and compares several variants.

...read moreread less

Abstract: Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled audio-visual dataset for this task has constrained algorithm evaluations with respect to data diversity, environments, and accuracy. This has made comparisons and improvements difficult. In this paper, we present the AVA Active Speaker detection dataset (AVA-ActiveSpeaker) that will be released publicly to facilitate algorithm development and enable comparisons. The dataset contains temporally labeled face tracks in video, where each face instance is labeled as speaking or not, and whether the speech is audible. This dataset contains about 3.65 million human labeled frames or about 38.5 hours of face tracks, and the corresponding audio. We also present a new audio-visual approach for active speaker detection, and analyze its performance, demonstrating both its strength and the contributions of the dataset.

...read moreread less

76 citations

Proceedings Article•DOI•

Ava Active Speaker: An Audio-Visual Dataset for Active Speaker Detection

[...]

Joseph Roth¹, Sourish Chaudhuri¹, Ondrej Klejch¹, Radhika Marvin¹, Andrew C. Gallagher¹, Liat Kaver¹, Sharadh Ramaswamy¹, Arkadiusz Stopczynski¹, Cordelia Schmid¹, Zhonghua Xi¹, Caroline Pantofaru¹ - Show less +7 more•Institutions (1)

Google¹

04 May 2020

TL;DR: The AVA Active Speaker dataset (AVA-ActiveSpeaker) as discussed by the authors contains temporally labeled face tracks in videos, where each face instance is labeled as speaking or not, and whether the speech is audible.

...read moreread less

Abstract: Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled audio-visual active speaker dataset has limited evaluation in terms of data diversity, environments, and accuracy. In this paper, we present the AVA Active Speaker detection dataset (AVA-ActiveSpeaker) which has been publicly released to facilitate algorithm development and comparison. It contains temporally labeled face tracks in videos, where each face instance is labeled as speaking or not, and whether the speech is audible. The dataset contains about 3.65 million human labeled frames spanning 38.5 hours. We also introduce a state-of-the-art, jointly trained audio-visual model for real-time active speaker detection and compare several variants. The evaluation clearly demonstrates a significant gain due to audio-visual modeling and temporal integration over multiple frames.

...read moreread less

67 citations

1
2
3
4
…
5
6
7
8

Collapse

Cited by

PDF

Open Access

More filters

Proceedings Article•DOI•

Audio Set: An ontology and human-labeled dataset for audio events

[...]

Jort F. Gemmeke¹, Daniel P. W. Ellis¹, Dylan Freedman¹, Aren Jansen¹, Wade Lawrence¹, R. Channing Moore¹, Manoj Plakal¹, Marvin Ritter¹ - Show less +4 more•Institutions (1)

Google¹

05 Mar 2017

TL;DR: The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.

...read moreread less

Abstract: Audio event recognition, the human-like ability to identify and relate sounds from audio, is a nascent problem in machine perception. Comparable problems such as object detection in images have reaped enormous benefits from comprehensive datasets - principally ImageNet. This paper describes the creation of Audio Set, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research. Using a carefully structured hierarchical ontology of 632 audio classes guided by the literature and manual curation, we collect data from human labelers to probe the presence of specific audio classes in 10 second segments of YouTube videos. Segments are proposed for labeling using searches based on metadata, context (e.g., links), and content analysis. The result is a dataset of unprecedented breadth and size that will, we hope, substantially stimulate the development of high-performance audio event recognizers.

...read moreread less

2,204 citations

将“Cooperative Learning”融入课堂——浅谈英语素质教育

[...]

樊希强

01 Jan 2002

TL;DR: In this paper, the interactions learners have with each other build interpersonal skills, such as listening, politely interrupting, expressing ideas, raising questions, disagreeing, paraphrasing, negotiating, and asking for help.

...read moreread less

Abstract: 1. Interaction. The interactions learners have with each other build interpersonal skills, such as listening, politely interrupting, expressing ideas, raising questions, disagreeing, paraphrasing, negotiating, and asking for help. 2. Interdependence. Learners must depend on one another to accomplish a common objective. Each group member has specific tasks to complete, and successful completion of each member’s tasks results in attaining the overall group objective.

...read moreread less

2,171 citations

Journal Article•DOI•

Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning.

[...]

Nicolas Coudray¹, Paolo S. Ocampo¹, Theodore Sakellaropoulos², Navneet Narula¹, Matija Snuderl¹, David Fenyö¹, Andre L. Moreira¹, Narges Razavian¹, Aristotelis Tsirigos¹ - Show less +5 more•Institutions (2)

New York University¹, National Technical University of Athens²

17 Sep 2018-Nature Medicine

TL;DR: A deep convolutional neural network model is trained on whole-slide images obtained from The Cancer Genome Atlas to accurately and automatically classify them into LUAD, LUSC or normal lung tissue and predicts the ten most commonly mutated genes in LUAD.

...read moreread less

Abstract: Visual inspection of histopathology slides is one of the main methods used by pathologists to assess the stage, type and subtype of lung tumors. Adenocarcinoma (LUAD) and squamous cell carcinoma (LUSC) are the most prevalent subtypes of lung cancer, and their distinction requires visual inspection by an experienced pathologist. In this study, we trained a deep convolutional neural network (inception v3) on whole-slide images obtained from The Cancer Genome Atlas to accurately and automatically classify them into LUAD, LUSC or normal lung tissue. The performance of our method is comparable to that of pathologists, with an average area under the curve (AUC) of 0.97. Our model was validated on independent datasets of frozen tissues, formalin-fixed paraffin-embedded tissues and biopsies. Furthermore, we trained the network to predict the ten most commonly mutated genes in LUAD. We found that six of them—STK11, EGFR, FAT1, SETBP1, KRAS and TP53—can be predicted from pathology images, with AUCs from 0.733 to 0.856 as measured on a held-out population. These findings suggest that deep-learning models can assist pathologists in the detection of cancer subtype or gene mutations. Our approach can be applied to any cancer type, and the code is available at https://github.com/ncoudray/DeepPATH .

...read moreread less

1,682 citations

Proceedings Article•

RandAugment: Practical Automated Data Augmentation with a Reduced Search Space

[...]

Ekin D. Cubuk¹, Barret Zoph¹, Jonathon Shlens¹, Quoc V. Le¹•Institutions (1)

Google¹

01 Jan 2020

TL;DR: This work proposes a simplified search space that vastly reduces the computational expense of automated augmentation, and permits the removal of a separate proxy task.

...read moreread less

Abstract: Recent work has shown that data augmentation has the potential to significantly improve the generalization of deep learning models. Recently, automated augmentation strategies have led to state-of-the-art results in image classification and object detection. While these strategies were optimized for improving validation accuracy, they also led to state-of-the-art results in semi-supervised learning and improved robustness to common corruptions of images. An obstacle to a large-scale adoption of these methods is a separate search phase which increases the training complexity and may substantially increase the computational cost. Additionally, due to the separate search phase, these approaches are unable to adjust the regularization strength based on model or dataset size. Automated augmentation policies are often found by training small models on small datasets and subsequently applied to train larger models. In this work, we remove both of these obstacles. RandAugment has a significantly reduced search space which allows it to be trained on the target task with no need for a separate proxy task. Furthermore, due to the parameterization, the regularization strength may be tailored to different model and dataset sizes. RandAugment can be used uniformly across different tasks and datasets and works out of the box, matching or surpassing all previous automated augmentation approaches on CIFAR-10/100, SVHN, and ImageNet. On the ImageNet dataset we achieve 85.0% accuracy, a 0.6% increase over the previous state-of-the-art and 1.0% increase over baseline augmentation. On object detection, RandAugment leads to 1.0-1.3% improvement over baseline augmentation, and is within 0.3% mAP of AutoAugment on COCO. Finally, due to its interpretable hyperparameter, RandAugment may be used to investigate the role of data augmentation with varying model and dataset size. Code is available online.

...read moreread less

1,581 citations

Proceedings Article•

Implicit Neural Representations with Periodic Activation Functions

[...]

Vincent Sitzmann¹, Julien N. P. Martel¹, Alexander W. Bergman¹, David B. Lindell¹, Gordon Wetzstein¹ - Show less +1 more•Institutions (1)

Stanford University¹

17 Jun 2020

TL;DR: In this paper, the authors propose to leverage periodic activation functions for implicit neural representations and demonstrate that these networks, dubbed sinusoidal representation networks or Sirens, are ideally suited for representing complex natural signals and their derivatives.

...read moreread less

Abstract: Implicitly defined, continuous, differentiable signal representations parameterized by neural networks have emerged as a powerful paradigm, offering many possible benefits over conventional representations. However, current network architectures for such implicit neural representations are incapable of modeling signals with fine detail, and fail to represent a signal's spatial and temporal derivatives, despite the fact that these are essential to many physical signals defined implicitly as the solution to partial differential equations. We propose to leverage periodic activation functions for implicit neural representations and demonstrate that these networks, dubbed sinusoidal representation networks or Sirens, are ideally suited for representing complex natural signals and their derivatives. We analyze Siren activation statistics to propose a principled initialization scheme and demonstrate the representation of images, wavefields, video, sound, and their derivatives. Further, we show how Sirens can be leveraged to solve challenging boundary value problems, such as particular Eikonal equations (yielding signed distance functions), the Poisson equation, and the Helmholtz and wave equations. Lastly, we combine Sirens with hypernetworks to learn priors over the space of Siren functions.

...read moreread less

1,058 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse