Home
/
Authors
/
Joshua Kravitz

Author

Joshua Kravitz

Bio: Joshua Kravitz is an academic researcher from Stanford University. The author has contributed to research in topics: Question answering & Crowdsourcing. The author has an hindex of 4, co-authored 5 publications receiving 3955 citations.

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

[...]

Ranjay Krishna¹, Yuke Zhu¹, Oliver Groth², Justin Johnson¹, Kenji Hata¹, Joshua Kravitz¹, Stephanie Chen¹, Yannis Kalantidis³, Li-Jia Li, David A. Shamma⁴, Michael S. Bernstein¹, Li Fei-Fei¹ - Show less +8 more•Institutions (4)

Stanford University¹, Dresden University of Technology², Yahoo!³, Centrum Wiskunde & Informatica⁴

01 May 2017-International Journal of Computer Vision

TL;DR: The Visual Genome dataset as mentioned in this paper contains over 108k images where each image has an average of $35$35 objects, $26$26 attributes, and $21$21 pairwise relationships between objects.

...read moreread less

Abstract: Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an image. When asked "What vehicle is the person riding?", computers will need to identify the objects in an image as well as the relationships riding(man, carriage) and pulling(horse, carriage) to answer correctly that "the person is riding a horse-drawn carriage." In this paper, we present the Visual Genome dataset to enable the modeling of such relationships. We collect dense annotations of objects, attributes, and relationships within each image to learn these models. Specifically, our dataset contains over 108K images where each image has an average of $$35$$35 objects, $$26$$26 attributes, and $$21$$21 pairwise relationships between objects. We canonicalize the objects, attributes, relationships, and noun phrases in region descriptions and questions answer pairs to WordNet synsets. Together, these annotations represent the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs.

...read moreread less

3,842 citations

Posted Content•

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

[...]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, Fei-Fei Li - Show less +8 more

23 Feb 2016-arXiv: Computer Vision and Pattern Recognition

TL;DR: The Visual Genome dataset is presented, which contains over 108K images where each image has an average of $$35$$35 objects, $$26$$26 attributes, and $$21$$21 pairwise relationships between objects, and represents the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs.

...read moreread less

Abstract: Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an image. When asked "What vehicle is the person riding?", computers will need to identify the objects in an image as well as the relationships riding(man, carriage) and pulling(horse, carriage) in order to answer correctly that "the person is riding a horse-drawn carriage". In this paper, we present the Visual Genome dataset to enable the modeling of such relationships. We collect dense annotations of objects, attributes, and relationships within each image to learn these models. Specifically, our dataset contains over 100K images where each image has an average of 21 objects, 18 attributes, and 18 pairwise relationships between objects. We canonicalize the objects, attributes, relationships, and noun phrases in region descriptions and questions answer pairs to WordNet synsets. Together, these annotations represent the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answers.

...read moreread less

1,663 citations

Proceedings Article•DOI•

Embracing Error to Enable Rapid Crowdsourcing

[...]

Ranjay Krishna¹, Kenji Hata¹, Stephanie Chen¹, Joshua Kravitz¹, David A. Shamma², Li Fei-Fei¹, Michael S. Bernstein¹ - Show less +3 more•Institutions (2)

Stanford University¹, Yahoo!²

07 May 2016

TL;DR: This work presents a technique that produces extremely rapid judgments for binary and categorical labels, and demonstrates that it is possible to rectify errors by randomizing task order and modeling response latency.

...read moreread less

Abstract: Microtask crowdsourcing has enabled dataset advances in social science and machine learning, but existing crowdsourcing schemes are too expensive to scale up with the expanding volume of data. To scale and widen the applicability of crowdsourcing, we present a technique that produces extremely rapid judgments for binary and categorical labels. Rather than punishing all errors, which causes workers to proceed slowly and deliberately, our technique speeds up workers' judgments to the point where errors are acceptable and even expected. We demonstrate that it is possible to rectify these errors by randomizing task order and modeling response latency. We evaluate our technique on a breadth of common labeling tasks such as image verification, word similarity, sentiment analysis and topic classification. Where prior work typically achieves a 0.25x to 1x speedup over fixed majority vote, our approach often achieves an order of magnitude (10x) speedup.

...read moreread less

95 citations

Proceedings Article•DOI•

Embracing Error to Enable Rapid Crowdsourcing

[...]

Ranjay Krishna¹, Kenji Hata¹, Stephanie Chen¹, Joshua Kravitz¹, David A. Shamma², Li Fei-Fei¹, Michael S. Bernstein¹ - Show less +3 more•Institutions (2)

Stanford University¹, Yahoo!²

14 Feb 2016-arXiv: Human-Computer Interaction

TL;DR: In this article, the authors present a technique that produces extremely rapid judgments for binary and categorical labels, rather than punishing all errors, which causes workers to proceed slowly and deliberately, and demonstrate that it is possible to rectify these errors by randomizing task order and modeling response latency.

...read moreread less

15 citations

Journal Article•DOI•

The Emotional Toll of Inflammatory Bowel Disease: Using Machine Learning to Analyze Online Community Forum Discourse

[...]

Robert Lerrigo, Johnny T R Coffey¹, Joshua Kravitz¹, Priyanka Jadhav², Azadeh Nikfarjam¹, Nigam H. Shah¹, Dan Jurafsky¹, Sidhartha R. Sinha - Show less +4 more•Institutions (2)

Stanford University¹, Northwestern University²

01 Jul 2019

TL;DR: High-throughput, machine learning-directed analysis of OCFs may help identify psychosocial impacts of inflammatory bowel disease on patients and their caregivers.

...read moreread less

Abstract: Patients with inflammatory bowel disease are using online community forums (OCFs) to seek emotional support. The impact of OCFs on well-being and their emotional content are unknown. We used an unsupervised machine learning algorithm to identify the thematic content of 51,591 public, online posts from the Crohn’s & Colitis Foundation Community Forum. We identified 10,702 (20.8%) posts expressing: gratitude (40%), anxiety/fear (20.8%), empathy (18.2%), anger/frustration (13.4%), hope (13.2%), happiness (10.0%), sadness/depression (5.8%), shame/guilt (2.5%), and/or loneliness (2.5%). A common subtheme was the importance of fostering social support. High-throughput, machine learning-directed analysis of OCFs may help identify psychosocial impacts of inflammatory bowel disease on patients and their caregivers.

...read moreread less

6 citations

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

[...]

Stanford University¹, Dresden University of Technology², Yahoo!³, Centrum Wiskunde & Informatica⁴

01 May 2017-International Journal of Computer Vision

...read moreread less

3,842 citations

Journal Article•DOI•

Places: A 10 Million Image Database for Scene Recognition

[...]

Bolei Zhou¹, Agata Lapedriza², Aditya Khosla¹, Aude Oliva¹, Antonio Torralba¹ - Show less +1 more•Institutions (2)

Massachusetts Institute of Technology¹, Open University of Catalonia²

01 Jun 2018-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The Places Database is described, a repository of 10 million scene photographs, labeled with scene semantic categories, comprising a large and diverse list of the types of environments encountered in the world, using the state-of-the-art Convolutional Neural Networks as baselines, that significantly outperform the previous approaches.

...read moreread less

Abstract: The rise of multi-million-item dataset initiatives has enabled data-hungry machine learning algorithms to reach near-human semantic classification performance at tasks such as visual object and scene recognition. Here we describe the Places Database, a repository of 10 million scene photographs, labeled with scene semantic categories, comprising a large and diverse list of the types of environments encountered in the world. Using the state-of-the-art Convolutional Neural Networks (CNNs), we provide scene classification CNNs (Places-CNNs) as baselines, that significantly outperform the previous approaches. Visualization of the CNNs trained on Places shows that object detectors emerge as an intermediate representation of scene classification. With its high-coverage and high-diversity of exemplars, the Places Database along with the Places-CNNs offer a novel resource to guide future progress on scene recognition problems.

...read moreread less

3,215 citations

Proceedings Article•DOI•

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

[...]

Peter Anderson¹, Xiaodong He, Chris Buehler², Damien Teney³, Mark Johnson, Stephen Gould¹, Lei Zhang² - Show less +3 more•Institutions (3)

Australian National University¹, Microsoft², University of Adelaide³

18 Jun 2018

TL;DR: In this paper, a bottom-up and top-down attention mechanism was proposed to enable attention to be calculated at the level of objects and other salient image regions, which achieved state-of-the-art results on the MSCOCO test server.

...read moreread less

Abstract: Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.

...read moreread less

2,904 citations

Posted Content•

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering.

[...]

Peter Anderson¹, Xiaodong He, Chris Buehler², Damien Teney³, Mark Johnson, Stephen Gould¹, Lei Zhang² - Show less +3 more•Institutions (3)

Australian National University¹, Microsoft², University of Adelaide³

25 Jul 2017-arXiv: Computer Vision and Pattern Recognition

TL;DR: A combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions is proposed, demonstrating the broad applicability of this approach to VQA.

...read moreread less

2,248 citations

Proceedings Article•DOI•

ChestX-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases

[...]

Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, Ronald M. Summers - Show less +2 more

21 Jul 2017

TL;DR: The ChestX-ray dataset as discussed by the authors contains 108,948 frontal-view X-ray images of 32,717 unique patients with the text-mined eight disease image labels from the associated radiological reports using natural language processing.

...read moreread less

Abstract: The chest X-ray is one of the most commonly accessible radiological examinations for screening and diagnosis of many lung diseases. A tremendous number of X-ray imaging studies accompanied by radiological reports are accumulated and stored in many modern hospitals Picture Archiving and Communication Systems (PACS). On the other side, it is still an open question how this type of hospital-size knowledge database containing invaluable imaging informatics (i.e., loosely labeled) can be used to facilitate the data-hungry deep learning paradigms in building truly large-scale high precision computer-aided diagnosis (CAD) systems. In this paper, we present a new chest X-ray database, namely ChestX-ray8, which comprises 108,948 frontal-view X-ray images of 32,717 unique patients with the text-mined eight disease image labels (where each image can have multi-labels), from the associated radiological reports using natural language processing. Importantly, we demonstrate that these commonly occurring thoracic diseases can be detected and even spatially-located via a unified weakly-supervised multi-label image classification and disease localization framework, which is validated using our proposed dataset. Although the initial quantitative results are promising as reported, deep convolutional neural network based reading chest X-rays (i.e., recognizing and locating the common disease patterns trained with only image-level labels) remains a strenuous task for fully-automated high precision CAD systems.

...read moreread less

2,100 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse