Home
/
Authors
/
Siming Li

Author

Siming Li

Bio: Siming Li is an academic researcher from Stony Brook University. The author has contributed to research in topics: Natural language & Service provider. The author has an hindex of 7, co-authored 10 publications receiving 1646 citations. Previous affiliations of Siming Li include Tsinghua University.

Papers

PDF

Open Access

More filters

Journal Article•DOI•

BabyTalk: Understanding and Generating Simple Image Descriptions

[...]

Girish Kulkarni¹, Visruth Premraj¹, Vicente Ordonez¹, Sagnik Dhar¹, Siming Li¹, Yejin Choi¹, Alexander C. Berg¹, Tamara L. Berg¹ - Show less +4 more•Institutions (1)

Stony Brook University¹

01 Dec 2013-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The proposed system to automatically generate natural language descriptions from images is very effective at producing relevant sentences for images and generates descriptions that are notably more true to the specific image content than previous work.

...read moreread less

Abstract: We present a system to automatically generate natural language descriptions from images. This system consists of two parts. The first part, content planning, smooths the output of computer vision-based detection and recognition algorithms with statistics mined from large pools of visually descriptive text to determine the best content words to use to describe an image. The second step, surface realization, chooses words to construct natural language sentences based on the predicted content and general statistics from natural language. We present multiple approaches for the surface realization step and evaluate each using automatic measures of similarity to human generated reference descriptions. We also collect forced choice human evaluations between descriptions from the proposed generation system and descriptions from competing approaches. The proposed system is very effective at producing relevant sentences for images. It also generates descriptions that are notably more true to the specific image content than previous work.

...read moreread less

791 citations

Proceedings Article•DOI•

Baby talk: Understanding and generating simple image descriptions

[...]

Girish Kulkarni¹, Visruth Premraj¹, Sagnik Dhar¹, Siming Li¹, Yejin Choi¹, Alexander C. Berg¹, Tamara L. Berg¹ - Show less +3 more•Institutions (1)

Stony Brook University¹

20 Jun 2011

TL;DR: A system to automatically generate natural language descriptions from images that exploits both statistics gleaned from parsing large quantities of text data and recognition algorithms from computer vision that is very effective at producing relevant sentences for images.

...read moreread less

Abstract: We posit that visually descriptive language offers computer vision researchers both information about the world, and information about how people describe the world. The potential benefit from this source is made more significant due to the enormous amount of language data easily available today. We present a system to automatically generate natural language descriptions from images that exploits both statistics gleaned from parsing large quantities of text data and recognition algorithms from computer vision. The system is very effective at producing relevant sentences for images. It also generates descriptions that are notably more true to the specific image content than previous work.

...read moreread less

564 citations

Proceedings Article•

Composing Simple Image Descriptions using Web-scale N-grams

[...]

Siming Li¹, Girish Kulkarni¹, Tamara L. Berg¹, Alexander C. Berg¹, Yejin Choi¹ - Show less +1 more•Institutions (1)

Stony Brook University¹

23 Jun 2011

TL;DR: A simple yet effective approach to automatically compose image descriptions given computer vision based inputs and using web-scale n-grams, which indicates that it is viable to generate simple textual descriptions that are pertinent to the specific content of an image, while permitting creativity in the description -- making for more human-like annotations than previous approaches.

...read moreread less

Abstract: Studying natural language, and especially how people describe the world around them can help us better understand the visual world. In turn, it can also help us in the quest to generate natural language that describes this world in a human manner. We present a simple yet effective approach to automatically compose image descriptions given computer vision based inputs and using web-scale n-grams. Unlike most previous work that summarizes or retrieves pre-existing text relevant to an image, our method composes sentences entirely from scratch. Experimental results indicate that it is viable to generate simple textual descriptions that are pertinent to the specific content of an image, while permitting creativity in the description -- making for more human-like annotations than previous approaches.

...read moreread less

371 citations

Proceedings Article•DOI•

Optimizing Bulk Transfers with Software-Defined Optical WAN

[...]

Xin Jin¹, Yiran Li², Da Wei², Siming Li³, Jie Gao³, Lei Xu, Guangzhi Li⁴, Wei Xu², Jennifer Rexford¹ - Show less +5 more•Institutions (4)

Princeton University¹, Tsinghua University², Stony Brook University³, AT&T Labs⁴

22 Aug 2016

TL;DR: Owan is presented, a novel traffic management system that optimizes wide-area bulk transfers with centralized joint control of the optical and network layers with efficient algorithms to jointly optimize optical circuit setup, routing and rate allocation, and dynamically adapt them to traffic demand changes.

...read moreread less

Abstract: Bulk transfer on the wide-area network (WAN) is a fundamental service to many globally-distributed applications. It is challenging to efficiently utilize expensive WAN bandwidth to achieve short transfer completion time and meet mission-critical deadlines. Advancements in software-defined networking (SDN) and optical hardware make it feasible and beneficial to quickly reconfigure optical devices in the optical layer, which brings a new opportunity for traffic management on the WAN. We present Owan, a novel traffic management system that optimizes wide-area bulk transfers with centralized joint control of the optical and network layers. \sysname can dynamically change the network-layer topology by reconfiguring the optical devices. We develop efficient algorithms to jointly optimize optical circuit setup, routing and rate allocation, and dynamically adapt them to traffic demand changes. We have built a prototype of Owan with commodity optical and electrical hardware. Testbed experiments and large-scale simulations on two ISP topologies and one inter-DC topology show that \sysname completes transfers up to 4.45x faster on average, and up to 1.36x more transfers meet their deadlines, as compared to prior methods that only control the network layer.

...read moreread less

128 citations

Journal Article•DOI•

Compact Conformal Map for Greedy Routing in Wireless Mobile Sensor Networks

[...]

Siming Li¹, Wei Zeng², Dengpan Zhou¹, Xianfeng Gu¹, Jie Gao¹ - Show less +1 more•Institutions (2)

Stony Brook University¹, Florida International University²

01 Jul 2016-IEEE Transactions on Mobile Computing

TL;DR: This work finds an embedding of the network such that greedy routing using the virtual coordinates guarantees delivery, thus eliminating the necessity of any recovery methods and represents the first practical solution for using virtual coordinates for greedy routing in a sensor network.

...read moreread less

Abstract: Motivated by mobile sensor networks as in participatory sensing applications, we are interested in developing a practical, lightweight solution for routing in a mobile network. While greedy routing is robust to mobility, it may get stuck in a local minimum, which then requires non-trivial recovery methods. We find an embedding of the network such that greedy routing using the virtual coordinates guarantees delivery, thus eliminating the necessity of any recovery methods. Our contribution is to replace the in-network computation of the embedding by a preprocessing of the domain before network deployment and encode the map of network domain to virtual coordinate space by using a small number of parameters which can be preloaded to all sensor nodes. As a result, the map is only dependent on the network domain and is independent of the network connectivity. Each node can directly compute or update its virtual coordinates by applying the locally stored map on its geographical coordinates. This represents the first practical solution for using virtual coordinates for greedy routing in a sensor network and could be easily extended to the case of a mobile network. The paper describes algorithmic innovations as well as implementations on a real testbed.

...read moreread less

14 citations

Cited by

PDF

Open Access

More filters

Proceedings Article•

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

[...]

Kelvin Xu¹, Jimmy Ba², Ryan Kiros², Kyunghyun Cho¹, Aaron Courville¹, Ruslan Salakhudinov³, Ruslan Salakhudinov², Rich Zemel³, Rich Zemel², Yoshua Bengio³, Yoshua Bengio¹ - Show less +7 more•Institutions (3)

Université de Montréal¹, University of Toronto², Canadian Institute for Advanced Research³

06 Jul 2015

TL;DR: An attention based model that automatically learns to describe the content of images is introduced that can be trained in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound.

...read moreread less

Abstract: Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound. We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the corresponding words in the output sequence. We validate the use of attention with state-of-the-art performance on three benchmark datasets: Flickr9k, Flickr30k and MS COCO.

...read moreread less

6,485 citations

Posted Content•

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

[...]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, Yoshua Bengio - Show less +4 more

10 Feb 2015-arXiv: Learning

TL;DR: This paper proposed an attention-based model that automatically learns to describe the content of images by focusing on salient objects while generating corresponding words in the output sequence, which achieved state-of-the-art performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO.

...read moreread less

5,896 citations

Proceedings Article•DOI•

Show and tell: A neural image caption generator

[...]

Oriol Vinyals¹, Alexander Toshev¹, Samy Bengio¹, Dumitru Erhan¹•Institutions (1)

Google¹

07 Jun 2015

TL;DR: In this paper, a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation is proposed to generate natural sentences describing an image, which can be used to automatically describe the content of an image.

...read moreread less

Abstract: Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively. For instance, while the current state-of-the-art BLEU-1 score (the higher the better) on the Pascal dataset is 25, our approach yields 59, to be compared to human performance around 69. We also show BLEU-1 score improvements on Flickr30k, from 56 to 66, and on SBU, from 19 to 28. Lastly, on the newly released COCO dataset, we achieve a BLEU-4 of 27.7, which is the current state-of-the-art.

...read moreread less

5,095 citations

Proceedings Article•DOI•

Deep visual-semantic alignments for generating image descriptions

[...]

Andrej Karpathy¹, Li Fei-Fei¹•Institutions (1)

Stanford University¹

07 Jun 2015

TL;DR: A model that generates natural language descriptions of images and their regions based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding is presented.

...read moreread less

Abstract: We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate that our alignment model produces state of the art results in retrieval experiments on Flickr8K, Flickr30K and MSCOCO datasets. We then show that the generated descriptions significantly outperform retrieval baselines on both full images and on a new dataset of region-level annotations.

...read moreread less

3,996 citations

Posted Content•

Long-term Recurrent Convolutional Networks for Visual Recognition and Description

[...]

Jeff Donahue¹, Lisa Anne Hendricks¹, Marcus Rohrbach¹, Subhashini Venugopalan², Sergio Guadarrama¹, Kate Saenko³, Trevor Darrell¹ - Show less +3 more•Institutions (3)

University of California, Berkeley¹, University of Texas at Austin², University of Massachusetts Lowell³

17 Nov 2014-arXiv: Computer Vision and Pattern Recognition

TL;DR: A novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and shows such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.

...read moreread less

Abstract: Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent, or "temporally deep", are effective for tasks involving sequences, visual and otherwise. We develop a novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and demonstrate the value of these models on benchmark video recognition tasks, image description and retrieval problems, and video narration challenges. In contrast to current models which assume a fixed spatio-temporal receptive field or simple temporal averaging for sequential processing, recurrent convolutional models are "doubly deep"' in that they can be compositional in spatial and temporal "layers". Such models may have advantages when target concepts are complex and/or training data are limited. Learning long-term dependencies is possible when nonlinearities are incorporated into the network state updates. Long-term RNN models are appealing in that they directly can map variable-length inputs (e.g., video frames) to variable length outputs (e.g., natural language text) and can model complex temporal dynamics; yet they can be optimized with backpropagation. Our recurrent long-term models are directly connected to modern visual convnet models and can be jointly trained to simultaneously learn temporal dynamics and convolutional perceptual representations. Our results show such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.

...read moreread less

3,935 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse