Home
/
Authors
/
Dhaval Salvi

Author

Dhaval Salvi

Bio: Dhaval Salvi is an academic researcher from University of South Carolina. The author has contributed to research in topics: Noun phrase & Image rectification. The author has an hindex of 6, co-authored 10 publications receiving 209 citations.

Topics: Noun phrase, Image rectification, Object detection, Verb, Adverbial ...read more

Papers

PDF

Open Access

More filters

Proceedings Article•

Video in sentences out

[...]

Andrei Barbu¹, Alexander Bridge¹, Zachary Burchill¹, Dan Coroian¹, Sven Dickinson², Sanja Fidler², Aaron Michaux¹, Sam Mussman¹, Siddharth Narayanaswamy¹, Dhaval Salvi³, Lara Schmidt¹, Jiangnan Shangguan¹, Jeffrey Mark Siskind¹, Jarrell Waggoner³, Song Wang³, Jinlian Wei¹, Yifan Yin¹, Zhiqi Zhang³ - Show less +14 more•Institutions (3)

Purdue University¹, University of Toronto², University of South Carolina³

14 Aug 2012

TL;DR: In this article, the authors present a system that produces sentential descriptions of video: who did what to whom, and where and how they did it, and extract the information needed to render these linguistic entities requires an approach to event recognition that recovers object tracks, the track-to-role assignments, and changing body posture.

...read moreread less

Abstract: We present a system that produces sentential descriptions of video: who did what to whom, and where and how they did it. Action class is rendered as a verb, participant objects as noun phrases, properties of those objects as adjectival modifiers in those noun phrases, spatial relations between those participants as prepositional phrases, and characteristics of the event as prepositional-phrase adjuncts and adverbial modifiers. Extracting the information needed to render these linguistic entities requires an approach to event recognition that recovers object tracks, the track-to-role assignments, and changing body posture.

...read moreread less

106 citations

Proceedings Article•DOI•

Handwritten text segmentation using average longest path algorithm

[...]

Dhaval Salvi¹, Jun Zhou¹, Jarrell Waggoner¹, Song Wang¹•Institutions (1)

University of South Carolina¹

15 Jan 2013

TL;DR: This paper uses a graph model that describes the possible locations for segmenting neighboring characters, and develops an average longest path algorithm to identify the globally optimal segmentation, which finds the text segmentation with the maximum average likeliness for the resulting characters.

...read moreread less

Abstract: Offline handwritten text recognition is a very challenging problem. Aside from the large variation of different handwriting styles, neighboring characters within a word are usually connected, and we may need to segment a word into individual characters for accurate character recognition. Many existing methods achieve text segmentation by evaluating the local stroke geometry and imposing constraints on the size of each resulting character, such as the character width, height and aspect ratio. These constraints are well suited for printed texts, but may not hold for handwritten texts. Other methods apply holistic approach by using a set of lexicons to guide and correct the segmentation and recognition. This approach may fail when the lexicon domain is insufficient. In this paper, we present a new global non-holistic method for handwritten text segmentation, which does not make any limiting assumptions on the character size and the number of characters in a word. Specifically, the proposed method finds the text segmentation with the maximum average likeliness for the resulting characters. For this purpose, we use a graph model that describes the possible locations for segmenting neighboring characters, and we then develop an average longest path algorithm to identify the globally optimal segmentation. We conduct experiments on real images of handwritten texts taken from the IAM handwriting database and compare the performance of the proposed method against an existing text segmentation algorithm that uses dynamic programming.

...read moreread less

30 citations

Proceedings Article•DOI•

Free-shape subwindow search for object localization

[...]

Zhiqi Zhang¹, Yu Cao¹, Dhaval Salvi¹, Kenton Oliver¹, Jarrell Waggoner¹, Song Wang¹ - Show less +2 more•Institutions (1)

University of South Carolina¹

13 Jun 2010

TL;DR: This paper proposes a new graph-theoretic approach for object localization by searching for an optimal subwindow without pre-specifying its shape, and requires the resulting subwindow to be well aligned with edge pixels that are detected from the image.

...read moreread less

Abstract: Object localization in an image is usually handled by searching for an optimal subwindow that tightly covers the object of interest. However, the subwindows considered in previous work are limited to rectangles or other specified, simple shapes. With such specified shapes, no subwindow can cover the object of interest tightly. As a result, the desired subwindow around the object of interest may not be optimal in terms of the localization objective function, and cannot be detected by a subwindow search algorithm. In this paper, we propose a new graph-theoretic approach for object localization by searching for an optimal subwindow without pre-specifying its shape. Instead, we require the resulting subwindow to be well aligned with edge pixels that are detected from the image. This requirement is quantified and integrated into the localization objective function based on the widely-used bag of visual words technique. We show that the ratio-contour graph algorithm can be adapted to find the optimal free-shape subwindow in terms of the new localization objective function. In the experiment, we test the proposed approach on the PASCAL VOC 2006 and VOC 2007 databases for localizing several categories of animals. We find that its performance is better than the previous efficient subwindow search algorithm.

...read moreread less

30 citations

Posted Content•

Video In Sentences Out

[...]

Purdue University¹, University of Toronto², University of South Carolina³

09 Aug 2014-arXiv: Computer Vision and Pattern Recognition

TL;DR: A system that produces sentential descriptions of video: who did what to whom, and where and how they did it, with an approach to event recognition that recovers object tracks, the track-to-role assignments, and changing body posture.

...read moreread less

22 citations

Book Chapter•DOI•

Video-Based Action Detection Using Multiple Wearable Cameras

[...]

Kang Zheng¹, Yuewei Lin¹, Youjie Zhou¹, Dhaval Salvi¹, Xiaochuan Fan¹, Dazhou Guo¹, Zibo Meng¹, Song Wang¹ - Show less +4 more•Institutions (1)

University of South Carolina¹

06 Sep 2014

TL;DR: A new approach for video-based action detection where a set of temporally synchronized videos are taken by multiple wearable cameras from different and varying views and the goal is to accurately localize the starting and ending time of each instance of the actions of interest in such videos.

...read moreread less

Abstract: This paper is focused on developing a new approach for video-based action detection where a set of temporally synchronized videos are taken by multiple wearable cameras from different and varying views and our goal is to accurately localize the starting and ending time of each instance of the actions of interest in such videos. Compared with traditional approaches based on fixed-camera videos, this new approach incorporates the visual attention of the camera wearers and allows for the action detection in a larger area, although it brings in new challenges such as unconstrained motion of cameras. In this approach, we leverage the multi-view information and the temporal synchronization of the input videos for more reliable action detection. Specifically, we detect and track the focal character in each video and conduct action recognition only for the focal character in each temporal sliding window. To more accurately localize the starting and ending time of actions, we develop a strategy that may merge temporally adjacent sliding windows when detecting durative actions, and non-maximally suppress temporally adjacent sliding windows when detecting momentary actions. Finally we propose a voting scheme to integrate the detection results from multiple videos for more accurate action detection. For the experiments, we collect a new dataset of multiple wearable-camera videos that reflect the complex scenarios in practice.

...read moreread less

8 citations

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

I and i

[...]

Kevin Barraclough

08 Dec 2001-BMJ

TL;DR: There is, I think, something ethereal about i —the square root of minus one, which seems an odd beast at that time—an intruder hovering on the edge of reality.

...read moreread less

Abstract: There is, I think, something ethereal about i —the square root of minus one. I remember first hearing about it at school. It seemed an odd beast at that time—an intruder hovering on the edge of reality. Usually familiarity dulls this sense of the bizarre, but in the case of i it was the reverse: over the years the sense of its surreal nature intensified. It seemed that it was impossible to write mathematics that described the real world in …

...read moreread less

33,785 citations

Proceedings Article•DOI•

Long-term recurrent convolutional networks for visual recognition and description

[...]

Jeff Donahue¹, Lisa Anne Hendricks¹, Sergio Guadarrama¹, Marcus Rohrbach¹, Subhashini Venugopalan², Trevor Darrell¹, Kate Saenko³ - Show less +3 more•Institutions (3)

University of California, Berkeley¹, University of Texas at Austin², University of Massachusetts Lowell³

07 Jun 2015

TL;DR: A novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and shows such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.

...read moreread less

Abstract: Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent, or “temporally deep”, are effective for tasks involving sequences, visual and otherwise. We develop a novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and demonstrate the value of these models on benchmark video recognition tasks, image description and retrieval problems, and video narration challenges. In contrast to current models which assume a fixed spatio-temporal receptive field or simple temporal averaging for sequential processing, recurrent convolutional models are “doubly deep” in that they can be compositional in spatial and temporal “layers”. Such models may have advantages when target concepts are complex and/or training data are limited. Learning long-term dependencies is possible when nonlinearities are incorporated into the network state updates. Long-term RNN models are appealing in that they directly can map variable-length inputs (e.g., video frames) to variable length outputs (e.g., natural language text) and can model complex temporal dynamics; yet they can be optimized with backpropagation. Our recurrent long-term models are directly connected to modern visual convnet models and can be jointly trained to simultaneously learn temporal dynamics and convolutional perceptual representations. Our results show such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.

...read moreread less

4,206 citations

Proceedings Article•DOI•

Deep visual-semantic alignments for generating image descriptions

[...]

Andrej Karpathy¹, Li Fei-Fei¹•Institutions (1)

Stanford University¹

07 Jun 2015

TL;DR: A model that generates natural language descriptions of images and their regions based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding is presented.

...read moreread less

Abstract: We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate that our alignment model produces state of the art results in retrieval experiments on Flickr8K, Flickr30K and MSCOCO datasets. We then show that the generated descriptions significantly outperform retrieval baselines on both full images and on a new dataset of region-level annotations.

...read moreread less

3,996 citations

Posted Content•

Long-term Recurrent Convolutional Networks for Visual Recognition and Description

[...]

Jeff Donahue¹, Lisa Anne Hendricks¹, Marcus Rohrbach¹, Subhashini Venugopalan², Sergio Guadarrama¹, Kate Saenko³, Trevor Darrell¹ - Show less +3 more•Institutions (3)

University of California, Berkeley¹, University of Texas at Austin², University of Massachusetts Lowell³

17 Nov 2014-arXiv: Computer Vision and Pattern Recognition

...read moreread less

Abstract: Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent, or "temporally deep", are effective for tasks involving sequences, visual and otherwise. We develop a novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and demonstrate the value of these models on benchmark video recognition tasks, image description and retrieval problems, and video narration challenges. In contrast to current models which assume a fixed spatio-temporal receptive field or simple temporal averaging for sequential processing, recurrent convolutional models are "doubly deep"' in that they can be compositional in spatial and temporal "layers". Such models may have advantages when target concepts are complex and/or training data are limited. Learning long-term dependencies is possible when nonlinearities are incorporated into the network state updates. Long-term RNN models are appealing in that they directly can map variable-length inputs (e.g., video frames) to variable length outputs (e.g., natural language text) and can model complex temporal dynamics; yet they can be optimized with backpropagation. Our recurrent long-term models are directly connected to modern visual convnet models and can be jointly trained to simultaneously learn temporal dynamics and convolutional perceptual representations. Our results show such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.

...read moreread less

3,935 citations

Journal Article•DOI•

Deep Visual-Semantic Alignments for Generating Image Descriptions

[...]

Andrej Karpathy¹, Li Fei-Fei¹•Institutions (1)

Stanford University¹

01 Apr 2017-IEEE Transactions on Pattern Analysis and Machine Intelligence

...read moreread less

Abstract: We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks (RNN) over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate that our alignment model produces state of the art results in retrieval experiments on Flickr8K, Flickr30K and MSCOCO datasets. We then show that the generated descriptions outperform retrieval baselines on both full images and on a new dataset of region-level annotations. Finally, we conduct large-scale analysis of our RNN language model on the Visual Genome dataset of 4.1 million captions and highlight the differences between image and region-level caption statistics.

...read moreread less

1,953 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

Collapse