scispace - formally typeset
Search or ask a question

Showing papers by "Andrew C. Gallagher published in 2013"


Proceedings ArticleDOI
23 Jun 2013
TL;DR: This work proposes a new approach for parsing RGB-D images using 3D block units for volumetric reasoning, and incorporates the intuition that a good 3D representation of the scene is the one that fits the data well, and is a stable, self-supporting arrangement of objects.
Abstract: 3D volumetric reasoning is important for truly understanding a scene. Humans are able to both segment each object in an image, and perceive a rich 3D interpretation of the scene, e.g., the space an object occupies, which objects support other objects, and which objects would, if moved, cause other objects to fall. We propose a new approach for parsing RGB-D images using 3D block units for volumetric reasoning. The algorithm fits image segments with 3D blocks, and iteratively evaluates the scene based on block interaction properties. We produce a 3D representation of the scene based on jointly optimizing over segmentations, block fitting, supporting relations, and object stability. Our algorithm incorporates the intuition that a good 3D representation of the scene is the one that fits the data well, and is a stable, self-supporting (i.e., one that does not topple) arrangement of objects. We experiment on several datasets including controlled and real indoor scenarios. Results show that our stability-reasoning framework improves RGB-D segmentation and scene volumetric representation.

133 citations


Proceedings ArticleDOI
01 Sep 2013
TL;DR: A novel framework for recognizing kinship by modeling this problem as that of reconstructing the query face from a mixture of parts from a set of families, and achieves state-of-the-art family classification performance.
Abstract: We propose a new, challenging, problem in kinship classification: recognizing the family that a query person belongs to from a set of families. We propose a novel framework for recognizing kinship by modeling this problem as that of reconstructing the query face from a mixture of parts from a set of families. To accomplish this, we reconstruct the query face from a sparse set of samples among the candidate families. Our sparse group reconstruction roughly models the biological process of inheritance: a child inherits genetic material from two parents, and therefore may not appear completely similar to either parent, but is instead a composite of the parents. The family classification is determined based on the reconstruction error for each family. On our newly collected “Family101” dataset, we discover links between familial traits among family members and achieve state-of-the-art family classification performance.

110 citations


Proceedings ArticleDOI
23 Jun 2013
TL;DR: It is shown that describing people in terms of similarity to a vector of possible first names is a powerful description of facial appearance that can be used for face naming and building facial attribute classifiers.
Abstract: This paper introduces a new idea in describing people using their first names, i.e., the name assigned at birth. We show that describing people in terms of similarity to a vector of possible first names is a powerful description of facial appearance that can be used for face naming and building facial attribute classifiers. We build models for 100 common first names used in the United States and for each pair, construct a pair wise first-name classifier. These classifiers are built using training images downloaded from the Internet, with no additional user interaction. This gives our approach important advantages in building practical systems that do not require additional human intervention for labeling. We use the scores from each pair wise name classifier as a set of facial attributes. We show several surprising results. Our name attributes predict the correct first names of test faces at rates far greater than chance. The name attributes are applied to gender recognition and to age classification, outperforming state-of-the-art methods with all training images automatically gathered from the Internet.

33 citations


Proceedings ArticleDOI
21 Oct 2013
TL;DR: A geotag-based inter- and intra-city travel recommendation system that considers both the personal preference and the seasonal/temporal popularity is presented and a combination of two similarity measure among users is proposed.
Abstract: In this paper, a geotag-based inter- and intra-city travel recommendation system that considers both the personal preference and the seasonal/temporal popularity is presented. For the inter-city recommendation, a combination of two similarity measure among users is proposed. Accurate intra-city recommendation is achieved by incorporating the seasonal and temporal information into a Markov model. The effectiveness of the proposed algorithm has been experimentally demonstrated by using more than 6 million geotags downloaded from Flickr.

23 citations


Proceedings ArticleDOI
01 Dec 2013
TL;DR: This work proposes a spoken attribute classifier which models a more natural way of using an attribute in a description, and shows that as a result of using this model, it produces descriptions about images of people that are more natural and specific than past systems.
Abstract: In recent years, there has been a great deal of progress in describing objects with attributes Attributes have proven useful for object recognition, image search, face verification, image description, and zero-shot learning Typically, attributes are either binary or relative: they describe either the presence or absence of a descriptive characteristic, or the relative magnitude of the characteristic when comparing two exemplars However, prior work fails to model the actual way in which humans use these attributes in descriptive statements of images Specifically, it does not address the important interactions between the binary and relative aspects of an attribute In this work we propose a spoken attribute classifier which models a more natural way of using an attribute in a description For each attribute we train a classifier which captures the specific way this attribute should be used We show that as a result of using this model, we produce descriptions about images of people that are more natural and specific than past systems

22 citations


Proceedings ArticleDOI
02 Dec 2013
TL;DR: It is shown that human proficiency at object recognition is due to surface normal and depth edges and suggests that future research should focus on explicitly modeling edge types to increase the likelihood of finding informative edges.
Abstract: In this paper, we investigate the ability of humans to recognize objects using different types of edges. Edges arise in images because of several different physical phenomena, such as shadow boundaries, changes in material albedo or reflectance, changes to surface normals, and occlusion boundaries. By constructing synthetic photo realistic scenes, we control which edges are visible in a rendered image to investigate the relationship between human visual recognition and that edge type. We evaluate the information conveyed by each edge type through human studies on object recognition tasks. We find that edges related to surface normals and depth are the most informative edges, while texture and shadow edges can confuse recognition tasks. This work corroborates recent advances in practical vision systems where active sensors capture depth edges (e.g. Microsoft Kinect) as well as in edge detection where progress is being made towards finding object boundaries instead of just pixel gradients. Further, we evaluate seven standard and state-of-the-art edge detectors based on the types of edges they find by comparing the detected edges with known informative edges in the synthetic scene. We suggest that this evaluation method could lead to more informed metrics for gauging developments in edge detection, without requiring any human labeling. In summary, this work shows that human proficiency at object recognition is due to surface normal and depth edges and suggests that future research should focus on explicitly modeling edge types to increase the likelihood of finding informative edges.

15 citations


Journal ArticleDOI
TL;DR: The proposed system employs a trained classifier to detect pairs of video frames that are suitable for constructing pseudo-stereo images and recomposes the frame pairs to ensure consistent 3D perception for objects for such cases.
Abstract: Due to the advances in display technologies and commercial success of 3D motion pictures in recent years, there is renewed interest in enabling consumers to create 3D content. While new 3D content can be created using more advanced capture devices (i.e., stereo cameras), most people still own 2D capture devices. Further, enormously large collections of captured media exist only in 2D. We present a system for producing pseudo-stereo images from captured 2D videos. Our system employs a two-phase procedure where the first phase detects “good” pseudo-stereo images frames from a 2D video, which was captured a priori without any constraints on camera motion or content. We use a trained classifier to detect pairs of video frames that are suitable for constructing pseudo-stereo images. In particular, for a given frame at time t, we determine if exists such that It+t and It can form an acceptable pseudo-stereo image. Moreover, even if t is determined, generating a good pseudo-stereo image from 2D captured video frames can be nontrivial since in many videos, professional or amateur, both foreground and background objects may undergo complex motion. Independent foreground motions from different scene objects define different epipolar geometries that cause the conventional method of generating pseudo-stereo images to fail. To address this problem, the second phase of the proposed system further recomposes the frame pairs to ensure consistent 3D perception for objects for such cases. In this phase, final left and right pseudo-stereo images are created by recompositing different regions of the initial frame pairs to ensure a consistent camera geometry. We verify the performance of our method for producing pseudo-stereo media from captured 2D videos in a psychovisual evaluation using both professional movie clips and amateur home videos.

13 citations


Proceedings ArticleDOI
01 Sep 2013
TL;DR: This work demonstrates that face arrangement, when combined with attribute (age and gender) correspondence, is a useful cue in capturing an approximate social essence of the group of people, and lets us understand why the groupof people gathered for the photo.
Abstract: When people gather for a group photo, they are together for a social reason. Past work has shown that these social relationships affect how people position themselves in a group photograph. We propose classifying the type of group photo based on the spatial arrangement and the predicted attributes of the faces in the image. We propose a matching algorithm for finding images from a training set that have both similar arrangement of faces and attribute correspondence. We formulate the problem as a bipartite matching problem where the faces from each of the pair of images are nodes in the graph. Our work demonstrates that face arrangement, when combined with attribute (age and gender) correspondence, is a useful cue in capturing an approximate social essence of the group of people, and lets us understand why the group of people gathered for the photo.

10 citations


Proceedings ArticleDOI
01 Dec 2013
TL;DR: A geotag-based travel route recommendation algorithm that considers the seasonal and temporal popularity is presented, and it is shown that the recommendation accuracy can be improved by 0.9% - 10.3% on average.
Abstract: In this paper, a geotag-based travel route recommendation algorithm that considers the seasonal and temporal popularity is presented. Travel routes are extracted from geotags attached to Flickr images. Then, landmarks/routes that become particularly popular at a specific time range in a typical season are extracted. By using the Bayes' theory, the transition probability matrix is efficiently calculated. Experiments were conducted using 21 famous sightseeing cities/places in the world. The results have shown that the recommendation accuracy can be improved by 0.9% - 10.3% on average. The proposed algorithm can also be incoorporated into the state-of-the-art algorithms, having a potential for further recommendation accuracy improvement.

10 citations


Proceedings ArticleDOI
23 Jun 2013
TL;DR: This work introduces an efficient, principled method for choosing which attributes are included in a short description to maximize the likelihood that a third party will correctly guess to which person the description refers.
Abstract: Visual attributes are powerful features for many different applications in computer vision such as object detection and scene recognition. Visual attributes present another application that has not been examined as rigorously: verbal communication from a computer to a human. Since many attributes are nameable, the computer is able to communicate these concepts through language. However, this is not a trivial task. Given a set of attributes, selecting a subset to be communicated is task dependent. Moreover, because attribute classifiers are noisy, it is important to find ways to deal with this uncertainty. We address the issue of communication by examining the task of composing an automatic description of a person in a group photo that distinguishes him from the others. We introduce an efficient, principled method for choosing which attributes are included in a short description to maximize the likelihood that a third party will correctly guess to which person the description refers. We compare our algorithm to computer baselines and human describers, and show the strength of our method in creating effective descriptions.

9 citations


Proceedings ArticleDOI
23 Jun 2013
TL;DR: This paper presents a method for including neighbors in a referring expression, and shows through experiments that using descriptions with neighbors can significantly improve the probability of conveying the correct information to a user.
Abstract: Referring expression generation is widely considered a basic building block of any natural language generation system. Generating these phrases, which can point out a single object from a group of objects, has been studied extensively in that community. However, to build systems which can discuss images in an intelligent way, it is necessary to consider additional factors unique to the visual domain. In this paper we consider the use of neighbors as anchors to create a referring expression for a person in a group image. We describe a target person using the people around him, when we cannot find a reliable set of attributes to describe the target himself. We first present a method for including neighbors in a referring expression, and discuss several ways of presenting this data to a user. We show through experiments that using descriptions with neighbors can significantly improve the probability of conveying the correct information to a user.

Proceedings ArticleDOI
23 Jun 2013
TL;DR: This work quantitatively show that depth ordering produced by the proposed combination of the depth cues from object motion and monocular occlusion cues are superior to using either feature independently, and using a naive combination ofThe features.
Abstract: In this work, we consider images of a scene with a moving object captured by a static camera. As the object (human or otherwise) moves about the scene, it reveals pairwise depth-ordering or occlusion cues. The goal of this work is to use these sparse occlusion cues along with monocular depth occlusion cues to densely segment the scene into depth layers. We cast the problem of depth-layer segmentation as a discrete labeling problem on a spatio-temporal Markov Random Field (MRF) that uses the motion occlusion cues along with monocular cues and a smooth motion prior for the moving object. We quantitatively show that depth ordering produced by the proposed combination of the depth cues from object motion and monocular occlusion cues are superior to using either feature independently, and using a naive combination of the features.

Proceedings ArticleDOI
01 Sep 2013
TL;DR: Experiments show that the mid-level color and depth features outperform using either depth or color alone, and the method surpasses the performance of baseline boundary detection methods.
Abstract: To enable high-level understanding of a scene, it is important to understand the occlusion and connected boundaries of objects in the image In this paper, we propose a new framework for inferring boundaries from color and depth information Even with depth information, it is not a trivial task to find and classify boundaries Real-world depth images are noisy, especially at object boundaries, where our task is focused Our approach uses features from both the color (which are sharp at object boundaries) and depth images (for providing geometric cues) to detect boundaries and classify them as occlusion or connected boundaries We propose depth features based on surface fitting from sparse point clouds, and perform inference with a Conditional Random Field One advantage of our approach is that occlusion and connected boundaries are identified with a single, common model Experiments show that our mid-level color and depth features outperform using either depth or color alone, and our method surpasses the performance of baseline boundary detection methods

Proceedings ArticleDOI
01 Sep 2013
TL;DR: This paper estimates the absolute orientation of a planar object with respect to the ground by combining the inertial sensor information with vision algorithms and proposes a novel homography decomposition to extract the rotation matrix.
Abstract: Photography on a mobile camera provides access to additional sensors. In this paper, we estimate the absolute orientation of a planar object with respect to the ground, which can be a valuable prior for many vision tasks. To find the planar object orientation, our novel algorithm combines information from a gravity sensor with a planar homography that matches a region of an image to a training image (e.g., of a company logo). We demonstrate our approach with an iPhone application that records the gravity direction for each captured image. We find a homography that maps the training image to the test image, and propose a novel homography decomposition to extract the rotation matrix. We believe this is the first paper to estimate absolute planar object orientation by combining the inertial sensor information with vision algorithms. Experiments show that our proposed algorithm performs reliably.