scispace - formally typeset
Search or ask a question

Showing papers by "Junsong Yuan published in 2015"


Proceedings Article•DOI•
07 Jun 2015
TL;DR: Experimental results on two challenging datasets, MSRII and UCF 101, validate the superior performance of the action proposals as well as competitive results on action detection and search.
Abstract: In this paper we target at generating generic action proposals in unconstrained videos. Each action proposal corresponds to a temporal series of spatial bounding boxes, i.e., a spatio-temporal video tube, which has a good potential to locate one human action. Assuming each action is performed by a human with meaningful motion, both appearance and motion cues are utilized to measure the actionness of the video tubes. After picking those spatiotemporal paths of high actionness scores, our action proposal generation is formulated as a maximum set coverage problem, where greedy search is performed to select a set of action proposals that can maximize the overall actionness score. Compared with existing action proposal approaches, our action proposals do not rely on video segmentation and can be generated in nearly real-time. Experimental results on two challenging datasets, MSRII and UCF 101, validate the superior performance of our action proposals as well as competitive results on action detection and search.

258 citations


Journal Article•DOI•
TL;DR: An incremental Maximal-Conditional-Mutual-Information scheme for LBP structure learning to handle pixel correlation has demonstrated a superior performance over the state-of-the-arts results on classifying both spatial patternssuch as texture classification, scene recognition and face recognition, and spatial-temporal patterns such as dynamic texture recognition.

72 citations


Journal Article•DOI•
TL;DR: Qualitative and quantitative evaluations on the benchmark data set containing 51 challenging image sequences demonstrate that the proposed algorithm outperforms the state-of-the-art methods.
Abstract: The appearance of an object could be continuously changing during tracking, thereby being not independent identically distributed. A good discriminative tracker often needs a large number of training samples to fit the underlying data distribution, which is impractical for visual tracking. In this paper, we present a new discriminative tracker via landmark-based label propagation (LLP) that is nonparametric and makes no specific assumption about the sample distribution. With an undirected graph representation of samples, the LLP locally approximates the soft label of each sample by a linear combination of labels on its nearby landmarks. It is able to effectively propagate a limited amount of initial labels to a large amount of unlabeled samples. To this end, we introduce a local landmarks approximation method to compute the cross-similarity matrix between the whole data and landmarks. Moreover, a soft label prediction function incorporating the graph Laplacian regularizer is used to diffuse the known labels to all the unlabeled vertices in the graph, which explicitly considers the local geometrical structure of all samples. Tracking is then carried out within a Bayesian inference framework, where the soft label prediction value is used to construct the observation model. Both qualitative and quantitative evaluations on the benchmark data set containing 51 challenging image sequences demonstrate that the proposed algorithm outperforms the state-of-the-art methods.

44 citations


Proceedings Article•DOI•
13 Oct 2015
TL;DR: This demo shows the possibility to interact with 3D contents with bare hands on wearable devices by two Augmented Reality applications, including virtual teapot manipulation and fountain animation in hand.
Abstract: Wearable devices such as Microsoft Hololens and Google glass are highly popular in recent years. As traditional input hardware is difficult to use on such platforms, vision-based hand pose tracking and gesture control techniques are more suitable alternatives. This demo shows the possibility to interact with 3D contents with bare hands on wearable devices by two Augmented Reality applications, including virtual teapot manipulation and fountain animation in hand. Technically, we use a head-mounted depth camera to capture the RGB-D images from egocentric view, and adopt the random forest to regress for the palm pose and classify the hand gesture simultaneously via a spatial-voting framework. The predicted pose and gesture are used to render the 3D virtual objects, which are overlaid onto the hand region in input RGB images with camera calibration parameters for seamless virtual and real scene synthesis.

43 citations


Journal Article•DOI•
TL;DR: This paper proposes a novel sparse representation method of SPD matrices in the data-dependent manifold kernel space, and designs two different positive definite kernel functions that can be readily transformed to the corresponding manifold kernels.
Abstract: The symmetric positive-definite (SPD) matrix, as a connected Riemannian manifold, has become increasingly popular for encoding image information. Most existing sparse models are still primarily developed in the Euclidean space. They do not consider the non-linear geometrical structure of the data space, and thus are not directly applicable to the Riemannian manifold. In this paper, we propose a novel sparse representation method of SPD matrices in the data-dependent manifold kernel space. The graph Laplacian is incorporated into the kernel space to better reflect the underlying geometry of SPD matrices. Under the proposed framework, we design two different positive definite kernel functions that can be readily transformed to the corresponding manifold kernels. The sparse representation obtained has more discriminating power. Extensive experimental results demonstrate good performance of manifold kernel sparse codes in image classification, face recognition, and visual tracking.

33 citations


Journal Article•DOI•
TL;DR: A novel method to predict the 3-D joint positions from the depth images and the parsed hand parts obtained with a pretrained classifier is presented, showing that the regressor learned on synthesized dataset also gives accurate prediction on real-world depth images by enforcing the hand part correlations despite their discrepancies.
Abstract: The positions of the hand joints are important high-level features for hand-based human-computer interaction. We present a novel method to predict the 3-D joint positions from the depth images and the parsed hand parts obtained with a pretrained classifier. The hand parts are utilized as the additional cue to resolve the multimodal predictions produced by the previous regression-based method without increasing the computational cost significantly. In addition, we further enforce the hand motion constraints to fuse the per-pixel prediction results. The posterior distribution of the joints is formulated as a weighted product of experts model based on the individual pixel predictions, which is maximized via the expectation–maximization algorithm on a learned low-dimensional space of the hand joint parameters. The experimental results show the proposed method improves the prediction accuracy considerably compared with the rivals that also regress for the joint locations from the depth images. Especially, we show that the regressor learned on synthesized dataset also gives accurate prediction on real-world depth images by enforcing the hand part correlations despite their discrepancies.

31 citations


Journal Article•DOI•
TL;DR: Zhang et al. as discussed by the authors proposed propagative generalized Hough voting (HV) to propagate the label and spatio-temporal configuration information of local features via HV.
Abstract: Generalized Hough voting (HV) has shown promising results in both object and action detection. However, most existing HV methods will suffer when insufficient training data are provided. We propose propagative HV to address this limitation and apply it to human activity analysis. Instead of training a discriminative classifier for local feature voting, we match individual local features to propagate the label and spatiotemporal configuration information of local features via HV. To enable a fast local feature matching, we index the local features using random projection trees (RPTs). RPTs can reveal the low-dimension manifold structure to provide adaptive local feature matching. Moreover, as the RPT index can be built in either labeled or unlabeled dataset, it can be applied to different tasks, such as activity search (limited training) and recognition (sufficient training). The superior performances on benchmarked datasets validate that our propagative HV can outperform state-of-the-art techniques in various activity analysis tasks, such as activity search, recognition, and prediction.

30 citations


Journal Article•DOI•
TL;DR: A chi-squared transformation (CST) is proposed to transfer the LBP feature to a feature that fits better to Gaussian distribution, which leads to the formulation of a two-class classification problem.
Abstract: Local binary pattern (LBP) and its variants have been widely used in many recognition tasks. Subspace approaches are often applied to the LBP feature in order to remove unreliable dimensions, or to derive a compact feature representation. It is well-known that subspace approaches utilizing up to the second-order statistics are optimal only when the underlying distribution is Gaussian. However, due to its nonnegative and simplex constraints, the LBP feature deviates significantly from Gaussian distribution. To alleviate this problem, we propose a chi-squared transformation (CST) to transfer the LBP feature to a feature that fits better to Gaussian distribution. The proposed CST leads to the formulation of a two-class classification problem. Due to its asymmetric nature, we apply asymmetric principal component analysis (APCA) to better remove the unreliable dimensions in the CST feature space. The proposed CST-APCA is evaluated extensively on spatial LBP for face recognition, protein cellular classification, and spatial-temporal LBP for dynamic texture recognition. All experiments show that the proposed feature transformation significantly enhances the recognition accuracy.

29 citations


Journal Article•DOI•
TL;DR: A topic model is proposed that incorporates a word co-occurrence prior for efficient discovery of topical video objects from a set of key frames that can discover different types of topical objects despite variations in scale, view-point, color and lighting changes, or even partial occlusions.
Abstract: A topical video object refers to an object, that is, frequently highlighted in a video. It could be, e.g., the product logo and the leading actor/actress in a TV commercial. We propose a topic model that incorporates a word co-occurrence prior for efficient discovery of topical video objects from a set of key frames. Previous work using topic models, such as latent Dirichelet allocation (LDA), for video object discovery often takes a bag-of-visual-words representation, which ignored important co-occurrence information among the local features. We show that such data driven co-occurrence information from bottom–up can conveniently be incorporated in LDA with a Gaussian Markov prior, which combines top–down probabilistic topic modeling with bottom–up priors in a unified model. Our experiments on challenging videos demonstrate that the proposed approach can discover different types of topical objects despite variations in scale, view-point, color and lighting changes, or even partial occlusions. The efficacy of the co-occurrence prior is clearly demonstrated when compared with topic models without such priors.

26 citations


Journal Article•DOI•
TL;DR: This work proposes a randomized approach to deriving spatial context, in the form of spatial random partition, which offers three benefits: the aggregation of the matching scores over multiple random patches provides robust local matching; the matched objects can be directly identified on the pixelwise confidence map, which results in efficient object localization.
Abstract: Searching visual objects in large image or video data sets is a challenging problem, because it requires efficient matching and accurate localization of query objects that often occupy a small part of an image. Although spatial context has been shown to help produce more reliable detection than methods that match local features individually, how to extract appropriate spatial context remains an open problem. Instead of using fixed-scale spatial context, we propose a randomized approach to deriving spatial context, in the form of spatial random partition. The effect of spatial context is achieved by averaging the matching scores over multiple random patches. Our approach offers three benefits: 1) the aggregation of the matching scores over multiple random patches provides robust local matching; 2) the matched objects can be directly identified on the pixelwise confidence map, which results in efficient object localization; and 3) our algorithm lends itself to easy parallelization and also allows a flexible tradeoff between accuracy and speed through adjusting the number of partition times. Both theoretical studies and experimental comparisons with the state-of-the-art methods validate the advantages of our approach.

24 citations


Proceedings Article•DOI•
01 Jun 2015
TL;DR: This paper investigates the problem of minimizing the total inter-server traffic among a cluster of OSN servers through joint partitioning and replication optimization and proposes a Traffic-Optimized Partitioning and Replication (TOPR) method based on an analysis of how replica allocation affects the inter- server communication.
Abstract: Distributed storage systems are the key infrastructures for hosting the user data of large-scale Online Social Networks (OSNs). The amount of inter-server communication is an important scalability indicator for these systems. Data partitioning and replication are two inter-related issues affecting the inter-server traffic caused by user-initiated read and write operations. This paper investigates the problem of minimizing the total inter-server traffic among a cluster of OSN servers through joint partitioning and replication optimization. We propose a Traffic-Optimized Partitioning and Replication (TOPR) method based on an analysis of how replica allocation affects the inter-server communication. Lightweight algorithms are developed to adjust partitioning and replication dynamically according to data read and write rates. Evaluations with real Facebook and Twitter social graphs show that TOPR significantly reduces the inter-server communication compared with state-of-the-art methods.

Proceedings Article•DOI•
10 Dec 2015
TL;DR: Group Saliency Propagation model is proposed where a single group saliency map is developed, which can be propagated to segment the entire group, with the added advantage of speed up.
Abstract: Most of the existing co-segmentation methods are usually complex, and require pre-grouping of images, fine-tuning a few parameters and initial segmentation masks etc. These limitations become serious concerns for their application on large scale datasets. In this paper, Group Saliency Propagation (GSP) model is proposed where a single group saliency map is developed, which can be propagated to segment the entire group. In addition, it is also shown how a pool of these group saliency maps can help in quickly segmenting new input images. Experiments demonstrate that the proposed method can achieve competitive performance on several benchmark co-segmentation datasets including ImageNet, with the added advantage of speed up.

Journal Article•DOI•
TL;DR: Two enhanced NRLBPs are proposed that jointly utilize the sign and the magnitude of the current pixel difference, and also the information of other LBP bits that demonstrate a superior performance compared with NRLBP and other L BP variants.
Abstract: Local binary pattern (LBP) is sensitive to image noise Noise-resistant LBP (NRLBP) improves the robustness to noise by incorporating the prior knowledge of images and information of other LBP bits into encoding process However, it encodes the small pixel difference in such a way that its sign and magnitude are ignored Although the small pixel difference may be easily distorted by noise, some of its information is still useful for LBP encoding In this letter, we propose two enhanced NRLBPs that jointly utilize the sign and the magnitude of the current pixel difference, and also the information of other LBP bits The proposed approaches are validated on two benchmark databases and demonstrate a superior performance compared with NRLBP and other LBP variants The performance gain is significant when the noise level is high

Proceedings Article•DOI•
01 Dec 2015
TL;DR: The approach here is to iteratively update the saliency maps through co-saliency estimation depending upon quality scores, which indicate the degree of separation of foreground and background likelihoods (the easier the separation, the higher the quality of saliency map).
Abstract: Despite recent advances in joint processing of images, sometimes it may not be as effective as single image processing for object discovery problems. In this paper while aiming for common object detection, we attempt to address this problem by proposing a novel QCCE: Quality Constrained Co-saliency Estimation method. The approach here is to iteratively update the saliency maps through co-saliency estimation depending upon quality scores, which indicate the degree of separation of foreground and background likelihoods (the easier the separation, the higher the quality of saliency map). In this way, joint processing is automatically constrained by the quality of saliency maps. Moreover, the proposed method can be applied to both unsupervised and supervised scenarios, unlike other methods which are particularly designed for one scenario only. Experimental results demonstrate superior performance of the proposed method compared to the state-of-the-art methods.

Book•DOI•
26 Sep 2015
TL;DR: This is the first book to describe how Autonomous Virtual Humans and Social Robots can interact with real people, be aware of the environment around them, and react to various situations.
Abstract: This is the first book to describe how Autonomous Virtual Humans and Social Robots can interact with real people, be aware of the environment around them, and react to various situations. Researchers from around the world present the main techniques for tracking and analysing humans and their behaviour and contemplate the potential for these virtual humans and robots to replace or stand in for their human counterparts, tackling areas such as awareness and reactions to real world stimuli and using the same modalities as humans do: verbal and body gestures, facial expressions and gaze to aid seamless human-computer interaction (HCI). The research presented in this volume is split into three sections: User Understanding through Multisensory Perception: deals with the analysis and recognition of a given situation or stimuli, addressing issues of facial recognition, body gestures and sound localization. Facial and Body Modelling Animation: presents the methods used in modelling and animating faces and bodies to generate realistic motion. Modelling Human Behaviours: presents the behavioural aspects of virtual humans and social robots when interacting and reacting to real humans and each other. Context Aware Human-Robot and Human-Agent Interaction would be of great use to students, academics and industry specialists in areas like Robotics, HCI, and Computer Graphics.

Proceedings Article•DOI•
19 Apr 2015
TL;DR: This work proposes to determine the fuzzy membership function by its sign only, and shows that this approach is more robust to noise, and demonstrates a superior performance to FLBP and many other LBP variants.
Abstract: Face recognition under large illumination variations is challenging Local binary pattern (LBP) is robust to illumination variation, but sensitive to noise Fuzzy LBP (FLBP) partially solves the noise-sensitivity problem by incorporating fuzzy logic in the representation of local binary patterns The fuzzy membership function is determined by both sign and magnitude of the pixel difference However, the magnitude is easily altered by noise, hence could be unreliable Thus, we propose to determine the fuzzy membership function by its sign only We name the proposed approach as Quantized Fuzzy LBP (QFLBP) On two challenging face recognition datasets, it is shown more robust to noise, and demonstrates a superior performance to FLBP and many other LBP variants

Proceedings Article•DOI•
06 Aug 2015
TL;DR: This paper demonstrates the possibility to recover both the articulated hand pose and its distance from the camera with a single RGB camera in egocentric view with good performance on both a synthesized dataset and several real-world color image sequences that are captured in different environments.
Abstract: Articulated hand pose recovery in egocentric vision is useful for in-air interaction with the wearable devices, such as the Google glasses. Despite the progress obtained with the depth camera, this task is still challenging with ordinary RGB cameras. In this paper we demonstrate the possibility to recover both the articulated hand pose and its distance from the camera with a single RGB camera in egocentric view. We address this problem by modeling the distance as a hidden variable and use the Conditional Regression Forest to infer the pose and distance jointly. Especially, we find that the pose estimation accuracy can be further enhanced by incorporating the hand part semantics. The experimental results show that the proposed method achieves good performance on both a synthesized dataset and several real-world color image sequences that are captured in different environments. In addition, our system runs in real-time at more than 10fps.

Journal Article•DOI•
TL;DR: This work proposes a novel transductive learning approach that considers multiple feature types simultaneously to improve the classification performance, and allows all feature types to collaborate simultaneously.
Abstract: Much existing work of multifeature learning relies on the agreement among different feature types to improve the clustering or classification performance. However, as different feature types could have different data characteristics, such a forced agreement among different feature types may not bring a satisfactory result. We propose a novel transductive learning approach that considers multiple feature types simultaneously to improve the classification performance. Instead of forcing different feature types to agree with each other, we perform spectral clustering in different feature types separately. Each data sample is then described by a co-occurrence of feature patterns among different feature types, and we apply these feature co-occurrence representations to perform transductive learning, such that data samples of similar feature co-occurrence pattern will share the same label. As the spectral clustering results in different feature types and the formed co-occurrence patterns influence each other under the transductive learning formulation, an iterative optimization approach is proposed to decouple these factors. Different from co-training that need to iteratively update individual feature type, our method allows all feature types to collaborate simultaneously. It can naturally handle multiple feature types together and is less sensitive to noisy feature types. The experimental results on synthetic, object, and action recognition datasets all validate the advantages of our method compared to state-of-the-art methods.

Journal Article•DOI•
TL;DR: This work proposes a novel branch-and-bound co-occurrence feature mining algorithm that can directly mine both optimal conjunctions and disjunctions of individual features at arbitrary orders simultaneously.
Abstract: The co-occurrence features are the composition of base features that have more discriminative power than individual base features. Although they show promising performance in visual recognition applications such as object, scene, and action recognition, the discovery of optimal co-occurrence features is usually a computationally demanding task. Unlike previous feature mining methods that fix the order of the co-occurrence features or rely on a two-stage frequent pattern mining to select the optimal co-occurrence feature, we propose a novel branch-and-bound search-based co-occurrence feature mining algorithm that can directly mine both optimal conjunctions (AND) and disjunctions (OR) of individual features at arbitrary orders simultaneously. This feature mining process is integrated into a multi-class boosting framework Adaboost.MH such that the weighted training error is minimized by the discovered co- occurrence features in each boosting step. Experiments on UCI benchmark datasets, the scene recognition dataset, and the action recognition dataset validate both the effectiveness and efficiency of our proposed method.

Proceedings Article•DOI•
05 Jan 2015
TL;DR: A flexible 3D trajectory indexing method for complex 3D motion recognition based on both point level and primitive-level descriptors that is suitable for spatial motion trajectory, which is view-invariant in 3D space.
Abstract: Motion trajectory analysis is important for human motion recognition and human computer interaction. In this paper, we propose a flexible 3D trajectory indexing method for complex 3D motion recognition. Based on both point level and primitive-level descriptors, trajectories are represented in the sub-primitive level, the level between the point level and primitive level. Primitives are flexibly segmented into sub-primitives in various scales, and the sub-primitives retain more detailed information than primitives. The detailed level of sub-primitives can be adjusted by controlling segmentation scales according to motion complexities. The proposed approach is suitable for spatial motion trajectory, which is view-invariant in 3D space. A cluster model is also proposed to represent motion classes and motion recognition performed based on maximum a posteriori (MAP) criterion. The experiments on benchmark datasets validate the effectiveness of the proposed approach.

Book Chapter•DOI•
11 Mar 2015
TL;DR: An Augmented Reality solution to allow users to manipulate and inspect 3D virtual objects freely with their bare hands on wearable devices is presented and a unified framework to jointly recover the 6D palm pose and recognize the hand gesture from the depth images is proposed.
Abstract: We present an Augmented Reality solution to allow users to manipulate and inspect 3D virtual objects freely with their bare hands on wearable devices. To this end, we use a head-mounted depth camera to capture the RGB-D hand images from egocentric view, and propose a unified framework to jointly recover the 6D palm pose and recognize the hand gesture from the depth images. The random forest is utilized to regress for the palm pose and classify the hand gesture simultaneously via a spatial-voting framework. With a real-world annotated training dataset, the proposed method shows to predict the palm pose and gesture accurately. The output of the forest is used to render the 3D virtual objects, which are overlaid onto the hand region in input RGB images with camera calibration parameters to provide seamless virtual and real scene synthesis.

Proceedings Article•DOI•
10 Dec 2015
TL;DR: The proposed method significantly improves the localized search accuracy over the baseline, which treats each frame independently, and is able to find the top 100 object trajectories in the 5.5-hour dataset within 30 seconds.
Abstract: We present an efficient approach to search for and locate all occurrences of a specific object in large video volumes, given a single query example. Locations of object occurrences are returned as spatio-temporal trajectories in the 3D video volume. Despite much work on object instance search in image datasets, these methods locate the object independently in each image, therefore do not preserve the spatio-temporal consistency in consecutive video frames. This results in sub-optimal performance if directly applied to videos, as will be shown in our experiments. We propose to locate the object jointly across video frames using spatio-temporal search. The efficiency and effectiveness of the proposed approach is demonstrated on a consumer video dataset consisting of crawled YouTube videos and mobile captured consumer clips. Our method significantly improves the localized search accuracy over the baseline, which treats each frame independently. Moreover, it is able to find the top 100 object trajectories in the 5.5-hour dataset within 30 seconds.

Proceedings Article•DOI•
01 Dec 2015
TL;DR: Experiments demonstrate that the proposed initialization method can obviously save the iterations and related processing time for the existing online or offline algorithms to achieve the same reconstructed peak signal to noise ratio (PSNR) and present a better subjective reconstructed performance using the same computation resource.
Abstract: In this paper, we propose a method to optimize two-layer light field display using depth initialization. In contrast to existing trade-off work between performance and processing time, this paper firstly models the display principle of layered light field display, and then performs layered initialization with the prior known depth of 3D objects, and finally optimizes the layered images for light field display. Experiments demonstrate that the proposed initialization method can obviously save the iterations and related processing time for the existing online or offline algorithms to achieve the same reconstructed peak signal to noise ratio (PSNR) and present a better subjective reconstructed performance using the same computation resource.

Proceedings Article•DOI•
01 Dec 2015
TL;DR: Comparisons with average and exponential filtering, as well as state-of-the-art methods, validate that the proposed adaptive exponential filtering method can effectively refine the pixel prediction maps, without using the original video again.
Abstract: We propose an efficient online video filtering method, called adaptive exponential filtering (AES) to refine pixel prediction maps. Assuming each pixel is associated with a discriminative prediction score, the proposed AES applies exponentially decreasing weights over time to smooth the prediction score of each pixel, similar to classic exponential smoothing. However, instead of fixing the spatial pixel location to perform temporal filtering, we trace each pixel in the past frames by finding the optimal path that can bring the maximum exponential smoothing score, thus performing adaptive and non-linear filtering. Thanks to the pixel tracing, AES can better address object movements and avoid over-smoothing. To enable real-time filtering, we propose a linear-complexity dynamic programming scheme that can trace all pixels simultaneously. We apply the proposed filtering method to improve both saliency detection maps and scene parsing maps. The comparisons with average and exponential filtering, as well as state-of-the-art methods, validate that our AES can effectively refine the pixel prediction maps, without using the original video again.

Journal Article•DOI•
Shizheng Wang1, Mingyu Sun1, Phil Surman1, Junsong Yuan1, Xiao Wei Sun1 •
01 Jun 2015
TL;DR: Experiments demonstrate that the proposed method provides a relatively desirable improvement for the whole visual effect of compressive light field display, especially for the performance in the non-target display region.
Abstract: In this paper, we propose a method to extend the viewing field of a compressive light field display by optimizing the maximum viewing angle. The difference from existing work is the improvement in the overall visual effect of compressive light field display rather than only extending the viewing field region. Sliding window scanning and viewer detection are also used for determining the target region and optimizing the display. Experiments demonstrate that the proposed method provides a relatively desirable improvement for the whole visual effect of compressive light field display, especially for the performance in the non-target display region. Author Keywords Light field; compressive display; glass-free 3D; face detection.

Proceedings Article•DOI•
13 Oct 2015
TL;DR: A graph-based optimization framework to leverage category independent object proposals (candidate object regions) for logo search in a large scale image database and an efficient feature descriptor EdgeBoW, which can yield promising results, specially for object categories primarily defined by its shape.
Abstract: We propose a graph-based optimization framework to leverage category independent object proposals (candidate object regions) for logo search in a large scale image database. The proposed contour-based feature descriptor EdgeBoW is robust to view-angle changes, varying illumination conditions and can implicitly capture the significant object shape information. Having been equipped with a local descriptor, it can handle a fair amount of occlusion and deformation frequently present in a real-life scenario. Given a small set of initially retrieved candidate object proposals, a fast graph-based short-listing scheme is designed to exploit the mutual similarities among these proposals for eliminating outliers. In contrast to a coarse image-level pairwise similarity measure, this search focussed on a few specific image regions provides a more accurate method for matching. The proposed query expansion strategy aims to assess each of the remaining better matched proposals against all its neighbors within the same image for a precise localization. Combined with an efficient feature descriptor EdgeBoW, a set of more insightful edge-weights and node-utility measures can yield promising results, specially for object categories primarily defined by its shape. Extensive set of experiments performed on a number of benchmark datasets demonstrates its effectiveness and superior generalization ability in both clutter intensive real-life images and poor quality binary document images.

Book Chapter•DOI•
01 Jan 2015
TL;DR: This chapter proposes a very fast action retrieval system which can effectively locate the subvolumes similar to the query video and proposes a coarse-to-fine subvolume search scheme, which results in a dramatic speedup over the existing video branch-and-bound method.
Abstract: Action search is an interesting problem for human action analysis, which has a lot of potential applications in industry. In this chapter, we propose a very fast action retrieval system which can effectively locate the subvolumes similar to the query video. Random-indexing-trees-based visual vocabularies are introduced for the database indexing. By increasing the number of vocabularies, the large intra-class variance problem can be relieved despite only one query sample available. In addition, we use a mutual information based formulation, which is easy to leverage feedback from the user. Also, a coarse-to-fine subvolume search scheme is proposed, which results in a dramatic speedup over the existing video branch-and-bound method. Cross-dataset experiments demonstrate that our proposed method is not only fast to search higher-resolution videos, but also robust to action variations, partial occlusions, and cluttered and dynamic backgrounds. Besides from the superior performance, our system is fast for on-line applications, for example, we can finish an action search in 24 s from a 1 h database and in 37 s from a 5 h database.

Proceedings Article•DOI•
01 Dec 2015
TL;DR: A demo system that realistically displays the glasses-free light field 3D effect with a triple-layer structure by combining multi-layer panels, high refresh rates, and directional backlighting together with a thin form factor is presented.
Abstract: This paper presents a demo system that realistically displays the glasses-free light field 3D effect with a triple-layer structure. By combining multi-layer panels, high refresh rates, and directional backlighting together, we achieve a wide field of view and large depth of field with a thin form factor. Additionally, using some off-the-shelf hardware, this system demonstrates an interesting light field display.

Book Chapter•DOI•
01 Jan 2015
TL;DR: The superior performances on benchmarked datasets validate that the propagative Hough voting can outperform state-of-the-art techniques in various action analysis tasks, such as action search and recognition.
Abstract: Generalized Hough voting has shown promising results in both object and action detection. However, most existing Hough voting methods will suffer when insufficient training data are provided. To address this limitation, we propose propagative Hough voting in this chapter. Instead of training a discriminative classifier for local feature voting, we first match labeled feature points to unlabeled feature points, then propagate the label and sptatio-temporal configuration information via Hough voting. To enable a fast and robust matching, we index the unlabeled data using random projection trees (RPT). RPT can leverage the low-dimension manifold structure to provide adaptive local feature matching. Moreover, as the RPT index can be built in either labeled or unlabeled dataset, it can be applied to different tasks such as action search (limited training) and recognition (sufficient training). The superior performances on benchmarked datasets validate that our propagative Hough voting can outperform state-of-the-art techniques in various action analysis tasks, such as action search and recognition.

Book Chapter•DOI•
01 Jan 2015
TL;DR: This chapter develops a spatial-temporal implicit shape model (STISM), which characterizes the space-time structure of the sparse local features extracted from a video, and proposes a new random forest structure, called multiclass balanced random forest, which makes a good trade-off between the balance of the trees and the discriminative abilities.
Abstract: Early recognition and prediction of human activities are of great importance in video surveillance. In this chapter, we target this problem by developing a spatial-temporal implicit shape model (STISM), which characterizes the space-time structure of the sparse local features extracted from a video. The recognition of human activities is accomplished by pattern matching through STISM. To enable efficient and robust matching, we propose a new random forest structure, called multiclass balanced random forest, which makes a good trade-off between the balance of the trees and the discriminative abilities. The prediction is done simultaneously for multiple classes, which saves both the memory and computational cost. The experiments show that our algorithm significantly outperforms the state-of-the-art for the human activity prediction problem.