scispace - formally typeset
Search or ask a question

Showing papers by "Junsong Yuan published in 2016"


Proceedings ArticleDOI
27 Jun 2016
TL;DR: This work proposes to first project the query depth image onto three orthogonal planes and utilize these multi-view projections to regress for 2D heat-maps which estimate the joint positions on each plane to produce final 3D hand pose estimation with learned pose priors.
Abstract: Articulated hand pose estimation plays an important role in human-computer interaction. Despite the recent progress, the accuracy of existing methods is still not satisfactory, partially due to the difficulty of embedded highdimensional and non-linear regression problem. Different from the existing discriminative methods that regress for the hand pose with a single depth image, we propose to first project the query depth image onto three orthogonal planes and utilize these multi-view projections to regress for 2D heat-maps which estimate the joint positions on each plane. These multi-view heat-maps are then fused to produce final 3D hand pose estimation with learned pose priors. Experiments show that the proposed method largely outperforms state-of-the-art on a challenging dataset. Moreover, a cross-dataset experiment also demonstrates the good generalization ability of the proposed method.

266 citations


Proceedings ArticleDOI
01 Jun 2016
TL;DR: Comparisons with existing representative selection approaches such as K-mediod, sparse dictionary selection and density based selection validate that the new formulation can better capture the key video objects despite appearance variations, cluttered backgrounds and camera motions.
Abstract: We propose to summarize a video into a few key objects by selecting representative object proposals generated from video frames. This representative selection problem is formulated as a sparse dictionary selection problem, i.e., choosing a few representatives object proposals to reconstruct the whole proposal pool. Compared with existing sparse dictionary selection based representative selection methods, our new formulation can incorporate object proposal priors and locality prior in the feature space when selecting representatives. Consequently it can better locate key objects and suppress outlier proposals. We convert the optimization problem into a proximal gradient problem and solve it by the fast iterative shrinkage thresholding algorithm (FISTA). Experiments on synthetic data and real benchmark datasets show promising results of our key object summarization apporach in video content mining and search. Comparisons with existing representative selection approaches such as K-mediod, sparse dictionary selection and density based selection validate that our formulation can better capture the key video objects despite appearance variations, cluttered backgrounds and camera motions.

122 citations


Journal ArticleDOI
TL;DR: Experiments on five benchmark datasets with eight saliency extraction methods show that the proposed saliency co-fusion-based approach achieves competitive performance even without parameter fine-tuning when compared with the state-of-the-art methods.
Abstract: Most existing high-performance co-segmentation algorithms are usually complex due to the way of co-labeling a set of images as well as the common need of fine-tuning few parameters for effective co-segmentation. In this paper, instead of following the conventional way of co-labeling multiple images, we propose to first exploit inter-image information through co-saliency, and then perform single-image segmentation on each individual image. To make the system robust and to avoid heavy dependence on one single saliency extraction method, we propose to apply multiple existing saliency extraction methods on each image to obtain diverse salient maps. Our major contribution lies in the proposed method that fuses the obtained diverse saliency maps by exploiting the inter-image information, which we call saliency co-fusion. Experiments on five benchmark datasets with eight saliency extraction methods show that our saliency co-fusion-based approach achieves competitive performance even without parameter fine-tuning when compared with the state-of-the-art methods.

117 citations


Posted Content
TL;DR: In this paper, the authors project the query depth image onto three orthogonal planes and utilize these multi-view projections to regress for 2D heat-maps which estimate the joint positions on each plane.
Abstract: Articulated hand pose estimation plays an important role in human-computer interaction. Despite the recent progress, the accuracy of existing methods is still not satisfactory, partially due to the difficulty of embedded high-dimensional and non-linear regression problem. Different from the existing discriminative methods that regress for the hand pose with a single depth image, we propose to first project the query depth image onto three orthogonal planes and utilize these multi-view projections to regress for 2D heat-maps which estimate the joint positions on each plane. These multi-view heat-maps are then fused to produce final 3D hand pose estimation with learned pose priors. Experiments show that the proposed method largely outperforms state-of-the-art on a challenging dataset. Moreover, a cross-dataset experiment also demonstrates the good generalization ability of the proposed method.

62 citations


Journal ArticleDOI
TL;DR: A novel invariant multi-scale descriptor to represent shape contour is proposed that is invariant to rotation, scale variation, intra-class variation, articulated deformation and partial occlusion, and is robust to noise as well.

59 citations


Book ChapterDOI
08 Oct 2016
TL;DR: This work proposes to leverage co-saliency activated tracklets to address the challenge of jointly localizing common objects across videos and shows that the proposed method outperforms state-of-the-art methods.
Abstract: Video co-localization is the task of jointly localizing common objects across videos. Due to the appearance variations both across the videos and within the video, it is a challenging problem to identify and track them without any supervision. In contrast to previous joint frameworks that use bounding box proposals to attack the problem, we propose to leverage co-saliency activated tracklets to address the challenge. To identify the common visual object, we first explore inter-video commonness, intra-video commonness, and motion saliency to generate the co-saliency maps. Object proposals of high objectness and co-saliency scores are tracked across short video intervals to build tracklets. The best tube for a video is obtained through tracklet selection from these intervals based on confidence and smoothness between the adjacent tracklets, with the help of dynamic programming. Experimental results on the benchmark YouTube Object dataset show that the proposed method outperforms state-of-the-art methods.

53 citations


Proceedings ArticleDOI
01 Nov 2016
TL;DR: A general profit metric is defined that naturally combines the benefit of influence spread with the cost of seed selection in viral marketing to eliminate the need for presetting the budget for seed selection and develop new seed selection algorithms for profit maximization with strong approximation guarantees.
Abstract: Information can be disseminated widely and rapidly through Online Social Networks (OSNs) with “word-of-mouth” effects. Viral marketing is such a typical application in which new products or commercial activities are advertised by some seed users in OSNs to other users in a cascading manner. The budget allocation for seed selection reflects a tradeoff between the expense and reward of viral marketing. In this paper, we define a general profit metric that naturally combines the benefit of influence spread with the cost of seed selection in viral marketing to eliminate the need for presetting the budget for seed selection. We carry out a comprehensive study on finding a set of seed nodes to maximize the profit of viral marketing. We show that the profit metric is significantly different from the influence metric in that it is no longer monotone. As a result, from the computability perspective, the problem of profit maximization is much more challenging than that of influence maximization. We develop new seed selection algorithms for profit maximization with strong approximation guarantees. Experimental evaluations with real OSN datasets demonstrate the effectiveness of our algorithms.

51 citations


Journal ArticleDOI
TL;DR: This paper proposes a new method for detecting primary objects in unconstrained videos in a completely automatic setting that integrates the local visual/motion saliency extracted from each frame, global appearance consistency throughout the video, and spatiotemporal smoothness constraint on object trajectories.
Abstract: In this paper, we propose a new method for detecting primary objects in unconstrained videos in a completely automatic setting. Here, we define the primary object in a video as the object that presents saliently in most of the frames. Unlike previous works considering only local saliency detection or common pattern discovery, the proposed method integrates the local visual/motion saliency extracted from each frame, global appearance consistency throughout the video, and spatiotemporal smoothness constraint on object trajectories. We first identify a temporal coherent salient region throughout the whole video, and then explicitly learn a global appearance model to distinguish the primary object against the background. In order to obtain high-quality saliency estimations from both appearance and motion cues, we propose a novel self-adaptive saliency map fusion method by learning the reliability of saliency maps from labeled data. As a whole, our method can robustly localize and track primary objects in diverse video content, and handle the challenges such as fast object and camera motion, large scale and appearance variation, background clutter, and pose deformation. Moreover, compared with some existing approaches that assume the object is present in all the frames, our approach can naturally handle the case where the object is present only in part of the frames, e.g., the object enters the scene in the middle of the video or leaves the scene before the video ends. We also propose a new video data set containing 51 videos for primary object detection with per-frame ground-truth labeling. Quantitative experiments on several challenging video data sets demonstrate the superiority of our method compared with the recent state of the arts.

39 citations


Journal ArticleDOI
TL;DR: This work proposes a novel and efficient appearance modeling technique for automatic primary video object segmentation in the Markov random field (MRF) framework that embeds the appearance constraint as auxiliary nodes and edges in the MRF structure, and can optimize both the segmentation and appearance model parameters simultaneously in one graph cut.
Abstract: Automatic segmentation of the primary object in a video clip is a challenging problem as there is no prior knowledge of the primary object. Most existing techniques thus adapt an iterative approach for foreground and background appearance modeling, i.e., fix the appearance model while optimizing the segmentation and fix the segmentation while optimizing the appearance model. However, these approaches may rely on good initialization and can be easily trapped in local optimal. In addition, they are usually time consuming for analyzing videos. To address these limitations, we propose a novel and efficient appearance modeling technique for automatic primary video object segmentation in the Markov random field (MRF) framework. It embeds the appearance constraint as auxiliary nodes and edges in the MRF structure, and can optimize both the segmentation and appearance model parameters simultaneously in one graph cut. The extensive experimental evaluations validate the superiority of the proposed approach over the state-of-the-art methods, in both efficiency and effectiveness.

38 citations


Journal ArticleDOI
TL;DR: The use of spatio-temporal cues to improve the quality of object instance search from videos is explored and the key bottleneck in applying this approach is solved by leveraging a randomized approach to enable fast scoring of any bounding boxes in the video volume.
Abstract: Given a specific object as query, object instance search aims to not only retrieve the images or frames that contain the query, but also locate all its occurrences. In this work, we explore the use of spatio-temporal cues to improve the quality of object instance search from videos. To this end, we formulate this problem as the spatio-temporal trajectory search problem, where a trajectory is a sequence of bounding boxes that locate the object instance in each frame. The goal is to find the top- $K$ trajectories that are likely to contain the target object. Despite the large number of trajectory candidates, we build on a recent spatio- temporal search algorithm for event detection to efficiently find the optimal spatio- temporal trajectories in large video volumes , with complexity linear to the video volume size. We solve the key bottleneck in applying this approach to object instance search by leveraging a randomized approach to enable fast scoring of any bounding boxes in the video volume. In addition , we present a new dataset for video object instance search. Experimental results on a 73-hour video dataset demonstrate that our approach improves the performance of video object instance search and localization over the state-of-the-art search and tracking methods.

36 citations


Proceedings ArticleDOI
27 Feb 2016
TL;DR: Preliminary results show that the efficient data-driven approach to track fingertip and detect finger tapping for virtual piano using an RGB-D camera can recognize most of the beginner's piano-playing gestures in realtime for soothing rhythms.
Abstract: This paper presents an efficient data-driven approach to track fingertip and detect finger tapping for virtual piano using an RGB-D camera. We collect 7200 depth images covering the most common finger articulation for playing piano, and train a random regression forest using depth context features of randomly sampled pixels in training images. In the online tracking stage, we firstly segment the hand from the plane in contact by fusing the information from both color and depth images. Then we use the trained random forest to estimate the 3D position of fingertips and wrist in each frame, and predict finger tapping based on the estimated fingertip motion. Finally, we build a kinematic chain and recover the articulation parameters for each finger. In contrast to the existing hand tracking algorithms that often require hands are in the air and cannot interact with physical objects, our method is designed for hand interaction with planar objects, which is desired for the virtual piano application. Using our prototype system, users can put their hands on a desk, move them sideways and then tap fingers on the desk, like playing a real piano. Preliminary results show that our method can recognize most of the beginner's piano-playing gestures in realtime for soothing rhythms.

Book ChapterDOI
20 Nov 2016
TL;DR: This work categorizes occlusions based on how pedestrian examples are occluded into K groups and adopts an L1-norm linear support vector machine (SVM) to select and fuse occlusion-specific detectors for the K classifiers simultaneously.
Abstract: It is a challenging problem to detect partially occluded pedestrians due to the diversity of occlusion patterns. Although training occlusion-specific detectors can help handle various partial occlusions, it is a non-trivial problem to integrate these detectors properly. A direct combination of all occlusion-specific detectors can be affected by unreliable detectors and usually does not favor heavily occluded pedestrian examples, which can only be recognized by few detectors. Instead of combining all occlusion-specific detectors into a generic detector for all occlusions, we categorize occlusions based on how pedestrian examples are occluded into K groups. Each occlusion group selects its own occlusion-specific detectors and fuses them linearly to obtain a classifer. An L1-norm linear support vector machine (SVM) is adopted to select and fuse occlusion-specific detectors for the K classifiers simultaneously. Thanks to the L1-norm linear SVM, unreliable and irrelevant detectors are removed for each group. Experiments on the Caltech dataset show promising performance of our approach for detecting heavily occluded pedestrians.

Proceedings ArticleDOI
01 Oct 2016
TL;DR: The proposed query adaptive sketch-based object search is formulated as a sub-graph selection problem, which can be solved by maximum flow algorithm and can accurately locate the small target objects in cluttered background or densely drawn deformation intensive cartoon images.
Abstract: Sketch-based object search is a challenging problem mainly due to two difficulties: (1) how to match the binary sketch query with the colorful image, and (2) how to locate the small object in a big image with the sketch query. To address the above challenges, we propose to leverage object proposals for object search and localization. However, instead of purely relying on sketch features, e.g., Sketch-a-Net, to locate the candidate object proposals, we propose to fully utilize the appearance information to resolve the ambiguities among object proposals and refine the search results. Our proposed query adaptive search is formulated as a sub-graph selection problem, which can be solved by maximum flow algorithm. By performing query expansion using a smaller set of more salient matches as the query representatives, it can accurately locate the small target objects in cluttered background or densely drawn deformation intensive cartoon (Manga like) images. Our query adaptive sketch based object search on benchmark datasets exhibits superior performance when compared with existing methods, which validates the advantages of utilizing both the shape and appearance features for sketch-based search.

Journal ArticleDOI
TL;DR: Adobe Boxes is proposed to efficiently locate the potential objects with fewer proposals, in terms of searching the object adobes that are the salient object parts easy to be perceived, and generally outperforms the state-of-the-art methods, especially with relatively small number of proposals.
Abstract: Despite the previous efforts of object proposals, the detection rates of the existing approaches are still not satisfactory enough. To address this, we propose Adobe Boxes to efficiently locate the potential objects with fewer proposals, in terms of searching the object adobes that are the salient object parts easy to be perceived. Because of the visual difference between the object and its surroundings, an object adobe obtained from the local region has a high probability to be a part of an object, which is capable of depicting the locative information of the proto-object. Our approach comprises of three main procedures. First, the coarse object proposals are acquired by employing randomly sampled windows. Then, based on local-contrast analysis, the object adobes are identified within the enlarged bounding boxes that correspond to the coarse proposals. The final object proposals are obtained by converging the bounding boxes to tightly surround the object adobes. Meanwhile, our object adobes can also refine the detection rate of most state-of-the-art methods as a refinement approach. The extensive experiments on four challenging datasets (PASCAL VOC2007, VOC2010, VOC2012, and ILSVRC2014) demonstrate that the detection rate of our approach generally outperforms the state-of-the-art methods, especially with relatively small number of proposals. The average time consumed on one image is about 48 ms, which nearly meets the real-time requirement.

Journal ArticleDOI
TL;DR: The proposed view invariant hierarchical parsing method for free form 3D motion trajectory representation is view-invariant in 3D space and is robust to variations of scale, temporary speed and partial occlusion.

Journal ArticleDOI
TL;DR: Experiments performed on a number of other benchmark datasets show the powerful and superior generalization ability of this single integrated framework in dealing with both clutter-intensive real-life images and poor-quality binary document images at equal dexterity.
Abstract: While there has been a significant amount of work on object search and image retrieval, the focus has primarily been on establishing effective models for the whole images, scenes, and objects occupying a large portion of an image. In this paper, we propose to leverage object proposals to identify small and smooth-structured objects in a large image database. Unlike popular methods exploring a coarse image-level pairwise similarity, the search is designed to exploit the similarity measures at the proposal level. An effective graph-based query expansion strategy is designed to assess each of these better matched proposals against all its neighbors within the same image for a precise localization. Combined with a shape-aware feature descriptor EdgeBoW, a set of more insightful edge-weights and node-utility measures, the proposed search strategy can handle varying view angles, illumination conditions, deformation, and occlusion efficiently. Experiments performed on a number of other benchmark datasets show the powerful and superior generalization ability of this single integrated framework in dealing with both clutter-intensive real-life images and poor-quality binary document images at equal dexterity.

Proceedings ArticleDOI
08 Dec 2016
TL;DR: A novel light field image compression system based on an optimized linear prediction design based on L1 minimization of the residuals is proposed and K-means clustering is employed on training data in order to determine the optimized set of predictors.
Abstract: The advent of consumer-level plenoptic cameras has sparkled the interest towards the design of efficient compression techniques for light field images. State-of-the-art compression systems such as HEVC prove to be inefficient when directly applied on this type of data due to the inherent spatial discontinuities among neighboring microlens images. In this paper, a novel light field image compression system is proposed. The disk-shaped pixel clusters corresponding to each microlens in the light field image are efficiently predicted based on the neighboring disks. In this context, an optimized linear prediction design based on L1 minimization of the residuals is proposed. K-means clustering is employed on training data in order to determine the optimized set of predictors. The experimental results on an extensive set of light field images demonstrate that the proposed coding scheme yields an average of 2.93 dB and 3.22 dB gain in PSNR, and 52.67% and 57.27% average rate savings compared to HEVC and JPEG2000 respectively.

Journal ArticleDOI
TL;DR: This work proposes three customized virtual models, which are the USAF‐E model, the view angle model, and the concave/convex object model for accurate measurement of spatial resolution, viewing angle, and depth resolution for MLLFDs.
Abstract: Multi-layer light field displays (MLLFDs) are a promising computational display type that can not only display hologram-like 3D content but will also be well compatible with normal 2D applications. However, the quality of experience measurement for MLLFDs is always an important yet challenging issue. Despite existing research works on MLLFDs, most of them only provide quality of experience results with qualitative evaluation, for example, software simulation of a few 3D images/videos, rather than rigorous quantitative evaluation. This work targets at building a unified software and hardware measurement platform for different MLLFD methods, and comprehensively measuring both objective and subjective performance based on virtual object models. To the best of our knowledge, it is the first time that such performance has been measured for MLLFDs. In addition, to use the existing disclosed virtual object sequences, this paper further proposes three customized virtual models, which are the USAF-E model, the view angle model, and the concave/convex object model for accurate measurement of spatial resolution, viewing angle, and depth resolution. A toolbox for MLLFD measurement with proposed models is also released in this paper. The experimental results demonstrate that our proposed measurement method, models, and toolbox can well measure MLLFDs in different configurations.

Proceedings Article
09 Jul 2016
TL;DR: A novel approach called Minimal Reconstruction Bias Hashing (MRH), formulated as a problem of minimizing reconstruction bias of compressed signals, which can adaptively adjust the projection dimensionality to balance the information loss between projection and quantization.
Abstract: We present a novel approach called Minimal Reconstruction Bias Hashing (MRH) to learn similarity preserving binary codes that jointly optimize both projection and quantization stages. Our work tackles an important problem of how to elegantly connect optimizing projection with optimizing quantization, and to maximize the complementary effects of two stages. Distinct from previous works, MRH can adaptively adjust the projection dimensionality to balance the information loss between projection and quantization. It is formulated as a problem of minimizing reconstruction bias of compressed signals. Extensive experiment results have shown the proposed MRH significantly outperforms a variety of state-of-the-art methods over several widely used benchmarks.

Journal ArticleDOI
TL;DR: Experimental results show that the proposed action states discovery method not only outperforms the state-of-the-art methods, but also can predict the video by an on-going process with a real-time speed.
Abstract: In this paper, we provide an approach for online human action recognition, where the videos are represented by frame-level descriptors. To address the large intraclass variations of frame-level descriptors, we propose an action states discovery method to discover the different distributions of frame-level descriptors while training a classifier. A positive sample set is treated as multiple clusters called action states. The action states model can be effectively learned by clustering the positive samples and optimizing the decision boundary of each state simultaneously. Experimental results show that our method not only outperforms the state-of-the-art methods, but also can predict the video by an on-going process with a real-time speed.

Journal ArticleDOI
TL;DR: Inspired by the observation that salient video objects usually appear in consecutive frames, the motion coherence of videos is leveraged into the path discovery and make the salient object detection more robust.

Book ChapterDOI
20 Nov 2016
TL;DR: The experimental results demonstrate that by suppressing unreliable leaf nodes, it not only improves prediction accuracy, but also reduces both prediction time cost and model complexity of the random forest.
Abstract: Random forest based Hough-voting techniques have been widely used in a variety of computer vision problems. As an ensemble learning method, the voting weights of leaf nodes in random forest play critical role to generate reliable estimation result. We propose to improve Hough-voting with random forest via simultaneously optimizing the weights of leaf votes and pruning unreliable leaf nodes in the forest. After constructing the random forest, the weight assignment problem at each tree is formulated as a L0-regularized optimization problem, where unreliable leaf nodes with zero voting weights are suppressed and trees are pruned to ignore sub-trees that contain only suppressed leaves. We apply our proposed techniques to several regression and classification problems such as hand gesture recognition, head pose estimation and articulated pose estimation. The experimental results demonstrate that by suppressing unreliable leaf nodes, it not only improves prediction accuracy, but also reduces both prediction time cost and model complexity of the random forest.

Proceedings ArticleDOI
01 Sep 2016
TL;DR: A novel multi-scale shape descriptor is proposed for shape matching and object recognition that is invariant to translation, rotation, scaling, and can well tolerate partial occlusion, articulated variation and intra-class variations.
Abstract: We propose a novel multi-scale shape descriptor for shape matching and object recognition. The descriptor includes three types of invariants in multiple scales to capture discriminative local and semi-global shape features and the dynamic programming algorithm is employed for shape matching. The experimental results verify that our proposed shape feature is invariant to translation, rotation, scaling, and can well tolerate partial occlusion, articulated variation and intra-class variations. The shape matching and retrieval results on benchmark datasets validate the effectiveness of our method. This method is also applied to real time hand gesture recognition and achieves competitive result compared with state of the arts.

Book ChapterDOI
01 Jan 2016
TL;DR: An extended noise-resistant LBP (ENRLBP) to capture line patterns and NRLBP and ENRLBP are validated extensively on face recognition, facial expression analysis and other recognition tasks and are shown more resistant to image noise compared with LBP, LTP and many other variants.
Abstract: Face recognition and facial expression analysis are essential abilities of humans, which provide the basic visual clues during human-computer interaction. It is important to enable the virtual human/social robot such capabilities in order to achieve autonomous behavior. Local binary pattern (LBP) has been widely used in face recognition and facial expression analysis. It is popular because of robustness to illumination variation and alignment error. However, local binary pattern still has some limitations, e.g. it is sensitive to image noise. Local ternary pattern (LTP), fuzzy LBP and many other LBP variants partially solve this problem. However, these approaches treat the corrupted image patterns as they are, and do not have an mechanism to recover the underlying patterns. In view of this, we develop a noise-resistant LBP to preserve the image micro-structures in presence of noise. We encode the small pixel difference as an uncertain state first, and then determine its value based on the other bits of the LBP code. Most image micro-structures are represented by uniform codes and non-uniform codes mainly represent noise patterns. Therefore, we assign the value of uncertain bit so as to form possible uniform codes. In such a way, we develop an error-correction mechanism to recover the distorted image patterns. In addition, we find that some image patterns such as lines are not captured in uniform codes. They represent a set of important local primitives for pattern recognition. We thus define an extended noise-resistant LBP (ENRLBP) to capture line patterns. NRLBP and ENRLBP are validated extensively on face recognition, facial expression analysis and other recognition tasks. They are shown more resistant to image noise compared with LBP, LTP and many other variants. These two approaches greatly enhance the performance of face recognition and facial expression analysis.

Proceedings ArticleDOI
01 Nov 2016
TL;DR: This framework is built around an inference engine similar to the probability hypothesis density (PHD) filter, where the state space consists of stochastic bounding boxes with constant velocity dynamics.
Abstract: This paper is concerned with a system for detecting and tracking multiple 3D bounding boxes based on information from multiple sensors. Our framework is built around an inference engine similar to the probability hypothesis density (PHD) filter, where the state space consists of stochastic bounding boxes with constant velocity dynamics. We outline measurement equations for two modalities (vision and radar). The result is a flexible inference system suitable for use on autonomous vehicles.

Proceedings ArticleDOI
04 Jul 2016
TL;DR: In this paper, the authors present methods that are particularly applicable to multi-layer light field displays using two or more cascaded display layers where the images are produced by computationally intensive algorithms.
Abstract: Light field 3D displays, where a hologram-like image is produced by using geometrical optical techniques as opposed to interference of light, require different measurement procedures to conventional stereoscopic displays for the measurement of certain parameters. This paper covers methods that are particularly applicable to multi-layer light field displays using two or more cascaded display layers where the images are produced by computationally-intensive algorithms. As the overall system performance is dependent on capture and computation as well as display hardware, we have developed methods that allow for all the components in the complete chain. In addition to describing these we also cover other selected measurement techniques here.

Journal ArticleDOI
01 May 2016
TL;DR: Experiments demonstrate that the proposed method provides a promising perceptual optimization in both moire mitigation and definition enhancement for the visual perception of compressive light field displays.
Abstract: In this paper we propose a rotation angle-based perceptual optimization framework for dual-layer light field 3D displays. The whole framework flow can be divided into two phases; first, the moire pattern is simulated based on the rotation angle and a moire cost for visual perception is calculated and second, the moire cost is further incorporated into the final optimization of the proposed compressive factorization for the light field 3D display. In this way, a moire-aware compressive factorization with rotation optimization is proposed to generate the final display content for our prototype. Experiments demonstrate that the proposed method provides a promising perceptual optimization in both moire mitigation and definition enhancement for the visual perception of compressive light field displays.

Proceedings ArticleDOI
11 Jul 2016
TL;DR: This work proposes to extract pseudo-color CENTRIST features from the logarithm of Gammatone-like spectrogram to well classify the sound event under the unknown noise condition with a classifier-selection scheme, which automatically selects the most suitable classifier.
Abstract: Sound-event classification often extracts features from an image-like spectrogram. Recent approaches such as spectrogram image feature and subband-power-distribution image feature extract local statistics such as mean and variance from the spectrogram. We argue that such simple image statistics cannot well capture complex texture details of the spectrogram. Thus, we propose to extract pseudo-color CENTRIST features from the logarithm of Gammatone-like spectrogram. To well classify the sound event under the unknown noise condition, we propose a classifier-selection scheme, which automatically selects the most suitable classifier. The proposed approach is compared with the state of the art on the RWCP database, and demonstrates a superior performance.

Proceedings ArticleDOI
11 Jul 2016
TL;DR: The proposed method jointly learns multiple distance metrics under which multiple feature representations are consistent across different views, i.e., the difference of the distance metrics learned in different views is enforced to be as small as possible.
Abstract: Most of distance metric learning algorithms usually learn a single distance metric over the single-view data and cannot directly exploit multi-view data. In many visual classification applications, we have access to multi-view feature representations. To exploit more discriminative information for classification, it is desired to learn several distance metrics from multi-view data. To this aim, we propose a collaborative multi-view metric learning (CMML) method for visual classification. The proposed method jointly learns multiple distance metrics under which multiple feature representations are consistent across different views, i.e., the difference of the distance metrics learned in different views is enforced to be as small as possible. Experimental results on two visual classification tasks including face recognition and scene classification show the efficacy of the CMML method.

Book ChapterDOI
01 Jan 2016
TL;DR: In this chapter, a nonverbal way of communication for human–robot interaction by understanding human upper body gestures will be addressed and an effective and real-time human gesture recognition method is proposed.
Abstract: In this chapter, a nonverbal way of communication for human–robot interaction by understanding human upper body gestures will be addressed. The human–robot interaction system based on a novel combination of sensors is proposed. It allows one person to interact with a humanoid social robot with natural body language. The robot can understand the meaning of human upper body gestures and express itself by using a combination of body movements, facial expressions, and verbal language. A set of 12 upper body gestures is involved for communication. Human–object interactions are also included in these gestures. The gestures can be characterized by the head, arm, and hand posture information. CyberGlove II is employed to capture the hand posture. This feature is combined with the head and arm posture information captured from Microsoft Kinect. This is a new sensor solution for human-gesture capture. Based on the body posture data, an effective and real-time human gesture recognition method is proposed. For experiments, a human body gesture dataset was built. The experimental results demonstrate the effectiveness and efficiency of the proposed approach.