scispace - formally typeset
Search or ask a question

Showing papers by "Hao Su published in 2016"


Proceedings Articleā€¢DOIā€¢
Charles R. Qi1, Hao Su1, Matthias NieBner1, Angela Dai1, Mengyuan Yan1, Leonidas J. Guibas1Ā ā€¢
27 Jun 2016
TL;DR: In this paper, two distinct network architectures of volumetric CNNs and multi-view CNNs are introduced, where they introduce multiresolution filtering in 3D. And they provide extensive experiments designed to evaluate underlying design choices.
Abstract: 3D shape models are becoming widely available and easier to capture, making available 3D information crucial for progress in object classification. Current state-of-theart methods rely on CNNs to address this problem. Recently, we witness two types of CNNs being developed: CNNs based upon volumetric representations versus CNNs based upon multi-view representations. Empirical results from these two types of CNNs exhibit a large gap, indicating that existing volumetric CNN architectures and approaches are unable to fully exploit the power of 3D representations. In this paper, we aim to improve both volumetric CNNs and multi-view CNNs according to extensive analysis of existing approaches. To this end, we introduce two distinct network architectures of volumetric CNNs. In addition, we examine multi-view CNNs, where we introduce multiresolution filtering in 3D. Overall, we are able to outperform current state-of-the-art methods for both volumetric CNNs and multi-view CNNs. We provide extensive experiments designed to evaluate underlying design choices, thus providing a better understanding of the space of methods available for object classification on 3D data.

1,488Ā citations


Posted Contentā€¢
TL;DR: In this article, the authors address the problem of 3D reconstruction from a single image, generating a straight-forward form of output -point cloud coordinates. But the groundtruth shape for an input image may be ambiguous, and they design architecture, loss function and learning paradigm that are novel and effective.
Abstract: Generation of 3D data by deep neural network has been attracting increasing attention in the research community. The majority of extant works resort to regular representations such as volumetric grids or collection of images; however, these representations obscure the natural invariance of 3D shapes under geometric transformations and also suffer from a number of other issues. In this paper we address the problem of 3D reconstruction from a single image, generating a straight-forward form of output -- point cloud coordinates. Along with this problem arises a unique and interesting issue, that the groundtruth shape for an input image may be ambiguous. Driven by this unorthodox output form and the inherent ambiguity in groundtruth, we design architecture, loss function and learning paradigm that are novel and effective. Our final solution is a conditional shape sampler, capable of predicting multiple plausible 3D point clouds from an input image. In experiments not only can our system outperform state-of-the-art methods on single image based 3d reconstruction benchmarks; but it also shows a strong performance for 3d shape completion and promising ability in making multiple plausible predictions.

1,194Ā citations


Posted Contentā€¢
TL;DR: PointNet as discussed by the authors proposes a novel type of neural network that directly consumes point clouds and well respects the permutation invariance of points in the input, which provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing.
Abstract: Point cloud is an important type of geometric data structure. Due to its irregular format, most researchers transform such data to regular 3D voxel grids or collections of images. This, however, renders data unnecessarily voluminous and causes issues. In this paper, we design a novel type of neural network that directly consumes point clouds and well respects the permutation invariance of points in the input. Our network, named PointNet, provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing. Though simple, PointNet is highly efficient and effective. Empirically, it shows strong performance on par or even better than state of the art. Theoretically, we provide analysis towards understanding of what the network has learnt and why the network is robust with respect to input perturbation and corruption.

1,156Ā citations


Journal Articleā€¢DOIā€¢
11 Nov 2016
TL;DR: This work proposes a novel active learning method capable of enriching massive geometric datasets with accurate semantic region annotations, and demonstrates that incorporating verification of all produced labelings within this unified objective improves both accuracy and efficiency of the active learning procedure.
Abstract: Large repositories of 3D shapes provide valuable input for data-driven analysis and modeling tools. They are especially powerful once annotated with semantic information such as salient regions and functional parts. We propose a novel active learning method capable of enriching massive geometric datasets with accurate semantic region annotations. Given a shape collection and a user-specified region label our goal is to correctly demarcate the corresponding regions with minimal manual work. Our active framework achieves this goal by cycling between manually annotating the regions, automatically propagating these annotations across the rest of the shapes, manually verifying both human and automatic annotations, and learning from the verification results to improve the automatic propagation algorithm. We use a unified utility function that explicitly models the time cost of human input across all steps of our method. This allows us to jointly optimize for the set of models to annotate and for the set of models to verify based on the predicted impact of these actions on the human efficiency. We demonstrate that incorporating verification of all produced labelings within this unified objective improves both accuracy and efficiency of the active learning procedure. We automatically propagate human labels across a dynamic shape network using a conditional random field (CRF) framework, taking advantage of global shape-to-shape similarities, local feature similarities, and point-to-point correspondences. By combining these diverse cues we achieve higher accuracy than existing alternatives. We validate our framework on existing benchmarks demonstrating it to be significantly more efficient at using human input compared to previous techniques. We further validate its efficiency and robustness by annotating a massive shape dataset, labeling over 93,000 shape parts, across multiple model classes, and providing a labeled part collection more than one order of magnitude larger than existing ones.

959Ā citations


Book Chapterā€¢DOIā€¢
08 Oct 2016
TL;DR: A large scale database for 3D object recognition that consists of 100 categories, 90,127 images, 201,888 objects in these images and 44,147 3D shapes, which is useful for recognizing the 3D pose and 3D shape of objects from 2D images is contributed.
Abstract: We contribute a large scale database for 3D object recognition, named ObjectNet3D, that consists of 100 categories, 90,127 images, 201,888 objects in these images and 44,147 3D shapes. Objects in the 2D images in our database are aligned with the 3D shapes, and the alignment provides both accurate 3D pose annotation and the closest 3D shape annotation for each 2D object. Consequently, our database is useful for recognizing the 3D pose and 3D shape of objects from 2D images. We also provide baseline experiments on four tasks: region proposal generation, 2D object detection, joint 2D detection and 3D object pose estimation, and image-based 3D shape retrieval, which can serve as baselines for future research using our database. Our database is available online at http://cvgl.stanford.edu/projects/objectnet3d.

309Ā citations


Proceedings Articleā€¢DOIā€¢
01 Oct 2016
TL;DR: In this paper, a fully automatic, scalable approach that samples the human pose space for guiding the synthesis procedure and extracts clothing textures from real images is presented. But this approach is not suitable for 3D pose estimation, since 3D poses are much harder to annotate.
Abstract: Human 3D pose estimation from a single image is a challenging task with numerous applications. Convolutional Neural Networks (CNNs) have recently achieved superior performance on the task of 2D pose estimation from a single image, by training on images with 2D annotations collected by crowd sourcing. This suggests that similar success could be achieved for direct estimation of 3D poses. However, 3D poses are much harder to annotate, and the lack of suitable annotated training images hinders attempts towards end-to-end solutions. To address this issue, we opt to automatically synthesize training images with ground truth pose annotations. Our work is a systematic study along this road. We find that pose space coverage and texture diversity are the key ingredients for the effectiveness of synthetic training data. We present a fully automatic, scalable approach that samples the human pose space for guiding the synthesis procedure and extracts clothing textures from real images. Furthermore, we explore domain adaptation for bridging the gap between our synthetic training images and real testing photos. We demonstrate that CNNs trained with our synthetic images out-perform those trained with real photos on 3D pose estimation tasks.

240Ā citations


Posted Contentā€¢
TL;DR: This work represents 3D spaces as volumetric fields, and proposes a novel design that employs field probing filters to efficiently extract features from them, showing that field probing is significantly more efficient than 3DCNNs, while providing state-of-the-art performance, on classification tasks for 3D object recognition benchmark datasets.
Abstract: Building discriminative representations for 3D data has been an important task in computer graphics and computer vision research. Convolutional Neural Networks (CNNs) have shown to operate on 2D images with great success for a variety of tasks. Lifting convolution operators to 3D (3DCNNs) seems like a plausible and promising next step. Unfortunately, the computational complexity of 3D CNNs grows cubically with respect to voxel resolution. Moreover, since most 3D geometry representations are boundary based, occupied regions do not increase proportionately with the size of the discretization, resulting in wasted computation. In this work, we represent 3D spaces as volumetric fields, and propose a novel design that employs field probing filters to efficiently extract features from them. Each field probing filter is a set of probing points --- sensors that perceive the space. Our learning algorithm optimizes not only the weights associated with the probing points, but also their locations, which deforms the shape of the probing filters and adaptively distributes them in 3D space. The optimized probing points sense the 3D space "intelligently", rather than operating blindly over the entire domain. We show that field probing is significantly more efficient than 3DCNNs, while providing state-of-the-art performance, on classification tasks for 3D object recognition benchmark datasets.

233Ā citations


Posted Contentā€¢
TL;DR: In this paper, two distinct network architectures of volumetric CNNs and multi-view CNNs are introduced, where they are able to outperform current state-of-the-art methods.
Abstract: 3D shape models are becoming widely available and easier to capture, making available 3D information crucial for progress in object classification. Current state-of-the-art methods rely on CNNs to address this problem. Recently, we witness two types of CNNs being developed: CNNs based upon volumetric representations versus CNNs based upon multi-view representations. Empirical results from these two types of CNNs exhibit a large gap, indicating that existing volumetric CNN architectures and approaches are unable to fully exploit the power of 3D representations. In this paper, we aim to improve both volumetric CNNs and multi-view CNNs according to extensive analysis of existing approaches. To this end, we introduce two distinct network architectures of volumetric CNNs. In addition, we examine multi-view CNNs, where we introduce multi-resolution filtering in 3D. Overall, we are able to outperform current state-of-the-art methods for both volumetric CNNs and multi-view CNNs. We provide extensive experiments designed to evaluate underlying design choices, thus providing a better understanding of the space of methods available for object classification on 3D data.

219Ā citations


Proceedings Articleā€¢
20 May 2016
TL;DR: In this article, the authors represent 3D spaces as volumetric fields, and propose a novel design that employs field probing filters to efficiently extract features from them, where each field probing filter is a set of probing points.
Abstract: Building discriminative representations for 3D data has been an important task in computer graphics and computer vision research. Convolutional Neural Networks (CNNs) have shown to operate on 2D images with great success for a variety of tasks. Lifting convolution operators to 3D (3DCNNs) seems like a plausible and promising next step. Unfortunately, the computational complexity of 3D CNNs grows cubically with respect to voxel resolution. Moreover, since most 3D geometry representations are boundary based, occupied regions do not increase proportionately with the size of the discretization, resulting in wasted computation. In this work, we represent 3D spaces as volumetric fields, and propose a novel design that employs field probing filters to efficiently extract features from them. Each field probing filter is a set of probing points -- sensors that perceive the space. Our learning algorithm optimizes not only the weights associated with the probing points, but also their locations, which deforms the shape of the probing filters and adaptively distributes them in 3D space. The optimized probing points sense the 3D space "intelligently", rather than operating blindly over the entire domain. We show that field probing is significantly more efficient than 3DCNNs, while providing state-of-the-art performance, on classification tasks for 3D object recognition benchmark datasets.

129Ā citations


Posted Contentā€¢
TL;DR: A learning framework for abstracting complex shapes by learning to assemble objects using 3D volumetric primitives that allows predicting shape representations which can be leveraged for obtaining a consistent parsing across the instances of a shape collection and constructing an interpretable shape similarity measure.
Abstract: We present a learning framework for abstracting complex shapes by learning to assemble objects using 3D volumetric primitives. In addition to generating simple and geometrically interpretable explanations of 3D objects, our framework also allows us to automatically discover and exploit consistent structure in the data. We demonstrate that using our method allows predicting shape representations which can be leveraged for obtaining a consistent parsing across the instances of a shape collection and constructing an interpretable shape similarity measure. We also examine applications for image-based prediction as well as shape manipulation.

83Ā citations


Journal Articleā€¢DOIā€¢
TL;DR: A unique synthetic methodology to prepare a library of giant molecules with multiple, precisely arranged nano building blocks and the influence of minute structural differences on their self-assembly behaviors is introduced.
Abstract: Herein we introduce a unique synthetic methodology to prepare a library of giant molecules with multiple, precisely arranged nano building blocks, and illustrate the influence of minute structural differences on their self-assembly behaviors. The T8 polyhedral oligomeric silsesquioxane (POSS) nanoparticles are orthogonally functionalized and sequentially attached onto the end of a hydrophobic polymer chain in either linear or branched configuration. The heterogeneity of primary chemical structure in terms of composition, surface functionality, sequence, and topology can be precisely controlled and is reflected in the self-assembled supramolecular structures of these giant molecules in the condensed state. This strategy offers promising opportunities to manipulate the hierarchical heterogeneities of giant molecules via precise and modular assemblies of various nano building blocks.

Posted Contentā€¢
TL;DR: It is shown that pose space coverage and texture diversity are the key ingredients for the effectiveness of synthetic training data and CNNs trained with the authors' synthetic images out-perform those trained with real photos on 3D pose estimation tasks.
Abstract: Human 3D pose estimation from a single image is a challenging task with numerous applications. Convolutional Neural Networks (CNNs) have recently achieved superior performance on the task of 2D pose estimation from a single image, by training on images with 2D annotations collected by crowd sourcing. This suggests that similar success could be achieved for direct estimation of 3D poses. However, 3D poses are much harder to annotate, and the lack of suitable annotated training images hinders attempts towards end-to-end solutions. To address this issue, we opt to automatically synthesize training images with ground truth pose annotations. Our work is a systematic study along this road. We find that pose space coverage and texture diversity are the key ingredients for the effectiveness of synthetic training data. We present a fully automatic, scalable approach that samples the human pose space for guiding the synthesis procedure and extracts clothing textures from real images. Furthermore, we explore domain adaptation for bridging the gap between our synthetic training images and real testing photos. We demonstrate that CNNs trained with our synthetic images out-perform those trained with real photos on 3D pose estimation tasks.

Proceedings Articleā€¢DOIā€¢
08 May 2016
TL;DR: This track aims to provide a benchmark to evaluate large-scale shape retrieval based on the ShapeNet dataset, using ShapeNet Core55, which provides more than 50 thousands models over 55 common categories in total for training and evaluating several algorithms.
Abstract: With the advent of commodity 3D capturing devices and better 3D modeling tools, 3D shape content is becoming increasingly prevalent. Therefore, the need for shape retrieval algorithms to handle large-scale shape repositories is more and more important. This track aims to provide a benchmark to evaluate large-scale shape retrieval based on the ShapeNet dataset. We use ShapeNet Core55, which provides more than 50 thousands models over 55 common categories in total for training and evaluating several algorithms. Five participating teams have submitted a variety of retrieval methods which were evaluated on several standard information retrieval performance metrics. We find the submitted methods work reasonably well on the track benchmark, but we also see significant space for improvement by future algorithms. We release all the data, results, and evaluation code for the benefit of the community.

Journal Articleā€¢DOIā€¢
TL;DR: It is found that camptothecin (CPT) prodrugs created by conjugating two CPT molecules onto a hydrophilic segment can associate into filamentous nanostructures in water and exhibit much greater efficacy against primary brain cancer cells relative to that of irinotecan, a clinically used CPT prodrug.
Abstract: Chemical modification of small molecule hydrophobic drugs is a clinically proven strategy to devise prodrugs with enhanced treatment efficacy. While this prodrug strategy improves the parent drug's water solubility and pharmacokinetic profile, it typically compromises the drug's potency against cancer cells due to the retarded drug release rate and reduced cellular uptake efficiency. Here we report on the supramolecular design of self-assembling prodrugs (SAPD) with much improved water solubility while maintaining high potency against cancer cells. We found that camptothecin (CPT) prodrugs created by conjugating two CPT molecules onto a hydrophilic segment can associate into filamentous nanostructures in water. Our results suggest that these SAPD exhibit much greater efficacy against primary brain cancer cells relative to that of irinotecan, a clinically used CPT prodrug. We believe these findings open a new avenue for rational design of supramolecular prodrugs for cancer treatment.

Journal Articleā€¢DOIā€¢
11 Nov 2016
TL;DR: A 3D Attention Model that selects the best views to scan from, as well as the most informative regions in each view to focus on, to achieve efficient object recognition is developed, which leads to focus-driven features which are quite robust against object occlusion.
Abstract: We address the problem of autonomously exploring unknown objects in a scene by consecutive depth acquisitions. The goal is to reconstruct the scene while online identifying the objects from among a large collection of 3D shapes. Fine-grained shape identification demands a meticulous series of observations attending to varying views and parts of the object of interest. Inspired by the recent success of attention-based models for 2D recognition, we develop a 3D Attention Model that selects the best views to scan from, as well as the most informative regions in each view to focus on, to achieve efficient object recognition. The region-level attention leads to focus-driven features which are quite robust against object occlusion. The attention model, trained with the 3D shape collection, encodes the temporal dependencies among consecutive views with deep recurrent networks. This facilitates order-aware view planning accounting for robot movement cost. In achieving instance identification, the shape collection is organized into a hierarchy, associated with pre-trained hierarchical classifiers. The effectiveness of our method is demonstrated on an autonomous robot (PR) that explores a scene and identifies the objects to construct a 3D scene model.

Proceedings Articleā€¢DOIā€¢
01 Jun 2016
TL;DR: A multilinear hyperplane hashing that generates a hash bit using multiple linear projections with strong locality sensitivity to hyperplane queries is proposed and an angular quantization based learning framework for compact multil inear hashing is introduced, which considerably boosts the search performance with less hash bits.
Abstract: Hashing has become an increasingly popular technique for fast nearest neighbor search. Despite its successful progress in classic pointto-point search, there are few studies regarding point-to-hyperplane search, which has strong practical capabilities of scaling up applications like active learning with SVMs. Existing hyperplane hashing methods enable the fast search based on randomly generated hash codes, but still suffer from a low collision probability and thus usually require long codes for a satisfying performance. To overcome this problem, this paper proposes a multilinear hyperplane hashing that generates a hash bit using multiple linear projections. Our theoretical analysis shows that with an even number of random linear projections, the multilinear hash function possesses strong locality sensitivity to hyperplane queries. To leverage its sensitivity to the angle distance, we further introduce an angular quantization based learning framework for compact multilinear hashing, which considerably boosts the search performance with less hash bits. Experiments with applications to large-scale (up to one million) active learning on two datasets demonstrate the overall superiority of the proposed approach.

Journal Articleā€¢DOIā€¢
11 Nov 2016
TL;DR: The method allows for large-scale unsupervised production of richly textured 3D models directly from image data, providing high quality virtual objects for 3D scene design or photo editing applications, as well as a wealth of data for training machine learning algorithms for various inference tasks in graphics and vision.
Abstract: Large 3D model repositories of common objects are now ubiquitous and are increasingly being used in computer graphics and computer vision for both analysis and synthesis tasks. However, images of objects in the real world have a richness of appearance that these repositories do not capture, largely because most existing 3D models are untextured. In this work we develop an automated pipeline capable of transporting texture information from images of real objects to 3D models of similar objects. This is a challenging problem, as an object's texture as seen in a photograph is distorted by many factors, including pose, geometry, and illumination. These geometric and photometric distortions must be undone in order to transfer the pure underlying texture to a new object --- the 3D model. Instead of using problematic dense correspondences, we factorize the problem into the reconstruction of a set of base textures (materials) and an illumination model for the object in the image. By exploiting the geometry of the similar 3D model, we reconstruct certain reliable texture regions and correct for the illumination, from which a full texture map can be recovered and applied to the model. Our method allows for large-scale unsupervised production of richly textured 3D models directly from image data, providing high quality virtual objects for 3D scene design or photo editing applications, as well as a wealth of data for training machine learning algorithms for various inference tasks in graphics and vision.

Journal Articleā€¢DOIā€¢
TL;DR: As the first of its kind, MRI-guided targeted concentric tube needle placements with ex vivo porcine liver are demonstrated with 4.64 mm RMS error through closed-loop control of the piezoelectrically-actuated robot.
Abstract: This paper presents the design, modeling and experimental evaluation of a magnetic resonance imaging (MRI)-compatible concentric tube continuum robotic system. This system enables MRI-guided deployment of a precurved and steerable concentric tube continuum mechanism, and is suitable for clinical applications where a curved trajectory is needed. This compact 6 degree-of-freedom (DOF) robotic system is piezoelectrically-actuated, and allows simultaneous robot motion and imaging with no visually observable image artifact. The targeting accuracy is evaluated with optical tracking system and gelatin phantom under live MRI-guidance with Root Mean Square (RMS) errors of 1.94 and 2.17 mm respectively. Furthermore, we demonstrate that the robot has kinematic redundancy to reach the same target through different paths. This was evaluated in both free space and MRI-guided gelatin phantom trails, with RMS errors of 0.48 and 0.59 mm respectively. As the first of its kind, MRI-guided targeted concentric tube needle placements with ex vivo porcine liver are demonstrated with 4.64 mm RMS error through closed-loop control of the piezoelectrically-actuated robot.

Posted Contentā€¢
TL;DR: SyncSpecCNN as discussed by the authors proposes a spectral convolutional neural network for 3D shape part segmentation and keypoint prediction, which enables weight sharing by parameterizing kernels in the spectral domain spanned by graph laplacian eigenbases.
Abstract: In this paper, we study the problem of semantic annotation on 3D models that are represented as shape graphs. A functional view is taken to represent localized information on graphs, so that annotations such as part segment or keypoint are nothing but 0-1 indicator vertex functions. Compared with images that are 2D grids, shape graphs are irregular and non-isomorphic data structures. To enable the prediction of vertex functions on them by convolutional neural networks, we resort to spectral CNN method that enables weight sharing by parameterizing kernels in the spectral domain spanned by graph laplacian eigenbases. Under this setting, our network, named SyncSpecCNN, strive to overcome two key challenges: how to share coefficients and conduct multi-scale analysis in different parts of the graph for a single shape, and how to share information across related but different shapes that may be represented by very different graphs. Towards these goals, we introduce a spectral parameterization of dilated convolutional kernels and a spectral transformer network. Experimentally we tested our SyncSpecCNN on various tasks, including 3D shape part segmentation and 3D keypoint prediction. State-of-the-art performance has been achieved on all benchmark datasets.

Proceedings Articleā€¢
01 Jan 2016
TL;DR: This work proposes an adaptive binary quantization method that learns a discriminative hash function with prototypes correspondingly associated with small unique binary codes that significantly outperforms state-of-the-art hashing methods.
Abstract: Hashing has been proved an attractive technique for fast nearest neighbor search over big data. Compared to the projection based hashing methods, prototype based ones own stronger capability of generating discriminative binary codes for the data with complex inherent structure. However, our observation indicates that they still suffer from the insufficient coding that usually utilizes the complete binary codes in a hypercube. To address this problem, we propose an adaptive binary quantization method that learns a discriminative hash function with prototypes correspondingly associated with small unique binary codes. Our alternating optimization adaptively discovers the prototype set and the code set of a varying size in an efficient way, which together robustly approximate the data relations. Our method can be naturally generalized to the product space for long hash codes. We believe that our idea serves as a very helpful insight to hashing research. The extensive experiments on four large-scale (up to 80 million) datasets demonstrate that our method significantly outperforms state-of-the-art hashing methods, with up to 58.84% performance gains relatively.

Journal Articleā€¢DOIā€¢
TL;DR: This Review highlights recent examples of peptide-based targeting ligands that have been exploited to selectively deliver a chemotherapeutic payload to specific tumor-associated sites such as the vasculature, lymphatics, or cell surface.
Abstract: Chemotherapeutic treatment of cancers is a challenging endeavor, hindered by poor selectivity towards tumorous tissues over healthy ones. Preferentially delivering a given drug to tumor sites necessitates the use of targeting elements, of which there are a wide range in development. In this Review, we highlight recent examples of peptide-based targeting ligands that have been exploited to selectively deliver a chemotherapeutic payload to specific tumor-associated sites such as the vasculature, lymphatics, or cell surface. The advantages and limitations of such approaches will be discussed with a view to potential future development. Additionally, we will also examine how peptide-based ligands can be used diagnostically in the detection and characterization of cancers through their incorporation into imaging agents.

Patentā€¢
Fei-Fei Li1, Jia Li1, Hao Su1ā€¢
22 Jan 2016
TL;DR: In this article, an image is represented as a scale-invariant response map of a large number of pre-trained generic object detectors, blind to the testing dataset or visual task.
Abstract: Robust low-level image features have been proven to be effective representations for a variety of visual recognition tasks such as object recognition and scene classification; but pixels, or even local image patches, carry little semantic meanings. For high-level visual tasks, such low-level image representations are potentially not enough. The present invention provides a high-level image representation where an image is represented as a scale-invariant response map of a large number of pre-trained generic object detectors, blind to the testing dataset or visual task. Leveraging on this representation, superior performances on high-level visual recognition tasks are achieved with relatively classifiers such as logistic regression and linear SVM classifiers.

Posted Contentā€¢
TL;DR: This paper addresses the problem of inferring rich semantics imparted by an object part in still images by proposing to tokenize the semantic space as a discrete set of part states, and presents a part state dataset which contains rich part-level semantic annotations.
Abstract: Important high-level vision tasks such as human-object interaction, image captioning and robotic manipulation require rich semantic descriptions of objects at part level. Based upon previous work on part localization, in this paper, we address the problem of inferring rich semantics imparted by an object part in still images. We propose to tokenize the semantic space as a discrete set of part states. Our modeling of part state is spatially localized, therefore, we formulate the part state inference problem as a pixel-wise annotation problem. An iterative part-state inference neural network is specifically designed for this task, which is efficient in time and accurate in performance. Extensive experiments demonstrate that the proposed method can effectively predict the semantic states of parts and simultaneously correct localization errors, thus benefiting a few visual understanding applications. The other contribution of this paper is our part state dataset which contains rich part-level semantic annotations.

Ziyan Guo, Tlt Lun, Yu Chen, Hao Su, Dtm Chan, Ka-Wai KwokĀ 
01 Jan 2016
TL;DR: A novel MR-safe pneumatic stepper motor, whose design could be relatively compact, flexibly customized for various actuation requirements, is proposed and potential to be incorporated into an MRI-compatible robot for needle manipulation during intra-operative procedures, e.g. prostate surgery.
Abstract: Magnetic resonance imaging (MRI) superiority is wellknown by providing non-ionizing radiation, noninvasive and high-contrast imaging in particular for soft tissues [1]. These advantages have prompted MRI for the various application in surgical interventions ranging from neurosurgery, cardiac ablation to prostate biopsies. However, MR-safe mechatronics is still confronted by the fundamental challenge, namely to maintain zero interference of its imaging operation during the MRI navigation. Currently, four types of MRI actuations have been explored at different MR-safety conditions [2]: 1) electric actuators, e.g. piezoelectric motors and ultrasonic motors; 2) fluid-power motors; 3) MRpowered actuators. In terms of adaptability in general hospital setup, image quality and MR-safety, pneumatic actuators are advantageous in both material and energetics considerations. The material of pneumatic actuators could be non-magnetic and non-conducting, minimizing the effects on inhomogeneity of magnetic field. Pressured clean air as power supply is commonly available in MRI scanner rooms. This ensures zero image artifacts caused by the electromagnetic (EM) waves of electricity. Such pneumatic stepper actuators with the capability of generating accurate stepwise motion have been introduced recently. Stoianovici et al. [3] invented the first MR-safe pneumatic stepper motor that comprises three actuated diaphragms driving a hoop gear. Several pneumatic stepper motors [4-6] have been sequentially developed and their performances (e.g. torque-speed) have been demonstrated. However, many technical challenges still have not been addressed in these motors, e.g. typically large in motor size, high cost for the complicated fabrication and sterilization, dissatisfactory signal-to-noise ratio (SNR), as well as image distortion induced by the proximal electronics and valves of motor drivers. In this paper, we propose a novel MR-safe pneumatic stepper motor, whose design could be relatively compact, flexibly customized for various actuation requirements. Such a motor can be made of a homogeneous material for ease of minimization and reconfiguration. One set of design parameters are selected for the experimental evaluation. Self-locking and high speed (up to 160RPM) is achieved in both rotation directions. Steady torque within a wide range of speed can also be preserved. Low imaging interference has been experimentally demonstrated while operating the motor inside the MRI scanner. Regarding these specifications, this motor is potential to be incorporated into an MRI-compatible robot for needle manipulation during intra-operative procedures, e.g. prostate surgery.

Posted Contentā€¢
TL;DR: This work focuses on the non-Lambertian object-level intrinsic problem of recovering diffuse albedo, shading, and specular highlights from a single image of an object, and shows that feature learning at the encoder stage is more crucial for developing a universal representation across categories.
Abstract: We consider the non-Lambertian object intrinsic problem of recovering diffuse albedo, shading, and specular highlights from a single image of an object. We build a large-scale object intrinsics database based on existing 3D models in the ShapeNet database. Rendered with realistic environment maps, millions of synthetic images of objects and their corresponding albedo, shading, and specular ground-truth images are used to train an encoder-decoder CNN. Once trained, the network can decompose an image into the product of albedo and shading components, along with an additive specular component. Our CNN delivers accurate and sharp results in this classical inverse problem of computer vision, sharp details attributed to skip layer connections at corresponding resolutions from the encoder to the decoder. Benchmarked on our ShapeNet and MIT intrinsics datasets, our model consistently outperforms the state-of-the-art by a large margin. We train and test our CNN on different object categories. Perhaps surprising especially from the CNN classification perspective, our intrinsics CNN generalizes very well across categories. Our analysis shows that feature learning at the encoder stage is more crucial for developing a universal representation across categories. We apply our synthetic data trained model to images and videos downloaded from the internet, and observe robust and realistic intrinsics results. Quality non-Lambertian intrinsics could open up many interesting applications such as image-based albedo and specular editing.