scispace - formally typeset
Search or ask a question

Showing papers on "Orientation (computer vision) published in 2020"


Posted Content
TL;DR: The framework, CenterPoint, first detects centers of objects using a keypoint detector and regresses to other attributes, including 3D size, 3D orientation, and velocity, and refines these estimates using additional point features on the object.
Abstract: Three-dimensional objects are commonly represented as 3D boxes in a point-cloud. This representation mimics the well-studied image-based 2D bounding-box detection but comes with additional challenges. Objects in a 3D world do not follow any particular orientation, and box-based detectors have difficulties enumerating all orientations or fitting an axis-aligned bounding box to rotated objects. In this paper, we instead propose to represent, detect, and track 3D objects as points. Our framework, CenterPoint, first detects centers of objects using a keypoint detector and regresses to other attributes, including 3D size, 3D orientation, and velocity. In a second stage, it refines these estimates using additional point features on the object. In CenterPoint, 3D object tracking simplifies to greedy closest-point matching. The resulting detection and tracking algorithm is simple, efficient, and effective. CenterPoint achieved state-of-the-art performance on the nuScenes benchmark for both 3D detection and tracking, with 65.5 NDS and 63.8 AMOTA for a single model. On the Waymo Open Dataset, CenterPoint outperforms all previous single model method by a large margin and ranks first among all Lidar-only submissions. The code and pretrained models are available at this https URL.

397 citations


Book ChapterDOI
Yi Li1, Gu Wang1, Xiangyang Ji1, Yu Xiang2, Dieter Fox2 
01 Mar 2020
TL;DR: A novel deep neural network for 6D pose matching named DeepIM is proposed that is able to iteratively refine the pose by matching the rendered image against the observed image.
Abstract: Estimating the 6D pose of objects from images is an important problem in various applications such as robot manipulation and virtual reality. While direct regression of images to object poses has limited accuracy, matching rendered images of an object against the input image can produce accurate results. In this work, we propose a novel deep neural network for 6D pose matching named DeepIM. Given an initial pose estimation, our network is able to iteratively refine the pose by matching the rendered image against the observed image. The network is trained to predict a relative pose transformation using an untangled representation of 3D location and 3D orientation and an iterative training process. Experiments on two commonly used benchmarks for 6D pose estimation demonstrate that DeepIM achieves large improvements over state-of-the-art methods. We furthermore show that DeepIM is able to match previously unseen objects.

340 citations


Posted Content
TL;DR: This work proposes an efficient and accurate monocular 3D detection framework in single shot that achieves state-of-the-art performance on the KITTI benchmark and predicts the nine perspective keypoints of a 3D bounding box in image space, and utilizes the geometric relationship of 3D and 2D perspectives.
Abstract: In this work, we propose an efficient and accurate monocular 3D detection framework in single shot. Most successful 3D detectors take the projection constraint from the 3D bounding box to the 2D box as an important component. Four edges of a 2D box provide only four constraints and the performance deteriorates dramatically with the small error of the 2D detector. Different from these approaches, our method predicts the nine perspective keypoints of a 3D bounding box in image space, and then utilize the geometric relationship of 3D and 2D perspectives to recover the dimension, location, and orientation in 3D space. In this method, the properties of the object can be predicted stably even when the estimation of keypoints is very noisy, which enables us to obtain fast detection speed with a small architecture. Training our method only uses the 3D properties of the object without the need for external networks or supervision data. Our method is the first real-time system for monocular image 3D detection while achieves state-of-the-art performance on the KITTI benchmark. Code will be released at this https URL.

167 citations


Journal ArticleDOI
Kun Fu1, Zhonghan Chang1, Yue Zhang1, Guangluan Xu1, Keshu Zhang1, Xian Sun1 
TL;DR: A unified framework upon the region-based convolutional neural network for arbitrary-oriented and multi-scale object detection in remote sensing images is built and achieves state-of-the-art performance, which demonstrates the effectiveness of the proposed methods.
Abstract: Object detection plays an important role in the field of remote sensing imagery analysis. The most challenging issues in advancing this task are the large variation in object scales and the arbitrary orientation of objects. In this paper, we build a unified framework upon the region-based convolutional neural network for arbitrary-oriented and multi-scale object detection in remote sensing images. To handle the problem of multi-scale object detection, a feature-fusion architecture is proposed to generate a multi-scale feature hierarchy, which augments the features of shallow layers with semantic representations via a top-down pathway and combines the feature maps of top layers with low-level information by a bottom-up pathway. By combining features of different levels, we can form a powerful feature representation for multi-scale objects. Most previous methods locate objects with arbitrary orientations and dense spatial distributions via axis-aligned boxes, which may cover adjacent instances and background areas. We build a rotation-aware object detector that uses oriented boxes to localize objects in remote sensing images. The region proposal network augments the anchors with multiple default angles to cover oriented objects. It utilizes oriented proposal boxes to enclose objects rather than horizontal proposals that coarsely locate oriented objects. The orientation RoI pooling operation is introduced to extract the feature maps of oriented proposals for the following R-CNN subnetwork. We conduct comprehensive experiments on a public dataset for oriented object detection in remote sensing images. Our method achieves state-of-the-art performance, which demonstrates the effectiveness of the proposed methods.

165 citations


Posted Content
TL;DR: This work focuses on mitigating two limitations in the joint learning of local feature detectors and descriptors, by resorting to deformable convolutional networks to densely estimate and apply local transformation in ASLFeat.
Abstract: This work focuses on mitigating two limitations in the joint learning of local feature detectors and descriptors. First, the ability to estimate the local shape (scale, orientation, etc.) of feature points is often neglected during dense feature extraction, while the shape-awareness is crucial to acquire stronger geometric invariance. Second, the localization accuracy of detected keypoints is not sufficient to reliably recover camera geometry, which has become the bottleneck in tasks such as 3D reconstruction. In this paper, we present ASLFeat, with three light-weight yet effective modifications to mitigate above issues. First, we resort to deformable convolutional networks to densely estimate and apply local transformation. Second, we take advantage of the inherent feature hierarchy to restore spatial resolution and low-level details for accurate keypoint localization. Finally, we use a peakiness measurement to relate feature responses and derive more indicative detection scores. The effect of each modification is thoroughly studied, and the evaluation is extensively conducted across a variety of practical scenarios. State-of-the-art results are reported that demonstrate the superiority of our methods.

125 citations


Journal ArticleDOI
TL;DR: In this article, a deep neural network architecture for human activity recognition based on multiple sensor data is proposed, which encodes the time series of sensor data as images and leverages these transformed images to retain the necessary features for activity recognition.

123 citations


Proceedings ArticleDOI
14 Jun 2020
TL;DR: ASLiu et al. as mentioned in this paper used deformable convolutional networks to densely estimate and apply local transformation, and took advantage of the inherent feature hierarchy to restore spatial resolution and low-level details for accurate keypoint localization.
Abstract: This work focuses on mitigating two limitations in the joint learning of local feature detectors and descriptors. First, the ability to estimate the local shape (scale, orientation, etc.) of feature points is often neglected during dense feature extraction, while the shape-awareness is crucial to acquire stronger geometric invariance. Second, the localization accuracy of detected keypoints is not sufficient to reliably recover camera geometry, which has become the bottleneck in tasks such as 3D reconstruction. In this paper, we present ASLFeat, with three light-weight yet effective modifications to mitigate above issues. First, we resort to deformable convolutional networks to densely estimate and apply local transformation. Second, we take advantage of the inherent feature hierarchy to restore spatial resolution and low-level details for accurate keypoint localization. Finally, we use a peakiness measurement to relate feature responses and derive more indicative detection scores. The effect of each modification is thoroughly studied, and the evaluation is extensively conducted across a variety of practical scenarios. State-of-the-art results are reported that demonstrate the superiority of our methods.

118 citations


Journal ArticleDOI
TL;DR: This novel 3D orientation estimation is based on a variant of the Denoising Autoencoder that is trained on simulated views of a 3D model using Domain Randomization and achieves state-of-the-art performance on the T-LESS dataset both in the RGB and RGB-D domain.
Abstract: We propose a real-time RGB-based pipeline for object detection and 6D pose estimation. Our novel 3D orientation estimation is based on a variant of the Denoising Autoencoder that is trained on simulated views of a 3D model using Domain Randomization. This so-called Augmented Autoencoder has several advantages over existing methods: It does not require real, pose-annotated training data, generalizes to various test sensors and inherently handles object and view symmetries. Instead of learning an explicit mapping from input images to object poses, it provides an implicit representation of object orientations defined by samples in a latent space. Our pipeline achieves state-of-the-art performance on the T-LESS dataset both in the RGB and RGB-D domain. We also evaluate on the LineMOD dataset where we can compete with other synthetically trained approaches. We further increase performance by correcting 3D orientation estimates to account for perspective errors when the object deviates from the image center and show extended results. Our code is available here https://github.com/DLR-RM/AugmentedAutoencoder.

111 citations


Proceedings ArticleDOI
14 Jun 2020
TL;DR: A Dynamic Similarity Matching network is designed to estimate cross-view orientation alignment during localization and improves state-of-the-art performance on large-scale geo-localization datasets.
Abstract: Cross-view geo-localization is the problem of estimating the position and orientation (latitude, longitude and azimuth angle) of a camera at ground level given a large-scale database of geo-tagged aerial (eg., satellite) images. Existing approaches treat the task as a pure location estimation problem by learning discriminative feature descriptors, but neglect orientation alignment. It is well-recognized that knowing the orientation between ground and aerial images can significantly reduce matching ambiguity between these two views, especially when the ground-level images have a limited Field of View (FoV) instead of a full field-of-view panorama. Therefore, we design a Dynamic Similarity Matching network to estimate cross-view orientation alignment during localization. In particular, we address the cross-view domain gap by applying a polar transform to the aerial images to approximately align the images up to an unknown azimuth angle. Then, a two-stream convolutional network is used to learn deep features from the ground and polar-transformed aerial images. Finally, we obtain the orientation by computing the correlation between cross-view features, which also provides a more accurate measure of feature similarity, improving location recall. Experiments on standard datasets demonstrate that our method significantly improves state-of-the-art performance. Remarkably, we improve the top-1 location recall rate on the CVUSA dataset by a factor of 1.5x for panoramas with known orientation, by a factor of 3.3x for panoramas with unknown orientation, and by a factor of 6x for 180-degree FoV images with unknown orientation.

104 citations


Journal ArticleDOI
TL;DR: A systematic survey of the state-of-the-art for match pair selection from both ordered and unordered datasets, for outlier removal of initial matches dominated by outliers, and for efficiency improvement of BA are given and an experimental evaluation for six well-known SfM-based software packages on UAV image orientation is conducted.
Abstract: Unmanned aerial vehicle (UAV) images have gained extensive attention in varying fields, and the Structure from Motion (SfM) technique has become the gold standard for aerial triangulation of UAV images. With increasing data volume caused by the use of multi-view and high-resolution imaging systems and the enhancement of UAV platform’s endurance, the capability for orientation of large-scale UAV images is becoming a prominent and necessary feature for SfM-based solutions. A classical SfM pipeline consists of three major steps, i.e., (i) feature extraction for an individual image, (ii) feature matching for each image pair, and (iii) parameter solving based on iterative bundle adjustment. Most of the time costs are consumed in the second and third steps. This can be explained from three main aspects. First, for feature matching the large number of images and high overlapping degrees cause high combinational complexity of match pairs. Second, the efficiency of commonly utilized techniques for outlier removal would be seriously degenerated because of high outlier ratios of initial matches. Third, for parameter solving of camera poses and scene structures, the iterative execution of bundle adjustment (BA) leads to high computational costs in the incremental SfM workflow. Thus, this paper gives a systematic survey of the state-of-the-art for match pair selection from both ordered and unordered datasets, for outlier removal of initial matches dominated by outliers, and for efficiency improvement of BA, and conducts an experimental evaluation for six well-known SfM-based software packages on UAV image orientation.

102 citations


Journal ArticleDOI
TL;DR: A content-based image retrieval (CBIR) system has been proposed to extract a feature vector from an image and to effectively retrieve content- based images.
Abstract: Measures of components in digital images are expanded and to locate a specific image in the light of substance from a huge database is sometimes troublesome. In this paper, a content-based image retrieval (CBIR) system has been proposed to extract a feature vector from an image and to effectively retrieve content-based images. In this work, two types of image feature descriptor extraction methods, namely Oriented Fast and Rotated BRIEF (ORB) and scale-invariant feature transform (SIFT) are considered. ORB detector uses a fast key points and descriptor use a BRIEF descriptor. SIFT be used for analysis of images based on various orientation and scale. K-means clustering algorithm is used over both descriptors from which the mean of every cluster is obtained. Locality-preserving projection dimensionality reduction algorithm is used to reduce the dimensions of an image feature vector. At the time of retrieval, the image feature vectors are stored in the image database and matched with testing data feature vector for CBIR. The execution of the proposed work is assessed by utilizing a decision tree, random forest, and MLP classifiers. Two, public databases, namely Wang database and corel database, have been considered for the experimentation work. Combination of ORB and SIFT feature vectors are tested for images in Wang database and corel database which accomplishes a highest precision rate of 99.53% and 86.20% for coral database and Wang database, respectively.

Book ChapterDOI
10 Jan 2020
TL;DR: In this paper, the nine perspective keypoints of a 3D bounding box in image space are predicted and the geometric relationship of 3D and 2D perspectives is utilized to recover the dimension, location, and orientation in 3D space.
Abstract: In this work, we propose an efficient and accurate monocular 3D detection framework in single shot. Most successful 3D detectors take the projection constraint from the 3D bounding box to the 2D box as an important component. Four edges of a 2D box provide only four constraints and the performance deteriorates dramatically with the small error of the 2D detector. Different from these approaches, our method predicts the nine perspective keypoints of a 3D bounding box in image space, and then utilize the geometric relationship of 3D and 2D perspectives to recover the dimension, location, and orientation in 3D space. In this method, the properties of the object can be predicted stably even when the estimation of keypoints is very noisy, which enables us to obtain fast detection speed with a small architecture. Training our method only uses the 3D properties of the object without any extra annotations, category-specific 3D shape priors, or depth maps. Our method is the first real-time system (FPS > 24) for monocular image 3D detection while achieves state-of-the-art performance on the KITTI benchmark.

Posted Content
TL;DR: This paper explores a relatively less-studied methodology based on classification for rotation detection, and proposes new techniques to inherently dismiss the boundary discontinuity issue as encountered by the regression-based detectors.
Abstract: Rotation detection serves as a fundamental building block in many visual applications involving aerial image, scene text, and face etc. Differing from the dominant regression-based approaches for orientation estimation, this paper explores a relatively less-studied methodology based on classification. The hope is to inherently dismiss the boundary discontinuity issue as encountered by the regression-based detectors. We propose new techniques to push its frontier in two aspects: i) new encoding mechanism: the design of two Densely Coded Labels (DCL) for angle classification, to replace the Sparsely Coded Label (SCL) in existing classification-based detectors, leading to three times training speed increase as empirically observed across benchmarks, further with notable improvement in detection accuracy; ii) loss re-weighting: we propose Angle Distance and Aspect Ratio Sensitive Weighting (ADARSW), which improves the detection accuracy especially for square-like objects, by making DCL-based detectors sensitive to angular distance and object's aspect ratio. Extensive experiments and visual analysis on large-scale public datasets for aerial images i.e. DOTA, UCAS-AOD, HRSC2016, as well as scene text dataset ICDAR2015 and MLT, show the effectiveness of our approach. The source code is available at this https URL and is also integrated in our open source rotation detection benchmark: this https URL.

Posted Content
TL;DR: A learning-based system that estimates the camera position and orientation from a single input image relative to a known environment using a deep neural network and fully differentiable pose optimization achieves state-of-the-art accuracy on various public datasets for RGB-based re-localization, and competitive accuracy forRGB-D based re- localization.
Abstract: We describe a learning-based system that estimates the camera position and orientation from a single input image relative to a known environment. The system is flexible w.r.t. the amount of information available at test and at training time, catering to different applications. Input images can be RGB-D or RGB, and a 3D model of the environment can be utilized for training but is not necessary. In the minimal case, our system requires only RGB images and ground truth poses at training time, and it requires only a single RGB image at test time. The framework consists of a deep neural network and fully differentiable pose optimization. The neural network predicts so called scene coordinates, i.e. dense correspondences between the input image and 3D scene space of the environment. The pose optimization implements robust fitting of pose parameters using differentiable RANSAC (DSAC) to facilitate end-to-end training. The system, an extension of DSAC++ and referred to as DSAC*, achieves state-of-the-art accuracy an various public datasets for RGB-based re-localization, and competitive accuracy for RGB-D-based re-localization.

Journal ArticleDOI
TL;DR: A novel rotation detector for remote sensing images, mainly inspired by Mask R-CNN, namely RADet is proposed, which can obtain the rotation bounding box of objects with shape mask predicted by the mask branch, and is shown to outperform existing leading object detectors in remote sensing field.
Abstract: Object detection has made significant progress in many real-world scenes. Despite this remarkable progress, the common use case of detection in remote sensing images remains challenging even for leading object detectors, due to the complex background, objects with arbitrary orientation, and large difference in scale of objects. In this paper, we propose a novel rotation detector for remote sensing images, mainly inspired by Mask R-CNN, namely RADet. RADet can obtain the rotation bounding box of objects with shape mask predicted by the mask branch, which is a novel, simple and effective way to get the rotation bounding box of objects. Specifically, a refine feature pyramid network is devised with an improved building block constructing top-down feature maps, to solve the problem of large difference in scales. Meanwhile, the position attention network and the channel attention network are jointly explored by modeling the spatial position dependence between global pixels and highlighting the object feature, for detecting small object surrounded by complex background. Extensive experiments on two remote sensing public datasets, DOTA and NWPUVHR -10, show our method to outperform existing leading object detectors in remote sensing field.

Book ChapterDOI
Xu Chen1, Zijian Dong1, Jie Song1, Andreas Geiger2, Otmar Hilliges1 
23 Aug 2020
TL;DR: This paper combines a gradient-based fitting procedure with a parametric neural image synthesis module that is capable of implicitly representing the appearance, shape and pose of entire object categories, thus rendering the need for explicit CAD models per object instance unnecessary.
Abstract: Many object pose estimation algorithms rely on the analysis-by-synthesis framework which requires explicit representations of individual object instances. In this paper we combine a gradient-based fitting procedure with a parametric neural image synthesis module that is capable of implicitly representing the appearance, shape and pose of entire object categories, thus rendering the need for explicit CAD models per object instance unnecessary. The image synthesis network is designed to efficiently span the pose configuration space so that model capacity can be used to capture the shape and local appearance (i.e., texture) variations jointly. At inference time the synthesized images are compared to the target via an appearance based loss and the error signal is backpropagated through the network to the input parameters. Keeping the network parameters fixed, this allows for iterative optimization of the object pose, shape and appearance in a joint manner and we experimentally show that the method can recover orientation of objects with high accuracy from 2D images alone. When provided with depth measurements, to overcome scale ambiguities, the method can accurately recover the full 6DOF pose successfully.

Journal ArticleDOI
06 May 2020-Sensors
TL;DR: An overview of the computer vision based indoor localization domain is offered, presenting application areas, commercial tools, existing benchmarks, and other reviews, and proposing a new classification based on the configuration stage (use of known environment data), sensing devices, type of detected elements, and localization method.
Abstract: Computer vision based indoor localization methods use either an infrastructure of static cameras to track mobile entities (e.g., people, robots) or cameras attached to the mobile entities. Methods in the first category employ object tracking, while the others map images from mobile cameras with images acquired during a configuration stage or extracted from 3D reconstructed models of the space. This paper offers an overview of the computer vision based indoor localization domain, presenting application areas, commercial tools, existing benchmarks, and other reviews. It provides a survey of indoor localization research solutions, proposing a new classification based on the configuration stage (use of known environment data), sensing devices, type of detected elements, and localization method. It groups 70 of the most recent and relevant image based indoor localization methods according to the proposed classification and discusses their advantages and drawbacks. It highlights localization methods that also offer orientation information, as this is required by an increasing number of applications of indoor localization (e.g., augmented reality).

Journal ArticleDOI
TL;DR: A novel method of measuring full-field displacement response of a vibrating continuous edge of a structural component is proposed from its video and shows high correspondence between the actual motion of the cable and traced motion of its edge, over time.

Book ChapterDOI
Fangyun Wei1, Xiao Sun1, Hongyang Li2, Jingdong Wang1, Stephen Lin1 
23 Aug 2020
TL;DR: Point-Set Anchors as discussed by the authors proposes to regress from a set of points placed at more advantageous positions, which are arranged to reflect a good initialization for the given task such as modes in the training data for pose estimation, which lie closer to the ground truth than the central point and provide more informative features for regression.
Abstract: A recent approach for object detection and human pose estimation is to regress bounding boxes or human keypoints from a central point on the object or person. While this center-point regression is simple and efficient, we argue that the image features extracted at a central point contain limited information for predicting distant keypoints or bounding box boundaries, due to object deformation and scale/orientation variation. To facilitate inference, we propose to instead perform regression from a set of points placed at more advantageous positions. This point set is arranged to reflect a good initialization for the given task, such as modes in the training data for pose estimation, which lie closer to the ground truth than the central point and provide more informative features for regression. As the utility of a point set depends on how well its scale, aspect ratio and rotation matches the target, we adopt the anchor box technique of sampling these transformations to generate additional point-set candidates. We apply this proposed framework, called Point-Set Anchors, to object detection, instance segmentation, and human pose estimation. Our results show that this general-purpose approach can achieve performance competitive with state-of-the-art methods for each of these tasks .

Journal ArticleDOI
TL;DR: This paper proposes fully convolutional neural networks to perform automatic background subtraction for leaf images captured in mobile applications and reports state-of-the-art leaf image segmentation performance.

Journal ArticleDOI
TL;DR: A new panorama-oriented model, to predict head and eye movements, is proposed, and the spherical harmonics are employed to extract features at different frequency bands and orientations to estimate the saliency.
Abstract: By recording the whole scene around the capturer, virtual reality (VR) techniques can provide viewers the sense of presence. To provide a satisfactory quality of experience, there should be at least 60 pixels per degree, so the resolution of panoramas should reach 21600 × 10800. The huge amount of data will put great demands on data processing and transmission. However, when exploring in the virtual environment, viewers only perceive the content in the current field of view (FOV). Therefore if we can predict the head and eye movements which are important behaviors of viewer, more processing resources can be allocated to the active FOV. But conventional saliency prediction methods are not fully adequate for panoramic images. In this paper, a new panorama-oriented model, to predict head and eye movements, is proposed. Due to the superiority of computation in the spherical domain, the spherical harmonics are employed to extract features at different frequency bands and orientations. Related low- and high-level features including the rare components in the frequency domain and color domain, the difference between center vision and peripheral vision, visual equilibrium, person and car detection, and equator bias are extracted to estimate the saliency. To predict head movements, visual mechanisms including visual uncertainty and equilibrium are incorporated, and the graphical model and functional representation for the switch of head orientation are established. Extensive experimental results on the publicly available database demonstrate the effectiveness of our methods.

Journal ArticleDOI
TL;DR: A new one-stage anchor-free method to detect orientated objects in per-pixel prediction fashion with less computational complexity is proposed and a new aspect-ratio-aware orientation centerness method is proposed to better weigh positive pixel points, in order to guide the network to learn discriminative features from a complex background, which brings improvements for large aspect ratio object detection.
Abstract: Orientated object detection in aerial images is still a challenging task due to the bird’s eye view and the various scales and arbitrary angles of objects in aerial images. Most current methods for orientated object detection are anchor-based, which require considerable pre-defined anchors and are time consuming. In this article, we propose a new one-stage anchor-free method to detect orientated objects in per-pixel prediction fashion with less computational complexity. Arbitrary orientated objects are detected by predicting the axis of the object, which is the line connecting the head and tail of the object, and the width of the object is vertical to the axis. By predicting objects at the pixel level of feature maps directly, the method avoids setting a number of hyperparameters related to anchor and is computationally efficient. Besides, a new aspect-ratio-aware orientation centerness method is proposed to better weigh positive pixel points, in order to guide the network to learn discriminative features from a complex background, which brings improvements for large aspect ratio object detection. The method is tested on two common aerial image datasets, achieving better performance compared with most one-stage orientated methods and many two-stage anchor-based methods with a simpler procedure and lower computational complexity.

Journal ArticleDOI
TL;DR: DSF-CNN as mentioned in this paper uses group convolutions with multiple rotated copies of each filter in a densely connected framework, enabling exact rotation and decreasing the number of trainable parameters compared to standard filters.
Abstract: Histology images are inherently symmetric under rotation, where each orientation is equally as likely to appear. However, this rotational symmetry is not widely utilised as prior knowledge in modern Convolutional Neural Networks (CNNs), resulting in data hungry models that learn independent features at each orientation. Allowing CNNs to be rotation-equivariant removes the necessity to learn this set of transformations from the data and instead frees up model capacity, allowing more discriminative features to be learned. This reduction in the number of required parameters also reduces the risk of overfitting. In this paper, we propose Dense Steerable Filter CNNs (DSF-CNNs) that use group convolutions with multiple rotated copies of each filter in a densely connected framework. Each filter is defined as a linear combination of steerable basis filters, enabling exact rotation and decreasing the number of trainable parameters compared to standard filters. We also provide the first in-depth comparison of different rotation-equivariant CNNs for histology image analysis and demonstrate the advantage of encoding rotational symmetry into modern architectures. We show that DSF-CNNs achieve state-of-the-art performance, with significantly fewer parameters, when applied to three different tasks in the area of computational pathology: breast tumour classification, colon gland segmentation and multi-tissue nuclear segmentation.

Posted Content
Fangyun Wei1, Xiao Sun1, Hongyang Li2, Jingdong Wang1, Stephen Lin1 
TL;DR: This work argues that the image features extracted at a central point contain limited information for predicting distant keypoints or bounding box boundaries, due to object deformation and scale/orientation variation, and proposes to perform regression from a set of points placed at more advantageous positions to facilitate inference.
Abstract: A recent approach for object detection and human pose estimation is to regress bounding boxes or human keypoints from a central point on the object or person. While this center-point regression is simple and efficient, we argue that the image features extracted at a central point contain limited information for predicting distant keypoints or bounding box boundaries, due to object deformation and scale/orientation variation. To facilitate inference, we propose to instead perform regression from a set of points placed at more advantageous positions. This point set is arranged to reflect a good initialization for the given task, such as modes in the training data for pose estimation, which lie closer to the ground truth than the central point and provide more informative features for regression. As the utility of a point set depends on how well its scale, aspect ratio and rotation matches the target, we adopt the anchor box technique of sampling these transformations to generate additional point-set candidates. We apply this proposed framework, called Point-Set Anchors, to object detection, instance segmentation, and human pose estimation. Our results show that this general-purpose approach can achieve performance competitive with state-of-the-art methods for each of these tasks. Code is available at \url{this https URL}

Posted Content
TL;DR: This work proposes a data-driven optimization approach for long-term, 6D pose tracking, which aims to identify the optimal relative pose given the current RGB-D observation and a synthetic image conditioned on the previous best estimate and the object’s model.
Abstract: Tracking the 6D pose of objects in video sequences is important for robot manipulation. This task, however, introduces multiple challenges: (i) robot manipulation involves significant occlusions; (ii) data and annotations are troublesome and difficult to collect for 6D poses, which complicates machine learning solutions, and (iii) incremental error drift often accumulates in long term tracking to necessitate re-initialization of the object's pose. This work proposes a data-driven optimization approach for long-term, 6D pose tracking. It aims to identify the optimal relative pose given the current RGB-D observation and a synthetic image conditioned on the previous best estimate and the object's model. The key contribution in this context is a novel neural network architecture, which appropriately disentangles the feature encoding to help reduce domain shift, and an effective 3D orientation representation via Lie Algebra. Consequently, even when the network is trained only with synthetic data can work effectively over real images. Comprehensive experiments over benchmarks - existing ones as well as a new dataset with significant occlusions related to object manipulation - show that the proposed approach achieves consistently robust estimates and outperforms alternatives, even though they have been trained with real images. The approach is also the most computationally efficient among the alternatives and achieves a tracking frequency of 90.9Hz.

Journal Article
Linhao Li1, Zhiqiang Zhou1, Bo Wang1, Lingjuan Miao1, Hua Zong 
TL;DR: A novel CNN-based ship detection method that is able to predict the orientation and other variables independently, and yet more effectively, with a novel dual-branch regression network, based on the observation that the ship targets are nearly rotation-invariant in remote sensing images.
Abstract: Currently, reliable and accurate ship detection in optical remote sensing images is still challenging. Even the state-of-the-art convolutional neural network (CNN) based methods cannot obtain very satisfactory results. To more accurately locate the ships in diverse orientations, some recent methods conduct the detection via the rotated bounding box. However, it further increases the difficulty of detection, because an additional variable of ship orientation must be accurately predicted in the algorithm. In this paper, a novel CNN-based ship detection method is proposed, by overcoming some common deficiencies of current CNN-based methods in ship detection. Specifically, to generate rotated region proposals, current methods have to predefine multi-oriented anchors, and predict all unknown variables together in one regression process, limiting the quality of overall prediction. By contrast, we are able to predict the orientation and other variables independently, and yet more effectively, with a novel dual-branch regression network, based on the observation that the ship targets are nearly rotation-invariant in remote sensing images. Next, a shape-adaptive pooling method is proposed, to overcome the limitation of typical regular ROI-pooling in extracting the features of the ships with various aspect ratios. Furthermore, we propose to incorporate multilevel features via the spatially-variant adaptive pooling. This novel approach, called multilevel adaptive pooling, leads to a compact feature representation more qualified for the simultaneous ship classification and localization. Finally, detailed ablation study performed on the proposed approaches is provided, along with some useful insights. Experimental results demonstrate the great superiority of the proposed method in ship detection.

Journal ArticleDOI
05 Nov 2020-Sensors
TL;DR: A framework of detection, quantification, and localization of damage on a civil infrastructure using the proposed framework can directly be used in the prognosis of the structure’s ability to withstand service loads.
Abstract: Immediate assessment of structural integrity of important civil infrastructures, like bridges, hospitals, or dams, is of utmost importance after natural disasters. Currently, inspection is performed manually by engineers who look for local damages and their extent on significant locations of the structure to understand its implication on its global stability. However, the whole process is time-consuming and prone to human errors. Due to their size and extent, some regions of civil structures are hard to gain access for manual inspection. In such situations, a vision-based system of Unmanned Aerial Vehicles (UAVs) programmed with Artificial Intelligence algorithms may be an effective alternative to carry out a health assessment of civil infrastructures in a timely manner. This paper proposes a framework of achieving the above-mentioned goal using computer vision and deep learning algorithms for detection of cracks on the concrete surface from its image by carrying out image segmentation of pixels, i.e., classification of pixels in an image of the concrete surface and whether it belongs to cracks or not. The image segmentation or dense pixel level classification is carried out using a deep neural network architecture named U-Net. Further, morphological operations on the segmented images result in dense measurements of crack geometry, like length, width, area, and crack orientation for individual cracks present in the image. The efficacy and robustness of the proposed method as a viable real-life application was validated by carrying out a laboratory experiment of a four-point bending test on an 8-foot-long concrete beam of which the video is recorded using a camera mounted on a UAV-based, as well as a still ground-based, video camera. Detection, quantification, and localization of damage on a civil infrastructure using the proposed framework can directly be used in the prognosis of the structure's ability to withstand service loads.

Journal ArticleDOI
TL;DR: A novel IPN method based on shoe-mounted micro-electro-mechanical systems inertial measurement unit and ultra-wideband which is able to obtain high-precision position and orientation estimates at low cost and the system complexity is reduced.
Abstract: In the field of indoor pedestrian navigation (IPN), the orientation information of a pedestrian is often obtained by means of strap-down inertial navigation system (SINS). To deal with the problem of divergence in SINS based orientation estimates, additional orientation sensors, such as a camera, are needed to provide external orientation observations, resulting in increased cost and complexity of system. Although a low-cost magnetometer (or compass) can be used, it is significantly affected by geomagnetic disturbances indoors. Besides, the magnetometer can only give the heading observation which is insufficient to correct orientation errors in all three directions. In this paper, we propose a novel IPN method based on shoe-mounted micro-electro-mechanical systems inertial measurement unit and ultra-wideband. The biggest advantage of this method is able to obtain high-precision position and orientation estimates at low cost. In addition, in the proposed method, the data fusion is implemented by a quaternion Kalman filter which does not involve any complex linearization and hence the system complexity is reduced. Experimental results show that a decimeter level position accuracy is achieved and the orientation drifts can be limited to 0.066 radians in indoor environments.

Book ChapterDOI
23 Aug 2020
TL;DR: The AIM 2020 challenge on virtual image relighting and illumination estimation as discussed by the authors focused on one-to-one relighting, where the objective was to relight an input photo of a scene with a different color temperature and illuminant orientation.
Abstract: We review the AIM 2020 challenge on virtual image relighting and illumination estimation. This paper presents the novel VIDIT dataset used in the challenge and the different proposed solutions and final evaluation results over the 3 challenge tracks. The first track considered one-to-one relighting; the objective was to relight an input photo of a scene with a different color temperature and illuminant orientation (i.e., light source position). The goal of the second track was to estimate illumination settings, namely the color temperature and orientation, from a given image. Lastly, the third track dealt with any-to-any relighting, thus a generalization of the first track. The target color temperature and orientation, rather than being pre-determined, are instead given by a guide image. Participants were allowed to make use of their track 1 and 2 solutions for track 3. The tracks had 94, 52, and 56 registered participants, respectively, leading to 20 confirmed submissions in the final competition stage.

Journal ArticleDOI
TL;DR: Two structure from motion strategies are provided, which use trajectory information provided by an onboard survey-grade global navigation satellite system/inertial navigation system (GNSS/INS) and system calibration parameters to generate denser and more accurate 3D point clouds as well as orthophotos without any gaps.
Abstract: Acquired imagery by unmanned aerial vehicles (UAVs) has been widely used for three-dimensional (3D) reconstruction/modeling in various digital agriculture applications, such as phenotyping, crop monitoring, and yield prediction. 3D reconstruction from well-textured UAV-based images has matured and the user community has access to several commercial and opensource tools that provide accurate products at a high level of automation. However, in some applications, such as digital agriculture, due to repetitive image patterns, these approaches are not always able to produce reliable/complete products. The main limitation of these techniques is their inability to establish a sufficient number of correctly matched features among overlapping images, causing incomplete and/or inaccurate 3D reconstruction. This paper provides two structure from motion (SfM) strategies, which use trajectory information provided by an onboard survey-grade global navigation satellite system/inertial navigation system (GNSS/INS) and system calibration parameters. The main difference between the proposed strategies is that the first one—denoted as partially GNSS/INS-assisted SfM—implements the four stages of an automated triangulation procedure, namely, imaging matching, relative orientation parameters (ROPs) estimation, exterior orientation parameters (EOPs) recovery, and bundle adjustment (BA). The second strategy— denoted as fully GNSS/INS-assisted SfM—removes the EOPs estimation step while introducing a random sample consensus (RANSAC)-based strategy for removing matching outliers before the BA stage. Both strategies modify the image matching by restricting the search space for conjugate points. They also implement a linear procedure for ROPs’ refinement. Finally, they use the GNSS/INS information in modified collinearity equations for a simpler BA procedure that could be used for refining system calibration parameters. Eight datasets over six agricultural fields are used to evaluate the performance of the developed strategies. In comparison with a traditional SfM framework and Pix4D Mapper Pro, the proposed strategies are able to generate denser and more accurate 3D point clouds as well as orthophotos without any gaps.