scispace - formally typeset
Search or ask a question

Showing papers on "Orientation (computer vision) published in 2021"


Journal ArticleDOI
TL;DR: A single-shot alignment network (S2A-Net) consisting of two modules: a feature alignment module (FAM) and an oriented detection module (ODM) that can achieve the state-of-the-art performance on two commonly used aerial objects’ data sets while keeping high efficiency.
Abstract: The past decade has witnessed significant progress on detecting objects in aerial images that are often distributed with large-scale variations and arbitrary orientations. However, most of existing methods rely on heuristically defined anchors with different scales, angles, and aspect ratios, and usually suffer from severe misalignment between anchor boxes (ABs) and axis-aligned convolutional features, which lead to the common inconsistency between the classification score and localization accuracy. To address this issue, we propose a single-shot alignment network (S²A-Net) consisting of two modules: a feature alignment module (FAM) and an oriented detection module (ODM). The FAM can generate high-quality anchors with an anchor refinement network and adaptively align the convolutional features according to the ABs with a novel alignment convolution. The ODM first adopts active rotating filters to encode the orientation information and then produces orientation-sensitive and orientation-invariant features to alleviate the inconsistency between classification score and localization accuracy. Besides, we further explore the approach to detect objects in large-size images, which leads to a better trade-off between speed and accuracy. Extensive experiments demonstrate that our method can achieve the state-of-the-art performance on two commonly used aerial objects' data sets (i.e., DOTA and HRSC2016) while keeping high efficiency.

288 citations


Proceedings ArticleDOI
01 Jun 2021
TL;DR: CenterPoint as mentioned in this paper proposes to represent, detect, and track 3D objects as points and achieves state-of-the-art performance on the nuScenes benchmark for both 3D detection and tracking, with 65.5 NDS and 63.8 AMOTA.
Abstract: Three-dimensional objects are commonly represented as 3D boxes in a point-cloud. This representation mimics the well-studied image-based 2D bounding-box detection but comes with additional challenges. Objects in a 3D world do not follow any particular orientation, and box-based detectors have difficulties enumerating all orientations or fitting an axis-aligned bounding box to rotated objects. In this paper, we instead propose to represent, detect, and track 3D objects as points. Our framework, CenterPoint, first detects centers of objects using a keypoint detector and regresses to other attributes, including 3D size, 3D orientation, and velocity. In a second stage, it refines these estimates using additional point features on the object. In CenterPoint, 3D object tracking simplifies to greedy closest-point matching. The resulting detection and tracking algorithm is simple, efficient, and effective. CenterPoint achieved state-of-the-art performance on the nuScenes benchmark for both 3D detection and tracking, with 65.5 NDS and 63.8 AMOTA for a single model. On the Waymo Open Dataset, Center-Point outperforms all previous single model methods by a large margin and ranks first among all Lidar-only submissions. The code and pretrained models are available at https://github.com/tianweiy/CenterPoint.

246 citations


Posted Content
TL;DR: A Rotation-equivariant Detector (ReDet) is proposed, which explicitly encodes rotation equivariance and rotation invariance and incorporates rotation- equivariant networks into the detector to extract rotation-Equivariant features, which can accurately predict the orientation and lead to a huge reduction of model size.
Abstract: Recently, object detection in aerial images has gained much attention in computer vision. Different from objects in natural images, aerial objects are often distributed with arbitrary orientation. Therefore, the detector requires more parameters to encode the orientation information, which are often highly redundant and inefficient. Moreover, as ordinary CNNs do not explicitly model the orientation variation, large amounts of rotation augmented data is needed to train an accurate object detector. In this paper, we propose a Rotation-equivariant Detector (ReDet) to address these issues, which explicitly encodes rotation equivariance and rotation invariance. More precisely, we incorporate rotation-equivariant networks into the detector to extract rotation-equivariant features, which can accurately predict the orientation and lead to a huge reduction of model size. Based on the rotation-equivariant features, we also present Rotation-invariant RoI Align (RiRoI Align), which adaptively extracts rotation-invariant features from equivariant features according to the orientation of RoI. Extensive experiments on several challenging aerial image datasets DOTA-v1.0, DOTA-v1.5 and HRSC2016, show that our method can achieve state-of-the-art performance on the task of aerial object detection. Compared with previous best results, our ReDet gains 1.2, 3.5 and 2.6 mAP on DOTA-v1.0, DOTA-v1.5 and HRSC2016 respectively while reducing the number of parameters by 60\% (313 Mb vs. 121 Mb). The code is available at: \url{this https URL}.

153 citations


Proceedings ArticleDOI
01 Jun 2021
TL;DR: Csuhan et al. as discussed by the authors proposed a Rotation-equivariant Detector (ReDet), which explicitly encodes rotation equivariance and rotation invariance. But it requires large amounts of rotation augmented data to train an accurate object detector.
Abstract: Recently, object detection in aerial images has gained much attention in computer vision. Different from objects in natural images, aerial objects are often distributed with arbitrary orientation. Therefore, the detector requires more parameters to encode the orientation information, which are often highly redundant and inefficient. Moreover, as ordinary CNNs do not explicitly model the orientation variation, large amounts of rotation augmented data is needed to train an accurate object detector. In this paper, we propose a Rotation-equivariant Detector (ReDet) to address these issues, which explicitly encodes rotation equivariance and rotation invariance. More precisely, we incorporate rotation-equivariant networks into the detector to extract rotation-equivariant features, which can accurately predict the orientation and lead to a huge reduction of model size. Based on the rotation-equivariant features, we also present Rotation-invariant RoI Align (RiRoI Align), which adaptively extracts rotation-invariant features from equivariant features according to the orientation of RoI. Extensive experiments on several challenging aerial image datasets DOTA-v1.0, DOTA-v1.5 and HRSC2016, show that our method can achieve state-of-the-art performance on the task of aerial object detection. Compared with previous best results, our ReDet gains 1.2, 3.5 and 2.6 mAP on DOTA-v1.0, DOTA-v1.5 and HRSC2016 respectively while reducing the number of parameters by 60% (313 Mb vs. 121 Mb). The code is available at: https://github.com/csuhan/ReDet.

138 citations


Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a novel feature descriptor named the Histogram of Orientated Phase Congruency (HOPC), which is based on the structural properties of images.
Abstract: Automatic registration of multimodal remote sensing data (eg, optical, LiDAR, SAR) is a challenging task due to the significant non-linear radiometric differences between these data To address this problem, this paper proposes a novel feature descriptor named the Histogram of Orientated Phase Congruency (HOPC), which is based on the structural properties of images Furthermore, a similarity metric named HOPCncc is defined, which uses the normalized correlation coefficient (NCC) of the HOPC descriptors for multimodal registration In the definition of the proposed similarity metric, we first extend the phase congruency model to generate its orientation representation, and use the extended model to build HOPCncc Then a fast template matching scheme for this metric is designed to detect the control points between images The proposed HOPCncc aims to capture the structural similarity between images, and has been tested with a variety of optical, LiDAR, SAR and map data The results show that HOPCncc is robust against complex non-linear radiometric differences and outperforms the state-of-the-art similarities metrics (ie, NCC and mutual information) in matching performance Moreover, a robust registration method is also proposed in this paper based on HOPCncc, which is evaluated using six pairs of multimodal remote sensing images The experimental results demonstrate the effectiveness of the proposed method for multimodal image registration

95 citations


Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a Critical Feature Capturing Network (CFC-Net) to improve detection accuracy from three aspects: building powerful feature representation, refining preset anchors, and optimizing label assignment.
Abstract: Object detection in optical remote sensing images is an important and challenging task. In recent years, the methods based on convolutional neural networks have made good progress. However, due to the large variation in object scale, aspect ratio, and arbitrary orientation, the detection performance is difficult to be further improved. In this paper, we discuss the role of discriminative features in object detection, and then propose a Critical Feature Capturing Network (CFC-Net) to improve detection accuracy from three aspects: building powerful feature representation, refining preset anchors, and optimizing label assignment. Specifically, we first decouple the classification and regression features, and then construct robust critical features adapted to the respective tasks through the Polarization Attention Module (PAM). With the extracted discriminative regression features, the Rotation Anchor Refinement Module (R-ARM) performs localization refinement on preset horizontal anchors to obtain superior rotation anchors. Next, the Dynamic Anchor Learning (DAL) strategy is given to adaptively select high-quality anchors based on their ability to capture critical features. The proposed framework creates more powerful semantic representations for objects in remote sensing images and achieves high-performance real-time object detection. Experimental results on three remote sensing datasets including HRSC2016, DOTA, and UCAS-AOD show that our method achieves superior detection performance compared with many state-of-the-art approaches. Code and models are available at this https URL.

94 citations


Journal ArticleDOI
TL;DR: An approach SMORE1 based on convolutional neural networks (CNNs) that restores image quality by improving resolution and reducing aliasing in MR images is presented and is shown to be visually and quantitatively superior to previously reported methods.
Abstract: High resolution magnetic resonance (MR) images are desired in many clinical and research applications. Acquiring such images with high signal-to-noise (SNR), however, can require a long scan duration, which is difficult for patient comfort, is more costly, and makes the images susceptible to motion artifacts. A very common practical compromise for both 2D and 3D MR imaging protocols is to acquire volumetric MR images with high in-plane resolution, but lower through-plane resolution. In addition to having poor resolution in one orientation, 2D MRI acquisitions will also have aliasing artifacts, which further degrade the appearance of these images. This paper presents an approach SMORE 1 based on convolutional neural networks (CNNs) that restores image quality by improving resolution and reducing aliasing in MR images. 2 This approach is self-supervised, which requires no external training data because the high-resolution and low-resolution data that are present in the image itself are used for training. For 3D MRI, the method consists of only one self-supervised super-resolution (SSR) deep CNN that is trained from the volumetric image data. For 2D MRI, there is a self-supervised anti-aliasing (SAA) deep CNN that precedes the SSR CNN, also trained from the volumetric image data. Both methods were evaluated on a broad collection of MR data, including filtered and downsampled images so that quantitative metrics could be computed and compared, and actual acquired low resolution images for which visual and sharpness measures could be computed and compared. The super-resolution method is shown to be visually and quantitatively superior to previously reported methods.

77 citations


Journal ArticleDOI
TL;DR: CFC-Net as mentioned in this paper proposes a critical feature capturing network (CFCNet) to improve detection accuracy from three aspects: building powerful feature representation, refining preset anchors, and optimizing label assignment.
Abstract: Object detection in optical remote-sensing images is an important and challenging task. In recent years, the methods based on convolutional neural networks (CNNs) have made good progress. However, due to the large variation in object scale, aspect ratio, as well as the arbitrary orientation, the detection performance is difficult to be further improved. In this article, we discuss the role of discriminative features in object detection, and then propose a critical feature capturing network (CFC-Net) to improve detection accuracy from three aspects: building powerful feature representation, refining preset anchors, and optimizing label assignment. Specifically, we first decouple the classification and regression features, and then construct robust critical features adapted to the respective tasks of classification and regression through the polarization attention module (PAM). With the extracted discriminative regression features, the rotation anchor refinement module (R-ARM) performs localization refinement on preset horizontal anchors to obtain superior rotation anchors. Next, the dynamic anchor learning (DAL) strategy is given to adaptively select high-quality anchors based on their ability to capture critical features. The proposed framework creates more powerful semantic representations for objects in remote-sensing images and achieves high-performance real-time object detection. Experimental results on three remote-sensing datasets including HRSC2016, DOTA, and UCAS-AOD show that our method achieves superior detection performance compared with many state-of-the-art approaches. Code and models are available at https://github.com/ming71/CFC-Net.

61 citations


Journal ArticleDOI
23 Feb 2021
TL;DR: This work designs a fully convolutional model to predict object keypoints, dimension, and orientation, and combine these with perspective geometry constraints to compute position attributes, and proposes an effective semi-supervised training strategy for settings where labeled training data are scarce.
Abstract: In this work, we propose a novel one-stage and keypoint-based framework for monocular 3D object detection using only RGB images, called KM3D-Net. 2D detection only requires a deep neural network to predict 2D properties of objects, as it is a semanticity-aware task. For image-based 3D detection, we argue that the combination of a deep neural network and geometric constraints are needed to synergistically estimate appearance-related and spatial-related information. Here, we design a fully convolutional model to predict object keypoints, dimension, and orientation, and combine these with perspective geometry constraints to compute position attributes. Further, we reformulate the geometric constraints as a differentiable version and embed this in the network to reduce running time while maintaining the consistency of model outputs in an end-to-end fashion. Benefiting from this simple structure, we propose an effective semi-supervised training strategy for settings where labeled training data are scarce. In this strategy, we enforce a consensus prediction of two shared-weights KM3D-Net for the same unlabeled image under different input augmentation conditions and network regularization. In particular, we unify the coordinate-dependent augmentations as the affine transformation for the differential recovering position of objects, and propose a keypoint-dropout module for network regularization. Our model only requires RGB images, without synthetic data, instance segmentation, CAD model, or depth generator. Extensive experiments on the popular KITTI 3D detection dataset indicate that the KM3D-Net surpasses state-of-the-art methods by a large margin in both efficiency and accuracy. And also, to the best of our knowledge, this is the first application of semi-supervised learning in monocular 3D object detection. We surpass most of the previous fully supervised methods with only 13% labeled data on KITTI.

52 citations


Proceedings ArticleDOI
30 May 2021
TL;DR: RGBD-Grasp as discussed by the authors decouples the grasp detection into two sub-tasks where RGB and depth information are processed separately, and achieves state-of-the-art results on GraspNet-1Billion dataset compared with several baselines.
Abstract: General object grasping is an important yet unsolved problem in the field of robotics. Most of the current methods either generate grasp poses with few DoF that fail to cover most of the success grasps, or only take the unstable depth image or point cloud as input which may lead to poor results in some cases. In this paper, we propose RGBD-Grasp, a pipeline that solves this problem by decoupling 7-DoF grasp detection into two sub-tasks where RGB and depth information are processed separately. In the first stage, an encoder-decoder like convolutional neural network Angle-View Net(AVN) is proposed to predict the SO(3) orientation of the gripper at every location of the image. Consequently, a Fast Analytic Searching(FAS) module calculates the opening width and the distance of the gripper to the grasp point. By decoupling the grasp detection problem and introducing the stable RGB modality, our pipeline alleviates the requirement for the high-quality depth image and is robust to depth sensor noise. We achieve state-of-the-art results on GraspNet-1Billion dataset compared with several baselines. Real robot experiments on a UR5 robot with an Intel Realsense camera and a Robotiq two-finger gripper show high success rates for both single object scenes and cluttered scenes. Our code and trained model are available at graspnet.net.

52 citations


Journal ArticleDOI
TL;DR: A novel architecture, i.e., point-based estimator that can be easily embedded into the region-based detector and leads to significant improvement on oriented object detection in aerial images is proposed.
Abstract: Object detection in aerial images is important for a wide range of applications. The most challenging dilemma in this task is the arbitrary orientation of objects, and many deep-learning-based methods are proposed to address this issue. In previous works on oriented object detection, the regression-based method for object localization has limited performance due to the shortage of spatial information. And the models suffer from the divergence of feature construction for object recognition and localization. In this article, we propose a novel architecture, i.e., point-based estimator to remedy these problems. To utilize the spatial information explicitly, the detector encodes an oriented object with a point-based representation and operates a fully convolutional network for point localization. To improve localization accuracy, the detector takes the manner of coarse-to-fine to lessen the quantization error in point localization. To avoid the discrepancy of feature construction, the detector decouples localization and recognition with individual pathways. In the pathway of object recognition, the instance-alignment block is involved to ensure the alignment between the feature map and oriented region. Overall, the point-based estimator can be easily embedded into the region-based detector and leads to significant improvement on oriented object detection. Extensive experiments have demonstrated the effectiveness of our point-based estimator. Compared with existing works, our method shows state-of-the-art performance on oriented object detection in aerial images.

Journal ArticleDOI
TL;DR: A one-stage, anchor-free detection approach to detect arbitrarily oriented vehicles in high-resolution aerial images by directly predicting high-level vehicle features via a fully convolutional network is proposed.
Abstract: Vehicle detection in aerial images is an important and challenging task in the field of remote sensing. Recently, deep learning technologies have yielded superior performance for object detection in remote sensing images. However, the detection results of the existing methods are horizontal bounding boxes that ignore vehicle orientations, thereby having limited applicability in scenes with dense vehicles or clutter backgrounds. In this article, we propose a one-stage, anchor-free detection approach to detect arbitrarily oriented vehicles in high-resolution aerial images. The vehicle detection task is transformed into a multitask learning problem by directly predicting high-level vehicle features via a fully convolutional network. That is, a classification subtask is created to look for vehicle central points and three regression subtasks are created to predict vehicle orientations, scales, and offsets of vehicle central points. First, coarse and fine feature maps outputted from different stages of a residual network are concatenated together by a feature pyramid fusion strategy. Upon the concatenated features, four convolutional layers are attached in parallel to predict high-level vehicle features. During training, task uncertainty learned from the training data is used to weight loss function in the multitask learning setting. For inferencing, oriented bounding boxes are generated using the predicted vehicle features, and oriented nonmaximum suppression (NMS) postprocessing is used to reduce redundant results. Experiments on two public aerial image data sets have shown the effectiveness of the proposed approach.

Journal ArticleDOI
TL;DR: This work is to present a novel error-state Kalman filter that yields highly accurate estimates of IMU orientation that are robust to poor measurement updates from fluctuations in the local magnetic field and/or highly dynamic movements.
Abstract: Inertial measurement units (IMUs) are increasingly utilized as motion capture devices in human movement studies. Given their high portability, IMUs can be deployed in any environment, importantly those outside of the laboratory. However, a significant challenge limits the adoption of this technology; namely estimating the orientation of the IMUs to a common world frame, which is essential to estimating the rotations across skeletal joints. Common (probabilistic) methods for estimating IMU orientation rely on the ability to update the current orientation estimate using data provided by the IMU. The objective of this work is to present a novel error-state Kalman filter that yields highly accurate estimates of IMU orientation that are robust to poor measurement updates from fluctuations in the local magnetic field and/or highly dynamic movements. The method is validated with ground truth data collected with highly accurate orientation measurements provided by a coordinate measurement machine. As an example, the method yields IMU-estimated orientations that remain within 3.7 degrees (RMS error) over relatively long (25 cumulative minutes) trials even in the presence of large fluctuations in the local magnetic field. For comparison, ignoring the magnetic interference increases the RMS error to 12.8 degrees, more than a threefold increase.

Journal ArticleDOI
TL;DR: A method to automatically position a US probe orthogonally to the tissue surface, thereby improving sound propagation and enabling RUSS to reach predefined orientations relatively to the surface normal at the contact point is proposed.
Abstract: The correct orientation of an ultrasound (US) probe is one of the main parameters governing the US image quality. With the rise of robotic ultrasound systems (RUSS), methods that can automatically compute the orientation promise repeatable, automatic acquisition from predefined angles resulting in high-quality US imaging. In this article, we propose a method to automatically position a US probe orthogonally to the tissue surface, thereby improving sound propagation and enabling RUSS to reach predefined orientations relatively to the surface normal at the contact point. The method relies on the derivation of the underlying mechanical model. Two rotations around orthogonal axes are carried out, while the contact force is being recorded. Then, the force data are fed into the model to estimate the normal direction. Accordingly, the probe orientation can be computed without requiring visual features. The method is applicable to the convex and linear probes. It has been evaluated on a phantom with varying tilt angles and on multiple human tissues (forearm, upper arm, lower back, and leg). As a result, it has outperformed existing methods in terms of accuracy. The mean ( $\pm$ SD) absolute angular difference on the in-vivo tissues averaged over all anatomies and probe types is $2.9\pm 1.6^{\circ }$ and $2.2\pm 1.5^{\circ }$ on the phantom.

Journal ArticleDOI
Qiuze Yu1, Dawen Ni1, Yuxuan Jiang1, Yuxuan Yan1, Jiachun An1, Tao Sun1 
TL;DR: An improved nonlinear scale-invariant feature transform (SIFT)-framework-based algorithm that combines spatial feature detection with local frequency-domain description for the registration of SAR and optical images is proposed and can achieve better results than other state-of-the-art methods in terms of registration accuracy.
Abstract: Due to severe speckle noise in synthetic aperture radar (SAR) images and the large nonlinear intensity differences between SAR and optical images, the registration of SAR and optical images is a challenging problem that remains to be solved. In this paper, an improved nonlinear scale-invariant feature transform (SIFT)-framework-based algorithm that combines spatial feature detection with local frequency-domain description for the registration of SAR and optical images is proposed. First, multiscale representations of the SAR and optical images are constructed based on nonlinear diffusion to better preserve edges and obtain consistent edge information. The ratio of exponentially weighted averages (ROEWA) operator and the Sobel operator are utilized in the process of scale space construction to calculate consistent gradient information. Then, a new feature detection strategy based on the Harris–Laplace ROEWA and Harris–Laplace Sobel techniques is proposed to detect stable and repeatable keypoints in the scale space. Finally, a novel descriptor, called the rotation-invariant amplitudes of log-Gabor orientation histograms (RI-ALGH), and a simplified version, ALGH, are proposed. The proposed descriptors are built based on the amplitudes of multiscale and multiorientation log-Gabor responses and utilize an improved spatial structure of the gradient location and orientation histogram (GLOH) descriptor, which is robust to local distortions. The experimental results on both simulated and real images demonstrate that the proposed method can achieve better results than other state-of-the-art methods in terms of registration accuracy.

Journal ArticleDOI
TL;DR: SynthSR as discussed by the authors is a method to train a CNN that receives one or more scans with spaced slices, acquired with different contrast, resolution and orientation, and produces an isotropic scan of canonical contrast (typically a 1mm MP-RAGE).

Journal ArticleDOI
TL;DR: Receiver operating characteristics (ROC) results demonstrate that the proposed method is superior to the state-of-the-art algorithms, and comparisons show that the hashing could provide accurate predictions than other metrics.
Abstract: Numerous screen content images (SCIs) have been produced to meet the needs of virtual desktop and remote display, which put forward a very urgent requirement for security and management of SCIs. Perceptual hashing is an effective way to deal with this issue. However, since SCIs are generally composed of pictures, graphics and texts, their intrinsic characteristics are different from those of natural images. Thus the previous hashing methods for natural images are not suitable for SCIs. In this article, we propose a perceptual hashing method for SCIs from the perspective of visual content understanding. Specifically, considering that the visual content understanding of SCIs mainly comes from textual regions, while the contours of text always have thinner width and higher contrast, it is decided to generate hash in the gradient field. An input screen image is first performed by some joint preprocessing operations. Then the maximum gradient magnitude and corresponding orientation information are extracted from three color channels R, G and B. Normalized histogram and local frequency coefficient features are further obtained from the maximum gradient magnitude. Finally, a hash sequence is constructed by statistics that are derived from extracted features. Experiments validated on three SCIs databases were conducted to evaluate classification between robustness and discrimination. Receiver operating characteristics (ROC) results demonstrate that the proposed method is superior to the state-of-the-art algorithms. Besides, SIQAD and SCID databases were leveraged to present the application in reduced-reference screen content image quality assessment, and comparisons show that our hashing could provide accurate predictions than other metrics.

Journal ArticleDOI
TL;DR: The proposed deep-seated features histogram (DSFH) provides efficient CBIR performance and not only has the power to discriminate low-level features, including color, texture, and shape, but can also match scenes of similar style.

Journal ArticleDOI
TL;DR: A tile-based image processing method was proposed to apply a localized thresholding technique on each tile and detect the cracked ones (tiles containing cracks) based on crack pixels’ spatial distribution, and test results were found to be promising.
Abstract: Nowadays, there is a massive necessity to develop fully automated and efficient distress assessment systems to evaluate pavement conditions with the minimum cost. Due to having complex training processes, most of the current supervised learning-based practices in this area are not suitable for smaller, local-level projects with limited resources. This paper aims to develop an automatic crack assessment method to detect and classify cracks from 2-D and 3-D pavement images. A tile-based image processing method was proposed to apply a localized thresholding technique on each tile and detect the cracked ones (tiles containing cracks) based on crack pixels’ spatial distribution. For longitudinal and transverse cracking, a curve is then fitted on the cracked tiles to connect them. Next, cracks are classified, and their lengths are measured based on the orientation axes and length of the crack curves. This method is not limited to the pavement texture type, and it is cost-efficient as it takes less than 20 s per image for a commodity computer to generate results. The method was tested on 130 images of Portland Cement Concrete (PCC) and Asphalt Concrete (AC) surfaces; test results were found to be promising (Precision = 0.89, Recall = 0.83, F1 score = 0.86, and Crack length measurement accuracy = 80%).

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors constructed a large-scale object counting data set with remote sensing images, which contains four important geographic objects: buildings, crowded ships in harbors, and large vehicles and small vehicles in parking lots.
Abstract: Object counting, whose aim is to estimate the number of objects from a given image, is an important and challenging computation task. Significant efforts have been devoted to addressing this problem and achieved great progress, yet counting the number of ground objects from remote sensing images is barely studied. In this article, we are interested in counting dense objects from remote sensing images. Compared with object counting in a natural scene, this task is challenging in the following factors: large-scale variation, complex cluttered background, and orientation arbitrariness. More importantly, the scarcity of data severely limits the development of research in this field. To address these issues, we first construct a large-scale object counting data set with remote sensing images, which contains four important geographic objects: buildings, crowded ships in harbors, and large vehicles and small vehicles in parking lots. We then benchmark the data set by designing a novel neural network that can generate a density map of an input image. The proposed network consists of three parts, namely attention module, scale pyramid module, and deformable convolution module (DCM) to attack the aforementioned challenging factors. Extensive experiments are performed on the proposed data set and one crowd counting data set, which demonstrates the challenges of the proposed data set and the superiority and effectiveness of our method compared with state-of-the-art methods.

Journal ArticleDOI
TL;DR: In this article, a kernelized treatment is introduced to alleviate explicit basis functions when learning orientation trajectories associated with high-dimensional inputs, which allows for learning orientation trajectory associated with arbitrary desired points that comprise orientation and angular velocities.
Abstract: As a promising branch of robotics, imitation learning emerges as an important way to transfer human skills to robots, where human demonstrations represented in Cartesian or joint spaces are utilized to estimate task/skill models that can be subsequently generalized to new situations. While learning Cartesian positions suffices for many applications, the end-effector orientation is required in many others. Despite recent advances in learning orientations from demonstrations, several crucial issues have not been adequately addressed yet. For instance, how can demonstrated orientations be adapted to pass through arbitrary desired points that comprise orientations and angular velocities? In this article, we propose an approach that is capable of learning multiple orientation trajectories and adapting learned orientation skills to new situations (e.g., via-points and end-points), where both orientation and angular velocity are considered. Specifically, we introduce a kernelized treatment to alleviate explicit basis functions when learning orientations, which allows for learning orientation trajectories associated with high-dimensional inputs. In addition, we extend our approach to the learning of quaternions with angular acceleration or jerk constraints, which allows for generating smoother orientation profiles for robots. Several examples including experiments with real 7-DoF robot arms are provided to verify the effectiveness of our method.

Journal ArticleDOI
TL;DR: In this paper, the illumination conditions and experimental parameters for 4D-STEM experiments with the goal of producing images of structural features for materials that are beam-sensitive are discussed. But, the experimental data acquisition does not require an aberration-corrected TEM but can be produced on a variety of instruments with the right attention to experimental parameters.
Abstract: ConspectusScanning electron nanobeam diffraction, or 4D-STEM (four-dimensional scanning transmission electron microscopy), is a flexible and powerful approach to elucidate structure from "soft" materials that are challenging to image in the transmission electron microscope because their structure is easily damaged by the electron beam. In a 4D-STEM experiment, a converged electron beam is scanned across the sample, and a pixelated camera records a diffraction pattern at each scan position. This four-dimensional data set can be mined for various analyses, producing maps of local crystal orientation, structural distortions, crystallinity, or different structural classes. Holding the sample at cryogenic temperatures minimizes diffusion of radicals and the resulting damage and disorder caused by the electron beam. The total fluence of incident electrons can easily be controlled during 4D-STEM experiments by careful use of the beam blanker, steering of the localized electron dose, and by minimizing the fluence in the convergent beam thus minimizing beam damage. This technique can be applied to both organic and inorganic materials that are known to be beam-sensitive; they can be highly crystalline, semicrystalline, mixed phase, or amorphous.One common example is the case for many organic materials that have a π-π stacking of polymer chains or rings on the order of 3.4-4.2 A separation. If these chains or rings are aligned in some regions, they will produce distinct diffraction spots (as would other crystalline spacings in this range), though they may be weak or diffuse for disordered or weakly scattering materials. We can reconstruct the orientation of the π-π stacking, the degree of π-π stacking in the sample, and the domain size of the aligned regions. This Account summarizes illumination conditions and experimental parameters for 4D-STEM experiments with the goal of producing images of structural features for materials that are beam-sensitive. We will discuss experimental parameters including sample cooling, probe size and shape, fluence, and cameras. 4D-STEM has been applied to a variety of materials, not only as an advanced technique for model systems, but as a technique for the beginning microscopist to answer materials science questions. It is noteworthy that the experimental data acquisition does not require an aberration-corrected TEM but can be produced on a variety of instruments with the right attention to experimental parameters.

Journal ArticleDOI
TL;DR: A simple yet effective method named prototype-CNN (P-CNN), which mainly consists of a prototype learning network converting support images to class-aware prototypes, a prototype-guided region proposal network for better generation of region proposals, and a detector head extending the head of Faster region-based CNN (R-CNN) to further boost the performance.
Abstract: Recently, due to the excellent representation ability of convolutional neural networks (CNNs), object detection in remote sensing images has undergone remarkable development. However, when trained with a small number of samples, the performance of the object detectors drops sharply. In this article, we focus on the following three main challenges of few-shot object detection in remote sensing images: 1) since the sample number of novel classes is far less than base classes, object detectors would fail to quickly adapt to the features of novel classes, which would result in overfitting; 2) the scarcity of samples in novel classes leads to a sparse orientation space, while the objects in remote sensing images usually have arbitrary orientations; and 3) the distribution of object instances in remote sensing images is scattered and, therefore, it is hard to identify foreground objects from the complex background. To tackle these problems, we propose a simple yet effective method named prototype-CNN (P-CNN), which mainly consists of three parts: a prototype learning network (PLN) converting support images to class-aware prototypes, a prototype-guided region proposal network (P-G RPN) for better generation of region proposals, and a detector head extending the head of Faster region-based CNN (R-CNN) to further boost the performance. Comprehensive evaluations on the large-scale DIOR dataset demonstrate the effectiveness of our P-CNN. The source code is available at https://github.com/Ybowei/P-CNN.

Journal ArticleDOI
TL;DR: In this paper, a robotic grasping system with multi-view depth image acquisition is presented, where RANSAC and an outlier filter are adopted for noise removal and multi-object segmentation.
Abstract: Due to recent advances on hardware and software technologies, industrial automation has been significantly improved in the past few decades. For random bin picking applications, it is a future trend to use machine vision based approaches to estimate the 3D poses of workpieces. In this work, we present a robotic grasping system with multi-view depth image acquisition. First, RANSAC and an outlier filter are adopted for noise removal and multi-object segmentation. A voting scheme is then used for preliminary pose computation, followed by the ICP algorithm to derive a more precise target orientation. A model-based registration approach using a genetic algorithm with parameter minimization is proposed for 6-DOF pose estimation. Finally, the grasping efficiency is increased by disturbance detection, which reduces the number of 3D data scanning for multiple operations. The experiments are carried out in the real scene environment, and the performance evaluation has demonstrated the feasibility of the proposed technique.

Journal ArticleDOI
TL;DR: Registration experiments show the proposed CAO-C2F method accurately aligns images, and outperforms other state-of-arts in terms of precision, recall, and root-mean-square error.
Abstract: Automatic registration for infrared, and visible images of power equipment has become a challenging work in intelligent diagnosis system of the power grid. Existing registration methods usually fail in accurately aligning power equipment infrared, and visible images because of resolutions, spectrums, and viewpoints differences. To solve this problem, we propose a novel main orientation of feature points named contour angle orientation (CAO), and describe an automatic infrared, and visible image registration method named CAO-Coarse to Fine (CAO-C2F). CAO is based on the contour feature of images, and invariant to images viewpoints, and scales differences. C2F is a feature matching method to obtain correct matches. Our proposed CAO-C2F method includes four steps. First, feature points in contours are extracted by the curvature scale space (CSS) corner detector based on local, and global curvature. Second, the CAO of each feature point is computed as the main orientation. Third, modified scale-invariant feature transform (SIFT) descriptors on the main orientations are extracted, and matched by bilateral matching. Finally, accurate matches are obtained by applying the C2F method. Registration experiments on a self-established images dataset show our proposed CAO-C2F method accurately aligns images, and outperforms other state-of-arts in terms of precision, recall, and root-mean-square error.

Journal ArticleDOI
TL;DR: A new automated and robust algorithm termed clustering on local point descriptors (CLPD) is developed for more accurate discontinuity identification and results indicate that the proposed CLPD algorithm outperforms existing approaches in terms of accuracy in discontinuity orientation estimate.

Journal ArticleDOI
TL;DR: A deep neural network-based approach for view classification and content-based image retrieval is proposed and its application for efficient medical image retrieved is demonstrated and an approach for body part orientation view classification labels is designed, intending to reduce the variance that occurs in different types of scans.
Abstract: In medical applications, retrieving similar images from repositories is most essential for supporting diagnostic imaging-based clinical analysis and decision support systems. However, this is a challenging task, due to the multi-modal and multi-dimensional nature of medical images. In practical scenarios, the availability of large and balanced datasets that can be used for developing intelligent systems for efficient medical image management is quite limited. Traditional models often fail to capture the latent characteristics of images and have achieved limited accuracy when applied to medical images. For addressing these issues, a deep neural network-based approach for view classification and content-based image retrieval is proposed and its application for efficient medical image retrieval is demonstrated. We also designed an approach for body part orientation view classification labels, intending to reduce the variance that occurs in different types of scans. The learned features are used first to predict class labels and later used to model the feature space for similarity computation for the retrieval task. The outcome of this approach is measured in terms of error score. When benchmarked against 12 state-of-the-art works, the model achieved the lowest error score of 132.45, with 9.62–63.14% improvement over other works, thus highlighting its suitability for real-world applications.

Journal ArticleDOI
TL;DR: Liu et al. as discussed by the authors proposed a semantic-aware adaptive knowledge distillation networks (SAKDN) to enhance action recognition in vision-sensor modality (videos) by adaptively transferring and distilling the knowledge from multiple wearable sensors.
Abstract: Existing vision-based action recognition is susceptible to occlusion and appearance variations, while wearable sensors can alleviate these challenges by capturing human motion with one-dimensional time-series signals (e.g. acceleration, gyroscope, and orientation). For the same action, the knowledge learned from vision sensors (videos or images) and wearable sensors, may be related and complementary. However, there exists a significantly large modality difference between action data captured by wearable-sensor and vision-sensor in data dimension, data distribution, and inherent information content. In this paper, we propose a novel framework, named Semantics-aware Adaptive Knowledge Distillation Networks (SAKDN), to enhance action recognition in vision-sensor modality (videos) by adaptively transferring and distilling the knowledge from multiple wearable sensors. The SAKDN uses multiple wearable-sensors as teacher modalities and uses RGB videos as student modalities. To preserve the local temporal relationship and facilitate employing visual deep learning models, we transform one-dimensional time-series signals of wearable sensors to two-dimensional images by designing a gramian angular field based virtual image generation model. Then, we introduce a novel Similarity-Preserving Adaptive Multi-modal Fusion Module (SPAMFM) to adaptively fuse intermediate representation knowledge from different teacher networks. Finally, to fully exploit and transfer the knowledge of multiple well-trained teacher networks to the student network, we propose a novel Graph-guided Semantically Discriminative Mapping (GSDM) module, which utilizes graph-guided ablation analysis to produce a good visual explanation to highlight the important regions across modalities and concurrently preserve the interrelations of original data. Experimental results on Berkeley-MHAD, UTD-MHAD, and MMAct datasets well demonstrate the effectiveness of our proposed SAKDN for adaptive knowledge transfer from wearable-sensors modalities to vision-sensors modalities. The code is publicly available at https://github.com/YangLiu9208/SAKDN .

Book ChapterDOI
05 Sep 2021
TL;DR: In this article, a Neural Network based Handwritten Text Recognition (HTR) model is proposed to recognize full pages of handwritten or printed text without image segmentation, which can extract text present in an image and then sequence it correctly without imposing any constraints regarding orientation, layout and size of text and nontext.
Abstract: We present a Neural Network based Handwritten Text Recognition (HTR) model architecture that can be trained to recognize full pages of handwritten or printed text without image segmentation. Being based on Image to Sequence architecture, it can extract text present in an image and then sequence it correctly without imposing any constraints regarding orientation, layout and size of text and non-text. Further, it can also be trained to generate auxiliary markup related to formatting, layout and content. We use character level vocabulary, thereby enabling language and terminology of any subject. The model achieves a new state-of-art in paragraph level recognition on the IAM dataset. When evaluated on scans of real world handwritten free form test answers - beset with curved and slanted lines, drawings, tables, math, chemistry and other symbols - it performs better than all commercially available HTR cloud APIs. It is deployed in production as part of a commercial web application.

Journal ArticleDOI
TL;DR: In this paper, a learning-based system that estimates the camera position and orientation from a single input image relative to a known environment is proposed, based on a deep neural network and fully differentiable pose optimization.
Abstract: We describe a learning-based system that estimates the camera position and orientation from a single input image relative to a known environment The system is flexible wrt the amount of information available at test and at training time, catering to different applications Input images can be RGB-D or RGB, and a 3D model of the environment can be utilized for training but is not necessary In the minimal case, our system requires only RGB images and ground truth poses at training time, and it requires only a single RGB image at test time The framework consists of a deep neural network and fully differentiable pose optimization The neural network predicts so called scene coordinates, ie dense correspondences between the input image and 3D scene space of the environment The pose optimization implements robust fitting of pose parameters using differentiable RANSAC (DSAC) to facilitate end-to-end training The system, an extension of DSAC++ and referred to as DSAC*, achieves state-of-the-art accuracy on various public datasets for RGB-based re-localization, and competitive accuracy for RGB-D based re-localization