scispace - formally typeset
Search or ask a question

Showing papers on "Bounding overwatch published in 2021"


Posted Content
TL;DR: Wang et al. as discussed by the authors proposed an Efficient Intersection over Union (EIOU) loss, which explicitly measures the discrepancies of three geometric factors in BBR, i.e., the overlap area, the central point and the side length.
Abstract: In object detection, bounding box regression (BBR) is a crucial step that determines the object localization performance. However, we find that most previous loss functions for BBR have two main drawbacks: (i) Both $\ell_n$-norm and IOU-based loss functions are inefficient to depict the objective of BBR, which leads to slow convergence and inaccurate regression results. (ii) Most of the loss functions ignore the imbalance problem in BBR that the large number of anchor boxes which have small overlaps with the target boxes contribute most to the optimization of BBR. To mitigate the adverse effects caused thereby, we perform thorough studies to exploit the potential of BBR losses in this paper. Firstly, an Efficient Intersection over Union (EIOU) loss is proposed, which explicitly measures the discrepancies of three geometric factors in BBR, i.e., the overlap area, the central point and the side length. After that, we state the Effective Example Mining (EEM) problem and propose a regression version of focal loss to make the regression process focus on high-quality anchor boxes. Finally, the above two parts are combined to obtain a new loss function, namely Focal-EIOU loss. Extensive experiments on both synthetic and real datasets are performed. Notable superiorities on both the convergence speed and the localization accuracy can be achieved over other BBR losses.

134 citations


Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed complete IoU (CIoU) loss and cluster-NMS for enhancing geometric factors in both bounding-box regression and non-maximum suppression (NMS), leading to notable gains of average precision (AP) and average recall (AR), without the sacrifice of inference efficiency.
Abstract: Deep learning-based object detection and instance segmentation have achieved unprecedented progress. In this article, we propose complete-IoU (CIoU) loss and Cluster-NMS for enhancing geometric factors in both bounding-box regression and nonmaximum suppression (NMS), leading to notable gains of average precision (AP) and average recall (AR), without the sacrifice of inference efficiency. In particular, we consider three geometric factors, that is: 1) overlap area; 2) normalized central-point distance; and 3) aspect ratio, which are crucial for measuring bounding-box regression in object detection and instance segmentation. The three geometric factors are then incorporated into CIoU loss for better distinguishing difficult regression cases. The training of deep models using CIoU loss results in consistent AP and AR improvements in comparison to widely adopted l n -norm loss and IoU-based loss. Furthermore, we propose Cluster-NMS, where NMS during inference is done by implicitly clustering detected boxes and usually requires fewer iterations. Cluster-NMS is very efficient due to its pure GPU implementation, and geometric factors can be incorporated to improve both AP and AR. In the experiments, CIoU loss and Cluster-NMS have been applied to state-of-the-art instance segmentation (e.g., YOLACT and BlendMask-RT), and object detection (e.g., YOLO v3, SSD, and Faster R-CNN) models. Taking YOLACT on MS COCO as an example, our method achieves performance gains as +1.7 AP and +6.2 AR 100 for object detection, and +1.1 AP and +3.5 AR 100 for instance segmentation, with 27.1 FPS on one NVIDIA GTX 1080Ti GPU. All the source code and trained models are available at https://github.com/Zzh-tju/CIoU.

118 citations


Proceedings ArticleDOI
20 Jun 2021
TL;DR: Zhang et al. as mentioned in this paper proposed a distribution-guided quality estimator (DGQP) based on the learned distributions of the four parameters of the bounding box, which can provide accurate ranking scores that benefit the NMS processing and improve detection performance.
Abstract: Localization Quality Estimation (LQE) is crucial and popular in the recent advancement of dense object detectors since it can provide accurate ranking scores that benefit the Non-Maximum Suppression processing and improve detection performance. As a common practice, most existing methods predict LQE scores through vanilla convolutional features shared with object classification or bounding box regression. In this paper, we explore a completely novel and different perspective to perform LQE – based on the learned distributions of the four parameters of the bounding box. The bounding box distributions are inspired and introduced as "General Distribution" in GFLV1, which describes the uncertainty of the predicted bounding boxes well. Such a property makes the distribution statistics of a bounding box highly correlated to its real localization quality. Specifically, a bounding box distribution with a sharp peak usually corresponds to high localization quality, and vice versa. By leveraging the close correlation between distribution statistics and the real localization quality, we develop a considerably lightweight Distribution-Guided Quality Predictor (DGQP) for reliable LQE based on GFLV1, thus producing GFLV2. To our best knowledge, it is the first attempt in object detection to use a highly relevant, statistical representation to facilitate LQE. Extensive experiments demonstrate the effectiveness of our method. Notably, GFLV2 (ResNet101) achieves 46.2 AP at 14.6 FPS, surpassing the previous state-of-the-art ATSS baseline (43.6 AP at 14.6 FPS) by absolute 2.6 AP on COCO test-dev, without sacrificing the efficiency both in training and inference.

84 citations


Journal ArticleDOI
TL;DR: In this paper, a Representation Invariance Loss (RIL) is proposed to optimize the bounding box regression for the rotating objects, which treats multiple representations of an oriented object as multiple equivalent local minima, and hence transforms bounding boxes regression into an adaptive matching process with these local minimima.
Abstract: Arbitrary-oriented objects exist widely in natural scenes, and thus the oriented object detection has received extensive attention in recent years The mainstream rotation detectors use oriented bounding boxes (OBB) or quadrilateral bounding boxes (QBB) to represent the rotating objects However, these methods suffer from the representation ambiguity for oriented object definition, which leads to suboptimal regression optimization and the inconsistency between the loss metric and the localization accuracy of the predictions In this paper, we propose a Representation Invariance Loss (RIL) to optimize the bounding box regression for the rotating objects Specifically, RIL treats multiple representations of an oriented object as multiple equivalent local minima, and hence transforms bounding box regression into an adaptive matching process with these local minima Then, the Hungarian matching algorithm is adopted to obtain the optimal regression strategy We also propose a normalized rotation loss to alleviate the weak correlation between different variables and their unbalanced loss contribution in OBB representation Extensive experiments on remote sensing datasets and scene text datasets show that our method achieves consistent and substantial improvement The source code and trained models are available at this https URL

57 citations


Journal ArticleDOI
TL;DR: A simple yet effective anchor-free tracker (named Siamese corner networks, SiamCorners), which is end-to-end trained offline on large-scale image pairs is proposed, which achieves experimental results that are comparable to the state-of-art tracker while maintaining a high running speed.
Abstract: The current Siamese network based on region proposal network (RPN) has attracted great attention in visual tracking due to its excellent accuracy and high efficiency. However, the design of the RPN involves the selection of the number, scale, and aspect ratios of anchor boxes, which will affect the applicability and convenience of the model. Furthermore, these anchor boxes require complicated calculations, such as calculating their intersection-over-union (IoU) with ground truth bounding boxes. Due to the problems related to anchor boxes, we propose a simple yet effective anchor-free tracker (named Siamese corner networks, SiamCorners), which is end-to-end trained offline on large-scale image pairs. Specifically, we introduce a modified corner pooling layer to convert the bounding box estimate of the target into a pair of corner predictions (the bottom-right and the top-left corners). By tracking a target as a pair of corners, we avoid the need to design the anchor boxes. This will make the entire tracking algorithm more flexible and simple than anchor-based trackers. In our network design, we further introduce a layer-wise feature aggregation strategy that enables the corner pooling module to predict multiple corners for a tracking target in deep networks. We then introduce a new penalty term that is used to select an optimal tracking box in these candidate corners. Finally, SiamCorners achieved experimental results that are comparable to the state-of-art tracker while maintaining a high running speed. In particular, SiamCorners achieved a 53.7% AUC on NFS30 and a 61.4% AUC on UAV123, while still running at 42 frames per second (FPS).

55 citations


Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed an affinity attention graph neural network (A2GNN) to propagate semantic labels from the confident seeds to the unlabeled pixels, which can acquire the short-and long-distance information from soft graph edges.
Abstract: Weakly supervised semantic segmentation is receiving great attention due to its low human annotation cost In this paper, we aim to tackle bounding box supervised semantic segmentation, ie, training accurate semantic segmentation models using bounding box annotations as supervision To this end, we propose Affinity Attention Graph Neural Network (A2GNN) Following previous practices, we first generate pseudo semantic-aware seeds, which are then formed into semantic graphs based on our newly proposed affinity Convolutional Neural Network (CNN) Then the built graphs are input to our A2GNN, in which an affinity attention layer is designed to acquire the short- and long- distance information from soft graph edges to accurately propagate semantic labels from the confident seeds to the unlabeled pixels However, to guarantee the precision of the seeds, we only adopt a limited number of confident pixel seed labels for A2GNN, which may lead to insufficient supervision for training To alleviate this issue, we further introduce a new loss function and a consistency-checking mechanism to leverage the bounding box constraint, so that more reliable guidance can be included for the model optimization Experiments show that our approach achieves new state-of-the-art or comparable performances on Pascal VOC 2012 datasets (val: 765%, test: 752%)

51 citations


Proceedings ArticleDOI
26 May 2021
TL;DR: Zhang et al. as mentioned in this paper collected a radar dataset that contains radar data in the form of Range-AzimuthDoppler tensors along with the bounding boxes on the tensor for dynamic road users, category labels, and 2D bounding box on the Cartesian Bird-Eye-View range map.
Abstract: Object detection using automotive radars has not been explored with deep learning models in comparison to the camera based approaches. This can be attributed to the lack of public radar datasets. In this paper, we collect a novel radar dataset that contains radar data in the form of Range-AzimuthDoppler tensors along with the bounding boxes on the tensor for dynamic road users, category labels, and 2D bounding boxes on the Cartesian Bird-Eye-View range map. To build the dataset, we propose an instance-wise auto-annotation method. Furthermore, a novel Range-Azimuth-Doppler based multiclass object detection deep learning model is proposed. The algorithm is a one-stage anchor-based detector that generates both 3D bounding boxes and 2D bounding boxes on RangeAzimuth-Doppler and Cartesian domains, respectively. Our proposed algorithm achieves 56.3% AP with IOU of 0.3 on 3D bounding box predictions, and 51.6% with IOU of 0.5 on 2D bounding box prediction. Our dataset and the code can be found at https://github.com/ZhangAoCanada/RADDet.git.

34 citations


Journal ArticleDOI
TL;DR: In this article, a representation invariance loss (RIL) is proposed to optimize the bounding box regression for the rotating objects in remote sensing images, which treats multiple representations of an oriented object as multiple equivalent local minima.
Abstract: Arbitrary-oriented objects exist widely in remote sensing images. The mainstream rotation detectors use oriented bounding boxes (OBBs) or quadrilateral bounding boxes (QBBs) to represent the rotating objects. However, these methods suffer from the representation ambiguity for oriented object definition, which leads to suboptimal regression optimization and the inconsistency between the loss metric and the localization accuracy of the predictions. In this letter, we propose a representation invariance loss (RIL) to optimize the bounding box regression for the rotating objects in the remote sensing images. RIL treats multiple representations of an oriented object as multiple equivalent local minima and hence transforms bounding box regression into an adaptive matching process with these local minima. Next, the Hungarian matching algorithm is adopted to obtain the optimal regression strategy. Besides, we propose a normalized rotation loss to alleviate the weak correlation between different variables and their unbalanced loss contribution in OBB representation. Extensive experiments on remote sensing datasets show that our method achieves consistent and substantial improvement. The code and models are available at https://github.com/ming71/RIDet to facilitate future research.

31 citations


Journal ArticleDOI
TL;DR: A novel object detection method is presented that handles freely rotated objects of arbitrary sizes, including tiny objects as small as $2 \times 2$ pixels, which has the added benefit of enabling oriented bounding box detection without any extra computation.
Abstract: A novel object detection method is presented that handles freely rotated objects of arbitrary sizes, including tiny objects as small as 2 x 2 pixels. Such tiny objects appear frequently in remotely sensed images, and present a challenge to recent object detection algorithms. More importantly, current object detection methods have been designed originally to accommodate axis-aligned bounding box detection, and therefore fail to accurately localize oriented boxes that best describe freely rotated objects. In contrast, the proposed convolutional neural network (CNN) -based approach uses potential pixel information at multiple scale levels without the need for any external resources, such as anchor boxes. The method encodes the precise location and orientation of features of the target objects at grid cell locations. Unlike existing methods that regress the bounding box location and dimension, the proposed method learns all the required information by classification, which has the added benefit of enabling oriented bounding box detection without any extra computation. It thus infers the bounding boxes only at inference time by finding the minimum surrounding box for every set of the same predicted class labels. Moreover, a rotation-invariant feature representation is applied to each scale, which imposes a regularization constraint to enforce covering the 360° range of in-plane rotation of the training samples to share similar features. Evaluations on the xView and dataset for object detection in aerial images (DOTA) data sets show that the proposed method uniformly improves performance over existing state-of-the-art methods.

28 citations


Posted Content
TL;DR: The ROad event Awareness Dataset for Autonomous Driving is introduced, to the authors' knowledge the first of its kind, designed to test an autonomous vehicles ability to detect road events, defined as triplets composed by an active agent, the action(s) it performs and the corresponding scene locations.
Abstract: Humans approach driving in a holistic fashion which entails, in particular, understanding road events and their evolution. Injecting these capabilities in an autonomous vehicle has thus the potential to take situational awareness and decision making closer to human-level performance. To this purpose, we introduce the ROad event Awareness Dataset (ROAD) for Autonomous Driving, to our knowledge the first of its kind. ROAD is designed to test an autonomous vehicle's ability to detect road events, defined as triplets composed by a moving agent, the action(s) it performs and the corresponding scene locations. ROAD comprises 22 videos, originally from the Oxford RobotCar Dataset, annotated with bounding boxes showing the location in the image plane of each road event. We also provide as baseline a new incremental algorithm for online road event awareness, based on inflating RetinaNet along time, which achieves a mean average precision of 16.8% and 6.1% for frame-level and video-level event detection, respectively, at 50% overlap. Though promising, these figures highlight the challenges faced by situation awareness in autonomous driving. Finally, ROAD allows scholars to investigate exciting tasks such as complex (road) activity detection, future road event anticipation and the modelling of sentient road agents in terms of mental states. Dataset can be obtained from this https URL and baseline code from this https URL.

27 citations


Journal ArticleDOI
TL;DR: guided upsampling and background suppression not only improve counting performance but also enable explainable visualization of output visualization, and the TasselNetV3 series is introduced.
Abstract: Fast and accurate plant counting tools affect revolution in modern agriculture. Agricultural practitioners, however, expect the output of the tools to be not only accurate but also explainable. Such explainability often refers to the ability to infer which instance is counted. One intuitive way is to generate a bounding box for each instance. Nevertheless, compared with counting by detection, plant counts can be inferred more directly in the local count framework, while one thing reproaching this paradigm is its poor explainability of output visualization. In particular, we find that the poor explainability becomes a bottleneck limiting the counting performance. To address this, we explore the idea of guided upsampling and background suppression where a novel upsampling operator is proposed to allow count redistribution, and segmentation decoders with different fusion strategies are investigated to suppress background, respectively. By integrating them into our previous counting model TasselNetV2, we introduce TasselNetV3 series: TasselNetV3-Lite and TasselNetV3-Seg. We validate the TasselNetV3 series on three public plant counting data sets and a new unmanned aircraft vehicle (UAV)-based data set, covering maize tassels counting, wheat ears counting, and rice plants counting. Extensive results show that guided upsampling and background suppression not only improve counting performance but also enable explainable visualization. Aside from state-of-the-art performance, we have several interesting observations: 1) a limited-receptive-field counter in most cases outperforms a large-receptive-field one; 2) it is sufficient to generate empirical segmentation masks from dotted annotations; 3) middle fusion is a good choice to integrate foreground-background a priori knowledge; and 4) decoupling the learning of counting and segmentation matters.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors published a public ship detection dataset, namely ShipRSImageNet, which contributes an accurately labeled dataset in different scenes with variant categories and image sources.
Abstract: Ship detection in optical remote sensing images has potential applications in national maritime security, fishing, and defense. Many detectors, including computer vision and geoscience-based methods, have been proposed in the past decade. Recently, deep-learning-based algorithms have also achieved great success in the field of ship detection. However, most of the existing detectors face difficulties in complex environments, small ship detection, and fine-grained ship classification. One reason is that existing datasets have shortcomings in terms of the inadequate number of images, few ship categories, image diversity, and insufficient variations. This article publishes a public ship detection dataset, namely ShipRSImageNet, which contributes an accurately labeled dataset in different scenes with variant categories and image sources. The proposed ShipRSImageNet contains over 3435 images with 17 573 ship instances in 50 categories, elaborately annotated with both horizontal and orientated bounding boxes by experts. From our knowledge, up to now, the proposed ShipRSImageNet is the largest remote sensing dataset for ship detection. Moreover, several state-of-the-art detection algorithms are evaluated on our proposed ShipRSImageNet dataset to give a benchmark for deep-learning-based ship detection methods, which is valuable for assessing algorithm improvement. 1

Posted Content
TL;DR: Li et al. as discussed by the authors presented the first large-scale open simulated dataset for V2V perception, which contains over 70 interesting scenes, 11,464 frames, and 232,913 annotated 3D vehicle bounding boxes, collected from 8 towns in CARLA and a digital town of Culver City, Los Angeles.
Abstract: Employing Vehicle-to-Vehicle communication to enhance perception performance in self-driving technology has attracted considerable attention recently; however, the absence of a suitable open dataset for benchmarking algorithms has made it difficult to develop and assess cooperative perception technologies. To this end, we present the first large-scale open simulated dataset for Vehicle-to-Vehicle perception. It contains over 70 interesting scenes, 11,464 frames, and 232,913 annotated 3D vehicle bounding boxes, collected from 8 towns in CARLA and a digital town of Culver City, Los Angeles. We then construct a comprehensive benchmark with a total of 16 implemented models to evaluate several information fusion strategies~(i.e. early, late, and intermediate fusion) with state-of-the-art LiDAR detection algorithms. Moreover, we propose a new Attentive Intermediate Fusion pipeline to aggregate information from multiple connected vehicles. Our experiments show that the proposed pipeline can be easily integrated with existing 3D LiDAR detectors and achieve outstanding performance even with large compression rates. To encourage more researchers to investigate Vehicle-to-Vehicle perception, we will release the dataset, benchmark methods, and all related codes in this https URL.

Proceedings ArticleDOI
01 Jun 2021
TL;DR: In this article, a new design of sub-graphs is introduced to represent and encode the discriminative patterns of each action in the videos, which novelly builds space-time graphs and clusters the graphs into compact subgraphs on each scale with respect to the number of nodes.
Abstract: Human actions are typically of combinatorial structures or patterns, i.e., subjects, objects, plus spatio-temporal interactions in between. Discovering such structures is therefore a rewarding way to reason about the dynamics of interactions and recognize the actions. In this paper, we introduce a new design of sub-graphs to represent and encode the discriminative patterns of each action in the videos. Specifically, we present MUlti-scale Sub-graph LEarning (MUSLE) framework that novelly builds space-time graphs and clusters the graphs into compact sub-graphs on each scale with respect to the number of nodes. Technically, MUSLE produces 3D bounding boxes, i.e., tubelets, in each video clip, as graph nodes and takes dense connectivity as graph edges between tubelets. For each action category, we execute online clustering to decompose the graph into sub-graphs on each scale through learning Gaussian Mixture Layer and select the discriminative sub-graphs as action prototypes for recognition. Extensive experiments are conducted on both Something-Something V1 & V2 and Kinetics-400 datasets, and superior results are reported when comparing to state-of-the-art methods. More remarkably, our MUSLE achieves to-date the best reported accuracy of 65.0% on Something-Something V2 validation set.

Journal ArticleDOI
TL;DR: This paper revisits the problem of estimating the set of states of a positive system, that are reachable from the origin, by constrained exogenous inputs, and gives an estimate for the reachable set using the convex hull of hyper-rectangles, as well as a bounding polyhedron whose outward face normals are not necessarily positive vectors.

Journal ArticleDOI
TL;DR: This paper explores error bounds for data-driven models under all possible training and testing scenarios drawn from an underlying distribution, and proposes an evaluation implementation based on Rademacher complexity theory that focuses on regression problems and can provide a tighter bound.
Abstract: Data-driven models analyze power grids under incomplete physical information, and their accuracy has been mostly validated empirically using certain training and testing datasets. This paper explores error bounds for data-driven models under all possible training and testing scenarios drawn from an underlying distribution, and proposes an evaluation implementation based on Rademacher complexity theory. We answer critical questions for data-driven models: how much training data is required to guarantee a certain error bound, and how partial physical knowledge can be utilized to reduce the required amount of data. Different from traditional Rademacher complexity that mainly addresses classification problems, our method focuses on regression problems and can provide a tighter bound. Our results are crucial for the evaluation and application of data-driven models in power grid analysis. We demonstrate the proposed method by finding generalization error bounds for two applications, i.e., branch flow linearization and external network equivalent under different degrees of physical knowledge. Results identify how the bounds decrease with additional power grid physical knowledge or more training data.

Posted Content
TL;DR: In this article, a re-check network propagates previous tracklets to the current frame by exploring the relation between cross-frame temporal cues and current candidates using the modified cross-correlation layer.
Abstract: The one-shot multi-object tracking, which integrates object detection and ID embedding extraction into a unified network, has achieved groundbreaking results in recent years. However, current one-shot trackers solely rely on single-frame detections to predict candidate bounding boxes, which may be unreliable when facing disastrous visual degradation, e.g., motion blur, occlusions. Once a target bounding box is mistakenly classified as background by the detector, the temporal consistency of its corresponding tracklet will be no longer maintained, as shown in Fig. 1. In this paper, we set out to restore the misclassified bounding boxes, i.e., fake background, by proposing a re-check network. The re-check network propagates previous tracklets to the current frame by exploring the relation between cross-frame temporal cues and current candidates using the modified cross-correlation layer. The propagation results help to reload the "fake background" and eventually repair the broken tracklets. By inserting the re-check network to a strong baseline tracker CSTrack (a variant of JDE), our model achieves favorable gains by $70.7 \rightarrow 76.7$, $70.6 \rightarrow 76.3$ MOTA on MOT16 and MOT17, respectively. Code is publicly available at this https URL.

Journal ArticleDOI
TL;DR: This work develops an exact mathematical model based on a novel multi-level network representation that yields solutions with disjoint paths and introduces two heuristic methods and a lower bounding procedure to minimize the total latency of reaching the critical locations.

Journal ArticleDOI
TL;DR: In this paper, a centroid-centric vector regression method was proposed for text detection in the wild to generate a quadrilateral bounding box for detecting text in the scene image.
Abstract: Scene text appears with a wide range of sizes and arbitrary orientations. For detecting such text in the scene image, the quadrilateral bounding boxes provide a much tight bounding box compared to the rotated rectangle. In this work, a vector regression method has been proposed for text detection in the wild to generate a quadrilateral bounding box. The bounding box prediction using direct regression requires predicting the vectors from each position inside the quadrilateral. It needs to predict four-vectors, and each varies drastically in its length and orientation. It makes the vector prediction a difficult problem. To overcome this, we have proposed a centroid-centric vector regression by utilizing the geometry of quadrilateral. In this work, we have added the philosophy of indirect regression to direct regression by shifting all points within the quadrilateral to the centroid and afterward performed vector regression from shifted points. The experimental results show the improvement of the quadrilateral approach over the existing direct regression approach. The proposed method shows good performance on many existing public datasets. The proposed method also demonstrates good results on the unseen dataset without getting trained on it, which validates the approach’s generalization ability.

Book ChapterDOI
10 Jan 2021
TL;DR: In this paper, Mask R-CNN is extended with Monte-Carlo dropout layers to estimate the epistemic uncertainty of bounding box, class, mask, and score predictions.
Abstract: State-of-the-art instance segmentation techniques currently provide a bounding box, class, mask, and scores for each instance. What they do not provide is an epistemic uncertainty estimate of these predictions. With our approach, we want to identify corner cases by considering the epistemic uncertainty. Corner cases are data/situations that are underrepresented or not covered in our data set. Our work is based on Mask R-CNN. We estimate the epistemic uncertainty by extending the architecture with Monte-Carlo dropout layers. By repeatedly executing the forward pass, we create a large number of predictions per instance. Afterward, we cluster the predictions of an instance based on the bounding box coordinates. It becomes possible to determine the epistemic position uncertainty for the bounding boxes and the classifier’s epistemic class uncertainty. For the epistemic uncertainty regarding the bounding box position and the class assignment, we provide a criterion for detecting corner cases utilizing the model’s epistemic uncertainty.

Posted Content
TL;DR: Zhang et al. as mentioned in this paper defend the problem setting for improving localization performance by leveraging the bounding box regression knowledge from a well-annotated auxiliary dataset, which is a more convenient and economical to implement while avoiding the leakage of the auxiliary well annotated dataset.
Abstract: Weakly-supervised object detection (WSOD) has emerged as an inspiring recent topic to avoid expensive instance-level object annotations. However, the bounding boxes of most existing WSOD methods are mainly determined by precomputed proposals, thereby being limited in precise object localization. In this paper, we defend the problem setting for improving localization performance by leveraging the bounding box regression knowledge from a well-annotated auxiliary dataset. First, we use the well-annotated auxiliary dataset to explore a series of learnable bounding box adjusters (LBBAs) in a multi-stage training manner, which is class-agnostic. Then, only LBBAs and a weakly-annotated dataset with non-overlapped classes are used for training LBBA-boosted WSOD. As such, our LBBAs are practically more convenient and economical to implement while avoiding the leakage of the auxiliary well-annotated dataset. In particular, we formulate learning bounding box adjusters as a bi-level optimization problem and suggest an EM-like multi-stage training algorithm. Then, a multi-stage scheme is further presented for LBBA-boosted WSOD. Additionally, a masking strategy is adopted to improve proposal classification. Experimental results verify the effectiveness of our method. Our method performs favorably against state-of-the-art WSOD methods and knowledge transfer model with similar problem setting. Code is publicly available at \url{this https URL}.

Journal ArticleDOI
TL;DR: A Meta-Refine-Net is proposed to train object detectors from noisy category labels and imprecise bounding boxes and is model-agnostic and is capable of learning from noisy object detection data with only a few clean examples.
Abstract: Object detection has gained great improvements with the advances of convolutional neural networks and the availability of large amounts of accurate training data. Though the amount of data is increasing significantly, the quality of data annotations is not guaranteed from the existing crowd-sourcing labeling platforms. In addition to noisy category labels, imprecise bounding box annotations are commonly existed for object detection data. When the quality of training data degenerates, the performance of the typical object detectors is severely impaired. In this paper, we propose a Meta-Refine-Net (MRNet) to train object detectors from noisy category labels and imprecise bounding boxes. First, MRNet learns to adaptively assign lower weights to proposals with incorrect labels so as to suppress large loss values generated by these proposals on the classification branch. Second, MRNet learns to dynamically generate more accurate bounding box annotations to overcome the misleading of imprecisely annotated bounding boxes. Thus, the imprecise bounding boxes could impose positive impacts on the regression branch rather than simply be ignored. Third, we propose to refine the imprecise bounding box annotations by jointly learning from both the category and the localization information. By doing this, the approximation of ground-truth bounding boxes is more accurate while the misleading would be further alleviated. Our MRNet is model-agnostic and is capable of learning from noisy object detection data with only a few clean examples (less than 2%). Extensive experiments on PASCAL VOC 2012 and MS COCO 2017 demonstrate the effectiveness and efficiency of our method.

Journal ArticleDOI
TL;DR: In this paper, an adversarial example attack that triggers malfunctioning of NMS in OD models is proposed, which compresses the dimensions of detection boxes to evade NMS, and the final detection output contains extremely dense false positives.
Abstract: This article demonstrates that nonmaximum suppression (NMS), which is commonly used in object detection (OD) tasks to filter redundant detection results, is no longer secure. Considering that NMS has been an integral part of OD systems, thwarting the functionality of NMS can result in unexpected or even lethal consequences for such systems. In this article, an adversarial example attack that triggers malfunctioning of NMS in OD models is proposed. The attack, namely, Daedalus, compresses the dimensions of detection boxes to evade NMS. As a result, the final detection output contains extremely dense false positives. This can be fatal for many OD applications, such as autonomous vehicles and surveillance systems. The attack can be generalized to different OD models, such that the attack cripples various OD applications. Furthermore, a way of crafting robust adversarial examples is developed by using an ensemble of popular detection models as the substitutes. Considering the pervasive nature of model reuse in real-world OD scenarios, Daedalus examples crafted based on an ensemble of substitutes can launch attacks without knowing the parameters of the victim models. The experimental results demonstrate that the attack effectively stops NMS from filtering redundant bounding boxes. As the evaluation results suggest, Daedalus increases the false positive rate in detection results to 99.9% and reduces the mean average precision scores to 0, while maintaining a low cost of distortion on the original inputs. It also demonstrates that the attack can be practically launched against real-world OD systems via printed posters.

Journal ArticleDOI
TL;DR: A new global solver, named Global-TEP, is presented that can solve the ACTEP problem efficiently with a guaranteed optimality gap and is more scalable, more flexible, and much faster than the available global solvers.
Abstract: To design a reliable and secure power system, it is necessary to have enough transmission capacity. The solution of transmission expansion planning (TEP) problem determines cost-optimal investment in future transmission equipment. In this paper, we propose a new global solver, named Global-TEP, for TEP problem with an AC network representation (ACTEP), which is a mixed-integer nonlinear programming problem. The proposed solver is based on second-order cone relaxation, enhanced relaxation tightening constraints, and optimization-based/feasibility-based bound tightening techniques. Multiple enhanced relaxation tightening constraints are incorporated into the mixed-integer second-order cone relaxation of TEP in order to obtain a very strong relaxation as the lower bounding problem. In addition, a novel feasibility-based bound tightening technique is proposed to tighten bounds of decision variables in a considerably short runtime. Finally, introducing a novel application of optimization-based bound tightening technique, Global-TEP is constructed that can solve the ACTEP problem efficiently with a guaranteed optimality gap. As illustrated by numerical case studies, Global-TEP is more scalable, more flexible, and much faster than the available global solvers.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a Scale-Sensitive IOU(SIOU) loss for object detection in multi-scale targets, especially the remote sensing images to solve the problem where the gradients of current loss functions tend to be smooth and cannot distinguish some special bounding boxes during training procedure, which may cause unreasonable loss value calculation and impact the convergence speed.
Abstract: Regression loss function in object detection model plays an important factor during training procedure. The IoU based loss functions, such as CIOU loss, achieve remarkable performance, but still have some inherent shortages that may cause slow convergence speed. The paper proposes a Scale-Sensitive IOU(SIOU) loss for the object detection in multi-scale targets, especially the remote sensing images to solve the problem where the gradients of current loss functions tend to be smooth and cannot distinguish some special bounding boxes during training procedure in multi-scale object detection, which may cause unreasonable loss value calculation and impact the convergence speed. A new geometric factor affecting the loss value calculation, namely area difference, is introduced to extend the existing three factors in CIOU loss; By introducing an area regulatory factor $\gamma $ to the loss function, it could adjust the loss values of the bounding boxes and distinguish different boxes quantitatively. Furthermore, we also apply our SIOU loss to the oriented bounding box detection and get better optimization. Through extensive experiments, the detection accuracies of YOLOv4, Faster R-CNN and SSD with SIOU loss improve much more than the previous loss functions on two horizontal bounding box datasets, i.e, NWPU VHR-10 and DIOR, and on the oriented bounding box dataset, DOTA, which are all remote sensing datasets. Therefore, the proposed loss function has the state-of-the-art performance on multi-scale object detection.

Book ChapterDOI
TL;DR: Wang et al. as mentioned in this paper proposed generalized multiple instance learning and smooth maximum approximation to integrate the bounding box tightness prior into the deep neural network in an end-to-end manner.
Abstract: This paper presents a weakly supervised image segmentation method that adopts tight bounding box annotations. It proposes generalized multiple instance learning (MIL) and smooth maximum approximation to integrate the bounding box tightness prior into the deep neural network in an end-to-end manner. In generalized MIL, positive bags are defined by parallel crossing lines with a set of different angles, and negative bags are defined as individual pixels outside of any bounding boxes. Two variants of smooth maximum approximation, i.e., $\alpha$-softmax function and $\alpha$-quasimax function, are exploited to conquer the numeral instability introduced by maximum function of bag prediction. The proposed approach was evaluated on two pubic medical datasets using Dice coefficient. The results demonstrate that it outperforms the state-of-the-art methods. The codes are available at \url{https://github.com/wangjuan313/wsis-boundingbox}.

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper designed a reID-based search space as well as a search objective to fit NAS for the reID tasks, which is achieved via automatically searching attention-based network architectures from scratch.
Abstract: Recent years have witnessed significant progress of person reidentification (reID) driven by expert-designed deep neural network architectures. Despite the remarkable success, such architectures often suffer from high model complexity and time-consuming pretraining process, as well as the mismatches between the image classification-driven backbones and the reID task. To address these issues, we introduce neural architecture search (NAS) into automatically designing person reID backbones, i.e., reID-NAS, which is achieved via automatically searching attention-based network architectures from scratch. Different from traditional NAS approaches that originated for image classification, we design a reID-based search space as well as a search objective to fit NAS for the reID tasks. In terms of the search space, reID-NAS includes a lightweight attention module to precisely locate arbitrary pedestrian bounding boxes, which is automatically added as attention to the reID architectures. In terms of the search objective, reID-NAS introduces a new retrieval objective to search and train reID architectures from scratch. Finally, we propose a hybrid optimization strategy to improve the search stability in reID-NAS. In our experiments, we validate the effectiveness of different parts in reID-NAS, and show that the architecture searched by reID-NAS achieves a new state of the art, with one order of magnitude fewer parameters on three-person reID datasets. As a concomitant benefit, the reliance on the pretraining process is vastly reduced by reID-NAS, which facilitates one to directly search and train a lightweight reID model from scratch.

Journal ArticleDOI
TL;DR: A UAV object tracking algorithm that optimizes the semantic information of cyberspace, channel feature information and strengthens the selection of bounding boxes is proposed, and a convolutional attention module is designed to enhance the weighting of feature spatial location and feature channels.
Abstract: In the process of unmanned aerial vehicles (UAVs) object tracking, the tracking object is lost due to many problems such as occlusion and fast motion. In this paper, based on the SiamRPN algorithm, a UAV object tracking algorithm that optimizes the semantic information of cyberspace, channel feature information and strengthens the selection of bounding boxes, is proposed. Since the traditional SiamRPN method does not consider remote context information, the calculation and selection of bounding boxes need to be improved. Therefore, (1) we design a convolutional attention module to enhance the weighting of feature spatial location and feature channels. (2) We also add a multi-spectral channel attention module to the search branch of the network to further solve remote dependency problems of the network and effectively understand different UAVs tracking scenes. Finally, we use the distance intersection over union to predict the bounding box, and the accurate prediction bounding box is regressed. The experimental results show that the algorithm has strong robustness and accuracy in many scenes.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a new representation of the probabilistic bounding box through a spatial uncertainty distribution and proposed Jaccard IoU (JIoU) as a new evaluation metric that extends IoU by incorporating label uncertainty.
Abstract: The availability of many real-world driving datasets is a key reason behind the recent progress of object detection algorithms in autonomous driving. However, there exist ambiguity or even failures in object labels due to error-prone annotation process or sensor observation noise. Current public object detection datasets only provide deterministic object labels without considering their inherent uncertainty, as does the common training process or evaluation metrics for object detectors. As a result, an in-depth evaluation among different object detection methods remains challenging, and the training process of object detectors is sub-optimal, especially in probabilistic object detection. In this work, we infer the uncertainty in bounding box labels from LiDAR point clouds based on a generative model, and define a new representation of the probabilistic bounding box through a spatial uncertainty distribution. Comprehensive experiments show that the proposed model reflects complex environmental noises in LiDAR perception and the label quality. Furthermore, we propose Jaccard IoU (JIoU) as a new evaluation metric that extends IoU by incorporating label uncertainty. We conduct an in-depth comparison among several LiDAR-based object detectors using the JIoU metric. Finally, we incorporate the proposed label uncertainty in a loss function to train a probabilistic object detector and to improve its detection accuracy. We verify our proposed methods on two public datasets (KITTI, Waymo), as well as on simulation data. Code is released at https://github.com/ZiningWang/Inferring-Spatial-Uncertainty-in-Object-Detection.

Journal ArticleDOI
TL;DR: CenterNet3D as mentioned in this paper uses keypoint estimation to find center points and directly regresses 3D bounding boxes, which can achieve real-time 3D object detection from point clouds.
Abstract: Accurate and fast 3D object detection from point clouds is a key task in autonomous driving. Existing one-stage 3D object detection methods can achieve real-time performance, however, they are dominated by anchor-based detectors which are inefficient and require additional post-processing. In this paper, we eliminate anchors and model an object as a single point--the center point of its bounding box. Based on the center point, we propose an anchor-free CenterNet3D network that performs 3D object detection without anchors. Our CenterNet3D uses keypoint estimation to find center points and directly regresses 3D bounding boxes. However, because inherent sparsity of point clouds, 3D object center points are likely to be in empty space which makes it difficult to estimate accurate boundaries. To solve this issue, we propose an extra corner attention module to enforce the CNN backbone to pay more attention to object boundaries. Besides, considering that one-stage detectors suffer from the discordance between the predicted bounding boxes and corresponding classification confidences, we develop an efficient keypoint-sensitive warping operation to align the confidences to the predicted bounding boxes. Our proposed CenterNet3D is non-maximum suppression free which makes it more efficient and simpler. We evaluate CenterNet3D on the widely used KITTI dataset and more challenging nuScenes dataset. Our method outperforms all state-of-the-art anchor-based one-stage methods and has comparable performance to two-stage methods as well. It has an inference speed of 20 FPS and achieves the best speed and accuracy trade-off. Our source code will be released at https://github.com/wangguojun2018/CenterNet3d.