scispace - formally typeset
Search or ask a question

Showing papers on "Object detection published in 2022"


Journal ArticleDOI
01 Jan 2022-Sensors
TL;DR: This paper investigates different versions of the YOLO objection detection method and compares their performances for the specific application of detecting a safe landing location for a UAV that has suffered an in-flight failure, and confirms the feasibility of utilizing these algorithms for effective emergency landing spot detection.
Abstract: In-flight system failure is one of the major safety concerns in the operation of unmanned aerial vehicles (UAVs) in urban environments. To address this concern, a safety framework consisting of following three main tasks can be utilized: (1) Monitoring health of the UAV and detecting failures, (2) Finding potential safe landing spots in case a critical failure is detected in step 1, and (3) Steering the UAV to a safe landing spot found in step 2. In this paper, we specifically look at the second task, where we investigate the feasibility of utilizing object detection methods to spot safe landing spots in case the UAV suffers an in-flight failure. Particularly, we investigate different versions of the YOLO objection detection method and compare their performances for the specific application of detecting a safe landing location for a UAV that has suffered an in-flight failure. We compare the performance of YOLOv3, YOLOv4, and YOLOv5l while training them by a large aerial image dataset called DOTA in a Personal Computer (PC) and also a Companion Computer (CC). We plan to use the chosen algorithm on a CC that can be attached to a UAV, and the PC is used to verify the trends that we see between the algorithms on the CC. We confirm the feasibility of utilizing these algorithms for effective emergency landing spot detection and report their accuracy and speed for that specific application. Our investigation also shows that the YOLOv5l algorithm outperforms YOLOv4 and YOLOv3 in terms of accuracy of detection while maintaining a slightly slower inference speed.

195 citations


Journal ArticleDOI
TL;DR: In this paper , a survey of recent developments in deep learning based object detectors is presented along with some of the prominent backbone architectures used in recognition tasks and compared the performances of these architectures on multiple metrics.
Abstract: Object Detection is the task of classification and localization of objects in an image or video. It has gained prominence in recent years due to its widespread applications. This article surveys recent developments in deep learning based object detectors. Concise overview of benchmark datasets and evaluation metrics used in detection is also provided along with some of the prominent backbone architectures used in recognition tasks. It also covers contemporary lightweight classification models used on edge devices. Lastly, we compare the performances of these architectures on multiple metrics.

169 citations


Journal ArticleDOI
TL;DR: In this article , a brief overview of the You Only Look Once (YOLO) algorithm and its subsequent advanced versions is given, and the results show the differences and similarities among the YOLO versions and between CNNs.

166 citations


Book ChapterDOI
TL;DR: Detic as mentioned in this paper proposes to train the classifiers of a detector on image classification data and thus expands the vocabulary of detectors to tens of thousands of concepts, making it much easier to implement and compatible with a range of detection architectures and backbones.
Abstract: Current object detectors are limited in vocabulary size due to the small scale of detection datasets. Image classifiers, on the other hand, reason about much larger vocabularies, as their datasets are larger and easier to collect. We propose Detic, which simply trains the classifiers of a detector on image classification data and thus expands the vocabulary of detectors to tens of thousands of concepts. Unlike prior work, Detic does not need complex assignment schemes to assign image labels to boxes based on model predictions, making it much easier to implement and compatible with a range of detection architectures and backbones. Our results show that Detic yields excellent detectors even for classes without box annotations. It outperforms prior work on both open-vocabulary and long-tail detection benchmarks. Detic provides a gain of 2.4 mAP for all classes and 8.3 mAP for novel classes on the open-vocabulary LVIS benchmark. On the standard LVIS benchmark, Detic obtains 41.7 mAP when evaluated on all classes, or only rare classes, hence closing the gap in performance for object categories with few samples. For the first time, we train a detector with all the twenty-one-thousand classes of the ImageNet dataset and show that it generalizes to new datasets without finetuning. Code is available at https://github.com/facebookresearch/Detic .

160 citations


Journal ArticleDOI
TL;DR: In this article, a novel enhanced multiscale feature fusion method is proposed, namely, the atrous spatial pyramid pooling-balanced-feature pyramid network (ABFPN), which uses atrous convolution operators with different dilation rates to make full use of context information.
Abstract: Object detection is a well-known task in the field of computer vision, especially the small target detection problem that has aroused great academic attention. In order to improve the detection performance of small objects, in this article, a novel enhanced multiscale feature fusion method is proposed, namely, the atrous spatial pyramid pooling-balanced-feature pyramid network (ABFPN). In particular, the atrous convolution operators with different dilation rates are employed to make full use of context information, where the skip connection is applied to achieve sufficient feature fusions. In addition, there is a balanced module to integrate and enhance features at different levels. The performance of the proposed ABFPN is evaluated on three public benchmark datasets, and experimental results demonstrate that it is a reliable and efficient feature fusion method. Furthermore, in order to validate the applicational potential in small objects, the developed ABFPN is utilized to detect surface tiny defects of the printed circuit board (PCB), which acts as the neck part of an improved PCB defect detection (IPDD) framework. While designing the IPDD, several powerful strategies are also employed to further improve the overall performance, which is evaluated via extensive ablation studies. Experiments on a public PCB defect detection database have demonstrated the superiority of the designed IPDD framework against the other seven state-of-the-art methods, which further validates the practicality of the proposed ABFPN.

114 citations


Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed complete IoU (CIoU) loss and cluster-NMS for enhancing geometric factors in both bounding-box regression and non-maximum suppression (NMS), leading to notable gains of average precision (AP) and average recall (AR), without the sacrifice of inference efficiency.
Abstract: Deep learning-based object detection and instance segmentation have achieved unprecedented progress. In this article, we propose complete-IoU (CIoU) loss and Cluster-NMS for enhancing geometric factors in both bounding-box regression and nonmaximum suppression (NMS), leading to notable gains of average precision (AP) and average recall (AR), without the sacrifice of inference efficiency. In particular, we consider three geometric factors, that is: 1) overlap area; 2) normalized central-point distance; and 3) aspect ratio, which are crucial for measuring bounding-box regression in object detection and instance segmentation. The three geometric factors are then incorporated into CIoU loss for better distinguishing difficult regression cases. The training of deep models using CIoU loss results in consistent AP and AR improvements in comparison to widely adopted $\ell _{n}$ -norm loss and IoU-based loss. Furthermore, we propose Cluster-NMS, where NMS during inference is done by implicitly clustering detected boxes and usually requires fewer iterations. Cluster-NMS is very efficient due to its pure GPU implementation, and geometric factors can be incorporated to improve both AP and AR. In the experiments, CIoU loss and Cluster-NMS have been applied to state-of-the-art instance segmentation (e.g., YOLACT and BlendMask-RT), and object detection (e.g., YOLO v3, SSD, and Faster R-CNN) models. Taking YOLACT on MS COCO as an example, our method achieves performance gains as +1.7 AP and +6.2 AR 100 for object detection, and +1.1 AP and +3.5 AR 100 for instance segmentation, with 27.1 FPS on one NVIDIA GTX 1080Ti GPU. All the source code and trained models are available at https://github.com/Zzh-tju/CIoU .

113 citations


Journal ArticleDOI
TL;DR: A hybrid deep neural network model based on the integration of MobileNet-v2, YOLOv4, and Openpose, is constructed to identify the real-time status from physical manufacturing environment to virtual space and can achieve a higher detection accuracy for digital twinning in smart manufacturing.
Abstract: Recently, along with several technological advancements in cyber-physical systems, the revolution of Industry 4.0 has brought in an emerging concept named digital twin (DT), which shows its potential to break the barrier between the physical and cyber space in smart manufacturing. However, it is still difficult to analyze and estimate the real-time structural and environmental parameters in terms of their dynamic changes in digital twinning, especially when facing detection tasks of multiple small objects from a large-scale scene with complex contexts in modern manufacturing environments. In this article, we focus on a small object detection model for DT, aiming to realize the dynamic synchronization between a physical manufacturing system and its virtual representation. Three significant elements, including equipment, product, and operator, are considered as the basic environmental parameters to represent and estimate the dynamic characteristics and real-time changes in building a generic DT system of smart manufacturing workshop. A hybrid deep neural network model, based on the integration of MobileNetv2, YOLOv4, and Openpose, is constructed to identify the real-time status from physical manufacturing environment to virtual space. A learning algorithm is then developed to realize the efficient multitype small object detection based on the feature integration and fusion from both shallow and deep layers, in order to facilitate the modeling, monitoring, and optimizing of the whole manufacturing process in the DT system. Experiments and evaluations conducted in three different use cases demonstrate the effectiveness and usefulness of our proposed method, which can achieve a higher detection accuracy for DT in smart manufacturing.

106 citations


Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed SNUNet-CD (the combination of Siamese network and NestedUNet), which alleviated the loss of localization information in the deep layers of neural network through compact information transmission between encoder and decoder.
Abstract: Change detection is an important task in remote sensing (RS) image analysis. It is widely used in natural disaster monitoring and assessment, land resource planning, and other fields. As a pixel-to-pixel prediction task, change detection is sensitive about the utilization of the original position information. Recent change detection methods always focus on the extraction of deep change semantic feature, but ignore the importance of shallow-layer information containing high-resolution and fine-grained features, this often leads to the uncertainty of the pixels at the edge of the changed target and the determination miss of small targets. In this letter, we propose a densely connected siamese network for change detection, namely SNUNet-CD (the combination of Siamese network and NestedUNet). SNUNet-CD alleviates the loss of localization information in the deep layers of neural network through compact information transmission between encoder and decoder, and between decoder and decoder. In addition, Ensemble Channel Attention Module (ECAM) is proposed for deep supervision. Through ECAM, the most representative features of different semantic levels can be refined and used for the final classification. Experimental results show that our method improves greatly on many evaluation criteria and has a better tradeoff between accuracy and calculation amount than other state-of-the-art (SOTA) change detection methods.

100 citations


Journal ArticleDOI
01 Feb 2022
TL;DR: The FAIR1M dataset as discussed by the authors is a large-scale dataset with more than 1 million instances and more than 40,000 images for fine-grained object detection in high-resolution remote sensing imagery.
Abstract: With the rapid development of deep learning, many deep learning-based approaches have made great achievements in object detection tasks. It is generally known that deep learning is a data-driven approach. Data directly impact the performance of object detectors to some extent. Although existing datasets include common objects in remote sensing images, they still have some scale, category, and image limitations. Therefore, there is a strong requirement for establishing a large-scale object detection benchmark for high-resolution remote sensing images. In this paper, we propose a novel benchmark dataset with more than 1 million instances and more than 40,000 images for Fine-grAined object recognItion in high-Resolution remote sensing imagery which is named as FAIR1M. We collected remote sensing images with a resolution of 0.3 m to 0.8 m from different platforms, which are spread across many countries and regions. All objects in the FAIR1M dataset are annotated with respect to 5 categories and 37 subcategories by oriented bounding boxes. Compared with existing detection datasets that are dedicated to object detection, the FAIR1M dataset has 4 particular characteristics: (1) it is much larger than other existing object detection datasets both in terms of the number of instances and the number of images, (2) it provides richer fine-grained category information for objects in remote sensing images, (3) it contains geographic information such as latitude, longitude and resolution attributes, and (4) it provides better image quality due to the use of a careful data cleaning procedure. Based on the FAIR1M dataset, we propose three fine-grained object detection and recognition tasks. Moreover, we evaluate several state-of-the-art approaches to establish baselines for future research. Experimental results indicate that the FAIR1M dataset effectively represents real remote sensing applications and is quite challenging for existing methods. Considering the fine-grained characteristics, we improve the evaluation metric and introduce the idea of hierarchy detection into the algorithms. We believe that the FAIR1M dataset will contribute to the earth observation community via fine-grained object detection in large-scale real-world scenes. FAIR1M Website: http://gaofen-challenge.com/.

96 citations


Book ChapterDOI
01 Jan 2022
TL;DR: ByteTrack as discussed by the authors proposes a simple, effective and generic association method, tracking by associating almost every detection box instead of only the high score ones, and utilizes their similarities with tracklets to recover true objects and filter out the background detections.
Abstract: Multi-object tracking (MOT) aims at estimating bounding boxes and identities of objects in videos. Most methods obtain identities by associating detection boxes whose scores are higher than a threshold. The objects with low detection scores, e.g. occluded objects, are simply thrown away, which brings non-negligible true object missing and fragmented trajectories. To solve this problem, we present a simple, effective and generic association method, tracking by associating almost every detection box instead of only the high score ones. For the low score detection boxes, we utilize their similarities with tracklets to recover true objects and filter out the background detections. When applied to 9 different state-of-the-art trackers, our method achieves consistent improvement on IDF1 score ranging from 1 to 10 points. To put forwards the state-of-the-art performance of MOT, we design a simple and strong tracker, named ByteTrack. For the first time, we achieve 80.3 MOTA, 77.3 IDF1 and 63.1 HOTA on the test set of MOT17 with 30 FPS running speed on a single V100 GPU. ByteTrack also achieves state-of-the-art performance on MOT20, HiEve and BDD100K tracking benchmarks. The source code, pre-trained models with deploy versions and tutorials of applying to other trackers are released at https://github.com/ifzhang/ByteTrack .

89 citations



Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed an extended feature pyramid network (EFPN) with an extra high-resolution pyramid level specialized for small object detection, which is used to super-resolve features and extract credible regional details simultaneously.
Abstract: Small object detection remains an unsolved challenge because it is hard to extract the information of small objects with only a few pixels. While scale-level corresponding detection in feature pyramid network alleviates this problem, we find feature coupling of various scales still impairs the performance of small objects. In this paper, we propose an extended feature pyramid network (EFPN) with an extra high-resolution pyramid level specialized for small object detection. Specifically, we design a novel module, named feature texture transfer (FTT), which is used to super-resolve features and extract credible regional details simultaneously. Moreover, we introduce a cross resolution distillation mechanism to transfer the ability of perceiving details across the scales of the network, where a foreground-background-balanced loss function is designed to alleviate area imbalance of foreground and background. In our experiments, the proposed EFPN is efficient on both computation and memory, and yields state-of-the-art results on small traffic-sign dataset Tsinghua-Tencent 100 K and small category of general object detection dataset MS COCO.

Journal ArticleDOI
26 Jan 2022-Agronomy
TL;DR: In this paper , six versions of the You Only Look Once (YOLO) object detection algorithm were evaluated for real-time bunch detection and counting in grapes, and the best combination of accuracy and speed was achieved by YOLOv4-tiny.
Abstract: Over the last few years, several Convolutional Neural Networks for object detection have been proposed, characterised by different accuracy and speed. In viticulture, yield estimation and prediction is used for efficient crop management, taking advantage of precision viticulture techniques. Convolutional Neural Networks for object detection represent an alternative methodology for grape yield estimation, which usually relies on manual harvesting of sample plants. In this paper, six versions of the You Only Look Once (YOLO) object detection algorithm (YOLOv3, YOLOv3-tiny, YOLOv4, YOLOv4-tiny, YOLOv5x, and YOLOv5s) were evaluated for real-time bunch detection and counting in grapes. White grape varieties were chosen for this study, as the identification of white berries on a leaf background is trickier than red berries. YOLO models were trained using a heterogeneous dataset populated by images retrieved from open datasets and acquired on the field in several illumination conditions, background, and growth stages. Results have shown that YOLOv5x and YOLOv4 achieved an F1-score of 0.76 and 0.77, respectively, with a detection speed of 31 and 32 FPS. Differently, YOLO5s and YOLOv4-tiny achieved an F1-score of 0.76 and 0.69, respectively, with a detection speed of 61 and 196 FPS. The final YOLOv5x model for bunch number, obtained considering bunch occlusion, was able to estimate the number of bunches per plant with an average error of 13.3% per vine. The best combination of accuracy and speed was achieved by YOLOv4-tiny, which should be considered for real-time grape yield estimation, while YOLOv3 was affected by a False Positive–False Negative compensation, which decreased the RMSE.

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed an effective and consistent feature fusion network (ECFFNet) for RGB-T salient object detection, which fused features of corresponding sizes from RGB and thermal modalities.
Abstract: Under ideal environmental conditions, RGB-based deep convolutional neural networks can achieve high performance for salient object detection (SOD). In scenes with cluttered backgrounds and many objects, depth maps have been combined with RGB images to better distinguish spatial positions and structures during SOD, achieving high accuracy. However, under low-light and uneven lighting conditions, RGB and depth information may be insufficient for detection. Thermal images are insensitive to lighting and weather conditions, being able to capture important objects even during nighttime. By combining thermal images and RGB images, we propose an effective and consistent feature fusion network (ECFFNet) for RGB-T SOD. In ECFFNet, an effective cross-modality fusion module fully fuses features of corresponding sizes from the RGB and thermal modalities. Then, a bilateral reversal fusion module performs bilateral fusion of foreground and background information, enabling the full extraction of salient object boundaries. Finally, a multilevel consistent fusion module combines features across different levels to obtain complementary information. Comprehensive experiments on three RGB-T SOD datasets show that the proposed ECFFNet outperforms 12 state-of-the-art methods under different evaluation indicators.

Proceedings ArticleDOI
28 Apr 2022
TL;DR: An open-source toolbox, named MMRotate, which provides a coherent algorithm framework of training, inferring, and evaluation for the popular rotated object detection algorithm based on deep learning, and implements 18 state-of-the-art algorithms.
Abstract: We present an open-source toolbox, named MMRotate, which provides a coherent algorithm framework of training, inferring, and evaluation for the popular rotated object detection algorithm based on deep learning. MMRotate implements 18 state-of-the-art algorithms and supports the three most frequently used angle definition methods. To facilitate future research and industrial applications of rotated object detection-related problems, we also provide a large number of trained models and detailed benchmarks to give insights into the performance of rotated object detection. MMRotate is publicly released at https://github.com/open-mmlab/mmrotate.

Journal ArticleDOI
TL;DR: A large-scale dataset of object detection in aerial images (DOTA) is presented in this article , which contains 1,793,658 object instances of 18 categories of oriented-bounding-box annotations collected from 11,268 aerial images.
Abstract: In he past decade, object detection has achieved significant progress in natural images but not in aerial images, due to the massive variations in the scale and orientation of objects caused by the bird's-eye view of aerial images. More importantly, the lack of large-scale benchmarks has become a major obstacle to the development of object detection in aerial images (ODAI). In this paper, we present a large-scale Dataset of Object deTection in Aerial images (DOTA) and comprehensive baselines for ODAI. The proposed DOTA dataset contains 1,793,658 object instances of 18 categories of oriented-bounding-box annotations collected from 11,268 aerial images. Based on this large-scale and well-annotated dataset, we build baselines covering 10 state-of-the-art algorithms with over 70 configurations, where the speed and accuracy performances of each model have been evaluated. Furthermore, we provide a code library for ODAI and build a website for evaluating different algorithms. Previous challenges run on DOTA have attracted more than 1300 teams worldwide. We believe that the expanded large-scale DOTA dataset, the extensive baselines, the code library and the challenges can facilitate the designs of robust algorithms and reproducible research on the problem of object detection in aerial images.

Journal ArticleDOI
TL;DR: VisDrone as discussed by the authors is a large-scale drone captured dataset, which includes four tracks, i.e., (1) image object detection, (2) video object detection and tracking, (3) single object tracking, and (4) multi-object tracking.
Abstract: Drones, or general UAVs, equipped with cameras have been fast deployed with a wide range of applications, including agriculture, aerial photography, and surveillance. Consequently, automatic understanding of visual data collected from drones becomes highly demanding, bringing computer vision and drones more and more closely. To promote and track the developments of object detection and tracking algorithms, we have organized three challenge workshops in conjunction with ECCV 2018, ICCV 2019 and ECCV 2020, attracting more than 100 teams around the world. We provide a large-scale drone captured dataset, VisDrone, which includes four tracks, i.e., (1) image object detection, (2) video object detection, (3) single object tracking, and (4) multi-object tracking. In this paper, we first present a thorough review of object detection and tracking datasets and benchmarks, and discuss the challenges of collecting large-scale drone-based object detection and tracking datasets with fully manual annotations. After that, we describe our VisDrone dataset, which is captured over various urban/suburban areas of 14 different cities across China from North to South. Being the largest such dataset ever published, VisDrone enables extensive evaluation and investigation of visual analysis algorithms for the drone platform. We provide a detailed analysis of the current state of the field of large-scale object detection and tracking on drones, and conclude the challenge as well as propose future directions. We expect the benchmark largely boost the research and development in video analysis on drone platforms. All the datasets and experimental results can be downloaded from https://github.com/VisDrone/VisDrone-Dataset .

Journal ArticleDOI
TL;DR: In this article , a comprehensive review of single stage object detectors, regression formulation, their architecture advancements, and performance statistics is presented, among different versions of YOLO, applications based on two-stage detectors, and applications with different methods for detecting objects.
Abstract: Object detection is one of the predominant and challenging problems in computer vision. Over the decade, with the expeditious evolution of deep learning, researchers have extensively experimented and contributed in the performance enhancement of object detection and related tasks such as object classification, localization, and segmentation using underlying deep models. Broadly, object detectors are classified into two categories viz. two stage and single stage object detectors. Two stage detectors mainly focus on selective region proposals strategy via complex architecture; however, single stage detectors focus on all the spatial region proposals for the possible detection of objects via relatively simpler architecture in one shot. Performance of any object detector is evaluated through detection accuracy and inference time. Generally, the detection accuracy of two stage detectors outperforms single stage object detectors. However, the inference time of single stage detectors is better compared to its counterparts. Moreover, with the advent of YOLO (You Only Look Once) and its architectural successors, the detection accuracy is improving significantly and sometime it is better than two stage detectors. YOLOs are adopted in various applications majorly due to their faster inferences rather than considering detection accuracy. As an example, detection accuracies are 63.4 and 70 for YOLO and Fast-RCNN respectively, however, inference time is around 300 times faster in case of YOLO. In this paper, we present a comprehensive review of single stage object detectors specially YOLOs, regression formulation, their architecture advancements, and performance statistics. Moreover, we summarize the comparative illustration between two stage and single stage object detectors, among different versions of YOLOs, applications based on two stage detectors, and different versions of YOLOs along with the future research directions.

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed an edge computing and multi-task driven framework to complete tasks of image enhancement and object detection with fast response, which consists of two stages, namely cloud-based enhancement stage and edge-based detection stage.
Abstract: We focus on performing object detection in images taken under low-light conditions, which is critical task and often occurs in mobile multimedia computing environment. Unlike former methods to obtain enhanced images before detection with variant kinds of manually designed filters, we propose an edge computing and multi-task driven framework to complete tasks of image enhancement and object detection with fast response. The proposed framework consists of two stages, namely cloud-based enhancement stage and edge-based detection stage. In cloud-based enhancement stage, we establish connection between mobile users and cloud servers to input rescaled and small-size illumination parts of lowlight images, where enhancement subnetworks are dynamically combined to output several enhanced illumination parts and corresponding weights based on low-light context of input images. During edge-based detection stage, cloud-computed weights offers informativeness information on extracted feature maps to enhance their representation abilities, which results in accurate predictions on labels and positions for objects. By applying the proposed framework in cloud computing system, experimental results show it significantly improves detection performance in mobile multimedia and low-light environment.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a convolutional block attention module (CBAM) to select the information critical to the vehicle detection task and suppress uncritical information, thus improving the detection accuracy of the algorithm.

Journal ArticleDOI
TL;DR: Although the detection speed of the improved SSD algorithm decreases, it is faster than the Faster R-CNN, which better achieves the balance between target detection accuracy and speed.
Abstract: The development of object detection technology makes it possible for robots to interact with people and the environment, but the changeable application scenarios make the detection accuracy of small and medium objects in the practical application of object detection technology low. In this paper, based on multi-scale feature fusion of indoor small target detection method, using the device to collect different indoor images with angle, light, and shade conditions, and use the image enhancement technology to set up and amplify a date set, with indoor scenarios and the SSD algorithm in target detection layer and its adjacent features fusion. The Faster R-CNN, YOLOv5, SSD, and SSD target detection models based on multi-scale feature fusion were trained on an indoor scene data set based on transfer learning. The experimental results show that multi-scale feature fusion can improve the detection accuracy of all kinds of objects, especially for objects with a relatively small scale. In addition, although the detection speed of the improved SSD algorithm decreases, it is faster than the Faster R-CNN, which better achieves the balance between target detection accuracy and speed.

Journal ArticleDOI
TL;DR: An attempt is made to systematically analyze trends in the development of approaches and detection methods, reasons behind these developments, as well as metrics designed to assess the quality and reliability of object detection.
Abstract: The relevance of the tasks of detecting and recognizing objects in images and their sequences has only increased over the years. Over the past few decades, a huge number of approaches and methods for detecting both anomalies, that is, image areas whose characteristics differ from the predicted ones, and objects of interest, about the properties of which there is a priori information, up to the library of standards, have been proposed. In this work, an attempt is made to systematically analyze trends in the development of approaches and detection methods, reasons behind these developments, as well as metrics designed to assess the quality and reliability of object detection. Detection techniques based on mathematical models of images are considered. At the same time, special attention is paid to the approaches based on models of random fields and likelihood ratios. The development of convolutional neural networks intended for solving the recognition problems is analyzed, including a number of pre-trained architectures that provide high efficiency in solving this problem. Rather than using mathematical models, such architectures are trained using libraries of real images. Among the characteristics of the detection quality assessment, probabilities of errors of the first and second kind, precision and recall of detection, intersection by union, and interpolated average precision are considered. The paper also presents typical tests that are used to compare various neural network algorithms.

Journal ArticleDOI
TL;DR: In this paper , a bidirectional feature pyramid network is adopted for the path-aggregation neck to effectively fuse cross-scale features and an additional detection head is also introduced to improve small-size tassel detection based on the original YOLOv5.
Abstract: Unmanned aerial vehicles (UAVs) equipped with lightweight sensors, such as RGB cameras and LiDAR, have significant potential in precision agriculture, including object detection. Tassel detection in maize is an essential trait given its relevance as the beginning of the reproductive stage of growth and development of the plants. However, compared with general object detection, tassel detection based on RGB imagery acquired by UAVs is more challenging due to the small size, time-dependent variable shape, and complexity of the objects of interest. A novel algorithm referred to as YOLOv5-tassel is proposed to detect tassels in UAV-based RGB imagery. A bidirectional feature pyramid network is adopted for the path-aggregation neck to effectively fuse cross-scale features. The robust attention module of SimAM is introduced to extract the features of interest before each detection head. An additional detection head is also introduced to improve small-size tassel detection based on the original YOLOv5. Annotation is performed with guidance from center points derived from CenterNet to improve the selection of the bounding boxes for tassels. Finally, to address the issue of limited reference data, transfer learning based on the VisDrone dataset is adopted. Testing results for our proposed YOLOv5-tassel method achieved the mAP value of 44.7%, which is better than well-known object detection approaches, such as FCOS, RetinaNet, and YOLOv5.

Journal ArticleDOI
01 May 2022-Sensors
TL;DR: Wang et al. as discussed by the authors proposed the MSFT-YOLO model for the industrial scenario in which the image background interference is great, the defect category is easily confused, defect scale changes a great deal, and the detection results of small defects are poor.
Abstract: With the development of artificial intelligence technology and the popularity of intelligent production projects, intelligent inspection systems have gradually become a hot topic in the industrial field. As a fundamental problem in the field of computer vision, how to achieve object detection in the industry while taking into account the accuracy and real-time detection is an important challenge in the development of intelligent detection systems. The detection of defects on steel surfaces is an important application of object detection in the industry. Correct and fast detection of surface defects can greatly improve productivity and product quality. To this end, this paper introduces the MSFT-YOLO model, which is improved based on the one-stage detector. The MSFT-YOLO model is proposed for the industrial scenario in which the image background interference is great, the defect category is easily confused, the defect scale changes a great deal, and the detection results of small defects are poor. By adding the TRANS module, which is designed based on Transformer, to the backbone and detection headers, the features can be combined with global information. The fusion of features at different scales by combining multi-scale feature fusion structures enhances the dynamic adjustment of the detector to objects at different scales. To further improve the performance of MSFT-YOLO, we also introduce plenty of effective strategies, such as data augmentation and multi-step training methods. The test results on the NEU-DET dataset show that MSPF-YOLO can achieve real-time detection, and the average detection accuracy of MSFT-YOLO is 75.2, improving about 7% compared to the baseline model (YOLOv5) and 18% compared to Faster R-CNN, which is advantageous and inspiring.

Journal ArticleDOI
TL;DR: CFC-Net as mentioned in this paper proposes a Critical Feature Capturing Network (CFCNet) to improve detection accuracy from three aspects: building powerful feature representation, refining preset anchors, and optimizing label assignment.
Abstract: Object detection in optical remote sensing images is an important and challenging task. In recent years, the methods based on convolutional neural networks have made good progress. However, due to the large variation in object scale, aspect ratio, and arbitrary orientation, the detection performance is difficult to be further improved. In this paper, we discuss the role of discriminative features in object detection, and then propose a Critical Feature Capturing Network (CFC-Net) to improve detection accuracy from three aspects: building powerful feature representation, refining preset anchors, and optimizing label assignment. Specifically, we first decouple the classification and regression features, and then construct robust critical features adapted to the respective tasks through the Polarization Attention Module (PAM). With the extracted discriminative regression features, the Rotation Anchor Refinement Module (R-ARM) performs localization refinement on preset horizontal anchors to obtain superior rotation anchors. Next, the Dynamic Anchor Learning (DAL) strategy is given to adaptively select high-quality anchors based on their ability to capture critical features. The proposed framework creates more powerful semantic representations for objects in remote sensing images and achieves high-performance real-time object detection. Experimental results on three remote sensing datasets including HRSC2016, DOTA, and UCAS-AOD show that our method achieves superior detection performance compared with many state-of-the-art approaches. Code and models are available at https://github.com/ming71/CFC-Net.

Journal ArticleDOI
TL;DR: In this paper , a comprehensive survey of 3D object detection for autonomous driving is presented, encompassing all the main concerns including sensors, datasets, performance metrics and the recent state-of-the-art detection methods, together with their pros and cons.

Journal ArticleDOI
TL;DR: This study proposes an improved performance of the original YOLOv5 model, and applies the obtained data to each model, to calculate the key indicators and draw a conclusion on the best model of object detection under various conditions.
Abstract: With the recent development of drone technology, object detection technology is emerging, and these technologies can also be applied to illegal immigrants, industrial and natural disasters, and missing people and objects. In this paper, we would like to explore ways to increase object detection performance in these situations. Photography was conducted in an environment where it was confusing to detect an object. The experimental data were based on photographs that created various environmental conditions, such as changes in the altitude of the drone, when there was no light, and taking pictures in various conditions. All the data used in the experiment were taken with F11 4K PRO drone and VisDrone dataset. In this study, we propose an improved performance of the original YOLOv5 model. We applied the obtained data to each model: the original YOLOv5 model and the improved YOLOv5_Ours model, to calculate the key indicators. The main indicators are precision, recall, F-1 score, and mAP (0.5), and the YOLOv5_Ours values of mAP (0.5) and function loss were improved by comparing it with the original YOLOv5 model. Finally, the conclusion was drawn based on the data comparing the original YOLOv5 model and the improved YOLOv5_Ours model. As a result of the analysis, we were able to arrive at a conclusion on the best model of object detection under various conditions.

Journal ArticleDOI
TL;DR: In this paper , a deep learning-based object detection and tracking approach has been applied to various UAV-related tasks, such as environmental monitoring, precision agriculture, and traffic management.
Abstract: Owing to effective and flexible data acquisition, unmanned aerial vehicles (UAVs) have recently become a hotspot across the fields of computer vision (CV) and remote sensing (RS). Inspired by the recent success of deep learning (DL), many advanced object detection and tracking approaches have been widely applied to various UAV-related tasks, such as environmental monitoring, precision agriculture, and traffic management.

Journal ArticleDOI
TL;DR: Experiments show that Yolo V4_1 (with SPP) outperforms the state-of-the-art schemes, achieving 99.4% accuracy in the authors' experiments, along with the best total BFLOPS and mAP (99.32%) in their experiment, and SPP can enhance the achievement of all models in the experiment.

Journal ArticleDOI
TL;DR: Novel flexible shape-adaptive selection (SA-S) and shape- Adaptive measurement (SA -M) strategies for oriented object detection, which comprise an SA-S strategy for sample selection and SA-M strategy for the quality estimation of positive samples are proposed.
Abstract: The development of detection methods for oriented object detection remains a challenging task. A considerable obstacle is the wide variation in the shape (e.g., aspect ratio) of objects. Sample selection in general object detection has been widely studied as it plays a crucial role in the performance of the detection method and has achieved great progress. However, existing sample selection strategies still overlook some issues: (1) most of them ignore the object shape information; (2) they do not make a potential distinction between selected positive samples; and (3) some of them can only be applied to either anchor-free or anchor-based methods and cannot be used for both of them simultaneously. In this paper, we propose novel flexible shape-adaptive selection (SA-S) and shape-adaptive measurement (SA-M) strategies for oriented object detection, which comprise an SA-S strategy for sample selection and SA-M strategy for the quality estimation of positive samples. Specifically, the SA-S strategy dynamically selects samples according to the shape information and characteristics distribution of objects. The SA-M strategy measures the localization potential and adds quality information on the selected positive samples. The experimental results on both anchor-free and anchor-based baselines and four publicly available oriented datasets (DOTA, HRSC2016, UCAS-AOD, and ICDAR2015) demonstrate the effectiveness of the proposed method.