Author
Jingkai Zhou
Other affiliations: Alibaba Group
Bio: Jingkai Zhou is an academic researcher from South China University of Technology. The author has contributed to research in topics: Object detection & Drone. The author has an hindex of 5, co-authored 13 publications receiving 160 citations. Previous affiliations of Jingkai Zhou include Alibaba Group.
Papers
More filters
••
Tianjin University1, University at Albany, SUNY2, General Electric3, Temple University4, Texas A&M University5, Shanghai Jiao Tong University6, Nanjing University of Science and Technology7, National University of Defense Technology8, Xidian University9, Peking University10, University of Kansas11, Jiangnan University12, Shandong University13, University of Electronic Science and Technology of China14, Chinese Academy of Sciences15, University of Maryland, College Park16, South China University of Technology17, Tsinghua University18, Indian Institute of Technology, Hyderabad19, Sun Yat-sen University20, Chongqing University21, Tencent22, Xiamen University23, National Institute of Technology, Tiruchirappalli24, University of Ottawa25, Ocean University of China26, Northwestern Polytechnical University27, University of Illinois at Urbana–Champaign28
TL;DR: A large-scale drone-based dataset, including 8, 599 images with rich annotations, including object bounding boxes, object categories, occlusion, truncation ratios, etc, is released, to narrow the gap between current object detection performance and the real-world requirements.
Abstract: Object detection is a hot topic with various applications in computer vision, e.g., image understanding, autonomous driving, and video surveillance. Much of the progresses have been driven by the availability of object detection benchmark datasets, including PASCAL VOC, ImageNet, and MS COCO. However, object detection on the drone platform is still a challenging task, due to various factors such as view point change, occlusion, and scales. To narrow the gap between current object detection performance and the real-world requirements, we organized the Vision Meets Drone (VisDrone2018) Object Detection in Image challenge in conjunction with the 15th European Conference on Computer Vision (ECCV 2018). Specifically, we release a large-scale drone-based dataset, including 8, 599 images (6, 471 for training, 548 for validation, and 1, 580 for testing) with rich annotations, including object bounding boxes, object categories, occlusion, truncation ratios, etc. Featuring a diverse real-world scenarios, the dataset was collected using various drone models, in different scenarios (across 14 different cities spanned over thousands of kilometres), and under various weather and lighting conditions. We mainly focus on ten object categories in object detection, i.e., pedestrian, person, car, van, bus, truck, motor, bicycle, awning-tricycle, and tricycle. Some rarely occurring special vehicles (e.g., machineshop truck, forklift truck, and tanker) are ignored in evaluation. The dataset is extremely challenging due to various factors, including large scale and pose variations, occlusion, and clutter background. We present the evaluation protocol of the VisDrone-DET2018 challenge and the comparison results of 38 detectors on the released dataset, which are publicly available on the challenge website: http://www.aiskyeye.com/. We expect the challenge to largely boost the research and development in object detection in images on drone platforms.
146 citations
••
20 Jun 2021Abstract: Convolution is one of the basic building blocks of CNN architectures. Despite its common use, standard convolution has two main shortcomings: Content-agnostic and Computation-heavy. Dynamic filters are content-adaptive, while further increasing the computational overhead. Depth-wise convolution is a lightweight variant, but it usually leads to a drop in CNN performance or requires a larger number of channels. In this work, we propose the Decoupled Dynamic Filter (DDF) that can simultaneously tackle both of these shortcomings. Inspired by recent advances in attention, DDF decouples a depth-wise dynamic filter into spatial and channel dynamic filters. This decomposition considerably reduces the number of parameters and limits computational costs to the same level as depth-wise convolution. Meanwhile, we observe a significant boost in performance when replacing standard convolution with DDF in classification networks. ResNet50 / 101 get improved by 1.9% and 1.3% on the top-1 accuracy, while their computational costs are reduced by nearly half. Experiments on the detection and joint upsampling networks also demonstrate the superior performance of the DDF upsampling variant (DDF-Up) in comparison with standard convolution and specialized content-adaptive layers. The project page with code is available 1.
60 citations
••
University at Albany, SUNY1, Xidian University2, University of Electronic Science and Technology of China3, Beijing Institute of Technology4, Samsung5, Chinese Academy of Sciences6, Tianjin University7, SK Telecom8, Dalian University of Technology9, Ocean University of China10, Harbin Institute of Technology11, Nanjing University of Posts and Telecommunications12, Siemens13, Sun Yat-sen University14, Xi'an Jiaotong University15, South China University of Technology16, General Electric17, Hanyang University18, Stony Brook University19, ShanghaiTech University20, Chongqing University21, Queen Mary University of London22, Huazhong University of Science and Technology23
TL;DR: The Vision Meets Drone Object Detection in Image Challenge (VME-DET 2019) as discussed by the authors, held in conjunction with the 17th International Conference on Computer Vision (ICCV 2019), focuses on image object detection on drones.
Abstract: Recently, automatic visual data understanding from drone platforms becomes highly demanding. To facilitate the study, the Vision Meets Drone Object Detection in Image Challenge is held the second time in conjunction with the 17-th International Conference on Computer Vision (ICCV 2019), focuses on image object detection on drones. Results of 33 object detection algorithms are presented. For each participating detector, a short description is provided in the appendix. Our goal is to advance the state-of-the-art detection algorithms and provide a comprehensive evaluation platform for them. The evaluation protocol of the VisDrone-DET2019 Challenge and the comparison results of all the submitted detectors on the released dataset are publicly available at the website: http: //www.aiskyeye.com/. The results demonstrate that there still remains a large room for improvement for object detection algorithms on drones.
57 citations
••
Tianjin University1, Siemens2, Beijing Institute of Technology3, Xi'an Jiaotong University4, South China University of Technology5, University at Albany, SUNY6, Sun Yat-sen University7, Samsung8, Karlsruhe Institute of Technology9, Xidian University10, Northwestern Polytechnical University11, General Electric12, Tsinghua University13, Stony Brook University14
TL;DR: The goal is to advance the state-of-the-art detection algorithms and provide a comprehensive evaluation platform for them and demonstrate that there still remains a large room for improvement for object detection algorithms on drones.
Abstract: Video object detection has drawn great attention recently. The Vision Meets Drone Object Detection in Video Challenge 2019 (VisDrone-VID2019) is held to advance the state-of-the-art in video object detection for videos captured by drones. Specifically, there are 13 teams participating the challenge. We also report the results of 6 state-of-the-art detectors on the collected dataset. A short description is provided in the appendix for each participating detector. We present the analysis and discussion of the challenge results. Both the dataset and the challenge results are publicly available at the challenge website: http://www.aiskyeye.com/.
51 citations
••
TL;DR: A nighttime FIR pedestrian dataset with the largest scale at present is introduced in this paper, which is called SCUT (South China University of Technology) dataset and shows that convolutional neural networks (CNN) based detectors obtained good performance on FIR image.
41 citations
Cited by
More filters
•
TL;DR: The superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, is shown, suggesting that the HRNet is a stronger backbone for computer vision problems.
Abstract: High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection. Existing state-of-the-art frameworks first encode the input image as a low-resolution representation through a subnetwork that is formed by connecting high-to-low resolution convolutions \emph{in series} (e.g., ResNet, VGGNet), and then recover the high-resolution representation from the encoded low-resolution representation. Instead, our proposed network, named as High-Resolution Network (HRNet), maintains high-resolution representations through the whole process. There are two key characteristics: (i) Connect the high-to-low resolution convolution streams \emph{in parallel}; (ii) Repeatedly exchange the information across resolutions. The benefit is that the resulting representation is semantically richer and spatially more precise. We show the superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, suggesting that the HRNet is a stronger backbone for computer vision problems. All the codes are available at~{\url{this https URL}}.
1,278 citations
••
TL;DR: The High-Resolution Network (HRNet) as mentioned in this paper maintains high-resolution representations through the whole process by connecting the high-to-low resolution convolution streams in parallel and repeatedly exchanging the information across resolutions.
Abstract: High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection. Existing state-of-the-art frameworks first encode the input image as a low-resolution representation through a subnetwork that is formed by connecting high-to-low resolution convolutions in series (e.g., ResNet, VGGNet), and then recover the high-resolution representation from the encoded low-resolution representation. Instead, our proposed network, named as High-Resolution Network (HRNet), maintains high-resolution representations through the whole process. There are two key characteristics: (i) Connect the high-to-low resolution convolution streams in parallel and (ii) repeatedly exchange the information across resolutions. The benefit is that the resulting representation is semantically richer and spatially more precise. We show the superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, suggesting that the HRNet is a stronger backbone for computer vision problems. All the codes are available at https://github.com/HRNet .
1,162 citations
••
01 Oct 2019TL;DR: Zhang et al. as discussed by the authors proposed a cluster proposal sub-network (CPNet), a scale estimation sub-networks (ScaleNet), and a dedicated detection network (DetecNet) to detect small objects in aerial images.
Abstract: Detecting objects in aerial images is challenging for at least two reasons: (1) target objects like pedestrians are very small in pixels, making them hardly distinguished from surrounding background; and (2) targets are in general sparsely and non-uniformly distributed, making the detection very inefficient. In this paper, we address both issues inspired by observing that these targets are often clustered. In particular, we propose a Clustered Detection (ClusDet) network that unifies object clustering and detection in an end-to-end framework. The key components in ClusDet include a cluster proposal sub-network (CPNet), a scale estimation sub-network (ScaleNet), and a dedicated detection network (DetecNet). Given an input image, CPNet produces object cluster regions and ScaleNet estimates object scales for these regions. Then, each scale-normalized cluster region is fed into DetecNet for object detection. ClusDet has several advantages over previous solutions: (1) it greatly reduces the number of chips for final object detection and hence achieves high running time efficiency, (2) the cluster-based scale estimation is more accurate than previously used single-object based ones, hence effectively improves the detection for small objects, and (3) the final DetecNet is dedicated for clustered regions and implicitly models the prior context information so as to boost detection accuracy. The proposed method is tested on three popular aerial image datasets including VisDrone, UAVDT and DOTA. In all experiments, ClusDet achieves promising performance in comparison with state-of-the-art detectors.
161 citations
•
TL;DR: The VisDrone dataset, which is captured over various urban/suburban areas of 14 different cities across China from North to South, is described, being the largest such dataset ever published, and enables extensive evaluation and investigation of visual analysis algorithms on the drone platform.
Abstract: Drones, or general UAVs, equipped with cameras have been fast deployed with a wide range of applications, including agriculture, aerial photography, and surveillance. Consequently, automatic understanding of visual data collected from drones becomes highly demanding, bringing computer vision and drones more and more closely. To promote and track the evelopments of object detection and tracking algorithms, we have organized two challenge workshops in conjunction with ECCV 2018, and ICCV 2019, attracting more than 100 teams around the world. We provide a large-scale drone captured dataset, VisDrone, which includes four tracks, i.e., (1) image object detection, (2) video object detection, (3) single object tracking, and (4) multi-object tracking. In this paper, we first presents a thorough review of object detection and tracking datasets and benchmarks, and discuss the challenges of collecting large-scale drone-based object detection and tracking datasets with fully manual annotations. After that, we describe our VisDrone dataset, which is captured over various urban/suburban areas of 14 different cities across China from North to South. Being the largest such dataset ever published, VisDrone enables extensive evaluation and investigation of visual analysis algorithms on the drone platform. We provide a detailed analysis of the current state of the field of large-scale object detection and tracking on drones, and conclude the challenge as well as propose future directions. We expect the benchmark largely boost the research and development in video analysis on drone platforms. All the datasets and experimental results can be downloaded from the website: this https URL.
129 citations
••
TL;DR: A comprehensive review of the state of the art deep learning based object detection algorithms and analyze recent contributions of these algorithms to low altitude UAV datasets is provided.
116 citations