Author
Yingjie Yin
Other affiliations: Hong Kong Polytechnic University
Bio: Yingjie Yin is an academic researcher from Chinese Academy of Sciences. The author has contributed to research in topics: Monocular vision & Object detection. The author has an hindex of 10, co-authored 29 publications receiving 320 citations. Previous affiliations of Yingjie Yin include Hong Kong Polytechnic University.
Papers
More filters
TL;DR: A different scales face detector (DSFD) based on Faster R-CNN is proposed that achieves promising performance on popular benchmarks including FDDB, AFW, PASCAL faces, and WIDER FACE.
Abstract: In recent years, the application of deep learning based on deep convolutional neural networks has gained great success in face detection. However, one of the remaining open challenges is the detection of small-scaled faces. The depth of the convolutional network can cause the projected feature map for small faces to be quickly shrunk, and most detection approaches with scale invariant can hardly handle less than $15\times 15$ pixel faces. To solve this problem, we propose a different scales face detector (DSFD) based on Faster R-CNN. The new network can improve the precision of face detection while performing as real-time a Faster R-CNN. First, an efficient multitask region proposal network (RPN), combined with boosting face detection, is developed to obtain the human face ROI. Setting the ROI as a constraint, an anchor is inhomogeneously produced on the top feature map by the multitask RPN. A human face proposal is extracted through the anchor combined with facial landmarks. Then, a parallel-type Fast R-CNN network is proposed based on the proposal scale. According to the different percentages they cover on the images, the proposals are assigned to three corresponding Fast R-CNN networks. The three networks are separated through the proposal scales and differ from each other in the weight of feature map concatenation. A variety of strategies is introduced in our face detection network, including multitask learning, feature pyramid, and feature concatenation. Compared to state-of-the-art face detection methods such as UnitBox, HyperFace, FastCNN, the proposed DSFD method achieves promising performance on popular benchmarks including FDDB, AFW, PASCAL faces, and WIDER FACE.
91 citations
TL;DR: The experimental results on several challenging video sequences validate the effectiveness and robustness of the proposed robust visual detection-learning-tracking framework for autonomous aerial refueling of unmanned aerial vehicles.
Abstract: In this paper, we propose a robust visual detection–learning–tracking framework for autonomous aerial refueling of unmanned aerial vehicles. Two classifiers (D-classifier and T-classifier) are defined in the proposed framework. The D-classifier is a robust linear support vector machine (SVM) classifier trained offline for detecting the drogue object of aerial refueling and a low-dimensional normalized robust local binary pattern feature is proposed to describe the drogue object in the D-classifier. The T-classifier is a state-based structured SVM classifier trained online for tracking the drogue object. A combination strategy between the D-classifier and the T-classifier is proposed in the framework. The D-classifier is used to assess if some positive support vectors in the T-classifier are required to be replaced by positive examples with density peaks. The experimental results on several challenging video sequences validate the effectiveness and robustness of our proposed framework.
65 citations
TL;DR: A robust state-based structured support vector machine (SVM) tracking algorithm combined with incremental principal component analysis (PCA) that directly learns and predicts the object's states and not the 2-D translation transformation during tracking.
Abstract: In this paper, we propose a robust state-based structured support vector machine (SVM) tracking algorithm combined with incremental principal component analysis (PCA). Different from the current structured SVM for tracking, our method directly learns and predicts the object’s states and not the 2-D translation transformation during tracking. We define the object’s virtual state to combine the state-based structured SVM and incremental PCA. The virtual state is considered as the most confident state of the object in every frame. The incremental PCA is used to update the virtual feature vector corresponding to the virtual state and the principal subspace of the object’s feature vectors. In order to improve the accuracy of the prediction, all the feature vectors are projected onto the principal subspace in the learning and prediction process of the state-based structured SVM. Experimental results on several challenging video sequences validate the effectiveness and robustness of our approach.
44 citations
TL;DR: Experimental results on the two KUKA robots platform verify the effectiveness and robustness of the proposed position measurement system, including drogue’s landmark detection and position computation for aerial refueling of unmanned aerial vehicles.
Abstract: In this paper, a position measurement system, including drogue’s landmark detection and position computation for autonomous aerial refueling of unmanned aerial vehicles, is proposed. A multitask parallel deep convolution neural network (MPDCNN) is designed to detect the landmarks of the drogue target. In MPDCNN, two parallel convolution networks are used, and a fusion mechanism is proposed to accomplish the effective fusion of the drogue’s two salient parts’ landmark detection. Considering the drogue target’s geometric constraints, a position measurement method based on monocular vision is proposed. An effective fusion strategy, which fuses the measurement results of drogue’s different parts, is proposed to achieve robust position measurement. The error of landmark detection with the proposed method is 3.9%, and it is obviously lower than the errors of other methods. Experimental results on the two KUKA robots platform verify the effectiveness and robustness of the proposed position measurement system for aerial refueling.
38 citations
TL;DR: The experimental results validate the effectiveness and robustness of the proposed framework and show the precision of drogue object tracking is 98.7%, which is obviously higher than the other comparison methods.
Abstract: In this paper, we propose a robust detection and tracking strategy for autonomous aerial refueling of unmanned aerial vehicles. The proposed framework includes two modules: a faster deep-learning-based detector (DLD) and a more accurate reinforcement-learning-based tracker (RLT). In the detection stage, the DLD achieves faster speed by combining the efficient MoblieNet with the you only look once framework. In the tracking stage, RLT is proposed to obtain target’s position accurately and fastly by performing hierarchically positioning and adjusting target bounding box according to the reinforcement learning. The precision of drogue object tracking is 98.7%, which is obviously higher than the other comparison methods. The speed of our network can achieve 15 frames/s on GPU Titan X. The experimental results validate the effectiveness and robustness of the proposed framework.
30 citations
Cited by
More filters
01 Jan 1979
TL;DR: This special issue aims at gathering the recent advances in learning with shared information methods and their applications in computer vision and multimedia analysis and addressing interesting real-world computer Vision and multimedia applications.
Abstract: In the real world, a realistic setting for computer vision or multimedia recognition problems is that we have some classes containing lots of training data and many classes contain a small amount of training data. Therefore, how to use frequent classes to help learning rare classes for which it is harder to collect the training data is an open question. Learning with Shared Information is an emerging topic in machine learning, computer vision and multimedia analysis. There are different level of components that can be shared during concept modeling and machine learning stages, such as sharing generic object parts, sharing attributes, sharing transformations, sharing regularization parameters and sharing training examples, etc. Regarding the specific methods, multi-task learning, transfer learning and deep learning can be seen as using different strategies to share information. These learning with shared information methods are very effective in solving real-world large-scale problems. This special issue aims at gathering the recent advances in learning with shared information methods and their applications in computer vision and multimedia analysis. Both state-of-the-art works, as well as literature reviews, are welcome for submission. Papers addressing interesting real-world computer vision and multimedia applications are especially encouraged. Topics of interest include, but are not limited to: • Multi-task learning or transfer learning for large-scale computer vision and multimedia analysis • Deep learning for large-scale computer vision and multimedia analysis • Multi-modal approach for large-scale computer vision and multimedia analysis • Different sharing strategies, e.g., sharing generic object parts, sharing attributes, sharing transformations, sharing regularization parameters and sharing training examples, • Real-world computer vision and multimedia applications based on learning with shared information, e.g., event detection, object recognition, object detection, action recognition, human head pose estimation, object tracking, location-based services, semantic indexing. • New datasets and metrics to evaluate the benefit of the proposed sharing ability for the specific computer vision or multimedia problem. • Survey papers regarding the topic of learning with shared information. Authors who are unsure whether their planned submission is in scope may contact the guest editors prior to the submission deadline with an abstract, in order to receive feedback.
1,758 citations
TL;DR: A comprehensive survey of algorithms proposed for binary neural networks, mainly categorized into the native solutions directly conducting binarization, and the optimized ones using techniques like minimizing the quantization error, improving the network loss function, and reducing the gradient error are presented.
Abstract: The binary neural network, largely saving the storage and computation, serves as a promising technique for deploying deep models on resource-limited devices. However, the binarization inevitably causes severe information loss, and even worse, its discontinuity brings difficulty to the optimization of the deep network. To address these issues, a variety of algorithms have been proposed, and achieved satisfying progress in recent years. In this paper, we present a comprehensive survey of these algorithms, mainly categorized into the native solutions directly conducting binarization, and the optimized ones using techniques like minimizing the quantization error, improving the network loss function, and reducing the gradient error. We also investigate other practical aspects of binary neural networks such as the hardware-friendly design and the training tricks. Then, we give the evaluation and discussions on different tasks, including image classification, object detection and semantic segmentation. Finally, the challenges that may be faced in future research are prospected.
346 citations
TL;DR: A novel ensemble convolutional neural networks (CNNs) based architecture for effective detection of both packed and unpacked malware, named Image-based Malware Classification using Ensemble of CNNs (IMCEC).
Abstract: Both researchers and malware authors have demonstrated that malware scanners are unfortunately limited and are easily evaded by simple obfuscation techniques. This paper proposes a novel ensemble convolutional neural networks (CNNs) based architecture for effective detection of both packed and unpacked malware. We have named this method Image-based Malware Classification using Ensemble of CNNs (IMCEC). Our main assumption is that based on their deeper architectures different CNNs provide different semantic representations of the image; therefore, a set of CNN architectures makes it possible to extract features with higher qualities than traditional methods. Experimental results show that IMCEC is particularly suitable for malware detection. It can achieve a high detection accuracy with low false alarm rates using malware raw-input. Result demonstrates more than 99% accuracy for unpacked malware and over 98% accuracy for packed malware. IMCEC is flexible, practical and efficient as it takes only 1.18 s on average to identify a new malware sample.
221 citations
01 Oct 2019
TL;DR: By enforcing restriction to the rate of alteration in response maps generated in the detection phase, the ARCF tracker can evidently suppress aberrances and is thus more robust and accurate to track objects.
Abstract: Traditional framework of discriminative correlation filters (DCF) is often subject to undesired boundary effects. Several approaches to enlarge search regions have been already proposed in the past years to make up for this shortcoming. However, with excessive background information, more background noises are also introduced and the discriminative filter is prone to learn from the ambiance rather than the object. This situation, along with appearance changes of objects caused by full/partial occlusion, illumination variation, and other reasons has made it more likely to have aberrances in the detection process, which could substantially degrade the credibility of its result. Therefore, in this work, a novel approach to repress the aberrances happening during the detection process is proposed, i.e., aberrance repressed correlation filter (ARCF). By enforcing restriction to the rate of alteration in response maps generated in the detection phase, the ARCF tracker can evidently suppress aberrances and is thus more robust and accurate to track objects. Considerable experiments are conducted on different UAV datasets to perform object tracking from an aerial view, i.e., UAV123, UAVDT, and DTB70, with 243 challenging image sequences containing over 90K frames to verify the performance of the ARCF tracker and it has proven itself to have outperformed other 20 state-of-the-art trackers based on DCF and deep-based frameworks with sufficient speed for real-time applications.
208 citations
Posted Content•
TL;DR: This paper proposes Complete-IoU (CIoU) loss and Cluster-NMS for enhancing geometric factors in both bounding box regression and Non-Maximum Suppression, leading to notable gains of average precision (AP) and average recall (AR), without the sacrifice of inference efficiency.
Abstract: Deep learning-based object detection and instance segmentation have achieved unprecedented progress. In this paper, we propose Complete-IoU (CIoU) loss and Cluster-NMS for enhancing geometric factors in both bounding box regression and Non-Maximum Suppression (NMS), leading to notable gains of average precision (AP) and average recall (AR), without the sacrifice of inference efficiency. In particular, we consider three geometric factors, i.e., overlap area, normalized central point distance and aspect ratio, which are crucial for measuring bounding box regression in object detection and instance segmentation. The three geometric factors are then incorporated into CIoU loss for better distinguishing difficult regression cases. The training of deep models using CIoU loss results in consistent AP and AR improvements in comparison to widely adopted $\ell_n$-norm loss and IoU-based loss. Furthermore, we propose Cluster-NMS, where NMS during inference is done by implicitly clustering detected boxes and usually requires less iterations. Cluster-NMS is very efficient due to its pure GPU implementation, and geometric factors can be incorporated to improve both AP and AR. In the experiments, CIoU loss and Cluster-NMS have been applied to state-of-the-art instance segmentation (e.g., YOLACT), and object detection (e.g., YOLO v3, SSD and Faster R-CNN) models. Taking YOLACT on MS COCO as an example, our method achieves performance gains as +1.7 AP and +6.2 AR$_{100}$ for object detection, and +0.9 AP and +3.5 AR$_{100}$ for instance segmentation, with 27.1 FPS on one NVIDIA GTX 1080Ti GPU. All the source code and trained models are available at this https URL
185 citations