scispace - formally typeset
Search or ask a question

Showing papers on "Object-class detection published in 2018"


Journal ArticleDOI
TL;DR: In this article, a convolutional neural network (CNN) was employed to jointly regress to 3D bounding box coordinates and object pose for object detection and orientation estimation tasks.
Abstract: The goal of this paper is to perform 3D object detection in the context of autonomous driving. Our method aims at generating a set of high-quality 3D object proposals by exploiting stereo imagery. We formulate the problem as minimizing an energy function that encodes object size priors, placement of objects on the ground plane as well as several depth informed features that reason about free space, point cloud densities and distance to the ground. We then exploit a CNN on top of these proposals to perform object detection. In particular, we employ a convolutional neural net (CNN) that exploits context and depth information to jointly regress to 3D bounding box coordinates and object pose. Our experiments show significant performance gains over existing RGB and RGB-D object proposal methods on the challenging KITTI benchmark. When combined with the CNN, our approach outperforms all existing results in object detection and orientation estimation tasks for all three KITTI object classes. Furthermore, we experiment also with the setting where LIDAR information is available, and show that using both LIDAR and stereo leads to the best result.

319 citations


Journal ArticleDOI
TL;DR: GBDai et al. as mentioned in this paper proposed a gated bi-directional CNN (GBD-Net) to pass messages among features from different support regions during both feature learning and feature extraction.
Abstract: The visual cues from multiple support regions of different sizes and resolutions are complementary in classifying a candidate box in object detection. Effective integration of local and contextual visual cues from these regions has become a fundamental problem in object detection. In this paper, we propose a gated bi-directional CNN (GBD-Net) to pass messages among features from different support regions during both feature learning and feature extraction. Such message passing can be implemented through convolution between neighboring support regions in two directions and can be conducted in various layers. Therefore, local and contextual visual patterns can validate the existence of each other by learning their nonlinear relationships and their close interactions are modeled in a more complex way. It is also shown that message passing is not always helpful but dependent on individual samples. Gated functions are therefore needed to control message transmission, whose on-or-offs are controlled by extra visual evidence from the input sample. The effectiveness of GBD-Net is shown through experiments on three object detection datasets, ImageNet, Pascal VOC2007 and Microsoft COCO. Besides the GBD-Net, this paper also shows the details of our approach in winning the ImageNet object detection challenge of 2016, with source code provided on https://github.com/craftGBD/craftGBD . In this winning system, the modified GBD-Net, new pretraining scheme and better region proposal designs are provided. We also show the effectiveness of different network structures and existing techniques for object detection, such as multi-scale testing, left-right flip, bounding box voting, NMS, and context.

136 citations


Journal ArticleDOI
TL;DR: Strong evidence is found that visual similarity and semantic relatedness are complementary for the task, and when combined notably improve detection, achieving state-of-the-art detection performance in a semi-supervised setting.
Abstract: Deep CNN-based object detection systems have achieved remarkable success on several large-scale object detection benchmarks. However, training such detectors requires a large number of labeled bounding boxes, which are more difficult to obtain than image-level annotations. Previous work addresses this issue by transforming image-level classifiers into object detectors. This is done by modeling the differences between the two on categories with both image-level and bounding box annotations, and transferring this information to convert classifiers to detectors for categories without bounding box annotations. We improve this previous work by incorporating knowledge about object similarities from visual and semantic domains during the transfer process. The intuition behind our proposed method is that visually and semantically similar categories should exhibit more common transferable properties than dissimilar categories, e.g. a better detector would result by transforming the differences between a dog classifier and a dog detector onto the cat class, than would by transforming from the violin class. Experimental results on the challenging ILSVRC2013 detection dataset demonstrate that each of our proposed object similarity based knowledge transfer methods outperforms the baseline methods. We found strong evidence that visual similarity and semantic relatedness are complementary for the task, and when combined notably improve detection, achieving state-of-the-art detection performance in a semi-supervised setting.

66 citations


Journal ArticleDOI
TL;DR: A material-based salient object detection method which can effectively distinguish objects with similar perceived color but different spectral responses, and outperforms several existing hyperspectral salient object Detection approaches and the state-of-the-art methods proposed for RGB images.

64 citations


Journal ArticleDOI
TL;DR: In this paper, the authors present the design details of a deep learning system for unconstrained face recognition, including modules for face detection, association, alignment, and face verification.
Abstract: Over the last 5 years, methods based on Deep Convolutional Neural Networks (DCNNs) have shown impressive performance improvements for object detection and recognition problems. This has been made possible due to the availability of large annotated datasets, a better understanding of the non-linear mapping between input images and class labels as well as the affordability of GPUs. In this paper, we present the design details of a deep learning system for unconstrained face recognition, including modules for face detection, association, alignment and face verification. The quantitative performance evaluation is conducted using the IARPA Janus Benchmark A (IJB-A), the JANUS Challenge Set 2 (JANUS CS2), and the Labeled Faces in the Wild (LFW) dataset. The IJB-A dataset includes real-world unconstrained faces of 500 subjects with significant pose and illumination variations which are much harder than the LFW and Youtube Face datasets. JANUS CS2 is the extended version of IJB-A which contains not only all the images/frames of IJB-A but also includes the original videos. Some open issues regarding DCNNs for face verification problems are then discussed.

46 citations


Journal ArticleDOI
TL;DR: This paper proposes to incorporate a saliency map into an incremental subspace analysis framework in which theSaliency map makes the estimated background have less of a chance than the foreground to contain salient objects.
Abstract: Moving object detection is a key to intelligent video analysis. On the one hand, what moves are not only interesting objects but also noise and cluttered background. On the other hand, moving objects without rich texture are prone to not be detected. Therefore, there are undesirable false alarms and missed alarms in the results of many algorithms of moving object detection. To reduce the false alarms and missed alarms, in this paper we propose to incorporate a saliency map into an incremental subspace analysis framework in which the saliency map makes the estimated background have less of a chance than the foreground (i.e., moving objects) to contain salient objects. The proposed objective function systematically takes into account the properties of sparsity, low rank, connectivity, and saliency. An alternative minimization algorithm is proposed to seek the optimal solutions. The experimental results on both the Perception Test Images Sequences data set and Wallflower data set demonstrate that the proposed method is effective in reducing false alarms and missed alarms.

36 citations


Journal ArticleDOI
TL;DR: This work proposes a novel spatiotemporal RGB-D video segmentation framework that automatically segments and tracks objects with continuity and consistency over time and leverages scale-invariant feature transform (SIFT) flow and bilateral representation to solve inconsistency under occlusion.
Abstract: RGB-D video segmentation is important for many applications, including scene understanding, object tracking, and robotic grasping. However, to segment RGB-D frames over a long video sequence into globally consistent segmentation is still a challenging problem. Current methods often lose pixel correspondences between frames under occlusion and, thus, fail to generate consistent and continuous segmentation results. To address this problem, we propose a novel spatiotemporal RGB-D video segmentation framework that automatically segments and tracks objects with continuity and consistency over time. Our approach first produces consistent segments in some keyframes by region clustering, and then propagates the segmentation result to a whole video sequence via a mask propagation scheme in bilateral space. Instead of exploiting local optical, flow information to establish correspondences between adjacent frames, we leverage scale-invariant feature transform (SIFT) flow and bilateral representation to solve inconsistency under occlusion. Moreover, our method automatically extracts multiple objects of interest and tracks them without any user input hint. A variety of experiments demonstrates effectiveness and robustness of our proposed method.

33 citations


Patent
Mohamed N. Ahmed1
23 Mar 2018
TL;DR: In an approach to face recognition in an image, one or more computer processors receive an image that includes at least one face and oneor more face parts, and the computer processors extract, from the clustered images, face descriptors.
Abstract: In an approach to face recognition in an image, one or more computer processors receive an image that includes at least one face and one or more face parts. The one or more computer processors detect the one or more face parts in the image with a face component model. The one or more computer processors cluster the detected one or more face parts with one or more stored images. The one or more computer processors extract, from the clustered images, one or more face descriptors. The one or more computer processors determine a recognition score of the at least one face, based, at least in part, on the extracted one or more face descriptors.

27 citations


Journal ArticleDOI
Mai Xu1, Yun Ren1, Zulin Wang1, Jingxian Liu1, Xiaoming Tao2 
TL;DR: This paper proposes adopting the particle filter (PF) in modeling DGMM for saliency detection of face videos, which is called PF-DGMM and shows that the experimental results show that this approach significantly outperforms other state-of-the-art approaches in saliency detected in face videos.
Abstract: Recently, videoconferencing has been popular in multimedia systems, such as FaceTime and Skype. In videoconferencing, almost every frame contains a human face. Therefore, it is important to predict human visual attention on face videos by saliency detection, as saliency may be used as a guide to the region of interest for the content-based applications of face videos. In this paper, we propose a data-driven approach for saliency detection in face videos. From the data-driven perspective, we first establish an eye-tracking database that contains fixations of 76 face videos viewed by 40 subjects. Upon the analysis of our database, we find that visual attention is significantly attracted by faces in videos. More important, the attention distribution within face regions varies with regard to mouth movement. Since previous works have investigated that it is efficient to model face saliency in still images using a Gaussian mixture model (GMM), the variation of visual attention in videos can be modeled by dynamic GMM (DGMM). Accordingly, we propose adopting the particle filter (PF) in modeling DGMM for saliency detection of face videos, which is called PF-DGMM. Finally, the experimental results show that our PF-DGMM approach significantly outperforms other state-of-the-art approaches in saliency detection of face videos.

17 citations


Journal ArticleDOI
TL;DR: The Boosted Random Ferns are introduced to rapidly build discriminative classifiers for learning and detecting object categories, and can be very efficiently trained, densely evaluated for all image locations in about 0.1 seconds, and provides detection rates similar to competing approaches that require expensive and significantly slower processing times.
Abstract: In this paper we introduce the Boosted Random Ferns (BRFs) to rapidly build discriminative classifiers for learning and detecting object categories. At the core of our approach we use standard random ferns, but we introduce four main innovations that let us bring ferns from an instance to a category level, and still retain efficiency. First, we define binary features on the histogram of oriented gradients-domain (as opposed to intensity-), allowing for a better representation of intra-class variability. Second, both the positions where ferns are evaluated within the sliding window, and the location of the binary features for each fern are not chosen completely at random, but instead we use a boosting strategy to pick the most discriminative combination of them. This is further enhanced by our third contribution, that is to adapt the boosting strategy to enable sharing of binary features among different ferns, yielding high recognition rates at a low computational cost. And finally, we show that training can be performed online, for sequentially arriving images. Overall, the resulting classifier can be very efficiently trained, densely evaluated for all image locations in about 0.1 seconds, and provides detection rates similar to competing approaches that require expensive and significantly slower processing times. We demonstrate the effectiveness of our approach by thorough experimentation in publicly available datasets in which we compare against state-of-the-art, and for tasks of both 2D detection and 3D multi-view estimation.

15 citations


Journal ArticleDOI
TL;DR: A preprocessing method in camera nodes named Preprocessing-based Multi-Face Detection (PMFD) is proposed, which works based on the extracting bounding box of each object’s face, using Boosting-based face detection algorithm, and sending only the faces’ information to the base station.
Abstract: Recently, advances in hardware such as CMOS camera nodes have led to the development of Visual Sensor Networks (VSNs) that process sensed data and transmit the useful information to the base station for completing subsequent tasks. Today, object detection and sending useful information to the base station for object recognition is emerged as an important challenging issue in VSNs. Our investigations show that the face’s information is adequate for completing object recognition. According to literature, many approaches have been proposed for object detection and sending useful information to the base station to be completed subsequent tasks like object recognition. However, in most of them, lack of preprocessing methods in camera nodes causes network to be faced with large volume of data. For example when there is more than one object within the each camera node filed-of-view, conventional works deliver empty spaces among objects to the base station. Also, most of them send whole information about each object to the base station, while sending only face’s information of each object is adequate for completing object recognition. Therefore, in this paper, a preprocessing method in camera nodes named Preprocessing-based Multi-Face Detection (PMFD) is proposed. Our method works based on the extracting bounding box of each object’s face, using Boosting-based face detection algorithm, and sending only the faces’ information to the base station. The simulation results show that PMFD method has acceptable preprocessing time complexity and injects low volume of traffic into the network. Consequently, PMFD method prolongs the network lifetime in comparison with state-of-the-art algorithms.

Journal ArticleDOI
TL;DR: Two virtual ‘axis-symmetrical’ face images are generated from an original face image and collaborative representation based classification method (CRC) is adopted to perform classification.
Abstract: The research of automatic face recognition has attracted much attention from many researchers because of human faces’ uniqueness and usability. However, in the real-world applications, the acquisition equipment of face images is affected by illumination changes, facial expression variations, different postures and other environment factors, resulting in limited number of face images collected. This situation has become an obstacle to the development of face recognition technology. Therefore, in this paper, we utilize the information of the left-half face and right-half face to generate respectively two virtual ‘axis-symmetrical’ face images from an original face image and adopt collaborative representation based classification method (CRC) to perform classification. The first and second virtual face images convey more information of the right-half face and left-half face, respectively. Experiments have been performed on the Extended Yale_B, ORL, AR and FERET face databases and the experimental results show that our method can improve the recognition accuracy effectively.

Journal ArticleDOI
01 Jan 2018
TL;DR: A unified algorithm framework called group object detection and tracking is presented, which detects moving objects by robust principle component analysis (RPCA) and Graph Cut algorithm and tracks objects via fractal analysis simultaneously.
Abstract: Automatic video analysis is a hot research topic in the field of computer vision and has broad application prospects. It usually consists of three key steps: object detection, object tracking and behavior recognition. Usually, object detection is just considered as the precondition of object tracking, and the correlation between them is very little. So, existing video analysis solutions treat them as independent procedures and execute them separately. Actually, object detection and tracking are related and the effective combination of them can improve the performance of video analysis. This paper mainly studies object detection and tracking, and tries to utilize the outputs of them to optimize their performance by each other. For this purpose, a unified algorithm framework called group object detection and tracking is presented, which detects moving objects by robust principle component analysis (RPCA) and Graph Cut algorithm and tracks objects via fractal analysis simultaneously. The multi-fractal spectrum (MFS) constrain and Graph Cut improve the complement of object detection, which will bring more exact tracking feature. At the same time, the successful results from tracking can provide optimal constrain for object detection in an opposite manner. Therefore, object detection and tracking are grouped and can be improved by an iterative RPCA algorithm. The experimental results of simulation and real sequence demonstrate that the proposed algorithm is more robust and outperforms state-of-art algorithms in object detection and tracking.

Journal ArticleDOI
TL;DR: A face deduplication system which is combined with face detection and face quality evaluation to obtain the highest quality face image of a person is proposed.
Abstract: The video surveillance system based on face analysis has played an increasingly important role in the security industry. Compared with identification methods of other physical characteristics, face verification method is easy to be accepted by people. In the video surveillance scene, it is common to capture multiple faces belonging to a same person. We cannot get a good result of face recognition if we use all the images without considering image quality. In order to solve this problem, we propose a face deduplication system which is combined with face detection and face quality evaluation to obtain the highest quality face image of a person. The experimental results in this paper also show that our method can effectively detect the faces and select the high-quality face images, so as to improve the accuracy of face recognition.

Journal ArticleDOI
TL;DR: A novel blob reconstruction method is introduced that overcomes the mentioned limitation through optical flow based nullification, bifurcation, and unification of detected blobs.
Abstract: There has been a significant research devoted towards detection of a moving object in an image sequence. Detected moving objects usually contain some errors (some pixels belonging to the object are marked as non-objects and vice versa). To achieve a refined detection of moving object in the video, there is a need of post processing of the binary blobs detected as objects in every frame of the video. This article introduces a novel blob reconstruction method that overcomes the mentioned limitation through optical flow based nullification, bifurcation, and unification of detected blobs. To claim the performance of the proposed method, a comparison is made with ten widely used object detection methods on twenty four standard moving-object scene videos. Comparison is made based on standard parameters like accuracy, precision, recall, and F-measure. The results clearly indicates the efficacy of the proposed method. Following this, results on a priliminary case study on placodal cell migration during early development of ectodermal organ of human and mice has been made employing the proposed model which promisingly tracks the cell migration.

Journal ArticleDOI
TL;DR: This paper proposes a cryptographic algorithm for the kernel method to process encrypted images without decrypting them, so the owner of these images can have them processed by some classifiers belong to other people without leaking the content ofThese images to these people, and the owner also learns nothing about the classifier.
Abstract: With the advance of computer vision, some technologies such as face detection and human detection, have been used widely. However, when processing photos through computer vision technologies, we have to face a privacy-related problem : people do not want their photos to be distributed to others even for taking advantage of computer vision. Since kernel method has been used widely in object classifiers, we proposed a cryptographic algorithm for the kernel method to process encrypted images without decrypting them. So the owner of these images can have them processed by some classifiers belong to other people without leaking the content of these images to these people, and the owner also learns nothing about the classifier. In this paper, we analyze the security, correctness and efficiency of our proposed cryptographic algorithms, then approve the effectiveness of them through some face detection experiments.

Journal ArticleDOI
TL;DR: In this article, the random finite set based multi-Bernoulli filter with a detectionless likelihood function was applied to frame-to-frame tracking of space objects observed in electro-optical imagery.
Abstract: This paper applies the random finite set based multi-Bernoulli filter with a detectionless likelihood function to frame-to-frame tracking of space objects observed in electro-optical imagery for sp...

Dissertation
02 Jul 2018
TL;DR: An end-to-end multitask objective is introduced that jointly learns object-action relationships and is the first to propose an action tubelet detector that leverages the temporal continuity of videos instead of operating at the frame level, as state-of- the-art approaches do.
Abstract: The rise of deep learning has facilitated remarkable progress in video understanding. This thesis addresses three important tasks of video understanding: video object detection, joint object and action detection, and spatio-temporal action localization. Object class detection is one of the most important challenges in computer vision. Object detectors are usually trained on bounding-boxes from still images. Recently, video has been used as an alternative source of data. Yet, training an object detector on one domain (either still images or videos) and testing on the other one results in a significant performance gap compared to training and testing on the same domain. In the first part of this thesis, we examine the reasons behind this performance gap. We define and evaluate several domain shift factors: spatial location accuracy, appearance diversity, image quality, aspect distribution, and object size and camera framing. We examine the impact of these factors by comparing the detection performance before and after cancelling them out. The results show that all five factors affect the performance of the detectors and their combined effect explains the performance gap. While most existing approaches for detection in videos focus on objects or human actions separately, in the second part of this thesis we aim at detecting non-human centric actions, i.e., objects performing actions, such as cat eating or dog jumping. We introduce an end-to-end multitask objective that jointly learns object-action relationships. We compare it with different training objectives, validate its effectiveness for detecting object-action pairs in videos, and show that both tasks of object and action detection benefit from this joint learning. In experiments on the A2D dataset, we obtain state-of-the-art results on segmentation of object-action pairs. In the third part, we are the first to propose an action tubelet detector that leverages the temporal continuity of videos instead of operating at the frame level, as state-of- the-art approaches do. The same way modern detectors rely on anchor boxes, our tubelet detector is based on anchor cuboids by taking as input a sequence of frames and outputing tubelets, i.e., sequences of bounding boxes with associated scores. Our tubelet detector outperforms all state of the art on the UCF-Sports, J-HMDB, and UCF-101 action localization datasets especially at high overlap thresholds. The improvement in detection performance is explained by both more accurate scores and more precise localization.

Dissertation
02 Jul 2018
TL;DR: This thesis proposes an active search strategy for efficient object class detection and a part detection approach that exploits object context, and complements part appearance with the object appearance, its class, and the expected relative location of the parts inside it.
Abstract: Objects and parts are crucial elements for achieving automatic image understanding. The goal of the object detection task is to recognize and localize all the objects in an image. Similarly, semantic part detection attempts to recognize and localize the object parts. This thesis proposes four contributions. The first two make object detection more efficient by using active search strategies guided by image context. The last two involve parts. One of them explores the emergence of parts in neural networks trained for object detection, whereas the other improves on part detection by adding object context. First, we present an active search strategy for efficient object class detection. Modern object detectors evaluate a large set of windows using a window classifier. Instead, our search sequentially chooses what window to evaluate next based on all the information gathered before. This results in a significant reduction on the number of necessary window evaluations to detect the objects in the image. We guide our search strategy using image context and the score of the classifier. In our second contribution, we extend this active search to jointly detect pairs of object classes that appear close in the image, exploiting the valuable information that one class can provide about the location of the other. This leads to an even further reduction on the number of necessary evaluations for the smaller, more challenging classes. In the third contribution of this thesis, we study whether semantic parts emerge in Convolutional Neural Networks trained for different visual recognition tasks, especially object detection. We perform two quantitative analyses that provide a deeper understanding of their internal representation by investigating the responses of the network filters. Moreover, we explore several connections between discriminative power and semantics, which provides further insights on the role of semantic parts in the network. Finally, the last contribution is a part detection approach that exploits object context. We complement part appearance with the object appearance, its class, and the expected relative location of the parts inside it. We significantly outperform approaches that use part appearance alone in this challenging task.

Book ChapterDOI
01 Jan 2018
TL;DR: The purpose of the paper is to survey the method with which the objects can be efficiently detected from any given video sequence along with the preferable use of the deep learning library.
Abstract: One of the challenging topics in the field of computer vision is the detection of the stationary/non-stationary objects from a video sequence. The outcome of detection, tracking, and learning must be free from ambiguity. For effectively detecting the moving object, first the background information from the video should be subtracted. However, in the high-definition video, modeling techniques suffer from high computation and memory cost which may lead to a decrease in performance measure such as accuracy and efficiency in identifying the object accurately. It is important to identify the definite structure from a large amount of unstructured data which is a prerequisite problem to be solved. The task of finding the structure from a large amount of data is achieved using Deep Learning ‘which is about learning multiple levels of representation and abstraction that help to make sense of data such as images, sound, and text’. The purpose of the paper is to survey the method with which the objects can be efficiently detected from any given video sequence along with the preferable use of the deep learning library.

Journal ArticleDOI
TL;DR: This paper presents a structured approach for efficiently exploiting the perspective information of a scene to enhance the detection of objects in monocular systems that defines a finite grid of 3D positions on the dominant ground plane and computes occupancy maps from which object location estimates are extracted.
Abstract: This paper presents a structured approach for efficiently exploiting the perspective information of a scene to enhance the detection of objects in monocular systems It defines a finite grid of 3D positions on the dominant ground plane and computes occupancy maps from which object location estimates are extracted This method works on the top of any detection method, either pixel-wise (eg background subtraction) or region-wise (eg detection-by-classification) technique, which can be linked to the proposed scheme with minimal fine tuning Its flexibility thus allows for applying this approach in a wide variety of applications and sectors, such as surveillance applications (eg person detection) or driver assistance systems (eg vehicle or pedestrian detection) Extensive results provide evidence of its excellent performance and its ease of use in combination with different image processing techniques

Journal ArticleDOI
TL;DR: A computer vision method is presented for the mobile robot to find humans in scene that achieves high detection accuracy and fast detection speed on both standard testing datasets and real-life images.
Abstract: A computer vision method is presented for the mobile robot to find humans in scene. Face detection is used for confirming humans. In order to reduce regions of search, optical flow algorithm is used to segment the image in advance. Asymmetric problems in face detection are explained, and relative solutions are put forward by bootstrapping strategy and asymmetric adaboost algorithm. In addition, fisher discriminant analysis further improves the performance of face detection. Multi-view face models are trained to accommodate practical face detection application. At last, experiments demonstrate that our multi-view face detector achieves high detection accuracy and fast detection speed on both standard testing datasets and real-life images.

Journal ArticleDOI
01 Feb 2018
TL;DR: The experimental results show that the corresponding face images can be retrieved according to the input face sketch and super-resolution can effectively enhance the image quality and detail information of the pseudo-photo.
Abstract: Considering the crucial role of face image in modern intelligent system, face image retrieval has attracted more attention for authentication, surveillance, law enforcement, and security control. In most cases, we cannot obtain the suspect’s face image directly and the best substitute is a face sketch of criminal suspect drawn by artist according to eyewitness description. It is a key step in the criminal investigation process to narrow down criminal suspect using the face sketch. At first, the face sketch is transformed to a pseudo-photo for subsequent utilization. Transformation is performed according to the classic eigenface algorithm and enhanced by super-resolution. Matching between reconstructed pseudo-photo and real face photographs is performed by Hash encoding and iterative quantization. We carried out our ideas on two public face databases, and the sketch face images are generated by photo-shopping software program. The experimental results show that the corresponding face images can be retrieved according to the input face sketch and super-resolution can effectively enhance the image quality and detail information of the pseudo-photo. Hash encoding and iterative quantization achieve the quick search of approximate face images.