scispace - formally typeset
Search or ask a question
Author

Haizhou Ai

Bio: Haizhou Ai is an academic researcher from Tsinghua University. The author has contributed to research in topics: Face detection & Video tracking. The author has an hindex of 31, co-authored 144 publications receiving 4857 citations.


Papers
More filters
Proceedings ArticleDOI
17 May 2004
TL;DR: A rotation invariant multi-view face detection method based on Real Adaboost algorithm is proposed and a pose estimation method is introduced and results in a processing speed of four frames per second on 320/spl times/240 sized image.
Abstract: In this paper, we propose a rotation invariant multi-view face detection method based on Real Adaboost algorithm. Human faces are divided into several categories according to the variant appearance from different viewpoints. For each view category, weak classifiers are configured as confidence-rated look-up-table (LUT) of Haar feature. Real Adaboost algorithm is used to boost these weak classifiers and construct a nesting-structured face detector. To make it rotation invariant, we divide the whole 360-degree range into 12 sub-ranges and construct their corresponding view based detectors separately. To improve performance, a pose estimation method is introduced and results in a processing speed of four frames per second on 320/spl times/240 sized image. Experiments on faces with 360-degree in-plane rotation and /spl mnplus/90-degree out-of-plane rotation are reported, of which the frontal face detector subsystem retrieves 94.5% of the faces with 57 false alarms on the CMU+MlT frontal face test set and the multi-view face detector subsystem retrieves 89.8% of the faces with 221 false alarms on the CMU profile face test set.

414 citations

Journal ArticleDOI
TL;DR: A series of innovative methods are proposed to construct a high-performance rotation invariant multiview face detector, including the width-first-search (WFS) tree detector structure, the vector boosting algorithm for learning vector-output strong classifiers, the domain-partition-based weak learning method, the sparse feature in granular space, and the heuristic search for sparse feature selection.
Abstract: Rotation invariant multiview face detection (MVFD) aims to detect faces with arbitrary rotation-in-plane (RIP) and rotation-off-plane (ROP) angles in still images or video sequences. MVFD is crucial as the first step in automatic face processing for general applications since face images are seldom upright and frontal unless they are taken cooperatively. In this paper, we propose a series of innovative methods to construct a high-performance rotation invariant multiview face detector, including the width-first-search (WFS) tree detector structure, the vector boosting algorithm for learning vector-output strong classifiers, the domain-partition-based weak learning method, the sparse feature in granular space, and the heuristic search for sparse feature selection. As a result of that, our multiview face detector achieves low computational complexity, broad detection scope, and high detection accuracy on both standard testing sets and real-life images

411 citations

Proceedings ArticleDOI
23 Jul 2018
TL;DR: Zhang et al. as discussed by the authors proposed to handle unreliable detection by collecting candidates from outputs of both detection and tracking, and applied optimal selection from a considerable amount of candidates in real-time, and presented a novel scoring function based on a fully convolutional neural network.
Abstract: Online multi-object tracking is a fundamental problem in time-critical video analysis applications. A major challenge in the popular tracking-by-detection framework is how to associate unreliable detection results with existing tracks. In this paper, we propose to handle unreliable detection by collecting candidates from outputs of both detection and tracking. The intuition behind generating redundant candidates is that detection and tracks can complement each other in different scenarios. Detection results of high confidence prevent tracking drifts in the long term, and predictions of tracks can handle noisy detection caused by occlusion. In order to apply optimal selection from a considerable amount of candidates in real-time, we present a novel scoring function based on a fully convolutional neural network, that shares most computations on the entire image. Moreover, we adopt a deeply learned appearance representation, which is trained on large-scale person re-identification datasets, to improve the identification ability of our tracker. Extensive experiments show that our tracker achieves real-time and state-of-the-art performance on a widely used people tracking benchmark.

327 citations

Journal ArticleDOI
TL;DR: Experiments show significantly improved accuracy of the proposed approach in comparison with existing tracking methods, under the condition of low frame rate data and abrupt motion of both target and camera.
Abstract: Tracking object in low frame rate video or with abrupt motion poses two main difficulties which most conventional tracking methods can hardly handle: 1) poor motion continuity and increased search space; 2) fast appearance variation of target and more background clutter due to increased search space. In this paper, we address the problem from a view which integrates conventional tracking and detection, and present a temporal probabilistic combination of discriminative observers of different lifespans. Each observer is learned from different ranges of samples, with different subsets of features, to achieve varying level of discriminative power at varying cost. An efficient fusion and temporal inference is then done by a cascade particle filter which consists of multiple stages of importance sampling. Experiments show significantly improved accuracy of the proposed approach in comparison with existing tracking methods, under the condition of low frame rate data and abrupt motion of both target and camera.

312 citations

Proceedings ArticleDOI
17 Oct 2005
TL;DR: A novel tree-structured multiview face detector (MVFD) is proposed, which adopts the coarse-to-fine strategy to divide the entire face space into smaller and smaller subspaces and achieves high accuracy and amazing speed.
Abstract: In this paper, we propose a novel tree-structured multiview face detector (MVFD), which adopts the coarse-to-fine strategy to divide the entire face space into smaller and smaller subspaces. For this purpose, a newly extended boosting algorithm named vector boosting is developed to train the predictors for the branching nodes of the tree that have multicomponents outputs as vectors. Our MVFD covers a large range of the face space, say, +/-45/spl deg/ rotation in plane (RIP) and +/-90/spl deg/ rotation off plane (ROP), and achieves high accuracy and amazing speed (about 40 ms per frame on a 320 /spl times/ 240 video sequence) compared with previous published works. As a result, by simply rotating the detector 90/spl deg/, 180/spl deg/ and 270/spl deg/, a rotation invariant (360/spl deg/ RIP) MVFD is implemented that achieves real time performance (11 fps on a 320 /spl times/ 240 video sequence) with high accuracy.

257 citations


Cited by
More filters
Proceedings ArticleDOI
23 Jun 2013
TL;DR: Large scale experiments are carried out with various evaluation criteria to identify effective approaches for robust tracking and provide potential future research directions in this field.
Abstract: Object tracking is one of the most important components in numerous applications of computer vision. While much progress has been made in recent years with efforts on sharing code and datasets, it is of great importance to develop a library and benchmark to gauge the state of the art. After briefly reviewing recent advances of online object tracking, we carry out large scale experiments with various evaluation criteria to understand how these algorithms perform. The test image sequences are annotated with different attributes for performance evaluation and analysis. By analyzing quantitative results, we identify effective approaches for robust tracking and provide potential future research directions in this field.

3,828 citations

01 Jan 2004
TL;DR: Comprehensive and up-to-date, this book includes essential topics that either reflect practical significance or are of theoretical importance and describes numerous important application areas such as image based rendering and digital libraries.
Abstract: From the Publisher: The accessible presentation of this book gives both a general view of the entire computer vision enterprise and also offers sufficient detail to be able to build useful applications. Users learn techniques that have proven to be useful by first-hand experience and a wide range of mathematical methods. A CD-ROM with every copy of the text contains source code for programming practice, color images, and illustrative movies. Comprehensive and up-to-date, this book includes essential topics that either reflect practical significance or are of theoretical importance. Topics are discussed in substantial and increasing depth. Application surveys describe numerous important application areas such as image based rendering and digital libraries. Many important algorithms broken down and illustrated in pseudo code. Appropriate for use by engineers as a comprehensive reference to the computer vision enterprise.

3,627 citations

Journal ArticleDOI
TL;DR: A novel tracking framework (TLD) that explicitly decomposes the long-term tracking task into tracking, learning, and detection, and develops a novel learning method (P-N learning) which estimates the errors by a pair of “experts”: P-expert estimates missed detections, and N-ex Expert estimates false alarms.
Abstract: This paper investigates long-term tracking of unknown objects in a video stream. The object is defined by its location and extent in a single frame. In every frame that follows, the task is to determine the object's location and extent or indicate that the object is not present. We propose a novel tracking framework (TLD) that explicitly decomposes the long-term tracking task into tracking, learning, and detection. The tracker follows the object from frame to frame. The detector localizes all appearances that have been observed so far and corrects the tracker if necessary. The learning estimates the detector's errors and updates it to avoid these errors in the future. We study how to identify the detector's errors and learn from them. We develop a novel learning method (P-N learning) which estimates the errors by a pair of “experts”: (1) P-expert estimates missed detections, and (2) N-expert estimates false alarms. The learning process is modeled as a discrete dynamical system and the conditions under which the learning guarantees improvement are found. We describe our real-time implementation of the TLD framework and the P-N learning. We carry out an extensive quantitative evaluation which shows a significant improvement over state-of-the-art approaches.

3,137 citations

01 Jan 2006

3,012 citations

Posted Content
TL;DR: Zhang et al. as mentioned in this paper proposed a novel deep learning framework for attribute prediction in the wild, which cascades two CNNs, LNet and ANet, which are fine-tuned jointly with attribute tags, but pre-trained differently.
Abstract: Predicting face attributes in the wild is challenging due to complex face variations. We propose a novel deep learning framework for attribute prediction in the wild. It cascades two CNNs, LNet and ANet, which are fine-tuned jointly with attribute tags, but pre-trained differently. LNet is pre-trained by massive general object categories for face localization, while ANet is pre-trained by massive face identities for attribute prediction. This framework not only outperforms the state-of-the-art with a large margin, but also reveals valuable facts on learning face representation. (1) It shows how the performances of face localization (LNet) and attribute prediction (ANet) can be improved by different pre-training strategies. (2) It reveals that although the filters of LNet are fine-tuned only with image-level attribute tags, their response maps over entire images have strong indication of face locations. This fact enables training LNet for face localization with only image-level annotations, but without face bounding boxes or landmarks, which are required by all attribute recognition works. (3) It also demonstrates that the high-level hidden neurons of ANet automatically discover semantic concepts after pre-training with massive face identities, and such concepts are significantly enriched after fine-tuning with attribute tags. Each attribute can be well explained with a sparse linear combination of these concepts.

2,822 citations