Quality Aware Network for Set to Set Recognition

doi:10.1109/CVPR.2017.499

Home
/
Papers
/
Quality Aware Network for Set to Set Recognition

Proceedings Article•DOI•

Quality Aware Network for Set to Set Recognition

Yu Liu¹, Junjie Yan¹, Wanli Ouyang²•Institutions (2)

SenseTime¹, University of Sydney²

01 Jul 2017-pp 4694-4703

TL;DR: In this article, the quality of each sample can be automatically learned in the training stage, although such information is not explicitly provided during the training process, and the network has two branches, where the first branch extracts appearance feature embedding and the other branch predicts quality score for each sample.

read less

Abstract: This paper targets on the problem of set to set recognition, which learns the metric between two image sets. Images in each set belong to the same identity. Since images in a set can be complementary, they hopefully lead to higher accuracy in practical applications. However, the quality of each sample cannot be guaranteed, and samples with poor quality will hurt the metric. In this paper, the quality aware network (QAN) is proposed to confront this problem, where the quality of each sample can be automatically learned although such information is not explicitly provided in the training stage. The network has two branches, where the first branch extracts appearance feature embedding for each sample and the other branch predicts quality score for each sample. Features and quality scores of all samples in a set are then aggregated to generate the final feature embedding. We show that the two branches can be trained in an end-to-end manner given only the set-level identity annotation. Analysis on gradient spread of this mechanism indicates that the quality learned by the network is beneficial to set-to-set recognition and simplifies the distribution that the network needs to fit. Experiments on both face verification and person re-identification show advantages of the proposed QAN. The source code and network structure can be downloaded at GitHub.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Book Chapter•DOI•

Part-Aligned Bilinear Representations for Person Re-Identification

[...]

Yumin Suh¹, Jingdong Wang², Siyu Tang³, Tao Mei, Kyoung Mu Lee¹ - Show less +1 more•Institutions (3)

Seoul National University¹, Microsoft², Max Planck Society³

08 Sep 2018

TL;DR: A novel network that learns a part-aligned representation for person re-identification that handles the body part misalignment problem, that is, body parts are misaligned across human detections due to pose/viewpoint change and unreliable detection.

...read moreread less

Abstract: Comparing the appearance of corresponding body parts is essential for person re-identification. As body parts are frequently misaligned between the detected human boxes, an image representation that can handle this misalignment is required. In this paper, we propose a network that learns a part-aligned representation for person re-identification. Our model consists of a two-stream network, which generates appearance and body part feature maps respectively, and a bilinear-pooling layer that fuses two feature maps to an image descriptor. We show that it results in a compact descriptor, where the image matching similarity is equivalent to an aggregation of the local appearance similarities of the corresponding body parts. Since the image similarity does not depend on the relative positions of parts, our approach significantly reduces the part misalignment problem. Training the network does not require any part annotation on the person re-identification dataset. Instead, we simply initialize the part sub-stream using a pre-trained sub-network of an existing pose estimation network and train the whole network to minimize the re-identification loss. We validate the effectiveness of our approach by demonstrating its superiority over the state-of-the-art methods on the standard benchmark datasets including Market-1501, CUHK03, CUHK01 and DukeMTMC, and standard video dataset MARS.

...read moreread less

371 citations

Proceedings Article•DOI•

Exploit the Unknown Gradually: One-Shot Video-Based Person Re-identification by Stepwise Learning

[...]

Yu Wu¹, Yutian Lin¹, Xuanyi Dong¹, Yan Yan¹, Wanli Ouyang², Yi Yang¹ - Show less +2 more•Institutions (2)

University of Technology, Sydney¹, University of Sydney²

18 Jun 2018

TL;DR: This paper proposes an approach to exploiting unlabeled tracklets by gradually but steadily improving the discriminative capability of the Convolutional Neural Network feature representation via stepwise learning.

...read moreread less

Abstract: We focus on the one-shot learning for video-based person re-Identification (re-ID). Unlabeled tracklets for the person re-ID tasks can be easily obtained by preprocessing, such as pedestrian detection and tracking. In this paper, we propose an approach to exploiting unlabeled tracklets by gradually but steadily improving the discriminative capability of the Convolutional Neural Network (CNN) feature representation via stepwise learning. We first initialize a CNN model using one labeled tracklet for each identity. Then we update the CNN model by the following two steps iteratively: 1. sample a few candidates with most reliable pseudo labels from unlabeled tracklets; 2. update the CNN model according to the selected data. Instead of the static sampling strategy applied in existing works, we propose a progressive sampling method to increase the number of the selected pseudo-labeled candidates step by step. We systematically investigate the way how we should select pseudo-labeled tracklets into the training set to make the best use of them. Notably, the rank-1 accuracy of our method outperforms the state-of-the-art method by 21.46 points (absolute, i.e., 62.67% vs. 41.21%) on the MARS dataset, and 16.53 points on the DukeMTMC-VideoReID dataset1.

...read moreread less

348 citations

Proceedings Article•DOI•

Diversity Regularized Spatiotemporal Attention for Video-Based Person Re-identification

[...]

Shuang Li¹, Slawomir Bak, Peter W. Carr, Xiaogang Wang¹•Institutions (1)

The Chinese University of Hong Kong¹

18 Jun 2018

TL;DR: A new spatiotemporal attention model is proposed that automatically discovers a diverse set of distinctive body parts in video clips of people across non-overlapping cameras and outperforms the state-of-the-art approaches by large margins on multiple metrics.

...read moreread less

Abstract: Video-based person re-identification matches video clips of people across non-overlapping cameras. Most existing methods tackle this problem by encoding each video frame in its entirety and computing an aggregate representation across all frames. In practice, people are often partially occluded, which can corrupt the extracted features. Instead, we propose a new spatiotemporal attention model that automatically discovers a diverse set of distinctive body parts. This allows useful information to be extracted from all frames without succumbing to occlusions and misalignments. The network learns multiple spatial attention models and employs a diversity regularization term to ensure multiple models do not discover the same body part. Features extracted from local image regions are organized by spatial attention model and are combined using temporal attention. As a result, the network learns latent representations of the face, torso and other body parts using the best available image patches from the entire video sequence. Extensive evaluations on three datasets show that our framework outperforms the state-of-the-art approaches by large margins on multiple metrics.

...read moreread less

342 citations

Proceedings Article•DOI•

Dual Attention Matching Network for Context-Aware Feature Sequence Based Person Re-identification

[...]

Jianlou Si¹, Honggang Zhang¹, Chunguang Li¹, Jason Kuen², Xiangfei Kong², Alex C. Kot², Gang Wang³ - Show less +3 more•Institutions (3)

Beijing University of Posts and Telecommunications¹, Nanyang Technological University², Alibaba Group³

18 Jun 2018

TL;DR: A novel end-to-end trainable framework, called Dual ATtention Matching network (DuATM), to learn context-aware feature sequences and perform attentive sequence comparison simultaneously, in which both intrasequence and inter-sequence attention strategies are used for feature refinement and feature-pair alignment.

...read moreread less

Abstract: Typical person re-identification (ReID) methods usually describe each pedestrian with a single feature vector and match them in a task-specific metric space. However, the methods based on a single feature vector are not sufficient enough to overcome visual ambiguity, which frequently occurs in real scenario. In this paper, we propose a novel end-to-end trainable framework, called Dual ATtention Matching network (DuATM), to learn context-aware feature sequences and perform attentive sequence comparison simultaneously. The core component of our DuATM framework is a dual attention mechanism, in which both intrasequence and inter-sequence attention strategies are used for feature refinement and feature-pair alignment, respectively. Thus, detailed visual cues contained in the intermediate feature sequences can be automatically exploited and properly compared. We train the proposed DuATM network as a siamese network via a triplet loss assisted with a decorrelation loss and a cross-entropy loss. We conduct extensive experiments on both image and video based ReID benchmark datasets. Experimental results demonstrate the significant advantages of our approach compared to the state-of-the-art methods.

...read moreread less

341 citations

Posted Content•

Towards Real-Time Multi-Object Tracking

[...]

Zhongdao Wang¹, Liang Zheng², Yixuan Liu¹, Yali Li¹, Shengjin Wang¹ - Show less +1 more•Institutions (2)

Tsinghua University¹, Australian National University²

27 Sep 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work incorporates the appearance embedding model into a single-shot detector, such that the model can simultaneously output detections and the corresponding embeddings, and is formulated as a multi-task learning problem.

...read moreread less

Abstract: Modern multiple object tracking (MOT) systems usually follow the \emph{tracking-by-detection} paradigm. It has 1) a detection model for target localization and 2) an appearance embedding model for data association. Having the two models separately executed might lead to efficiency problems, as the running time is simply a sum of the two steps without investigating potential structures that can be shared between them. Existing research efforts on real-time MOT usually focus on the association step, so they are essentially real-time association methods but not real-time MOT system. In this paper, we propose an MOT system that allows target detection and appearance embedding to be learned in a shared model. Specifically, we incorporate the appearance embedding model into a single-shot detector, such that the model can simultaneously output detections and the corresponding embeddings. We further propose a simple and fast association method that works in conjunction with the joint model. In both components the computation cost is significantly reduced compared with former MOT systems, resulting in a neat and fast baseline for future follow-ups on real-time MOT algorithm design. To our knowledge, this work reports the first (near) real-time MOT system, with a running speed of 22 to 40 FPS depending on the input resolution. Meanwhile, its tracking accuracy is comparable to the state-of-the-art trackers embodying separate detection and embedding (SDE) learning ($64.4\%$ MOTA \vs $66.1\%$ MOTA on MOT-16 challenge). Code and models are available at \url{this https URL}.

...read moreread less

296 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49

Collapse

References

PDF

Open Access

More filters

Proceedings Article•DOI•

Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories

[...]

Svetlana Lazebnik¹, Cordelia Schmid², Jean Ponce³•Institutions (3)

University of Illinois at Urbana–Champaign¹, French Institute for Research in Computer Science and Automation², École Normale Supérieure³

17 Jun 2006

TL;DR: This paper presents a method for recognizing scene categories based on approximate global geometric correspondence that exceeds the state of the art on the Caltech-101 database and achieves high accuracy on a large database of fifteen natural scene categories.

...read moreread less

Abstract: This paper presents a method for recognizing scene categories based on approximate global geometric correspondence. This technique works by partitioning the image into increasingly fine sub-regions and computing histograms of local features found inside each sub-region. The resulting "spatial pyramid" is a simple and computationally efficient extension of an orderless bag-of-features image representation, and it shows significantly improved performance on challenging scene categorization tasks. Specifically, our proposed method exceeds the state of the art on the Caltech-101 database and achieves high accuracy on a large database of fifteen natural scene categories. The spatial pyramid framework also offers insights into the success of several recently proposed image descriptions, including Torralbas "gist" and Lowes SIFT descriptors.

...read moreread less

8,736 citations

Proceedings Article•DOI•

FaceNet: A unified embedding for face recognition and clustering

[...]

Florian Schroff¹, Dmitry Kalenichenko¹, James Philbin¹•Institutions (1)

Google¹

07 Jun 2015

TL;DR: A system that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure offace similarity, and achieves state-of-the-art face recognition performance using only 128-bytes perface.

...read moreread less

Abstract: Despite significant recent advances in the field of face recognition [10, 14, 15, 17], implementing face verification and recognition efficiently at scale presents serious challenges to current approaches. In this paper we present a system, called FaceNet, that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. Once this space has been produced, tasks such as face recognition, verification and clustering can be easily implemented using standard techniques with FaceNet embeddings as feature vectors.

...read moreread less

8,289 citations

Proceedings Article•DOI•

DeepFace: Closing the Gap to Human-Level Performance in Face Verification

[...]

Yaniv Taigman¹, Ming Yang¹, Marc'Aurelio Ranzato¹, Lior Wolf²•Institutions (2)

Facebook¹, Tel Aviv University²

23 Jun 2014

TL;DR: This work revisits both the alignment step and the representation step by employing explicit 3D face modeling in order to apply a piecewise affine transformation, and derive a face representation from a nine-layer deep neural network.

...read moreread less

Abstract: In modern face recognition, the conventional pipeline consists of four stages: detect => align => represent => classify. We revisit both the alignment step and the representation step by employing explicit 3D face modeling in order to apply a piecewise affine transformation, and derive a face representation from a nine-layer deep neural network. This deep network involves more than 120 million parameters using several locally connected layers without weight sharing, rather than the standard convolutional layers. Thus we trained it on the largest facial dataset to-date, an identity labeled dataset of four million facial images belonging to more than 4, 000 identities. The learned representations coupling the accurate model-based alignment with the large facial database generalize remarkably well to faces in unconstrained environments, even with a simple classifier. Our method reaches an accuracy of 97.35% on the Labeled Faces in the Wild (LFW) dataset, reducing the error of the current state of the art by more than 27%, closely approaching human-level performance.

...read moreread less

6,132 citations

Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments

[...]

Gary B. Huang¹, Marwan Mattar¹, Tamara L. Berg², Eric Learned-Miller¹•Institutions (2)

University of Massachusetts Amherst¹, Stony Brook University²

01 Oct 2008

TL;DR: The database contains labeled face photographs spanning the range of conditions typically encountered in everyday life, and exhibits “natural” variability in factors such as pose, lighting, race, accessories, occlusions, and background.

...read moreread less

Abstract: Most face databases have been created under controlled conditions to facilitate the study of specific parameters on the face recognition problem. These parameters include such variables as position, pose, lighting, background, camera quality, and gender. While there are many applications for face recognition technology in which one can control the parameters of image acquisition, there are also many applications in which the practitioner has little or no control over such parameters. This database, Labeled Faces in the Wild, is provided as an aid in studying the latter, unconstrained, recognition problem. The database contains labeled face photographs spanning the range of conditions typically encountered in everyday life. The database exhibits “natural” variability in factors such as pose, lighting, race, accessories, occlusions, and background. In addition to describing the details of the database, we provide specific experimental paradigms for which the database is suitable. This is done in an effort to make research performed with the database as consistent and comparable as possible. We provide baseline results, including results of a state of the art face recognition system combined with a face alignment system. To facilitate experimentation on the database, we provide several parallel databases, including an aligned version.

...read moreread less

5,742 citations

Proceedings Article•DOI•

Deep face recognition

[...]

Omkar M. Parkhi¹, Andrea Vedaldi¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

01 Jan 2015

TL;DR: It is shown how a very large scale dataset can be assembled by a combination of automation and human in the loop, and the trade off between data purity and time is discussed.

...read moreread less

Abstract: The goal of this paper is face recognition – from either a single photograph or from a set of faces tracked in a video. Recent progress in this area has been due to two factors: (i) end to end learning for the task using a convolutional neural network (CNN), and (ii) the availability of very large scale training datasets. We make two contributions: first, we show how a very large scale dataset (2.6M images, over 2.6K people) can be assembled by a combination of automation and human in the loop, and discuss the trade off between data purity and time; second, we traverse through the complexities of deep network training and face recognition to present methods and procedures to achieve comparable state of the art results on the standard LFW and YTF face benchmarks.

...read moreread less

5,308 citations