scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Orientation Invariant Feature Embedding and Spatial Temporal Regularization for Vehicle Re-identification

TL;DR: Both the orientation invariant feature embedding and the spatio-temporal regularization achieve considerable improvements in the vehicle Re-identification problem.
Abstract: In this paper, we tackle the vehicle Re-identification (ReID) problem which is of great importance in urban surveillance and can be used for multiple applications. In our vehicle ReID framework, an orientation invariant feature embedding module and a spatial-temporal regularization module are proposed. With orientation invariant feature embedding, local region features of different orientations can be extracted based on 20 key point locations and can be well aligned and combined. With spatial-temporal regularization, the log-normal distribution is adopted to model the spatial-temporal constraints and the retrieval results can be refined. Experiments are conducted on public vehicle ReID datasets and our proposed method achieves state-of-the-art performance. Investigations of the proposed framework is conducted, including the landmark regressor and comparisons with attention mechanism. Both the orientation invariant feature embedding and the spatio-temporal regularization achieve considerable improvements.
Citations
More filters
Proceedings ArticleDOI
18 Jun 2018
TL;DR: A Viewpoint-aware Attentive Multi-view Inference (VAMI) model that only requires visual information to solve the multi-view vehicle reID problem and achieves consistent improvements over state-of-the-art vehicle re-ID methods on two public datasets: VeRi and VehicleID.
Abstract: Vehicle re-identification (re-ID) has the huge potential to contribute to the intelligent video surveillance. However, it suffers from challenges that different vehicle identities with a similar appearance have little inter-instance discrepancy while one vehicle usually has large intra-instance differences under viewpoint and illumination variations. Previous methods address vehicle re-ID by simply using visual features from originally captured views and usually exploit the spatial-temporal information of the vehicles to refine the results. In this paper, we propose a Viewpoint-aware Attentive Multi-view Inference (VAMI) model that only requires visual information to solve the multi-view vehicle reID problem. Given vehicle images of arbitrary viewpoints, the VAMI extracts the single-view feature for each input image and aims to transform the features into a global multiview feature representation so that pairwise distance metric learning can be better optimized in such a viewpointinvariant feature space. The VAMI adopts a viewpoint-aware attention model to select core regions at different viewpoints and implement effective multi-view feature inference by an adversarial training architecture. Extensive experiments validate the effectiveness of each proposed component and illustrate that our approach achieves consistent improvements over state-of-the-art vehicle re-ID methods on two public datasets: VeRi and VehicleID.

252 citations


Cites background from "Orientation Invariant Feature Embed..."

  • ...Moreover, OIFE [26] aims to align local region features of different viewpoints based on key points....

    [...]

  • ...Many vehicle re-ID researchers also noticed the challenges, thus preferred to make use of license plate or spatial-temporal information [15, 26, 23] to...

    [...]

  • ...The key point alignment of OIFE does not work well for large viewpoint variations....

    [...]

  • ...[26] proposed the visual-spatiotemporal path proposals and orientation invariant feature embedding as well as spatial-temporal regularization, respectively, to focus on exploiting vehicles’ spatial and temporal information to address the vehicle re-ID task....

    [...]

Proceedings ArticleDOI
15 Jun 2019
TL;DR: A new method for vehicle ReID is proposed, in which, the ReID model is coupled into a Feature Distance Adversarial Network (FDA-Net), and a novel feature distance adversary scheme is designed to generate hard negative samples in feature space to facilitate Re ID model training.
Abstract: Vehicle Re-identification (ReID) is of great significance to the intelligent transportation and public security. However, many challenging issues of Vehicle ReID in real-world scenarios have not been fully investigated, e.g., the high viewpoint variations, extreme illumination conditions, complex backgrounds, and different camera sources. To promote the research of vehicle ReID in the wild, we collect a new dataset called VERI-Wild with the following distinct features: 1) The vehicle images are captured by a large surveillance system containing 174 cameras covering a large urban district (more than 200km^2) The camera network continuously captures vehicles for 24 hours in each day and lasts for 1 month. 3) It is the first vehicle ReID dataset that is collected from unconstrained conditionsns. It is also a large dataset containing more than 400 thousand images of 40 thousand vehicle IDs. In this paper, we also propose a new method for vehicle ReID, in which, the ReID model is coupled into a Feature Distance Adversarial Network (FDA-Net), and a novel feature distance adversary scheme is designed to generate hard negative samples in feature space to facilitate ReID model training. The comprehensive results show the effectiveness of our method on the proposed dataset and the other two existing datasets.

223 citations


Cites methods from "Orientation Invariant Feature Embed..."

  • ...Compared with OIFE [21], which used key-point alignment in vehicle feature representation, the proposed FDA-Net achieves much better performance....

    [...]

Proceedings ArticleDOI
15 Jun 2019
TL;DR: This paper proposes a simple but efficient part-regularized discriminative feature preserving method which enhances the perceptive ability of subtle discrepancies in vehicle re-identification and develops a novel framework to integrate part constrains with the global Re-ID modules by introducing an detection branch.
Abstract: Vehicle re-identification (Re-ID) has been attracting more interests in computer vision owing to its great contributions in urban surveillance and intelligent transportation. With the development of deep learning approaches, vehicle Re-ID still faces a near-duplicate challenge, which is to distinguish different instances with nearly identical appearances. Previous methods simply rely on the global visual features to handle this problem. In this paper, we proposed a simple but efficient part-regularized discriminative feature preserving method which enhances the perceptive ability of subtle discrepancies. We further develop a novel framework to integrate part constrains with the global Re-ID modules by introducing an detection branch. Our framework is trained end-to-end with combined local and global constrains. Specially, without the part-regularized local constrains in inference step, our Re-ID network outperforms the state-of-the-art method by a large margin on large benchmark datasets VehicleID and VeRi-776.

221 citations


Cites background or methods from "Orientation Invariant Feature Embed..."

  • ...With the proposals of large dataset [14, 12, 27]and the development of deep learning algorithms [24, 36], recent models have gain remarkable success in the past decade....

    [...]

  • ...[24] explored vehicle viewpoint attribute and proposed orientation invariant feature embedding module....

    [...]

  • ...Fact+Plate+STR [14], Siamese+Path [21] and OIFE+ST [24] relies on the spatil-temporal information in Veri-776 Dataset....

    [...]

  • ...Besides, Some other methods [21, 24] rely on extra spatialtemporal information to explore the final retrieval results....

    [...]

  • ...OIFE [24] and VAMI [36] exploit the vehicle view information use the view invariant feature to roughly alight the vehicle image....

    [...]

Posted Content
TL;DR: This work introduces CityFlow, a city-scale traffic camera dataset consisting of more than 3 hours of synchronized HD videos from 40 cameras across 10 intersections, with the longest distance between two simultaneous cameras being 2.5 km.
Abstract: Urban traffic optimization using traffic cameras as sensors is driving the need to advance state-of-the-art multi-target multi-camera (MTMC) tracking. This work introduces CityFlow, a city-scale traffic camera dataset consisting of more than 3 hours of synchronized HD videos from 40 cameras across 10 intersections, with the longest distance between two simultaneous cameras being 2.5 km. To the best of our knowledge, CityFlow is the largest-scale dataset in terms of spatial coverage and the number of cameras/videos in an urban environment. The dataset contains more than 200K annotated bounding boxes covering a wide range of scenes, viewing angles, vehicle models, and urban traffic flow conditions. Camera geometry and calibration information are provided to aid spatio-temporal analysis. In addition, a subset of the benchmark is made available for the task of image-based vehicle re-identification (ReID). We conducted an extensive experimental evaluation of baselines/state-of-the-art approaches in MTMC tracking, multi-target single-camera (MTSC) tracking, object detection, and image-based ReID on this dataset, analyzing the impact of different network architectures, loss functions, spatio-temporal models and their combinations on task effectiveness. An evaluation server is launched with the release of our benchmark at the 2019 AI City Challenge (this https URL) that allows researchers to compare the performance of their newest techniques. We expect this dataset to catalyze research in this field, propel the state-of-the-art forward, and lead to deployed traffic optimization(s) in the real world.

207 citations

Proceedings ArticleDOI
01 Oct 2019
TL;DR: In this paper, a dual-path adaptive attention model for vehicle re-identification (AAVER) is proposed, where the global appearance path captures macroscopic vehicle features and the orientation conditioned part appearance path learns to capture localized discriminative features by focusing attention on the most informative key-points.
Abstract: In recent years, attention models have been extensively used for person and vehicle re-identification. Most re-identification methods are designed to focus attention on key-point locations. However, depending on the orientation, the contribution of each key-point varies. In this paper, we present a novel dual-path adaptive attention model for vehicle re-identification (AAVER). The global appearance path captures macroscopic vehicle features while the orientation conditioned part appearance path learns to capture localized discriminative features by focusing attention on the most informative key-points. Through extensive experimentation, we show that the proposed AAVER method is able to accurately re-identify vehicles in unconstrained scenarios, yielding state of the art results on the challenging dataset VeRi-776. As a byproduct, the proposed system is also able to accurately predict vehicle key-points and shows an improvement of more than 7% over state of the art. The code for key-point estimation model is available at https://github.com/Pirazh/Vehicle_Key_ Point_Orientation_Estimation

134 citations

References
More filters
Proceedings ArticleDOI
07 Jun 2015
TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Abstract: We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

40,257 citations

Journal Article
TL;DR: A new technique called t-SNE that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map, a variation of Stochastic Neighbor Embedding that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map.
Abstract: We present a new technique called “t-SNE” that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map. The technique is a variation of Stochastic Neighbor Embedding (Hinton and Roweis, 2002) that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map. t-SNE is better than existing techniques at creating a single map that reveals structure at many different scales. This is particularly important for high-dimensional data that lie on several different, but related, low-dimensional manifolds, such as images of objects from multiple classes seen from multiple viewpoints. For visualizing the structure of very large datasets, we show how t-SNE can use random walks on neighborhood graphs to allow the implicit structure of all of the data to influence the way in which a subset of the data is displayed. We illustrate the performance of t-SNE on a wide variety of datasets and compare it with many other non-parametric visualization techniques, including Sammon mapping, Isomap, and Locally Linear Embedding. The visualizations produced by t-SNE are significantly better than those produced by the other techniques on almost all of the datasets.

30,124 citations


"Orientation Invariant Feature Embed..." refers background or methods in this paper

  • ...Illustration of the orientation invariant features with t-SNE [10]....

    [...]

  • ...Features of selected vehicle images in the VeRi-776 test set are projected to 2-dimensional space using t-SNE [10] and are visualized in Fig....

    [...]

  • ..., C1 = [5, 6, 7, 8, 9, 10, 13, 14], C2 = [15, 16, 17, 18, 19, 20], C3 = [1, 2, 6, 8, 11, 14, 15, 17], and C4 = [3, 4, 5, 7, 12, 13, 16, 18], corresponding to the key points belonging to the vehicle’s front face, back face, left...

    [...]

Proceedings ArticleDOI
07 Jun 2015
TL;DR: A system that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure offace similarity, and achieves state-of-the-art face recognition performance using only 128-bytes perface.
Abstract: Despite significant recent advances in the field of face recognition [10, 14, 15, 17], implementing face verification and recognition efficiently at scale presents serious challenges to current approaches. In this paper we present a system, called FaceNet, that directly learns a mapping from face images to a compact Euclidean space where distances directly correspond to a measure of face similarity. Once this space has been produced, tasks such as face recognition, verification and clustering can be easily implemented using standard techniques with FaceNet embeddings as feature vectors.

8,289 citations


"Orientation Invariant Feature Embed..." refers background in this paper

  • ...Key point-based face alignment is conducted in most face recognition frameworks [15, 18]....

    [...]

  • ..., C1 = [5, 6, 7, 8, 9, 10, 13, 14], C2 = [15, 16, 17, 18, 19, 20], C3 = [1, 2, 6, 8, 11, 14, 15, 17], and C4 = [3, 4, 5, 7, 12, 13, 16, 18], corresponding to the key points belonging to the vehicle’s front face, back face, left...

    [...]

Book ChapterDOI
08 Oct 2016
TL;DR: This work introduces a novel convolutional network architecture for the task of human pose estimation that is described as a “stacked hourglass” network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions.
Abstract: This work introduces a novel convolutional network architecture for the task of human pose estimation. Features are processed across all scales and consolidated to best capture the various spatial relationships associated with the body. We show how repeated bottom-up, top-down processing used in conjunction with intermediate supervision is critical to improving the performance of the network. We refer to the architecture as a “stacked hourglass” network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions. State-of-the-art results are achieved on the FLIC and MPII benchmarks outcompeting all recent methods.

3,865 citations


"Orientation Invariant Feature Embed..." refers background or methods in this paper

  • ..., face alignment [14] and human pose estimation [13, 20]....

    [...]

  • ..., C1 = [5, 6, 7, 8, 9, 10, 13, 14], C2 = [15, 16, 17, 18, 19, 20], C3 = [1, 2, 6, 8, 11, 14, 15, 17], and C4 = [3, 4, 5, 7, 12, 13, 16, 18], corresponding to the key points belonging to the vehicle’s front face, back face, left...

    [...]

  • ...Inspired by the Stacked Hourglass Networks which generate response maps of human joints in a stacked coarse-tofine manner for human pose estimation [13], an hourglasslike fully convolution network is adopted to generate vehicle key point response maps....

    [...]

  • ...However, the Hourglass model [13] is computational expensive....

    [...]

Proceedings ArticleDOI
07 Dec 2015
TL;DR: A minor contribution, inspired by recent advances in large-scale image search, an unsupervised Bag-of-Words descriptor is proposed that yields competitive accuracy on VIPeR, CUHK03, and Market-1501 datasets, and is scalable on the large- scale 500k dataset.
Abstract: This paper contributes a new high quality dataset for person re-identification, named "Market-1501". Generally, current datasets: 1) are limited in scale, 2) consist of hand-drawn bboxes, which are unavailable under realistic settings, 3) have only one ground truth and one query image for each identity (close environment). To tackle these problems, the proposed Market-1501 dataset is featured in three aspects. First, it contains over 32,000 annotated bboxes, plus a distractor set of over 500K images, making it the largest person re-id dataset to date. Second, images in Market-1501 dataset are produced using the Deformable Part Model (DPM) as pedestrian detector. Third, our dataset is collected in an open system, where each identity has multiple images under each camera. As a minor contribution, inspired by recent advances in large-scale image search, this paper proposes an unsupervised Bag-of-Words descriptor. We view person re-identification as a special task of image search. In experiment, we show that the proposed descriptor yields competitive accuracy on VIPeR, CUHK03, and Market-1501 datasets, and is scalable on the large-scale 500k dataset.

3,564 citations


"Orientation Invariant Feature Embed..." refers background or methods in this paper

  • ...Bag of Words with Color Name Descriptor (BOW-CN) [28], the LOMO feature [6], and the KEPLER method [11], which learns salient regions for constructing discriminative features....

    [...]

  • ...Many hand-crafted features are proposed to capture visual features for pedestrians [1,5,6,12,16,28]....

    [...]

  • ...The proposed framework is compared with two stateof-the-art vehicle ReID approaches, i.e. PROVID [9] and DRDL [8], together with several conventional person ReID methods, i.e. Bag of Words with Color Name Descriptor (BOW-CN) [28], the LOMO feature [6], and the KEPLER method [11], which learns salient regions for constructing discriminative features....

    [...]