Showing papers in "International Journal of Computer Vision in 2019"

PDF

Open Access

Journal Article•DOI•

Semantic Understanding of Scenes Through the ADE20K Dataset

[...]

Bolei Zhou¹, Hang Zhao², Xavier Puig², Tete Xiao³, Sanja Fidler⁴, Adela Barriuso², Antonio Torralba² - Show less +3 more•Institutions (4)

The Chinese University of Hong Kong¹, Massachusetts Institute of Technology², Peking University³, University of Toronto⁴

01 Mar 2019-International Journal of Computer Vision

TL;DR: The ADE20K dataset as discussed by the authors contains 25k images of complex everyday scenes containing a variety of objects in their natural spatial context, on average there are 19.5 instances and 10.5 object classes per image.

...read moreread less

Abstract: Semantic understanding of visual scenes is one of the holy grails of computer vision. Despite efforts of the community in data collection, there are still few image datasets covering a wide range of scenes and object categories with pixel-wise annotations for scene understanding. In this work, we present a densely annotated dataset ADE20K, which spans diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts. Totally there are 25k images of the complex everyday scenes containing a variety of objects in their natural spatial context. On average there are 19.5 instances and 10.5 object classes per image. Based on ADE20K, we construct benchmarks for scene parsing and instance segmentation. We provide baseline performances on both of the benchmarks and re-implement state-of-the-art models for open source. We further evaluate the effect of synchronized batch normalization and find that a reasonably large batch size is crucial for the semantic segmentation performance. We show that the networks trained on ADE20K are able to segment a wide variety of scenes and objects.

...read moreread less

961 citations

Journal Article•DOI•

Video Enhancement with Task-Oriented Flow

[...]

Tianfan Xue¹, Baian Chen², Jiajun Wu², Donglai Wei³, William T. Freeman¹, William T. Freeman² - Show less +2 more•Institutions (3)

Google¹, Massachusetts Institute of Technology², Harvard University³

01 Aug 2019-International Journal of Computer Vision

TL;DR: Task-Oriented Flow (TOFlow) as mentioned in this paper is a self-supervised, task-specific representation for low-level video processing, which is trained in a supervised manner.

...read moreread less

Abstract: Many video enhancement algorithms rely on optical flow to register frames in a video sequence. Precise flow estimation is however intractable; and optical flow itself is often a sub-optimal representation for particular video processing tasks. In this paper, we propose task-oriented flow (TOFlow), a motion representation learned in a self-supervised, task-specific manner. We design a neural network with a trainable motion estimation component and a video processing component, and train them jointly to learn the task-oriented flow. For evaluation, we build Vimeo-90K, a large-scale, high-quality video dataset for low-level video processing. TOFlow outperforms traditional optical flow on standard benchmarks as well as our Vimeo-90K dataset in three video processing tasks: frame interpolation, video denoising/deblocking, and video super-resolution.

...read moreread less

570 citations

Journal Article•DOI•

Locality Preserving Matching

[...]

Jiayi Ma¹, Ji Zhao, Junjun Jiang², Huabing Zhou³, Xiaojie Guo⁴ - Show less +1 more•Institutions (4)

Wuhan University¹, Harbin Institute of Technology², Wuhan Institute of Technology³, Tianjin University⁴

01 May 2019-International Journal of Computer Vision

TL;DR: The authors' method can accomplish the mismatch removal from thousands of putative correspondences in only a few milliseconds, and achieves better or favorably competitive performance in accuracy while intensively cutting time cost by more than two orders of magnitude.

...read moreread less

Abstract: Seeking reliable correspondences between two feature sets is a fundamental and important task in computer vision. This paper attempts to remove mismatches from given putative image feature correspondences. To achieve the goal, an efficient approach, termed as locality preserving matching (LPM), is designed, the principle of which is to maintain the local neighborhood structures of those potential true matches. We formulate the problem into a mathematical model, and derive a closed-form solution with linearithmic time and linear space complexities. Our method can accomplish the mismatch removal from thousands of putative correspondences in only a few milliseconds. To demonstrate the generality of our strategy for handling image matching problems, extensive experiments on various real image pairs for general feature matching, as well as for point set registration, visual homing and near-duplicate image retrieval are conducted. Compared with other state-of-the-art alternatives, our LPM achieves better or favorably competitive performance in accuracy while intensively cutting time cost by more than two orders of magnitude.

...read moreread less

416 citations

Journal Article•DOI•

Deep Supervised Hashing for Fast Image Retrieval

[...]

Haomiao Liu¹, Ruiping Wang¹, Shiguang Shan¹, Xilin Chen¹•Institutions (1)

Chinese Academy of Sciences¹

01 Sep 2019-International Journal of Computer Vision

TL;DR: A novel Deep Supervised Hashing method to learn compact similarity-preserving binary code for the huge body of image data using pairs/triplets of images as training inputs and encouraging the output of each image to approximate discrete values.

...read moreread less

Abstract: In this paper, we present a new hashing method to learn compact binary codes for highly efficient image retrieval on large-scale datasets. While the complex image appearance variations still pose a great challenge to reliable retrieval, in light of the recent progress of Convolutional Neural Networks (CNNs) in learning robust image representation on various vision tasks, this paper proposes a novel Deep Supervised Hashing method to learn compact similarity-preserving binary code for the huge body of image data. Specifically, we devise a CNN architecture that takes pairs/triplets of images as training inputs and encourages the output of each image to approximate discrete values (e.g. $$+\,1$$/$$-\,1$$). To this end, the loss functions are elaborately designed to maximize the discriminability of the output space by encoding the supervised information from the input image pairs/triplets, and simultaneously imposing regularization on the real-valued outputs to approximate the desired discrete values. For image retrieval, new-coming query images can be easily encoded by forward propagating through the network and then quantizing the network outputs to binary codes representation. Extensive experiments on three large scale datasets CIFAR-10, NUS-WIDE, and SVHN show the promising performance of our method compared with the state-of-the-arts.

...read moreread less

287 citations

Journal Article•DOI•

From BoW to CNN: Two Decades of Texture Representation for Texture Classification

[...]

Li Liu¹, Li Liu², Jie Chen¹, Paul Fieguth³, Guoying Zhao¹, Rama Chellappa⁴, Matti Pietikäinen¹ - Show less +3 more•Institutions (4)

University of Oulu¹, National University of Defense Technology², University of Waterloo³, University of Maryland, College Park⁴

01 Jan 2019-International Journal of Computer Vision

TL;DR: More than 250 major publications are cited in this survey covering different aspects of the research, including benchmark datasets and state-of-the-art results as discussed by the authors, in retrospect of what has been achieved so far and open challenges and directions for future research.

...read moreread less

Abstract: Texture is a fundamental characteristic of many types of images, and texture representation is one of the essential and challenging problems in computer vision and pattern recognition which has attracted extensive research attention over several decades. Since 2000, texture representations based on Bag of Words and on Convolutional Neural Networks have been extensively studied with impressive performance. Given this period of remarkable evolution, this paper aims to present a comprehensive survey of advances in texture representation over the last two decades. More than 250 major publications are cited in this survey covering different aspects of the research, including benchmark datasets and state of the art results. In retrospect of what has been achieved so far, the survey discusses open challenges and directions for future research.

...read moreread less

284 citations

Journal Article•DOI•

Deep Affect Prediction in-the-Wild: Aff-Wild Database and Challenge, Deep Architectures, and Beyond

[...]

Dimitrios Kollias, Panagiotis Tzirakis, Mihalis A. Nicolaou¹, Athanasios Papaioannou, Guoying Zhao², Björn Schuller, Irene Kotsia³, Stefanos Zafeiriou² - Show less +4 more•Institutions (3)

Goldsmiths, University of London¹, University of Oulu², University of London³

13 Feb 2019-International Journal of Computer Vision

TL;DR: In this article, an end-to-end deep neural architecture was proposed for predicting continuous emotion dimensions based on visual cues. But the performance of the proposed architecture was not as good as the state-of-the-art.

...read moreread less

Abstract: Automatic understanding of human affect using visual signals is of great importance in everyday human–machine interactions. Appraising human emotional states, behaviors and reactions displayed in real-world settings, can be accomplished using latent continuous dimensions (e.g., the circumplex model of affect). Valence (i.e., how positive or negative is an emotion) and arousal (i.e., power of the activation of the emotion) constitute popular and effective representations for affect. Nevertheless, the majority of collected datasets this far, although containing naturalistic emotional states, have been captured in highly controlled recording conditions. In this paper, we introduce the Aff-Wild benchmark for training and evaluating affect recognition algorithms. We also report on the results of the First Affect-in-the-wild Challenge (Aff-Wild Challenge) that was recently organized in conjunction with CVPR 2017 on the Aff-Wild database, and was the first ever challenge on the estimation of valence and arousal in-the-wild. Furthermore, we design and extensively train an end-to-end deep neural architecture which performs prediction of continuous emotion dimensions based on visual cues. The proposed deep learning architecture, AffWildNet, includes convolutional and recurrent neural network layers, exploiting the invariant properties of convolutional features, while also modeling temporal dynamics that arise in human behavior via the recurrent layers. The AffWildNet produced state-of-the-art results on the Aff-Wild Challenge. We then exploit the AffWild database for learning features, which can be used as priors for achieving best performances both for dimensional, as well as categorical emotion recognition, using the RECOLA, AFEW-VA and EmotiW 2017 datasets, compared to all other methods designed for the same goal. The database and emotion recognition models are available at http://ibug.doc.ic.ac.uk/resources/first-affect-wild-challenge .

...read moreread less

283 citations

Journal Article•DOI•

Facial Landmark Detection: A Literature Survey

[...]

Yue Wu¹, Qiang Ji¹•Institutions (1)

Rensselaer Polytechnic Institute¹

01 Feb 2019-International Journal of Computer Vision

TL;DR: This paper performs an extensive review of the facial landmark detection algorithms and identifies future research directions, including combining methods in different categories to leverage their respective strengths to solve landmark detection “in-the-wild”.

...read moreread less

Abstract: The locations of the fiducial facial landmark points around facial components and facial contour capture the rigid and non-rigid facial deformations due to head movements and facial expressions. They are hence important for various facial analysis tasks. Many facial landmark detection algorithms have been developed to automatically detect those key points over the years, and in this paper, we perform an extensive review of them. We classify the facial landmark detection algorithms into three major categories: holistic methods, Constrained Local Model (CLM) methods, and the regression-based methods. They differ in the ways to utilize the facial appearance and shape information. The holistic methods explicitly build models to represent the global facial appearance and shape information. The CLMs explicitly leverage the global shape model but build the local appearance models. The regression based methods implicitly capture facial shape and appearance information. For algorithms within each category, we discuss their underlying theories as well as their differences. We also compare their performances on both controlled and in the wild benchmark datasets, under varying facial expressions, head poses, and occlusion. Based on the evaluations, we point out their respective strengths and weaknesses. There is also a separate section to review the latest deep learning based algorithms. The survey also includes a listing of the benchmark databases and existing software. Finally, we identify future research directions, including combining methods in different categories to leverage their respective strengths to solve landmark detection "in-the-wild".

...read moreread less

212 citations

Journal Article•DOI•

Leveraging Prior-Knowledge for Weakly Supervised Object Detection Under a Collaborative Self-Paced Curriculum Learning Framework

[...]

Dingwen Zhang¹, Junwei Han¹, Long Zhao¹, Deyu Meng²•Institutions (2)

Northwestern Polytechnical University¹, Xi'an Jiaotong University²

01 Apr 2019-International Journal of Computer Vision

TL;DR: Comprehensive experiments on benchmark datasets demonstrate the superior capacity of the proposed C-SPCL regime and the proposed whole framework as compared with state-of-the-art methods along this research line.

...read moreread less

Abstract: Weakly supervised object detection is an interesting yet challenging research topic in computer vision community, which aims at learning object models to localize and detect the corresponding objects of interest only under the supervision of image-level annotation. For addressing this problem, this paper establishes a novel weakly supervised learning framework to leverage both the instance-level prior-knowledge and the image-level prior-knowledge based on a novel collaborative self-paced curriculum learning (C-SPCL) regime. Under the weak supervision, C-SPCL can leverage helpful prior-knowledge throughout the whole learning process and collaborate the instance-level confidence inference with the image-level confidence inference in a robust way. Comprehensive experiments on benchmark datasets demonstrate the superior capacity of the proposed C-SPCL regime and the proposed whole framework as compared with state-of-the-art methods along this research line.

...read moreread less

161 citations

Journal Article•DOI•

You Said That?: Synthesising Talking Faces from Audio

[...]

Amir Jamaludin¹, Joon Son Chung¹, Andrew Zisserman¹•Institutions (1)

University of Oxford¹

01 Dec 2019-International Journal of Computer Vision

TL;DR: An encoder–decoder convolutional neural network model is developed that uses a joint embedding of the face and audio to generate synthesised talking face video frames and proposed methods to re-dub videos by visually blending the generated face into the source video frame using a multi-stream CNN model.

...read moreread less

Abstract: We describe a method for generating a video of a talking face. The method takes still images of the target face and an audio speech segment as inputs, and generates a video of the target face lip synched with the audio. The method runs in real time and is applicable to faces and audio not seen at training time. To achieve this we develop an encoder–decoder convolutional neural network (CNN) model that uses a joint embedding of the face and audio to generate synthesised talking face video frames. The model is trained on unlabelled videos using cross-modal self-supervision. We also propose methods to re-dub videos by visually blending the generated face into the source video frame using a multi-stream CNN model.

...read moreread less

139 citations

Journal Article•DOI•

Context-Based Path Prediction for Targets with Switching Dynamics

[...]

Julian F. P. Kooij¹, Fabian Flohr², Ewoud A. I. Pool³, Dariu M. Gavrila¹, Dariu M. Gavrila³ - Show less +1 more•Institutions (3)

Delft University of Technology¹, Daimler AG², University of Amsterdam³

01 Mar 2019-International Journal of Computer Vision

TL;DR: This work proposes to extract various types of cues with computer vision to provide context on the target’s behavior, and incorporate these in a Dynamic Bayesian Network (DBN), which extends the SLDS by conditioning the mode transition probabilities on additional context states.

...read moreread less

Abstract: Anticipating future situations from streaming sensor data is a key perception challenge for mobile robotics and automated vehicles. We address the problem of predicting the path of objects with multiple dynamic modes. The dynamics of such targets can be described by a Switching Linear Dynamical System (SLDS). However, predictions from this probabilistic model cannot anticipate when a change in dynamic mode will occur. We propose to extract various types of cues with computer vision to provide context on the target’s behavior, and incorporate these in a Dynamic Bayesian Network (DBN). The DBN extends the SLDS by conditioning the mode transition probabilities on additional context states. We describe efficient online inference in this DBN for probabilistic path prediction, accounting for uncertainty in both measurements and target behavior. Our approach is illustrated on two scenarios in the Intelligent Vehicles domain concerning pedestrians and cyclists, so-called Vulnerable Road Users (VRUs). Here, context cues include the static environment of the VRU, its dynamic environment, and its observed actions. Experiments using stereo vision data from a moving vehicle demonstrate that the proposed approach results in more accurate path prediction than SLDS at the relevant short time horizon (1 s). It slightly outperforms a computationally more demanding state-of-the-art method.

...read moreread less

132 citations

Journal Article•DOI•

Lucid Data Dreaming for Video Object Segmentation

[...]

Anna Khoreva¹, Rodrigo Benenson², Eddy Ilg³, Thomas Brox³, Bernt Schiele¹ - Show less +1 more•Institutions (3)

Max Planck Society¹, Google², University of Freiburg³

15 Mar 2019-International Journal of Computer Vision

TL;DR: The results indicate that using a larger training set is not automatically better, and that for the video object segmentation task a smaller training set that is closer to the target domain is more effective.

...read moreread less

Abstract: Convolutional networks reach top quality in pixel-level video object segmentation but require a large amount of training data (1k–100k) to deliver such results. We propose a new training strategy which achieves state-of-the-art results across three evaluation datasets while using $$20\,\times $$–$$1000\,\times $$ less annotated data than competing methods. Our approach is suitable for both single and multiple object segmentation. Instead of using large training sets hoping to generalize across domains, we generate in-domain training data using the provided annotation on the first frame of each video to synthesize—“lucid dream” (in a lucid dream the sleeper is aware that he or she is dreaming and is sometimes able to control the course of the dream)—plausible future video frames. In-domain per-video training data allows us to train high quality appearance- and motion-based models, as well as tune the post-processing stage. This approach allows to reach competitive results even when training from only a single annotated frame, without ImageNet pre-training. Our results indicate that using a larger training set is not automatically better, and that for the video object segmentation task a smaller training set that is closer to the target domain is more effective. This changes the mindset regarding how many training samples and general “objectness” knowledge are required for the video object segmentation task.

...read moreread less

Journal Article•DOI•

Detecting and Mitigating Adversarial Perturbations for Robust Face Recognition

[...]

Gaurav Goswami¹, Akshay Agarwal¹, Nalini K. Ratha², Richa Singh¹, Mayank Vatsa¹ - Show less +1 more•Institutions (2)

Indraprastha Institute of Information Technology¹, IBM²

01 Jun 2019-International Journal of Computer Vision

TL;DR: This paper attempts to unravel three aspects related to the robustness of DNNs for face recognition in terms of vulnerabilities to attacks, detecting the singularities by characterizing abnormal filter response behavior in the hidden layers of deep networks; and making corrections to the processing pipeline to alleviate the problem.

...read moreread less

Abstract: Deep neural network (DNN) architecture based models have high expressive power and learning capacity. However, they are essentially a black box method since it is not easy to mathematically formulate the functions that are learned within its many layers of representation. Realizing this, many researchers have started to design methods to exploit the drawbacks of deep learning based algorithms questioning their robustness and exposing their singularities. In this paper, we attempt to unravel three aspects related to the robustness of DNNs for face recognition: (i) assessing the impact of deep architectures for face recognition in terms of vulnerabilities to attacks, (ii) detecting the singularities by characterizing abnormal filter response behavior in the hidden layers of deep networks; and (iii) making corrections to the processing pipeline to alleviate the problem. Our experimental evaluation using multiple open-source DNN-based face recognition networks, and three publicly available face databases demonstrates that the performance of deep learning based face recognition algorithms can suffer greatly in the presence of such distortions. We also evaluate the proposed approaches on four existing quasi-imperceptible distortions: DeepFool, Universal adversarial perturbations, $$l_2$$ , and Elastic-Net (EAD). The proposed method is able to detect both types of attacks with very high accuracy by suitably designing a classifier using the response of the hidden layers in the network. Finally, we present effective countermeasures to mitigate the impact of adversarial attacks and improve the overall robustness of DNN-based face recognition.

...read moreread less

Journal Article•DOI•

Blind Image Deblurring via Deep Discriminative Priors

[...]

Lerenhan Li¹, Lerenhan Li², Jinshan Pan³, Wei-Sheng Lai², Changxin Gao¹, Nong Sang¹, Ming-Hsuan Yang² - Show less +3 more•Institutions (3)

Huazhong University of Science and Technology¹, University of California, Merced², Nanjing University of Science and Technology³

01 Aug 2019-International Journal of Computer Vision

TL;DR: This work forms the image prior as a binary classifier using a deep convolutional neural network to handle image dehazing and develops an efficient numerical approach based on the half-quadratic splitting method and gradient descent algorithm to optimize the proposed model.

...read moreread less

Abstract: We present an effective blind image deblurring method based on a data-driven discriminative prior. Our work is motivated by the fact that a good image prior should favor sharp images over blurred ones. In this work, we formulate the image prior as a binary classifier using a deep convolutional neural network. The learned prior is able to distinguish whether an input image is sharp or not. Embedded into the maximum a posterior framework, it helps blind deblurring in various scenarios, including natural, face, text, and low-illumination images, as well as non-uniform deblurring. However, it is difficult to optimize the deblurring method with the learned image prior as it involves a non-linear neural network. In this work, we develop an efficient numerical approach based on the half-quadratic splitting method and gradient descent algorithm to optimize the proposed model. Furthermore, we extend the proposed model to handle image dehazing. Both qualitative and quantitative experimental results show that our method performs favorably against the state-of-the-art algorithms as well as domain-specific image deblurring approaches.

...read moreread less

Journal Article•DOI•

Zoom Out-and-In Network with Map Attention Decision for Region Proposal and Object Detection

[...]

Hongyang Li¹, Yu Liu¹, Wanli Ouyang², Xiaogang Wang¹•Institutions (2)

The Chinese University of Hong Kong¹, University of Sydney²

01 Mar 2019-International Journal of Computer Vision

TL;DR: Experimental results on three datasets demonstrate the effectiveness of the proposed zoom-out-and-in network over other state-of-the-arts, in terms of average recall for region proposal and average precision for object detection.

...read moreread less

Abstract: In this paper, we propose a zoom-out-and-in network for generating object proposals. A key observation is that it is difficult to classify anchors of different sizes with the same set of features. Anchors of different sizes should be placed accordingly based on different depth within a network: smaller boxes on high-resolution layers with a smaller stride while larger boxes on low-resolution counterparts with a larger stride. Inspired by the conv/deconv structure, we fully leverage the low-level local details and high-level regional semantics from two feature map streams, which are complimentary to each other, to identify the objectness in an image. A map attention decision (MAD) unit is further proposed to aggressively search for neuron activations among two streams and attend the most contributive ones on the feature learning of the final loss. The unit serves as a decision-maker to adaptively activate maps along certain channels with the solely purpose of optimizing the overall training loss. One advantage of MAD is that the learned weights enforced on each feature channel is predicted on-the-fly based on the input context, which is more suitable than the fixed enforcement of a convolutional kernel. Experimental results on three datasets demonstrate the effectiveness of our proposed algorithm over other state-of-the-arts, in terms of average recall for region proposal and average precision for object detection.

...read moreread less

Journal Article•DOI•

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

[...]

Yash Goyal¹, Tejas Khot², Aishwarya Agrawal¹, Douglas Summers-Stay³, Dhruv Batra¹, Dhruv Batra⁴, Devi Parikh¹, Devi Parikh⁴ - Show less +4 more•Institutions (4)

Georgia Institute of Technology¹, Carnegie Mellon University², United States Army Research Laboratory³, Facebook⁴

01 Apr 2019-International Journal of Computer Vision

TL;DR: This work balances the popular VQA dataset by collecting complementary images such that every question in the authors' balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question.

...read moreread less

Abstract: The problem of visual question answering (VQA) is of significant importance both as a challenging research question and for the rich set of applications it enables. In this context, however, inherent structure in our world and bias in our language tend to be a simpler signal for learning than visual modalities, resulting in VQA models that ignore visual information, leading to an inflated sense of their capability. We propose to counter these language priors for the task of VQA and make vision (the V in VQA) matter! Specifically, we balance the popular VQA dataset (Antol et al., in: ICCV, 2015) by collecting complementary images such that every question in our balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question. Our dataset is by construction more balanced than the original VQA dataset and has approximately twice the number of image-question pairs. Our complete balanced dataset is publicly available at http://visualqa.org/ as part of the 2nd iteration of the VQA Dataset and Challenge (VQA v2.0). We further benchmark a number of state-of-art VQA models on our balanced dataset. All models perform significantly worse on our balanced dataset, suggesting that these models have indeed learned to exploit language priors. This finding provides the first concrete empirical evidence for what seems to be a qualitative sense among practitioners. We also present interesting insights from analysis of the participant entries in VQA Challenge 2017, organized by us on the proposed VQA v2.0 dataset. The results of the challenge were announced in the 2nd VQA Challenge Workshop at the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017. Finally, our data collection protocol for identifying complementary images enables us to develop a novel interpretable model, which in addition to providing an answer to the given (image, question) pair, also provides a counter-example based explanation. Specifically, it identifies an image that is similar to the original image, but it believes has a different answer to the same question. This can help in building trust for machines among their users.

...read moreread less

Journal Article•DOI•

The Menpo Benchmark for Multi-pose 2D and 3D Facial Landmark Localisation and Tracking

[...]

Jiankang Deng¹, Anastasios Roussos², Grigorios Chrysos¹, Evangelos Ververas¹, Irene Kotsia³, Jie Shen¹, Stefanos Zafeiriou⁴, Stefanos Zafeiriou¹ - Show less +4 more•Institutions (4)

Imperial College London¹, University of Exeter², Middlesex University³, University of Oulu⁴

01 Jun 2019-International Journal of Computer Vision

TL;DR: An elaborate semi-automatic methodology is introduced for providing high-quality annotations for both the Menpo 2D and Menpo 3D benchmarks, two new datasets for multi-pose 2d and 3D facial landmark localisation and tracking.

...read moreread less

Abstract: In this article, we present the Menpo 2D and Menpo 3D benchmarks, two new datasets for multi-pose 2D and 3D facial landmark localisation and tracking. In contrast to the previous benchmarks such as 300W and 300VW, the proposed benchmarks contain facial images in both semi-frontal and profile pose. We introduce an elaborate semi-automatic methodology for providing high-quality annotations for both the Menpo 2D and Menpo 3D benchmarks. In Menpo 2D benchmark, different visible landmark configurations are designed for semi-frontal and profile faces, thus making the 2D face alignment full-pose. In Menpo 3D benchmark, a united landmark configuration is designed for both semi-frontal and profile faces based on the correspondence with a 3D face model, thus making face alignment not only full-pose but also corresponding to the real-world 3D space. Based on the considerable number of annotated images, we organised Menpo 2D Challenge and Menpo 3D Challenge for face alignment under large pose variations in conjunction with CVPR 2017 and ICCV 2017, respectively. The results of these challenges demonstrate that recent deep learning architectures, when trained with the abundant data, lead to excellent results. We also provide a very simple, yet effective solution, named Cascade Multi-view Hourglass Model, to 2D and 3D face alignment. In our method, we take advantage of all 2D and 3D facial landmark annotations in a joint way. We not only capitalise on the correspondences between the semi-frontal and profile 2D facial landmarks but also employ joint supervision from both 2D and 3D facial landmarks. Finally, we discuss future directions on the topic of face alignment.

...read moreread less

Journal Article•DOI•

Synthesis of High-Quality Visible Faces from Polarimetric Thermal Faces using Generative Adversarial Networks

[...]

He Zhang¹, Benjamin S. Riggan², Shuowen Hu², Nathaniel J. Short³, Vishal M. Patel⁴ - Show less +1 more•Institutions (4)

Rutgers University¹, United States Army Research Laboratory², Booz Allen Hamilton³, Johns Hopkins University⁴

01 Jun 2019-International Journal of Computer Vision

TL;DR: Zhang et al. as discussed by the authors proposed a generative adversarial networks based multi-stream feature-level fusion technique to synthesize high-quality visible images from polarimetric thermal images.

...read moreread less

Abstract: The large domain discrepancy between faces captured in polarimetric (or conventional) thermal and visible domains makes cross-domain face verification a highly challenging problem for human examiners as well as computer vision algorithms. Previous approaches utilize either a two-step procedure (visible feature estimation and visible image reconstruction) or an input-level fusion technique, where different Stokes images are concatenated and used as a multi-channel input to synthesize the visible image given the corresponding polarimetric signatures. Although these methods have yielded improvements, we argue that input-level fusion alone may not be sufficient to realize the full potential of the available Stokes images. We propose a generative adversarial networks based multi-stream feature-level fusion technique to synthesize high-quality visible images from polarimetric thermal images. The proposed network consists of a generator sub-network, constructed using an encoder–decoder network based on dense residual blocks, and a multi-scale discriminator sub-network. The generator network is trained by optimizing an adversarial loss in addition to a perceptual loss and an identity preserving loss to enable photo realistic generation of visible images while preserving discriminative characteristics. An extended dataset consisting of polarimetric thermal facial signatures of 111 subjects is also introduced. Multiple experiments evaluated on different experimental protocols demonstrate that the proposed method achieves state-of-the-art performance. Code will be made available at https://github.com/hezhangsprinter .

...read moreread less

Journal Article•DOI•

Learning to Segment Moving Objects

[...]

Pavel Tokmakov¹, Cordelia Schmid², Karteek Alahari²•Institutions (2)

Carnegie Mellon University¹, University of Grenoble²

01 Mar 2019-International Journal of Computer Vision

TL;DR: In this article, a two-stream neural network with an explicit memory module is proposed to segment moving objects in unconstrained videos, where appearance and motion cues are encoded in a video sequence respectively, while the memory module captures the evolution of objects over time, exploiting the temporal consistency.

...read moreread less

Abstract: We study the problem of segmenting moving objects in unconstrained videos. Given a video, the task is to segment all the objects that exhibit independent motion in at least one frame. We formulate this as a learning problem and design our framework with three cues: (i) independent object motion between a pair of frames, which complements object recognition, (ii) object appearance, which helps to correct errors in motion estimation, and (iii) temporal consistency, which imposes additional constraints on the segmentation. The framework is a two-stream neural network with an explicit memory module. The two streams encode appearance and motion cues in a video sequence respectively , while the memory module captures the evolution of objects over time, exploiting the temporal consistency. The motion stream is a convolutional neural network trained on synthetic videos to segment independently moving objects in the optical flow field. The module to build a 'visual memory' in video, i.e., a joint representation of all the video frames, is realized with a convolutional recurrent unit learned from a small number of training video sequences. For every pixel in a frame of a test video, our approach assigns an object or background label based on the learned spatio-temporal features as well as the 'visual memory' specific to the video. We evaluate our method extensively on three benchmarks, DAVIS, Freiburg-Berkeley motion seg-mentation dataset and SegTrack. In addition, we provide an extensive ablation study to investigate both the choice of the training data and the influence of each component in the proposed framework.

...read moreread less

Journal Article•DOI•

Blended Emotion in-the-Wild: Multi-label Facial Expression Recognition Using Crowdsourced Annotations and Deep Locality Feature Learning

[...]

Shan Li¹, Weihong Deng¹•Institutions (1)

Beijing University of Posts and Telecommunications¹

01 Jun 2019-International Journal of Computer Vision

TL;DR: A new deep manifold learning network is proposed, called Deep Bi-Manifold CNN, to learn the discriminative feature for multi-label expressions by jointly preserving the local affinity of deep features and the manifold structures of emotion labels.

...read moreread less

Abstract: Comprehending different categories of facial expressions plays a great role in the design of computational model analyzing human perceived and affective state. Authoritative studies have revealed that facial expressions in human daily life are in multiple or co-occurring mental states. However, due to the lack of valid datasets, most previous studies are still restricted to basic emotions with single label. In this paper, we present a novel multi-label facial expression database, RAF-ML, along with a new deep learning algorithm, to address this problem. Specifically, a crowdsourcing annotation of 1.2 million labels from 315 participants was implemented to identify the multi-label expressions collected from social network, then EM algorithm was designed to filter out unreliable labels. For all we know, RAF-ML is the first database in the wild that provides with crowdsourced cognition for multi-label expressions. Focusing on the ambiguity and continuity of blended expressions, we propose a new deep manifold learning network, called Deep Bi-Manifold CNN, to learn the discriminative feature for multi-label expressions by jointly preserving the local affinity of deep features and the manifold structures of emotion labels. Furthermore, a deep domain adaption method is leveraged to extend the deep manifold features learned from RAF-ML to other expression databases under various imaging conditions and cultures. Extensive experiments on the RAF-ML and other diverse databases (JAFFE, CK$$+$$+, SFEW and MMI) show that the deep manifold feature is not only superior in multi-label expression recognition in the wild, but also captures the elemental and generic components that are effective for a wide range of expression recognition tasks.

...read moreread less

Journal Article•DOI•

The Devil is in the Decoder: Classification, Regression and GANs

[...]

Zbigniew Wojna¹, Vittorio Ferrari², Sergio Guadarrama², Nathan Silberman², Liang-Chieh Chen², Alireza Fathi², Jasper Uijlings² - Show less +3 more•Institutions (2)

University College London¹, Google²

01 Dec 2019-International Journal of Computer Vision

TL;DR: This paper presents an extensive comparison of a variety of decoders for a range of pixel-wise tasks ranging from classification, regression to synthesis and introduces new residual-like connections for decoder.

...read moreread less

Abstract: Many machine vision applications, such as semantic segmentation and depth prediction, require predictions for every pixel of the input image. Models for such problems usually consist of encoders which decrease spatial resolution while learning a high-dimensional representation, followed by decoders who recover the original input resolution and result in low-dimensional predictions. While encoders have been studied rigorously, relatively few studies address the decoder side. This paper presents an extensive comparison of a variety of decoders for a variety of pixel-wise tasks ranging from classification, regression to synthesis. Our contributions are: (1) decoders matter: we observe significant variance in results between different types of decoders on various problems. (2) We introduce new residual-like connections for decoders. (3) We introduce a novel decoder: bilinear additive upsampling. (4) We explore prediction artifacts.

...read moreread less

Journal Article•DOI•

A Comprehensive Study on Center Loss for Deep Face Recognition

[...]

Yandong Wen¹, Kaipeng Zhang¹, Zhifeng Li², Yu Qiao¹•Institutions (2)

Chinese Academy of Sciences¹, Tencent²

01 Jun 2019-International Journal of Computer Vision

TL;DR: This paper addresses the open-set property of face recognition by developing the center loss, which simultaneously learns a center for each class, and penalizes the distances between the deep features of the face images and their corresponding class centers.

...read moreread less

Abstract: Deep convolutional neural networks (CNNs) trained with the softmax loss have achieved remarkable successes in a number of close-set recognition problems, e.g. object recognition, action recognition, etc. Unlike these close-set tasks, face recognition is an open-set problem where the testing classes (persons) are usually different from those in training. This paper addresses the open-set property of face recognition by developing the center loss. Specifically, the center loss simultaneously learns a center for each class, and penalizes the distances between the deep features of the face images and their corresponding class centers. Training with the center loss enables CNNs to extract the deep features with two desirable properties: inter-class separability and intra-class compactness. In addition, we extend the center loss in two aspects. First, we adopt parameter sharing between the softmax loss and the center loss, to reduce the extra parameters introduced by centers. Second, we generalize the concept of center from a single point to a region in embedding space, which further allows us to account for intra-class variations. The advanced center loss significantly enhances the discriminative power of deep features. Experimental results show that our method achieves high accuracies on several important face recognition benchmarks, including Labeled Faces in the Wild, YouTube Faces, IJB-A Janus, and MegaFace Challenging 1.

...read moreread less

Journal Article•DOI•

Face-Specific Data Augmentation for Unconstrained Face Recognition

[...]

Iacopo Masi¹, Anh Tuấn Trần, Tal Hassner², Gozde Sahin, Gerard Medioni - Show less +1 more•Institutions (2)

Information Sciences Institute¹, Open University of Israel²

01 Jun 2019-International Journal of Computer Vision

TL;DR: A highly effective face recognition pipeline which, at the time of submission, obtains state-of-the-art results across multiple benchmarks is described.

...read moreread less

Abstract: We identify two issues as key to developing effective face recognition systems: maximizing the appearance variations of training images and minimizing appearance variations in test images. The former is required to train the system for whatever appearance variations it will ultimately encounter and is often addressed by collecting massive training sets with millions of face images. The latter involves various forms of appearance normalization for removing distracting nuisance factors at test time and making test faces easier to compare. We describe novel, efficient face-specific data augmentation techniques and show them to be ideally suited for both purposes. By using knowledge of faces, their 3D shapes, and appearances, we show the following: (a) We can artificially enrich training data for face recognition with face-specific appearance variations. (b) This synthetic training data can be efficiently produced online, thereby reducing the massive storage requirements of large-scale training sets and simplifying training for many appearance variations. Finally, (c) The same, fast data augmentation techniques can be applied at test time to reduce appearance variations and improve face representations. Together, with additional technical novelties, we describe a highly effective face recognition pipeline which, at the time of submission, obtains state-of-the-art results across multiple benchmarks. Portions of this paper were previously published by Masi et al. (European conference on computer vision, Springer, pp 579–596, 2016b, International conference on automatic face and gesture recognition, 2017).

...read moreread less

Journal Article•DOI•

Wavelet Domain Generative Adversarial Network for Multi-scale Face Hallucination

[...]

Huaibo Huang, Ran He, Zhenan Sun, Tieniu Tan

01 Jun 2019-International Journal of Computer Vision

TL;DR: A wavelet-domain generative adversarial method that can ultra-resolve a very low-resolution face image to its larger version of multiple upscaling factors in a unified framework and achieves more appealing results both quantitatively and qualitatively than state-of-the-art face hallucination methods.

...read moreread less

Abstract: Most modern face hallucination methods resort to convolutional neural networks (CNN) to infer high-resolution (HR) face images. However, when dealing with very low-resolution (LR) images, these CNN based methods tend to produce over-smoothed outputs. To address this challenge, this paper proposes a wavelet-domain generative adversarial method that can ultra-resolve a very low-resolution (like $$16\times 16$$ or even $$8\times 8$$ ) face image to its larger version of multiple upscaling factors ( $$2\times $$ to $$16\times $$ ) in a unified framework. Different from the most existing studies that hallucinate faces in image pixel domain, our method firstly learns to predict the wavelet information of HR face images from its corresponding LR inputs before image-level super-resolution. To capture both global topology information and local texture details of human faces, a flexible and extensible generative adversarial network is designed with three types of losses: (1) wavelet reconstruction loss aims to push wavelets closer with the ground-truth; (2) wavelet adversarial loss aims to generate realistic wavelets; (3) identity preserving loss aims to help identity information recovery. Extensive experiments demonstrate that the presented approach not only achieves more appealing results both quantitatively and qualitatively than state-of-the-art face hallucination methods, but also can significantly improve identification accuracy for low-resolution face images captured in the wild.

...read moreread less

Journal Article•DOI•

Which and How Many Regions to Gaze: Focus Discriminative Regions for Fine-Grained Visual Categorization

[...]

Xiangteng He¹, Yuxin Peng¹, Junjie Zhao¹•Institutions (1)

Peking University¹

01 Sep 2019-International Journal of Computer Vision

TL;DR: Compared with state-of-the-art methods on two widely-used fine-grained visual categorization datasets, the M2DRL approach achieves the best categorization accuracy.

...read moreread less

Abstract: Fine-grained visual categorization (FGVC) aims to discriminate similar subcategories that belong to the same superclass. Since the distinctions among similar subcategories are quite subtle and local, it is highly challenging to distinguish them from each other even for humans. So the localization of distinctions is essential for fine-grained visual categorization, and there are two pivotal problems: (1) Which regions are discriminative and representative to distinguish from other subcategories? (2) How many discriminative regions are necessary to achieve the best categorization performance? It is still difficult to address these two problems adaptively and intelligently. Artificial prior and experimental validation are widely used in existing mainstream methods to discover which and how many regions to gaze. However, their applications extremely restrict the usability and scalability of the methods. To address the above two problems, this paper proposes a multi-scale and multi-granularity deep reinforcement learning approach (M2DRL), which learns multi-granularity discriminative region attention and multi-scale region-based feature representation. Its main contributions are as follows: (1) Multi-granularity discriminative localization is proposed to localize the distinctions via a two-stage deep reinforcement learning approach, which discovers the discriminative regions with multiple granularities in a hierarchical manner (“which problem”), and determines the number of discriminative regions in an automatic and adaptive manner (“how many problem”). (2) Multi-scale representation learning helps to localize regions in different scales as well as encode images in different scales, boosting the fine-grained visual categorization performance. (3) Semantic reward function is proposed to drive M2DRL to fully capture the salient and conceptual visual information, via jointly considering attention and category information in the reward function. It allows the deep reinforcement learning to localize the distinctions in a weakly supervised manner or even an unsupervised manner. (4) Unsupervised discriminative localization is further explored to avoid the heavy labor consumption of annotating, and extremely strengthen the usability and scalability of our M2DRL approach. Compared with state-of-the-art methods on two widely-used fine-grained visual categorization datasets, our M2DRL approach achieves the best categorization accuracy.

...read moreread less

Journal Article•DOI•

Fusing Visual and Inertial Sensors with Semantics for 3D Human Pose Estimation

[...]

Andrew Gilbert¹, Matthew Trumble¹, Charles Malleson¹, Adrian Hilton¹, John Collomosse¹ - Show less +1 more•Institutions (1)

University of Surrey¹

01 Apr 2019-International Journal of Computer Vision

TL;DR: A multi-channel 3D convolutional neural network is used to learn a pose embedding from visual occupancy and semantic 2D pose estimates from the MVV in a discretised volumetric probabilistic visual hull, yielding improved accuracy over prior methods.

...read moreread less

Abstract: We propose an approach to accurately esti- mate 3D human pose by fusing multi-viewpoint video (MVV) with inertial measurement unit (IMU) sensor data, without optical markers, a complex hardware setup or a full body model. Uniquely we use a multi-channel 3D convolutional neural network to learn a pose em- bedding from visual occupancy and semantic 2D pose estimates from the MVV in a discretised volumetric probabilistic visual hull (PVH). The learnt pose stream is concurrently processed with a forward kinematic solve of the IMU data and a temporal model (LSTM) exploits the rich spatial and temporal long range dependencies among the solved joints, the two streams are then fused in a final fully connected layer. The two complemen- tary data sources allow for ambiguities to be resolved within each sensor modality, yielding improved accu- racy over prior methods. Extensive evaluation is per- formed with state of the art performance reported on the popular Human 3.6M dataset [26], the newly re- leased TotalCapture dataset and a challenging set of outdoor videos TotalCaptureOutdoor. We release the new hybrid MVV dataset (TotalCapture) comprising of multi- viewpoint video, IMU and accurate 3D skele- tal joint ground truth derived from a commercial mo- tion capture system. The dataset is available online at http://cvssp.org/data/totalcapture/.

...read moreread less

Journal Article•DOI•

Joint Face Hallucination and Deblurring via Structure Generation and Detail Enhancement

[...]

Yibing Song¹, Jiawei Zhang², Lijun Gong¹, Shengfeng He³, Linchao Bao¹, Jinshan Pan⁴, Qingxiong Yang⁵, Ming-Hsuan Yang⁶ - Show less +4 more•Institutions (6)

Tencent¹, SenseTime², South China University of Technology³, Nanjing University of Science and Technology⁴, University of Science and Technology of China⁵, University of California, Merced⁶

17 Jan 2019-International Journal of Computer Vision

TL;DR: This paper proposes a facial component guided deep Convolutional Neural Network to restore a coarse face image, which is denoted as the base image where the facial component is automatically generated from the input face image.

...read moreread less

Abstract: We address the problem of restoring a high-resolution face image from a blurry low-resolution input. This problem is difficult as super-resolution and deblurring need to be tackled simultaneously. Moreover, existing algorithms cannot handle face images well as low-resolution face images do not have much texture which is especially critical for deblurring. In this paper, we propose an effective algorithm by utilizing the domain-specific knowledge of human faces to recover high-quality faces. We first propose a facial component guided deep Convolutional Neural Network (CNN) to restore a coarse face image, which is denoted as the base image where the facial component is automatically generated from the input face image. However, the CNN based method cannot handle image details well. We further develop a novel exemplar-based detail enhancement algorithm via facial component matching. Extensive experiments show that the proposed method outperforms the state-of-the-art algorithms both quantitatively and qualitatively.

...read moreread less

Journal Article•DOI•

Deep, Landmark-Free FAME: Face Alignment, Modeling, and Expression Estimation

[...]

Feng-Ju Chang, Anh Tuan Tran, Tal Hassner¹, Iacopo Masi², Ram Nevatia, Gerard Medioni - Show less +2 more•Institutions (2)

Open University of Israel¹, Information Sciences Institute²

13 Feb 2019-International Journal of Computer Vision

TL;DR: A novel method for modeling 3D face shape, viewpoint, and expression from a single, unconstrained photo is presented and it is shown how accurate landmarks can be obtained as a by-product of the modeling process.

...read moreread less

Abstract: We present a novel method for modeling 3D face shape, viewpoint, and expression from a single, unconstrained photo. Our method uses three deep convolutional neural networks to estimate each of these components separately. Importantly, unlike others, our method does not use facial landmark detection at test time; instead, it estimates these properties directly from image intensities. In fact, rather than using detectors, we show how accurate landmarks can be obtained as a by-product of our modeling process. We rigorously test our proposed method. To this end, we raise a number of concerns with existing practices used in evaluating face landmark detection methods. In response to these concerns, we propose novel paradigms for testing the effectiveness of rigid and non-rigid face alignment methods without relying on landmark detection benchmarks. We evaluate rigid face alignment by measuring its effects on face recognition accuracy on the challenging IJB-A and IJB-B benchmarks. Non-rigid, expression estimation is tested on the CK+ and EmotiW’17 benchmarks for emotion classification. We do, however, report the accuracy of our approach as a landmark detector for 3D landmarks on AFLW2000-3D and 2D landmarks on 300W and AFLW-PIFA. A surprising conclusion of these results is that better landmark detection accuracy does not necessarily translate to better face processing. Parts of this paper were previously published by Tran et al. (2017) and Chang et al. (2017, 2018).

...read moreread less

Journal Article•DOI•

Understanding Image Representations by Measuring Their Equivariance and Equivalence

[...]

Karel Lenc¹, Andrea Vedaldi¹•Institutions (1)

University of Oxford¹

01 Jan 2019-International Journal of Computer Vision

TL;DR: This work investigates two key mathematical properties of representations: equivariance and equivalence and identifies several predictors of geometric and architectural compatibility, including the spatial resolution of the representation and the complexity and depth of the models.

...read moreread less

Abstract: Despite the importance of image representations such as histograms of oriented gradients and deep Convolutional Neural Networks (CNN), our theoretical understanding of them remains limited. Aimed at filling this gap, we investigate two key mathematical properties of representations: equivariance and equivalence. Equivariance studies how transformations of the input image are encoded by the representation, invariance being a special case where a transformation has no effect. Equivalence studies whether two representations, for example two different parameterizations of a CNN, two different layers, or two different CNN architectures, share the same visual information or not. A number of methods to establish these properties empirically are proposed, including introducing transformation and stitching layers in CNNs. These methods are then applied to popular representations to reveal insightful aspects of their structure, including clarifying at which layers in a CNN certain geometric invariances are achieved and how various CNN architectures differ. We identify several predictors of geometric and architectural compatibility, including the spatial resolution of the representation and the complexity and depth of the models. While the focus of the paper is theoretical, direct applications to structured-output regression are demonstrated too.

...read moreread less

Journal Article•DOI•

Large-Scale Bisample Learning on ID Versus Spot Face Recognition

[...]

Xiangyu Zhu¹, Hao Liu¹, Zhen Lei¹, Hailin Shi¹, Fan Yang², Dong Yi³, Guo-Jun Qi⁴, Stan Z. Li¹ - Show less +4 more•Institutions (4)

Chinese Academy of Sciences¹, Beihang University², Alibaba Group³, Huawei⁴

01 Jun 2019-International Journal of Computer Vision

TL;DR: Wang et al. as mentioned in this paper proposed a deep learning based large-scale bisample learning (LBL) method for ID versus Spot (IvS) face recognition, where a classification-verification-classification training strategy is proposed to progressively enhance the IvS performance.

...read moreread less

Abstract: In real-world face recognition applications, there is a tremendous amount of data with two images for each person. One is an ID photo for face enrollment, and the other is a probe photo captured on spot. Most existing methods are designed for training data with limited breadth (a relatively small number of classes) and sufficient depth (many samples for each class). They would meet great challenges on ID versus Spot (IvS) data, including the under-represented intra-class variations and an excessive demand on computing devices. In this paper, we propose a deep learning based large-scale bisample learning (LBL) method for IvS face recognition. To tackle the bisample problem with only two samples for each class, a classification–verification–classification training strategy is proposed to progressively enhance the IvS performance. Besides, a dominant prototype softmax is incorporated to make the deep learning scalable on large-scale classes. We conduct LBL on a IvS face dataset with more than two million identities. Experimental results show the proposed method achieves superior performance to previous ones, validating the effectiveness of LBL on IvS face recognition.

...read moreread less

Journal Article•DOI•

Unsupervised Binary Representation Learning with Deep Variational Networks

[...]

Yuming Shen, Li Liu, Ling Shao

01 Dec 2019-International Journal of Computer Vision

TL;DR: The conditional auto-encoding variational Bayesian networks are introduced in this work to exploit the feature space structure of the training data using the latent variables, and the proposed DVB model estimates the statistics of data representations, and thus produces compact binary codes.

...read moreread less

Abstract: Learning to hash is regarded as an efficient approach for image retrieval and many other big-data applications. Recently, deep learning frameworks are adopted for image hashing, suggesting an alternative way to formulate the encoding function other than the conventional projections. Although deep learning has been proved to be successful in supervised hashing, existing unsupervised deep hashing techniques still cannot produce leading performance compared with the non-deep methods, as it is hard to unveil the intrinsic structure of the whole sample space by simply regularizing the output codes within each single training batch. To tackle this problem, in this paper, we propose a novel unsupervised deep hashing model, named deep variational binaries (DVB). The conditional auto-encoding variational Bayesian networks are introduced in this work to exploit the feature space structure of the training data using the latent variables. Integrating the probabilistic inference process with hashing objectives, the proposed DVB model estimates the statistics of data representations, and thus produces compact binary codes. Experimental results on three benchmark datasets, i.e., CIFAR-10, SUN-397 and NUS-WIDE, demonstrate that DVB outperforms state-of-the-art unsupervised hashing methods with significant margins.

...read moreread less