scispace - formally typeset
Search or ask a question

Showing papers by "Stan Z. Li published in 2021"


Journal ArticleDOI
TL;DR: This work proposes the first FAS method based on neural architecture search (NAS), called NAS-FAS, to discover the well-suited task-aware networks, and develops a novel search space consisting of central difference convolution and pooling operators.
Abstract: Face anti-spoofing (FAS) plays a vital role in securing face recognition systems. Existing methods heavily rely on the expert-designed networks, which may lead to a sub-optimal solution for FAS task. Here we propose the first FAS method based on neural architecture search (NAS), called NAS-FAS, to discover the well-suited task-aware networks. Unlike previous NAS works mainly focus on developing efficient search strategies in generic object classification, we pay more attention to study the search spaces for FAS task. The challenges of utilizing NAS for FAS are in two folds: the networks searched on 1) a specific acquisition condition might perform poorly in unseen conditions, and 2) particular spoofing attacks might generalize badly for unseen attacks. To overcome these two issues, we develop a novel search space consisting of central difference convolution and pooling operators. Moreover, an efficient static-dynamic representation is exploited for fully mining the FAS-aware spatio-temporal discrepancy. Besides, we propose Domain/Type-aware Meta-NAS, which leverages cross-domain/type knowledge for robust searching. Finally, in order to evaluate the NAS transferability for cross datasets and unknown attack types, we release a large-scale 3D mask dataset, namely CASIA-SURF 3DMask, for supporting the new ‘cross-dataset cross-type’ testing protocol. Experiments demonstrate that the proposed NAS-FAS achieves state-of-the-art performance on nine FAS benchmark datasets with four testing protocols.

109 citations


Journal ArticleDOI
TL;DR: A novel single-shot based detector, namely RefineDet++, is proposed, which achieves better accuracy than two-stage methods and maintains comparable efficiency of one- stage methods.
Abstract: Convolutional neural network based methods have dominated object detection in recent years, which can be divided into the one-stage approach and the two-stage approach. In general, the two-stage approach ( e.g. , Faster R-CNN) achieves high accuracy, while the one-stage approach ( e.g. , SSD) has the advantage of high efficiency. To inherit the merits of both while overcoming their disadvantages, we propose a novel single-shot based detector, namely RefineDet++, which achieves better accuracy than two-stage methods and maintains comparable efficiency of one-stage methods. The proposed RefineDet++ consists of two inter-connected modules: the anchor refinement module and the alignment detection module. Specifically, the former module aims to (1) filter out negative anchors to reduce search space for the subsequent classifier, and (2) coarsely adjust the locations and sizes of anchors to provide better initialization for the subsequent regressor. The latter module takes (1) the refined anchors as the input from the former module with (2) a newly designed alignment convolution operation to further improve the regression accuracy and predict multi-class label. Meanwhile, we design a transfer connection block to transfer the features in the anchor refinement module to predict locations, sizes and class labels of objects in the object detection module. The multi-task loss function enables us to train the whole network in an end-to-end way. Extensive experiments on PASCAL VOC and MS COCO demonstrate that RefineDet++ achieves state-of-the-art detection accuracy with high efficiency.

60 citations


Journal ArticleDOI
TL;DR: In this article, a cross-modal auxiliary network (CMA) is proposed for face anti-spoofing detection, which consists of a modality translation network (MT-Net) and a Modality Assistance Network (MA-Net), which can close the visible gap between different modalities via a generative model that maps inputs from one modality (i.e., RGB) to another ( i.e., NIR).
Abstract: Face Presentation Attack Detection (PAD) approaches based on multi-modal data have been attracted increasingly by the research community. However, they require multi-modal face data consistently involved in both the training and testing phases. It would severely limit the applicability due to the most Face Anti-spoofing (FAS) systems are only equipped with Visible (VIS) imaging devices, i.e. , RGB cameras. Therefore, how to use other modality (i.e., Near-Infrared (NIR)) to assist the performance improvement of VIS-based PAD is significant for FAS. In this work, we first discuss the big gap of performances among different modalities even though the same backbone network is applied. Then, we propose a novel Cross-modal Auxiliary (CMA) framework for the VIS-based FAS task. The main trait of CMA is that the performance can be greatly improved with the help of other modality while no other modality is required in the testing stage. The proposed CMA consists of a Modality Translation Network (MT-Net) and a Modality Assistance Network (MA-Net). The former aims to close the visible gap between different modalities via a generative model that maps inputs from one modality ( i.e. , RGB) to another ( i.e. , NIR). The latter focuses on how to use the translated modality ( i.e. , target modality) and RGB modality ( i.e. , source modality) together to train a discriminative PAD model. Extensive experiments are conducted to demonstrate that the proposed framework can push the state-of-the-art (SOTA) performances on both multi-modal datasets ( i.e. , CASIA-SURF, CeFA, and WMCA) and RGB-based datasets ( i.e. , OULU-NPU, and SiW).

50 citations


Journal ArticleDOI
TL;DR: RefineFace as mentioned in this paper is a single-shot refinement face detector consisting of five modules: selective two-step regression, selective two step classification, scale-aware margin loss, feature supervision module, and receptive field enhancement.
Abstract: Face detection has achieved significant progress in recent years. However, high performance face detection still remains a very challenging problem, especially when there exists many tiny faces. In this paper, we present a single-shot refinement face detector namely RefineFace to achieve high performance. Specifically, it consists of five modules: selective two-step regression (STR), selective two-step classification (STC), scale-aware margin loss (SML), feature supervision module (FSM) and receptive field enhancement (RFE). To enhance the regression ability for high location accuracy, STR coarsely adjusts locations and sizes of anchors from high level detection layers to provide better initialization for subsequent regressor. To improve the classification ability for high recall efficiency, STC first filters out most simple negatives from low level detection layers to reduce search space for subsequent classifier, then SML is applied to better distinguish faces from background at various scales and FSM is introduced to let the backbone learn more discriminative features for classification. Besides, RFE is presented to provide more diverse receptive field to better capture faces in some extreme poses. Extensive experiments conducted on WIDER FACE, AFW, PASCAL Face, FDDB, MAFA demonstrate that our method achieves state-of-the-art results and runs at 37.3 FPS with ResNet-18 for VGA-resolution images.

46 citations


Proceedings ArticleDOI
17 Oct 2021
TL;DR: In this paper, the authors proposed a co-learning method for learning with noisy labels, where the intrinsic similarity with the self-supervised module and the structural similarity with noisily supervised module are imposed on a shared common feature encoder to regularize the network to maximize the agreement between the two constraints.
Abstract: Noisy labels, resulting from mistakes in manual labeling or webly data collecting for supervised learning, can cause neural networks to overfit the misleading information and degrade the generalization performance. Self-supervised learning works in the absence of labels and thus eliminates the negative impact of noisy labels. Motivated by co-training with both supervised learning view and self-supervised learning view, we propose a simple yet effective method called Co-learning for learning with noisy labels. Co-learning performs supervised learning and self-supervised learning in a cooperative way. The constraints of intrinsic similarity with the self-supervised module and the structural similarity with the noisily-supervised module are imposed on a shared common feature encoder to regularize the network to maximize the agreement between the two constraints. Co-learning is compared with peer methods on corrupted data from benchmark datasets fairly, and extensive results are provided which demonstrate that Co-learning is superior to many state-of-the-art approaches.

39 citations


Proceedings ArticleDOI
Ajian Liu, Zichang Tan, Jun Wan, Sergio Escalera, Guodong Guo1, Stan Z. Li 
01 Jan 2021
TL;DR: CASIA-SURF cross-ethnicity face anti-spoofing (CeFA) dataset as discussed by the authors is the first dataset with explicit ethnic labels in current released datasets, which employs a partially shared fusion strategy to learn complementary information from multiple modalities.
Abstract: The issue of ethnic bias has proven to affect the performance of face recognition in previous works, while it still remains to be vacant in face anti-spoofing. Therefore, in order to study the ethnic bias for face anti-spoofing, we introduce the largest CASIA-SURF Cross-ethnicity Face Anti-spoofing (CeFA) dataset, covering 3 ethnicities, 3 modalities, 1,607 subjects, and 2D plus 3D attack types. Five protocols are introduced to measure the affect under varied evaluation conditions, such as cross-ethnicity, unknown spoofs or both of them. As our knowledge, CASIA-SURF CeFA is the first dataset including explicit ethnic labels in current released datasets. Then, we propose a novel multi-modal fusion method as a strong baseline to alleviate the ethnic bias, which employs a partially shared fusion strategy to learn complementary information from multiple modalities. Extensive experiments have been conducted on the proposed dataset to verify its significance and generalization capability for other existing datasets, i.e., CASIA-SURF, OULU-NPU and SiW datasets. The dataset is available at https://sites.google.com/qq.com/face-anti-spoofing/welcome/challengecvpr2020?authuser=0.

39 citations


Proceedings ArticleDOI
TL;DR: In this article, the authors proposed a co-learning method for learning with noisy labels, where the intrinsic similarity with the self-supervised module and the structural similarity with noisily supervised module are imposed on a shared common feature encoder to regularize the network to maximize the agreement between the two constraints.
Abstract: Noisy labels, resulting from mistakes in manual labeling or webly data collecting for supervised learning, can cause neural networks to overfit the misleading information and degrade the generalization performance. Self-supervised learning works in the absence of labels and thus eliminates the negative impact of noisy labels. Motivated by co-training with both supervised learning view and self-supervised learning view, we propose a simple yet effective method called Co-learning for learning with noisy labels. Co-learning performs supervised learning and self-supervised learning in a cooperative way. The constraints of intrinsic similarity with the self-supervised module and the structural similarity with the noisily-supervised module are imposed on a shared common feature encoder to regularize the network to maximize the agreement between the two constraints. Co-learning is compared with peer methods on corrupted data from benchmark datasets fairly, and extensive results are provided which demonstrate that Co-learning is superior to many state-of-the-art approaches.

34 citations


Journal ArticleDOI
TL;DR: The Chalearn Face Anti-spoofing Attack Detection Challenge (CASIA-SURF CeFA) as mentioned in this paper was organized to measure the ethnic bias in face anti-spouting.
Abstract: Face anti-spoofing is critical to prevent face recognition systems from a security breach. The biometrics community has %possessed achieved impressive progress recently due the excellent performance of deep neural networks and the availability of large datasets. Although ethnic bias has been verified to severely affect the performance of face recognition systems, it still remains an open research problem in face anti-spoofing. Recently, a multi-ethnic face anti-spoofing dataset, CASIA-SURF CeFA, has been released with the goal of measuring the ethnic bias. It is the largest up to date cross-ethnicity face anti-spoofing dataset covering $3$ ethnicities, $3$ modalities, $1,607$ subjects, 2D plus 3D attack types, and the first dataset including explicit ethnic labels among the recently released datasets for face anti-spoofing. We organized the Chalearn Face Anti-spoofing Attack Detection Challenge which consists of single-modal (e.g., RGB) and multi-modal (e.g., RGB, Depth, Infrared (IR)) tracks around this novel resource to boost research aiming to alleviate the ethnic bias. Both tracks have attracted $340$ teams in the development stage, and finally 11 and 8 teams have submitted their codes in the single-modal and multi-modal face anti-spoofing recognition challenges, respectively. All the results were verified and re-ran by the organizing team, and the results were used for the final ranking. This paper presents an overview of the challenge, including its design, evaluation protocol and a summary of results. We analyze the top ranked solutions and draw conclusions derived from the competition. In addition we outline future work directions.

29 citations


Journal ArticleDOI
01 Dec 2021-PhotoniX
TL;DR: In this article, a systemic view on recent advancements in nanophotonic components designed by intelligence algorithms is presented, manifesting a development trend from performance optimizations towards inverse creations of novel designs.
Abstract: Applying intelligence algorithms to conceive nanoscale meta-devices becomes a flourishing and extremely active scientific topic over the past few years. Inverse design of functional nanostructures is at the heart of this topic, in which artificial intelligence (AI) furnishes various optimization toolboxes to speed up prototyping of photonic layouts with enhanced performance. In this review, we offer a systemic view on recent advancements in nanophotonic components designed by intelligence algorithms, manifesting a development trend from performance optimizations towards inverse creations of novel designs. To illustrate interplays between two fields, AI and photonics, we take meta-atom spectral manipulation as a case study to introduce algorithm operational principles, and subsequently review their manifold usages among a set of popular meta-elements. As arranged from levels of individual optimized piece to practical system, we discuss algorithm-assisted nanophotonic designs to examine their mutual benefits. We further comment on a set of open questions including reasonable applications of advanced algorithms, expensive data issue, and algorithm benchmarking, etc. Overall, we envision mounting photonic-targeted methodologies to substantially push forward functional artificial meta-devices to profit both fields.

23 citations


Journal ArticleDOI
TL;DR: A practical model for timely severity prediction for COVID-19 is presented, which is freely available at a webserver https://guomics.shinyapps.io/covidAI/.
Abstract: Severity prediction of COVID-19 remains one of the major clinical challenges for the ongoing pandemic. Here, we have recruited a 144 COVID-19 patient cohort, resulting in a data matrix containing 3,065 readings for 124 types of measurements over 52 days. A machine learning model was established to predict the disease progression based on the cohort consisting of training, validation, and internal test sets. A panel of eleven routine clinical factors constructed a classifier for COVID-19 severity prediction, achieving accuracy of over 98% in the discovery set. Validation of the model in an independent cohort containing 25 patients achieved accuracy of 80%. The overall sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were 0.70, 0.99, 0.93, and 0.93, respectively. Our model captured predictive dynamics of lactate dehydrogenase (LDH) and creatine kinase (CK) while their levels were in the normal range. This model is accessible at https://www.guomics.com/covidAI/ for research purpose.

22 citations


Proceedings ArticleDOI
01 Jun 2021
TL;DR: Wang et al. as discussed by the authors proposed to utilize facial detail, which is the combination of direct light and identity texture, as the clue to detect the subtle forgery patterns, and introduce a two-stream structure to exploit both face image and facial detail together as a multi-modality task.
Abstract: Detecting digital face manipulation has attracted extensive attention due to fake media’s potential harms to the public. However, recent advances have been able to reduce the forgery signals to a low magnitude. Decomposition, which reversibly decomposes an image into several constituent elements, is a promising way to highlight the hidden forgery details. In this paper, we consider a face image as the production of the intervention of the underlying 3D geometry and the lighting environment, and decompose it in a computer graphics view. Specifically, by disentangling the face image into 3D shape, common texture, identity texture, ambient light, and direct light, we find the devil lies in the direct light and the identity texture. Based on this observation, we propose to utilize facial detail, which is the combination of direct light and identity texture, as the clue to detect the subtle forgery patterns. Besides, we highlight the manipulated region with a supervised attention mechanism and introduce a two-stream structure to exploit both face image and facial detail together as a multi-modality task. Extensive experiments indicate the effectiveness of the extra features extracted from the facial detail, and our method achieves the state-of-the-art performance.

Journal ArticleDOI
TL;DR: The Fast Adapting without Forgetting (FAwF) method with three components: margin-based exemplar selection, prototype-based class extension and hard&soft knowledge distillation is proposed, which can well maintain the source domain performance with only one sample per source domain class, greatly reducing the fine-tuning time-cost and data storage.
Abstract: Although face recognition has made dramatic improvements in recent years, there are still many challenges in real-world applications such as face recognition for the elderly and children, for the surveillance scenes and for Near infrared vs. Visible light (NIR-VIS) heterogeneous scene, etc. Due to the existence of these challenges, there are usually domain gaps between training (source domain) and test (target domain). A common way to improve the performance on the target domain is fine-tuning the base model trained on source domain using target data. However, it will severely degrade performance on the source domain. Another way which jointly trains models using both source and target data, suffers from the heavy computations and large data storage, especially when we continue to encounter new domains. In response to these problems, we introduce a new challenging task: Single Exemplar Domain Incremental Learning (SE-DIL), which utilizes the target domain data and just one exemplar per identity from source domain data to quickly improve the performance on the target domain while keeping the performance on the source domain. To deal with SE-DIL, we propose our Fast Adapting without Forgetting (FAwF) method with three components: margin-based exemplar selection, prototype-based class extension and hard&soft knowledge distillation. Through FAwF, we can well maintain the source domain performance with only one sample per source domain class, greatly reducing the fine-tuning time-cost and data storage. Besides, we collected a large-scale children face dataset KidsFace with $12 {K}$ identities for studying the SE-DIL in face recognition. Extensive analysis and experiments on our KidsFace-Test protocol and other challenging face test sets show that our method performs better than the state-of-the-art methods on both target and source domain.

Journal ArticleDOI
TL;DR: A unified framework, named CAScaded Split-and-Aggregate Learning with Feature Recombination (CAS-SAL-FR), to learn the above modules jointly and concurrently achieves new state-of-the-art performance.
Abstract: Multi-label pedestrian attribute recognition in surveillance is inherently a challenging task due to poor imaging quality, large pose variations, and so on. In this paper, we improve its performance from the following two aspects: (1) We propose a cascaded Split-and-Aggregate Learning (SAL) to capture both the individuality and commonality for all attributes, with one at the feature map level and the other at the feature vector level. For the former, we split the features of each attribute by using a designed attribute-specific attention module (ASAM). For the later, the split features for each attribute are learned by using constrained losses. In both modules, the split features are aggregated by using several convolutional or fully connected layers. (2) We propose a Feature Recombination (FR) that conducts a random shuffle based on the split features over a batch of samples to synthesize more training samples, which spans the potential samples’ variability. To the end, we formulate a unified framework, named CAScaded Split-and-Aggregate Learning with Feature Recombination (CAS-SAL-FR), to learn the above modules jointly and concurrently. Experiments on five popular benchmarks, including RAP, PA-100K, PETA, Market-1501 and Duke attribute datasets, show the proposed CAS-SAL-FR achieves new state-of-the-art performance.

Posted Content
TL;DR: Li et al. as mentioned in this paper proposed a Contrastive Context-Aware Learning (CCL) framework for face presentation attack detection, which leverages rich contexts accurately (e.g., subjects, mask material and lighting) among pairs of live faces and high-fidelity mask attacks.
Abstract: Face presentation attack detection (PAD) is essential to secure face recognition systems primarily from high-fidelity mask attacks. Most existing 3D mask PAD benchmarks suffer from several drawbacks: 1) a limited number of mask identities, types of sensors, and a total number of videos; 2) low-fidelity quality of facial masks. Basic deep models and remote photoplethysmography (rPPG) methods achieved acceptable performance on these benchmarks but still far from the needs of practical scenarios. To bridge the gap to real-world applications, we introduce a largescale High-Fidelity Mask dataset, namely CASIA-SURF HiFiMask (briefly HiFiMask). Specifically, a total amount of 54,600 videos are recorded from 75 subjects with 225 realistic masks by 7 new kinds of sensors. Together with the dataset, we propose a novel Contrastive Context-aware Learning framework, namely CCL. CCL is a new training methodology for supervised PAD tasks, which is able to learn by leveraging rich contexts accurately (e.g., subjects, mask material and lighting) among pairs of live faces and high-fidelity mask attacks. Extensive experimental evaluations on HiFiMask and three additional 3D mask datasets demonstrate the effectiveness of our method.

Posted Content
16 May 2021
TL;DR: Self-supervised learning (SSL) is emerging as a new paradigm for extracting informative knowledge through well-designed pretext tasks without relying on manual labels as mentioned in this paper, however, precise annotations are generally very expensive and time-consuming.
Abstract: Deep learning on graphs has recently achieved remarkable success on a variety of tasks while such success relies heavily on the massive and carefully labeled data. However, precise annotations are generally very expensive and time-consuming. To address this problem, self-supervised learning (SSL) is emerging as a new paradigm for extracting informative knowledge through well-designed pretext tasks without relying on manual labels. In this survey, we extend the concept of SSL, which first emerged in the fields of computer vision and natural language processing, to present a timely and comprehensive review of the existing SSL techniques for graph data. Specifically, we divide existing graph SSL methods into three categories: contrastive, generative, and predictive. More importantly, unlike many other surveys that only provide a high-level description of published research, we present an additional mathematical summary of the existing works in a unified framework. Furthermore, to facilitate methodological development and empirical comparisons, we also summarize the commonly used datasets, evaluation metrics, downstream tasks, and open-source implementations of various algorithms. Finally, we discuss the technical challenges and potential future directions for improving graph self-supervised learning.

Posted Content
TL;DR: In this paper, a conditional local convolution whose shared kernel on nodes' local space is approximated by feedforward networks, with local representations of coordinate obtained by horizon maps into cylindrical-tangent space as its input, is proposed to capture the local spatial patterns.
Abstract: Spatio-temporal forecasting is challenging attributing to the high nonlinearity in temporal dynamics as well as complex location-characterized patterns in spatial domains, especially in fields like weather forecasting. Graph convolutions are usually used for modeling the spatial dependency in meteorology to handle the irregular distribution of sensors' spatial location. In this work, a novel graph-based convolution for imitating the meteorological flows is proposed to capture the local spatial patterns. Based on the assumption of smoothness of location-characterized patterns, we propose conditional local convolution whose shared kernel on nodes' local space is approximated by feedforward networks, with local representations of coordinate obtained by horizon maps into cylindrical-tangent space as its input. The established united standard of local coordinate system preserves the orientation on geography. We further propose the distance and orientation scaling terms to reduce the impacts of irregular spatial distribution. The convolution is embedded in a Recurrent Neural Network architecture to model the temporal dynamics, leading to the Conditional Local Convolution Recurrent Network (CLCRN). Our model is evaluated on real-world weather benchmark datasets, achieving state-of-the-art performance with obvious improvements. We conduct further analysis on local pattern visualization, model's framework choice, advantages of horizon maps and etc.

Journal ArticleDOI
TL;DR: Yu et al. as mentioned in this paper proposed a 3D Central Difference Convolution (3D-CDC) family to capture rich temporal context via aggregating temporal difference information, and optimized backbones for multi-sampling-rate branches and lateral connections among varied modalities.
Abstract: Gesture recognition has attracted considerable attention owing to its great potential in applications. Although the great progress has been made recently in multi-modal learning methods, existing methods still lack effective integration to fully explore synergies among spatio-temporal modalities effectively for gesture recognition. The problems are partially due to the fact that the existing manually designed network architectures have low efficiency in the joint learning of multi-modalities. In this paper, we propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition. The proposed method includes two key components: 1) enhanced temporal representation via the proposed 3D Central Difference Convolution (3D-CDC) family, which is able to capture rich temporal context via aggregating temporal difference information; and 2) optimized backbones for multi-sampling-rate branches and lateral connections among varied modalities. The resultant multi-modal multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics. Comprehensive experiments are performed on three benchmark datasets (IsoGD, NvGesture, and EgoGesture), demonstrating the state-of-the-art performance in both single- and multi-modality settings. The code is available at https://github.com/ZitongYu/3DCDC-NAS .

Journal ArticleDOI
TL;DR: In this paper, the authors propose a decomposition of the model into the weight parameters and the BN statistics in the training phase to solve the problem of domain adaptation with limited unlabeled samples.
Abstract: Face recognition systems are sometimes deployed to a target domain with limited unlabeled samples available. For instance, a model trained on the large-scale webfaces may be required to adapt to a NIR-VIS scenario via very limited unlabeled faces. This situation poses a great challenge to Unsupervised Domain Adaptation with Limited samples for Face Recognition (UDAL-FR), which is less studied in previous works. In this paper, with deep learning methods, we propose a novel training remedy by decomposing the model into the weight parameters and the BN statistics in the training phase. Based on decomposing, we design a novel framework via meta-learning, called Decomposed Meta Batch Normalization (DMBN) for fast domain adaptation in face recognition. DMBN trains the network such that domain-invariant information is prone to store in the weight parameters and domain-specific knowledge tends to be represented by the BN statistics. Specifically, DMBN constructs distribution-shifted tasks via domain-aware sampling, on which several meta-gradients are obtained by optimizing discriminative representations across different BNs. Finally, the weight parameters are updated with these meta-gradients for better consistency across different BNs. With the learned weight parameters, the adaptation is very fast since only the BN updating on limited data is needed. We propose two UDAL-FR benchmarks to evaluate the domain-adaptive ability of a model with limited unlabeled samples. Extensive experiments validate the efficacy of our proposed DMBN.

Journal Article
Lirong Wu1, Zicheng Liu1, Zelin Zang1, Jun Xia1, Siyuan Li1, Stan Z. Li1 
TL;DR: Experimental results on various datasets show that the proposed DCRL framework leads to comparable performances to current state-of-the-art deep clustering algorithms, yet exhibits superior performance for downstream tasks.
Abstract: In this paper, we propose a novel framework for Deep Clustering and multimanifold Representation Learning (DCRL) that preserves the geometric structure of data. In the proposed DCRL framework, manifold clustering is done in the latent space guided by a clustering loss. To overcome the problem that clusteringoriented losses may deteriorate the geometric structure of embeddings in the latent space, an isometric loss is proposed for preserving intra-manifold structure locally and a ranking loss for inter-manifold structure globally. Experimental results on various datasets show that the DCRL framework leads to performances comparable to current state-of-the-art deep clustering algorithms, yet exhibits superior performance for manifold representation. Our results also demonstrate the importance and effectiveness of the proposed losses in preserving geometric structure in terms of visualization and performance metrics. The code is provided in the Supplementary Material.

Posted Content
TL;DR: Wang et al. as discussed by the authors proposed surrogate representation learning with isometric mapping (SRLIM) to constrain the topological structure of nodes from the input layer to the embedding space, that is, to maintain the similarity of nodes in the propagation process.
Abstract: Gray-box graph attacks aim at disrupting the performance of the victim model by using inconspicuous attacks with limited knowledge of the victim model. The parameters of the victim model and the labels of the test nodes are invisible to the attacker. To obtain the gradient on the node attributes or graph structure, the attacker constructs an imaginary surrogate model trained under supervision. However, there is a lack of discussion on the training of surrogate models and the robustness of provided gradient information. The general node classification model loses the topology of the nodes on the graph, which is, in fact, an exploitable prior for the attacker. This paper investigates the effect of representation learning of surrogate models on the transferability of gray-box graph adversarial attacks. To reserve the topology in the surrogate embedding, we propose Surrogate Representation Learning with Isometric Mapping (SRLIM). By using Isometric mapping method, our proposed SRLIM can constrain the topological structure of nodes from the input layer to the embedding space, that is, to maintain the similarity of nodes in the propagation process. Experiments prove the effectiveness of our approach through the improvement in the performance of the adversarial attacks generated by the gradient-based attacker in untargeted poisoning gray-box setups.

Posted Content
24 Mar 2021
TL;DR: In this article, the authors regard mixup as a pretext task and split it into two sub-problems: mixed samples generation and mixup classification, and propose a lightweight mix block to generate synthetic samples based on feature maps and mix labels.
Abstract: Mixup-based data augmentation has achieved great success as regularizer for deep neural networks. However, existing mixup methods require explicitly designed mixup policies. In this paper, we present a flexible, general Automatic Mixup (AutoMix) framework which utilizes discriminative features to learn a sample mixing policy adaptively. We regard mixup as a pretext task and split it into two sub-problems: mixed samples generation and mixup classification. To this end, we design a lightweight mix block to generate synthetic samples based on feature maps and mix labels. Since the two sub-problems are in the nature of Expectation-Maximization (EM), we also propose a momentum training pipeline to optimize the mixup process and mixup classification process alternatively in an end-to-end fashion. Extensive experiments on six popular classification benchmarks show that AutoMix consistently outperforms other leading mixup methods and improves generalization abilities to downstream tasks. We hope AutoMix will motivate the community to rethink the role of mixup in representation learning. The code will be released soon.

Proceedings ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a series of systematic optimization strategies for the detection pipeline of one-stage detector, forming a single shot anchor-based detector (SADet) for efficient and accurate pedestrian detection.
Abstract: Although the anchor-based detectors have taken a big step forward in pedestrian detection, the overall performance of algorithm still needs further improvement for practical applications, e.g., a good trade-off between the accuracy and efficiency. To this end, this paper proposes a series of systematic optimization strategies for the detection pipeline of one-stage detector, forming a single shot anchor-based detector (SADet) for efficient and accurate pedestrian detection, which includes three main improvements. Firstly, we optimize the sample generation process by assigning soft labels to the outlier samples to generate semi-positive samples with continuous tag value between 0 and 1. Secondly, a novel Center-IoU loss is applied as a new regression loss for bounding box regression, which not only retains the good characteristics of IoU loss, but also solves some defects of it. Thirdly, we also design Cosine-NMS for the post-processing of predicted bounding boxes, and further propose adaptive anchor matching to enable the model to adaptively match the anchor boxes to full or visible bounding boxes according to the degree of occlusion. Though structurally simple, it presents state-of-the-art result and real-time speed of 20 FPS for VGA-resolution images (640×480) tested on one GeForce GTX 1080Ti GPU on challenging pedestrian detection benchmarks, i.e., CityPersons, Caltech, and human detection benchmark CrowdHuman, leading to a new attractive pedestrian detector.

Posted Content
TL;DR: Wang et al. as mentioned in this paper proposed a novel generalized clustering and multi-manifold learning (GCML) framework with geometric structure preservation for generalized data, i.e., not limited to 2D image data and has a wide range of applications in speech, text, and biology domains.
Abstract: Though manifold-based clustering has become a popular research topic, we observe that one important factor has been omitted by these works, namely that the defined clustering loss may corrupt the local and global structure of the latent space. In this paper, we propose a novel Generalized Clustering and Multi-manifold Learning (GCML) framework with geometric structure preservation for generalized data, i.e., not limited to 2-D image data and has a wide range of applications in speech, text, and biology domains. In the proposed framework, manifold clustering is done in the latent space guided by a clustering loss. To overcome the problem that the clustering-oriented loss may deteriorate the geometric structure of the latent space, an isometric loss is proposed for preserving intra-manifold structure locally and a ranking loss for inter-manifold structure globally. Extensive experimental results have shown that GCML exhibits superior performance to counterparts in terms of qualitative visualizations and quantitative metrics, which demonstrates the effectiveness of preserving geometric structure.

Posted Content
TL;DR: In this paper, a cross-view alignment loss is proposed to learn to extract discriminative texture information and localization for self-supervised pre-training under fine-grained scenarios.
Abstract: Self-supervised contrastive learning has demonstrated great potential in learning visual representations. Despite their success on various downstream tasks such as image classification and object detection, self-supervised pre-training for fine-grained scenarios is not fully explored. In this paper, we first point out that current contrastive methods are prone to memorizing background/foreground texture and therefore have a limitation in localizing the foreground object. Analysis suggests that learning to extract discriminative texture information and localization are equally crucial for self-supervised pre-training under fine-grained scenarios. Based on our findings, we introduce Cross-view Saliency Alignment (CVSA), a contrastive learning framework that first crops and swaps saliency regions of images as a novel view generation and then guides the model to localize on the foreground object via a cross-view alignment loss. Extensive experiments on four popular fine-grained classification benchmarks show that CVSA significantly improves the learned representation.

Jun Xia1, Haitao Lin1, Yongjie Xu, Lirong Wu1, Zhangyang Gao1, Siyuan Li1, Stan Z. Li1 
04 May 2021
TL;DR: In this article, a pseudo label is computed from the neighboring labels for each node in the training set using LP; meta learning is utilized to learn a proper aggregation of the original and pseudo labels as the final label.
Abstract: Massive labeled data have been used in training deep neural networks, thus label noise has become an important issue therein. Although learning with noisy labels has made great progress on image datasets in recent years, it has not yet been studied in connection with utilizing GNNs to classify graph nodes. In this paper, we proposed a method, named LPM, to address the problem using Label Propagation (LP) and Meta learning. Different from previous methods designed for image datasets, our method is based on a special attribute (label smoothness) of graph-structured data, i.e., neighboring nodes in a graph tend to have the same label. A pseudo label is computed from the neighboring labels for each node in the training set using LP; meta learning is utilized to learn a proper aggregation of the original and pseudo label as the final label. Experimental results demonstrate that LPM outperforms state-of-the-art methods in graph node classification task with both synthetic and real-world label noise. Source code to reproduce all results will be released.

Posted Content
Siyuan Li, Zicheng Liu, Di Wu, Zihan Liu, Stan Z. Li 
TL;DR: Zhang et al. as discussed by the authors proposed a scenario-agnostic mixup for both supervised learning and self-supervised learning (SSL) scenarios, which utilizes an attention mechanism to generate mixed samples without label dependency.
Abstract: Mixup is a popular data-dependent augmentation technique for deep neural networks, which contains two sub-tasks, mixup generation and classification. The community typically confines mixup to supervised learning (SL) and the objective of generation sub-task is fixed to the sampled pairs instead of considering the whole data manifold. To overcome such limitations, we systematically study the objectives of two sub-tasks and propose Scenario-Agostic Mixup for both SL and Self-supervised Learning (SSL) scenarios, named SAMix. Specifically, we hypothesize and verify the core objective of mixup generation as optimizing the local smoothness between two classes subject to global discrimination from other classes. Based on this discovery, $\eta$-balanced mixup loss is proposed for complementary training of the two sub-tasks. Meanwhile, the generation sub-task is parameterized as an optimizable module, Mixer, which utilizes an attention mechanism to generate mixed samples without label dependency. Extensive experiments on SL and SSL tasks demonstrate that SAMix consistently outperforms leading methods by a large margin.

Posted ContentDOI
04 Mar 2021-bioRxiv
TL;DR: In this article, the authors show that during the growth of multi-layered tissues, morphogenetic process can be self-organized by the progression of compression gradient stemmed from the interfacial mechanical interactions between layers.
Abstract: Morphogenesis is a spatially and temporally regulated process involved in various physiological and pathological transformations. In addition to the associated biochemical factors, the physical regulation of morphogenesis has attracted increasing attention. However, the driving force of morphogenesis initiation remains elusive. Here, we show that during the growth of multi-layered tissues, morphogenetic process can be self-organized by the progression of compression gradient stemmed from the interfacial mechanical interactions between layers. In tissues with low fluidity, the compression gradient is progressively strengthened during differential growth between layers and induces stratification through triggering symmetric-to-asymmetric cell division reorientation at the critical tissue size. In tissues with higher fluidity, compression gradient is dynamic and induces 2D in-plane morphogenesis instead of 3D deformation accompanied with cell junction remodeling regulated cell rearrangement. Morphogenesis can be tuned by manipulating tissue fluidity, cell adhesion forces and mechanical properties to influence the progression of compression gradient during the development of cultured cell sheets and chicken embryos. Together, the progression of compression gradient regulated by interfacial mechanical interaction provides a conserved mechanism underlying morphogenesis initiation and size control during tissue growth.

04 May 2021
TL;DR: A novel method, called elastic locally isometric smoothness (ELIS), to empower deep neural networks with such an ability to preserve local geometry of highly nonlinear manifolds in high dimensional spaces and properly unfold them into lower dimensional hyperplanes is proposed.
Abstract: The ability to preserve local geometry of highly nonlinear manifolds in high dimensional spaces and properly unfold them into lower dimensional hyperplanes is the key to the success of manifold computing, nonlinear dimensionality reduction (NLDR) and visualization. This paper proposes a novel method, called elastic locally isometric smoothness (ELIS), to empower deep neural networks with such an ability. ELIS requires that a desired metric between points should be preserved across layers in order to preserve local geometry; such a smoothness constraint effectively regularizes vector-based transformations to become well-behaved local metric-preserving homeomorphisms. Moreover, ELIS requires that the smoothness should be imposed in a way to render sufficient flexibility for tackling complicated nonlinearity and non-Euclideanity; this is achieved layer-wisely via nonlinearity in both the similarity and activation functions. The ELIS method incorporates a class of suitable nonlinear similarity functions into a two-way divergence loss and uses hyperparameter continuation in finding optimal solutions. Extensive experiments, comparisons, and ablation study demonstrate that ELIS can deliver results not only superior to UMAP and t-SNE for and visualization but also better than other leading counterparts of manifold and autoencoder learning for NLDR and manifold data generation.

Posted Content
TL;DR: Li et al. as discussed by the authors reformulate mixup for supervised classification as two sub-tasks, mixup sample generation and classification, and propose Automatic Mixup (AutoMix), a revolutionary mixup framework.
Abstract: Mixup-based data augmentations have achieved great success as regularizers for deep neural networks. However, existing methods rely on deliberately handcrafted mixup policies, which ignore or oversell the semantic matching between mixed samples and labels. Driven by their prior assumptions, early methods attempt to smooth decision boundaries by random linear interpolation while others focus on maximizing class-related information via offline saliency optimization. As a result, the issue of label mismatch has not been well addressed. Additionally, the optimization stability of mixup training is constantly troubled by the label mismatch. To address these challenges, we first reformulate mixup for supervised classification as two sub-tasks, mixup sample generation and classification, then propose Automatic Mixup (AutoMix), a revolutionary mixup framework. Specifically, a learnable lightweight Mix Block (MB) with a cross-attention mechanism is proposed to generate a mixed sample by modeling a fair relationship between the pair of samples under direct supervision of the corresponding mixed label. Moreover, the proposed Momentum Pipeline (MP) enhances training stability and accelerates convergence on top of making the Mix Block fully trained end-to-end. Extensive experiments on five popular classification benchmarks show that the proposed approach consistently outperforms leading methods by a large margin.

Book ChapterDOI
Siyuan Li1, Haitao Lin1, Zelin Zang1, Lirong Wu1, Jun Xia1, Stan Z. Li1 
04 May 2021
TL;DR: Li et al. as mentioned in this paper proposed a two-stage dimension reduction method, called invertible manifold learning (inv-ML), to preserve the topological and geometric properties of data manifolds, which involve exactly the entire information of the data manifold.
Abstract: It is widely believed that a dimension reduction (DR) process drops information inevitably in most practical scenarios. Thus, most methods try to preserve some essential information of data after DR, as well as manifold based DR methods. However, they usually fail to yield satisfying results, especially in high-dimensional cases. In the context of manifold learning, we think that a good low-dimensional representation should preserve the topological and geometric properties of data manifolds, which involve exactly the entire information of the data manifolds. In this paper, we define the problem of information-lossless NLDR with the manifold assumption and propose a novel two-stage NLDR method, called invertible manifold learning (inv-ML), to tackle this problem. A local isometry constraint of preserving local geometry is applied under this assumption in inv-ML. Firstly, a homeomorphic sparse coordinate transformation is learned to find the low-dimensional representation without losing topological information. Secondly, a linear compression is performed on the learned sparse coding, with the trade-off between the target dimension and the incurred information loss. Experiments are conducted on seven datasets with a neural network implementation of inv-ML, called i-ML-Enc, which demonstrate that the proposed inv-ML not only achieves invertible NLDR in comparison with typical existing methods but also reveals the characteristics of the learned manifolds through linear interpolation in latent space. Moreover, we find that the reliability of tangent space approximated by the local neighborhood on real-world datasets is key to the success of manifold based DR algorithms. The code will be made available soon.