scispace - formally typeset
Search or ask a question

Showing papers by "Liqing Zhang published in 2019"


Posted Content
TL;DR: This work contributes an image harmonization dataset iHarmony4 by generating synthesized composite images based on COCO (resp., Adobe5k, Flickr, day2night) dataset, leading to the HCOCO sub-dataset, and proposes a new deep image harmonized method DoveNet using a novel domain verification discriminator.
Abstract: Image composition is an important operation in image processing, but the inconsistency between foreground and background significantly degrades the quality of composite image. Image harmonization, aiming to make the foreground compatible with the background, is a promising yet challenging task. However, the lack of high-quality publicly available dataset for image harmonization greatly hinders the development of image harmonization techniques. In this work, we contribute an image harmonization dataset iHarmony4 by generating synthesized composite images based on COCO (resp., Adobe5k, Flickr, day2night) dataset, leading to our HCOCO (resp., HAdobe5k, HFlickr, Hday2night) sub-dataset. Moreover, we propose a new deep image harmonization method DoveNet using a novel domain verification discriminator, with the insight that the foreground needs to be translated to the same domain as background. Extensive experiments on our constructed dataset demonstrate the effectiveness of our proposed method. Our dataset and code are available at this https URL.

74 citations


Journal ArticleDOI
TL;DR: Overall, canopy cover showed the strongest associations with mental health at most scales, and there appears to be minimum and maximum threshold levels of spatial scale at which UGS and health have significant associations, with the strongest association consistently shown between 400 m to 1600 m in different buffer types.
Abstract: Although the benefits from exposure to urban green spaces (UGS) are increasingly reported, there are important knowledge gaps in the nature of UGS-health relationships. One such unknown area is the dependence of UGS-health associations on the types of UGS studied, the way they are quantified, and the spatial scale used in the analysis. These knowledge gaps have important ramifications on our ability to develop generalizations to promote implementation and facilitate comparative studies across different socio-cultural and socio-economic contexts. We conducted a study in Singapore to examine the dependence of UGS-health associations on the metrics for quantifying UGS (vegetation cover, canopy cover and park area) in different types of buffer area (circular, nested and network) at different spatial scales. A population-based household survey (n = 1000) was used to collect information on self-reported health and perception and usage pattern of UGS. The results showed that although all three UGS metrics were positively related to mental health at certain scales, overall, canopy cover showed the strongest associations with mental health at most scales. There also appears to be minimum and maximum threshold levels of spatial scale at which UGS and health have significant associations, with the strongest associations consistently shown between 400 m to 1600 m in different buffer types. We discuss the significance of these results for UGS-health studies and applications in UGS planning for improved health of urban dwellers.

47 citations


Journal ArticleDOI
TL;DR: This work proposes a method named adaptive embedding ZSL (AEZSL) to learn an adaptive visual-semantic mapping for each unseen category, followed by progressive label refinement, which achieves the state-of-the-art results for image classification on three small-scale benchmark datasets and one large- scale benchmark dataset.
Abstract: Zero-shot learning (ZSL) aims to classify a test instance from an unseen category based on the training instances from seen categories in which the gap between seen categories and unseen categories is generally bridged via visual-semantic mapping between the low-level visual feature space and the intermediate semantic space. However, the visual-semantic mapping (i.e., projection) learnt based on seen categories may not generalize well to unseen categories, which is known as the projection domain shift in ZSL. To address this projection domain shift issue, we propose a method named adaptive embedding ZSL (AEZSL) to learn an adaptive visual-semantic mapping for each unseen category, followed by progressive label refinement. Moreover, to avoid learning visual-semantic mapping for each unseen category in the large-scale classification task, we additionally propose a deep adaptive embedding model named deep AEZSL sharing the similar idea (i.e., visual-semantic mapping should be category specific and related to the semantic space) with AEZSL, which only needs to be trained once, but can be applied to arbitrary number of unseen categories. Extensive experiments demonstrate that our proposed methods achieve the state-of-the-art results for image classification on three small-scale benchmark datasets and one large-scale benchmark dataset.

38 citations


Proceedings ArticleDOI
01 Aug 2019
TL;DR: A high-order graph attention representation method (HGAR) to learn the embedding of guarantee networks and proves that the complexity is nearlinear with the number of edges, which could scale to large datasets.
Abstract: Assessing and predicting the default risk of networked-guarantee loans is critical for the commercial banks and financial regulatory authorities. The guarantee relationships between the loan companies are usually modeled as directed networks. Learning the informative low-dimensional representation of the networks is important for the default risk prediction of loan companies, even for the assessment of systematic financial risk level. In this paper, we propose a high-order graph attention representation method (HGAR) to learn the embedding of guarantee networks. Because this financial network is different from other complex networks, such as social, language, or citation networks, we set the binary roles of vertices and define high-order adjacent measures based on financial domain characteristics. We design objective functions in addition to a graph attention layer to capture the importance of nodes. We implement a productive learning strategy and prove that the complexity is near-linear with the number of edges, which could scale to large datasets. Extensive experiments demonstrate the superiority of our model over state-of-the-art method. We also evaluate the model in a real-world loan risk control system, and the results validate the effectiveness of our proposed approaches.

38 citations


Journal ArticleDOI
17 Jul 2019
TL;DR: A novel model formulating disentangled representations by projecting images to latent units, grouped feature channels of Convolutional Neural Network, to disassemble the information between different attributes to retain the diversity beyond the labels inside each image.
Abstract: Recent studies show significant progress in image-to-image translation task, especially facilitated by Generative Adversarial Networks. They can synthesize highly realistic images and alter the attribute labels for the images. However, these works employ attribute vectors to specify the target domain which diminishes image-level attribute diversity. In this paper, we propose a novel model formulating disentangled representations by projecting images to latent units, grouped feature channels of Convolutional Neural Network, to disassemble the information between different attributes. Thanks to disentangled representation, we can transfer attributes according to the attribute labels and moreover retain the diversity beyond the labels, namely, the styles inside each image. This is achieved by specifying some attributes and swapping the corresponding latent units to “swap” the attributes appearance, or applying channel-wise interpolation to blend different attributes. To verify the motivation of our proposed model, we train and evaluate our model on face dataset CelebA. Furthermore, the evaluation of another facial expression dataset RaFD demonstrates the generalizability of our proposed model.

29 citations


Proceedings ArticleDOI
03 Nov 2019
TL;DR: A dynamic default prediction framework (DDPF) is proposed, which preserves temporal network structures and loan behavior sequences in an end-to-end model and design a gated recursive and attention mechanism to integrate both the loan behavior and network information.
Abstract: Commercial banks normally require Small and Medium Enterprises (SMEs) to provide their warranties when applying for a loan. If the borrower defaults, the guarantor is obligated to repay its loan. Such a guarantee system is designed to reduce delinquent risks, but may introduce a new dimension risk if more and more SMEs involve and subsequently form complex temporal networks. Monitoring the financial status of SMEs in these networks, and preventing or reducing systematic loan risk, is an area of great concern for both the regulatory commission and the banks. To allow possible actions to be taken in advance, this paper studies the problem of predicting repayment delinquency in the networked-guarantee loans. We propose a dynamic default prediction framework (DDPF), which preserves temporal network structures and loan behavior sequences in an end-to-end model. In particular, we design a gated recursive and attention mechanism to integrate both the loan behavior and network information. Then, we uncover risky warrant patterns by the learned weights, which effectively accelerate risk evaluation process. Finally, we conduct extensive experiments in a real-world loan risk control system to evaluate its performance, the results demonstrate the effectiveness of our proposed approach compared with state-of-the-art baselines.

24 citations


Proceedings ArticleDOI
15 Oct 2019
TL;DR: The proposed Gradient Augmented Inpainting Network (GAIN), which uses image gradient information instead of edge information to facilitate image inpainting, is formulated, which outperforms state-of-the-art methods quantitatively and qualitatively.
Abstract: Image inpainting, which aims to fill the missing holes of the images, is a challenging task because the holes may contain complicated structures or different possible layouts. Deep learning methods have shown promising performance in image inpainting but still, suffer from generating poor-structured artifacts when the holes are large and irregular. Some existing methods use edge inpainting to help image inpainting, with binary edge map obtained from image gradient. However, by only using the binary edge map, these methods discard the rich information in image gradient and thus leave some critical issues (e.g. , color discrepancy) unattended. In this paper, we propose Gradient Augmented Inpainting Network (GAIN), which uses image gradient information instead of edge information to facilitate image inpainting. Specifically, we formulate a multi-task learning framework which performs image inpainting and gradient inpainting simultaneously. A novel GAI-Block is designed to encourage the information fusion between the image feature map and the gradient feature map. Moreover, gradient information is also used to determine the filling priority, which can guide the network to construct more plausible semantic structures for the holes. Experimental results on public datasets CelebA-HQ and Places2 show that our proposed method outperforms state-of-the-art methods quantitatively and qualitatively.

18 citations


Posted Content
TL;DR: This paper proposes an Activity Proposal-based Image-to-Video Retrieval (APIVR) approach, which incorporates multi-instance learning into cross-modal retrieval framework to address the proposal noise issue and proposes geometry-aware triplet loss based on point- to-subspace distance to preserve the structural information of activity proposals.
Abstract: Activity image-to-video retrieval task aims to retrieve videos containing the similar activity as the query image, which is a challenging task because videos generally have many background segments irrelevant to the activity. In this paper, we utilize R-C3D model to represent a video by a bag of activity proposals, which can filter out background segments to some extent. However, there are still noisy proposals in each bag. Thus, we propose an Activity Proposal-based Image-to-Video Retrieval (APIVR) approach, which incorporates multi-instance learning into cross-modal retrieval framework to address the proposal noise issue. Specifically, we propose a Graph Multi-Instance Learning (GMIL) module with graph convolutional layer, and integrate this module with classification loss, adversarial loss, and triplet loss in our cross-modal retrieval framework. Moreover, we propose geometry-aware triplet loss based on point-to-subspace distance to preserve the structural information of activity proposals. Extensive experiments on three widely-used datasets verify the effectiveness of our approach.

7 citations


Posted Content
27 Nov 2019
TL;DR: A new deep image harmonization method with a novel domain verification discriminator, enlightened by the following insight: incompatible foreground and background belong to two different domains, so they need to translate the domain of foreground to the same domain as background.
Abstract: Image composition is an important operation in image processing, but the inconsistency between foreground and background significantly degrades the quality of composite image. Image harmonization, aiming to make the foreground compatible with the background, is a promising yet challenging task. However, the lack of highquality publicly available dataset for image harmonization greatly hinders the development of image harmonization techniques. In this work, we contribute an image harmonization dataset by generating synthesized composite images based on COCO (resp., Adobe5k, Flickr, day2night) dataset, leading to our HCOCO (resp., HAdobe5k, HFlickr, Hday2night) sub-dataset. Moreover, we propose a new deep image harmonization method with a novel domain verification discriminator, enlightened by the following insight. Specifically, incompatible foreground and background belong to two different domains, so we need to translate the domain of foreground to the same domain as background. Our proposed domain verification discriminator can play such a role by pulling close the domains of foreground and background. Extensive experiments on our constructed dataset demonstrate the effectiveness of our proposed method. Our dataset is released in https://github.com/bcmi/Image Harmonization Datasets.

6 citations


Posted Content
TL;DR: This work proposes the STRucture-aware Asymmetric Disentanglement (STRAD) method, in which image features are disentangled into structure features and appearance features while sketch features are only projected to structure space.
Abstract: The goal of Sketch-Based Image Retrieval (SBIR) is using free-hand sketches to retrieve images of the same category from a natural image gallery. However, SBIR requires all test categories to be seen during training, which cannot be guaranteed in real-world applications. So we investigate more challenging Zero-Shot SBIR (ZS-SBIR), in which test categories do not appear in the training stage. After realizing that sketches mainly contain structure information while images contain additional appearance information, we attempt to achieve structure-aware retrieval via asymmetric disentanglement.For this purpose, we propose our STRucture-aware Asymmetric Disentanglement (STRAD) method, in which image features are disentangled into structure features and appearance features while sketch features are only projected to structure space. Through disentangling structure and appearance space, bi-directional domain translation is performed between the sketch domain and the image domain. Extensive experiments demonstrate that our STRAD method remarkably outperforms state-of-the-art methods on three large-scale benchmark datasets.

5 citations


Posted Content
28 Jun 2019
TL;DR: The novel ProtoNet is proposed, which is capable of handling these two types of noises together, without the supervision of clean images in the training stage, and can be easily integrated into arbitrary CNN model.
Abstract: Learning from web data has attracted lots of research interest in recent years. However, crawled web images usually have two types of noises, label noise and background noise, which induce extra difficulties in utilizing them effectively. Most existing methods either rely on human supervision or ignore the background noise. In this paper, we propose the novel ProtoNet, which is capable of handling these two types of noises together, without the supervision of clean images in the training stage. Particularly, we use a memory module to identify the representative and discriminative prototypes for each category. Then, we remove noisy images and noisy region proposals from the web dataset with the aid of the memory module. Our approach is efficient and can be easily integrated into arbitrary CNN model. Extensive experiments on four benchmark datasets demonstrate the effectiveness of our method.

Proceedings ArticleDOI
08 Jul 2019
TL;DR: This work proposes Detection+, a model using the prior symmetric constraint to refine the keypoints located by any backbone detection networks to deal with uncertainty in labelling clothing, and introduces a new loss to utilize all available data which contain "maybe" labels.
Abstract: Rich clothes datasets and high-quality annotations have driven recent advances in fashion clothes recognition. However, the existing approaches treat clothes as common images, ignoring the prior clothing knowledge such as spatial relations, symmetry, proportions, and key characteristics of clothes. In order to combine the semantic information with the advantages of deep learning, we propose Detection+, a model using the prior symmetric constraint to refine the keypoints located by any backbone detection networks. To deal with uncertainty in labelling clothing, we introduce a new loss to utilize all available data which contain "maybe" labels. Detection+ has reduced about 2.54% Normalized Error in FashionAI dataset and improved 3.2% AP in human keypoints dataset coco2017 compared to the Mask R-CNN baseline. A large number of experimental results show the proposed approach achieves better results in different recognition datasets (resp., FashionAI, and Deepfashion) with about (resp., 2.57% mAP, and 10% recall) improvements.

Posted Content
TL;DR: A Bi-directional Domain Translation (BDT) framework is proposed for ZS-SBIR, in which the image domain and sketch domain can be translated to each other through disentangled structure and appearance features to facilitate structure-based retrieval.
Abstract: The goal of Sketch-Based Image Retrieval (SBIR) is using free-hand sketches to retrieve images of the same category from a natural image gallery. However, SBIR requires all categories to be seen during training, which cannot be guaranteed in real-world applications. So we investigate more challenging Zero-Shot SBIR (ZS-SBIR), in which test categories do not appear in the training stage. Traditional SBIR methods are prone to be category-based retrieval and cannot generalize well from seen categories to unseen ones. In contrast, we disentangle image features into structure features and appearance features to facilitate structure-based retrieval. To assist feature disentanglement and take full advantage of disentangled information, we propose a Bi-directional Domain Translation (BDT) framework for ZS-SBIR, in which the image domain and sketch domain can be translated to each other through disentangled structure and appearance features. Finally, we perform retrieval in both structure feature space and image feature space. Extensive experiments demonstrate that our proposed approach remarkably outperforms state-of-the-art approaches by about 8% on the Sketchy dataset and over 5% on the TU-Berlin dataset.

Journal ArticleDOI
TL;DR: A binary higher-order network embedding method to learn the low-dimensional representations of a guarantee network and shows that this method outperforms other start-of-the-art algorithms for both classification accuracy and robustness, especially in the guarantee network.
Abstract: Networked-guarantee loans may cause systemic risk related concern for the government and banks in China. The prediction of the default of enterprise loans is a typical machine learning based classification problem, and the networked guarantee makes this problem very difficult to solve. As we know, a complex network is usually stored and represented by an adjacency matrix. It is a high-dimensional and sparse matrix, whereas machine-learning methods usually need lowdimensional dense feature representations. Therefore, in this paper, we propose a binary higher-order network embedding method to learn the low-dimensional representations of a guarantee network. We first set vertices of this heterogeneous economic network by binary roles (guarantor and guarantee), and then define high-order adjacent measures based on their roles and economic domain knowledge. Afterwards, we design a penalty parameter in the objective function to balance the importance of network structure and adjacency. We optimize it by negative sampling based gradient descent algorithms, which solve the limitation of stochastic gradient descent on weighted edges without compromising efficiency. Finally, we test our proposed method on three real-world network datasets. The result shows that this method outperforms other start-of-the-art algorithms for both classification accuracy and robustness, especially in a guarantee network.

Posted Content
TL;DR: Zhang et al. as discussed by the authors leverage only the depth of training images as the privileged information to mine the hard pixels in semantic segmentation, in which depth information is only available for training images but not available for test images.
Abstract: Semantic segmentation has achieved remarkable progress but remains challenging due to the complex scene, object occlusion, and so on. Some research works have attempted to use extra information such as a depth map to help RGB based semantic segmentation because the depth map could provide complementary geometric cues. However, due to the inaccessibility of depth sensors, depth information is usually unavailable for the test images. In this paper, we leverage only the depth of training images as the privileged information to mine the hard pixels in semantic segmentation, in which depth information is only available for training images but not available for test images. Specifically, we propose a novel Loss Weight Module, which outputs a loss weight map by employing two depth-related measurements of hard pixels: Depth Prediction Error and Depthaware Segmentation Error. The loss weight map is then applied to segmentation loss, with the goal of learning a more robust model by paying more attention to the hard pixels. Besides, we also explore a curriculum learning strategy based on the loss weight map. Meanwhile, to fully mine the hard pixels on different scales, we apply our loss weight module to multi-scale side outputs. Our hard pixels mining method achieves the state-of-the-art results on two benchmark datasets, and even outperforms the methods which need depth input during testing.

Posted Content
Yi Tu1, Li Niu1, Junjie Chen1, Dawei Cheng1, Liqing Zhang1 
TL;DR: In this article, a multi-instance learning method was proposed to handle label noise and background noise in crawled web images without the supervision of clean images in the training stage, where ROIs in each bag were assigned with different weights based on the representative/discriminative scores of their nearest clusters.
Abstract: Learning from web data has attracted lots of research interest in recent years. However, crawled web images usually have two types of noises, label noise and background noise, which induce extra difficulties in utilizing them effectively. Most existing methods either rely on human supervision or ignore the background noise. In this paper, we propose a novel method, which is capable of handling these two types of noises together, without the supervision of clean images in the training stage. Particularly, we formulate our method under the framework of multi-instance learning by grouping ROIs (i.e., images and their region proposals) from the same category into bags. ROIs in each bag are assigned with different weights based on the representative/discriminative scores of their nearest clusters, in which the clusters and their scores are obtained via our designed memory module. Our memory module could be naturally integrated with the classification module, leading to an end-to-end trainable system. Extensive experiments on four benchmark datasets demonstrate the effectiveness of our method.

Posted Content
TL;DR: This work contributes an image harmonization dataset iHarmony4 by generating synthesized composite images based on existing COCO (resp., Adobe5k, day2night) dataset, leading to the HCOCO sub-dataset.
Abstract: Image composition is an important operation in image processing, but the inconsistency between foreground and background significantly degrades the quality of composite image. Image harmonization, which aims to make the foreground compatible with the background, is a promising yet challenging task. However, the lack of high-quality public dataset for image harmonization, which significantly hinders the development of image harmonization techniques. Therefore, we contribute an image harmonization dataset iHarmony4 by generating synthesized composite images based on existing COCO (resp., Adobe5k, day2night) dataset, leading to our HCOCO (resp., HAdobe5k, Hday2night) sub-dataset. To enrich the diversity of our dataset, we also generate synthesized composite images based on our collected Flick images, leading to our HFlickr sub-dataset. The image harmonization dataset iHarmony4 is released at this https URL.

Posted Content
Yi Tu1, Li Niu1, Weijie Zhao, Dawei Cheng1, Liqing Zhang1 
TL;DR: Zhang et al. as discussed by the authors proposed an interpretable image cropping model to reveal the intrinsic mechanism of aesthetic evaluation, which uses a fully convolutional network to produce an aesthetic score map, which is shared among all candidate crops during crop-level aesthetic evaluation.
Abstract: Aesthetic image cropping is a practical but challenging task which aims at finding the best crops with the highest aesthetic quality in an image. Recently, many deep learning methods have been proposed to address this problem, but they did not reveal the intrinsic mechanism of aesthetic evaluation. In this paper, we propose an interpretable image cropping model to unveil the mystery. For each image, we use a fully convolutional network to produce an aesthetic score map, which is shared among all candidate crops during crop-level aesthetic evaluation. Then, we require the aesthetic score map to be both composition-aware and saliency-aware. In particular, the same region is assigned with different aesthetic scores based on its relative positions in different crops. Moreover, a visually salient region is supposed to have more sensitive aesthetic scores so that our network can learn to place salient objects at more proper positions. Such an aesthetic score map can be used to localize aesthetically important regions in an image, which sheds light on the composition rules learned by our model. We show the competitive performance of our model in the image cropping task on several benchmark datasets, and also demonstrate its generality in real-world applications.

Posted Content
27 Jun 2019
TL;DR: This paper relies on depth information to identify the hard pixels which are difficult to classify, by using the proposed Depth Prediction Error (DPE) and Depth-dependent Segmentation Error (DSE) by paying more attention to the identified hard pixels.
Abstract: It has been shown that incorporating depth features into RGB features helps improve semantic segmentation. However, depth information is usually unavailable for the test images. In this paper, we leverage only the depth of training images as the privileged information to mine the hard pixels in semantic segmentation. Specifically, we propose a novel Loss Weight Module (LWM), which outputs a loss weight map by employing two depth-related measurements of hard pixels: Depth Prediction Error (DPE) and Depth-aware Segmentation Error (DSE). The loss weight map is then applied to segmentation loss, aimed at learning a more robust model by paying more attention to the hard pixels. Besides, we also explore a curriculum learning strategy based on the loss weight map. Meanwhile, to fully mine the hard pixels on different scales, we apply our loss weight module to multi-scale side outputs. Our hard pixels mining method achieves the state-of-the-art results on two benchmark datasets, and even outperforms the methods which need depth input while testing.

Posted Content
TL;DR: Zhang et al. as discussed by the authors proposed a unified framework consisting of Visual Representation Enhancement (VRE) module and Motion Representation Augmentation (MRA) module, which includes a proxy task which imposes pseudo motion label constraint and temporal coherence constraint on unlabeled videos.
Abstract: Static image action recognition, which aims to recognize action based on a single image, usually relies on expensive human labeling effort such as adequate labeled action images and large-scale labeled image dataset. In contrast, abundant unlabeled videos can be economically obtained. Therefore, several works have explored using unlabeled videos to facilitate image action recognition, which can be categorized into the following two groups: (a) enhance visual representations of action images with a designed proxy task on unlabeled videos, which falls into the scope of self-supervised learning; (b) generate auxiliary representations for action images with the generator learned from unlabeled videos. In this paper, we integrate the above two strategies in a unified framework, which consists of Visual Representation Enhancement (VRE) module and Motion Representation Augmentation (MRA) module. Specifically, the VRE module includes a proxy task which imposes pseudo motion label constraint and temporal coherence constraint on unlabeled videos, while the MRA module could predict the motion information of a static action image by exploiting unlabeled videos. We demonstrate the superiority of our framework based on four benchmark human action datasets with limited labeled data.

Book ChapterDOI
12 Dec 2019
TL;DR: The proposed three-step framework reconstructs the 3D human skeleton for each person from the detected 2D human joints, by using prelearned base poses and considering the temporal smoothness.
Abstract: This article tackles the problem of multi-person 3D human pose estimation based on monocular image sequence in a three-step framework: (1) we detect 2D human skeletons in each frame across the image sequence; (2) we track each person through the image sequence and identify the sequence of 2D skeletons for each person; (3) we reconstruct the 3D human skeleton for each person from the detected 2D human joints, by using prelearned base poses and considering the temporal smoothness. We evaluate our framework on the Human3.6M dataset and the multi-person image sequence captured by ourselves. The quantitative results on the Human3.6M dataset and the qualitative results on our constructed test data demonstrate the effectiveness of our proposed method.

Posted Content
28 Aug 2019
TL;DR: This work creates synthesized composite images based on existing COCO (resp., Adobe5k, day2night) dataset, leading to their HCOCO sub-dataset, and generates synthesized composites based on collected Flick images, leading on to the HFlickr sub- dataset.
Abstract: Image composition is an important operation in image processing, but the inconsistency between foreground and background significantly degrades the quality of composite image. Image harmonization, which aims to make the foreground compatible with the background, is a promising yet challenging task. However, the lack of high-quality public dataset for image harmonization, which significantly hinders the development of image harmonization techniques. Therefore, we create synthesized composite images based on existing COCO (resp., Adobe5k, day2night) dataset, leading to our HCOCO (resp., HAdobe5k, Hday2night) sub-dataset. To enrich the diversity of our datasets, we also generate synthesized composite images based on our collected Flick images, leading to our HFlickr sub-dataset. All four sub-datasets are released in this https URL.