scispace - formally typeset
Search or ask a question

Showing papers presented at "British Machine Vision Conference in 2018"


Proceedings Article
17 Jul 2018
TL;DR: Bottleneck Attention Module (BAM) as discussed by the authors infers an attention map along two separate pathways, channel and spatial, and constructs a hierarchical attention at bottlenecks with a number of parameters and it is trainable in an end-to-end manner jointly with any feed-forward models.
Abstract: Recent advances in deep neural networks have been developed via architecture search for stronger representational power. In this work, we focus on the effect of attention in general deep neural networks. We propose a simple and effective attention module, named Bottleneck Attention Module (BAM), that can be integrated with any feed-forward convolutional neural networks. Our module infers an attention map along two separate pathways, channel and spatial. We place our module at each bottleneck of models where the downsampling of feature maps occurs. Our module constructs a hierarchical attention at bottlenecks with a number of parameters and it is trainable in an end-to-end manner jointly with any feed-forward models. We validate our BAM through extensive experiments on CIFAR-100, ImageNet-1K, VOC 2007 and MS COCO benchmarks. Our experiments show consistent improvement in classification and detection performances with various models, demonstrating the wide applicability of BAM. The code and models will be publicly available.

463 citations


Proceedings Article
01 Jan 2018
TL;DR: The problem of Explainable AI for deep neural networks that take images as input and output a class probability is addressed and an approach called RISE that generates an importance map indicating how salient each pixel is for the model's prediction is proposed.
Abstract: Deep neural networks are being used increasingly to automate data analysis and decision making, yet their decision-making process is largely unclear and is difficult to explain to the end users. In this paper, we address the problem of Explainable AI for deep neural networks that take images as input and output a class probability. We propose an approach called RISE that generates an importance map indicating how salient each pixel is for the model's prediction. In contrast to white-box approaches that estimate pixel importance using gradients or other internal network state, RISE works on black-box models. It estimates importance empirically by probing the model with randomly masked versions of the input image and obtaining the corresponding outputs. We compare our approach to state-of-the-art importance extraction methods using both an automatic deletion/insertion metric and a pointing metric based on human-annotated object segments. Extensive experiments on several benchmark datasets show that our approach matches or exceeds the performance of other methods, including white-box approaches. Project page: this http URL

436 citations


Proceedings Article
01 Jan 2018
TL;DR: A generalized framework for few-shot semantic segmentation with an alternative training scheme based on prototype learning and metric learning is proposed, which outperforms the baselines by a large margin and shows comparable performance for 1-way few- shot semantic segmentsation on PASCAL VOC 2012 dataset.
Abstract: Semantic segmentation assigns a class label to each image pixel. This dense prediction problem requires large amounts of manually annotated data, which is often unavailable. Few-shot learning aims to learn the pattern of a new category with only a few annotated examples. In this paper, we formulate the few-shot semantic segmentation problem from 1-way (class) to N-way (classes). Inspired by few-shot classification, we propose a generalized framework for few-shot semantic segmentation with an alternative training scheme. The framework is based on prototype learning and metric learning. Our approach outperforms the baselines by a large margin and shows comparable performance for 1-way few-shot semantic segmentation on PASCAL VOC 2012 dataset.

316 citations


Proceedings Article
01 Jan 2018
TL;DR: The proposed multi-branch low-light enhancement network (MBLLEN) is found to outperform the state-of-art techniques by a large margin and can be directly extended to handle low-lights videos.
Abstract: We present a deep learning based method for low-light image enhancement. This problem is challenging due to the difficulty in handling various factors simultaneously including brightness, contrast, artifacts and noise. To address this task, we propose the multi-branch low-light enhancement network (MBLLEN). The key idea is to extract rich features up to different levels, so that we can apply enhancement via multiple subnets and finally produce the output image via multi-branch fusion. In this manner, image quality is improved from different aspects. Through extensive experiments, our proposed MBLLEN is found to outperform the state-of-art techniques by a large margin. We additionally show that our method can be directly extended to handle low-light videos.

277 citations


Proceedings Article
01 Jan 2018
TL;DR: In this paper, a new technique for learning visual-semantic embeddings for cross-modal retrieval is proposed, inspired by hard negative mining, the use of hard negatives in structured prediction, and ranking loss functions.
Abstract: We present a new technique for learning visual-semantic embeddings for cross-modal retrieval. Inspired by hard negative mining, the use of hard negatives in structured prediction, and ranking loss functions, we introduce a simple change to common loss functions used for multi-modal embeddings. That, combined with fine-tuning and use of augmented data, yields significant gains in retrieval performance. We showcase our approach, VSE++, on MS-COCO and Flickr30K datasets, using ablation studies and comparisons with existing methods. On MS-COCO our approach outperforms state-of-the-art methods by 8.8% in caption retrieval and 11.3% in image retrieval (at R@1).

257 citations


Proceedings Article
25 May 2018
TL;DR: Huang et al. as mentioned in this paper proposed a pyramid attention network to exploit the impact of global contextual information in semantic segmentation, which achieves state-of-the-art performance on PASCAL VOC 2012 and Cityscapes benchmarks with a new record of mIoU accuracy 84.0%.
Abstract: A Pyramid Attention Network(PAN) is proposed to exploit the impact of global contextual information in semantic segmentation. Different from most existing works, we combine attention mechanism and spatial pyramid to extract precise dense features for pixel labeling instead of complicated dilated convolution and artificially designed decoder networks. Specifically, we introduce a Feature Pyramid Attention module to perform spatial pyramid attention structure on high-level output and combining global pooling to learn a better feature representation, and a Global Attention Upsample module on each decoder layer to provide global context as a guidance of low-level features to select category localization details. The proposed approach achieves state-of-the-art performance on PASCAL VOC 2012 and Cityscapes benchmarks with a new record of mIoU accuracy 84.0% on PASCAL VOC 2012, while training without COCO dataset.

235 citations


Proceedings Article
14 Aug 2018
TL;DR: Extensive experiments demonstrate that the proposed deep Retinex-Net learned on this LOw-Light dataset not only achieves visually pleasing quality for low-light enhancement but also provides a good representation of image decomposition.
Abstract: Retinex model is an effective tool for low-light image enhancement. It assumes that observed images can be decomposed into the reflectance and illumination. Most existing Retinex-based methods have carefully designed hand-crafted constraints and parameters for this highly ill-posed decomposition, which may be limited by model capacity when applied in various scenes. In this paper, we collect a LOw-Light dataset (LOL) containing low/normal-light image pairs and propose a deep Retinex-Net learned on this dataset, including a Decom-Net for decomposition and an Enhance-Net for illumination adjustment. In the training process for Decom-Net, there is no ground truth of decomposed reflectance and illumination. The network is learned with only key constraints including the consistent reflectance shared by paired low/normal-light images, and the smoothness of illumination. Based on the decomposition, subsequent lightness enhancement is conducted on illumination by an enhancement network called Enhance-Net, and for joint denoising there is a denoising operation on reflectance. The Retinex-Net is end-to-end trainable, so that the learned decomposition is by nature good for lightness adjustment. Extensive experiments demonstrate that our method not only achieves visually pleasing quality for low-light enhancement but also provides a good representation of image decomposition.

213 citations


Proceedings Article
01 Jan 2018
TL;DR: Zhang et al. as mentioned in this paper propose an instance-centric attention module that learns to dynamically highlight regions in an image conditioned on the appearance of each instance for detecting human-object interactions.
Abstract: Recent years have witnessed rapid progress in detecting and recognizing individual object instances. To understand the situation in a scene, however, computers need to recognize how humans interact with surrounding objects. In this paper, we tackle the challenging task of detecting human-object interactions (HOI). Our core idea is that the appearance of a person or an object instance contains informative cues on which relevant parts of an image to attend to for facilitating interaction prediction. To exploit these cues, we propose an instance-centric attention module that learns to dynamically highlight regions in an image conditioned on the appearance of each instance. Such an attention-based network allows us to selectively aggregate features relevant for recognizing HOIs. We validate the efficacy of the proposed network on the Verb in COCO and HICO-DET datasets and show that our approach compares favorably with the state-of-the-arts.

205 citations


Proceedings Article
01 Jan 2018
TL;DR: A novel two-step convolutional neural network is proposed to estimate a heart rate from a sequence of facial images to test the robustness of heart rate estimation methods to illumination changes and subject’s motion.
Abstract: We propose a novel two-step convolutional neural network to estimate a heart rate from a sequence of facial images. The network is trained end-to-end by alternating optimization and validated on three publicly available datasets yielding state-of-the-art results against three baseline methods. The network performs better by a 40% margin to the state-of-the-art method on a newly collected dataset. A challenging dataset of 204 fitness-themed videos is introduced. The dataset is designed to test the robustness of heart rate estimation methods to illumination changes and subject’s motion. 17 subjects perform 4 activities (talking, rowing, exercising on a stationary bike and an elliptical trainer) in 3 lighting setups. Each activity is captured by two RGB web-cameras, one is placed on a tripod, the other is attached to the fitness machine which vibrates significantly. Subject’s age ranges from 20 to 53 years, the mean heart rate is ≈ 110, the standard deviation ≈ 25.

133 citations


Proceedings Article
01 Feb 2018
TL;DR: An online optimization framework to build the association of cross-frame poses and form pose flows (PF-Builder) and a novel pose flow non-maximum suppression ( PF-NMS) is designed to robustly reduce redundant pose flows and re-link temporal disjoint ones.
Abstract: Multi-person articulated pose tracking in unconstrained videos is an important while challenging problem. In this paper, going along the road of top-down approaches, we propose a decent and efficient pose tracker based on pose flows. First, we design an online optimization framework to build the association of cross-frame poses and form pose flows (PF-Builder). Second, a novel pose flow non-maximum suppression (PF-NMS) is designed to robustly reduce redundant pose flows and re-link temporal disjoint ones. Extensive experiments show that our method significantly outperforms best-reported results on two standard Pose Tracking datasets by 13 mAP 25 MOTA and 6 mAP 3 MOTA respectively. Moreover, in the case of working on detected poses in individual frames, the extra computation of pose tracker is very minor, guaranteeing online 10FPS tracking. Our source codes are made publicly available(this https URL).

127 citations


Proceedings Article
01 Jan 2018
TL;DR: Zhang et al. as mentioned in this paper developed an unsupervised multi-task mid-level feature alignment (MMFA) network for cross-dataset person re-ID.
Abstract: © 2018. The copyright of this document resides with its authors. Most existing person re-identification (Re-ID) approaches follow a supervised learning framework, in which a large number of labelled matching pairs are required for training. Such a setting severely limits their scalability in real-world applications where no labelled samples are available during the training phase. To overcome this limitation, we develop a novel unsupervised Multi-task Mid-level Feature Alignment (MMFA) network for the unsupervised cross-dataset person re-identification task. Under the assumption that the source and target datasets share the same set of mid-level semantic attributes, our proposed model can be jointly optimised under the person's identity classification and the attribute learning task with a cross-dataset mid-level feature alignment regularisation term. In this way, the learned feature representation can be better generalised from one dataset to another which further improve the person re-identification accuracy. Experimental results on four benchmark datasets demonstrate that our proposed method outperforms the state-of-the-art baselines.

Proceedings Article
01 Jan 2018
TL;DR: A novel active learning method is developed which poses the layered architecture used in object detection as a ‘query by committee’ paradigm to choose the set of images to be queried and these methods outperform classical uncertainty-based active learning algorithms like maximum entropy.
Abstract: Object detection methods like Single Shot Multibox Detector (SSD) provide highly accurate object detection that run in real-time. However, these approaches require a large number of annotated training images. Evidently, not all of these images are equally useful for training the algorithms. Moreover, obtaining annotations in terms of bounding boxes for each image is costly and tedious. In this paper, we aim to obtain a highly accurate object detector using only a fraction of the training images. We do this by adopting active learning that uses ‘human in the loop’ paradigm to select the set of images that would be useful if annotated. Towards this goal, we make the following contributions: 1. We develop a novel active learning method which poses the layered architecture used in object detection as a ‘query by committee’ paradigm to choose the set of images to be queried. 2. We introduce a framework to use the exploration/exploitation trade-off in our methods. 3. We analyze the results on standard object detection datasets which show that with only a third of the training data, we can obtain more than 95% of the localization accuracy of full supervision. Further our methods outperform classical uncertainty-based active learning algorithms like maximum entropy.

Proceedings Article
01 May 2018
TL;DR: QuaterNet as discussed by the authors represents rotations with quaternions and performs forward kinematics on a skeleton to penalize absolute position errors instead of angle errors for short-term pose prediction.
Abstract: Deep learning for predicting or generating 3D human pose sequences is an active research area. Previous work regresses either joint rotations or joint positions. The former strategy is prone to error accumulation along the kinematic chain, as well as discontinuities when using Euler angle or exponential map parameterizations. The latter requires re-projection onto skeleton constraints to avoid bone stretching and invalid configurations. This work addresses both limitations. Our recurrent network, QuaterNet, represents rotations with quaternions and our loss function performs forward kinematics on a skeleton to penalize absolute position errors instead of angle errors. On short-term predictions, QuaterNet improves the state-of-the-art quantitatively. For long-term generation, our approach is qualitatively judged as realistic as recent neural strategies from the graphics literature.

Proceedings Article
01 Jan 2018
TL;DR: In this paper, a large-scale ASL data set was proposed, which covers over 200 signers, signer independent sets, challenging and unconstrained recording conditions and a large class count of 1000 signs.
Abstract: Computer Vision has been improved significantly in the past few decades. It has enabled machine to do many human tasks. However, the real challenge is in enabling machine to carry out tasks that an average human does not have the skills for. One such challenge that we have tackled in this paper is providing accessibility for deaf individual by providing means of communication with others with the aid of computer vision. Unlike other frequent works focusing on multiple camera, depth camera, electrical glove or visual gloves, we focused on the sole use of RGB which allows everybody to communicate with a deaf individual through their personal devices. This is not a new approach but the lack of realistic large-scale data set prevented recent computer vision trends on video classification in this filed. In this paper, we propose the first large scale ASL data set that covers over 200 signers, signer independent sets, challenging and unconstrained recording conditions and a large class count of 1000 signs. We evaluate baselines from action recognition techniques on the data set. We propose I3D, known from video classifications, as a powerful and suitable architecture for sign language recognition. We also propose new pre-trained model more appropriate for sign language recognition. Finally, We estimate the effect of number of classes and number of training samples on the recognition accuracy.

Proceedings Article
20 Nov 2018
TL;DR: The orthographic feature transform as discussed by the authors maps image-based features into an orthographic 3D space to reason holistically about the spatial configuration of the scene in a domain where scale is consistent and distances between objects are meaningful.
Abstract: 3D object detection from monocular images has proven to be an enormously challenging task, with the performance of leading systems not yet achieving even 10\% of that of LiDAR-based counterparts. One explanation for this performance gap is that existing systems are entirely at the mercy of the perspective image-based representation, in which the appearance and scale of objects varies drastically with depth and meaningful distances are difficult to infer. In this work we argue that the ability to reason about the world in 3D is an essential element of the 3D object detection task. To this end, we introduce the orthographic feature transform, which enables us to escape the image domain by mapping image-based features into an orthographic 3D space. This allows us to reason holistically about the spatial configuration of the scene in a domain where scale is consistent and distances between objects are meaningful. We apply this transformation as part of an end-to-end deep learning architecture and achieve state-of-the-art performance on the KITTI 3D object benchmark.\footnote{We will release full source code and pretrained models upon acceptance of this manuscript for publication.

Proceedings Article
01 Jan 2018
TL;DR: In most computer vision applications, convolutional neural networks (CNNs) operate on dense image data generated by ordinary cameras as discussed by the authors, and designing CNNs for sparse and irregularly spaced input data is...
Abstract: In most computer vision applications, convolutional neural networks (CNNs) operate on dense image data generated by ordinary cameras. Designing CNNs for sparse and irregularly spaced input data is ...

Proceedings Article
09 May 2018
TL;DR: This paper proposes to introduce the semantic segmentation information, which disentangles the inter-class difference and intra-class variation for image inpainting, which leads to much clearer recovered boundary between semantically different regions and better texture within semantically consistent segments.
Abstract: In this paper, we focus on image inpainting task, aiming at recovering the missing area of an incomplete image given the context information. Recent development in deep generative models enables an efficient end-to-end framework for image synthesis and inpainting tasks, but existing methods based on generative models don't exploit the segmentation information to constrain the object shapes, which usually lead to blurry results on the boundary. To tackle this problem, we propose to introduce the semantic segmentation information, which disentangles the inter-class difference and intra-class variation for image inpainting. This leads to much clearer recovered boundary between semantically different regions and better texture within semantically consistent segments. Our model factorizes the image inpainting process into segmentation prediction (SP-Net) and segmentation guidance (SG-Net) as two steps, which predict the segmentation labels in the missing area first, and then generate segmentation guided inpainting results. Experiments on multiple public datasets show that our approach outperforms existing methods in optimizing the image inpainting quality, and the interactive segmentation guidance provides possibilities for multi-modal predictions of image inpainting.

Proceedings Article
22 Feb 2018
TL;DR: It is shown that the proposed discriminator can be used to improve semantic segmentation accuracy by coupling the adversarial loss with the standard cross entropy loss of the proposed model.
Abstract: We propose a method for semi-supervised semantic segmentation using an adversarial network. While most existing discriminators are trained to classify input images as real or fake on the image level, we design a discriminator in a fully convolutional manner to differentiate the predicted probability maps from the ground truth segmentation distribution with the consideration of the spatial resolution. We show that the proposed discriminator can be used to improve semantic segmentation accuracy by coupling the adversarial loss with the standard cross entropy loss of the proposed model. In addition, the fully convolutional discriminator enables semi-supervised learning through discovering the trustworthy regions in predicted results of unlabeled images, thereby providing additional supervisory signals. In contrast to existing methods that utilize weakly-labeled images, our method leverages unlabeled images to enhance the segmentation model. Experimental results on the PASCAL VOC 2012 and Cityscapes datasets demonstrate the effectiveness of the proposed algorithm.

Proceedings Article
29 Jul 2018
TL;DR: Tiny-DSOD as discussed by the authors introduces two innovative and ultra-efficient architecture blocks: depthwise dense block (DDB) based backbone and depthwise feature pyramid network (D-FPN) based front-end.
Abstract: Object detection has made great progress in the past few years along with the development of deep learning. However, most current object detection methods are resource hungry, which hinders their wide deployment to many resource restricted usages such as usages on always-on devices, battery-powered low-end devices, etc. This paper considers the resource and accuracy trade-off for resource-restricted usages during designing the whole object detection framework. Based on the deeply supervised object detection (DSOD) framework, we propose Tiny-DSOD dedicating to resource-restricted usages. Tiny-DSOD introduces two innovative and ultra-efficient architecture blocks: depthwise dense block (DDB) based backbone and depthwise feature-pyramid-network (D-FPN) based front-end. We conduct extensive experiments on three famous benchmarks (PASCAL VOC 2007, KITTI, and COCO), and compare Tiny-DSOD to the state-of-the-art ultra-efficient object detection solutions such as Tiny-YOLO, MobileNet-SSD (v1 & v2), SqueezeDet, Pelee, etc. Results show that Tiny-DSOD outperforms these solutions in all the three metrics (parameter-size, FLOPs, accuracy) in each comparison. For instance, Tiny-DSOD achieves 72.1% mAP with only 0.95M parameters and 1.06B FLOPs, which is by far the state-of-the-arts result with such a low resource requirement.

Proceedings Article
01 Jan 2018
TL;DR: This paper proposes a novel perspective to establish the connection between the heuristic unmasking procedure and multiple classifier two sample tests (MC2ST) in statistical machine leaning, and presents a history sampling method to increase the testing power as well as to improve the performance on video anomaly detection.
Abstract: In this paper, we study challenging anomaly detections in streaming videos under fully unsupervised settings. Unsupervised unmasking methods [12] have recently been applied to anomaly detection; however, the theoretical understanding of it is still limited. Aiming to understand and improve this method, we propose a novel perspective to establish the connection between the heuristic unmasking procedure and multiple classifier two sample tests (MC2ST) in statistical machine leaning. Based on our analysis of the testing power of MC2ST, we present a history sampling method to increase the testing power as well as to improve the performance on video anomaly detection. We also offer a new frame-level motion feature that has better representation and generalization ability, and obtain improvement on several video benchmark datasets. The code could be found at https://github.com/MYusha/Video-Anomaly-Detection.

Proceedings Article
01 Aug 2018
TL;DR: A data augmentation approach based on generating artificial images using conditional Generative Adversarial Networks (cGANs) that is capable of generating realistic images by first producing a mask, and subsequently using the mask as input to the cGANs.
Abstract: Deep Learning models are being applied to address plant phenotyping problems such as leaf segmentation and leaf counting. Training these models requires large annotated datasets of plant images, which, in many cases, are not readily available. We address the problem of data scarcity by proposing a data augmentation approach based on generating artificial images using conditional Generative Adversarial Networks (cGANs). Our model is trained by conditioning on the leaf segmentation mask of plants with the aim to generate corresponding, realistic, plant images. We also provide a novel method to create the input masks. The proposed system is thus capable of generating realistic images by first producing a mask, and subsequently using the mask as input to the cGANs. We evaluated the impact of the data augmentation on the leaf counting performance of the Mask R-CNN model. The average leaf counting error is reduced by 16.67% when we augment the training set with the generated data.

Proceedings Article
01 Jun 2018
TL;DR: It is empirically demonstrate that the combination of low-rank and sparse kernels boosts the performance and the superiority of the proposed approach to the state-of-the-arts, IGCV2 and MobileNetV2 over image classification on CIFAR and ImageNet and object detection on COCO.
Abstract: In this paper, we are interested in building lightweight and efficient convolutional neural networks. Inspired by the success of two design patterns, composition of structured sparse kernels, e.g., interleaved group convolutions (IGC), and composition of lowrank kernels, e.g., bottle-neck modules, we study the combination of such two design patterns, using the composition of structured sparse low-rank kernels, to form a convolutional kernel. Rather than introducing a complementary condition over channels, we introduce a loose complementary condition, which is formulated by imposing the complementary condition over super-channels, to guide the design for generating a dense convolutional kernel. The resulting network is called IGCV3. We empirically demonstrate that the combination of low-rank and sparse kernels boosts the performance and the superiority of our proposed approach to the state-of-the-arts, IGCV2 and MobileNetV2 over image classification on CIFAR and ImageNet and object detection on COCO. Code and models are available at https://github.com/homles11/IGCV3.

Proceedings Article
24 Jul 2018
TL;DR: In this paper, a unified framework for class-specific 3D reconstruction from a single image and generation of new 3D shape samples is presented. But, unlike previous works, which rely on 3D supervision, annotation of 2D images with keypoints or poses, and/or training with multiple views of each object instance.
Abstract: We present a unified framework tackling two problems: class-specific 3D reconstruction from a single image, and generation of new 3D shape samples. These tasks have received considerable attention recently; however, existing approaches rely on 3D supervision, annotation of 2D images with keypoints or poses, and/or training with multiple views of each object instance. Our framework is very general: it can be trained in similar settings to these existing approaches, while also supporting weaker supervision scenarios. Importantly, it can be trained purely from 2D images, without ground-truth pose annotations, and with a single view per instance. We employ meshes as an output representation, instead of voxels used in most prior work. This allows us to exploit shading information during training, which previous 2D-supervised methods cannot. Thus, our method can learn to generate and reconstruct concave object classes. We evaluate our approach on synthetic data in various settings, showing that (i) it learns to disentangle shape from pose; (ii) using shading in the loss improves performance; (iii) our model is comparable or superior to state-of-the-art voxel-based approaches on quantitative metrics, while producing results that are visually more pleasing; (iv) it still performs well when given supervision weaker than in prior works.

Proceedings Article
01 Jan 2018
TL;DR: In this article, a click sampling strategy is used to emulate user clicks during training, and then the model iteratively adds clicks based on the errors of the currently predicted segmentation.
Abstract: Deep learning requires large amounts of training data to be effective. For the task of object segmentation, manually labeling data is very expensive, and hence interactive methods are needed. Following recent approaches, we develop an interactive object segmentation system which uses user input in the form of clicks as the input to a convolutional network. While previous methods use heuristic click sampling strategies to emulate user clicks during training, we propose a new iterative training strategy. During training, we iteratively add clicks based on the errors of the currently predicted segmentation. We show that our iterative training strategy together with additional improvements to the network architecture results in improved results over the state-of-the-art.

Proceedings Article
14 Aug 2018
TL;DR: A network fusion architecture is proposed, which consists of a multispectral proposal network to generate pedestrian proposals, and a subsequent mult ispectral classification network to distinguish pedestrian instances from hard negatives and significantly outperforms state-of-the-art methods on the KAIST dataset.
Abstract: Multispectral pedestrian detection has attracted increasing attention from the research community due to its crucial competence for many around-the-clock applications (e.g., video surveillance and autonomous driving), especially under insufficient illumination conditions. We create a human baseline over the KAIST dataset and reveal that there is still a large gap between current top detectors and human performance. To narrow this gap, we propose a network fusion architecture, which consists of a multispectral proposal network to generate pedestrian proposals, and a subsequent multispectral classification network to distinguish pedestrian instances from hard negatives. The unified network is learned by jointly optimizing pedestrian detection and semantic segmentation tasks. The final detections are obtained by integrating the outputs from different modalities as well as the two stages. The approach significantly outperforms state-of-the-art methods on the KAIST dataset while remain fast. Additionally, we contribute a sanitized version of training annotations for the KAIST dataset, and examine the effects caused by different kinds of annotation errors. Future research of this problem will benefit from the sanitized version which eliminates the interference of annotation errors.

Proceedings Article
26 Jul 2018
TL;DR: In this paper, a generative adversarial network (GAN) is proposed to predict visually appealing alphas with the addition of the adversarial loss from the discriminator that is trained to classify well-composited images.
Abstract: We present the first generative adversarial network (GAN) for natural image matting. Our novel generator network is trained to predict visually appealing alphas with the addition of the adversarial loss from the discriminator that is trained to classify well-composited images. Further, we improve existing encoder-decoder architectures to better deal with the spatial localization issues inherited in convolutional neural networks (CNN) by using dilated convolutions to capture global context information without downscaling feature maps and losing spatial information. We present state-of-the-art results on the alphamatting online benchmark for the gradient error and give comparable results in others. Our method is particularly well suited for fine structures like hair, which is of great importance in practical matting applications, e.g. in film/TV production.

Proceedings Article
01 Jan 2018
TL;DR: This paper investigates the problem of Domain Shift in action videos, an area that has remained under-explored, and proposes two new approaches named Action Modeling on Latent Subspace (AMLS) and Deep Adversarial Action Adaptation (DAAA).
Abstract: In the general settings of supervised learning, human action recognition has been a widely studied topic. The classifiers learned in this setting assume that the training and test data have been sampled from the same underlying probability distribution. However, in most of the practical scenarios, this assumption is not true, resulting in a suboptimal performance of the classifiers. This problem, referred to as Domain Shift, has been extensively studied, but mostly for image/object classification task. In this paper, we investigate the problem of Domain Shift in action videos, an area that has remained under-explored, and propose two new approaches named Action Modeling on Latent Subspace (AMLS) and Deep Adversarial Action Adaptation (DAAA). In the AMLS approach, the action videos in the target domain are modeled as a sequence of points on a latent subspace and adaptive kernels are successively learned between the source domain point and the sequence of target domain points on the manifold. In the DAAA approach, an end-to-end adversarial learning framework is proposed to align the two domains. The action adaptation experiments were conducted using various combinations of multi-domain action datasets, including six common classes of Olympic Sports and UCF50 datasets and all classes of KTH, MSR and our own SonyCam datasets. In this paper, we have achieved consistent improvements over chosen baselines and obtained some state-of-the-art results for the datasets.

Proceedings Article
03 Sep 2018
TL;DR: A novel approach to automatic Sign Language Production using state-of-the-art Neural Machine Translation (NMT) and Image Generation techniques, capable of producing sign videos from spoken language sentences.
Abstract: We present a novel approach to automatic Sign Language Production using stateof- the-art Neural Machine Translation (NMT) and Image Generation techniques. Our system is capable of producing sign videos from spoken language sentences. Contrary to current approaches that are dependent on heavily annotated data, our approach requires minimal gloss and skeletal level annotations for training. We achieve this by breaking down the task into dedicated sub-processes. We first translate spoken language sentences into sign gloss sequences using an encoder-decoder network. We then find a data driven mapping between glosses and skeletal sequences. We use the resulting pose information to condition a generative model that produces sign language video sequences. We evaluate our approach on the recently released PHOENIX14T Sign Language Translation dataset. We set a baseline for text-to-gloss translation, reporting a BLEU-4 score of 16.34/15.26 on dev/test sets. We further demonstrate the video generation capabilities of our approach by sharing qualitative results of generated sign sequences given their skeletal correspondence.

Proceedings ArticleDOI
03 Sep 2018
TL;DR: This paper proposes an event-driven OF algorithm called adaptive block-matching optical flow (ABMOF), which uses time slices of accumulated DVS events and developed both ABMOF and Lucas-Kanade (LK) algorithms using the authors' adapted slices.
Abstract: Dynamic Vision Sensors (DVS) output asynchronous log intensity change events. They have potential applications in high-speed robotics, autonomous cars and drones. The precise event timing, sparse output, and wide dynamic range of the events are well suited for optical flow, but conventional optical flow (OF) algorithms are not well matched to the event stream data. This paper proposes an event-driven OF algorithm called adaptive block-matching optical flow (ABMOF). ABMOF uses time slices of accumulated DVS events. The time slices are adaptively rotated based on the input events and OF results. Compared with other methods such as gradient-based OF, ABMOF can efficiently be implemented in compact logic circuits. We developed both ABMOF and Lucas-Kanade (LK) algorithms using our adapted slices. Results shows that ABMOF accuracy is comparable with LK accuracy on natural scene data including sparse and dense texture, high dynamic range, and fast motion exceeding 30,000 pixels per second.

Proceedings Article
01 Jan 2018
TL;DR: This paper uses Generative Adversarial Networks (GANs) to model the underlying distributions of old classes and select additional real exemplars as anchors to support the learned distribution and proves that the method has superior performance against state-of-the-arts.
Abstract: Incremental learning with deep neural networks often suffers from catastrophic forgetting, where newly learned patterns may completely erase the previous knowledge. A remedy is to review the old data (i.e. rehearsal) occasionally like humans to prevent forgetting. While recent approaches focus on storing historical data or the generator of old classes for rehearsal, we argue that they cannot fully and reliably represent old classes. In this paper, we propose a novel class incremental learning method called ExemplarSupported Generative Reproduction (ESGR) that can better reconstruct memory of old classes and mitigate catastrophic forgetting. Specifically, we use Generative Adversarial Networks (GANs) to model the underlying distributions of old classes and select additional real exemplars as anchors to support the learned distribution. When learning from new class samples, synthesized data generated by GANs and real exemplars stored in the memory for old classes can be jointly reviewed to mitigate catastrophic forgetting. By conducting experiments on CIFAR-100 and ImageNet-Dogs, we prove that our method has superior performance against state-of-the-arts.