scispace - formally typeset
Search or ask a question

Showing papers by "Andrea Cavallaro published in 2019"


Proceedings ArticleDOI
01 Oct 2019
TL;DR: Zhou et al. as mentioned in this paper designed a residual block composed of multiple convolutional feature streams, each detecting features at a certain scale, and a novel unified aggregation gate was introduced to dynamically fuse multi-scale features with input-dependent channel-wise weights.
Abstract: As an instance-level recognition problem, person re-identification (ReID) relies on discriminative features, which not only capture different spatial scales but also encapsulate an arbitrary combination of multiple scales. We callse features of both homogeneous and heterogeneous scales omni-scale features. In this paper, a novel deep ReID CNN is designed, termed Omni-Scale Network (OSNet), for omni-scale feature learning. This is achieved by designing a residual block composed of multiple convolutional feature streams, each detecting features at a certain scale. Importantly, a novel unified aggregation gate is introduced to dynamically fuse multi-scale features with input-dependent channel-wise weights. To efficiently learn spatial-channel correlations and avoid overfitting, the building block uses both pointwise and depthwise convolutions. By stacking such blocks layer-by-layer, our OSNet is extremely lightweight and can be trained from scratch on existing ReID benchmarks. Despite its small model size, our OSNet achieves state-of-the-art performance on six person-ReID datasets. Code and models are available at: https://github.com/KaiyangZhou/deep-person-reid.

390 citations


Posted Content
TL;DR: A novel deep ReID CNN is designed, termed Omni-Scale Network (OSNet), for omni-scale feature learning by designing a residual block composed of multiple convolutional feature streams, each detecting features at a certain scale.
Abstract: As an instance-level recognition problem, person re-identification (ReID) relies on discriminative features, which not only capture different spatial scales but also encapsulate an arbitrary combination of multiple scales. We call features of both homogeneous and heterogeneous scales omni-scale features. In this paper, a novel deep ReID CNN is designed, termed Omni-Scale Network (OSNet), for omni-scale feature learning. This is achieved by designing a residual block composed of multiple convolutional streams, each detecting features at a certain scale. Importantly, a novel unified aggregation gate is introduced to dynamically fuse multi-scale features with input-dependent channel-wise weights. To efficiently learn spatial-channel correlations and avoid overfitting, the building block uses pointwise and depthwise convolutions. By stacking such block layer-by-layer, our OSNet is extremely lightweight and can be trained from scratch on existing ReID benchmarks. Despite its small model size, OSNet achieves state-of-the-art performance on six person ReID datasets, outperforming most large-sized models, often by a clear margin. Code and models are available at: \url{this https URL}.

371 citations


Posted Content
TL;DR: Novel CNN architectures to address person re-identification and cross-dataset discrepancies are developed, including a re-ID CNN termed omni-scale network (OSNet) to learn features that not only capture different spatial scales but also encapsulate a synergistic combination of multiple scales, namely omNI-scale features.
Abstract: An effective person re-identification (re-ID) model should learn feature representations that are both discriminative, for distinguishing similar-looking people, and generalisable, for deployment across datasets without any adaptation. In this paper, we develop novel CNN architectures to address both challenges. First, we present a re-ID CNN termed omni-scale network (OSNet) to learn features that not only capture different spatial scales but also encapsulate a synergistic combination of multiple scales, namely omni-scale features. The basic building block consists of multiple convolutional streams, each detecting features at a certain scale. For omni-scale feature learning, a unified aggregation gate is introduced to dynamically fuse multi-scale features with channel-wise weights. OSNet is lightweight as its building blocks comprise factorised convolutions. Second, to improve generalisable feature learning, we introduce instance normalisation (IN) layers into OSNet to cope with cross-dataset discrepancies. Further, to determine the optimal placements of these IN layers in the architecture, we formulate an efficient differentiable architecture search algorithm. Extensive experiments show that, in the conventional same-dataset setting, OSNet achieves state-of-the-art performance, despite being much smaller than existing re-ID models. In the more challenging yet practical cross-dataset setting, OSNet beats most recent unsupervised domain adaptation methods without using any target data. Our code and models are released at \texttt{this https URL}.

105 citations


Proceedings ArticleDOI
15 Apr 2019
TL;DR: Li et al. as discussed by the authors proposed an on-device transformation of sensor data to be shared for specific applications, such as monitoring selected daily activities, without revealing information that enables user identification, which can be deployed on a mobile or wearable device to anonymize sensor data even for users who are not included in the training dataset.
Abstract: Motion sensors such as accelerometers and gyroscopes measure the instant acceleration and rotation of a device, in three dimensions. Raw data streams from motion sensors embedded in portable and wearable devices may reveal private information about users without their awareness. For example, motion data might disclose the weight or gender of a user, or enable their re-identification. To address this problem, we propose an on-device transformation of sensor data to be shared for specific applications, such as monitoring selected daily activities, without revealing information that enables user identification. We formulate the anonymization problem using an information-theoretic approach and propose a new multi-objective loss function for training deep autoencoders. This loss function helps minimizing user-identity information as well as data distortion to preserve the application-specific utility. The training process regulates the encoder to disregard user-identifiable patterns and tunes the decoder to shape the output independently of users in the training set. The trained autoencoder can be deployed on a mobile or wearable device to anonymize sensor data even for users who are not included in the training dataset. Data from 24 users transformed by the proposed anonymizing autoencoder lead to a promising trade-off between utility and privacy, with an accuracy for activity recognition above 92% and an accuracy for user identification below 7%.

92 citations


Journal ArticleDOI
TL;DR: A novel 3-D audio-visual people tracker that exploits visual observations to guide the acoustic processing by constraining the acoustic likelihood on the horizontal plane defined by the predicted height of a speaker.
Abstract: Compact multi-sensor platforms are portable and thus desirable for robotics and personal-assistance tasks. However, compared to physically distributed sensors, the size of these platforms makes person tracking more difficult. To address this challenge, we propose a novel 3-D audio-visual people tracker that exploits visual observations (object detections) to guide the acoustic processing by constraining the acoustic likelihood on the horizontal plane defined by the predicted height of a speaker. This solution allows the tracker to estimate, with a small microphone array, the distance of a sound. Moreover, we apply a color-based visual likelihood on the image plane to compensate for misdetections. Finally, we use a 3-D particle filter and greedy data association to combine visual observations, color-based, and acoustic likelihoods to track the position of multiple simultaneous speakers. We compare the proposed multimodal 3-D tracker against two state-of-the-art methods on the AV16.3 dataset and on a newly collected dataset with co-located sensors, which we make available to the research community. Experimental results show that our multimodal approach outperforms the other methods both in 3-D and on the image plane.

37 citations


Proceedings ArticleDOI
12 May 2019
TL;DR: The proposed method, private FGSM, achieves a desirable trade-off between the drop in classification accuracy and the distortion on the private classes of the Places365-Standard dataset using ResNet50.
Abstract: Images shared on social media are routinely analysed by classifiers for content annotation and user profiling. These automatic inferences reveal to the service provider sensitive information that a naive user might want to keep private. To address this problem, we present a method designed to distort the image data so as to hinder the inference of a classifier without affecting the utility for social media users. The proposed approach is based on the Fast Gradient Sign Method (FGSM) and limits the likelihood that automatic inference can expose the true class of a distorted image. Experimental results on a scene classification task show that the proposed method, private FGSM, achieves a desirable trade-off between the drop in classification accuracy and the distortion on the private classes of the Places365-Standard dataset using ResNet50. The classifier is misled 94.40% of the times in the top-5 classes with only a small average reduction of three image quality measures (SSIM, PSNR, BRISQUE).

27 citations


Proceedings ArticleDOI
01 Sep 2019
TL;DR: This paper presents an accurate model to forecast the position of moving objects by disentangling global and object motion without the need of camera calibration or planarity assumptions and shows that it can forecast up to 60% more accurately than state-of-the-art predictors while being resilient to noisy observations.
Abstract: Predicting the motion of objects captured by a moving camera is important for first-person vision tasks. In this paper, we present an accurate model to forecast the position of moving objects by disentangling global and object motion without the need of camera calibration or planarity assumptions. Our predictor uses past observations to model online the motion of objects by selectively tracking a spatially balanced set of keypoints and estimating scene transformations between pairs of frames. We show that we can forecast up to 60% more accurately than state-of-the-art predictors while being resilient to noisy observations. Moreover, the proposed predictor is robust to frame-rate reduction and outperforms alternative approaches while processing only 33% of the frames with moving cameras. We also show the benefit of integrating the proposed predictor in a multi-object tracker.

21 citations


Posted Content
TL;DR: This work shows that it can prevent inference of potentially sensitive activities while keeping the reduction in recognition accuracy of non-sensitive activities to less than 5 percentage points, and reduce the accuracy of user re-identification and of the potential inference of gender to the level of a random guess.
Abstract: Sensitive inferences and user re-identification are major threats to privacy when raw sensor data from wearable or portable devices are shared with cloud-assisted applications. To mitigate these threats, we propose mechanisms to transform sensor data before sharing them with applications running on users' devices. These transformations aim at eliminating patterns that can be used for user re-identification or for inferring potentially sensitive activities, while introducing a minor utility loss for the target application (or task). We show that, on gesture and activity recognition tasks, we can prevent inference of potentially sensitive activities while keeping the reduction in recognition accuracy of non-sensitive activities to less than 5 percentage points. We also show that we can reduce the accuracy of user re-identification and of the potential inference of gender to the level of a random guess, while keeping the accuracy of activity recognition comparable to that obtained on the original data.

16 citations


Proceedings ArticleDOI
01 Nov 2019
TL;DR: An audio-visual dataset recorded outdoors from a quadcopter is presented and baseline results for multiple applications are discussed, including a scenario for source localization and sound enhancement with up to two static sources, and a scenario with a moving sound source.
Abstract: We present an audio-visual dataset recorded outdoors from a quadcopter and discuss baseline results for multiple applications. The dataset includes a scenario for source localization and sound enhancement with up to two static sources, and a scenario for source localization and tracking with a moving sound source. These sensing tasks are made challenging by the strong and time-varying ego-noise generated by the rotating motors and propellers. The dataset was collected using a small circular array with 8 microphones and a camera mounted on the quadcopter. The camera view was used to facilitate the annotation of the sound-source positions and can also be used for multi-modal sensing tasks. We discuss the audio-visual calibration procedure that is needed to generate the annotation for the dataset, which we make available to the research community1.1http://cis.eecs.qmul.ac.uk/projects/avq/

15 citations


Proceedings ArticleDOI
01 Oct 2019
TL;DR: The proposed approach, View-LSTM, is a recurrent neural network structure that accounts for the temporal consistency and target feature approximation constraints and is validated by designing an end-to-end generator for novel-view video synthesis.
Abstract: We tackle the problem of synthesizing a video of multiple moving people as seen from a novel view, given only an input video and depth information or human poses of the novel view as prior. This problem requires a model that learns to transform input features into target features while maintaining temporal consistency. To this end, we learn an invariant feature from the input video that is shared across all viewpoints of the same scene and a view-dependent feature obtained using the target priors. The proposed approach, View-LSTM, is a recurrent neural network structure that accounts for the temporal consistency and target feature approximation constraints. We validate View-LSTM by designing an end-to-end generator for novel-view video synthesis. Experiments on a large multi-view action recognition dataset validate the proposed model.

12 citations


Journal ArticleDOI
TL;DR: This work proposes the first end-to-end convolutional-recurrent neural network architecture that learns conflict-specific features directly from raw speech waveforms, without using explicit domain knowledge or metadata.
Abstract: Computational paralinguistics aims to infer human emotions, personality traits and behavioural patterns from speech signals. In particular, verbal conflict is an important example of human-interaction behaviour, whose detection would enable monitoring and feedback in a variety of applications. The majority of methods for detection and intensity estimation of verbal conflict apply off-the-shelf classifiers/regressors to generic hand-crafted acoustic features. Generating conflict-specific features requires refinement steps and the availability of metadata, such as the number of speakers and their speech overlap duration. Moreover, most techniques treat feature extraction and regression as independent modules, which require separate training and parameter tuning. To address these limitations, we propose the first end-to-end convolutional-recurrent neural network architecture that learns conflict-specific features directly from raw speech waveforms, without using explicit domain knowledge or metadata. Additionally, to selectively focus the model on portions of speech containing verbal conflict instances, we include a global attention interface that learns the alignment between layers of the recurrent network. Experimental results on the SSPNet Conflict Corpus show that our end-to-end architecture achieves state-of-the-art performance in terms of Pearson Correlation Coefficient.

Proceedings ArticleDOI
20 May 2019
TL;DR: A self-supervised prediction network is presented to train the agent with intrinsic rewards that relate to achieving the desired final goal by learning action representations, and it is shown that, despite the sparse extrinsic rewards, this network achieves a faster training convergence than state-of-the-art methods.
Abstract: Learning to efficiently navigate an environment using only an on-board camera is a difficult task for an agent when the final goal is far from the initial state and extrinsic rewards are sparse. To address this problem, we present a self-supervised prediction network to train the agent with intrinsic rewards that relate to achieving the desired final goal. The network learns to predict its future camera view (the future state) from a current state-action pair through an Action Representation Module that decodes input actions as higher dimensional representations. To increase the representational power of the network during exploration we fuse the responses from the Action Representation Module in the transition network, which predicts the future state. Moreover, to enhance the discrimination capability between predictions from different input actions we introduce joint regression and triplet ranking loss functions. We show that, despite the sparse extrinsic rewards, by learning action representations we achieve a faster training convergence than state-of-the-art methods with only a small increase in the number of the model parameters.

Posted Content
TL;DR: In this paper, a content-based black-box adversarial attack that generates unrestricted perturbations by exploiting image semantics to selectively modify colors within chosen ranges that are perceived as natural by humans is proposed.
Abstract: Adversarial attacks that generate small L_p-norm perturbations to mislead classifiers have limited success in black-box settings and with unseen classifiers. These attacks are also not robust to defenses that use denoising filters and to adversarial training procedures. Instead, adversarial attacks that generate unrestricted perturbations are more robust to defenses, are generally more successful in black-box settings and are more transferable to unseen classifiers. However, unrestricted perturbations may be noticeable to humans. In this paper, we propose a content-based black-box adversarial attack that generates unrestricted perturbations by exploiting image semantics to selectively modify colors within chosen ranges that are perceived as natural by humans. We show that the proposed approach, ColorFool, outperforms in terms of success rate, robustness to defense frameworks and transferability, five state-of-the-art adversarial attacks on two different tasks, scene and object classification, when attacking three state-of-the-art deep neural networks using three standard datasets. The source code is available at this https URL.

Book ChapterDOI
01 Jan 2019
TL;DR: This chapter presents the main challenges for audio-visual classification using signals from wearable cameras, and shows how multi-modality can help address those challenges and discusses person reidentification as a specific application example.
Abstract: Wearable cameras capture user-centered data that can be used to analyze a scene, to recognize interactions and to classify physical activities. A wearable camera is equipped with multiple sensors such as microphone(s) and inertial measurement units, in addition to the imager. However, despite this richness in sensing modalities, the analysis of data from a wearable camera is particularly challenging due to unconventional mounting and capturing conditions, rapid changes in camera pose, self-occlusions, background noise and motion blur. In this chapter we present the main challenges for audio-visual classification using signals from wearable cameras, we show how multi-modality can help address those challenges and we discuss person reidentification as a specific application example.

Proceedings ArticleDOI
06 Nov 2019
TL;DR: This work proposes a framework to measure the amount of sensitive information memorized in each layer of a DNN, and shows that the last layers encode a larger amount of information from the training data compared to the first layers.
Abstract: Pre-trained Deep Neural Network (DNN) models are increasingly used in smartphones and other user devices to enable prediction services, leading to potential disclosures of (sensitive) information from training data captured inside these models. Based on the concept of generalization error, we propose a framework to measure the amount of sensitive information memorized in each layer of a DNN. Our results show that, when considered individually, the last layers encode a larger amount of information from the training data compared to the first layers. We find that the same DNN architecture trained with different datasets has similar exposure per layer. We evaluate an architecture to protect the most sensitive layers within an on-device Trusted Execution Environment (TEE) against potential white-box membership inference attacks without the significant computational overhead.

Journal ArticleDOI
TL;DR: Experimental results show that the proposed privacy filter protects privacy while reducing distortion, and is also robust against attacks.
Abstract: Photographs taken in public places often contain faces of bystanders thus leading to a perceived or actual violation of privacy. To address this issue, we propose to pseudo-randomly modify the appearance of face regions in the images using a privacy filter that prevents a human or a face recogniser from inferring the identity of people. The filter, which is applied only when the resolution is high enough for a face to be recognisable, adaptively distorts the face appearance as a function of its resolution. Moreover, the proposed filter locally changes the values of its parameters to counter attacks that attempt to estimate them. The filter exploits both global adaptiveness to reduce distortion and local parameter hopping to make their estimation difficult for an attacker. In order to evaluate the efficiency of the proposed approach, we consider an important scenario of oblique face images: photographs taken with low altitude Micro Aerial Vehicles (MAVs). We use a state-of-the-art face recognition algorithm and synthetically generated face data with 3D geometric image transformations that mimic faces captured from an MAV at different heights and pitch angles. Experimental results show that the proposed filter protects privacy while reducing distortion, and is also robust against attacks.

Posted Content
TL;DR: In this paper, the authors proposed a framework to measure the amount of sensitive information memorized in each layer of a DNN and evaluated an architecture to protect the most sensitive layers within the memory limits of Trusted Execution Environment (TEE) against potential white-box membership inference attacks without the significant computational overhead.
Abstract: Pre-trained Deep Neural Network (DNN) models are increasingly used in smartphones and other user devices to enable prediction services, leading to potential disclosures of (sensitive) information from training data captured inside these models. Based on the concept of generalization error, we propose a framework to measure the amount of sensitive information memorized in each layer of a DNN. Our results show that, when considered individually, the last layers encode a larger amount of information from the training data compared to the first layers. We find that, while the neuron of convolutional layers can expose more (sensitive) information than that of fully connected layers, the same DNN architecture trained with different datasets has similar exposure per layer. We evaluate an architecture to protect the most sensitive layers within the memory limits of Trusted Execution Environment (TEE) against potential white-box membership inference attacks without the significant computational overhead.

Posted Content
TL;DR: The proposed method for jointly localising container-like objects and estimating their dimensions using two wide-baseline, calibrated RGB cameras outperforms in terms of localisation success and dimension estimation accuracy a deep-learning based approach that uses depth maps.
Abstract: The 3D localisation of an object and the estimation of its properties, such as shape and dimensions, are challenging under varying degrees of transparency and lighting conditions. In this paper, we propose a method for jointly localising container-like objects and estimating their dimensions using two wide-baseline, calibrated RGB cameras. Under the assumption of circular symmetry along the vertical axis, we estimate the dimensions of an object with a generative 3D sampling model of sparse circumferences, iterative shape fitting and image re-projection to verify the sampling hypotheses in each camera using semantic segmentation masks. We evaluate the proposed method on a novel dataset of objects with different degrees of transparency and captured under different backgrounds and illumination conditions. Our method, which is based on RGB images only, outperforms in terms of localisation success and dimension estimation accuracy a deep-learning based approach that uses depth maps.

Posted Content
TL;DR: This document describes the submission to the 2018 LOCalization And TrAcking (LOCATA) challenge (Tasks 1, 3, 5) and employs a Particle Filter that favors the spatio-temporal continuity of the localization results.
Abstract: This document describes our submission to the 2018 LOCalization And TrAcking (LOCATA) challenge (Tasks 1, 3, 5). We estimate the 3D position of a speaker using the Global Coherence Field (GCF) computed from multiple microphone pairs of a DICIT planar array. One of the main challenges when using such an array with omnidirectional microphones is the front-back ambiguity, which is particularly evident in Task 5. We address this challenge by post-processing the peaks of the GCF and exploiting the attenuation introduced by the frame of the array. Moreover, the intermittent nature of speech and the changing orientation of the speaker make localization difficult. For Tasks 3 and 5, we also employ a Particle Filter (PF) that favors the spatio-temporal continuity of the localization results.

Posted Content
TL;DR: EdgeFool is proposed, an adversarial image enhancement filter that learns structure-aware adversarial perturbations that generate adversarial images with perturbation that enhance image details via training a fully convolutional neural network end-to-end with a multi-task loss function.
Abstract: Adversarial examples are intentionally perturbed images that mislead classifiers. These images can, however, be easily detected using denoising algorithms, when high-frequency spatial perturbations are used, or can be noticed by humans, when perturbations are large. In this paper, we propose EdgeFool, an adversarial image enhancement filter that learns structure-aware adversarial perturbations. EdgeFool generates adversarial images with perturbations that enhance image details via training a fully convolutional neural network end-to-end with a multi-task loss function. This loss function accounts for both image detail enhancement and class misleading objectives. We evaluate EdgeFool on three classifiers (ResNet-50, ResNet-18 and AlexNet) using two datasets (ImageNet and Private-Places365) and compare it with six adversarial methods (DeepFool, SparseFool, Carlini-Wagner, SemanticAdv, Non-targeted and Private Fast Gradient Sign Methods). Code is available at https://github.com/smartcameras/EdgeFool.git.

Journal ArticleDOI
TL;DR: In this paper, the authors focus on trustworthiness in multimedia communications and provide a forum for both academic and industrial researchers to discuss recent results and provide solutions to the above-mentioned challenges.
Abstract: The papers in this special issue focus on trustworthiness in multimedia communications. Recently, social multimedia content is being delivered to users with a high quality of experience (QoE) with the advance of multimedia technologies and social networks. However, as a huge amount of social users have various demands to exchange and share multimedia content with each other, it becomes a new challenge for the current social multimedia analytics and delivery to deal with the various attacks perpetrated by malicious users or through spam contents. Therefore, the trust and risk management for social multimedia content based on the social tie of users become of prime importance to face the unpredicted threats and subsequent damage. This Special Section aims to provide a premier forum for researchers working on the trust-based social multimedia analytics and delivery. It also provides the opportunity for both academic and industrial researchers to discuss recent results and provide solutions to the above-mentioned challenges.

Proceedings ArticleDOI
12 May 2019
TL;DR: A multi-modal approach that leverages annotations from reference streams and measurements from unannotated additional streams to infer 3D trajectories through an optimization that selectively uses measurements in the optimization.
Abstract: Accurate annotation is fundamental to quantify the performance of multi-sensor and multi-modal object detectors and trackers. However, invasive or expensive instrumentation is needed to automatically generate these annotations. To mitigate this problem, we present a multi-modal approach that leverages annotations from reference streams (e.g. individual camera views) and measurements from unannotated additional streams (e.g. audio) to infer 3D trajectories through an optimization. The core of our approach is a multi-modal extension of Bundle Adjustment with a cross-modal correspondence detection that selectively uses measurements in the optimization. We apply the proposed approach to fully annotate a new multi-modal and multi-view dataset for multi-speaker 3D tracking.

Proceedings ArticleDOI
20 May 2019
TL;DR: Simulations show prominent enhancement of detection performance and lower iteration counts by employing priors in the proposed CSS scheme, aided by priors and robust to priors imperfections.
Abstract: Compressive sensing has been applied in wideband spectrum sensing to achieve sub-Nyquist sampling. Prior information of the multiband spectrum occupancy, e.g. from geo-location database, can be utilized by compressive spectrum sensing (CSS) to enhance the sensing performance. However, these priors are prone to be partially missing and may also contain incorrect information. We hereby propose a CSS scheme aided by priors and robust to priors imperfections, and moreover, a novel and practical algorithm to provide robust channel sparsity estimation needed by the CSS scheme. Simulations show prominent enhancement of detection performance and lower iteration counts by employing priors in the proposed CSS scheme.

Book ChapterDOI
01 Jul 2019
TL;DR: This work presents a non-iterative layered vector field estimation process that yields sparse vector field abstractions of activity patterns from groups of trajectories, and proposes a trajectory labeling algorithm that labels trajectories according to their activity patterns using the vector field Abstractions.
Abstract: Far-field activities represented as time series or trajectories can be summarized in compact representations of frequent patterns. Popular representations such as clustering or probabilistic modeling of trajectories often do not inform about both velocity and direction of motion, which are by definition visually and quantitatively embedded in vector fields. However, a common use of vector fields may dismiss information about forbidden areas, or regions with concurrent activity patterns. To address this problem we present a non-iterative layered vector field estimation process that yields sparse vector field abstractions of activity patterns from groups of trajectories. The key feature of our approach is the estimate of the probability density function (PDF) of targets positions: it automatically tunes the cost function parameter, and serves as weights in the sparse estimation problem. We also propose a trajectory labeling algorithm that labels trajectories according to their activity patterns using the vector field abstractions. Experiments in synthetic and real trajectory data show that the proposed estimation approach yields correctly sparse vector fields, which are similar to known generating vector fields, and 5–12% higher labeling accuracy on test trajectories when compared to other generative models. Outlier trajectories are also detected.

Proceedings ArticleDOI
01 Sep 2019
TL;DR: It is shown that adding the proposed mask to UNet architectures improves the performance of view synthesis with only a slight increase in inference time.
Abstract: Pose-guided human view synthesis uses a target pose to generate the appearance of a new view of a person. The input view and the target pose can be processed separately with UNet architectures that combine the results in a late fusion stage. UNet architectures link their encoder and decoder with skip connections that preserve the location of spatial features by injecting input information in the decoding process. However, direct skip connections may transfer irrelevant information to the decoder. We overcome this limitation with learnable masks for skip connections that encourage the decoder to use only relevant information from the encoder. We show that adding the proposed mask to UNet architectures improves the performance of view synthesis with only a slight increase in inference time.