Showing papers in "Computer Vision and Image Understanding in 2022"
••
TL;DR: A comprehensive overview of state-of-the-art tracking frameworks including both deep and non-deep trackers is provided in this article , where the authors present both quantitative and qualitative tracking results of various trackers on five benchmark datasets.
39 citations
••
TL;DR: Borji et al. as discussed by the authors describe new dimensions that are becoming important in assessing models (e.g. bias and fairness) and discuss the connection between GAN evaluation and deepfakes.
34 citations
••
TL;DR: Wang et al. as mentioned in this paper proposed a novel method for visible and infrared image fusion by decomposing feature information, which adopts two pairs of encoder-decoder networks to implement feature map extraction and decomposition, respectively.
28 citations
••
TL;DR: Tan et al. as discussed by the authors proposed a temporal contrastive learning framework consisting of two novel losses to improve upon existing contrastive self-supervised video representation learning methods, namely the local-local and global-local contrastive losses.
26 citations
••
TL;DR: In this article , a survey of algorithms used to create deepfakes and methods proposed to detect deep fakes in the literature to date is presented, along with extensive discussions on challenges, research trends and directions related to deepfake technologies.
19 citations
••
TL;DR: Despite its simplicity, the proposed weakly supervised object detection method shows competitive results on a range of publicly available datasets, including paintings, watercolors, cliparts and comics and allows to quickly learn unseen visual categories.
16 citations
••
TL;DR: Zhang et al. as mentioned in this paper proposed an adaptive temporal modelling block (ATB), which is able to flexibly capture temporal structure for skeleton-based action recognition, and fused the adaptive feature map to the graph convolutional layer to improve the capability of learning better representation.
15 citations
••
TL;DR: The performance of face recognition systems can be negatively impacted in the presence of masks and other types of facial coverings that have become prevalent due to the COVID-19 pandemic, so the periocular region of the human face becomes an important biometric cue.
14 citations
••
TL;DR: Zhang et al. as mentioned in this paper proposed an uncertainty-aware consistency regularization method for cross-domain semantic segmentation, which introduces an uncertainty guided consistency loss with a dynamic weighting scheme by exploiting the latent uncertainty information of the target samples.
12 citations
••
TL;DR: This paper decomposes the TAD pipeline into several essential components: data sampling, backbone design, neck construction, and detection head, and yields an astounding RGB-Only baseline very close to the state-of-the-art methods with two-stream inputs.
9 citations
••
TL;DR: In this paper , the authors present MC-Calib, a toolbox dedicated to the calibration of complex synchronized multi-camera systems using an arbitrary number of fiducial marker-based patterns.
••
TL;DR: Zhang et al. as discussed by the authors proposed a frame-level refinement network to adaptively learn specific topology in different frames and capture long-range dependencies between frames through transformer self-attention.
••
TL;DR: This work revisits the self-supervised multi-task learning framework for video anomaly detection, proposing several updates to the original method, and modernizes the 3D convolutional backbone by introducing multi-head self-attention modules, inspired by the recent success of vision transformers.
••
TL;DR: In this paper , the authors present a checklist to spot different types of bias during visual dataset collection and discuss existing attempts to collect visual datasets in a bias-aware manner, concluding that the problem of bias discovery and quantification in visual datasets is still open, and there is room for improvement in terms of both methods and the range of biases that can be addressed.
••
TL;DR: Wang et al. as discussed by the authors proposed a 3D-2D structural information fusion (SIF) for 3D object detection on LiDAR-camera system, which is based on hand-crafted 3D and 2D descriptors, generates primary structure feature, and has stable performance in outdoor scenes.
••
TL;DR: In this paper , the authors reviewed and categorized image datasets that include depth information and grouped them into three categories: scene/objects, body, and medical, and provided an overview of the different types of sensors, depth applications, and examined trends and future directions of the usage and creation of datasets containing depth data.
••
TL;DR: Wang et al. as discussed by the authors proposed a two-stage region-based convolutional neural network for thighbone fracture detection, which achieved an AP of 88.9% and outperformed all existing methods.
••
TL;DR: Wang et al. as discussed by the authors proposed a 3D-2D structural information fusion (SIF) for 3D object detection on LiDAR-camera system, which is based on hand-crafted 3D and 2D descriptors, generates primary structure feature, and has stable performance in outdoor scenes.
••
TL;DR: Widafeng et al. as mentioned in this paper proposed a cross-modality dual attention fusion module named CMDA to explicitly exchange spatial-temporal information between two pathways in two-stream SlowFast networks.
••
TL;DR: In this article , the authors summarize the recent work in animal pose estimation from computer vision perspective and highlight the challenges they face in this field, and provide an in-depth analysis of the persisting obstacles.
••
TL;DR: Zhang et al. as mentioned in this paper proposed a saliency guided backlit image enhancement network, namely BacklitNet, for robust and natural restoration of backlit images, which combines a nested U-structure with bilateral grids, which enables fully extracting multi-scale saliency information and rapidly enhancing arbitrary resolution images.
••
TL;DR: In this paper , a global filter importance based adaptive pruning (GFI-AP) method is proposed to assign importance scores to all filters based on how the network learns the input-output mapping of a dataset, which can then be compared across all the other convolutional filters.
••
TL;DR: Zhang et al. as mentioned in this paper proposed a light-weight network for real-time shadow detection, which uses graph convolutional networks to provide extra training pairs, which obtains a complete shadow mask via only several annotation scribbles.
••
TL;DR: In this article , a robust encoder-decoder structured deep learning network is proposed to detect the local changes in video using a combination of the feature pooling module (FPM) with a ResNet-50 encoderdecoder network.
••
TL;DR: Zhang et al. as discussed by the authors proposed a root regression module to estimate absolute root locations in the camera coordinate and a fisheye re-projection module without using ground-truth camera parameters to connect two branches.
••
TL;DR: In this article , an anti-jamming network is proposed to improve the robustness of handling less-constrained scenarios, and a new spatial-temporal map generation mechanism is designed to enhance the spatial and temporal features representation by equivalent padding for low-quality video frame fragments.
••
TL;DR: In this paper , a split-inpaint-fuse network (SIFNet) is proposed to separate the corrupted luma and chroma images using two decoupled branches in the coarse stage and a fusion sub-network for fusing the inpainted chroma and luma images into a refined image in the refinement stage.
••
TL;DR: In this article , the authors proposed a cross-modal attention mechanism where the gating signal from one modality can dynamically activate the most discriminant CNN filters of the other modality.
••
TL;DR: HSGAN as discussed by the authors alleviates the mode collapse problem by maintaining a certain distance between the latent code of the generated data and the real data, where the objective function is designed to minimize the f-divergence between the distributions of generated and real data.
••
TL;DR: A benchmark aimed to study human behavior in the considered industrial-like scenario is proposed which demon-strates that the investigated tasks and the considered scenario are challenging for state-of-the-art algorithms.