scispace - formally typeset
Search or ask a question

Showing papers in "Computer Vision and Image Understanding in 2017"


Journal ArticleDOI
TL;DR: The THUMOS benchmark is described in detail and an overview of data collection and annotation procedures are given, including a comprehensive empirical study evaluating the differences in action recognition between trimmed and untrimed videos, and how well methods trained on trimmed videos generalize to untrimmed videos.

415 citations


Journal ArticleDOI
TL;DR: A novel double fusion framework is introduced, combining the benefits of traditional early fusion and late fusion strategies, which is extensively evaluated on publicly available video surveillance datasets including UCSD pedestian, Subway, and Train, showing competitive performance with respect to state of the art approaches.

385 citations


Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors designed three types of low-level statistical features in both spatial and frequency domains to quantify super-resolved artifacts, and learned a two-stage regression model to predict the quality scores of super-resolution images without referring to ground-truth images.

338 citations


Journal ArticleDOI
TL;DR: Skeleton-based human representations have been intensively studied and kept attracting an increasing attention, due to their robustness to variations of viewpoint, human body scale and motion speed as well as the real-time, online performance as mentioned in this paper.

279 citations


Journal ArticleDOI
TL;DR: It is shown that the use of 128 × 128 pixel images is sufficient to make qualitative conclusions about optimal network structure that hold for the full size Caffe and VGG nets, and an order of magnitude faster than with the standard 224 pixel images.

266 citations


Journal ArticleDOI
TL;DR: A novel approach to perform segmentation by leveraging the abstraction capabilities of convolutional neural networks (CNNs) based on Hough voting, which is robust, multi-region, flexible and can be easily adapted to different modalities is proposed.

263 citations


Journal ArticleDOI
TL;DR: In this article, the VQA-HAT (Human ATtention) dataset was introduced to evaluate the attention maps generated by state-of-the-art visual question answering models against human attention, and they showed that current attention models do not seem to be looking at the same regions as humans.

256 citations


Journal ArticleDOI
TL;DR: Visual Question Answering (VQA) is a challenging task that has received increasing attention from both the computer vision and the natural language processing communities as mentioned in this paper, which requires reasoning over visual elements of the image and general knowledge to infer the correct answer.

255 citations


Journal ArticleDOI
TL;DR: This review critically examine the current state of VQA in terms of problem formulation, existing datasets, evaluation metrics, and algorithms, and exhaustively review existing algorithms for V QA.

203 citations


Journal ArticleDOI
TL;DR: SOMAnet, a framework based on a deep convolutional neural network that additionally models other discriminative aspects of the human figure, departing from the usual siamese framework, matches subjects even with different apparel.

196 citations


Journal ArticleDOI
TL;DR: An original "task oriented" way to categorize the state of the art of the AT works has been introduced that relies on the split of the final assistive goals into tasks that are then used as pointers to the works in literature in which each of them has been used as a component.

Journal ArticleDOI
TL;DR: A selection of current commercial applications that use computer vision for sports analysis, and highlights some of the topics that are currently being addressed in the research community are discussed.

Journal ArticleDOI
TL;DR: A comprehensive survey of visibility enhancement of images taken in hazy or foggy scenes can be found in this paper, where optical models of atmospheric scattering media and image formation are discussed.

Journal ArticleDOI
TL;DR: A specialized deep convolutional neural network architecture for gait recognition that is less sensitive to several cases of the common variations and occlusions that affect and degrade gait Recognition performance.

Journal ArticleDOI
TL;DR: This paper presents the state-of-the-art in preprocessing and processing methods for soccer player tracking, categorize different approaches, analyze their strengths and weaknesses, review evaluation criteria and conclude future research directions.

Journal ArticleDOI
TL;DR: A robust and efficient line-based Multi-v iew Stereo algorithm is introduced that uses geometric line-matching, which makes it invariant to illumination changes, and generates accurate 3D models with low computational costs, which is especially useful for large-scale urban datasets.

Journal ArticleDOI
TL;DR: This survey presents an up-to-date critical review of the existing literatures on face alignment, focusing on those methods addressing overall difficulties and challenges of this topic under uncontrolled conditions.

Journal ArticleDOI
TL;DR: This paper presents an efficient multi-scale correlated wavelet approach to solve the image dehazing and denoising problem in the frequency domain and finds a generic regularity in nature images that the haze is typically distributed in the low frequency spectrum of its multi- scale wavelet decomposition.

Journal ArticleDOI
TL;DR: In this article, a weakly-supervised learning approach is proposed for weakly supervised learning of human actions from video transcriptions based on the idea that, given a sequence of input data and a transcript, i.e., a list of the order the actions occur in the video, it is possible to infer the actions within the video stream and to learn the related action models without the need for any frame-based annotation.

Journal ArticleDOI
TL;DR: A novel deep hashing framework with Convolutional Neural Networks (CNNs) for fast person re-identification that simultaneously learns both CNN features and hash functions to get robust yet discriminative features and similarity-preserving hash codes.

Journal ArticleDOI
TL;DR: A novel dehazing method is presented that improves visibility in images and videos by detecting and segmenting image regions that contain only water, and proposes a semantic white balancing approach for illuminant estimation that uses the dominant colour of the water to address the spectral distortion present in underwater scenes.

Journal ArticleDOI
TL;DR: The ability of the learned image descriptor to generalise beyond the categories of object present in the authors' training data, forming a basis for general cross-category SBIR is demonstrated.

Journal ArticleDOI
TL;DR: This paper improves over a recent state-of-the-art camera calibration method for traffic surveillance based on two detected vanishing points, and proposes a novel automatic scene scale inference method based on matching bounding boxes of rendered 3D models of vehicles with detected bounding box in the image.

Journal ArticleDOI
TL;DR: A general architecture is proposed in which a system can represent both the content and underlying concepts of an image using an SDG, and it is proposed that the extracted graphs capture syntactic and semantic content of images with reasonable accuracy.

Journal ArticleDOI
TL;DR: In this paper, a self-paced learning theory with diversity was proposed to learn an optimal multi-modal embedding space based on non-linear mapping functions, which enhances the model robustness to outliers and achieves better generalization via training the model gradually from easy rankings by diverse queries to more complex ones.

Journal ArticleDOI
TL;DR: Results for 3D-2D face recognition on the UHDB11 3D/2D database with 2D images under large illumination and pose variations support the hypothesis that, in challenging datasets, 3D+2D outperforms 2D- 2D and decreases the performance gap against 3D.

Journal ArticleDOI
TL;DR: This paper presents an approach for reconstructing large-scale outdoor scenes through monocular motion stereo at interactive frame rates on a modern mobile device, and is the first method to enable live reconstruction of large outdoor scenes on a mobile device.

Journal ArticleDOI
TL;DR: This paper addresses the problem of organizing egocentric photo streams acquired by a wearable camera into semantically meaningful segments, hence making an important step towards the goal of automatically annotating these photos for browsing and retrieval.

Journal ArticleDOI
TL;DR: A novel algorithm which involves separation of background images to minimize the influence of noise, non-uniformed illuminations and lesions is proposed and two different strategies to segment thin and thick blood vessels are developed.

Journal ArticleDOI
TL;DR: The experimental results show that the proposed method outperforms state-of-the-art results and captures what visually characterizes a certain trait: using a deconvolution strategy, a clear distinction of features, patterns and content between low and high values in a given trait is found.