Humans perceive the three-dimensional structure of the world with apparent ease. However, despite all of the recent advances in computer vision research, the dream of having a computer interpret an image at the same level as a two-year old remains elusive. Why is computer vision such a challenging problem and what is the current state of the art? Computer Vision: Algorithms and Applications explores the variety of techniques commonly used to analyze and interpret images. It also describes challenging real-world applications where vision is being successfully used, both for specialized applications such as medical imaging, and for fun, consumer-level tasks such as image editing and stitching, which students can apply to their own personal photos and videos. More than just a source of recipes, this exceptionally authoritative and comprehensive textbook/reference also takes a scientific approach to basic vision problems, formulating physical models of the imaging process before inverting them to produce descriptions of a scene. These problems are also analyzed using statistical models and solved using rigorous engineering techniques Topics and features: structured to support active curricula and project-oriented courses, with tips in the Introduction for using the book in a variety of customized courses; presents exercises at the end of each chapter with a heavy emphasis on testing algorithms and containing numerous suggestions for small mid-term projects; provides additional material and more detailed mathematical topics in the Appendices, which cover linear algebra, numerical techniques, and Bayesian estimation theory; suggests additional reading at the end of each chapter, including the latest research in each sub-field, in addition to a full Bibliography at the end of the book; supplies supplementary course material for students at the associated website, http://szeliski.org/Book/. Suitable for an upper-level undergraduate or graduate-level course in computer science or engineering, this textbook focuses on basic techniques that work under real-world conditions and encourages students to push their creative boundaries. Its design and exposition also make it eminently suitable as a unique reference to the fundamental techniques and current research literature in computer vision.

/pdf/computer-vision-algorithms-and-applications-25dn6wu83j.pdf

Computer Vision: Algorithms and Applications

“Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告

The PASCAL Visual Object Classes Challenge

A scheme is developed for classifying the types of motion perceived by a humanlike robot. It is assumed that the robot receives visual images of the scene using a perspective system model. Equations, theorems, concepts, clues, etc., relating the objects, their positions, and their motion to their images on the focal plane are presented. >

http://ieeexplore.ieee.org/iel2/815/3148/00100651.pdf

Robot vision

In the real world, a realistic setting for computer vision or multimedia recognition problems is that we have some classes containing lots of training data and many classes contain a small amount of training data. Therefore, how to use frequent classes to help learning rare classes for which it is harder to collect the training data is an open question. Learning with Shared Information is an emerging topic in machine learning, computer vision and multimedia analysis. There are different level of components that can be shared during concept modeling and machine learning stages, such as sharing generic object parts, sharing attributes, sharing transformations, sharing regularization parameters and sharing training examples, etc. Regarding the specific methods, multi-task learning, transfer learning and deep learning can be seen as using different strategies to share information. These learning with shared information methods are very effective in solving real-world large-scale problems. This special issue aims at gathering the recent advances in learning with shared information methods and their applications in computer vision and multimedia analysis. Both state-of-the-art works, as well as literature reviews, are welcome for submission. Papers addressing interesting real-world computer vision and multimedia applications are especially encouraged. Topics of interest include, but are not limited to:  • Multi-task learning or transfer learning for large-scale computer vision and multimedia analysis • Deep learning for large-scale computer vision and multimedia analysis • Multi-modal approach for large-scale computer vision and multimedia analysis • Different sharing strategies, e.g., sharing generic object parts, sharing attributes, sharing transformations, sharing regularization parameters and sharing training examples, • Real-world computer vision and multimedia applications based on learning with shared information, e.g., event detection, object recognition, object detection, action recognition, human head pose estimation, object tracking, location-based services, semantic indexing. • New datasets and metrics to evaluate the benefit of the proposed sharing ability for the specific computer vision or multimedia problem. • Survey papers regarding the topic of learning with shared information.  Authors who are unsure whether their planned submission is in scope may contact the guest editors prior to the submission deadline with an abstract, in order to receive feedback.

IEEE transactions on pattern analysis and machine intelligence

Compositing a scene from multiple images is of considerableinterest to graphics professionals. Typical compositing techniques involve estimation or explicit prepar ation of matte by an artist. In this article, we address the problem of automatic compositing of a scene from images o btained through variable exposure photography. We consider the High Dynamic Range Imaging (HDRI) problem an d review some of the existing approaches for directly generating a Low Dynamic Range (LDR) image from mul ti-exposure images. We propose a computationally efficient method of scene compositing using edge-prese rving filters such as bilateral filters. The key challenge is to composite the multi-exposure images in such a way so as t o preserve details in both brightly and poorly illuminated regions of the scene within the limited dynamicrange.

/pdf/bilateral-filter-based-compositing-for-variable-exposure-3h8dpmaiwg.pdf

Bilateral Filter Based Compositing for Variable Exposure Photography

Removing blur caused by camera shake in images has always been a challenging problem in computer vision literature due to its ill-posed nature. Motion blur caused due to the relative motion between the camera and the object in 3D space induces a spatially varying blurring effect over the entire image. In this paper, we propose a novel deep filter based on Generative Adversarial Network (GAN) architecture integrated with global skip connection and dense architecture in order to tackle this problem. Our model, while bypassing the process of blur kernel estimation, significantly reduces the test time which is necessary for practical applications. The experiments on the benchmark datasets prove the effectiveness of the proposed method which outperforms the state-of-the-art blind deblurring algorithms both quantitatively and qualitatively.

/pdf/deep-generative-filter-for-motion-deblurring-zuibt07es8.pdf

Deep Generative Filter for Motion Deblurring

High Dynamic Range (HDR) imaging requires one to composite multiple, differently exposed images of a scene in the irradiance domain and perform tone mapping of the generated HDR image for displaying on Low Dynamic Range (LDR) devices. In the case of dynamic scenes, standard techniques may introduce artifacts called ghosts if the scene changes are not accounted for. In this paper, we consider the blind HDR problem for dynamic scenes. We develop a novel bottom-up segmentation algorithm through superpixel grouping which enables us to detect scene changes. We then employ a piecewise patch-based compositing methodology in the gradient domain to directly generate the ghost-free LDR image of the dynamic scene. Being a blind method, the primary advantage of our approach is that we do not assume any knowledge of camera response function and exposure settings while preserving the contrast even in the non-stationary regions of the scene. We compare the results of our approach for both static and dynamic scenes with that of the state-of-the-art techniques.

Reconstruction of high contrast images for dynamic scenes

We have developed a convolutional neural network for the purpose of recognizing facial expressions in human beings. We have fine-tuned the existing convolutional neural network model trained on the visual recognition dataset used in the ILSVRC2012 to two widely used facial expression datasets - CFEE and RaFD, which when trained and tested independently yielded test accuracies of 74.79% and 95.71%, respectively. Generalization of results was evident by training on one dataset and testing on the other. Further, the image product of the cropped faces and their visual saliency maps were computed using Deep Multi-Layer Network for saliency prediction and were fed to the facial expression recognition CNN. In the most generalized experiment, we observed the top-1 accuracy in the test set to be 65.39%. General confusion trends between different facial expressions as exhibited by humans were also observed.

/pdf/facial-expression-recognition-using-visual-saliency-and-deep-ywejdwbj8k.pdf

Facial Expression Recognition Using Visual Saliency and Deep Learning

Traditional 3D Convolutional Neural Networks (CNNs) are computationally expensive, memory intensive, prone to overfit, and most importantly, there is a need to improve their feature learning capabilities. To address these issues, we propose Rectified Local Phase Volume (ReLPV) block, an efficient alternative to the standard 3D convolutional layer. The ReLPV block extracts the phase in a 3D local neighborhood (e.g., 3x3x3) of each position of the input map to obtain the feature maps. The phase is extracted by computing 3D Short Term Fourier Transform (STFT) at multiple fixed low frequency points in the 3D local neighborhood of each position. These feature maps at different frequency points are then linearly combined after passing them through an activation function. The ReLPV block provides significant parameter savings of at least, 3^3 to 13^3 times compared to the standard 3D convolutional layer with the filter sizes 3x3x3 to 13x13x13, respectively. We show that the feature learning capabilities of the ReLPV block are significantly better than the standard 3D convolutional layer. Furthermore, it produces consistently better results across different 3D data representations. We achieve state-of-the-art accuracy on the volumetric ModelNet10 and ModelNet40 datasets while utilizing only 11% parameters of the current state-of-the-art. We also improve the state-of-the-art on the UCF-101 split-1 action recognition dataset by 5.68% (when trained from scratch) while using only 15% of the parameters of the state-of-the-art.

/pdf/lp-3dcnn-unveiling-local-phase-in-3d-convolutional-neural-1a1s8cvv6t.pdf

Shanmuganathan Raman

Papers

Bilateral Filter Based Compositing for Variable Exposure Photography

Deep Generative Filter for Motion Deblurring

Reconstruction of high contrast images for dynamic scenes

Facial Expression Recognition Using Visual Saliency and Deep Learning

LP-3DCNN: Unveiling Local Phase in 3D Convolutional Neural Networks