Feature matching is at the base of many computer vision problems, such as object recognition or structure from motion. Current methods rely on costly descriptors for detection and matching. In this paper, we propose a very fast binary descriptor based on BRIEF, called ORB, which is rotation invariant and resistant to noise. We demonstrate through experiments how ORB is at two orders of magnitude faster than SIFT, while performing as well in many situations. The efficiency is tested on several real-world applications, including object detection and patch-tracking on a smart phone.

ORB: An efficient alternative to SIFT or SURF

Humans perceive the three-dimensional structure of the world with apparent ease. However, despite all of the recent advances in computer vision research, the dream of having a computer interpret an image at the same level as a two-year old remains elusive. Why is computer vision such a challenging problem and what is the current state of the art? Computer Vision: Algorithms and Applications explores the variety of techniques commonly used to analyze and interpret images. It also describes challenging real-world applications where vision is being successfully used, both for specialized applications such as medical imaging, and for fun, consumer-level tasks such as image editing and stitching, which students can apply to their own personal photos and videos. More than just a source of recipes, this exceptionally authoritative and comprehensive textbook/reference also takes a scientific approach to basic vision problems, formulating physical models of the imaging process before inverting them to produce descriptions of a scene. These problems are also analyzed using statistical models and solved using rigorous engineering techniques Topics and features: structured to support active curricula and project-oriented courses, with tips in the Introduction for using the book in a variety of customized courses; presents exercises at the end of each chapter with a heavy emphasis on testing algorithms and containing numerous suggestions for small mid-term projects; provides additional material and more detailed mathematical topics in the Appendices, which cover linear algebra, numerical techniques, and Bayesian estimation theory; suggests additional reading at the end of each chapter, including the latest research in each sub-field, in addition to a full Bibliography at the end of the book; supplies supplementary course material for students at the associated website, http://szeliski.org/Book/. Suitable for an upper-level undergraduate or graduate-level course in computer science or engineering, this textbook focuses on basic techniques that work under real-world conditions and encourages students to push their creative boundaries. Its design and exposition also make it eminently suitable as a unique reference to the fundamental techniques and current research literature in computer vision.

/pdf/computer-vision-algorithms-and-applications-25dn6wu83j.pdf

Computer Vision: Algorithms and Applications

“Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告

We propose a direct (feature-less) monocular SLAM algorithm which, in contrast to current state-of-the-art regarding direct methods, allows to build large-scale, consistent maps of the environment Along with highly accurate pose estimation based on direct image alignment, the 3D environment is reconstructed in real-time as pose-graph of keyframes with associated semi-dense depth maps These are obtained by filtering over a large number of pixelwise small-baseline stereo comparisons The explicitly scale-drift aware formulation allows the approach to operate on challenging sequences including large variations in scene scale Major enablers are two key novelties: (1) a novel direct tracking method which operates on \(\mathfrak{sim}(3)\), thereby explicitly detecting scale-drift, and (2) an elegant probabilistic solution to include the effect of noisy depth values into tracking The resulting direct monocular SLAM system runs in real-time on a CPU

/pdf/lsd-slam-large-scale-direct-monocular-slam-4eqdolimo8.pdf

LSD-SLAM: Large-Scale Direct Monocular SLAM

Incremental Structure-from-Motion is a prevalent strategy for 3D reconstruction from unordered image collections. While incremental reconstruction systems have tremendously advanced in all regards, robustness, accuracy, completeness, and scalability remain the key problems towards building a truly general-purpose pipeline. We propose a new SfM technique that improves upon the state of the art to make a further step towards this ultimate goal. The full reconstruction pipeline is released to the public as an open-source implementation.

/pdf/structure-from-motion-revisited-10382wusrz.pdf

Structure-from-Motion Revisited

The paper presents a system for automatic, geo-registered, real-time 3D reconstruction from video of urban scenes. The system collects video streams, as well as GPS and inertia measurements in order to place the reconstructed models in geo-registered coordinates. It is designed using current state of the art real-time modules for all processing steps. It employs commodity graphics hardware and standard CPU's to achieve real-time performance. We present the main considerations in designing the system and the steps of the processing pipeline. Our system extends existing algorithms to meet the robustness and variability necessary to operate out of the lab. To account for the large dynamic range of outdoor videos the processing pipeline estimates global camera gain changes in the feature tracking stage and efficiently compensates for these in stereo estimation without impacting the real-time performance. The required accuracy for many applications is achieved with a two-step stereo reconstruction process exploiting the redundancy across frames. We show results on real video sequences comprising hundreds of thousands of frames.

/pdf/detailed-real-time-urban-3d-reconstruction-from-video-67719pruej.pdf

Detailed Real-Time Urban 3D Reconstruction from Video

We propose a single-shot approach for simultaneously detecting an object in an RGB image and predicting its 6D pose without requiring multiple stages or having to examine multiple hypotheses. Unlike a recently proposed single-shot technique for this task [10] that only predicts an approximate 6D pose that must then be refined, ours is accurate enough not to require additional post-processing. As a result, it is much faster - 50 fps on a Titan X (Pascal) GPU - and more suitable for real-time processing. The key component of our method is a new CNN architecture inspired by [27, 28] that directly predicts the 2D image locations of the projected vertices of the object's 3D bounding box. The object's 6D pose is then estimated using a PnP algorithm. For single object and multiple object pose estimation on the LINEMOD and OCCLUSION datasets, our approach substantially outperforms other recent CNN-based approaches [10, 25] when they are all used without postprocessing. During post-processing, a pose refinement step can be used to boost the accuracy of these two methods, but at 10 fps or less, they are much slower than our method.

/pdf/real-time-seamless-single-shot-6d-object-pose-prediction-275s2uhots.pdf

Real-Time Seamless Single Shot 6D Object Pose Prediction

Video analytics will drive a wide range of applications with great potential to impact society. A geographically distributed architecture of public clouds and edges that extend down to the cameras is the only feasible approach to meeting the strict real-time requirements of large-scale live video analytics.

/pdf/real-time-video-analytics-the-killer-app-for-edge-computing-pl3yw37p07.pdf

Real-Time Video Analytics: The Killer App for Edge Computing

This paper describes novel implementations of the KLT feature track- ing and SIFT feature extraction algorithms that run on the graphics processing unit (GPU) and is suitable for video analysis in real-time vision systems. While significant acceleration over standard CPU implementations is obtained by ex- ploiting parallelism provided by modern programmable graphics hardware, the CPU is freed up to run other computations in parallel. Our GPU-based KLT im- plementation tracks about a thousand features in real-time at 30 Hz on 1024 £ 768 resolution video which is a 20 times improvement over the CPU. It works on both ATI and NVIDIA graphics cards. The GPU-based SIFT implementation works on NVIDIA cards and extracts about 800 features from 640 £ 480 video at 10Hz which is approximately 10 times faster than an optimized CPU implementation.

/pdf/gpu-based-video-feature-tracking-and-matching-o3by4wugzu.pdf

GPU-based Video Feature Tracking And Matching

Data-driven techniques help boost agricultural productivity by increasing yields, reducing losses and cutting down input costs. However, these techniques have seen sparse adoption owing to high costs of manual data collection and limited connectivity solutions. In this paper, we present FarmBeats, an end-to-end IoT platform for agriculture that enables seamless data collection from various sensors, cameras and drones. FarmBeats's system design that explicitly accounts for weather-related power and Internet outages has enabled six month long deployments in two US farms.

/pdf/farmbeats-an-iot-platform-for-data-driven-agriculture-lyqra6g6f7.pdf

Sudipta N. Sinha

Papers

Detailed Real-Time Urban 3D Reconstruction from Video

Real-Time Seamless Single Shot 6D Object Pose Prediction

Real-Time Video Analytics: The Killer App for Edge Computing

GPU-based Video Feature Tracking And Matching

Farmbeats: an IoT platform for data-driven agriculture