scispace - formally typeset
Search or ask a question
DOI

Acquisition, compression and rendering of depth and texture for multi-view video

01 Jan 2009-
TL;DR: A new multi-view depth-estimation technique is proposed, employing a one-dimensional optimization strategy that reduces the noise level in the estimated depth images and enforces consistent depth images across the views, and is suitable for execution on a standard Graphics Processor Unit (GPU).
Abstract: Three-dimensional (3D) video and imaging technologies is an emerging trend in the development of digital video systems, as we presently witness the appearance of 3D displays, coding systems, and 3D camera setups. Three-dimensional multi-view video is typically obtained from a set of synchronized cameras, which are capturing the same scene from different viewpoints. This technique especially enables applications such as freeviewpoint video or 3D-TV. Free-viewpoint video applications provide the feature to interactively select and render a virtual viewpoint of the scene. A 3D experience such as for example in 3D-TV is obtained if the data representation and display enable to distinguish the relief of the scene, i.e., the depth within the scene. With 3D-TV, the depth of the scene can be perceived using a multi-view display that renders simultaneously several views of the same scene. To render these multiple views on a remote display, an efficient transmission, and thus compression of the multi-view video is necessary. However, a major problem when dealing with multiview video is the intrinsically large amount of data to be compressed, decompressed and rendered. We aim at an efficient and flexible multi-view video system, and explore three different aspects. First, we develop an algorithm for acquiring a depth signal from a multi-view setup. Second, we present efficient 3D rendering algorithms for a multi-view signal. Third, we propose coding techniques for 3D multi-view signals, based on the use of an explicit depth signal. This motivates that the thesis is divided in three parts. The first part (Chapter 3) addresses the problem of 3D multi-view video acquisition. Multi-view video acquisition refers to the task of estimating and recording a 3D geometric description of the scene. A 3D description of the scene can be represented by a so-called depth image, which can be estimated by triangulation of the corresponding pixels in the multiple views. Initially, we focus on the problem of depth estimation using two views, and present the basic geometric model that enables the triangulation of corresponding pixels across the views. Next, we review two calculation/optimization strategies for determining corresponding pixels: a local and a one-dimensional optimization strategy. Second, to generalize from the two-view case, we introduce a simple geometric model for estimating the depth using multiple views simultaneously. Based on this geometric model, we propose a new multi-view depth-estimation technique, employing a one-dimensional optimization strategy that (1) reduces the noise level in the estimated depth images and (2) enforces consistent depth images across the views. The second part (Chapter 4) details the problem of multi-view image rendering. Multi-view image rendering refers to the process of generating synthetic images using multiple views. Two different rendering techniques are initially explored: a 3D image warping and a mesh-based rendering technique. Each of these methods has its limitations and suffers from either high computational complexity or low image rendering quality. As a consequence, we present two image-based rendering algorithms that improves the balance on the aforementioned issues. First, we derive an alternative formulation of the relief texture algorithm which was extented to the geometry of multiple views. The proposed technique features two advantages: it avoids rendering artifacts ("holes") in the synthetic image and it is suitable for execution on a standard Graphics Processor Unit (GPU). Second, we propose an inverse mapping rendering technique that allows a simple and accurate re-sampling of synthetic pixels. Experimental comparisons with 3D image warping show an improvement of rendering quality of 3.8 dB for the relief texture mapping and 3.0 dB for the inverse mapping rendering technique. The third part concentrates on the compression problem of multi-view texture and depth video (Chapters 5–7). In Chapter 5, we extend the standard H.264/MPEG-4 AVC video compression algorithm for handling the compression of multi-view video. As opposed to the Multi-view Video Coding (MVC) standard that encodes only the multi-view texture data, the proposed encoder peforms the compression of both the texture and the depth multi-view sequences. The proposed extension is based on exploiting the correlation between the multiple camera views. To this end, two different approaches for predictive coding of views have been investigated: a block-based disparity-compensated prediction technique and a View Synthesis Prediction (VSP) scheme. Whereas VSP relies on an accurate depth image, the block-based disparity-compensated prediction scheme can be performed without any geometry information. Our encoder adaptively selects the most appropriate prediction scheme using a rate-distortion criterion for an optimal prediction-mode selection. We present experimental results for several texture and depth multi-view sequences, yielding a quality improvement of up to 0.6 dB for the texture and 3.2 dB for the depth, when compared to solely performing H.264/MPEG-4AVC disparitycompensated prediction. Additionally, we discuss the trade-off between the random-access to a user-selected view and the coding efficiency. Experimental results illustrating and quantifying this trade-off are provided. In Chapter 6, we focus on the compression of a depth signal. We present a novel depth image coding algorithm which concentrates on the special characteristics of depth images: smooth regions delineated by sharp edges. The algorithm models these smooth regions using parameterized piecewiselinear functions and sharp edges by a straight line, so that it is more efficient than a conventional transform-based encoder. To optimize the quality of the coding system for a given bit rate, a special global rate-distortion optimization balances the rate against the accuracy of the signal representation. For typical bit rates, i.e., between 0.01 and 0.25 bit/pixel, experiments have revealed that the coder outperforms a standard JPEG-2000 encoder by 0.6-3.0 dB. Preliminary results were published in the Proceedings of 26th Symposium on Information Theory in the Benelux. In Chapter 7, we propose a novel joint depth-texture bit-allocation algorithm for the joint compression of texture and depth images. The described algorithm combines the depth and texture Rate-Distortion (R-D) curves, to obtain a single R-D surface that allows the optimization of the joint bit-allocation in relation to the obtained rendering quality. Experimental results show an estimated gain of 1 dB compared to a compression performed without joint bit-allocation optimization. Besides this, our joint R-D model can be readily integrated into an multi-view H.264/MPEG-4 AVC coder because it yields the optimal compression setting with a limited computation effort.
Citations
More filters
Journal ArticleDOI
TL;DR: A new rendering algorithm is explored that enables to compute a free-viewpoint between two reference views from existing cameras that performs forward warping for both texture and depth simultaneously.

136 citations


Cites background from "Acquisition, compression and render..."

  • ...Obviously, new digital image and video processing algorithms are needed that support the representation of a 3D signal and enable the processing of it....

    [...]

Proceedings ArticleDOI
10 May 2017
TL;DR: In this article, a Siamese viewpoint factorization network is proposed to align different videos together without explicitly comparing 3D shapes, and a 3D shape completion network is used to extract the full shape of an object from partial observations.
Abstract: Traditional approaches for learning 3D object categories use either synthetic data or manual supervision. In this paper, we propose a method which does not require manual annotations and is instead cued by observing objects from a moving vantage point. Our system builds on two innovations: a Siamese viewpoint factorization network that robustly aligns different videos together without explicitly comparing 3D shapes; and a 3D shape completion network that can extract the full shape of an object from partial observations. We also demonstrate the benefits of configuring networks to perform probabilistic predictions as well as of geometry-aware data augmentation schemes. We obtain state-of-the-art results on publicly-available benchmarks.

104 citations

Journal ArticleDOI
TL;DR: In this article, a new calibration procedure is proposed for a stereovision setup, which uses the object of interest as the calibration target, provided the observed surface has a known definition (e.g., its CAD model).
Abstract: A new calibration procedure is proposed for a stereovision setup It uses the object of interest as the calibration target, provided the observed surface has a known definition (eg, its CAD model) In a first step, the transformation matrices needed for the calibration of the setup are determined assuming that the object conforms to its CAD model Then the 3D shape of the surface of interest is evaluated by deforming the a priori given freeform surface These two steps are performed via an integrated approach to stereoDIC The measured 3D shape of a machined Bezier patch is validated against data obtained by a coordinate measuring machine The feasibility of the calibration method’s application to large surfaces is shown with the analysis of a 2-m2 automotive roof panel

58 citations

Proceedings ArticleDOI
TL;DR: A new rendering algorithm that computes a free-viewpoint based on depth image warping between two reference views from existing cameras is explored and three quality enhancing techniques that specifically aim at solving the major artifacts are developed.
Abstract: Interactive free-viewpoint selection applied to a 3D multi-view signal is a possible attractive feature of the rapidly developing 3D TV media. This paper explores a new rendering algorithm that computes a free-viewpoint based on depth image warping between two reference views from existing cameras. We have developed three quality enhancing techniques that specifically aim at solving the major artifacts. First, resampling artifacts are filled in by a combination of median filtering and inverse warping. Second, contour artifacts are processed while omitting warping of edges at high discontinuities. Third, we employ a depth signal for more accurate disocclusion inpainting. We obtain an average PSNR gain of 3 dB and 4.5 dB for the 'Breakdancers' and 'Ballet' sequences, respectively, compared to recently published results. While experimenting with synthetic data, we observe that the rendering quality is highly dependent on the complexity of the scene. Moreover, experiments are performed using compressed video from surrounding cameras. The overall system quality is dominated by the rendering quality and not by coding.

45 citations

Proceedings ArticleDOI
04 May 2009
TL;DR: The key feature of the approach is warping texture and depth in the first stage simultaneously and postpone blending the new view to a later stage, thereby avoiding errors in the virtual depth map.
Abstract: This paper evaluates our 3D view interpolation rendering algorithm and proposes a few performance improving techniques. We aim at developing a rendering method for free-viewpoint 3DTV, based on depth image warping from surrounding cameras. The key feature of our approach is warping texture and depth in the first stage simultaneously and postpone blending the new view to a later stage, thereby avoiding errors in the virtual depth map. We evaluate the rendering quality in two ways. Firstly, it is measured by varying the distance between the two nearest cameras. We have obtained a PSNR gain of 3 dB and 4.5 dB for the ‘Breakdancers’ and ‘Ballet’ sequences, respectively, compared to the performance of a recent algorithm. A second series of tests in measuring the rendering quality were performed using compressed video or images from surrounding cameras. The overall quality of the system is dominated by rendering quality and not by coding.

41 citations


Cites background from "Acquisition, compression and render..."

  • ...Freeviewpoint video will be an important and innovative feature of 3DTV and an interesting extension [1]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: New results are derived on the minimum number of landmarks needed to obtain a solution, and algorithms are presented for computing these minimum-landmark solutions in closed form that provide the basis for an automatic system that can solve the Location Determination Problem under difficult viewing.
Abstract: A new paradigm, Random Sample Consensus (RANSAC), for fitting a model to experimental data is introduced. RANSAC is capable of interpreting/smoothing data containing a significant percentage of gross errors, and is thus ideally suited for applications in automated image analysis where interpretation is based on the data provided by error-prone feature detectors. A major portion of this paper describes the application of RANSAC to the Location Determination Problem (LDP): Given an image depicting a set of landmarks with known locations, determine that point in space from which the image was obtained. In response to a RANSAC requirement, new results are derived on the minimum number of landmarks needed to obtain a solution, and algorithms are presented for computing these minimum-landmark solutions in closed form. These results provide the basis for an automatic system that can solve the LDP under difficult viewing

23,396 citations


"Acquisition, compression and render..." refers methods in this paper

  • ..., the RANdom SAmple Consensus (RANSAC) algorithm [58]....

    [...]

01 Jan 2001
TL;DR: This book is referred to read because it is an inspiring book to give you more chance to get experiences and also thoughts and it will show the best book collections and completed collections.
Abstract: Downloading the book in this website lists can give you more advantages. It will show you the best book collections and completed collections. So many books can be found in this website. So, this is not only this multiple view geometry in computer vision. However, this book is referred to read because it is an inspiring book to give you more chance to get experiences and also thoughts. This is simple, read the soft file of the book and you get it.

14,282 citations


"Acquisition, compression and render..." refers background or methods in this paper

  • ...This technique is known as the normalized Direct Linear Transformation (DLT) algorithm [35]....

    [...]

  • ...First, the acquisition is flexible because the epipolar geometry can be estimated without any calibration device [35]....

    [...]

  • ...A planar homography transform is a non-singular linear relation between two planes [35]....

    [...]

  • ...2For simplicity, we denote (h) = h , according to the notation used in [35]....

    [...]

Journal ArticleDOI
ZhenQiu Zhang1
TL;DR: A flexible technique to easily calibrate a camera that only requires the camera to observe a planar pattern shown at a few (at least two) different orientations is proposed and advances 3D computer vision one more step from laboratory environments to real world use.
Abstract: We propose a flexible technique to easily calibrate a camera. It only requires the camera to observe a planar pattern shown at a few (at least two) different orientations. Either the camera or the planar pattern can be freely moved. The motion need not be known. Radial lens distortion is modeled. The proposed procedure consists of a closed-form solution, followed by a nonlinear refinement based on the maximum likelihood criterion. Both computer simulation and real data have been used to test the proposed technique and very good results have been obtained. Compared with classical techniques which use expensive equipment such as two or three orthogonal planes, the proposed technique is easy to use and flexible. It advances 3D computer vision one more step from laboratory environments to real world use.

13,200 citations


"Acquisition, compression and render..." refers methods in this paper

  • ...Using the computed solution vector b, the intrinsic parameters can be derived [33] as follows....

    [...]

  • ...As opposed to the standard technique [33], we perform the correction of the non-linear radial lens distortion prior to the camera calibration procedure....

    [...]

  • ...To compute the camera parameters, a practical technique [33] based on a calibration rig is employed....

    [...]

  • ...This description corresponds to the notation adopted in the literature [33]....

    [...]

Proceedings ArticleDOI
01 Aug 1996
TL;DR: This paper describes a sampled representation for light fields that allows for both efficient creation and display of inward and outward looking views, and describes a compression system that is able to compress the light fields generated by more than a factor of 100:1 with very little loss of fidelity.
Abstract: A number of techniques have been proposed for flying through scenes by redisplaying previously rendered or digitized views. Techniques have also been proposed for interpolating between views by warping input images, using depth information or correspondences between multiple images. In this paper, we describe a simple and robust method for generating new views from arbitrary camera positions without depth information or feature matching, simply by combining and resampling the available images. The key to this technique lies in interpreting the input images as 2D slices of a 4D function the light field. This function completely characterizes the flow of light through unobstructed space in a static scene with fixed illumination. We describe a sampled representation for light fields that allows for both efficient creation and display of inward and outward looking views. We hav e created light fields from large arrays of both rendered and digitized images. The latter are acquired using a video camera mounted on a computer-controlled gantry. Once a light field has been created, new views may be constructed in real time by extracting slices in appropriate directions. Since the success of the method depends on having a high sample rate, we describe a compression system that is able to compress the light fields we have generated by more than a factor of 100:1 with very little loss of fidelity. We also address the issues of antialiasing during creation, and resampling during slice extraction. CR Categories: I.3.2 [Computer Graphics]: Picture/Image Generation — Digitizing and scanning, Viewing algorithms; I.4.2 [Computer Graphics]: Compression — Approximate methods Additional keywords: image-based rendering, light field, holographic stereogram, vector quantization, epipolar analysis

4,426 citations


"Acquisition, compression and render..." refers background in this paper

  • ...However, this is the only simple alternative image quality metric that has been widely accepted and used by the research community (also adopted in key literature [59] on rendering)....

    [...]

  • ...For example, Light Field [59] and Lumigraph [61] record a 3D scene by capturing all rays traversing a 3D volume using a large array of cameras....

    [...]

Journal ArticleDOI
TL;DR: A "true" two-dimensional transform that can capture the intrinsic geometrical structure that is key in visual information is pursued and it is shown that with parabolic scaling and sufficient directional vanishing moments, contourlets achieve the optimal approximation rate for piecewise smooth functions with discontinuities along twice continuously differentiable curves.
Abstract: The limitations of commonly used separable extensions of one-dimensional transforms, such as the Fourier and wavelet transforms, in capturing the geometry of image edges are well known. In this paper, we pursue a "true" two-dimensional transform that can capture the intrinsic geometrical structure that is key in visual information. The main challenge in exploring geometry in images comes from the discrete nature of the data. Thus, unlike other approaches, such as curvelets, that first develop a transform in the continuous domain and then discretize for sampled data, our approach starts with a discrete-domain construction and then studies its convergence to an expansion in the continuous domain. Specifically, we construct a discrete-domain multiresolution and multidirection expansion using nonseparable filter banks, in much the same way that wavelets were derived from filter banks. This construction results in a flexible multiresolution, local, and directional image expansion using contour segments, and, thus, it is named the contourlet transform. The discrete contourlet transform has a fast iterated filter bank algorithm that requires an order N operations for N-pixel images. Furthermore, we establish a precise link between the developed filter bank and the associated continuous-domain contourlet expansion via a directional multiresolution analysis framework. We show that with parabolic scaling and sufficient directional vanishing moments, contourlets achieve the optimal approximation rate for piecewise smooth functions with discontinuities along twice continuously differentiable curves. Finally, we show some numerical experiments demonstrating the potential of contourlets in several image processing applications.

3,948 citations


"Acquisition, compression and render..." refers methods in this paper

  • ..., the Bandelet encoder [103, 104], the Contourlet encoder [105] or the Ridgelet encoder [106]....

    [...]