scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

A Novel 3D-Unet Deep Learning Framework Based on High-Dimensional Bilateral Grid for Edge Consistent Single Image Depth Estimation

TL;DR: Wang et al. as discussed by the authors proposed a novel Bilateral grid based 3D convolutional neural network, dubbed as 3DBG-UNet, that parameterize high dimensional feature space by encoding compact 3D bilateral grids with UNets and infers sharp geometric layout of the scene.
Abstract: The task of predicting smooth and edge-consistent depth maps is notoriously difficult for single image depth estimation. This paper proposes a novel Bilateral Grid based 3D convolutional neural network, dubbed as 3DBG-UNet, that parameterize high dimensional feature space by encoding compact 3D bilateral grids with UNets and infers sharp geometric layout of the scene. Further, an another novel 3DBGES-UNet model is introduced that integrate 3DBG-UNet for inferring an accurate depth map given a single color view. The 3DBGES-UNet concatenate 3DBG-UNet geometry map with the inception network edge accentuation map and a spatial object's boundary map obtained by leveraging semantic segmentation and train the UNet model with ResNet backbone. Both models are designed with a particular attention to explicitly account for edges or minute details. Preserving sharp discontinuities at depth edges is critical for many applications such as realistic integration of virtual objects in AR video or occlusion-aware view synthesis for 3D display applications. The proposed depth prediction network achieves state-of-the-art performance in both qualitative and quantitative evaluations on the challenging NYUv2-Depth data. The code and corresponding pre-trained weights will be made publicly available.
Citations
More filters
Journal ArticleDOI
TL;DR: In this paper, the authors develop a deep learning framework DL-ROM (deep learning-reduced order modeling) to create a neural network capable of non-linear projections to reduced order states.
Abstract: Reduced order modeling (ROM) has been widely used to create lower order, computationally inexpensive representations of higher-order dynamical systems. Using these representations, ROMs can efficiently model flow fields while using significantly lesser parameters. Conventional ROMs accomplish this by linearly projecting higher-order manifolds to lower-dimensional space using dimensionality reduction techniques such as proper orthogonal decomposition (POD). In this work, we develop a novel deep learning framework DL-ROM (deep learning—reduced order modeling) to create a neural network capable of non-linear projections to reduced order states. We then use the learned reduced state to efficiently predict future time steps of the simulation using 3D Autoencoder and 3D U-Net-based architectures. Our model DL-ROM can create highly accurate reconstructions from the learned ROM and is thus able to efficiently predict future time steps by temporally traversing in the learned reduced state. All of this is achieved without ground truth supervision or needing to iteratively solve the expensive Navier–Stokes (NS) equations thereby resulting in massive computational savings. To test the effectiveness and performance of our approach, we evaluate our implementation on five different computational fluid dynamics (CFD) datasets using reconstruction performance and computational runtime metrics. DL-ROM can reduce the computational run times of iterative solvers by nearly two orders of magnitude while maintaining an acceptable error threshold.

39 citations

Journal ArticleDOI
TL;DR: In this article , a dual-encoder single-decoder CNN with different weights for feature fusion is proposed for depth estimation of multi-exposure stereo image sequences in 3D HDR video content.
Abstract: Display technologies have evolved over the years. It is critical to develop practical HDR capturing, processing, and display solutions to bring 3D technologies to the next level. Depth estimation of multi-exposure stereo image sequences is an essential task in the development of cost-effective 3D HDR video content. In this paper, we develop a novel deep architecture for multi-exposure stereo depth estimation. The proposed architecture has two novel components. First, the stereo matching technique used in traditional stereo depth estimation is revamped. For the stereo depth estimation component of our architecture, a mono-to-stereo transfer learning approach is deployed. The proposed formulation circumvents the cost volume construction requirement, which is replaced by a ResNet based dual-encoder single-decoder CNN with different weights for feature fusion. EfficientNet based blocks are used to learn the disparity. Secondly, we combine disparity maps obtained from the stereo images at different exposure levels using a robust disparity feature fusion approach. The disparity maps obtained at different exposures are merged using weight maps calculated for different quality measures. The final predicted disparity map obtained is more robust and retains best features that preserve the depth discontinuities. The proposed CNN offers flexibility to train using standard dynamic range stereo data or with multi-exposure low dynamic range stereo sequences. In terms of performance, the proposed model surpasses state-of-the-art monocular and stereo depth estimation methods, both quantitatively and qualitatively, on challenging Scene flow and differently exposed Middlebury stereo datasets. The architecture performs exceedingly well on complex natural scenes, demonstrating its usefulness for diverse 3D HDR applications.
Journal ArticleDOI
TL;DR: The depth estimation problem is revisits, avoiding the explicit stereo matching step using a simple two-tower convolutional neural network, and the proposed algorithm is entitled 2T-UNet, which surpasses state-of-the-art monocular and stereo depth estimation methods on the challenging Scene dataset.
Abstract: —Stereo correspondence matching is an essential part of the multi-step stereo depth estimation process. This paper revisits the depth estimation problem, avoiding the explicit stereo matching step using a simple two-tower convolutional neural network. The proposed algorithm is entitled as 2T-UNet. The idea behind 2T-UNet is to replace cost volume construction with twin convolution towers. These towers have an allowance for different weights between them. Additionally, the input for twin encoders in 2T-UNet are different compared to the existing stereo methods. Generally, a stereo network takes a right and left image pair as input to determine the scene geometry. However, in the 2T-UNet model, the right stereo image is taken as one input and the left stereo image along with its monocular depth clue information, is taken as the other input. Depth clues provide complementary suggestions that help enhance the quality of predicted scene geometry. The 2T-UNet surpasses state-of-the-art monocular and stereo depth estimation methods on the challenging Scene flow dataset, both quantitatively and qualitatively. The architecture performs incredibly well on complex natural scenes, highlight- ing its usefulness for various real-time applications. Pretrained weights and code will be made readily available.
Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed an unsupervised monocular image depth prediction algorithm based on Fourier domain analysis to take advantage of the complementary properties of small-scale and large-scale images.
Abstract: Aiming at the problems of high cost and low accuracy of scene details during the depth map generation in 3D reconstruction, we propose an unsupervised monocular image depth prediction algorithm based on Fourier domain analysis. Generally speaking, small-scale images can better display depth details, while large-scale images can more reliably display the depth distribution value of the entire image. In order to take advantage of these complementary properties, we crop the input image with different cropped image ratios to generate multiple disparity map candidates, and then use Fourier frequency domain analysis algorithms to fuse disparity mapping candidates into left and right disparity maps. At the same time, we propose a loss function based on MSSIM to compensate the difference between left and right views and realize unsupervised monocular image depth prediction model training. Experimental results show that our method has good performance on the KITTI dataset.
Journal ArticleDOI
01 Mar 2023-Sensors
TL;DR: NDWTN as mentioned in this paper proposes a moderately dense encoder-decoder network based on discrete wavelet decomposition and trainable coefficients (LL, LH, HL, HH), which preserves the highfrequency information that is otherwise lost during the downsampling process in the encoder.
Abstract: Applications such as medical diagnosis, navigation, robotics, etc., require 3D images. Recently, deep learning networks have been extensively applied to estimate depth. Depth prediction from 2D images poses a problem that is both ill–posed and non–linear. Such networks are computationally and time–wise expensive as they have dense configurations. Further, the network performance depends on the trained model configuration, the loss functions used, and the dataset applied for training. We propose a moderately dense encoder–decoder network based on discrete wavelet decomposition and trainable coefficients (LL, LH, HL, HH). Our Nested Wavelet–Net (NDWTN) preserves the high–frequency information that is otherwise lost during the downsampling process in the encoder. Furthermore, we study the effect of activation functions, batch normalization, convolution layers, skip, etc., in our models. The network is trained with NYU datasets. Our network trains faster with good results.
References
More filters
Book ChapterDOI
01 Dec 2014
TL;DR: A stereo RGB-D camera system which uses the pros ofRGB-D cameras and combine them with the Pros of stereo camera systems to generate a depth map and shows that the density of depth information is increased especially for transparent, shiny or matte objects.
Abstract: RGB-D sensors such as the Microsoft Kinect or the Asus Xtion are inexpensive 3D sensors. A depth image is computed by calculating the distortion of a known infrared light (IR) pattern which is projected into the scene. While these sensors are great devices they have some limitations. The distance they can measure is limited and they suffer from reflection problems on transparent, shiny, or very matte and absorbing objects. If more than one RGB-D camera is used the IR patterns interfere with each other. This results in a massive loss of depth information. In this paper, we present a simple and powerful method to overcome these problems. We propose a stereo RGB-D camera system which uses the pros of RGB-D cameras and combine them with the pros of stereo camera systems. The idea is to utilize the IR images of each two sensors as a stereo pair to generate a depth map. The IR patterns emitted by IR projectors are exploited here to enhance the dense stereo matching even if the observed objects or surfaces are texture-less or transparent. The resulting disparity map is then fused with the depth map offered by the RGB-D sensor to fill the regions and the holes that appear because of interference, or due to transparent or reflective objects. Our results show that the density of depth information is increased especially for transparent, shiny or matte objects.

54 citations

Posted Content
TL;DR: This work establishes that the degenerate camera motions exhibited in handheld settings are a critical obstacle for unsupervised depth learning and proposes a novel data pre-processing method for effective training, i.e., search for image pairs with modest translation and remove their rotation via the proposed weak image rectification.
Abstract: Single-view depth estimation using CNNs trained from unlabelled videos has shown significant promise. However, the excellent results have mostly been obtained in street-scene driving scenarios, and such methods often fail in other settings, particularly indoor videos taken by handheld devices, in which case the ego-motion is often degenerate, i.e., the rotation dominates the translation. In this work, we establish that the degenerate camera motions exhibited in handheld settings are a critical obstacle for unsupervised depth learning. A main contribution of our work is fundamental analysis which shows that the rotation behaves as noise during training, as opposed to the translation (baseline) which provides supervision signals. To capitalise on our findings, we propose a novel data pre-processing method for effective training, i.e., we search for image pairs with modest translation and remove their rotation via the proposed weak image rectification. With our pre-processing, existing unsupervised models can be trained well in challenging scenarios (e.g., NYUv2 dataset), and the results outperform the unsupervised SOTA by a large margin (0.147 vs. 0.189 in the AbsRel error).

32 citations

Proceedings ArticleDOI
01 Jan 2020
TL;DR: A novel Serial U-Net (NU-Net) architecture is introduced as a modular, ensembling technique for combining the learned features from N-many U-Nets into a single pixel-by-pixel output for improved depth estimation accuracy.

9 citations

Proceedings ArticleDOI
16 Jun 2020
TL;DR: For the first time, a fully differentiable ordinal regression is formulated and train the network in end-to-end fashion, leading to smooth and edge-consistent depth maps in single image depth estimation.
Abstract: Single image depth estimation is a challenging problem. The current state-of-the-art method formulates the problem as that of ordinal regression. However, the formulation is not fully differentiable and depth maps are not generated in an end-to-end fashion. The method uses a native threshold strategy to determine per-pixel depth labels, which results in significant discretization errors. For the first time, we formulate a fully differentiable ordinal regression and train the network in end-to-end fashion. This enables us to include boundary and smoothness constraints in the optimization function, leading to smooth and edge-consistent depth maps. A novel per-pixel confidence map computation for depth refinement is also proposed. Extensive evaluation of the proposed model on challenging benchmarks reveals its superiority over recent state-of-the-art methods, both quantitatively and qualitatively. Additionally, we demonstrate practical utility of the proposed method for single camera bokeh solution using in-house dataset of challenging real-life images.

8 citations