A Novel 3D-Unet Deep Learning Framework Based on High-Dimensional Bilateral Grid for Edge Consistent Single Image Depth Estimation

doi:10.1109/IC3D51119.2020.9376327

Home
/
Papers
/
A Novel 3D-Unet Deep Learning Framework Based on High-Dimensional Bilateral Grid for Edge Consistent Single Image Depth Estimation

Proceedings Article•DOI•

A Novel 3D-Unet Deep Learning Framework Based on High-Dimensional Bilateral Grid for Edge Consistent Single Image Depth Estimation

Mansi Sharma¹, Abheesht Sharma², Kadvekar Rohit Tushar¹, Avinash Panneer¹•Institutions (2)

Indian Institute of Technology Madras¹, Birla Institute of Technology and Science²

15 Dec 2020-pp 1-8

TL;DR: Wang et al. as discussed by the authors proposed a novel Bilateral grid based 3D convolutional neural network, dubbed as 3DBG-UNet, that parameterize high dimensional feature space by encoding compact 3D bilateral grids with UNets and infers sharp geometric layout of the scene.

read less

Abstract: The task of predicting smooth and edge-consistent depth maps is notoriously difficult for single image depth estimation. This paper proposes a novel Bilateral Grid based 3D convolutional neural network, dubbed as 3DBG-UNet, that parameterize high dimensional feature space by encoding compact 3D bilateral grids with UNets and infers sharp geometric layout of the scene. Further, an another novel 3DBGES-UNet model is introduced that integrate 3DBG-UNet for inferring an accurate depth map given a single color view. The 3DBGES-UNet concatenate 3DBG-UNet geometry map with the inception network edge accentuation map and a spatial object's boundary map obtained by leveraging semantic segmentation and train the UNet model with ResNet backbone. Both models are designed with a particular attention to explicitly account for edges or minute details. Preserving sharp discontinuities at depth edges is critical for many applications such as realistic integration of virtual objects in AR video or occlusion-aware view synthesis for 3D display applications. The proposed depth prediction network achieves state-of-the-art performance in both qualitative and quantitative evaluations on the challenging NYUv2-Depth data. The code and corresponding pre-trained weights will be made publicly available.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Deep learning for reduced order modelling and efficient temporal evolution of fluid simulations

[...]

Pranshu Pant¹, Ruchit Doshi¹, Pranav Bahl¹, Amir Barati Farimani¹•Institutions (1)

Carnegie Mellon University¹

01 Oct 2021-Physics of Fluids

TL;DR: In this paper, the authors develop a deep learning framework DL-ROM (deep learning-reduced order modeling) to create a neural network capable of non-linear projections to reduced order states.

...read moreread less

Abstract: Reduced order modeling (ROM) has been widely used to create lower order, computationally inexpensive representations of higher-order dynamical systems. Using these representations, ROMs can efficiently model flow fields while using significantly lesser parameters. Conventional ROMs accomplish this by linearly projecting higher-order manifolds to lower-dimensional space using dimensionality reduction techniques such as proper orthogonal decomposition (POD). In this work, we develop a novel deep learning framework DL-ROM (deep learning—reduced order modeling) to create a neural network capable of non-linear projections to reduced order states. We then use the learned reduced state to efficiently predict future time steps of the simulation using 3D Autoencoder and 3D U-Net-based architectures. Our model DL-ROM can create highly accurate reconstructions from the learned ROM and is thus able to efficiently predict future time steps by temporally traversing in the learned reduced state. All of this is achieved without ground truth supervision or needing to iteratively solve the expensive Navier–Stokes (NS) equations thereby resulting in massive computational savings. To test the effectiveness and performance of our approach, we evaluate our implementation on five different computational fluid dynamics (CFD) datasets using reconstruction performance and computational runtime metrics. DL-ROM can reduce the computational run times of iterative solvers by nearly two orders of magnitude while maintaining an acceptable error threshold.

...read moreread less

39 citations

Journal Article•DOI•

MEStereo-Du2CNN: A Novel Dual Channel CNN for Learning Robust Depth Estimates from Multi-exposure Stereo Images for HDR 3D Applications

[...]

Rohit Choudhary, Mansi Sharma, V. UmaT, Rithvik Anil

21 Jun 2022-The Visual Computer

TL;DR: In this article , a dual-encoder single-decoder CNN with different weights for feature fusion is proposed for depth estimation of multi-exposure stereo image sequences in 3D HDR video content.

...read moreread less

Abstract: Display technologies have evolved over the years. It is critical to develop practical HDR capturing, processing, and display solutions to bring 3D technologies to the next level. Depth estimation of multi-exposure stereo image sequences is an essential task in the development of cost-effective 3D HDR video content. In this paper, we develop a novel deep architecture for multi-exposure stereo depth estimation. The proposed architecture has two novel components. First, the stereo matching technique used in traditional stereo depth estimation is revamped. For the stereo depth estimation component of our architecture, a mono-to-stereo transfer learning approach is deployed. The proposed formulation circumvents the cost volume construction requirement, which is replaced by a ResNet based dual-encoder single-decoder CNN with different weights for feature fusion. EfficientNet based blocks are used to learn the disparity. Secondly, we combine disparity maps obtained from the stereo images at different exposure levels using a robust disparity feature fusion approach. The disparity maps obtained at different exposures are merged using weight maps calculated for different quality measures. The final predicted disparity map obtained is more robust and retains best features that preserve the depth discontinuities. The proposed CNN offers flexibility to train using standard dynamic range stereo data or with multi-exposure low dynamic range stereo sequences. In terms of performance, the proposed model surpasses state-of-the-art monocular and stereo depth estimation methods, both quantitatively and qualitatively, on challenging Scene flow and differently exposed Middlebury stereo datasets. The architecture performs exceedingly well on complex natural scenes, demonstrating its usefulness for diverse 3D HDR applications.

...read moreread less

Journal Article•DOI•

2T-UNET: A Two-Tower UNet with Depth Clues for Robust Stereo Depth Estimation

[...]

Rohit Choudhary, Mansi Sharma, Rithvik Anil

27 Oct 2022-arXiv.org

TL;DR: The depth estimation problem is revisits, avoiding the explicit stereo matching step using a simple two-tower convolutional neural network, and the proposed algorithm is entitled 2T-UNet, which surpasses state-of-the-art monocular and stereo depth estimation methods on the challenging Scene dataset.

...read moreread less

Abstract: —Stereo correspondence matching is an essential part of the multi-step stereo depth estimation process. This paper revisits the depth estimation problem, avoiding the explicit stereo matching step using a simple two-tower convolutional neural network. The proposed algorithm is entitled as 2T-UNet. The idea behind 2T-UNet is to replace cost volume construction with twin convolution towers. These towers have an allowance for different weights between them. Additionally, the input for twin encoders in 2T-UNet are different compared to the existing stereo methods. Generally, a stereo network takes a right and left image pair as input to determine the scene geometry. However, in the 2T-UNet model, the right stereo image is taken as one input and the left stereo image along with its monocular depth clue information, is taken as the other input. Depth clues provide complementary suggestions that help enhance the quality of predicted scene geometry. The 2T-UNet surpasses state-of-the-art monocular and stereo depth estimation methods on the challenging Scene ﬂow dataset, both quantitatively and qualitatively. The architecture performs incredibly well on complex natural scenes, highlight- ing its usefulness for various real-time applications. Pretrained weights and code will be made readily available.

...read moreread less

Journal Article•DOI•

An unsupervised monocular image depth prediction algorithm using Fourier domain analysis

[...]

Lifang Chen, Xiaojiao Tang

14 May 2022-Iet Signal Processing

TL;DR: Zhang et al. as discussed by the authors proposed an unsupervised monocular image depth prediction algorithm based on Fourier domain analysis to take advantage of the complementary properties of small-scale and large-scale images.

...read moreread less

Abstract: Aiming at the problems of high cost and low accuracy of scene details during the depth map generation in 3D reconstruction, we propose an unsupervised monocular image depth prediction algorithm based on Fourier domain analysis. Generally speaking, small-scale images can better display depth details, while large-scale images can more reliably display the depth distribution value of the entire image. In order to take advantage of these complementary properties, we crop the input image with different cropped image ratios to generate multiple disparity map candidates, and then use Fourier frequency domain analysis algorithms to fuse disparity mapping candidates into left and right disparity maps. At the same time, we propose a loss function based on MSSIM to compensate the difference between left and right views and realize unsupervised monocular image depth prediction model training. Experimental results show that our method has good performance on the KITTI dataset.

...read moreread less

Journal Article•DOI•

Nested DWT–Based CNN Architecture for Monocular Depth Estimation

[...]

Sandip Paul, Deepak Mishra, Senthil Kumar Marimuthu

01 Mar 2023-Sensors

TL;DR: NDWTN as mentioned in this paper proposes a moderately dense encoder-decoder network based on discrete wavelet decomposition and trainable coefficients (LL, LH, HL, HH), which preserves the highfrequency information that is otherwise lost during the downsampling process in the encoder.

...read moreread less

Abstract: Applications such as medical diagnosis, navigation, robotics, etc., require 3D images. Recently, deep learning networks have been extensively applied to estimate depth. Depth prediction from 2D images poses a problem that is both ill–posed and non–linear. Such networks are computationally and time–wise expensive as they have dense configurations. Further, the network performance depends on the trained model configuration, the loss functions used, and the dataset applied for training. We propose a moderately dense encoder–decoder network based on discrete wavelet decomposition and trainable coefficients (LL, LH, HL, HH). Our Nested Wavelet–Net (NDWTN) preserves the high–frequency information that is otherwise lost during the downsampling process in the encoder. Furthermore, we study the effect of activation functions, batch normalization, convolution layers, skip, etc., in our models. The network is trained with NYU datasets. Our network trains faster with good results.

...read moreread less

References

PDF

Open Access

More filters

Book Chapter•DOI•

IR Stereo Kinect: Improving Depth Images by Combining Structured Light with IR Stereo

[...]

Faraj Alhwarin¹, Alexander Ferrein², Alexander Ferrein¹, Ingrid Scholl¹•Institutions (2)

RWTH Aachen University¹, University of KwaZulu-Natal²

01 Dec 2014

TL;DR: A stereo RGB-D camera system which uses the pros ofRGB-D cameras and combine them with the Pros of stereo camera systems to generate a depth map and shows that the density of depth information is increased especially for transparent, shiny or matte objects.

...read moreread less

Abstract: RGB-D sensors such as the Microsoft Kinect or the Asus Xtion are inexpensive 3D sensors. A depth image is computed by calculating the distortion of a known infrared light (IR) pattern which is projected into the scene. While these sensors are great devices they have some limitations. The distance they can measure is limited and they suffer from reflection problems on transparent, shiny, or very matte and absorbing objects. If more than one RGB-D camera is used the IR patterns interfere with each other. This results in a massive loss of depth information. In this paper, we present a simple and powerful method to overcome these problems. We propose a stereo RGB-D camera system which uses the pros of RGB-D cameras and combine them with the pros of stereo camera systems. The idea is to utilize the IR images of each two sensors as a stereo pair to generate a depth map. The IR patterns emitted by IR projectors are exploited here to enhance the dense stereo matching even if the observed objects or surfaces are texture-less or transparent. The resulting disparity map is then fused with the depth map offered by the RGB-D sensor to fill the regions and the holes that appear because of interference, or due to transparent or reflective objects. Our results show that the density of depth information is increased especially for transparent, shiny or matte objects.

...read moreread less

54 citations

Posted Content•

Unsupervised Depth Learning in Challenging Indoor Video: Weak Rectification to Rescue.

[...]

Jia-Wang Bian, Huangying Zhan, Naiyan Wang, Tat-Jun Chin, Chunhua Shen, Ian Reid - Show less +2 more

04 Jun 2020-arXiv: Computer Vision and Pattern Recognition

TL;DR: This work establishes that the degenerate camera motions exhibited in handheld settings are a critical obstacle for unsupervised depth learning and proposes a novel data pre-processing method for effective training, i.e., search for image pairs with modest translation and remove their rotation via the proposed weak image rectification.

...read moreread less

Abstract: Single-view depth estimation using CNNs trained from unlabelled videos has shown significant promise. However, the excellent results have mostly been obtained in street-scene driving scenarios, and such methods often fail in other settings, particularly indoor videos taken by handheld devices, in which case the ego-motion is often degenerate, i.e., the rotation dominates the translation. In this work, we establish that the degenerate camera motions exhibited in handheld settings are a critical obstacle for unsupervised depth learning. A main contribution of our work is fundamental analysis which shows that the rotation behaves as noise during training, as opposed to the translation (baseline) which provides supervision signals. To capitalise on our findings, we propose a novel data pre-processing method for effective training, i.e., we search for image pairs with modest translation and remove their rotation via the proposed weak image rectification. With our pre-processing, existing unsupervised models can be trained well in challenging scenarios (e.g., NYUv2 dataset), and the results outperform the unsupervised SOTA by a large margin (0.147 vs. 0.189 in the AbsRel error).

...read moreread less

32 citations

Proceedings Article•DOI•

Practical Depth Estimation with Image Segmentation and Serial U-Nets.

[...]

Kyle J. Cantrell¹, Craig D. Miller¹, Morato Carlos¹•Institutions (1)

Worcester Polytechnic Institute¹

01 Jan 2020

TL;DR: A novel Serial U-Net (NU-Net) architecture is introduced as a modular, ensembling technique for combining the learned features from N-many U-Nets into a single pixel-by-pixel output for improved depth estimation accuracy.

...read moreread less

9 citations

Proceedings Article•DOI•

ACED: Accurate And Edge-Consistent Monocular Depth Estimation

[...]

Kunal Swami¹, Prasanna Vishnu Bondada¹, Pankaj Kumar Bajpai¹•Institutions (1)

Samsung¹

16 Jun 2020

TL;DR: For the first time, a fully differentiable ordinal regression is formulated and train the network in end-to-end fashion, leading to smooth and edge-consistent depth maps in single image depth estimation.

...read moreread less

Abstract: Single image depth estimation is a challenging problem. The current state-of-the-art method formulates the problem as that of ordinal regression. However, the formulation is not fully differentiable and depth maps are not generated in an end-to-end fashion. The method uses a native threshold strategy to determine per-pixel depth labels, which results in significant discretization errors. For the first time, we formulate a fully differentiable ordinal regression and train the network in end-to-end fashion. This enables us to include boundary and smoothness constraints in the optimization function, leading to smooth and edge-consistent depth maps. A novel per-pixel confidence map computation for depth refinement is also proposed. Extensive evaluation of the proposed model on challenging benchmarks reveals its superiority over recent state-of-the-art methods, both quantitatively and qualitatively. Additionally, we demonstrate practical utility of the proposed method for single camera bokeh solution using in-house dataset of challenging real-life images.

...read moreread less

8 citations