scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

A Novel 3D-Unet Deep Learning Framework Based on High-Dimensional Bilateral Grid for Edge Consistent Single Image Depth Estimation

TL;DR: Wang et al. as discussed by the authors proposed a novel Bilateral grid based 3D convolutional neural network, dubbed as 3DBG-UNet, that parameterize high dimensional feature space by encoding compact 3D bilateral grids with UNets and infers sharp geometric layout of the scene.
Abstract: The task of predicting smooth and edge-consistent depth maps is notoriously difficult for single image depth estimation. This paper proposes a novel Bilateral Grid based 3D convolutional neural network, dubbed as 3DBG-UNet, that parameterize high dimensional feature space by encoding compact 3D bilateral grids with UNets and infers sharp geometric layout of the scene. Further, an another novel 3DBGES-UNet model is introduced that integrate 3DBG-UNet for inferring an accurate depth map given a single color view. The 3DBGES-UNet concatenate 3DBG-UNet geometry map with the inception network edge accentuation map and a spatial object's boundary map obtained by leveraging semantic segmentation and train the UNet model with ResNet backbone. Both models are designed with a particular attention to explicitly account for edges or minute details. Preserving sharp discontinuities at depth edges is critical for many applications such as realistic integration of virtual objects in AR video or occlusion-aware view synthesis for 3D display applications. The proposed depth prediction network achieves state-of-the-art performance in both qualitative and quantitative evaluations on the challenging NYUv2-Depth data. The code and corresponding pre-trained weights will be made publicly available.
Citations
More filters
Journal ArticleDOI
TL;DR: In this paper, the authors develop a deep learning framework DL-ROM (deep learning-reduced order modeling) to create a neural network capable of non-linear projections to reduced order states.
Abstract: Reduced order modeling (ROM) has been widely used to create lower order, computationally inexpensive representations of higher-order dynamical systems. Using these representations, ROMs can efficiently model flow fields while using significantly lesser parameters. Conventional ROMs accomplish this by linearly projecting higher-order manifolds to lower-dimensional space using dimensionality reduction techniques such as proper orthogonal decomposition (POD). In this work, we develop a novel deep learning framework DL-ROM (deep learning—reduced order modeling) to create a neural network capable of non-linear projections to reduced order states. We then use the learned reduced state to efficiently predict future time steps of the simulation using 3D Autoencoder and 3D U-Net-based architectures. Our model DL-ROM can create highly accurate reconstructions from the learned ROM and is thus able to efficiently predict future time steps by temporally traversing in the learned reduced state. All of this is achieved without ground truth supervision or needing to iteratively solve the expensive Navier–Stokes (NS) equations thereby resulting in massive computational savings. To test the effectiveness and performance of our approach, we evaluate our implementation on five different computational fluid dynamics (CFD) datasets using reconstruction performance and computational runtime metrics. DL-ROM can reduce the computational run times of iterative solvers by nearly two orders of magnitude while maintaining an acceptable error threshold.

39 citations

Journal ArticleDOI
TL;DR: In this article , a dual-encoder single-decoder CNN with different weights for feature fusion is proposed for depth estimation of multi-exposure stereo image sequences in 3D HDR video content.
Abstract: Display technologies have evolved over the years. It is critical to develop practical HDR capturing, processing, and display solutions to bring 3D technologies to the next level. Depth estimation of multi-exposure stereo image sequences is an essential task in the development of cost-effective 3D HDR video content. In this paper, we develop a novel deep architecture for multi-exposure stereo depth estimation. The proposed architecture has two novel components. First, the stereo matching technique used in traditional stereo depth estimation is revamped. For the stereo depth estimation component of our architecture, a mono-to-stereo transfer learning approach is deployed. The proposed formulation circumvents the cost volume construction requirement, which is replaced by a ResNet based dual-encoder single-decoder CNN with different weights for feature fusion. EfficientNet based blocks are used to learn the disparity. Secondly, we combine disparity maps obtained from the stereo images at different exposure levels using a robust disparity feature fusion approach. The disparity maps obtained at different exposures are merged using weight maps calculated for different quality measures. The final predicted disparity map obtained is more robust and retains best features that preserve the depth discontinuities. The proposed CNN offers flexibility to train using standard dynamic range stereo data or with multi-exposure low dynamic range stereo sequences. In terms of performance, the proposed model surpasses state-of-the-art monocular and stereo depth estimation methods, both quantitatively and qualitatively, on challenging Scene flow and differently exposed Middlebury stereo datasets. The architecture performs exceedingly well on complex natural scenes, demonstrating its usefulness for diverse 3D HDR applications.
Journal ArticleDOI
TL;DR: The depth estimation problem is revisits, avoiding the explicit stereo matching step using a simple two-tower convolutional neural network, and the proposed algorithm is entitled 2T-UNet, which surpasses state-of-the-art monocular and stereo depth estimation methods on the challenging Scene dataset.
Abstract: —Stereo correspondence matching is an essential part of the multi-step stereo depth estimation process. This paper revisits the depth estimation problem, avoiding the explicit stereo matching step using a simple two-tower convolutional neural network. The proposed algorithm is entitled as 2T-UNet. The idea behind 2T-UNet is to replace cost volume construction with twin convolution towers. These towers have an allowance for different weights between them. Additionally, the input for twin encoders in 2T-UNet are different compared to the existing stereo methods. Generally, a stereo network takes a right and left image pair as input to determine the scene geometry. However, in the 2T-UNet model, the right stereo image is taken as one input and the left stereo image along with its monocular depth clue information, is taken as the other input. Depth clues provide complementary suggestions that help enhance the quality of predicted scene geometry. The 2T-UNet surpasses state-of-the-art monocular and stereo depth estimation methods on the challenging Scene flow dataset, both quantitatively and qualitatively. The architecture performs incredibly well on complex natural scenes, highlight- ing its usefulness for various real-time applications. Pretrained weights and code will be made readily available.
Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed an unsupervised monocular image depth prediction algorithm based on Fourier domain analysis to take advantage of the complementary properties of small-scale and large-scale images.
Abstract: Aiming at the problems of high cost and low accuracy of scene details during the depth map generation in 3D reconstruction, we propose an unsupervised monocular image depth prediction algorithm based on Fourier domain analysis. Generally speaking, small-scale images can better display depth details, while large-scale images can more reliably display the depth distribution value of the entire image. In order to take advantage of these complementary properties, we crop the input image with different cropped image ratios to generate multiple disparity map candidates, and then use Fourier frequency domain analysis algorithms to fuse disparity mapping candidates into left and right disparity maps. At the same time, we propose a loss function based on MSSIM to compensate the difference between left and right views and realize unsupervised monocular image depth prediction model training. Experimental results show that our method has good performance on the KITTI dataset.
Journal ArticleDOI
01 Mar 2023-Sensors
TL;DR: NDWTN as mentioned in this paper proposes a moderately dense encoder-decoder network based on discrete wavelet decomposition and trainable coefficients (LL, LH, HL, HH), which preserves the highfrequency information that is otherwise lost during the downsampling process in the encoder.
Abstract: Applications such as medical diagnosis, navigation, robotics, etc., require 3D images. Recently, deep learning networks have been extensively applied to estimate depth. Depth prediction from 2D images poses a problem that is both ill–posed and non–linear. Such networks are computationally and time–wise expensive as they have dense configurations. Further, the network performance depends on the trained model configuration, the loss functions used, and the dataset applied for training. We propose a moderately dense encoder–decoder network based on discrete wavelet decomposition and trainable coefficients (LL, LH, HL, HH). Our Nested Wavelet–Net (NDWTN) preserves the high–frequency information that is otherwise lost during the downsampling process in the encoder. Furthermore, we study the effect of activation functions, batch normalization, convolution layers, skip, etc., in our models. The network is trained with NYU datasets. Our network trains faster with good results.
References
More filters
Book ChapterDOI
05 Oct 2015
TL;DR: Neber et al. as discussed by the authors proposed a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently, which can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks.
Abstract: There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .

49,590 citations

Posted Content
Sergey Ioffe1, Christian Szegedy1
TL;DR: Batch Normalization as mentioned in this paper normalizes layer inputs for each training mini-batch to reduce the internal covariate shift in deep neural networks, and achieves state-of-the-art performance on ImageNet.
Abstract: Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.

17,184 citations

Proceedings ArticleDOI
01 Oct 2016
TL;DR: A fully convolutional architecture, encompassing residual learning, to model the ambiguous mapping between monocular images and depth maps is proposed and a novel way to efficiently learn feature map up-sampling within the network is presented.
Abstract: This paper addresses the problem of estimating the depth map of a scene given a single RGB image. We propose a fully convolutional architecture, encompassing residual learning, to model the ambiguous mapping between monocular images and depth maps. In order to improve the output resolution, we present a novel way to efficiently learn feature map up-sampling within the network. For optimization, we introduce the reverse Huber loss that is particularly suited for the task at hand and driven by the value distributions commonly present in depth maps. Our model is composed of a single architecture that is trained end-to-end and does not rely on post-processing techniques, such as CRFs or other additional refinement steps. As a result, it runs in real-time on images or videos. In the evaluation, we show that the proposed model contains fewer parameters and requires fewer training data than the current state of the art, while outperforming all approaches on depth estimation. Code and models are publicly available.

1,677 citations

Proceedings ArticleDOI
01 Jul 2002
TL;DR: A new technique for the display of high-dynamic-range images, which reduces the contrast while preserving detail, is presented, based on a two-scale decomposition of the image into a base layer, encoding large-scale variations, and a detail layer.
Abstract: We present a new technique for the display of high-dynamic-range images, which reduces the contrast while preserving detail. It is based on a two-scale decomposition of the image into a base layer, encoding large-scale variations, and a detail layer. Only the base layer has its contrast reduced, thereby preserving detail. The base layer is obtained using an edge-preserving filter called the bilateral filter. This is a non-linear filter, where the weight of each pixel is computed using a Gaussian in the spatial domain multiplied by an influence function in the intensity domain that decreases the weight of pixels with large intensity differences. We express bilateral filtering in the framework of robust statistics and show how it relates to anisotropic diffusion. We then accelerate bilateral filtering by using a piecewise-linear approximation in the intensity domain and appropriate subsampling. This results in a speed-up of two orders of magnitude. The method is fast and requires no parameter setting.

1,612 citations

Journal ArticleDOI
01 Aug 2004
TL;DR: This paper presents a simple colorization method that requires neither precise image segmentation, nor accurate region tracking, and demonstrates that high quality colorizations of stills and movie clips may be obtained from a relatively modest amount of user input.
Abstract: Colorization is a computer-assisted process of adding color to a monochrome image or movie The process typically involves segmenting images into regions and tracking these regions across image sequences Neither of these tasks can be performed reliably in practice; consequently, colorization requires considerable user intervention and remains a tedious, time-consuming, and expensive taskIn this paper we present a simple colorization method that requires neither precise image segmentation, nor accurate region tracking Our method is based on a simple premise; neighboring pixels in space-time that have similar intensities should have similar colors We formalize this premise using a quadratic cost function and obtain an optimization problem that can be solved efficiently using standard techniques In our approach an artist only needs to annotate the image with a few color scribbles, and the indicated colors are automatically propagated in both space and time to produce a fully colorized image or sequence We demonstrate that high quality colorizations of stills and movie clips may be obtained from a relatively modest amount of user input

1,505 citations