scispace - formally typeset
Search or ask a question
Book ChapterDOI

Space-Time Super-Resolution using Deep Learning-based Framework

TL;DR: The experimental results prove that the proposed H.264/AVC compatible framework performs better than the state-of-art techniques on space-time SR in terms of quality and time complexity.
Abstract: This paper introduces a novel end-to-end deep learning framework to learn space-time super-resolution (SR) process. We propose a coupled deep convolutional auto-encoder (CDCA) which learns the non-linear mapping between convolutional features of up-sampled low-resolution (LR) video sequence patches and convolutional features of high-resolution (HR) video sequence patches. The upsampling in LR video refers to tri-cubic interpolation both in space and time. We also propose a H.264/AVC compatible video space-time SR framework by using learned CDCA, which enables to super-resolve compressed LR video with less computational complexity. The experimental results prove that the proposed H.264/AVC compatible framework performs better than the state-of-art techniques on space-time SR in terms of quality and time complexity.
Citations
More filters
Proceedings ArticleDOI
14 Jun 2020
TL;DR: In this paper, the authors proposed a model called STARnet, which super-resolves jointly in space and time to leverage mutually informative relationships between time and space, which can provide more detailed information about motion and higher frame-rate can provide better pixel alignment.
Abstract: We consider the problem of space-time super-resolution (ST-SR): increasing spatial resolution of video frames and simultaneously interpolating frames to increase the frame rate. Modern approaches handle these axes one at a time. In contrast, our proposed model called STARnet super-resolves jointly in space and time. This allows us to leverage mutually informative relationships between time and space: higher resolution can provide more detailed information about motion, and higher frame-rate can provide better pixel alignment. The components of our model that generate latent low- and high-resolution representations during ST-SR can be used to finetune a specialized mechanism for just spatial or just temporal super-resolution. Experimental results demonstrate that STARnet improves the performances of space-time, spatial, and temporal video super-resolution by substantial margins on publicly available datasets.

77 citations

Book ChapterDOI
Jaeyeon Kang1, Younghyun Jo1, Seoung Wug Oh1, Peter Vajda2, Seon Joo Kim1 
23 Aug 2020
TL;DR: An end-to-end DNN framework for the space-time video upsampling by efficiently merging VSR and FI into a joint framework is proposed and a novel weighting scheme is proposed to fuse input frames effectively without explicit motion compensation for efficient processing of videos.
Abstract: Video super-resolution (VSR) and frame interpolation (FI) are traditional computer vision problems, and the performance have been improving by incorporating deep learning recently. In this paper, we investigate the problem of jointly upsampling videos both in space and time, which is becoming more important with advances in display systems. One solution for this is to run VSR and FI, one by one, independently. This is highly inefficient as heavy deep neural networks (DNN) are involved in each solution. To this end, we propose an end-to-end DNN framework for the space-time video upsampling by efficiently merging VSR and FI into a joint framework. In our framework, a novel weighting scheme is proposed to fuse all input frames effectively without explicit motion compensation for efficient processing of videos. The results show better results both quantitatively and qualitatively, while reducing the computation time (\(\times \)7 faster) and the number of parameters (30%) compared to baselines. Our source code is available at https://github.com/JaeYeonKang/STVUN-Pytorch.

15 citations


Cites methods from "Space-Time Super-Resolution using D..."

  • ...[25] first used a DNN architecture for the joint space-time upsampling....

    [...]

Proceedings ArticleDOI
01 Jan 2021
TL;DR: Wang et al. as mentioned in this paper proposed a dual-stream fusion network to adaptively fuse the intermediate results produced by two spatio-temporal up-sampling streams, where the first stream applies the spatial superresolution followed by the temporal super-resolution, while the second one is with the reverse order of cascade.
Abstract: Visual data upsampling has been an important research topic for improving the perceptual quality and benefiting various computer vision applications. In recent years, we have witnessed remarkable progresses brought by the re-naissance of deep learning techniques for video or image super-resolution. However, most existing methods focus on advancing super-resolution at either spatial or temporal direction, i.e, to increase the spatial resolution or the video frame rate. In this paper, we instead turn to discuss both directions jointly and tackle the spatiotemporal upsampling problem. Our method is based on an important observation that: even the direct cascade of prior research in spatial and temporal super-resolution can achieve the spatiotemporal upsampling, changing orders for combining them would lead to results with a complementary property. Thus, we propose a dual-stream fusion network to adaptively fuse the intermediate results produced by two spatiotemporal up-sampling streams, where the first stream applies the spatial super-resolution followed by the temporal super-resolution, while the second one is with the reverse order of cascade. Extensive experiments verify the efficacy of the proposed method against several baselines. Moreover, we investigate various spatial and temporal upsampling methods as the basis in our two-stream model and demonstrate the flexibility with wide applicability of the proposed framework.

2 citations

Proceedings ArticleDOI
09 Jul 2020
TL;DR: It turns out that the proposed cross-stream consistency does not consume labeled training data and can guide network training in an unsupervised manner, and can derive an effective model with few high-resolution and high-framerate videos, achieving the state-of-the-art performance.
Abstract: Spatiotemporal super-resolution (SR) aims to upscale both the spatial and temporal dimensions of input videos, and produces videos with higher frame resolutions and rates. It involves two essential sub-tasks: spatial SR and temporal SR. We design a two-stream network for spatiotemporal SR in this work. One stream contains a temporal SR module followed by a spatial SR module, while the other stream has the same two modules in the reverse order. Based on the interchangeability of performing the two sub-tasks, the two network streams are supposed to produce consistent spatiotemporal SR results. Thus, we present a cross-stream consistency to enforce the similarity between the outputs of the two streams. In this way, the training of the two streams is correlated, which allows the two SR modules to share their supervisory signals and improve each other. In addition, the proposed cross-stream consistency does not consume labeled training data and can guide network training in an unsupervised manner. We leverage this property to carry out semi-supervised spatiotemporal SR. It turns out that our method makes the most of training data, and can derive an effective model with few high-resolution and high-framerate videos, achieving the state-of-the-art performance. The source code of this work is available at https://hankweb.github.io/STSRwithCrossTask/.

1 citations


Cites background or methods from "Space-Time Super-Resolution using D..."

  • ...We compare our method with two existing methods: One is examplebased method [Shahar et al., 2011] and the other is a deeplearning-based method [Sharma et al., 2017], called coupled deep convolutional auto-encoder (CDCA)....

    [...]

  • ...[Sharma et al., 2017] propose the first deep-learning-based method, called coupled deep convolutional auto-encoder (CDCA), for spatiotemporal SR....

    [...]

  • ...Sharma et al. [Sharma et al., 2017] propose the first deep-learning-based method, called coupled deep convolutional auto-encoder (CDCA), for spatiotemporal SR. CDCA generates the convolutional feature maps of the spatial patches in up-sampled LR and HR video frames using convolutional auto-encoder…...

    [...]

  • ...However, few research advancements [Shahar et al., 2011; Sharma et al., 2017] have been made on spatiotemporal SR, which is more practical for lowquality video processing and understanding....

    [...]

  • ..., 2011] and the other is a deeplearning-based method [Sharma et al., 2017], called coupled deep convolutional auto-encoder (CDCA)....

    [...]

Posted Content
TL;DR: The components of the model that generate latent low- and high-resolution representations during ST-SR can be used to finetune a specialized mechanism for just spatial or just temporal super-resolution.
Abstract: We consider the problem of space-time super-resolution (ST-SR): increasing spatial resolution of video frames and simultaneously interpolating frames to increase the frame rate. Modern approaches handle these axes one at a time. In contrast, our proposed model called STARnet super-resolves jointly in space and time. This allows us to leverage mutually informative relationships between time and space: higher resolution can provide more detailed information about motion, and higher frame-rate can provide better pixel alignment. The components of our model that generate latent low- and high-resolution representations during ST-SR can be used to finetune a specialized mechanism for just spatial or just temporal super-resolution. Experimental results demonstrate that STARnet improves the performances of space-time, spatial, and temporal video super-resolution by substantial margins on publicly available datasets.

1 citations


Cites methods from "Space-Time Super-Resolution using D..."

  • ...[48] proposed STSR method to learn LR-HR non-linear mapping....

    [...]

References
More filters
Proceedings ArticleDOI
27 Jun 2016
TL;DR: This paper presents the first convolutional neural network capable of real-time SR of 1080p videos on a single K2 GPU and introduces an efficient sub-pixel convolution layer which learns an array of upscaling filters to upscale the final LR feature maps into the HR output.
Abstract: Recently, several models based on deep neural networks have achieved great success in terms of both reconstruction accuracy and computational performance for single image super-resolution. In these methods, the low resolution (LR) input image is upscaled to the high resolution (HR) space using a single filter, commonly bicubic interpolation, before reconstruction. This means that the super-resolution (SR) operation is performed in HR space. We demonstrate that this is sub-optimal and adds computational complexity. In this paper, we present the first convolutional neural network (CNN) capable of real-time SR of 1080p videos on a single K2 GPU. To achieve this, we propose a novel CNN architecture where the feature maps are extracted in the LR space. In addition, we introduce an efficient sub-pixel convolution layer which learns an array of upscaling filters to upscale the final LR feature maps into the HR output. By doing so, we effectively replace the handcrafted bicubic filter in the SR pipeline with more complex upscaling filters specifically trained for each feature map, whilst also reducing the computational complexity of the overall SR operation. We evaluate the proposed approach using images and videos from publicly available datasets and show that it performs significantly better (+0.15dB on Images and +0.39dB on Videos) and is an order of magnitude faster than previous CNN-based methods.

4,770 citations

Book ChapterDOI
06 Sep 2014
TL;DR: This work proposes a deep learning method for single image super-resolution (SR) that directly learns an end-to-end mapping between the low/high-resolution images and shows that traditional sparse-coding-based SR methods can also be viewed as a deep convolutional network.
Abstract: We propose a deep learning method for single image super-resolution (SR). Our method directly learns an end-to-end mapping between the low/high-resolution images. The mapping is represented as a deep convolutional neural network (CNN) [15] that takes the low-resolution image as the input and outputs the high-resolution one. We further show that traditional sparse-coding-based SR methods can also be viewed as a deep convolutional network. But unlike traditional methods that handle each component separately, our method jointly optimizes all layers. Our deep CNN has a lightweight structure, yet demonstrates state-of-the-art restoration quality, and achieves fast speed for practical on-line usage.

4,445 citations

Proceedings ArticleDOI
27 Jun 2016
TL;DR: In this article, a very deep convolutional network inspired by VGG-net was used for image superresolution, which achieved state-of-the-art performance in accuracy.
Abstract: We present a highly accurate single-image superresolution (SR) method. Our method uses a very deep convolutional network inspired by VGG-net used for ImageNet classification [19]. We find increasing our network depth shows a significant improvement in accuracy. Our final model uses 20 weight layers. By cascading small filters many times in a deep network structure, contextual information over large image regions is exploited in an efficient way. With very deep networks, however, convergence speed becomes a critical issue during training. We propose a simple yet effective training procedure. We learn residuals only and use extremely high learning rates (104 times higher than SRCNN [6]) enabled by adjustable gradient clipping. Our proposed method performs better than existing methods in accuracy and visual improvements in our results are easily noticeable.

4,136 citations

Posted Content
TL;DR: This work presents a highly accurate single-image superresolution (SR) method using a very deep convolutional network inspired by VGG-net used for ImageNet classification and uses extremely high learning rates enabled by adjustable gradient clipping.
Abstract: We present a highly accurate single-image super-resolution (SR) method. Our method uses a very deep convolutional network inspired by VGG-net used for ImageNet classification \cite{simonyan2015very}. We find increasing our network depth shows a significant improvement in accuracy. Our final model uses 20 weight layers. By cascading small filters many times in a deep network structure, contextual information over large image regions is exploited in an efficient way. With very deep networks, however, convergence speed becomes a critical issue during training. We propose a simple yet effective training procedure. We learn residuals only and use extremely high learning rates ($10^4$ times higher than SRCNN \cite{dong2015image}) enabled by adjustable gradient clipping. Our proposed method performs better than existing methods in accuracy and visual improvements in our results are easily noticeable.

3,628 citations

Journal ArticleDOI
TL;DR: This paper proposes a CNN that is trained on both the spatial and the temporal dimensions of videos to enhance their spatial resolution and shows that by using images to pretrain the model, a relatively small video database is sufficient for the training of the model to achieve and improve upon the current state-of-the-art.
Abstract: Convolutional neural networks (CNN) are a special type of deep neural networks (DNN). They have so far been successfully applied to image super-resolution (SR) as well as other image restoration tasks. In this paper, we consider the problem of video super-resolution. We propose a CNN that is trained on both the spatial and the temporal dimensions of videos to enhance their spatial resolution. Consecutive frames are motion compensated and used as input to a CNN that provides super-resolved video frames as output. We investigate different options of combining the video frames within one CNN architecture. While large image databases are available to train deep neural networks, it is more challenging to create a large video database of sufficient quality to train neural nets for video restoration. We show that by using images to pretrain our model, a relatively small video database is sufficient for the training of our model to achieve and even improve upon the current state-of-the-art. We compare our proposed approach to current video as well as image SR algorithms.

541 citations