scispace - formally typeset
Open AccessProceedings ArticleDOI

NTIRE 2019 Challenge on Video Super-Resolution: Methods and Results

Reads0
Chats0
TLDR
This paper reviews the first NTIRE challenge on video super-resolution (restoration of rich details in low-resolution video frames) with focus on proposed solutions and results and gauge the state-of-the-art in videosuper-resolution.
Abstract
This paper reviews the first NTIRE challenge on video super-resolution (restoration of rich details in low-resolution video frames) with focus on proposed solutions and results. A new REalistic and Diverse Scenes dataset (REDS) was employed. The challenge was divided into 2 tracks. Track 1 employed standard bicubic downscaling setup while Track 2 had realistic dynamic motion blurs. Each competition had 124 and 104 registered participants. There were total 14 teams in the final testing phase. They gauge the state-of-the-art in video super-resolution.

read more

Content maybe subject to copyright    Report

NTIRE 2019 Challenge on Video Super-Resolution: Methods and Results
Seungjun Nah Radu Timofte Shuhang Gu Sungyong Baik Seokil Hong
Gyeongsik Moon Sanghyun Son Kyoung Mu Lee Xintao Wang Kelvin C.K. Chan
Ke Yu Chao Dong Chen Change Loy Yuchen Fan Jiahui Yu Ding Liu
Thomas S. Huang Xiao Liu Chao Li Dongliang He Yukang Ding Shilei Wen
Fatih Porikli Ratheesh Kalarot Muhammad Haris Greg Shakhnarovich
Norimichi Ukita Peng Yi Zhongyuan Wang Kui Jiang Junjun Jiang Jiayi Ma
Hang Dong Xinyi Zhang Zhe Hu Kwanyoung Kim Dong Un Kang
Se Young Chun Kuldeep Purohit A. N. Rajagopalan Yapeng Tian Yulun Zhang
Yun Fu Chenliang Xu A. Murat Tekalp M. Akin Yilmaz Cansu Korkmaz
Manoj Sharma Megh Makwana Anuj Badhwar Ajay Pratap Singh
Avinash Upadhyay Rudrabha Mukhopadhyay Ankit Shukla Dheeraj Khanna
A. S. Mandal Santanu Chaudhury Si Miao Yongxin Zhu Xiao Huo
Abstract
This paper reviews the first NTIRE challenge on
video super-resolution (restoration of rich details in low-
resolution video frames) with focus on proposed solutions
and results. A new REalistic and Diverse Scenes dataset
(REDS) was employed. The challenge was divided into 2
tracks. Track 1 employed standard bicubic downscaling
setup while Track 2 had realistic dynamic motion blurs.
Each competition had 124 and 104 registered participants.
There were total 14 teams in the final testing phase. They
gauge the state-of-the-art in video super-resolution.
1. Introduction
Example-based video super-resolution (SR) aims at the
restoration of the rich details (high frequencies) from low-
resolution video frames based on a set of prior examples
with low-resolution and high-resolution videos. The loss
of contents can be caused by various factors such as quan-
tization error, limitations of the sensor from the capturing
camera, presence of defocus, motion blur or other degrad-
ing operators, and the use of downsampling operators to re-
duce the video resolution for storage purposes. Just like the
conventional single image SR, video SR is also an ill-posed
S. Nah (seungjun.nah@gmail.com, Seoul National University), R.
Timofte, S. Gu, S. Baik, S. Hong, G. Moon, S. Son and K.M. Lee are
the NTIRE 2019 challenge organizers, while the other authors participated
in the challenge.
Appendix
A contains the authors’ teams and affiliations.
NTIRE webpage:
http://www.vision.ee.ethz.ch/ntire19
problem because for each low resolution (LR) frame, the
space of corresponding high resolution (HR) frames can be
very large.
In recent years, a significant amount of literature fo-
cused on video super-resolution research. The performance
of the top methods continuously improved [
28, 22, 15,
2, 24, 9, 21, 35], showing that the field is getting ma-
tured. However, when compared with single image super-
resolution [
1, 27, 29], video super-resolution lacks stan-
dardized benchmarks to allow for an assessment that is
based on identical datasets and criteria. Recently, most of
the video SR publications use the Vid4 [
14] dataset for eval-
uation and comparison. Vid4 dataset contains 4 sequences
and each video consists of 30 to 45 frames. The resolu-
tion of each frame is 480 × 704 or 576 × 704. In some
works, other datasets are also proposed for evaluation like
YT10 [21], Val4 [9], SPMCS [24], and CDVL [2]. How-
ever, they are not widely used for comparison yet. While
those video super-resolution datasets have brought substan-
tial improvements to this domain, they still have significant
shortcomings: (1) they lack a standard training set: recent
video SR works are trained from various sets that are chosen
rather arbitrarily; (2) small test sets and resolution (often
below HD resolution); (3) mixed downsampling methods
(Gaussian blurs and bicubic kernels) that are not standard-
ized; they are chosen for LR data generation and they are
not consistent with single image SR literature where usu-
ally bicubic interpolation is employed.
The NTIRE 2019 video super-resolution challenge is
a step forward in benchmarking and training of video
super-resolution algorithms. It uses REalistic and Dynamic
1985
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops
978-1-7281-2506-0/19/$31.00 ©2019 IEEE
DOI 10.1109/CVPRW.2019.00250
Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on June 09,2020 at 18:59:16 UTC from IEEE Xplore. Restrictions apply.

(a) Sharp HR
(b) LR (Bicubic ൈͶ՝) (c) LR (Bicubic ൈͶ՝+ Blur)
Figure 1: Visualization of a video frame and its low resolution corresponding frames from the REDS dataset.
Scenes (REDS) dataset [
16] consisting of 30000 reference
frames with two types of degradation: the standard bicubic
and additional dynamic motion blurs that are locally variant.
Fig.
1 shows some images from REDS dataset. The REDS
dataset is introduced in [
16] along with a study of challenge
results. In the next, we describe the challenge, present and
discuss the results and describe the proposed methods.
2. NTIRE 2019 Challenge
The objectives of the NTIRE 2019 challenge on video
super-resolution are: (i) to gauge and push the state-of-the-
art in video super-resolution; (ii) to compare different solu-
tions; (iii) to promote a novel large dataset (REDS); and
(iv) to promote more challenging video super-resolution
settings.
2.1. REDS Dataset
As a step forward from the previously proposed super-
resolution and deblurring datasets, a novel dataset is pro-
moted, namely REDS dataset [
16]. It consists of 300 video
sequences containing 100 frames of 720 × 1280 resolution.
240 sequences are for training, 30 for validation and the rest
30 for testing purposes. The frames are of high quality in
terms of the reference frames, diverse scenes and locations,
and realistic approximations of motion blur. REDS covers
a large diversity of contents, people, handmade objects and
environments (cities).
All the videos used to create the REDS dataset are man-
ually recorded with GoPro HERO6 Black. They were orig-
inally recorded in 1920 × 1080 resolution at 120 fps. We
calibrated the camera response function using [
20] with reg-
ularization. Then the frames are interpolated [
19] to virtu-
ally increase the frame rate up to 1920 fps so that averaged
frames could exhibit smooth and realistic blurs without step
artifacts. Then, the virtual frames are averaged in the signal
space to mimic camera imaging pipeline [
17]. To suppress
noise, compression artifacts, we downscale reference sharp
and corresponding blurry frames to 720 × 1280 resolution.
This preprocessing also increases the effective number of
pixels per resolution. The result blurry video frames resem-
ble 24 fps video captured at duty cycle τ =0.8. Then, the
sharp and blurry frames are ×4 downscaled with the bicubic
kernel to generate low-resolution videos.
2.2. Tracks and competitions
Track 1: Clean facilitates easy deployment of many video
super-resolution methods. It assumes that the degrada-
tion only comes from downscaling. We generate each LR
frame from the HR REDS frame by using MATLAB func-
tion imresize with bicubic interpolation and downscal-
ing factor 4.
Track 2: Blur goes one step ahead and considers motion
blur from fast-moving objects or shaken cameras as well.
No Gaussian or other types of noise is added to the frames,
but only motion blur from dynamic scenes is incorporated.
We obtain each blurry LR frame following the procedure
described in Section
2.1. More details are provided in [16].
The blur is locally variant and any further information such
as blur strength or kernel shape was not provided. Each
ground truth HR RGB frame from REDS is bicubically
downscaled to the corresponding LR frames and used either
for training, validation, or testing of the methods.
Competitions Both video deblurring challenge tracks are
hosted as Codalab competitions. CodaLab platform was
used for all of the NTIRE 2019 challenges competitions.
Each participant is required to register to the CodaLab chal-
lenge tracks to access the data and submit their super-
resolved results.
Challenge phases (1) Development (training) phase: the
participants got both LR and HR train video frames and
1986
Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on June 09,2020 at 18:59:16 UTC from IEEE Xplore. Restrictions apply.

the LR frames of the validation set. The participants had
the opportunity to test their solutions on the LR validation
frames and to receive feedback by uploading their results to
the server. Due to the large-scale of the validation dataset,
every 10
th
frame was involved in evaluation. A validation
leaderboard is available; (2) Final evaluation (test) phase:
the participants got the sharp HR validation frames with the
LR test frames. They had to submit both the super-resolved
frames and a description of their methods before the chal-
lenge deadline. One week later, the final results were made
available to the participants. The final results reflect the per-
formance on every frame of the test set.
Evaluation protocol The Peak Signal-to-Noise Ratio
(PSNR) measured in deciBels (dB) and the Structural Sim-
ilarity Index (SSIM) [
34] computed between a result frame
and the ground truth are the quantitative measures. The
higher the scores are the better the restoration fidelity to the
ground truth frame. Because of boundary effects which may
appear in particular methods, we ignore a rim of 1 pixel dur-
ing the evaluation.
3. Challenge Results
From 124 and 104 registered participants for the compe-
titions, 14 teams entered in the final phase and submitted
results, codes, executables, and factsheets. Table
1 reports
the final scoring results of the challenge and Table
2 shows
the runtimes and the major details for each entry as pro-
vided by the authors in their factsheets. Section
4 describes
the method of each team briefly while in the Appendix
A
are the team members and affiliations.
Use of temporal information All the proposed methods
use the end-to-end deep learning and employ the GPU(s)
for both training and testing. Interestingly, in contrast
to the recent RNN-based video super-resolution meth-
ods, most teams (HelloVSR, UIUC-IFP, SuperRior, Cyber-
verseSanDiego, XJTU-IAIR, BMIPL
UNIST, IPCV IITM,
Lucky Bird, mvgl) aggregated several video frames in chan-
nel dimension and let CNN learn the temporal relation to
deblur a target frame. External optical flow estimation or
warping was employed in none of the submitted methods.
TTI used a recurrent model inspired from (DBPN [6]). Cris-
tianoRonaldo used a single image super-resolution method.
Restoration fidelity HelloVSR, UIUC-IFP, and Super-
Rior are the best scoring teams. HelloVSR is the win-
ner of NTIRE 2019 Video Super-Resolution Challenge.
HelloVSR achieves 31.79 dB for Track 1 and 30.17 dB
for Track 2 improving +5.31 dB and +6.12 dB over the
input low-resolution video, respectively. HelloVSR team
achieves the best results for both of the competition tracks.
Their solution shows consistent performance across the
tracks and is also valid for NTIRE 2019 Video Deblurring
Challenge [
18].
Runtime / efficiency In Fig. 2 and 3, we plot the run-
ning time per image vs. achieved PSNR performance for
both tracks. UIUC-IFP’s solution showed good trade-off
between the restoration quality in terms of PSNR and the
running time. It runs in 0.98 s per frame for both Tracks
on Tesla V100 in contrast to most other methods consume
more than 1 seconds per frame. They had 0.71 dB gap
with the HelloVSR’s method in Track 2. Lucky Bird team’s
method was the fastest, taking only 0.013 seconds to pro-
cess a frame.
10
2
10
1
10
0
10
1
10
2
Running time (s)
29.0
29.5
30.0
30.5
31.0
31.5
PSNR (dB)
HelloVSR
SuperRior
CyberverseSanDiego
TTI NERCMS
UIUC-IFP
BMIPL_UNIST
IPCV_IITM
Lucky Bird
mvglTeam_India
Figure 2: Runtime vs. performance for Track 1: Clean.
10
1
10
0
10
1
10
2
Running time (s)
26.5
27.0
27.5
28.0
28.5
29.0
29.5
30.0
PSNR (dB)
HelloVSR
CyberverseSanDiego
TTI
NERCMS
UIUC-IFP
XJTU-IAIR
BMIPL_UNIST
IPCV_IITM
CristianoRonaldo
Figure 3: Runtime vs. performance for Track 2: Blur.
Ensembles Many solutions used self-ensemble [
30] that av-
erages the results from flipped and rotated inputs at test
time. HelloVSR did not use rotation to reduce computa-
tion. SuperRior team focused on the fusion of multiple ar-
chitectures. RDN [
38], RCAN [37], DUF [9] are modified
to take channel-concatenated frames as input and they esti-
1987
Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on June 09,2020 at 18:59:16 UTC from IEEE Xplore. Restrictions apply.

Track 1:
Clean
Track 2:
Blur
Team Author
PSNR SSIM PSNR SSIM
HelloVSR xixihaha
31.79
(1)
0.8962 30.17
(1)
0.8647
UIUC-IFP fyc0624
30.81
(6)
0.8748 29.46
(2)
0.8430
SuperRior lchkou
31.13
(2)
0.8811 --
CyberverseSanDiego CyberverseSanDiego
31.00
(3)
0.8822 27.71
(7)
0.8067
TTI iim
lab
30.97
(4)
0.8804 28.92
(4)
0.8333
NERCMS Mrobot0
30.91
(5)
0.8782 28.98
(3)
0.8307
XJTU-IAIR Hang
--
28.86
(5)
0.8301
BMIPL
UNIST UNIST BMIPL
30.43
(7)
0.8666 28.68
(6)
0.8252
IPCV
IITM kuldeeppurohit3
29.99
(8)
0.8570 26.39
(9)
0.7699
Lucky Bird NEU
SMILE Lab
29.39
(9)
0.8419 --
mvgl akinyilmaz
28.81
(10)
0.8249 --
Team
India Manoj
28.81
(10)
0.8241 --
withdrawn team
28.54
(11)
0.8179 26.54
(8)
0.7587
CristianoRonaldo ChristianoRonaldo
--26.34
(10)
0.7549
Bicubic baseline
26.48 0.7799 24.05 0.6809
Table 1: NTIRE 2019 Video Super-Resolution Challenge results on the REDS test data. HelloVSR team is the winner of the
challenge with consistent performance in both tracks.
Team
Track 1
Clean
Track 2
Blur
Platform
GPU
(at runtime)
Ensemble / Fusion
(at runtime)
HelloVSR 2.788 3.562 PyTorch TITAN Xp Flip (x4)
UIUC-IFP
0.980 0.980 PyTorch Tesla V100 Flip/Rotation (x8)
SuperRior
120.000 - PyTorch Tesla V100 Flip/Rotation/Temporal flip (x16)
Adaptive model ensemble
CyberverseSanDiego
3.000 3.000 TensorFlow RTX 2080 Ti -
TTI
1.390 1.390 PyTorch TITAN X -
NERCMS
6.020 6.020 PyTorch GTX 1080 Ti Flip/Rotation (x8)
XJTU-IAIR
- 13.000 PyTorch GTX 1080 Ti Flip/Rotation (x8)
BMIPL
UNIST 45.300 54.200 PyTorch TITAN V -
IPCV
IITM 3.300 4.600 PyTorch TITAN X Flip/Rotation (x8)
Lucky Bird
0.013 - PyTorch TITAN Xp -
mvgl
3.500 - PyTorch GTX 1080 Ti -
Team
India 0.050 - Pytorch/Tensorflow Tesla V100 -
withdrawn team
398.000 398.000 - - -
CristianoRonaldo
- 0.600 TensorFlow Tesla K80 -
Table 2: Reported runtimes per frame on REDS test data and details from the factsheets
mated the score maps for each output to generate spatially
adaptive ensemble model. They also adopted temporal flips
of inputs at test time for additional ensemble as well as spa-
tial flips and rotations.
Train data REDS dataset [
16] has 24000 train frames and
all the participants found the amount of data to be suffi-
cient for training their models. Training data augmentation
strategy [
30] such as flips and rotations by 90 degrees were
employed by most of the participants.
Conclusions From the analysis of the presented results,
we conclude that the proposed methods gauge the state-of-
the-art performance in video super-resolution. The meth-
ods proposed by the best ranking team (HelloVSR) exhibit
consistent superiority in both tracks in terms of PSNR and
SSIM.
1988
Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on June 09,2020 at 18:59:16 UTC from IEEE Xplore. Restrictions apply.

3&'$OLJQ
0RGXOH
ݐݐ൅ͳݐെͳ
76$)XVLRQ
0RGXOH
5HFRQVWUXFWLRQ
0RGXOH
н
3UH'HEOXU
0RGXOH
ݐݐ൅ͳݐെͳ
8SVDPSOLQJ
$OLJQHG)HDWXUHV
ј
8SVDPSOLQJ
Figure 4: HelloVSR team: the proposed EDVR framework.
4. Challenge Methods and Teams
4.1. HelloVSR team
HelloVSR team proposes the EDVR framework [
31],
which takes 2N +1low-resolution frames as inputs and
generates a high-resolution output, as shown in Fig.
4. First,
to alleviate the effects of blurry frames on alignment, a
PreDeblur module is used to pre-process the blurry inputs
before alignment (it is not included in the model for SR-
clean track). Then, each neighboring frame is aligned to
the reference frame by the PCD alignment module at the
feature level. The TSA fusion module is used to fuse the
aligned features effectively. The fused features then pass
through a reconstruction module, which consists of sev-
eral residual blocks [
13] in EDVR and can be easily re-
placed by any other advanced modules in single image
SR [
11, 38, 6, 37, 33]. The upsampling operation is per-
formed at the end of the network to increase the spatial size.
Finally, the high-resolution reference frame is obtained by
adding the predicted image residual to a direct upsampled
image [
10]. Note that EDVR is a generic architecture also
suitable for other video restoration tasks, such as deblurring.
To address large and complex motions between frames,
which are common in the REDS dataset, they propose a
Pyramid, Cascading and Deformable convolution (PCD)
alignment module. In this module, deformable convolu-
tions [
3, 26] is adopted to align frames at the feature level.
They use a pyramid structure that first aligns features in
lower scales with coarse estimations, and then propagates
the offsets and aligned features to higher scales to facil-
itate precise motion compensation, similar to the notion
adopted in optical flow estimation [8, 23]. Moreover, an ad-
ditional deformable convolution is cascaded after the pyra-
midal alignment This approach further improve robustness
of the alignment. The overview of the PCD module is
shown in Fig.
5.
Since different frames and locations are not equally in-
formative due to the imperfect alignment and imbalanced
blur among frames, a Temporal and Spatial Attention (TSA)
fusion module is designed to dynamically aggregate neigh-
boring frames in pixel-level, as shown in Fig. 5. Tempo-
ral attention is introduced by computing the element-wise
correlation between the reference frame and each neighbor-
ing frame in an embedding space. The correlation coef-
ficients then weigh each adjacent feature at each location.
Then, weighted features from all frames are convolved and
fused together. After the fusion, they further apply spatial
attention [
35, 32, 37] to assign weights to each location in
each channel to exploit cross-channel and spatial informa-
tion more effectively.
76$)XVLRQ
&RQFDW
&RQFDW
'&RQY
/
/
/
&RQFDW
'&RQY
'&RQY
RIIVHW
'&RQY
&RQFDW
RIIVHW
$OLJQHGIHDWXUHV
ݐݐ൅݅
&RQY
&RQY
(PEHGGLQJ
6LJPRLG
'RWSURGXFW
)XVLRQ
&RQY
(OHPHQWZLVH
PXOWLSOLFDWLRQ
ј
н
н
8SVDPSOLQJ
ݐ൅ͳݐݐെͳ
3&'$OLJQ
ј
ݐ൅ͳݐݐെͳ
Figure 5: PCD alignment module and TSA fusion module
in EDVR.
They also use a two-stage strategy to boost performance
further. Specifically, a similar but shallower EDVR network
is cascaded to refine the output frames of the first stage. The
cascaded network can further remove the severe motion blur
that cannot be handled by the preceding model and alleviate
the inconsistency among output frames.
1989
Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on June 09,2020 at 18:59:16 UTC from IEEE Xplore. Restrictions apply.

Citations
More filters
Proceedings ArticleDOI

EDVR: Video Restoration With Enhanced Deformable Convolutional Networks

TL;DR: This work proposes a novel Video Restoration framework with Enhanced Deformable convolutions, termed EDVR, and proposes a Temporal and Spatial Attention (TSA) fusion module, in which attention is applied both temporally and spatially, so as to emphasize important features for subsequent restoration.
Posted Content

EDVR: Video Restoration with Enhanced Deformable Convolutional Networks

TL;DR: Zhang et al. as mentioned in this paper proposed a novel Video Restoration framework with Enhanced Deformable networks, termed EDVR, to address the challenges of aligning multiple frames given large motions and effectively fusing different frames with diverse motion and blur.
Posted Content

Understanding Deformable Alignment in Video Super-Resolution

TL;DR: It is shown that deformable convolution can be decomposed into a combination of spatial warping and convolution, which reveals the commonality of deformable alignment and flow-based alignment in formulation, but with a key difference in their offset diversity.
Posted Content

Video Super Resolution Based on Deep Learning: A comprehensive survey

TL;DR: A taxonomy is proposed and classify the methods into six sub-categories according to the ways of utilizing inter-frame information for video super-resolution, to alleviate understandability and transferability of existing and future techniques into practice.
Proceedings ArticleDOI

NTIRE 2021 Challenge on Image Deblurring

TL;DR: The NTIRE 2021 Challenge on Image Deblurring as mentioned in this paper focused on image deblurring, where both the tracks aim to recover a high-quality clean image from a blurry image, different artifacts are jointly involved.
References
More filters
Journal ArticleDOI

Image quality assessment: from error visibility to structural similarity

TL;DR: In this article, a structural similarity index is proposed for image quality assessment based on the degradation of structural information, which can be applied to both subjective ratings and objective methods on a database of images compressed with JPEG and JPEG2000.
Proceedings ArticleDOI

Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network

TL;DR: SRGAN as mentioned in this paper proposes a perceptual loss function which consists of an adversarial loss and a content loss, which pushes the solution to the natural image manifold using a discriminator network that is trained to differentiate between the super-resolved images and original photo-realistic images.
Posted Content

CBAM: Convolutional Block Attention Module

TL;DR: The proposed Convolutional Block Attention Module (CBAM), a simple yet effective attention module for feed-forward convolutional neural networks, can be integrated into any CNN architectures seamlessly with negligible overheads and is end-to-end trainable along with base CNNs.
Book ChapterDOI

CBAM: Convolutional Block Attention Module

TL;DR: Convolutional Block Attention Module (CBAM) as discussed by the authors is a simple yet effective attention module for feed-forward convolutional neural networks, given an intermediate feature map, the module sequentially infers attention maps along two separate dimensions, channel and spatial, then the attention maps are multiplied to the input feature map for adaptive feature refinement.
Proceedings ArticleDOI

Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network

TL;DR: This paper presents the first convolutional neural network capable of real-time SR of 1080p videos on a single K2 GPU and introduces an efficient sub-pixel convolution layer which learns an array of upscaling filters to upscale the final LR feature maps into the HR output.
Related Papers (5)
Frequently Asked Questions (10)
Q1. What have the authors contributed in "Ntire 2019 challenge on video super-resolution: methods and results" ?

This paper reviews the first NTIRE challenge on video super-resolution ( restoration of rich details in lowresolution video frames ) with focus on proposed solutions and results. 

The authors generate each LR frame from the HR REDS frame by using MATLAB function imresize with bicubic interpolation and downscaling factor 4. 

8. They integrate spatial and temporal contexts from consecutive video frames using a recurrent encoder-decoder module, that fuses multi-frame information with the more traditional, single frame super-resolution path for the target frame. 

To handle motion blurs in Track 2, they additionally use deblurring stage afterward, feeding the output of the super-resolution model to the deblurring model. 

Train data REDS dataset [16] has 24000 train frames and all the participants found the amount of data to be sufficient for training their models. 

The cascaded network can further remove the severe motion blur that cannot be handled by the preceding model and alleviate the inconsistency among output frames. 

the high-resolution reference frame is obtained by adding the predicted image residual to a direct upsampled image [10]. 

Challenge phases (1) Development (training) phase: the participants got both LR and HR train video frames and1986Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. 

HelloVSR team proposes the EDVR framework [31], which takes 2N + 1 low-resolution frames as inputs and generates a high-resolution output, as shown in Fig. 

Under a strictly controlled computational budget, they explore the designs of each residual building block in a video restoration network, which consists of a mixture of 2D and 3D convolutional layers.