How does the TTI team integrate spatial and temporal information from consecutive video frames?

8. They integrate spatial and temporal contexts from consecutive video frames using a recurrent encoder-decoder module, that fuses multi-frame information with the more traditional, single frame super-resolution path for the target frame.

What is the purpose of the motion blurs?

To handle motion blurs in Track 2, they additionally use deblurring stage afterward, feeding the output of the super-resolution model to the deblurring model.

What is the way to improve the performance of a video restoration network?

Under a strictly controlled computational budget, they explore the designs of each residual building block in a video restoration network, which consists of a mixture of 2D and 3D convolutional layers.

(Open Access) NTIRE 2019 Challenge on Video Super-Resolution: Methods and Results (2019) | Seungjun Nah

Q: What have the authors contributed in "Ntire 2019 challenge on video super-resolution: methods and results" ?

This paper reviews the first NTIRE challenge on video super-resolution ( restoration of rich details in lowresolution video frames ) with focus on proposed solutions and results.

Q: How do the authors generate the LR frame from the HR REDS?

The authors generate each LR frame from the HR REDS frame by using MATLAB function imresize with bicubic interpolation and downscaling factor 4.

Q: How many train frames did the participants find sufficient?

Train data REDS dataset [16] has 24000 train frames and all the participants found the amount of data to be sufficient for training their models.

Q: What is the ranking team's proposal for EDVR?

HelloVSR team proposes the EDVR framework [31], which takes 2N + 1 low-resolution frames as inputs and generates a high-resolution output, as shown in Fig.

NTIRE 2019 Challenge on Video Super-Resolution: Methods and Results

Seungjun Nah Radu Timofte Shuhang Gu Sungyong Baik Seokil Hong

Gyeongsik Moon Sanghyun Son Kyoung Mu Lee Xintao Wang Kelvin C.K. Chan

Ke Yu Chao Dong Chen Change Loy Yuchen Fan Jiahui Yu Ding Liu

Thomas S. Huang Xiao Liu Chao Li Dongliang He Yukang Ding Shilei Wen

Fatih Porikli Ratheesh Kalarot Muhammad Haris Greg Shakhnarovich

Norimichi Ukita Peng Yi Zhongyuan Wang Kui Jiang Junjun Jiang Jiayi Ma

Hang Dong Xinyi Zhang Zhe Hu Kwanyoung Kim Dong Un Kang

Se Young Chun Kuldeep Purohit A. N. Rajagopalan Yapeng Tian Yulun Zhang

Yun Fu Chenliang Xu A. Murat Tekalp M. Akin Yilmaz Cansu Korkmaz

Manoj Sharma Megh Makwana Anuj Badhwar Ajay Pratap Singh

Avinash Upadhyay Rudrabha Mukhopadhyay Ankit Shukla Dheeraj Khanna

A. S. Mandal Santanu Chaudhury Si Miao Yongxin Zhu Xiao Huo

Abstract

This paper reviews the ﬁrst NTIRE challenge on

video super-resolution (restoration of rich details in low-

resolution video frames) with focus on proposed solutions

and results. A new REalistic and Diverse Scenes dataset

(REDS) was employed. The challenge was divided into 2

tracks. Track 1 employed standard bicubic downscaling

setup while Track 2 had realistic dynamic motion blurs.

Each competition had 124 and 104 registered participants.

There were total 14 teams in the ﬁnal testing phase. They

gauge the state-of-the-art in video super-resolution.

1. Introduction

Example-based video super-resolution (SR) aims at the

restoration of the rich details (high frequencies) from low-

resolution video frames based on a set of prior examples

with low-resolution and high-resolution videos. The loss

of contents can be caused by various factors such as quan-

tization error, limitations of the sensor from the capturing

camera, presence of defocus, motion blur or other degrad-

ing operators, and the use of downsampling operators to re-

duce the video resolution for storage purposes. Just like the

conventional single image SR, video SR is also an ill-posed

S. Nah (seungjun.nah@gmail.com, Seoul National University), R.

Timofte, S. Gu, S. Baik, S. Hong, G. Moon, S. Son and K.M. Lee are

the NTIRE 2019 challenge organizers, while the other authors participated

in the challenge.

Appendix

A contains the authors’ teams and afﬁliations.

NTIRE webpage:

http://www.vision.ee.ethz.ch/ntire19

problem because for each low resolution (LR) frame, the

space of corresponding high resolution (HR) frames can be

very large.

In recent years, a signiﬁcant amount of literature fo-

cused on video super-resolution research. The performance

of the top methods continuously improved [

28, 22, 15,

2, 24, 9, 21, 35], showing that the ﬁeld is getting ma-

tured. However, when compared with single image super-

resolution [

1, 27, 29], video super-resolution lacks stan-

dardized benchmarks to allow for an assessment that is

based on identical datasets and criteria. Recently, most of

the video SR publications use the Vid4 [

14] dataset for eval-

uation and comparison. Vid4 dataset contains 4 sequences

and each video consists of 30 to 45 frames. The resolu-

tion of each frame is 480 × 704 or 576 × 704. In some

works, other datasets are also proposed for evaluation like

YT10 [21], Val4 [9], SPMCS [24], and CDVL [2]. How-

ever, they are not widely used for comparison yet. While

those video super-resolution datasets have brought substan-

tial improvements to this domain, they still have signiﬁcant

shortcomings: (1) they lack a standard training set: recent

video SR works are trained from various sets that are chosen

rather arbitrarily; (2) small test sets and resolution (often

below HD resolution); (3) mixed downsampling methods

(Gaussian blurs and bicubic kernels) that are not standard-

ized; they are chosen for LR data generation and they are

not consistent with single image SR literature where usu-

ally bicubic interpolation is employed.

The NTIRE 2019 video super-resolution challenge is

a step forward in benchmarking and training of video

super-resolution algorithms. It uses REalistic and Dynamic

1985

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops

DOI 10.1109/CVPRW.2019.00250

Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on June 09,2020 at 18:59:16 UTC from IEEE Xplore. Restrictions apply.

(a) Sharp HR

(b) LR (Bicubic ൈͶ՝) (c) LR (Bicubic ൈͶ՝+ Blur)

Figure 1: Visualization of a video frame and its low resolution corresponding frames from the REDS dataset.

Scenes (REDS) dataset [

16] consisting of 30000 reference

frames with two types of degradation: the standard bicubic

and additional dynamic motion blurs that are locally variant.

Fig.

1 shows some images from REDS dataset. The REDS

dataset is introduced in [

16] along with a study of challenge

results. In the next, we describe the challenge, present and

discuss the results and describe the proposed methods.

2. NTIRE 2019 Challenge

The objectives of the NTIRE 2019 challenge on video

super-resolution are: (i) to gauge and push the state-of-the-

art in video super-resolution; (ii) to compare different solu-

tions; (iii) to promote a novel large dataset (REDS); and

(iv) to promote more challenging video super-resolution

settings.

2.1. REDS Dataset

As a step forward from the previously proposed super-

resolution and deblurring datasets, a novel dataset is pro-

moted, namely REDS dataset [

16]. It consists of 300 video

sequences containing 100 frames of 720 × 1280 resolution.

240 sequences are for training, 30 for validation and the rest

30 for testing purposes. The frames are of high quality in

terms of the reference frames, diverse scenes and locations,

and realistic approximations of motion blur. REDS covers

a large diversity of contents, people, handmade objects and

environments (cities).

All the videos used to create the REDS dataset are man-

ually recorded with GoPro HERO6 Black. They were orig-

inally recorded in 1920 × 1080 resolution at 120 fps. We

calibrated the camera response function using [

20] with reg-

ularization. Then the frames are interpolated [

19] to virtu-

ally increase the frame rate up to 1920 fps so that averaged

frames could exhibit smooth and realistic blurs without step

artifacts. Then, the virtual frames are averaged in the signal

space to mimic camera imaging pipeline [

17]. To suppress

noise, compression artifacts, we downscale reference sharp

and corresponding blurry frames to 720 × 1280 resolution.

This preprocessing also increases the effective number of

pixels per resolution. The result blurry video frames resem-

ble 24 fps video captured at duty cycle τ =0.8. Then, the

sharp and blurry frames are ×4 downscaled with the bicubic

kernel to generate low-resolution videos.

2.2. Tracks and competitions

Track 1: Clean facilitates easy deployment of many video

super-resolution methods. It assumes that the degrada-

tion only comes from downscaling. We generate each LR

frame from the HR REDS frame by using MATLAB func-

tion imresize with bicubic interpolation and downscal-

ing factor 4.

Track 2: Blur goes one step ahead and considers motion

blur from fast-moving objects or shaken cameras as well.

No Gaussian or other types of noise is added to the frames,

but only motion blur from dynamic scenes is incorporated.

We obtain each blurry LR frame following the procedure

described in Section

2.1. More details are provided in [16].

The blur is locally variant and any further information such

as blur strength or kernel shape was not provided. Each

ground truth HR RGB frame from REDS is bicubically

downscaled to the corresponding LR frames and used either

for training, validation, or testing of the methods.

Competitions Both video deblurring challenge tracks are

hosted as Codalab competitions. CodaLab platform was

used for all of the NTIRE 2019 challenges competitions.

Each participant is required to register to the CodaLab chal-

lenge tracks to access the data and submit their super-

resolved results.

Challenge phases (1) Development (training) phase: the

participants got both LR and HR train video frames and

1986

Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on June 09,2020 at 18:59:16 UTC from IEEE Xplore. Restrictions apply.

the LR frames of the validation set. The participants had

the opportunity to test their solutions on the LR validation

frames and to receive feedback by uploading their results to

the server. Due to the large-scale of the validation dataset,

every 10

frame was involved in evaluation. A validation

leaderboard is available; (2) Final evaluation (test) phase:

the participants got the sharp HR validation frames with the

LR test frames. They had to submit both the super-resolved

frames and a description of their methods before the chal-

lenge deadline. One week later, the ﬁnal results were made

available to the participants. The ﬁnal results reﬂect the per-

formance on every frame of the test set.

Evaluation protocol The Peak Signal-to-Noise Ratio

(PSNR) measured in deciBels (dB) and the Structural Sim-

ilarity Index (SSIM) [

34] computed between a result frame

and the ground truth are the quantitative measures. The

higher the scores are the better the restoration ﬁdelity to the

ground truth frame. Because of boundary effects which may

appear in particular methods, we ignore a rim of 1 pixel dur-

ing the evaluation.

3. Challenge Results

From 124 and 104 registered participants for the compe-

titions, 14 teams entered in the ﬁnal phase and submitted

results, codes, executables, and factsheets. Table

1 reports

the ﬁnal scoring results of the challenge and Table

2 shows

the runtimes and the major details for each entry as pro-

vided by the authors in their factsheets. Section

4 describes

the method of each team brieﬂy while in the Appendix

are the team members and afﬁliations.

Use of temporal information All the proposed methods

use the end-to-end deep learning and employ the GPU(s)

for both training and testing. Interestingly, in contrast

to the recent RNN-based video super-resolution meth-

ods, most teams (HelloVSR, UIUC-IFP, SuperRior, Cyber-

verseSanDiego, XJTU-IAIR, BMIPL

UNIST, IPCV IITM,

Lucky Bird, mvgl) aggregated several video frames in chan-

nel dimension and let CNN learn the temporal relation to

deblur a target frame. External optical ﬂow estimation or

warping was employed in none of the submitted methods.

TTI used a recurrent model inspired from (DBPN [6]). Cris-

tianoRonaldo used a single image super-resolution method.

Restoration ﬁdelity HelloVSR, UIUC-IFP, and Super-

Rior are the best scoring teams. HelloVSR is the win-

ner of NTIRE 2019 Video Super-Resolution Challenge.

HelloVSR achieves 31.79 dB for Track 1 and 30.17 dB

for Track 2 improving +5.31 dB and +6.12 dB over the

input low-resolution video, respectively. HelloVSR team

achieves the best results for both of the competition tracks.

Their solution shows consistent performance across the

tracks and is also valid for NTIRE 2019 Video Deblurring

Challenge [

18].

Runtime / efﬁciency In Fig. 2 and 3, we plot the run-

ning time per image vs. achieved PSNR performance for

both tracks. UIUC-IFP’s solution showed good trade-off

between the restoration quality in terms of PSNR and the

running time. It runs in 0.98 s per frame for both Tracks

on Tesla V100 in contrast to most other methods consume

more than 1 seconds per frame. They had 0.71 dB gap

with the HelloVSR’s method in Track 2. Lucky Bird team’s

method was the fastest, taking only 0.013 seconds to pro-

cess a frame.

Running time (s)

29.0

29.5

30.0

30.5

31.0

31.5

PSNR (dB)

HelloVSR

SuperRior

CyberverseSanDiego

TTI NERCMS

UIUC-IFP

BMIPL_UNIST

IPCV_IITM

Lucky Bird

mvglTeam_India

Figure 2: Runtime vs. performance for Track 1: Clean.

Running time (s)

26.5

27.0

27.5

28.0

28.5

29.0

29.5

30.0

PSNR (dB)

HelloVSR

CyberverseSanDiego

TTI

NERCMS

UIUC-IFP

XJTU-IAIR

BMIPL_UNIST

IPCV_IITM

CristianoRonaldo

Figure 3: Runtime vs. performance for Track 2: Blur.

Ensembles Many solutions used self-ensemble [

30] that av-

erages the results from ﬂipped and rotated inputs at test

time. HelloVSR did not use rotation to reduce computa-

tion. SuperRior team focused on the fusion of multiple ar-

chitectures. RDN [

38], RCAN [37], DUF [9] are modiﬁed

to take channel-concatenated frames as input and they esti-

1987

Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on June 09,2020 at 18:59:16 UTC from IEEE Xplore. Restrictions apply.

Track 1:

Clean

Track 2:

Blur

Team Author

PSNR SSIM PSNR SSIM

HelloVSR xixihaha

31.79

(1)

0.8962 30.17

(1)

0.8647

UIUC-IFP fyc0624

30.81

(6)

0.8748 29.46

(2)

0.8430

SuperRior lchkou

31.13

(2)

0.8811 --

CyberverseSanDiego CyberverseSanDiego

31.00

(3)

0.8822 27.71

(7)

0.8067

TTI iim

lab

30.97

(4)

0.8804 28.92

(4)

0.8333

NERCMS Mrobot0

30.91

(5)

0.8782 28.98

(3)

0.8307

XJTU-IAIR Hang

28.86

(5)

0.8301

BMIPL

UNIST UNIST BMIPL

30.43

(7)

0.8666 28.68

(6)

0.8252

IPCV

IITM kuldeeppurohit3

29.99

(8)

0.8570 26.39

(9)

0.7699

Lucky Bird NEU

SMILE Lab

29.39

(9)

0.8419 --

mvgl akinyilmaz

28.81

(10)

0.8249 --

Team

India Manoj

28.81

(10)

0.8241 --

withdrawn team

28.54

(11)

0.8179 26.54

(8)

0.7587

CristianoRonaldo ChristianoRonaldo

--26.34

(10)

0.7549

Bicubic baseline

26.48 0.7799 24.05 0.6809

Table 1: NTIRE 2019 Video Super-Resolution Challenge results on the REDS test data. HelloVSR team is the winner of the

challenge with consistent performance in both tracks.

Team

Track 1

Clean

Track 2

Blur

Platform

GPU

(at runtime)

Ensemble / Fusion

(at runtime)

HelloVSR 2.788 3.562 PyTorch TITAN Xp Flip (x4)

UIUC-IFP

0.980 0.980 PyTorch Tesla V100 Flip/Rotation (x8)

SuperRior

120.000 - PyTorch Tesla V100 Flip/Rotation/Temporal ﬂip (x16)

Adaptive model ensemble

CyberverseSanDiego

3.000 3.000 TensorFlow RTX 2080 Ti -

TTI

1.390 1.390 PyTorch TITAN X -

NERCMS

6.020 6.020 PyTorch GTX 1080 Ti Flip/Rotation (x8)

XJTU-IAIR

- 13.000 PyTorch GTX 1080 Ti Flip/Rotation (x8)

BMIPL

UNIST 45.300 54.200 PyTorch TITAN V -

IPCV

IITM 3.300 4.600 PyTorch TITAN X Flip/Rotation (x8)

Lucky Bird

0.013 - PyTorch TITAN Xp -

mvgl

3.500 - PyTorch GTX 1080 Ti -

Team

India 0.050 - Pytorch/Tensorﬂow Tesla V100 -

withdrawn team

398.000 398.000 - - -

CristianoRonaldo

- 0.600 TensorFlow Tesla K80 -

Table 2: Reported runtimes per frame on REDS test data and details from the factsheets

mated the score maps for each output to generate spatially

adaptive ensemble model. They also adopted temporal ﬂips

of inputs at test time for additional ensemble as well as spa-

tial ﬂips and rotations.

Train data REDS dataset [

16] has 24000 train frames and

all the participants found the amount of data to be sufﬁ-

cient for training their models. Training data augmentation

strategy [

30] such as ﬂips and rotations by 90 degrees were

employed by most of the participants.

Conclusions From the analysis of the presented results,

we conclude that the proposed methods gauge the state-of-

the-art performance in video super-resolution. The meth-

ods proposed by the best ranking team (HelloVSR) exhibit

consistent superiority in both tracks in terms of PSNR and

SSIM.

1988

Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on June 09,2020 at 18:59:16 UTC from IEEE Xplore. Restrictions apply.

3&'$OLJQ

0RGXOH

ݐݐ൅ͳݐെͳ

76$)XVLRQ

0RGXOH

5HFRQVWUXFWLRQ

0RGXOH

3UH'HEOXU

0RGXOH

ݐݐ൅ͳݐെͳ

8SVDPSOLQJ

$OLJQHG)HDWXUHV

8SVDPSOLQJ

Figure 4: HelloVSR team: the proposed EDVR framework.

4. Challenge Methods and Teams

4.1. HelloVSR team

HelloVSR team proposes the EDVR framework [

31],

which takes 2N +1low-resolution frames as inputs and

generates a high-resolution output, as shown in Fig.

4. First,

to alleviate the effects of blurry frames on alignment, a

PreDeblur module is used to pre-process the blurry inputs

before alignment (it is not included in the model for SR-

clean track). Then, each neighboring frame is aligned to

the reference frame by the PCD alignment module at the

feature level. The TSA fusion module is used to fuse the

aligned features effectively. The fused features then pass

through a reconstruction module, which consists of sev-

eral residual blocks [

13] in EDVR and can be easily re-

placed by any other advanced modules in single image

SR [

11, 38, 6, 37, 33]. The upsampling operation is per-

formed at the end of the network to increase the spatial size.

Finally, the high-resolution reference frame is obtained by

adding the predicted image residual to a direct upsampled

image [

10]. Note that EDVR is a generic architecture also

suitable for other video restoration tasks, such as deblurring.

To address large and complex motions between frames,

which are common in the REDS dataset, they propose a

Pyramid, Cascading and Deformable convolution (PCD)

alignment module. In this module, deformable convolu-

tions [

3, 26] is adopted to align frames at the feature level.

They use a pyramid structure that ﬁrst aligns features in

lower scales with coarse estimations, and then propagates

the offsets and aligned features to higher scales to facil-

itate precise motion compensation, similar to the notion

adopted in optical ﬂow estimation [8, 23]. Moreover, an ad-

ditional deformable convolution is cascaded after the pyra-

midal alignment This approach further improve robustness

of the alignment. The overview of the PCD module is

shown in Fig.

Since different frames and locations are not equally in-

formative due to the imperfect alignment and imbalanced

blur among frames, a Temporal and Spatial Attention (TSA)

fusion module is designed to dynamically aggregate neigh-

boring frames in pixel-level, as shown in Fig. 5. Tempo-

ral attention is introduced by computing the element-wise

correlation between the reference frame and each neighbor-

ing frame in an embedding space. The correlation coef-

ﬁcients then weigh each adjacent feature at each location.

Then, weighted features from all frames are convolved and

fused together. After the fusion, they further apply spatial

attention [

35, 32, 37] to assign weights to each location in

each channel to exploit cross-channel and spatial informa-

tion more effectively.

76$)XVLRQ

&RQFDW

'&RQY

/

/

/

&RQFDW

'&RQY

RIIVHW

'&RQY

&RQFDW

RIIVHW

$OLJQHGIHDWXUHV

ݐݐ൅݅

&RQY

(PEHGGLQJ

6LJPRLG

ൈ

'RWSURGXFW

)XVLRQ

&RQY

(OHPHQWZLVH

PXOWLSOLFDWLRQ



8SVDPSOLQJ

ݐ൅ͳݐݐെͳ

3&'$OLJQ

ݐ൅ͳݐݐെͳ

Figure 5: PCD alignment module and TSA fusion module

in EDVR.

They also use a two-stage strategy to boost performance

further. Speciﬁcally, a similar but shallower EDVR network

is cascaded to reﬁne the output frames of the ﬁrst stage. The

cascaded network can further remove the severe motion blur

that cannot be handled by the preceding model and alleviate

the inconsistency among output frames.

1989

Authorized licensed use limited to: ULAKBIM UASL - KOC UNIVERSITY. Downloaded on June 09,2020 at 18:59:16 UTC from IEEE Xplore. Restrictions apply.

NTIRE 2019 Challenge on Video Super-Resolution: Methods and Results

Figures

Citations

EDVR: Video Restoration With Enhanced Deformable Convolutional Networks

EDVR: Video Restoration with Enhanced Deformable Convolutional Networks

Understanding Deformable Alignment in Video Super-Resolution

Video Super Resolution Based on Deep Learning: A comprehensive survey

NTIRE 2021 Challenge on Image Deblurring

References

Image quality assessment: from error visibility to structural similarity

Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network

CBAM: Convolutional Block Attention Module

CBAM: Convolutional Block Attention Module

Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network

Related Papers (5)

Enhanced Deep Residual Networks for Single Image Super-Resolution

Image Super-Resolution Using Very Deep Residual Channel Attention Networks

Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network

PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume

Learning a Deep Convolutional Network for Image Super-Resolution

Frequently Asked Questions (10)

Q1. What have the authors contributed in "Ntire 2019 challenge on video super-resolution: methods and results" ?

Q2. How do the authors generate the LR frame from the HR REDS?

Q3. How does the TTI team integrate spatial and temporal information from consecutive video frames?

Q4. What is the purpose of the motion blurs?

Q5. How many train frames did the participants find sufficient?

Q6. What is the way to improve the performance of a 3D video restoration network?

Q7. How is the high-resolution reference frame obtained?

Q8. What is the purpose of the challenge?

Q9. What is the ranking team's proposal for EDVR?

Q10. What is the way to improve the performance of a video restoration network?