In recent years, the image and video coding technologies have advanced by leaps and bounds. However, due to the popularization of image and video acquisition devices, the growth rate of image and video data is far beyond the improvement of the compression ratio. In particular, it has been widely recognized that there are increasing challenges of pursuing further coding performance improvement within the traditional hybrid coding framework. Deep convolution neural network which makes the neural network resurge in recent years and has achieved great success in both artificial intelligent and signal processing fields, also provides a novel and promising solution for image and video compression. In this paper, we provide a systematic, comprehensive and up-to-date review of neural network-based image and video compression techniques. The evolution and development of neural network-based compression methodologies are introduced for images and video respectively. More specifically, the cutting-edge video coding techniques by leveraging deep learning and HEVC framework are presented and discussed, which promote the state-of-the-art video coding performance substantially. Moreover, the end-to-end image and video coding frameworks based on neural networks are also reviewed, revealing interesting explorations on next generation image and video coding frameworks/standards. The most significant research works on the image and video coding related topics using neural networks are highlighted, and future trends are also envisioned. In particular, the joint compression on semantic and visual information is tentatively explored to formulate high efficiency signal representation structure for both human vision and machine vision, which are the two dominant signal receptors in the age of artificial intelligence.

Image and Video Compression With Neural Networks: A Review

Nowadays, 360° video/image has been increasingly popular and drawn great attention. The spherical viewing range of 360° video/image accounts for huge data, which pose the challenges to 360° video/image processing in solving the bottleneck of storage, transmission, etc. Accordingly, the recent years have witnessed the explosive emergence of works on 360° video/image processing. In this article, we review the state-of-the-art works on 360° video/image processing from the aspects of perception, assessment and compression. First, this article reviews both datasets and visual attention modelling approaches for 360° video/image. Second, we survey the related works on both subjective and objective visual quality assessment (VQA) of 360° video/image. Third, we overview the compression approaches for 360° video/image, which either utilize the spherical characteristics or visual attention models. Finally, we summarize this overview article and outlook the future research trends on 360° video/image processing.

State-of-the-Art in 360° Video/Image Processing: Perception, Assessment and Compression

This paper proposes a deep learning method for intra prediction. Different from traditional methods utilizing some fixed rules, we propose using a fully connected network to learn an end-to-end mapping from neighboring reconstructed pixels to the current block. In the proposed method, the network is fed by multiple reference lines. Compared with traditional single line-based methods, more contextual information of the current block is utilized. For this reason, the proposed network has the potential to generate better prediction. In addition, the proposed network has good generalization ability on different bitrate settings. The model trained from a specified bitrate setting also works well on other bitrate settings. Experimental results demonstrate the effectiveness of the proposed method. When compared with high efficiency video coding reference software HM-16.9, our network can achieve an average of 3.4% bitrate saving. In particular, the average result of 4K sequences is 4.5% bitrate saving, where the maximum one is 7.4%.

Fully Connected Network-Based Intra Prediction for Image Coding

One key challenge to learning-based video compression is that motion predictive coding, a very effective tool for video compression, can hardly be trained into a neural network. In this paper, we propose the concept of PixelMotionCNN (PMCNN) which includes motion extension and hybrid prediction networks. PMCNN can model spatiotemporal coherence to effectively perform predictive coding inside the learning network. On the basis of PMCNN, we further explore a learning-based framework for video compression with additional components of iterative analysis/synthesis and binarization. The experimental results demonstrate the effectiveness of the proposed scheme. Although entropy coding and complex configurations are not employed in this paper, we still demonstrate superior performance compared with MPEG-2 and achieve comparable results with H.264 codec. The proposed learning-based scheme provides a possible new direction to further improve compression efficiency and functionalities of future video coding.

Learning for Video Compression

We study the dual problem of image super-resolution (SR), which we term image compact-resolution (CR). Opposite to image SR that hallucinates a visually plausible high-resolution image given a low-resolution input, image CR provides a low-resolution version of a high-resolution image, such that the low-resolution version is both visually pleasing and as informative as possible compared to the high-resolution image. We propose a convolutional neural network (CNN) for image CR, namely, CNN-CR, inspired by the great success of CNN for image SR. Specifically, we translate the requirements of image CR into operable optimization targets for training CNN-CR: the visual quality of the compact resolved image is ensured by constraining its difference from a naively downsampled version and the information loss of image CR is measured by upsampling/super-resolving the compact-resolved image and comparing that to the original image. Accordingly, CNN-CR can be trained either separately or jointly with a CNN for image SR. We explore different training strategies as well as different network structures for CNN-CR. Our experimental results show that the proposed CNN-CR clearly outperforms simple bicubic downsampling and achieves on average 2.25 dB improvement in terms of the reconstruction quality on a large collection of natural images. We further investigate two applications of image CR, i.e., low-bit-rate image compression and image retargeting. Experimental results show that the proposed CNN-CR helps achieve significant bits saving than High Efficiency Video Coding when applied to image compression and produce visually pleasing results when applied to image retargeting.

Learning a Convolutional Neural Network for Image Compact-Resolution

Inspired by the recent advances of image super-resolution using convolutional neural network (CNN), we propose a CNN-based block up-sampling scheme for intra frame coding. A block can be down-sampled before being compressed by normal intra coding, and then up-sampled to its original resolution. Different from previous studies on down/up-sampling-based coding, the up-sampling methods in our scheme have been designed by training CNN instead of hand-crafted. We explore a new CNN structure for up-sampling, which features deconvolution of feature maps, multi-scale fusion, and residue learning, making the network both compact and efficient. We also design different networks for the up-sampling of luma and chroma components, respectively, where the chroma up-sampling CNN utilizes the luma information to boost its performance. In addition, we design a two-stage up-sampling process, the first stage being within the block-by-block coding loop, and the second stage being performed on the entire frame, so as to refine block boundaries. We also empirically study how to set the coding parameters of down-sampled blocks for pursuing the frame-level rate-distortion optimization. Our proposed scheme is implemented into the high-efficiency video coding (HEVC) reference software, and a comprehensive set of experiments have been performed to evaluate our methods. Experimental results show that our scheme achieves significant bits saving compared with the HEVC anchor, especially at low bit rates, leading to on average 5.5% BD-rate reduction on common test sequences and on average 9.0% BD-rate reduction on ultrahigh definition test sequences.

Convolutional Neural Network-Based Block Up-Sampling for Intra Frame Coding

In this paper, we study a simplified affine motion model-based coding framework to overcome the limitation of a translational motion model and maintain low-computational complexity. The proposed framework mainly has three key contributions. First, we propose to reduce the number of affine motion parameters from 6 to 4. The proposed four-parameter affine motion model can not only handle most of the complex motions in natural videos, but also save the bits for two parameters. Second, to efficiently encode the affine motion parameters, we propose two motion prediction modes, i.e., an advanced affine motion vector prediction scheme combined with a gradient-based fast affine motion estimation algorithm and an affine model merge scheme, where the latter attempts to reuse the affine motion parameters (instead of the motion vectors) of neighboring blocks. Third, we propose two fast affine motion compensation algorithms. One is the one-step sub-pixel interpolation that reduces the computations of each interpolation. The other is the interpolation-precision-based adaptive block size motion compensation that performs motion compensation at the block level rather than the pixel level to reduce the number of interpolation. Our proposed techniques have been implemented based on the state-of-the-art high-efficiency video coding standard, and the experimental results show that the proposed techniques altogether achieve, on average, 11.1% and 19.3% bits saving for random access and low-delay configurations, respectively, on typical video sequences that have rich rotation or zooming motions. Meanwhile, the computational complexity increases of both the encoder and the decoder are within an acceptable range.

An Efficient Four-Parameter Affine Motion Model for Video Coding

Recently, convolutional neural network (CNN)-based methods have achieved remarkable progress in image and video super-resolution, which inspires research on down-/up-sampling-based image and video coding using CNN. Instead of hand-crafted filters for up-sampling, trained CNN models are believed to be more capable of improving image quality, thus leading to coding gain. However, previous studies either concentrated on intra-frame coding or performed down- and up-sampling of entire frame. In this paper, we introduce block-level down- and up-sampling into inter-frame coding with the help of CNN. Specifically, each block in the P or B frame can either be compressed at the original resolution or down-sampled and compressed at low resolution and then, up-sampled by the trained CNN models. Such block-level adaptivity is flexible to cope with the spatially variant texture and motion characteristics. We further investigate how to enhance the capability of CNN-based up-sampling by utilizing reference frames and study how to train the CNN models by using encoded video sequences. We implement the proposed scheme onto the high efficiency video coding (HEVC) reference software and perform a comprehensive set of experiments to evaluate our methods. The experimental results show that our scheme achieves superior performance to the HEVC anchor, especially at low bit rates, leading to an average 3.8%, 2.6%, and 3.5% BD-rate reduction on the HEVC common test sequences under random-access, low-delay B, and low-delay P configurations, respectively. When tested on high-definition and ultrahigh-definition sequences, the average BD-rate exceeds 5%.

Convolutional Neural Network-Based Block Up-Sampling for HEVC

The 360° video compression has two main challenges due to projection distortions, namely, the geometry distortion and the face boundary discontinuity. There are some tradeoffs between selecting equi-rectangular projection (ERP) and polyhedron projection. In ERP, the geometry distortion is severer than the face boundary discontinuity; while for the polyhedron projections, the face boundary discontinuity is severer than the geometry distortion. These two distortions will have side effects on the motion compensation and undermine the compression efficiency of the 360° video. In this paper, an integrated framework is developed to handle these two problems to improve coding efficiency. The proposed framework mainly has two key contributions. First, we derive a unified advanced spherical motion model to handle the geometry distortion of different projection formats for the 360° video. When fitting the projection between the various projection formats and the sphere into the unified framework, a specific solution can be obtained for each projection format. Second, we propose a local 3D padding method to handle the face boundary discontinuity between the neighboring faces in various projection formats of the 360° video. The local 3D padding method can be applied to different projection formats through setting different angles between neighboring faces. These two methods are independent of each other and can also be combined into an integrated framework to achieve a better rate-distortion performance. The proposed framework can be seamlessly integrated into the latest video coding standard high-efficiency video coding. The experimental results demonstrate that introducing proposed coding tools can achieve significant bitrate savings compared with the current state-of-the-art method.

Advanced Spherical Motion Model and Local Padding for 360° Video Compression

A picture prediction method and a related apparatus are disclosed. The picture prediction method includes: determining motion vector predictors of K pixel samples in a current picture block, where K is an integer greater than 1, the K pixel samples include a first vertex angle pixel sample in the current picture block, a motion vector predictor of the first vertex angle pixel sample is obtained based on a motion vector of a preset first spatially adjacent picture block of the current picture block, and the first spatially adjacent picture block is spatially adjacent to the first vertex angle pixel sample; and performing, based on a non-translational motion model and the motion vector predictors of the K pixel samples, pixel value prediction on the current picture block. Solutions in the embodiments of the present application are helpful in reducing calculation complexity of picture prediction based on a non-translational motion model.

Yang Haitao

Papers

Convolutional Neural Network-Based Block Up-Sampling for Intra Frame Coding

An Efficient Four-Parameter Affine Motion Model for Video Coding

Convolutional Neural Network-Based Block Up-Sampling for HEVC

Advanced Spherical Motion Model and Local Padding for 360° Video Compression

Picture prediction method and related apparatus