scispace - formally typeset
Search or ask a question

Showing papers on "Residual frame published in 2022"


Proceedings ArticleDOI
01 Jun 2022
TL;DR: In this article , the authors propose a many-to-many splatting framework to estimate multiple bidirectional flows to directly warp the pixels to the desired time step, and then fuse any overlapping pixels.
Abstract: Motion-based video frame interpolation commonly relies on optical flow to warp pixels from the inputs to the desired interpolation instant. Yet due to the inherent challenges of motion estimation (e.g. occlusions and discontinuities), most state-of-the-art interpolation approaches require subsequent refinement of the warped result to generate satisfying outputs, which drastically decreases the efficiency for multi-frame interpolation. In this work, we propose a fully differentiable Many-to-Many (M2M) splatting framework to interpolate frames efficiently. Specifically, given a frame pair, we estimate multiple bidirectional flows to directly forward warp the pixels to the desired time step, and then fuse any overlapping pixels. In doing so, each source pixel renders multiple target pixels and each target pixel can be synthesized from a larger area of visual context. This establishes a many-to-many splatting scheme with robustness to artifacts like holes. Moreover, for each input frame pair, M2M only performs motion estimation once and has a minuscule computational overhead when interpolating an arbitrary number of in-between frames, hence achieving fast multi-frame interpolation. We conducted extensive experiments to analyze M2M, and found that it significantly improves the efficiency while maintaining high effectiveness.

7 citations


Proceedings ArticleDOI
01 Jan 2022
TL;DR: Cho et al. as discussed by the authors proposed a bijective matching mechanism to find the best matches from the query frame to the reference frame and vice versa, which can effectively eliminate background distractors.
Abstract: Semi-supervised video object segmentation (VOS) aims to track the designated objects present in the initial frame of a video at the pixel level. To fully exploit the appearance information of an object, pixel-level feature matching is widely used in VOS. Conventional feature matching runs in a surjective manner, i.e., only the best matches from the query frame to the reference frame are considered. Each location in the query frame refers to the optimal location in the reference frame regardless of how often each reference frame location is referenced. This works well in most cases and is robust against rapid appearance variations, but may cause critical errors when the query frame contains background distractors that look similar to the target object. To mitigate this concern, we introduce a bijective matching mechanism to find the best matches from the query frame to the reference frame and vice versa. Before finding the best matches for the query frame pixels, the optimal matches for the reference frame pixels are first considered to prevent each reference frame pixel from being overly referenced. As this mechanism operates in a strict manner, i.e., pixels are connected if and only if they are the sure matches for each other, it can effectively eliminate background distractors. In addition, we propose a mask embedding module to improve the existing mask propagation method. By embedding multiple historic masks with coordinate information, it can effectively capture the position information of a target object. Code and models are available at https://github.com/suhwan-cho/BMVOS.

6 citations


Journal ArticleDOI
TL;DR: In this article , the authors proposed a frame synchronization method for data transmission systems with short packets, specifically those that use nonseparable factorial coding, which employs correlation processing and majority processing of data fragments transmitted through the communication channel.

4 citations


Journal ArticleDOI
TL;DR: In this paper , the Easy Frame Selector (EFS) is proposed to select an easy reference frame that makes the subsequent VOS become easy, thereby improving the VOS performance. But, it is not clear why the first frame should be selected as a reference frame or why the entire video should be used to specify the mask.
Abstract: Unsupervised video object segmentation (UVOS) is a per-pixel binary labeling problem which aims at separating the foreground object from the background in the video without using the ground truth (GT) mask of the foreground object. Most of the previous UVOS models use the first frame or the entire video as a reference frame to specify the mask of the foreground object. Our question is why the first frame should be selected as a reference frame or why the entire video should be used to specify the mask. We believe that we can select a better reference frame to achieve the better UVOS performance than using only the first frame or the entire video as a reference frame. In our paper, we propose Easy Frame Selector (EFS). The EFS enables us to select an "easy" reference frame that makes the subsequent VOS become easy, thereby improving the VOS performance. Furthermore, we propose a new framework named as Iterative Mask Prediction (IMP). In the framework, we repeat applying EFS to the given video and selecting an "easier" reference frame from the video than the previous iteration, increasing the VOS performance incrementally. The IMP consists of EFS, Bi-directional Mask Prediction (BMP), and Temporal Information Updating (TIU). From the proposed framework, we achieve state-of-the-art performance in three UVOS benchmark sets: DAVIS16, FBMS, and SegTrack-V2.

3 citations


Proceedings ArticleDOI
06 Feb 2022
TL;DR: This work estimates optical flows between a pair of input frames and predicts future motions using two schemes: motion doubling and motion mirroring, and develops a synthesis network to generate a future frame from the warped frames.
Abstract: We propose a novel video frame extrapolation algorithm based on future motion estimation. First, we estimate optical flows between a pair of input frames and predict future motions using two schemes: motion doubling and motion mirroring. Then, we forward warp the input frames by employing the two kinds of predicted motion fields, respectively. Finally, we develop a synthesis network to generate a future frame from the warped frames. Experimental results show that the proposed algorithm outperforms recent video frame extrapolation algorithms on various datasets.

2 citations


Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a disparity-aware reference frame generation network (DAG-Net) to transform the disparity relationship between different viewpoints and generate a more reliable reference frame.
Abstract: Multiview video coding (MVC) aims to compress the multiview video through the elimination of video redundancies, where the quality of the reference frame directly affects the compression efficiency. In this paper, we propose a deep virtual reference frame generation method based on a disparity-aware reference frame generation network (DAG-Net) to transform the disparity relationship between different viewpoints and generate a more reliable reference frame. The proposed DAG-Net consists of a multi-level receptive field module, a disparity-aware alignment module, and a fusion reconstruction module. First, a multi-level receptive field module is designed to enlarge the receptive field, and extract the multi-scale deep features of the temporal and inter-view reference frames. Then, a disparity-aware alignment module is proposed to learn the disparity relationship, and perform disparity shift on the inter-view reference frame to align it with the temporal reference frame. Finally, a fusion reconstruction module is utilized to fuse the complementary information and generate a more reliable virtual reference frame. Experiments demonstrate that the proposed reference frame generation method achieves superior performance for multiview video coding.

2 citations


Proceedings ArticleDOI
28 May 2022
TL;DR: Wang et al. as mentioned in this paper proposed a deep video compression method for P-frame in sub-sampled color spaces regarding the YUV420, which has been widely adopted in many state-of-theart hybrid video compression standards, in an effort to achieve high compression performance.
Abstract: In this paper, we propose a deep video compression method for P-frame in sub-sampled color spaces regarding the YUV420, which has been widely adopted in many state-of-art hybrid video compression standards, in an effort to achieve high compression performance. We adopt motion estimation and motion compression to facilitate the inter prediction of the videos with YUV420 color format, shrinking the total data volume of motion information. Moreover, the motion compensation module on YUV420 is cooperated to enhance the quality of the compensated frame with the consideration of the resolution alignment in the sub-sampled color spaces. To explore the cross-component correlation, the residual encoder-decoder is accompanied with two head-branches and color information fusion. Additionally, a weighted loss emphasizing more on the Y component is utilized to enhance the compression efficiency. Experimental results show that the proposed method can realize 19.82% bit rate reductions on average compared to the deep video compression (DVC) method in terms of the combined PSNR and predominant gains on the Y component.

1 citations


Journal ArticleDOI
12 Feb 2022-Symmetry
TL;DR: A novel object-based frame identification network that uses symmetrically overlapped motion residuals to enhance the discernment of video frames and compares the identification accuracy of the proposed method with that of the existing methods.
Abstract: Image and video manipulation has been actively used in recent years with the development of multimedia editing technologies. However, object-based video tampering, which adds or removes objects within a video frame, is posing challenges because it is difficult to verify the authenticity of videos. In this paper, we present a novel object-based frame identification network. The proposed method uses symmetrically overlapped motion residuals to enhance the discernment of video frames. Since the proposed motion residual features are generated on the basis of overlapped temporal windows, temporal variations in the video sequence can be exploited in the deep neural network. In addition, this paper introduces an asymmetric network structure for training and testing a single basic convolutional neural network. In the training process, two networks with an identical structure are used, each of which has a different input pair. In the testing step, two types of testing methods corresponding to two- and three-class frame identifications are proposed. We compare the identification accuracy of the proposed method with that of the existing methods. The experimental results demonstrate that the proposed method generates reasonable identification results for both two- and three-class forged frame identifications.

1 citations


Journal ArticleDOI
TL;DR: In this article , a commonality modeling approach is proposed to provide a seamless blending between global and local homogeneity information in a video sequence by recursively partitioning the frame into rectangular regions based on the homogeneity of the entire frame.
Abstract: Video coding algorithms attempt to minimize the significant commonality that exists within a video sequence. Each new video coding standard contains tools that can perform this task more efficiently compared to its predecessors. Modern video coding systems are block-based wherein commonality modeling is carried out only from the perspective of the block that need be coded next. In this work, we argue for a commonality modeling approach that can provide a seamless blending between global and local homogeneity information. For this purpose, at first the frame that need be coded, is recursively partitioned into rectangular regions based on the homogeneity information of the entire frame. After that each obtained rectangular region’s feature descriptor is taken to be the average value of all the pixels’ intensities encompassing the region. In this way, the proposed approach generates a coarse representation of the current frame by minimizing both global and local commonality. This coarse frame is computationally simple and has a compact representation. It attempts to preserve important structural properties of the current frame which can be viewed subjectively as well as from improved rate-distortion performance of a reference scalable HEVC coder that employs the coarse frame as a reference frame for encoding the current frame.

Proceedings ArticleDOI
21 Oct 2022
TL;DR: In this paper , an effective packet loss concealment (PLC) method based on the simplified residual network is proposed, which includes two stages, i.e., residual network training (RNT) and PLC.
Abstract: In this paper, an effective packet loss concealment (PLC) method based on the simplified residual network is proposed, which includes two stages, i.e., residual network training (RNT) and PLC. In the RNT stage, input feature of residual network comes from the decoded speech signal of previous N frames, whereas output feature of residual network is the decoded speech signal of current frame. The residual network is used to learn speech waveform in time-domain. In the PLC stage, if there is no packet loss in some frame, the characteristic parameters of this frame are decoded normally and sent to the buffer for standby. If packet loss happened in this frame, the speech signals of previous N frames contained in the buffer and the well-trained residual network are used to predict the lost speech signal of this frame. Experimental results show that the proposed PLC method outperforms the state-of-the-art method.


Proceedings ArticleDOI
16 Oct 2022
TL;DR: Wang et al. as discussed by the authors proposed a novel future frame extrapolation algorithm using the future cost volume, which significantly outperforms state-of-the-art extrapolators on various datasets.
Abstract: A novel future frame extrapolation algorithm using the future cost volume is proposed in this work. First, we develop the future cost volume to estimate motion vectors from a future frame to input frames. Second, we generate two future frame candidates by back-ward warping the input frames using the future motion vectors. Finally, we develop a synthesis network, which aggregates the two candidates to reconstruct the future frame faithfully. Experimental results demonstrate that the proposed algorithm significantly outperforms state-of-the-art extrapolators on various datasets.

Proceedings ArticleDOI
28 Dec 2022
TL;DR: In this paper , a novel method based on frame difference for moving object detection is proposed, which confirms the number of frames used to generate background by comparing the binary frame difference of the current frame with the initial frame difference and establishes the background model using the determined frame; then the background subtraction method is utilized to detect the moving object through the established background.
Abstract: Aiming at the problem that frame difference method will deal with the moving target incorrectly when the target moves slowly or stops for some times, a novel method based on frame difference for moving object detection is proposed. Firstly, the algorithm confirms the number of frames used to generate background by comparing the binary frame difference of the current frame with the initial binary frame difference and establishes the background model using the determined frame; Then the background subtraction method is utilized to detect the moving object through the established background. Relevant experiments show that the algorithm can better deal with the detection of slow-moving targets.

Book ChapterDOI
01 Jan 2022
TL;DR: In this article , the inter frame difference method (MIFD) is proposed to detect moving objects in video under various environmental conditions with little data loss in a short period of time.
Abstract: AbstractDue to its lack of performance and flexibility, the traditional frame difference method does not handle precise moving object detection by motion of the object region in each frame of the various video sequences. So, the object cannot be seen in the foreground accurately. It remains a serious concern. The time computation for object detection using the three frame difference and five frame difference approaches is longer, and frame information is lost. To address these flaws, a new method known as the inter frame difference method is suggested (MIFD). It detects moving objects in video under various environmental conditions with little data loss in a short period of time. MIFD involves constructing a reference frame, computing inter frame difference, a motion frame and detecting moving object(s) in a frame by drawing a rectangle blobs using connected components in the video sequence. The performance of the proposed algorithm is compared with the previous results of the code book model (CB), self-organizing background subtraction method (SOBS), local binary pattern histogram (LBPH), robust background subtraction for network surveillance in H.264, GMM, VIBE, frame difference, three frame difference, improved three frame difference, and combined three frame difference & background subtraction model. The experimental results demonstrate that the proposed methodology performance is better than the other methods in accurately detecting moving object(s) in video under challenging environmental conditions.KeywordsVideo surveillanceModified inter frame difference methodMotion frameObject detectionConnected components

Journal ArticleDOI
TL;DR: In this paper , a video compressed sensing reconstruction algorithm based on multidimensional reference frames is proposed using the sparse characteristics of video signals in different sparse representation domains, which can effectively improve the quality of slow motion video reconstruction.
Abstract: In this paper, a video compressed sensing reconstruction algorithm based on multidimensional reference frames is proposed using the sparse characteristics of video signals in different sparse representation domains. First, the overall structure of the proposed video compressed sensing algorithm is introduced in this paper. The paper adopts a multi-reference frame bidirectional prediction hypothesis optimization algorithm. Then, the paper proposes a reconstruction method for CS frames at the re-decoding end. In addition to using key frames of each GOP reconstructed in the time domain as reference frames for reconstructing CS frames, half-pixel reference frames and scaled reference frames in the pixel domain are also used as CS frames. Reference frames of CS frames are used to obtain higher quality assumptions. The method of obtaining reference frames in the pixel domain is also discussed in detail in this paper. Finally, the reconstruction algorithm proposed in this paper is compared with video compression algorithms in the literature that have better reconstruction results. Experiments show that the algorithm has better performance than the best multi-reference frame video compression sensing algorithm and can effectively improve the quality of slow motion video reconstruction.

Journal ArticleDOI
01 Oct 2022-Sensors
TL;DR: The experimental results show that the proposed frame selection strategy ensures the maximum safe frame removal under the premise of continuous video content at different vehicle speeds in various halation scenes.
Abstract: In order to address the discontinuity caused by the direct application of the infrared and visible image fusion anti-halation method to a video, an efficient night vision anti-halation method based on video fusion is proposed. The designed frame selection based on inter-frame difference determines the optimal cosine angle threshold by analyzing the relation of cosine angle threshold with nonlinear correlation information entropy and de-frame rate. The proposed time-mark-based adaptive motion compensation constructs the same number of interpolation frames as the redundant frames by taking the retained frame number as a time stamp. At the same time, considering the motion vector of two adjacent retained frames as the benchmark, the adaptive weights are constructed according to the interframe differences between the interpolated frame and the last retained frame, then the motion vector of the interpolated frame is estimated. The experimental results show that the proposed frame selection strategy ensures the maximum safe frame removal under the premise of continuous video content at different vehicle speeds in various halation scenes. The frame numbers and playing duration of the fused video are consistent with that of the original video, and the content of the interpolated frame is highly synchronized with that of the corresponding original frames. The average FPS of video fusion in this work is about six times that in the frame-by-frame fusion, which effectively improves the anti-halation processing efficiency of video fusion.

Posted ContentDOI
25 Jul 2022
TL;DR: In this article , a deep learning-based inter-frame encoding scheme for dynamic point cloud geometry compression is proposed, which utilizes sparse convolutions with hierarchical multiscale 3D feature learning to encode the current frame using the previous frame.
Abstract: Efficient point cloud compression is essential for applications like virtual and mixed reality, autonomous driving, and cultural heritage. In this paper, we propose a deep learning-based inter-frame encoding scheme for dynamic point cloud geometry compression. We propose a lossy geometry compression scheme that predicts the latent representation of the current frame using the previous frame by employing a novel prediction network. Our proposed network utilizes sparse convolutions with hierarchical multiscale 3D feature learning to encode the current frame using the previous frame. We employ convolution on target coordinates to map the latent representation of the previous frame to the downsampled coordinates of the current frame to predict the current frame's feature embedding. Our framework transmits the residual of the predicted features and the actual features by compressing them using a learned probabilistic factorized entropy model. At the receiver, the decoder hierarchically reconstructs the current frame by progressively rescaling the feature embedding. We compared our model to the state-of-the-art Video-based Point Cloud Compression (V-PCC) and Geometry-based Point Cloud Compression (G-PCC) schemes standardized by the Moving Picture Experts Group (MPEG). Our method achieves more than 91% BD-Rate Bjontegaard Delta Rate) reduction against G-PCC, more than 62% BD-Rate reduction against V-PCC intra-frame encoding mode, and more than 52% BD-Rate savings against V-PCC P-frame-based inter-frame encoding mode using HEVC.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a frame interpolation method based on residual blocks and feature pyramids, which can capture multi-layer information, segment objects from the background and obtain parameters with motion information.
Abstract: Various deep learning-based video frame interpolation methods have been proposed in the past few years, but how to generate high quality interpolated frames in videos with large motions, complex backgrounds and rich textures is still a challenging issue. To deal with this limitation, a frame interpolation method based on residual blocks and feature pyramids is proposed. U-Net is the main architecture of our method, which can capture multi-layer information, segment objects from the background and obtain parameters with motion information to guide frame interpolation. However, the upsampling and subsampled of U-Net will lose important information. In order to acquire more detailed contextual information, shortcut connection is used in the encoder basic module. At the same time, feature pyramid network is employed to capture features at different scales of the decoder to improve the representation of inter-frame spatial-temporal features. The experimental results show that the proposed method outperform the baseline methods in both of objective and subjective evaluations on different datasets. In particular, the method has obvious advantages on datasets which contain complex background.

Journal ArticleDOI
TL;DR: In this article , a blind forensics method is proposed to identify the adopted MCFI methods by considering the irregularities of optical flow produced by various MCFIs and a set of compact features are constructed from the motion-aligned frame difference-weighted histogram of local binary pattern on the basis of the optical flow (MAFD-WHLBP).
Abstract: Motion-compensated frame-interpolation (MCFI), synthesize intermediate frames between input frames guided by estimated motion, can be employed to falsify high bit-rate videos or high frame-rate videos with different frame-rates. Although existing MCFI identification methods have obtained satisfactory results, they are seriously degraded by stronger compression. Therefore, to conquer this issue, a blind forensics method is proposed to identify the adopted MCFI methods by considering the irregularities of optical flow produced by various MCFIs. In this paper, a set of compact features are constructed from the motion-aligned frame difference-weighted histogram of local binary pattern on the basis of optical flow (MAFD-WHLBP). Experimental results show that the proposed approach outperforms existing MCFI detectors under stronger compression.

Posted ContentDOI
05 Sep 2022
TL;DR: B-CANF as discussed by the authors exploits conditional augmented normalizing flows for B-frame coding and achieves the state-of-the-art compression performance on HM-16.23 under the random access configuration.
Abstract: This work introduces a B-frame coding framework, termed B-CANF, that exploits conditional augmented normalizing flows for B-frame coding. Learned B-frame coding is less explored and more challenging. Motivated by recent advances in conditional P-frame coding, B-CANF is the first attempt at applying flow-based models to both conditional motion and inter-frame coding. B-CANF features frame-type adaptive coding that learns better bit allocation for hierarchical B-frame coding. B-CANF also introduces a special type of B-frame, called B*-frame, to mimic P-frame coding. On commonly used datasets, B-CANF achieves the state-of-the-art compression performance, showing comparable BD-rate results (in terms of PSNR-RGB) to HM-16.23 under the random access configuration.


Journal ArticleDOI
TL;DR: In this article , a Progressive Motion Context Refine Network (PMCRNet) is proposed to predict motion fields and image context jointly for higher efficiency, which can reduce the model size and inference delay.
Abstract: Recently, flow-based frame interpolation methods have achieved great success by first modeling optical flow between target and input frames, and then building synthesis network for target frame generation. However, above cascaded architecture can lead to large model size and inference delay, hindering them from mobile and real-time applications. To solve this problem, we propose a novel Progressive Motion Context Refine Network (PMCRNet) to predict motion fields and image context jointly for higher efficiency. Different from others that directly synthesize target frame from deep feature, we explore to simplify frame interpolation task by borrowing existing texture from adjacent input frames, which means that decoder in each pyramid level of our PMCRNet only needs to update easier intermediate optical flow, occlusion merge mask and image residual. Moreover, we introduce a new annealed multi-scale reconstruction loss to better guide the learning process of this efficient PMCRNet. Experiments on multiple benchmarks show that proposed approaches not only achieve favorable quantitative and qualitative results but also reduces current model size and running time significantly.

Journal ArticleDOI
01 Nov 2022-Sensors
TL;DR: In this article , the authors proposed an algorithm to consistently create a mesh of 4D volumetric data using dynamic reconstruction, which comprises remeshing, correspondence searching, and target frame reconstruction by key frame deformation.
Abstract: A sequence of 3D models generated using volumetric capture has the advantage of retaining the characteristics of dynamic objects and scenes. However, in volumetric data, since 3D mesh and texture are synthesized for every frame, the mesh of every frame has a different shape, and the brightness and color quality of the texture is various. This paper proposes an algorithm to consistently create a mesh of 4D volumetric data using dynamic reconstruction. The proposed algorithm comprises remeshing, correspondence searching, and target frame reconstruction by key frame deformation. We make non-rigid deformation possible by applying the surface deformation method of the key frame. Finally, we propose a method of compressing the target frame using the target frame reconstructed using the key frame with error rates of up to 98.88% and at least 20.39% compared to previous studies. The experimental results show the proposed method’s effectiveness by measuring the geometric error between the deformed key frame and the target frame. Further, by calculating the residual between two frames, the ratio of data transmitted is measured to show a compression performance of 18.48%.