scispace - formally typeset
Search or ask a question

Showing papers on "Inter frame published in 2022"


Journal ArticleDOI
TL;DR: In this paper , a spatial-temporal feature-based detection framework is proposed to enhance the target's energy, suppress the strong spatially nonstationary clutter, and detect dim small targets.
Abstract: The detection of infrared small targets under low signal-to-clutter ratio (SCR) and complex background conditions has been a challenging and popular research topic. In this article, a spatial-temporal feature-based detection framework is proposed. First, several factors, such as the infrared target’s small sample, the sensitive size, and the usual sample selection strategy, that affect the detection of small targets are analyzed. In addition, the small intersection over union (IOU) strategy, which helps to solve the false convergence and sample misjudgment problem, is proposed. Second, aiming at the difficulties due to the target’s dim information and complex background, the interframe energy accumulation (IFEA) enhancement mechanism-based end-to-end spatial-temporal feature extraction and target detection framework is proposed. This framework helps to enhance the target’s energy, suppress the strong spatially nonstationary clutter, and detect dim small targets. Experimental results show that using the small IOU strategy and IFEA mechanism, the proposed multiple frame-based detection framework performs better than some popular deep learning (DL)-based detection algorithms.

14 citations


Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a total variation (TV)-based interframe infrared patch-image model that regards the long-distance IR small target detection task as an optimization problem.
Abstract: Infrared (IR) small target detection is one of the most fundamental techniques in the infrared search and track (IRST) system. Due to the interferences caused by background clutter and image noise, conventional IR small target detection algorithms always suffer from a high false alarm rate and are unable to achieve robust performance in complex scenes. To accurately distinguish IR small target from the background, we propose a total variation (TV)-based interframe infrared patch-image model that regards the long-distance IR small target detection task as an optimization problem. First, the input IR image is converted to a patch-image that consists of a sparse target matrix and a low-rank background matrix. Then, the interframe similarity of target appearance is utilized to impose a temporal consistency constraint on the target matrix. Next, a TV regularization term is proposed to further alleviate the false alarms generated by noise. Finally, an alternating optimization algorithm using singular value decomposition (SVD) and accelerated proximal gradient (APG) is designed to mathematically solve the proposed model. Both qualitative and quantitative experiments implemented on real IR sequences demonstrate that our model outperforms other traditional IR small target methods in terms of the signal-to-clutter ratio gain (SCRG) and the background suppression factor (BSF).

7 citations


Proceedings ArticleDOI
01 Jun 2022
TL;DR: In this article , an adaptive style fusion network (ASFN) is proposed to align and aggregate flow features from consecutive flow sequences based on the inertia prior, and the corrupted flows are completed under the supervision of customized losses on reconstruction, flow smoothness, and consistent ternary census transform.
Abstract: Physical objects have inertia, which resists changes in the velocity and motion direction. Inspired by this, we introduce inertia prior that optical flow, which reflects object motion in a local temporal window, keeps unchanged in the adjacent preceding or subsequent frame. We propose a flow completion network to align and aggregate flow features from the consecutive flow sequences based on the inertia prior. The corrupted flows are completed under the supervision of customized losses on reconstruction, flow smoothness, and consistent ternary census transform. The completed flows with high fidelity give rise to significant improvement on the video inpainting quality. Nevertheless, the existing flow-guided cross-frame warping methods fail to consider the lightening and sharpness variation across video frames, which leads to spatial incoherence after warping from other frames. To alleviate such problem, we propose the Adaptive Style Fusion Network (ASFN), which utilizes the style information extracted from the valid regions to guide the gradient refinement in the warped regions. Moreover, we design a data simulation pipeline to reduce the training difficulty of ASFN. Extensive experiments show the superiority of our method against the state-of-the-art methods quantitatively and qualitatively. The project page is at https://github.com/hitachinsk/ISVI.

5 citations


Journal ArticleDOI
TL;DR: In this paper , a multiframe-to-multiframe (MM) denoising scheme is proposed to simultaneously recover multiple clean frames from consecutive noisy frames, enabling better temporal consistency in the denoised video.
Abstract: Most existing studies performed video denoising by using multiple adjacent noisy frames to recover one clean frame; however, despite achieving relatively good quality for each individual frame, these approaches may result in visual flickering when the denoised frames are considered in sequence. In this paper, instead of separately restoring each clean frame, we propose a multiframe-to-multiframe (MM) denoising scheme that simultaneously recovers multiple clean frames from consecutive noisy frames. The proposed MM denoising scheme uses a training strategy that optimizes the denoised video from both the spatial and temporal dimensions, enabling better temporal consistency in the denoised video. Furthermore, we present an MM network (MMNet), which adopts a spatiotemporal convolutional architecture that considers both the interframe similarity and single-frame characteristics. Benefiting from the underlying parallel mechanism of the MM denoising scheme, MMNet achieves a highly competitive denoising efficiency. Extensive analyses and experiments demonstrate that MMNet outperforms the state-of-the-art video denoising methods, yielding temporal consistency improvements of at least 13.3 $\%$ and running more than 2 times faster than the other methods.

5 citations


Proceedings ArticleDOI
01 Jun 2022
TL;DR: In this paper , a multi-scale feature-level fusion and one-shot non-linear inter-frame motion estimation is proposed for image warping from both events and images.
Abstract: Recently, video frame interpolation using a combination of frame- and event-based cameras has surpassed traditional image-based methods both in terms of performance and memory efficiency. However, current methods still suffer from (i) brittle image-level fusion of complementary interpolation results, that fails in the presence of artifacts in the fused image, (ii) potentially temporally inconsistent and inefficient motion estimation procedures, that run for every inserted frame and (iii) low contrast regions that do not trigger events, and thus cause events-only motion estimation to generate artifacts. Moreover, previous methods were only tested on datasets consisting of planar and far-away scenes, which do not capture the full complexity of the real world. In this work, we address the above problems by introducing multi-scale feature-level fusion and computing one-shot non-linear inter-frame motion-which can be efficiently sampled for image warping-from events and images. We also collect the first large-scale events and frames dataset consisting of more than 100 challenging scenes with depth variations, captured with a new experimental setup based on a beamsplitter. We show that our method improves the reconstruction quality by up to 0.2 dB in terms of PSNR and up to 15% in LPIPS score.

5 citations



Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a simple yet effective feature optimization method for video object segmentation based on motion information, which constructs a two-branch deep network and use computed motion cues (i.e., optical flow) to jointly optimize global and local interframe correlation information.

4 citations


Proceedings ArticleDOI
01 Jun 2022
TL;DR: Wang et al. as discussed by the authors introduced a novel Recurrent Vision Transformer (RViT) framework based on spatial-temporal representation learning to achieve the video action recognition task.
Abstract: Existing video understanding approaches, such as 3D convolutional neural networks and Transformer-Based methods, usually process the videos in a clip-wise manner; hence huge GPU memory is needed and fixed-length video clips are usually required. To alleviate those issues, we introduce a novel Recurrent Vision Transformer (RViT) framework based on spatial-temporal representation learning to achieve the video action recognition task. Specifically, the proposed RViT is equipped with an attention gate to build interaction between current frame input and previous hidden state, thus aggregating the global level interframe features through the hidden state temporally. RViT is executed recurrently to process a video by giving the current frame and previous hidden state. The RViT can capture both spatial and temporal features because of the attention gate and recurrent execution. Besides, the proposed RViT can work on variant-length video clips properly without requiring large GPU memory thanks to the frame by frame processing flow. Our experiment results demonstrate that RViT can achieve state-of-the-art performance on various datasets for the video recognition task. Specifically, RViT can achieve a top-1 accuracy of 81.5% on Kinetics-400, 92.31% on Jester, 67.9% on Something-Something-V2, and an mAP accuracy of 66.1% on Charades.

4 citations


Journal ArticleDOI
21 Sep 2022-Sensors
TL;DR: This work proposes an efficient end-to-end VCS network, which integrates the measurement and reconstruction into one whole framework, and has higher reconstruction accuracy than existing video compression sensing networks and even performs well at measurement ratios as low as 0.01.
Abstract: Video compression sensing can use a few measurements to obtain the original video by reconstruction algorithms. There is a natural correlation between video frames, and how to exploit this feature becomes the key to improving the reconstruction quality. More and more deep learning-based video compression sensing (VCS) methods are proposed. Some methods overlook interframe information, so they fail to achieve satisfactory reconstruction quality. Some use complex network structures to exploit the interframe information, but it increases the parameters and makes the training process more complicated. To overcome the limitations of existing VCS methods, we propose an efficient end-to-end VCS network, which integrates the measurement and reconstruction into one whole framework. In the measurement part, we train a measurement matrix rather than a pre-prepared random matrix, which fits the video reconstruction task better. An unfolded LSTM network is utilized in the reconstruction part, deeply fusing the intra- and interframe spatial–temporal information. The proposed method has higher reconstruction accuracy than existing video compression sensing networks and even performs well at measurement ratios as low as 0.01.

3 citations


Journal ArticleDOI
Pinqing Yang, Lili Dong, He Xu, Hao Dai, Wenhai Xu 
TL;DR: A robust anti-jitter spatial–temporal trajectory consistency (ASTTC) method, the main idea of which is to improve detection accuracy using single-frame detection followed by multi-frame decision, which performs more satisfactorily than the existing methods in detection accuracy.
Abstract: When detecting diverse infrared (IR) small maritime targets on complicated scenes, the existing methods get into trouble and unsatisfactory performance. The main reasons are as follows: 1) affected by target characteristics and ambient temperature and so on, both bright and dark targets may exist on IR maritime images in practical application and 2) spatial information and temporal correlation of targets are not fully excavated. To these problems, we propose a robust anti-jitter spatial–temporal trajectory consistency (ASTTC) method, the main idea of which is to improve detection accuracy using single-frame detection followed by multi-frame decision. First, we innovatively design adaptive local gradient variation (ALGV) descriptor, which combines local dissimilarity measure with gradient magnitude distribution to enhance the local contrast for both bright and dark targets so that the suspected targets can be robustly extracted. For multi-frame decision, interframe displacement correction is necessary to eliminate the interference of IR imager motion and vibration for target codeword. We use pyramid optical flow to track feature point extracted by Shi-Tomasi to capture interframe registration coefficients. Then, the target position is corrected in the spatial–temporal domain. Finally, a robust spatial–temporal trajectory descriptor (STTD), which achieves target encoding and target trajectory consistency measurement, is designed to further confirm real targets and eliminate false targets. Experiments conducted on various real IR maritime image sequences demonstrate the applicability and robustness of ASTTC, which performs more satisfactorily than the existing methods in detection accuracy.

2 citations


Proceedings ArticleDOI
06 Feb 2022
TL;DR: This work estimates optical flows between a pair of input frames and predicts future motions using two schemes: motion doubling and motion mirroring, and develops a synthesis network to generate a future frame from the warped frames.
Abstract: We propose a novel video frame extrapolation algorithm based on future motion estimation. First, we estimate optical flows between a pair of input frames and predict future motions using two schemes: motion doubling and motion mirroring. Then, we forward warp the input frames by employing the two kinds of predicted motion fields, respectively. Finally, we develop a synthesis network to generate a future frame from the warped frames. Experimental results show that the proposed algorithm outperforms recent video frame extrapolation algorithms on various datasets.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors designed a single motion decoder to strengthen the efficiency of the motion encoder in the original motion-content network (MCnet), and used both edges with and without semantic meanings from the holistically-nested edge detection (HED) module as content details.

Proceedings ArticleDOI
23 May 2022
TL;DR: This work presents a deep neural network (DNN) based method to estimate the interframe correlation coefficients and the estimated coefficients are subsequently fed into multiframe filters to achieve noise reduction.
Abstract: While multiframe noise reduction filters, e.g., the multiframe Wiener and minimum variance distortionless response (MVDR) ones, have demonstrated great potential to improve both the subband and full-band signal-to-noise ratios (SNRs) by exploiting explicitly the interframe speech correlation, the implementation of such filters requires the knowledge of the interframe correlation coefficients for every subband, which are challenging to estimate in practice. In this work, we present a deep neural network (DNN) based method to estimate the interframe correlation coefficients and the estimated coefficients are subsequently fed into multiframe filters to achieve noise reduction. Unlike existing DNN based methods, which outputs the enhanced speech directly, the presented method combines deep learning and traditional methods, which gives more flexibility to optimize or tune noise reduction performance. Experimental results are presented to justify the properties of the presented methods.

DOI
01 Jan 2022
TL;DR: In this article, a deep learning based intra prediction method using CNN is proposed, which trains the network to provide depth of the CTU with reduced computation and less time, the experimental results show a dip in the encoding time, about 71.3% compared to existing method.
Abstract: The video and its compression become prominent with the emergence of digital video technology and common use of video acquisition devices. The traditional video compression needs upgradation with artificial intelligence, machine learning, neural network, and deep learning. Apart from normal signal processing the deep learning technologies are advantages as they can deal with content analysis than dealing only with neighboring pixels. The initial steps in video compression, intra/inter frame prediction provide a better percentage in overall compression. The computational complexity of existing intra prediction method is more. This paper proposes a deep learning based intra prediction method using CNN. This deep depth prediction algorithm trains the network to provide depth of the CTU with reduced computation and less time. The experimental results show a dip in the encoding time, about 71.3% compared to existing method.

Journal ArticleDOI
TL;DR: The developed motion correction method is insensitive to any specific tracer distribution pattern, thus enabling improved correction of motion artefacts in a variety of clinical applications of extended PET imaging of the brain without the need for fiducial markers.
Abstract: Aim/Introduction: Patient head motion poses a significant challenge when performing dynamic PET brain studies. In response, we developed a fast, robust, easily implementable and tracer-independent brain motion correction technique that facilitates accurate alignment of dynamic PET images. Materials and methods: Correction of head motion was performed using motion vectors derived by the application of Gaussian scale-space theory. A multiscale pyramid consisting of three different resolution levels (1/4x: coarse, 1/2x: medium, and 1x: fine) was applied to all image frames (37 frames, framing of 12 × 10s, 15 × 30s, 10 × 300s) of the dynamic PET sequence. Frame image alignment was initially performed at the coarse scale, which was subsequently used to initialise coregistration at the next finer scale, a process repeated until the finest possible scale, that is, the original resolution was reached. In addition, as tracer distribution changes during the dynamic frame sequence, a mutual information (MI) score was used to identify the starting frame for motion correction that is characterised by a sufficiently similar tracer distribution with the reference (last) frame. Validation of the approach was performed based on a simulated F18-fluoro-deoxy-glucose (FDG) dynamic sequence synthesised from the digital Zubal phantom. Inter-frame motion was added to each dynamic frame (except the reference frame). Total brain voxel displacement based on the added motion was constrained to 25 mm, which included both translation (0–15 mm in x, y and z) and rotation (0–0.3 rad for each Euler angle). Twenty repetitions were performed for each dataset with arbitrarily simulated motion, resulting in 20 synthetic datasets, each consisting of 36 dynamic frames (frame 37 was the reference frame). Assessment of motion correction accuracy across the dynamic sequence was performed based on the uncorrected/residual displacement remaining after the application of our algorithm. To investigate the clinical utility of the developed algorithm, three clinically cases that underwent list-mode PET imaging utilising different tracers ([18F]-fluoro-deoxy-glucose [18F]FDG [18F]-fluoroethyl-l-tyrosine [18F]FET [11C]-alpha-methyl-tryptophan [11C]AMT), each characterised by a different temporal tracer distribution were included in this study. Improvements in the Dice score coefficient (DSC) following frame alignment were evaluated as the correlation significance between the identified displacement for each frame of the clinical FDG, FET and AMT dynamic sequences. Results: Sub-millimetre accuracy (0.4 ± 0.2 mm) was achieved in the Zubal phantom for all frames after 5 min p. i., with early frames (30 s–180 s) displaying a higher residual displacement of ∼3 mm (3.2 ± 0.6 mm) due to differences in tracer distribution relative to the reference frame. The effect of these differences was also seen in MI scores; the MI plateau phase was reached at 35s p. i., 2.0 and 2.5 min p. i. At the coarse, medium and fine resolution levels, respectively. For the clinical images, a significant correlation between the identified (and corrected) displacement and the improvement in DSC score was seen in all dynamic studies (FET: R = 0.49, p < 0.001; FDG: R = 0.82, p < 0.001; AMT: R = 0.92, p < 0.001). Conclusion: The developed motion correction method is insensitive to any specific tracer distribution pattern, thus enabling improved correction of motion artefacts in a variety of clinical applications of extended PET imaging of the brain without the need for fiducial markers.

Proceedings ArticleDOI
28 May 2022
TL;DR: Wang et al. as mentioned in this paper proposed a deep video compression method for P-frame in sub-sampled color spaces regarding the YUV420, which has been widely adopted in many state-of-theart hybrid video compression standards, in an effort to achieve high compression performance.
Abstract: In this paper, we propose a deep video compression method for P-frame in sub-sampled color spaces regarding the YUV420, which has been widely adopted in many state-of-art hybrid video compression standards, in an effort to achieve high compression performance. We adopt motion estimation and motion compression to facilitate the inter prediction of the videos with YUV420 color format, shrinking the total data volume of motion information. Moreover, the motion compensation module on YUV420 is cooperated to enhance the quality of the compensated frame with the consideration of the resolution alignment in the sub-sampled color spaces. To explore the cross-component correlation, the residual encoder-decoder is accompanied with two head-branches and color information fusion. Additionally, a weighted loss emphasizing more on the Y component is utilized to enhance the compression efficiency. Experimental results show that the proposed method can realize 19.82% bit rate reductions on average compared to the deep video compression (DVC) method in terms of the combined PSNR and predominant gains on the Y component.


Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed an adaptive cropping method, which utilizes the relative displacement between frames as a constraint during determining the optimal retargeting window, helping to preserve the temporal coherence in the retargeted videos.
Abstract: Content-aware cropping makes it easy to preserve the temporal coherence and guarantee the quality of visual objects during retargeting since the retargeting window is often rigid and single. We propose an adaptive cropping method for video retargeting, which utilizes the relative displacement between frames as a constraint during determining the optimal retargeting window, helping to preserve the temporal coherence in the retargeted videos. Moreover, we design a feedback mechanism to detect the jumping frames and smooth their importance, which reduces the jitter of resized video effectively. Experiments show that the proposed method can achieve the retargeted results with the more desired quality compared with the existing relative video retargeting approaches.

Journal ArticleDOI
TL;DR: In this article , an edge-based Joint Video Coding (eJVC) scheme was proposed to save up to 84.04% encoding complexity of UAV video collector, where the attention network was used to distinguish foreground blocks and background blocks in the frame.
Abstract: The ultra-low latency transmission, high speed broadband and ubiquitous access points in modern cities have greatly alleviated the last mile problem of live streaming services. Yet, the sophisticated video compression of high definition videos is still inevitable before transmission, which leads to overwhelming workloads in the light-weighted video collectors. In this paper, we are motivated to explore the computation offloading of live streaming video compression for UAVs, which are featured by limited capacity and fixed trajectory. In particular, the global motion model is utilized for inter frame residual coding instead of motion vector estimation, which usually takes more than 50% computation complexity in traditional H.264/265. We propose an edge-based Joint Video Coding (eJVC) scheme, which can save up to 84.04% encoding complexity of UAV video collector. Specifically, the attention network is used to distinguish foreground blocks and background blocks in the frame. LSTM neural network is used on the edge server to predict the auxiliary data needed by UAV video collector for video coding. In addition, our proposed solution can also accommodate the movement direction change when the control signal is notified in advance. Finally, the prototype system is implemented with the real world data set, and the experiment results show that our proposed solution can significantly save the computation time with little bitrate distortion performance loss.

Journal ArticleDOI
Lue Chen, Xin Chen, Peng Rao, Lan Guo, Maotong Huang 
TL;DR: Zhang et al. as mentioned in this paper proposed a two-stage Interframe Registration and Spatial Local Contrast (IFR-SLC) based method for space-based infrared aerial target detection.


Journal ArticleDOI
TL;DR: In this article , a commonality modeling approach is proposed to provide a seamless blending between global and local homogeneity information in a video sequence by recursively partitioning the frame into rectangular regions based on the homogeneity of the entire frame.
Abstract: Video coding algorithms attempt to minimize the significant commonality that exists within a video sequence. Each new video coding standard contains tools that can perform this task more efficiently compared to its predecessors. Modern video coding systems are block-based wherein commonality modeling is carried out only from the perspective of the block that need be coded next. In this work, we argue for a commonality modeling approach that can provide a seamless blending between global and local homogeneity information. For this purpose, at first the frame that need be coded, is recursively partitioned into rectangular regions based on the homogeneity information of the entire frame. After that each obtained rectangular region’s feature descriptor is taken to be the average value of all the pixels’ intensities encompassing the region. In this way, the proposed approach generates a coarse representation of the current frame by minimizing both global and local commonality. This coarse frame is computationally simple and has a compact representation. It attempts to preserve important structural properties of the current frame which can be viewed subjectively as well as from improved rate-distortion performance of a reference scalable HEVC coder that employs the coarse frame as a reference frame for encoding the current frame.

Proceedings ArticleDOI
21 Aug 2022
TL;DR: Wang et al. as discussed by the authors proposed a motion approximation scheme to utilize the motion vector between the reference frames, which is able to generate additional compensated frames to further refine the missing details in the target frame.
Abstract: In recent years, various methods have been proposed to tackle the compressed video quality enhancement problem. It aims at restoring the distorted information in low-quality target frames from high-quality reference frames in the compressed video. Most methods for video quality enhancement contain two key stages, i.e., the synchronization and the fusion stages. The synchronization stage synchronizes the input frames by compensating the estimated motion vector to reference frames. The fusion stage reconstructs each frame with the compensated frames. However, the synchronization stage in previous works merely estimates the motion vector between the reference frame and the target frame. Due to the quality fluctuation of frames and region occlusion of objects, the missing detail information cannot be adequately replenished. To make full use of the temporal motion between input frames, we propose a motion approximation scheme to utilize the motion vector between the reference frames. It is able to generate additional compensated frames to further refine the missing details in the target frame. In the fusion stage, we propose a deep neural network to extract frame features with blended attention to the texture details and the quality discrepancy at different times. The experimental results show the effectiveness and robustness of our method.

Posted ContentDOI
31 Mar 2022
TL;DR: In this article , a multi-scale feature-level fusion and computing one-shot non-linear inter-frame motion from events and images is proposed to improve the reconstruction quality.
Abstract: Recently, video frame interpolation using a combination of frame- and event-based cameras has surpassed traditional image-based methods both in terms of performance and memory efficiency. However, current methods still suffer from (i) brittle image-level fusion of complementary interpolation results, that fails in the presence of artifacts in the fused image, (ii) potentially temporally inconsistent and inefficient motion estimation procedures, that run for every inserted frame and (iii) low contrast regions that do not trigger events, and thus cause events-only motion estimation to generate artifacts. Moreover, previous methods were only tested on datasets consisting of planar and faraway scenes, which do not capture the full complexity of the real world. In this work, we address the above problems by introducing multi-scale feature-level fusion and computing one-shot non-linear inter-frame motion from events and images, which can be efficiently sampled for image warping. We also collect the first large-scale events and frames dataset consisting of more than 100 challenging scenes with depth variations, captured with a new experimental setup based on a beamsplitter. We show that our method improves the reconstruction quality by up to 0.2 dB in terms of PSNR and up to 15% in LPIPS score.

Proceedings ArticleDOI
13 Dec 2022
TL;DR: In this paper , a new way of video coding by modeling human pose from the already-encoded frames and using the generated frame at the current time as an additional forward-referencing frame is proposed.
Abstract: To exploit high temporal correlations in video frames of the same scene, the current frame is predicted from the already-encoded reference frames using block-based motion estimation and compensation techniques. While this approach can efficiently exploit the translation motion of the moving objects, it is susceptible to other types of affine motion and object occlusion/deocclusion. Recently, deep learning has been used to model the high-level structure of human pose in specific actions from short videos and then generate virtual frames in future time by predicting the pose using a generative adversarial network (GAN). Therefore, modelling the high-level structure of human pose is able to exploit semantic correlation by predicting human actions and determining its trajectory. Video surveillance applications will benefit as stored “big” surveillance data can be compressed by estimating human pose trajectories and generating future frames through semantic correlation. This paper explores a new way of video coding by modelling human pose from the already-encoded frames and using the generated frame at the current time as an additional forward-referencing frame. It is expected that the proposed approach can overcome the limitations of the traditional backward-referencing frames by predicting the blocks containing the moving objects with lower residuals. Our experimental results show that the proposed approach can achieve on average up to 2.83 dB PSNR gain and 25.93% bitrate savings for high motion video sequences compared to standard video coding.


Proceedings ArticleDOI
16 Oct 2022
TL;DR: In this article , two advanced motion vector differences (MVD) coding methods are proposed to reduce the overhead for signaling MVD, and the precision of the MVD is implicitly determined based on the associated MV class and MVD magnitude.
Abstract: In AV1, for inter coded blocks with compound reference mode, motion vector differences (MVDs) are signaled for reference frame list 0 or list 1 separately with the same MVD precision regardless of the motion vector magnitude. In this paper, two advanced MVD coding methods are proposed. Firstly, to reduce the overhead for signaling MVD, the precision of the MVD is implicitly determined based on the associated MV class and MVD magnitude. Secondly, a new inter prediction mode is added to explore the correlation of MVDs between two reference frames, wherein one joint MVD is signaled for two reference frames. Experimental results demonstrate that, in the random-access common test condition luma coding gains of around 1.1% in terms of BD-rate can be achieved on top of a recent release of AOMedia Video Model (AVM).

Proceedings ArticleDOI
01 Dec 2022
TL;DR: In this article , the authors proposed a next-frame prediction model that predicts the next frame using the previous five frames, which achieves a peak signal-to-noise ratio (PSNR) of 29.35dB with a mean square error loss of 0.003 for the UCF101 dataset video sequence.
Abstract: High Efficiency Video Coding (HEVC) is a video compression standard that compresses video sequences with 50% less bit rate compared to ancestor H.264 standard. In HEVC, the motion compensation block utilizes the motion vectors to generate the motion compensated frame. The motion vectors are generated using the motion estimation process that improves the efficiency of HEVC at the expense of high computational complexity. The next-frame prediction technique can be used to predict the motion compensated frame. This paper proposes the next-frame prediction model that predicts the next frame using the previous five frames. The experimental results show that the proposed method achieves a Peak Signal-to-Noise Ratio(PSNR) of 29.35dB with a mean square error loss of 0.003 for the UCF101 dataset video sequence, which is better than state-of-the-art methods.


Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a motion-appearance-aware network (MAAN) for learning robust feature representations by mining information at multiple time scales, which can adaptively adjust information elements to refine the motion feature.
Abstract: Object change detection (OCD), which aims to segment moving objects from an input frame, has attracted growing attention. Most OCD algorithms rely on scene diversity or ignore the interframe spatiotemporal structural dependence, which limits their applicability. In this paper, we propose a motion-appearance-aware network (MAAN) for learning robust feature representations. Specifically, a module for mining information at multiple time scales, which can adaptively adjust information elements, is designed to refine the motion feature. Meanwhile, salient object knowledge is obtained with the help of the extracted appearance features. To enhance the semantic consistency and trim redundant connections, we construct a fusion module called multi-view feature evolution, which effectively fuses motion and appearance information by global communication and local guidance, respectively. Moreover, we develop two strategies for obtaining uniform and consistent moving objects during information propagation. One is to feed the predicted mask of the previous frame into the decoder, and the other is to match different levels of motion cues at multiple time scales to the decoder. Finally, extensive experiments on four public datasets (i.e., LASIESTA, CDnet2014, INO, and AICD) indicate that the proposed approach outperforms other methods.