scispace - formally typeset
Search or ask a question

Showing papers on "Residual frame published in 2023"



Proceedings ArticleDOI
04 Jun 2023
TL;DR: Wang et al. as discussed by the authors proposed a frame-level detection method based on the de-blocking filtering feature mode to detect RI frames in the double compressed HEVC videos with shifted GOP structure.
Abstract: Instead of detecting whether the whole video sequence is double compressed, a frame-level detection result can provide more precise information for video forensic tasks, such as locate tamper point and restore compression history, et al. But the research on frame-level double compression detection is still in its infancy. Therefore we aim to provide a frame-level detection method for HEVC videos in this paper. The relocated I(RI) frame belongs to different GOP groups from its reference frame at the first compression and may cause more severe blocking effects than other types of P frames. Hence, this paper proposes an algorithm based on the de-blocking filtering feature mode to detect RI frames in the double compressed HEVC videos with shifted GOP structure. Firstly, the abnormal traces of the de-blocking filtering parameters, such as boundary strength, filtering switch and filtering mode, in the RI frame are analyzed. Then, the de-blocking filtering feature is constructed by mapping the different combinations of the three parameters into a single numerical value. Finally, the de-blocking filtering feature of the video clips is adopted as the input of the proposed mini_MobileViT network, which is the combination of Convolutional Neural Network (CN-N) and Transformer, to learn spatial and temporal representations to identify the RI frames. Experimental results demonstrate the advantages of the proposed algorithm in detecting RI frames in the double compressed HEVC videos. Compared with the state-of-art work He’s method, the proposed method has a 1.72% improvement in the accuracy of detecting RI frames. Compared with other traditional methods, there is a more than 10% improvement.


Journal ArticleDOI
TL;DR: In this paper , the authors proposed three interpolation-like methods to combat error accumulation and showed that all three methods reduce tracking errors in frame-to-frame trackers, including the DeepLabCut (DLC) tracker.
Abstract: Tracking points in ultrasound (US) videos can be especially useful to characterize tissues in motion. Tracking algorithms that analyze successive video frames, such as variations of Optical Flow and Lucas-Kanade (LK), exploit frame-to-frame temporal information to track regions of interest. In contrast, convolutional neural-network (CNN) models process each video frame independently of neighboring frames. In this paper, we show that frame-to-frame trackers accumulate error over time. We propose three interpolation-like methods to combat error accumulation and show that all three methods reduce tracking errors in frame-to-frame trackers. On the neural-network end, we show that a CNN-based tracker, DeepLabCut (DLC), outperforms all four frame-to-frame trackers when tracking tissues in motion. DLC is more accurate than the frame-to-frame trackers and less sensitive to variations in types of tissue movement. The only caveat found with DLC comes from its non-temporal tracking strategy, leading to jitter between consecutive frames. Overall, when tracking points in videos of moving tissue, we recommend using DLC when prioritizing accuracy and robustness across movements in videos, and using LK with the proposed error-correction methods for small movements when tracking jitter is unacceptable.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a transformer-based method named spatiotemporal vision transformer (STVT) for video summarization, which is composed of three dominant components including the embedded sequence module, temporal inter-frame attention (TIA) encoder, and spatial intra-frame attentions (SIA), and a multi-frame loss is computed to drive the network in an end-to-end trainable manner.
Abstract: Video summarization aims to generate a compact summary of the original video for efficient video browsing. To provide video summaries which are consistent with the human perception and contain important content, supervised learning-based video summarization methods are proposed. These methods aim to learn important content based on continuous frame information of human-created summaries. However, simultaneously considering both of inter-frame correlations among non-adjacent frames and intra-frame attention which attracts the humans for frame importance representations are rarely discussed in recent methods. To address these issues, we propose a novel transformer-based method named spatiotemporal vision transformer (STVT) for video summarization. The STVT is composed of three dominant components including the embedded sequence module, temporal inter-frame attention (TIA) encoder, and spatial intra-frame attention (SIA) encoder. The embedded sequence module generates the embedded sequence by fusing the frame embedding, index embedding and segment class embedding to represent the frames. The temporal inter-frame correlations among non-adjacent frames are learned by the TIA encoder with the multi-head self-attention scheme. Then, the spatial intra-frame attention of each frame is learned by the SIA encoder. Finally, a multi-frame loss is computed to drive the learning of the network in an end-to-end trainable manner. By simultaneously using both inter-frame and intra-frame information, our method outperforms state-of-the-art methods in both of the SumMe and TVSum datasets. The source code of the spatiotemporal vision transformer will be available at https://github.com/nchucvml/STVT.

Journal ArticleDOI
TL;DR: In this paper , a fast video instance lane detection network, called MT-Net, based on space-time memory and template matching was proposed to mitigate jitter from scene changes and other disturbances.

Posted ContentDOI
25 Apr 2023
TL;DR: In this paper , a pixel-wise adaptive depth sampling module guided by single-frame depth was introduced to train the multi-frame model and then the minimum reprojection based distillation loss was used to transfer the knowledge from the multispectral depth network to the singleframe network.
Abstract: Although both self-supervised single-frame and multi-frame depth estimation methods only require unlabeled monocular videos for training, the information they leverage varies because single-frame methods mainly rely on appearance-based features while multi-frame methods focus on geometric cues. Considering the complementary information of single-frame and multi-frame methods, some works attempt to leverage single-frame depth to improve multi-frame depth. However, these methods can neither exploit the difference between single-frame depth and multi-frame depth to improve multi-frame depth nor leverage multi-frame depth to optimize single-frame depth models. To fully utilize the mutual influence between single-frame and multi-frame methods, we propose a novel self-supervised training framework. Specifically, we first introduce a pixel-wise adaptive depth sampling module guided by single-frame depth to train the multi-frame model. Then, we leverage the minimum reprojection based distillation loss to transfer the knowledge from the multi-frame depth network to the single-frame network to improve single-frame depth. Finally, we regard the improved single-frame depth as a prior to further boost the performance of multi-frame depth estimation. Experimental results on the KITTI and Cityscapes datasets show that our method outperforms existing approaches in the self-supervised monocular setting.

Posted ContentDOI
17 May 2023
TL;DR: Shi et al. as mentioned in this paper proposed an event-and-frame-based video frame interpolation method named IDO-VFI that assigns varying amounts of computation for different sub-regions via optical flow guidance.
Abstract: Video frame interpolation aims to generate high-quality intermediate frames from boundary frames and increase frame rate. While existing linear, symmetric and nonlinear models are used to bridge the gap from the lack of inter-frame motion, they cannot reconstruct real motions. Event cameras, however, are ideal for capturing inter-frame dynamics with their extremely high temporal resolution. In this paper, we propose an event-and-frame-based video frame interpolation method named IDO-VFI that assigns varying amounts of computation for different sub-regions via optical flow guidance. The proposed method first estimates the optical flow based on frames and events, and then decides whether to further calculate the residual optical flow in those sub-regions via a Gumbel gating module according to the optical flow amplitude. Intermediate frames are eventually generated through a concise Transformer-based fusion network. Our proposed method maintains high-quality performance while reducing computation time and computational effort by 10% and 17% respectively on Vimeo90K datasets, compared with a unified process on the whole region. Moreover, our method outperforms state-of-the-art frame-only and frames-plus-events methods on multiple video frame interpolation benchmarks. Codes and models are available at https://github.com/shicy17/IDO-VFI.

Proceedings ArticleDOI
21 Apr 2023
TL;DR: In this paper , a synchronous decryption scheme for real-time voice encrypted communication based on frame ID was proposed, which can reduce the bit error rate (BER) of the voice signal after waveform demodulation, but does not reduce the transmission rate.
Abstract: In the process of data decryption, the real-time voice encrypted communication based on waveform symbol mapping method is subject to bit errors and data loss. To address this problem, this paper proposes a synchronous decryption scheme for real-time voice encrypted communication based on frame ID. On the basis of the frame-by-frame encryption and decryption of voice data, a voice frame format with good error correction capability and satisfying the coding length is designed according to the error characteristics of waveform symbol mapping method, and the frame ID and frame pattern are obtained and confirmed by the voice frame determination algorithm for simultaneous decryption. The experimental results show that, compared with other schemes, the scheme in this paper can reduce the bit error rate (BER) of the voice signal after waveform demodulation, but does not reduce the transmission rate.

Proceedings ArticleDOI
04 Jun 2023
TL;DR: In this article , a dynamic mixture of explicit and implicit motion compensations is proposed to avoid the problem of carrying false edges/details caused by inaccurate optical flow in the predicted frame to the residual.
Abstract: Learned video coding employs explicit motion compensation (MC) with neural networks to predict the original frame from its reference frame and to compress its residual from the predicted frame, where neural networks are optimized with rate-distortion trade-offs. However, good predictions are hard to find or even do not exist due to fast motions, dis-occlusions, and coding errors of the reference frame. To avoid the problem of carrying false edges/details caused by inaccurate optical flow in the predicted frame to the residual, we propose a dynamic mixture of explicit and implicit motion compensations, where implicitness means that the encoding and decoding of the original frame are conditioned on the predicted frame in pixel and latent domains, respectively. The proposed mixture model saves up to 30% bitrate over the baseline and achieves state-of-the-art performance.

Journal ArticleDOI
TL;DR: In this paper , a deep neural network Frame Prediction Network (FPNet-OF) was proposed to predict future video frames by adaptively fusing the future object-motion with the future frame generator.
Abstract: Video prediction is the problem of generating future frames by exploiting the spatiotemporal correlation from the past frame sequence. It is one of the crucial issues in computer vision and has many real-world applications, mainly focused on predicting future scenarios to avoid undesirable outcomes. However, modeling future image content and object is challenging due to the dynamic evolution and complexity of the scene, such as occlusions, camera movements, delay and illumination. Direct frame synthesis or optical-flow estimation are common approaches used by researchers. However, researchers mainly focused on video prediction using one of the approaches. Both methods have limitations, such as direct frame synthesis, usually face blurry prediction due to complex pixel distributions in the scene, and optical-flow estimation, usually produce artifacts due to large object displacements or obstructions in the clip. In this paper, we constructed a deep neural network Frame Prediction Network (FPNet-OF) with multiple-branch inputs (optical flow and original frame) to predict the future video frame by adaptively fusing the future object-motion with the future frame generator. The key idea is to jointly optimize direct RGB frame synthesis and dense optical flow estimation to generate a superior video prediction network. Using various real-world datasets, we experimentally verify that our proposed framework can produce high-level video frame compared to other state-of-the-art framework.

Journal ArticleDOI
TL;DR: Lee et al. as mentioned in this paper proposed a hierarchical video compression scheme (HVCS) method with three hierarchical layers of quality with Recurrent Quality Enhancement (RQEN) network, which adopts hierarchical quality to benefit the efficiency of frame coding and enhances the low-quality frames at encoding and decoding stages, respectively.
Abstract: Background: This paper proposes a Hierarchical Video Compression Scheme (HVCS) method with three hierarchical layers of quality with Recurrent Quality Enhancement (RQEN‎‎) network. Image compression techniques are used to compress frames in the first layer, where frames have the highest quality. Using high-quality frame as a reference, the Bi-Directional Deep Compression (BDC) network is proposed for frame compression in the second layer with considerable quality. In the third layer, low quality is used for frame compression using adopted Single Motion Compression (SMC) network, which proposes the single motion map for motion estimation within multiple frames. As a result, SMC provide motion information using fewer bits. In decoding stage, a weighted Recurrent Quality Enhancement (RQEN‎‎) network is developed to take both bit stream and the compressed frames as inputs. In RQEN cell‎‎, the update signal and memory are weighted using quality features to positively influence information of multi-frame for enhancement. In this paper, HVCS adopts hierarchical quality to benefit the efficiency of frame coding, whereas high-quality information improves frame compression and enhances the low-quality frames at encoding and decoding stages, respectively. Experimental results validate that proposed HVCS approach overcomes the state-of-the-art of compression methods. Materials and Methods: Tables 1& 2 illustrate representing yielded values for rate-distortion on both video datasets. As aforementioned, PSNR and MS-SSIM are used for quality evaluation, where bit-rates are calculated using bits per pixel (bpp). Table 1 illustrates PSNR performance, where they show better PSNR performance for the proposed compression model than other methods such as Chao et al [7] or optimized methods [1]. In addition, they outperform applying H.265 on standard JCT-VC dataset. On the other side, proposed compression scheme yielded better bit-rate performance than applying H.265 on UVG. As in Table 2, the MS-SSIM evaluation provided better performance of proposed scheme than all other learned approaches, where it reached better performance than H.264 and H.265. Due to bit-rate performance on UVG, Lee ‎et al. [11] has comparable performance, and Guo ‎et al [10] yielded lower performance than H.265. Applying on JCT-VC, DVC [10] is only comparable with H.265. On the opposite, the preformance of HVCS-rate-distortion have obvious better performance than H.265. Furthermore, BjꝊntegaard Delta Bit-Rate (BDBR) [47] is also computed depending on H.265. A BDBR measure computes the average difference of bit-rate considering H.265 anchor, where better performance is indicated on lower values of BDBR [48]. In Table 3, BDBR performance is illustrated depending on PSNR and MS-SSIM, in which, bit-rate reduction considering the anchor is indicated by showed negative numbers. Such results outperform H.265 performance, where bold numbers represent best yielded results by learned methods. Table 3 provided a fair comparison on (MS-SSIM & PSNR) optimized techniques DVC [10] considering the anchor H.265. Results: Tables 1& 2 illustrate representing yielded values for rate-distortion on both video datasets. As aforementioned, PSNR and MS-SSIM are used for quality evaluation, where bit-rates are calculated using bits per pixel (bpp). Table 1 illustrates PSNR performance, where they show better PSNR performance for the proposed compression model than other methods such as Chao et al [7] or optimized methods [1]. In addition, they outperform applying H.265 on standard JCT-VC dataset. On the other side, proposed compression scheme yielded better bit-rate performance than applying H.265 on UVG. As in Table 2, the MS-SSIM evaluation provided better performance of proposed scheme than all other learned approaches, where it reached better performance than H.264 and H.265. Due to bit-rate performance on UVG, Lee ‎et al. [11] has comparable performance, and Guo ‎et al [10] yielded lower performance than H.265. Applying on JCT-VC, DVC [10] is only comparable with H.265. On the opposite, the preformance of HVCS-rate-distortion have obvious better performance than H.265. Furthermore, BjꝊntegaard Delta Bit-Rate (BDBR) [47] is also computed depending on H.265. A BDBR measure computes the average difference of bit-rate considering H.265 anchor, where better performance is indicated on lower values of BDBR [48]. In Table 3, BDBR performance is illustrated depending on PSNR and MS-SSIM, in which, bit-rate reduction considering the anchor is indicated by showed negative numbers. Such results outperform H.265 performance, where bold numbers represent best yielded results by learned methods. Table 3 provided a fair comparison on (MS-SSIM & PSNR) optimized techniques DVC [10] considering the anchor H.265. Conclusion: This work proposes a learned video compression scheme utilizing the hierarchical frame quality with recurrent enhancement. Specifically, this work proposes dividing frames into hierarchical levels 1, 2 and 3 in decreasing quality. For the first layer, image compression methods are proposed, while proposing BDC and SMC for layers 2 and 3 respectively. RQEN‎‎ network is developed with frame quality compressed frames and bit-rate information as inputs for multi-frame enhancement. Experimental results validated the efficiency of proposed HVCS compression scheme. Similarly with other compression techniques, frame structure is manually set the in this scheme. A promising recommendation for future work can be accomplished by developing DNN networks which are automatically learned for the prediction and hierarchy.

Proceedings ArticleDOI
01 Feb 2023
TL;DR: In this article , the authors proposed a scheme to reduce the frame overhead in short frame burst orthogonal frequency division multiplexing (OFDM) communication system, where both the guard sequence and the preamble sequence are used for automatic gain control (AGC) adjustment to shorten the total frame length.
Abstract: In short frame burst orthogonal frequency division multiplexing (OFDM) communication system, time resource is very essential. In this paper, we propose a scheme to reduce the frame overhead. Both the guard sequence and the preamble sequence are used for automatic gain control (AGC) adjustment to shorten the total frame length. The proposed frame structure design can increase the average data transmission rate by 14% at most. In addition, we compare the performance of the proposed frame structure and the traditional frame structure in timing synchronization and frequency offset estimation according to the simulation results, and prove the feasibility and effectiveness of this scheme.