scispace - formally typeset
Search or ask a question

Showing papers on "Inter frame published in 2021"


Journal ArticleDOI
Xiaoqin Zhang1, Runhua Jiang1, Tao Wang1, Pengcheng Huang1, Li Zhao1 
TL;DR: An attention-based interframe compensation scheme that replaces frames in blurry sequences with newly restored frames, and estimates temporal patterns among the replaced sequence to restore the whole sequence and propose an adaptive residual block that dynamically fuses multi-level features via learning location-specific weights.

24 citations


Journal ArticleDOI
TL;DR: A novel video CS framework based on a convolutional neural network (dubbed VCSNet) to explore both intraframe and interframe correlations and provides better performance over state-of-the-art video CS methods and deep learning-based image CS methods in both objective and subjective reconstruction quality.
Abstract: Recently, a few image compressed sensing (CS) methods based on deep learning have been developed, which achieve remarkable reconstruction quality with low computational complexity. However, these existing deep learning-based image CS methods focus on exploring intraframe correlation while ignoring interframe cues, resulting in inefficiency when directly applied to video CS. In this paper, we propose a novel video CS framework based on a convolutional neural network (dubbed VCSNet) to explore both intraframe and interframe correlations. Specifically, VCSNet divides the video sequence into multiple groups of pictures (GOPs), of which the first frame is a keyframe that is sampled at a higher sampling ratio than the other nonkeyframes. In a GOP, the block-based framewise sampling by a convolution layer is proposed, which leads to the sampling matrix being automatically optimized. In the reconstruction process, the framewise initial reconstruction by using a linear convolutional neural network is first presented, which effectively utilizes the intraframe correlation. Then, the deep reconstruction with multilevel feature compensation is proposed, which compensates the nonkeyframes with the keyframe in a multilevel feature compensation manner. Such multilevel feature compensation allows the network to better explore both intraframe and interframe correlations. Extensive experiments on six benchmark videos show that VCSNet provides better performance over state-of-the-art video CS methods and deep learning-based image CS methods in both objective and subjective reconstruction quality.

23 citations


Journal ArticleDOI
TL;DR: A digital forensic technique to detect inter-frame forgeries in surveillance videos using compressed domain video footprints i.e, prediction footprint variation and variation of motion vectors in videos, for the purpose of video forgery detection and localization is proposed.

17 citations


Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper combined autoencoding neural network and AdaBoost to construct a fast pedestrian detection algorithm aiming at the problem that a single high-level output feature map has insufficient ability to express pedestrian features and existing methods cannot effectively select appropriate multilevel features.
Abstract: In order to solve the problem of low accuracy of pedestrian detection of real traffic cameras and high missed detection rate of small target pedestrians, this paper combines autoencoding neural network and AdaBoost to construct a fast pedestrian detection algorithm Aiming at the problem that a single high-level output feature map has insufficient ability to express pedestrian features and existing methods cannot effectively select appropriate multilevel features, this paper improves the traditional AdaBoost algorithm structure, that is, the sample weight update formula and the strong classifier output formula are reset, and the two-input AdaBoost-DBN classification algorithm is proposed Moreover, in view of the problem that the fusion video is not smoothly played, this paper considers the motion information of the video object, performs pixel interpolation by motion compensation, and restores the frame rate of the original video by reconstructing the dropped interframe image Through experimental research, we can see that the algorithm constructed in this paper has a certain effect

14 citations


Journal ArticleDOI
TL;DR: In this paper, a bidirectional-3D LSTM network was developed to fully utilize both local and nonlocal temporal information in the 4D dynamic image data for motion detection.
Abstract: Patient motion during dynamic PET imaging can induce errors in myocardial blood flow (MBF) estimation. Motion correction for dynamic cardiac PET is challenging because the rapid tracer kinetics of 82Rb leads to substantial tracer distribution change across different dynamic frames over time, which can cause difficulties for image registration-based motion correction, particularly for early dynamic frames. In this paper, we developed an automatic deep learning-based motion correction (DeepMC) method for dynamic cardiac PET. In this study we focused on the detection and correction of inter-frame rigid translational motion caused by voluntary body movement and pattern change of respiratory motion. A bidirectional-3D LSTM network was developed to fully utilize both local and nonlocal temporal information in the 4D dynamic image data for motion detection. The network was trained and evaluated over motion-free patient scans with simulated motion so that the motion ground-truths are available, where one million samples based on 65 patient scans were used in training, and 600 samples based on 20 patient scans were used in evaluation. The proposed method was also evaluated using additional 10 patient datasets with real motion. We demonstrated that the proposed DeepMC obtained superior performance compared to conventional registration-based methods and other convolutional neural networks (CNN), in terms of motion estimation and MBF quantification accuracy. Once trained, DeepMC is much faster than the registration-based methods and can be easily integrated into the clinical workflow. In the future work, additional investigation is needed to evaluate this approach in a clinical context with realistic patient motion.

11 citations


Book ChapterDOI
17 Mar 2021
TL;DR: Wang et al. as discussed by the authors proposed to analyze information from a sequence of multiple consecutive frames to detect deepfakes in video content by processing the video using the sliding window approach, taking into account not only spatial intraframe dependencies but also interframe temporal dependencies.
Abstract: Deepfakes generated by generative adversarial neural networks may threaten not only individuals but also pose a public threat. In this regard, detecting video content manipulations is an urgent task, and many researchers propose various methods to solve it. Nevertheless, the problem remains. In this paper, the existing approaches are evaluated, and a new method for detecting deepfakes in videos is proposed. Considering that deepfakes are inserted into the video frame by frame, when viewing it, even with the naked eye, fluctuations and temporal distortions are noticeable, which are not taken into account by many deepfake detection algorithms that use information from a single frame to search for forgeries out of context with neighboring frames. It is proposed to analyze information from a sequence of multiple consecutive frames to detect deepfakes in video content by processing the video using the sliding window approach, taking into account not only spatial intraframe dependencies but also interframe temporal dependencies. Experiments have shown the advantage and potential for further development of the proposed approach over simple intraframe recognition.

10 citations


Journal ArticleDOI
TL;DR: A 54-dimentional feature set exploiting spatio-temporal features of motion vectors to blindly detect MV-based stego videos and results have shown that the features’ performance far exceeds that of state-of-the-art steganalysis methods.
Abstract: Despite all its irrefutable benefits, the development of steganography methods has sparked ever-increasing concerns over steganography abuse in recent decades. To prevent the inimical usage of steganography, steganalysis approaches have been introduced. Since motion vector manipulation leads to random and indirect changes in the statistics of videos, MV-based video steganography has been the center of attention in recent years. In this paper, we propose a 54-dimentional feature set exploiting spatio-temporal features of motion vectors to blindly detect MV-based stego videos. The idea behind the proposed features originates from two facts. First, there are strong dependencies among neighboring MVs due to utilizing rate-distortion optimization techniques and belonging to the same rigid object or static background. Accordingly, MV manipulation can leave important clues on the differences between each MV and the MVs belonging to the neighboring blocks. Second, a majority of MVs in original videos are locally optimal after decoding concerning the Lagrangian multiplier, notwithstanding the information loss during compression. Motion vector alteration during information embedding can affect these statistics that can be utilized for steganalysis. Experimental results have shown that our features’ performance far exceeds that of state-of-the-art steganalysis methods. This outstanding performance lies in the utilization of complementary spatio-temporal statistics affected by MV manipulation as well as feature dimensionality reduction applied to prevent overfitting. Moreover, unlike other existing MV-based steganalysis methods, our proposed features can be adjusted to various settings of the state-of-the-art video codec standards such as sub-pixel motion estimation and variable-block-size motion estimation.

10 citations


Journal ArticleDOI
TL;DR: A multiframe-to-multiframe (MM) denoising scheme that simultaneously recovers multiple clean frames from consecutive noisy frames, and an MM network (MMNet), which adopts a spatiotemporal convolutional architecture that considers both the interframe similarity and single-frame characteristics.
Abstract: Most existing studies performed video denoising by using multiple adjacent noisy frames to recover one clean frame; however, despite achieving relatively good quality for each individual frame, these approaches may result in visual flickering when the denoised frames are considered in sequence. In this paper, instead of separately restoring each clean frame, we propose a multiframe-to-multiframe (MM) denoising scheme that simultaneously recovers multiple clean frames from consecutive noisy frames. The proposed MM denoising scheme uses a training strategy that optimizes the denoised video from both the spatial and temporal dimensions, enabling better temporal consistency in the denoised video. Furthermore, we present an MM network (MMNet), which adopts a spatiotemporal convolutional architecture that considers both the interframe similarity and single-frame characteristics. Benefiting from the underlying parallel mechanism of the MM denoising scheme, MMNet achieves a highly competitive denoising efficiency. Extensive analyses and experiments demonstrate that MMNet outperforms the state-of-the-art video denoising methods, yielding temporal consistency improvements of at least 13.3% and running more than 2 times faster than the other methods.

9 citations


Journal ArticleDOI
TL;DR: In this paper, a moving object detection and tracking algorithm based on computer vision technology is presented, which has the longest running time per frame when tracking a moving target, which is about 2.3 times that of the single frame running time of the CamShift algorithm.
Abstract: In order to improve the video image processing technology, this paper presents a moving object detection and tracking algorithm based on computer vision technology. Firstly, the detection performance of the interframe difference method and the background difference model method is compared comprehensively from both theoretical and experimental aspects, and then the Robert edge detection operator is selected to carry out edge detection of the vehicle. The research results show that the algorithm proposed in this paper has the longest running time per frame when tracking a moving target, which is about 2.3 times that of the single frame running time of the CamShift algorithm. The algorithm has high running efficiency and can meet the requirements of real-time tracking of a foreground target. The algorithm has the highest tracking accuracy, the time consumption is reduced, and the error of the tracking frame deviating from the real position of the target is the least.

8 citations


Journal ArticleDOI
TL;DR: A novel occlusion-invariant term is proposed to make the part features close to their center, which can relive several uncontrolled complicated factors, such as occlusions and pose invariance, in the video-based person Re-ID task.

8 citations


Journal ArticleDOI
TL;DR: In this article, a dynamic warping network (DWNet) is proposed to adaptively warp the interframe features for improving the accuracy of warping-based models, which can achieve consistent improvement over various strong baselines and achieves state-of-the-art accuracy on the Cityscapes and CamVid benchmark datasets.
Abstract: A major challenge for semantic video segmentation is how to exploit the spatiotemporal information and produce consistent results for a video sequence Many previous works utilize the precomputed optical flow to warp the feature maps across adjacent frames However, the imprecise optical flow and the warping operation without any learnable parameters may not achieve accurate feature warping and only bring a slight improvement In this paper, we propose a novel framework named Dynamic Warping Network (DWNet) to adaptively warp the interframe features for improving the accuracy of warping-based models Firstly, we design a flow refinement module (FRM) to optimize the precomputed optical flow Then, we propose a flow-guided convolution (FG-Conv) to achieve the adaptive feature warping based on the refined optical flow Furthermore, we introduce the temporal consistency loss including the feature consistency loss and prediction consistency loss to explicitly supervise the warped features instead of simple feature propagation and fusion, which guarantees the temporal consistency of video segmentation Note that our DWNet adopts extra constraints to improve the temporal consistency in the training phase, while no additional calculation and postprocessing are required during inference Extensive experiments show that our DWNet can achieve consistent improvement over various strong baselines and achieves state-of-the-art accuracy on the Cityscapes and CamVid benchmark datasets

Journal ArticleDOI
TL;DR: In this article, a smart key frame extraction algorithm is proposed by combining the background difference method and SIFT feature matching algorithm, at the same time, the criterion factor K is introduced.

Journal ArticleDOI
TL;DR: This work presents the cross and self-attention network (CSANet), which not only propagates temporal features from adjacent frames, but is also designed to aggregate spatial context within the current frame, which is shown to effectively improve the consistency and robustness of the extracted deep features.
Abstract: Video semantic segmentation aims atgenerating temporal consistent segmentation results and is still a very challenging task in the deep learning era. In this work, we improve prior approaches from two aspects. On the network architecture level, we present the cross and self-attention network (CSANet). As opposed to prior methods, CSANet not only propagates temporal features from adjacent frames, but is also designed to aggregate spatial context within the current frame, which is shown to effectively improve the consistency and robustness of the extracted deep features. On the loss function level, we further propose the inter-frame mutual learning strategy which ensures the cross-attention module to focus on semantically correlated context regions, allowing the segmentation results at different frames to be collaboratively improved. By combining the above two novel designs, we show that our proposed method is able to deliver state-of-the-art performance on the Cityscapes and CamVid benchmarks.

Book ChapterDOI
28 Jan 2021
TL;DR: Wang et al. as discussed by the authors proposed a coverless video steganography method based on inter frame combination, where the hash sequence of a frame is generated by the CNNs and hash generator.
Abstract: In most coverless image steganography methods, the number of images increases exponentially with the increase of hidden message bits, which is difficult to construct such a dataset. And several images in semantic irrelevance are usually needed to represent more secret message bits, which are easy to cause the attacker’s attention and bring some insecurity. To solve these two problems, a coverless video steganography method based on inter frame combination is proposed in this manuscript. In the proposed method, the hash sequence of a frame is generated by the CNNs and hash generator. To hide more information bits in one video, a special mapping rule is proposed. Through this mapping rule, some key frames in one video are selected. In the selected frames, one or several frames are used to represent a piece of information with equal length. To quickly index out the corresponding frames, a three-level index structure is proposed in this manuscript. Since the proposed coverless video steganography method does not embed one bit in video, it can effectively resist steganalysis algorithms. The experimental results and analysis show that the proposed method has a large capacity, good robustness and high security.

Journal ArticleDOI
TL;DR: This is the first such CNN approach in the literature to perform motion-based multiframe SR by fusing multiple input frames in a single network and it is demonstrated that this subpixel registration information is critical to network performance.

Journal ArticleDOI
TL;DR: A novel camera motion classification framework based on modeling the compressed domain block motion vectors using the HSI color model and demonstrating accuracies of over 98 % in recognizing eleven camera patterns for the proposed method.
Abstract: This paper presents a novel camera motion classification framework based on modeling the compressed domain block motion vectors using the HSI color model. The input to the proposed method is the interframe block motion vectors decoded from the compressed bitstream. The block motion vector’s magnitude and orientation are estimated, followed by assigning motion vector orientation to Hue, motion vector magnitude to Saturation, and keeping Intensity at a fixed value. The HSI assignment is then converted into an RGB image followed by supervised learning utilizing a convolutional neural network to recognize eleven camera motion patterns comprising seven pure camera motion patterns and four mixed camera patterns. The proposed method’s premise is based on posing the camera motion classification problem as a color recognition task. Detailed experimental analysis that includes a comparison with state-of-the-art methods, ablation study, and robustness analysis is carried out utilizing block motion vectors obtained from H.264/AVC encoded videos. Results demonstrate accuracies of over 98 % in recognizing eleven camera patterns for the proposed method.

Posted Content
TL;DR: In this paper, a hybrid motion compensation (HMC) method was proposed to adaptively combine the predictions generated by adaptive kernel-based resampling (e.g., adaptive convolutions and deformable convolutions) in video prediction for uncompressed videos.
Abstract: Recent years have witnessed rapid advances in learnt video coding. Most algorithms have solely relied on the vector-based motion representation and resampling (e.g., optical flow based bilinear sampling) for exploiting the inter frame redundancy. In spite of the great success of adaptive kernel-based resampling (e.g., adaptive convolutions and deformable convolutions) in video prediction for uncompressed videos, integrating such approaches with rate-distortion optimization for inter frame coding has been less successful. Recognizing that each resampling solution offers unique advantages in regions with different motion and texture characteristics, we propose a hybrid motion compensation (HMC) method that adaptively combines the predictions generated by these two approaches. Specifically, we generate a compound spatiotemporal representation (CSTR) through a recurrent information aggregation (RIA) module using information from the current and multiple past frames. We further design a one-to-many decoder pipeline to generate multiple predictions from the CSTR, including vector-based resampling, adaptive kernel-based resampling, compensation mode selection maps and texture enhancements, and combines them adaptively to achieve more accurate inter prediction. Experiments show that our proposed inter coding system can provide better motion-compensated prediction and is more robust to occlusions and complex motions. Together with jointly trained intra coder and residual coder, the overall learnt hybrid coder yields the state-of-the-art coding efficiency in low-delay scenario, compared to the traditional H.264/AVC and H.265/HEVC, as well as recently published learning-based methods, in terms of both PSNR and MS-SSIM metrics.

Journal ArticleDOI
TL;DR: In this article, a fusion algorithm of four interframe difference methods and background averaging method is proposed for the shortcomings of inter-frame difference method and background difference method, which combines morphological processing to correct the foreground, which can effectively cope with the slow change of the background.
Abstract: In this paper, we track the motion of multiple targets in sports videos by a machine learning algorithm and study its tracking technique in depth. In terms of moving target detection, the traditional detection algorithms are analysed theoretically as well as implemented algorithmically, based on which a fusion algorithm of four interframe difference method and background averaging method is proposed for the shortcomings of interframe difference method and background difference method. The fusion algorithm uses the learning rate to update the background in real time and combines morphological processing to correct the foreground, which can effectively cope with the slow change of the background. According to the requirements of real time, accuracy, and occupying less video memory space in intelligent video surveillance systems, this paper improves the streamlined version of the algorithm. The experimental results show that the improved multitarget tracking algorithm effectively improves the Kalman filter-based algorithm to meet the real-time and accuracy requirements in intelligent video surveillance scenarios.

Journal ArticleDOI
TL;DR: In this paper, the point cloud geometry is decomposed in silhouettes, and context adaptive arithmetic coding is used to exploit redundancies within the point clouds and also using a reference point cloud (inter frame coding).
Abstract: Recently we have proposed a coding algorithm of point cloud geometry based on a rather different approach from the popular octree representation. In our algorithm, the point cloud is decomposed in silhouettes, hence the name Silhouette Coder, and context adaptive arithmetic coding is used to exploit redundancies within the point cloud (intra frame coding), and also using a reference point cloud (inter frame coding). In this letter we build on our previous work and propose a context selection algorithm as a pre-processing stage. With this algorithm, the point cloud is first parsed testing a large number of candidate context locations. The algorithm selects a small number of these contexts that better reflect the current point cloud, and then encode it with this choice. The proposed method further improves the results of our previous coder, Silhouette 4D, by $10 \%$ , on average, on a dynamic point cloud dataset of the JPEG Pleno, and achieves bitrates competitive with some high quality lossy coders such as the MPEG G-PCC.

Journal ArticleDOI
TL;DR: In this article, the authors used image-based registration of reconstructions of very short frames for data-driven motion estimation, and optimized a number of reconstruction and registration parameters (frame duration, MLEM iterations, image pixel size, post-smoothing filter, reference image creation, and registration metric) to ensure accurate registrations while maximizing temporal resolution and minimizing total computation time.
Abstract: PURPOSE Data-driven rigid motion estimation for PET brain imaging is usually performed using data frames sampled at low temporal resolution to reduce the overall computation time and to provide adequate signal-to-noise ratio in the frames. In recent work it has been demonstrated that list-mode reconstructions of ultrashort frames are sufficient for motion estimation and can be performed very quickly. In this work we take the approach of using image-based registration of reconstructions of very short frames for data-driven motion estimation, and optimize a number of reconstruction and registration parameters (frame duration, MLEM iterations, image pixel size, post-smoothing filter, reference image creation, and registration metric) to ensure accurate registrations while maximizing temporal resolution and minimizing total computation time. METHODS Data from 18 F-fluorodeoxyglucose (FDG) and 18 F-florbetaben (FBB) tracer studies with varying count rates are analyzed, for PET/MR and PET/CT scanners. For framed reconstructions using various parameter combinations interframe motion is simulated and image-based registrations are performed to estimate that motion. RESULTS For FDG and FBB tracers using 4 × 105 true and scattered coincidence events per frame ensures that 95% of the registrations will be accurate to within 1 mm of the ground truth. This corresponds to a frame duration of 0.5-1 sec for typical clinical PET activity levels. Using four MLEM iterations with no subsets, a transaxial pixel size of 4 mm, a post-smoothing filter with 4-6 mm full width at half maximum, and averaging two or more frames to create the reference image provides an optimal set of parameters to produce accurate registrations while keeping the reconstruction and processing time low. CONCLUSIONS It is shown that very short frames (≤1 sec) can be used to provide accurate and quick data-driven rigid motion estimates for use in an event-by-event motion corrected reconstruction.

Journal ArticleDOI
TL;DR: A progressive approach to train and incorporate the CNN-based in-loop filters to work seamlessly with video encoders is proposed and experimental results show that the proposed method outperforms the RDO method that utilizes only local model.
Abstract: Convolutional Neural Network (CNN) structures have been designed for in-loop filtering to improve video coding performance. These CNN models are usually trained through learning the correlations between the reconstructed and the original frames, which are then applied to every single reconstructed frame to improve the overall video quality. This direct model training and deployment strategy is effective for intra coding since a locally optimal model is sufficient. However, when applied to inter coding, it causes over-filtering because the intertwined reference dependencies across inter frames are not taken into consideration. To address this issue, existing methods usually resort to the Rate–Distortion Optimization (RDO) to selectively apply the CNN model, but fail to address the limitation of using a local CNN model. In this paper, we propose a progressive approach to train and incorporate the CNN-based in-loop filters to work seamlessly with video encoders. First, we develop a progressive training method to obtain the inter model. Using transfer learning, reconstructed frames using the CNN model are progressively involved back into the training of the CNN model itself, to simulate the reference dependencies in inter coding. Next, we design a frame-level model selection strategy for the high-bitrate coding where the over-filtering effect is diluted. Experimental results show that the proposed method outperforms the RDO method that utilizes only local model. Proposed approach also achieves comparable coding performance but with less computational complexity when integrating our progressive model into the RDO scheme.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed an efficient detection algorithm for HEVC double compression with non-aligned Group of Pictures (GOP) structures, which is based on inter-frame quality degradation process analysis.

Journal ArticleDOI
08 Jun 2021-Sensors
TL;DR: Wang et al. as mentioned in this paper proposed a detection method that adapts to brightness and jitter for video inter-frame forgery, which is more accurate and robust for videos with significant brightness variance or videos with heavy jitter on public benchmark datasets.
Abstract: Digital video forensics plays a vital role in judicial forensics, media reports, e-commerce, finance, and public security. Although many methods have been developed, there is currently no efficient solution to real-life videos with illumination noises and jitter noises. To solve this issue, we propose a detection method that adapts to brightness and jitter for video inter-frame forgery. For videos with severe brightness changes, we relax the brightness constancy constraint and adopt intensity normalization to propose a new optical flow algorithm. For videos with large jitter noises, we introduce motion entropy to detect the jitter and extract the stable feature of texture changes fraction for double-checking. Experimental results show that, compared with previous algorithms, the proposed method is more accurate and robust for videos with significant brightness variance or videos with heavy jitter on public benchmark datasets.

Journal ArticleDOI
TL;DR: Attentive correlated temporal feature (ACTF) as mentioned in this paper exploits both bilinear and linear correlations between successive frames on the regional level, which has the advantage of achieving performance comparable to or better than optical flow-based methods while avoiding the introduction of optical flow.
Abstract: Temporal feature extraction is an important issue in video-based action recognition. Optical flow is a popular method to extract temporal feature, which produces excellent performance thanks to its capacity of capturing pixel-level correlation information between consecutive frames. However, such a pixel-level correlation is extracted at the cost of high computational complexity and large storage resource. In this paper, we propose a novel temporal feature extraction method, Attentive Correlated Temporal Feature (ACTF), by exploring inter-frame correlation within a certain region. The proposed ACTF exploits both bilinear and linear correlations between successive frames on the regional level. Our method has the advantage of achieving performance comparable to or better than optical flow-based methods while avoiding the introduction of optical flow. Experimental results demonstrate our proposed method achieves the competitive performances of 96.3 % on UCF101 and 76.3 % on HMDB51 benchmark datasets.

Journal ArticleDOI
TL;DR: In this paper, a new block-based motion estimation (BME) algorithm is proposed to reduce the coding process's computational complexity, which is based on the primary search point prediction and advance ending search point strategies.
Abstract: Digital video technology has been increasingly needed in various fields, such as telecommunications, entertainment, medicine. Therefore, video compression is required. Motion estimation methods help in improving video compression efficiency by effectively removing the temporal redundancy between successive frames. Several block-based motion estimation (BME) algorithms are being suggested to reduce the coding process’s computational complexity. This paper proposes a new rapid hybrid (BME) algorithm established on the primary search point prediction and advance ending search point strategies. It combines rough adaptive search and effective local search. The coarse search introduces a new motion vector (MV) prediction technique that utilizes the macro-blocks (MBs) Spatio-temporal correlations to optimize the traditional adaptive-rood-pattern search algorithm (ARPS) and speeding up the whole process without affecting the accuracy. In the accurate local search, the cross-formed search pattern using a one-step search (OSS) block matching algorithm is employed, to estimate the actual (MV) with less computation time and further speed up the search efficiency. Exhaustive experiments are performed to demonstrate the present algorithm’s performance over the benchmark schemes concerning specific assessment criteria for results, including the peak signal-to-noise ratio (PSNR), computational complexity and computational gain. The results show that the proposed algorithm is efficient and reliable; it can always give better performance over diamond search (DS) and (ARPS). The conducted test shows an increased performance of search speed while preserving the visual quality of the motion-compensated images, and it achieves 59.76–88.03 speed improvement over (DS) and 20.98–72.06 over (ARPS) for different video sequences. Besides, the suggested method (ARP-OSS) provides the best result compared to (DS) and (ARPS) in terms of time complexity for analyzing all the video samples.

Journal ArticleDOI
TL;DR: In this article, the spatial-temporal context information correlation filtering target tracking algorithm is proposed to construct a fast target tracking method, which enables the ground-based telescope to effectively detect spatial targets in dense stellar backgrounds in both modes.
Abstract: In this paper, we simulate the estimation of motion through an interframe difference detection function model and investigate the spatial-temporal context information correlation filtering target tracking algorithm, which is complex and computationally intensive. The basic theory of spatiotemporal context information and correlation filtering is studied to construct a fast target tracking method. The different computational schemes are designed for the flow of multiframe target detection from background removal to noise reduction, to single-frame detection, and finally to multiframe detection, respectively. This enables the ground-based telescope to effectively detect spatial targets in dense stellar backgrounds in both modes. The method is validated by simulations and experiments and can meet the requirements of real projects. The interframe bit attitude estimation is optimized by using the beam-parity method to reduce the interframe estimation noise; a global optimization strategy based on the bit attitude map is used in the back end to reduce the system computation amount and make the global bit attitude estimation more accurate; a loop detection based on the word pocket model is added to the system to reduce the cumulative error.

Proceedings ArticleDOI
Bohan Li1, Jingning Han1, Yaowu Xu1
06 Jun 2021
TL;DR: In this paper, a novel method to determine the size of each group of picture (GOP) using the multi-pass information is presented, which categorizes frames into regions with different natures, including stationary, high variance, blending, and scene cut, through analyzing the frame statistics generated from the previous passes using a hidden Markov model.
Abstract: Multi-pass coding is a widely utilized technique to improve the compression efficiency in video coding, where frame statistics are collected from the previous passes and then analyzed to provide better encoder decisions, such as rate control parameters, prediction mode selection, motion estimation, etc. In this paper, a novel method to determine the size of each group of picture (GOP) using the multi-pass information is presented. In particular, we propose to categorize frames into regions with different natures, including stationary, high-variance, blending, and scene cut, through analyzing the frame statistics generated from the previous passes using a hidden Markov model. The GOP size is then determined based on the region types and the inter frame correlations. It is experimentally shown that the proposed adaptive GOP size decision provides considerable coding performance improvements over conventional fixed GOP length.

Posted Content
TL;DR: In this article, a variable filter size multi-scale CNN (MSCNN) was introduced to improve the denoising operation and incorporated strided deconvolution for further computation improvement.
Abstract: To achieve higher coding efficiency, Versatile Video Coding (VVC) includes several novel components, but at the expense of increasing decoder computational complexity. These technologies at a low bit rate often create contouring and ringing effects on the reconstructed frames and introduce various blocking artifacts at block boundaries. To suppress those visual artifacts, the VVC framework supports four post-processing filter operations. The interoperation of these filters introduces extra signaling bits and eventually becomes overhead at higher resolution video processing. In this paper, a novel deep learning-based model is proposed for sample adaptive offset (SAO) nonlinear filtering operation and substantiated the merits of intra-inter frame quality enhancement. We introduced a variable filter size multi-scale CNN (MSCNN) to improve the denoising operation and incorporated strided deconvolution for further computation improvement. We demonstrated that our deconvolution model can effectively be trained by leveraging the high-frequency edge features learned in a parallel fashion using feature fusion and residual learning. The simulation results demonstrate that the proposed method outperforms the baseline VVC method in BD-BR, BD-PSNR measurements and achieves an average of 3.762 % bit rate saving on the standard video test sequences.

Posted Content
TL;DR: In this paper, an end-to-end solution for video instance segmentation based on transformers is proposed, which significantly reduces the overhead for information-passing between frames by efficiently encoding the context within the input clip.
Abstract: We propose a novel end-to-end solution for video instance segmentation (VIS) based on transformers. Recently, the per-clip pipeline shows superior performance over per-frame methods leveraging richer information from multiple frames. However, previous per-clip models require heavy computation and memory usage to achieve frame-to-frame communications, limiting practicality. In this work, we propose Inter-frame Communication Transformers (IFC), which significantly reduces the overhead for information-passing between frames by efficiently encoding the context within the input clip. Specifically, we propose to utilize concise memory tokens as a mean of conveying information as well as summarizing each frame scene. The features of each frame are enriched and correlated with other frames through exchange of information between the precisely encoded memory tokens. We validate our method on the latest benchmark sets and achieved the state-of-the-art performance (AP 44.6 on YouTube-VIS 2019 val set using the offline inference) while having a considerably fast runtime (89.4 FPS). Our method can also be applied to near-online inference for processing a video in real-time with only a small delay. The code will be made available.

Journal ArticleDOI
TL;DR: IoTorch, a fast and reliable LED-to-camera communication that efficiently prevents the packet losses, is presented and a remote wake-up function by using the smartphone's flash as the data transmission trigger is realized.
Abstract: LED-to-camera communication for smartphones promises to enable low-cost, small-size, and intuitive data access to Internet-of-things (IoT) devices. However, for fast and reliable communication under the rolling shutter effect, the packet losses caused by inter-frame gaps and frame drops must be dealt with. In this paper, we introduce LED-to-camera communication between a commercial smartphone and an IoT device with LED flashlight, where the user can intuitively acquire the desired data transmitted from the IoT device to the smartphone. Specifically, we present IoTorch, a fast and reliable LED-to-camera communication that efficiently prevents the packet losses. IoTorch consists of two core mechanisms: 1) a minimum-repetition one-way reliable transmission focusing on the periodicity of inter-frame gaps and 2) an acknowledgement mechanism to overcome frame drops by using a smartphone's built-in flash focusing on its delay characteristics. Additionally, we propose an optimization method to increase the throughput up to 2.92 kbps (i.e., 2.43 times faster than that of the state-of-the-art method that overcomes the inter-frame gaps). Furthermore, we realize a remote wake-up function by using the smartphone's flash as the data transmission trigger. To demonstrate the benefit of IoTorch, we present a sensor data viewer using Arduino and a smartphone console system on TelosB.