scispace - formally typeset
Search or ask a question

Showing papers on "Video compression picture types published in 2017"


Proceedings ArticleDOI
01 Oct 2017
TL;DR: Deep voxel flow as mentioned in this paper combines the advantages of optical flow and neural network-based methods by training a deep network that learns to synthesize video frames by flowing pixel values from existing ones, which can be applied at any video resolution.
Abstract: We address the problem of synthesizing new video frames in an existing video, either in-between existing frames (interpolation), or subsequent to them (extrapolation). This problem is challenging because video appearance and motion can be highly complex. Traditional optical-flow-based solutions often fail where flow estimation is challenging, while newer neural-network-based methods that hallucinate pixel values directly often produce blurry results. We combine the advantages of these two methods by training a deep network that learns to synthesize video frames by flowing pixel values from existing ones, which we call deep voxel flow. Our method requires no human supervision, and any video can be used as training data by dropping, and then learning to predict, existing frames. The technique is efficient, and can be applied at any video resolution. We demonstrate that our method produces results that both quantitatively and qualitatively improve upon the state-of-the-art.

601 citations


Journal ArticleDOI
01 May 2017
TL;DR: This report sets out to summarize and categorize the research in tone‐mapping as of today, distilling the most important trends and characteristics of the tone reproduction pipeline and specifically focuses on tone-mapping of HDR video and the problems this medium entails.
Abstract: Tone-mapping constitutes a key component within the field of high dynamic range HDR imaging. Its importance is manifested in the vast amount of tone-mapping methods that can be found in the literature, which are the result of an active development in the area for more than two decades. Although these can accommodate most requirements for display of HDR images, new challenges arose with the advent of HDR video, calling for additional considerations in the design of tone-mapping operators TMOs. Today, a range of TMOs exist that do support video material. We are now reaching a point where most camera captured HDR videos can be prepared in high quality without visible artifacts, for the constraints of a standard display device. In this report, we set out to summarize and categorize the research in tone-mapping as of today, distilling the most important trends and characteristics of the tone reproduction pipeline. While this gives a wide overview over the area, we then specifically focus on tone-mapping of HDR video and the problems this medium entails. First, we formulate the major challenges a video TMO needs to address. Then, we provide a description and categorization of each of the existing video TMOs. Finally, by constructing a set of quantitative measures, we evaluate the performance of a number of the operators, in order to give a hint on which can be expected to render the least amount of artifacts. This serves as a comprehensive reference, categorization and comparative assessment of the state-of-the-art in tone-mapping for HDR video.

90 citations


Journal ArticleDOI
TL;DR: This paper aims at the evaluation of perceived visual quality of light field images and at comparing the performance of a few state-of-the-art algorithms for light field image compression, by means of a set of objective and subjective quality assessments.
Abstract: The recent advances in light field imaging, supported among others by the introduction of commercially available cameras, e.g., Lytro or Raytrix, are changing the ways in which visual content is captured and processed. Efficient storage and delivery systems for light field images must rely on compression algorithms. Several methods to compress light field images have been proposed recently. However, in-depth evaluations of compression algorithms have rarely been reported. This paper aims at the evaluation of perceived visual quality of light field images and at comparing the performance of a few state-of-the-art algorithms for light field image compression. First, a processing chain for light field image compression and decompression is defined for two typical use cases, professional and consumer. Then, five light field compression algorithms are compared by means of a set of objective and subjective quality assessments. An interactive methodology recently introduced by authors, as well as a passive methodology is used to perform these evaluations. The results provide a useful benchmark for future development of compression solutions for light field images.

86 citations


Proceedings ArticleDOI
18 Mar 2017
TL;DR: This work proposes using layered encoding for 360-degree video to improve QoE by reducing the probability of video freezes and the latency of response to the user head movements, which reduces the storage requirements significantly and improves in-network cache performance.
Abstract: Virtual reality and 360-degree video streaming are growing rapidly; however, streaming 360-degree video is very challenging due to high bandwidth requirements. To address this problem, the video quality is adjusted according to the user viewport prediction. High quality video is only streamed for the user viewport, reducing the overall bandwidth consumption. Existing solutions use shallow buffers limited by the accuracy of viewport prediction. Therefore, playback is prone to video freezes which are very destructive for the Quality of Experience(QoE). We propose using layered encoding for 360-degree video to improve QoE by reducing the probability of video freezes and the latency of response to the user head movements. Moreover, this scheme reduces the storage requirements significantly and improves in-network cache performance.

73 citations


Journal ArticleDOI
TL;DR: A multi-modal visual features-based SBD framework is employed that aims to analyze the behaviors of visual representation in terms of the discontinuity signal and can achieve good accuracy in both types of video data set compared with other proposed SBD methods.
Abstract: One of the essential pre-processing steps of semantic video analysis is the video shot boundary detection (SBD). It is the primary step to segment the sequence of video frames into shots. Many SBD systems using supervised learning have been proposed for years; however, the training process still remains its principal limitation. In this paper, a multi-modal visual features-based SBD framework is employed that aims to analyze the behaviors of visual representation in terms of the discontinuity signal. We adopt a candidate segment selection that performs without the threshold calculation but uses the cumulative moving average of the discontinuity signal to identify the position of shot boundaries and neglect the non-boundary video frames. The transition detection is structurally performed to distinguish candidate segment into a cut transition and a gradual transition, including fade in/out and logo occurrence. Experimental results are evaluated using the golf video clips and the TREC2001 documentary video data set. Results show that the proposed SBD framework can achieve good accuracy in both types of video data set compared with other proposed SBD methods.

57 citations


Proceedings ArticleDOI
01 Oct 2017
TL;DR: K-means clustering, 2D-DWT and fuzzy logic based image compression are discussed, which are considered to be good techniques for reducing data redundancy in image processing.
Abstract: Demand of multimedia growth, contributes to insufficient bandwidth of network and memory storage device. Therefore data compression is more required for reducing data redundancy to save more hardware space and transmission bandwidth. Image compression is one of the main research in the field of image processing. Many techniques are given for image compression. Some of which are discussed in this paper. This paper discusses k-means clustering, 2D-DWT and fuzzy logic based image compression.

56 citations


Journal ArticleDOI
TL;DR: A hierarchical temporally dependent RDO scheme is developed specifically for the LD-HCS based on a source distortion propagation model and can achieve higher coding gains, coupled with QP adaption.
Abstract: Low-delay hierarchical coding structure (LD-HCS), as one of the most important components in the latest High Efficiency Video Coding (HEVC) standard, greatly improves coding performance. It groups consecutive P/B frames into different layers and encodes them with different quantization parameters (QPs) and reference mechanisms in such a way that temporal dependency among frames can be exploited. However, due to varying characteristics of video contents, temporal dependency among coding units differs significantly from each other in the same or different layers, while a fixed LD-HCS scheme cannot take full advantage of the dependency, leading to a substantial loss in coding performance. This paper addresses the temporally dependent rate distortion optimization (RDO) problem by attempting to exploit varying temporal dependency of different units. First, the temporal relationship of different frames under the LD-HCS is examined, and hierarchical temporal propagation chains are constructed to represent the temporal dependency among coding units in different frames. Then, a hierarchical temporally dependent RDO scheme is developed specifically for the LD-HCS based on a source distortion propagation model. Experimental results show that our proposed scheme can achieve 2.5% and 2.3% BD-rate gain in average compared with the HEVC codec under the same configuration of P and B frames, respectively, with a negligible increase in encoding time. Furthermore, coupled with QP adaption, our proposed method can achieve higher coding gains, e.g., with multi-QP optimization, about 5.4% and 5.0% BD-rate saving in average over the HEVC codec under the same setting of P and B frames, respectively.

54 citations


Proceedings ArticleDOI
Fanyi Duanmu1, Eymen Kurdoglu1, S. Amir Hosseini1, Yong Liu1, Yao Wang1 
11 Aug 2017
TL;DR: A two-tier 360 video streaming framework with prioritized buffer control is proposed to effectively accommodate the dynamics in both network bandwidth and viewing direction, and it is demonstrated that the proposed framework can significantly outperform the conventional360 video streaming solutions.
Abstract: 360 degree video compression and streaming is one of the key components of Virtual Reality (VR) applications. In 360 video streaming, a user may freely navigate through the captured 3D environment by changing her desired viewing direction. Only a small portion of the entire 360 degree video is watched at any time. Streaming the entire 360 degree raw video is therefore unnecessary and bandwidth-consuming. One the other hand, only streaming the video in the predicted user's view direction will introduce streaming discontinuity whenever the the prediction is wrong. In this work, a two-tier 360 video streaming framework with prioritized buffer control is proposed to effectively accommodate the dynamics in both network bandwidth and viewing direction. Through simulations driven by real network bandwidth and viewing direction traces, we demonstrate that the proposed framework can significantly outperform the conventional 360 video streaming solutions.

52 citations


Journal ArticleDOI
TL;DR: This paper presents a comprehensive study and analysis of numerous cutting edge video steganography methods and their performance evaluations from literature, and suggests current research directions and recommendations to improve on existing video Steganography techniques.
Abstract: In the last two decades, the science of covertly concealing and communicating data has acquired tremendous significance due to the technological advancement in communication and digital content. Steganography is the art of concealing secret data in a particular interactive media transporter, e.g., text, audio, image, and video data in order to build a covert communication between authorized parties. Nowadays, video steganography techniques have become important in many video-sharing and social networking applications such as Livestreaming, YouTube, Twitter, and Facebook because of the noteworthy development of advanced video over the Internet. The performance of any steganographic method ultimately relies on the imperceptibility, hiding capacity, and robustness. In the past decade, many video steganography methods have been proposed; however, the literature lacks of sufficient survey articles that discuss all techniques. This paper presents a comprehensive study and analysis of numerous cutting edge video steganography methods and their performance evaluations from literature. Both compressed and raw video steganography methods are surveyed. In the compressed domain, video steganography techniques are categorized according to the video compression stages as venues for data hiding such as intra frame prediction, inter frame prediction, motion vectors, transformed and quantized coefficients, and entropy coding. On the other hand, raw video steganography methods are classified into spatial and transform domains. This survey suggests current research directions and recommendations to improve on existing video steganography techniques.

51 citations


Proceedings ArticleDOI
01 Jul 2017
TL;DR: This paper proposes a new approach for forensic analysis by exploiting the local spatio-temporal relationships within a portion of a video to robustly detect frame removals and produces a refined video-level confidence score that is superior to the raw output scores from the network.
Abstract: Frame dropping is a type of video manipulation where consecutive frames are deleted to omit content from the original video. Automatically detecting dropped frames across a large archive of videos while maintaining a low false alarm rate is a challenging task in digital video forensics. We propose a new approach for forensic analysis by exploiting the local spatio-temporal relationships within a portion of a video to robustly detect frame removals. In this paper, we propose to adapt the Convolutional 3D Neural Network (C3D) for frame drop detection. In order to further suppress the errors due by the network, we produce a refined video-level confidence score and demonstrate that it is superior to the raw output scores from the network. We conduct experiments on two challenging video datasets containing rapid camera motion and zoom changes. The experimental results clearly demonstrate the efficacy of the proposed approach.

38 citations


Journal ArticleDOI
TL;DR: A binocular rivalry inspired model is applied to account for the prediction bias, leading to a significantly improved full reference quality prediction model of stereoscopic videos that allows us to quantitatively predict the coding gain of different variations of asymmetric video compression, and provides new insight on the development of high efficiency 3D video coding schemes.
Abstract: Objective quality assessment of stereoscopic 3D video is challenging but highly desirable, especially in the application of stereoscopic video compression and transmission, where useful quality models are missing, that can guide the critical decision making steps in the selection of mixed-resolution coding, asymmetric quantization, and pre- and post-processing schemes. Here we first carry out subjective quality assessment experiments on two databases that contain various asymmetrically compressed stereoscopic 3D videos obtained from mixed-resolution coding, asymmetric transform-domain quantization coding, their combinations, and the multiple choices of postprocessing techniques. We compare these asymmetric stereoscopic video coding schemes with symmetric coding methods and verify their potential coding gains. We observe a strong systematic bias when using direct averaging of 2D video quality of both views to predict 3D video quality. We then apply a binocular rivalry inspired model to account for the prediction bias, leading to a significantly improved full reference quality prediction model of stereoscopic videos. The model allows us to quantitatively predict the coding gain of different variations of asymmetric video compression, and provides new insight on the development of high efficiency 3D video coding schemes.

Journal ArticleDOI
TL;DR: This article proposes a robust watermarking framework for HEVC-encoded video using informed detector and shows that the proposed work effectively limits the increase in video bitrate and degradation in perceptual quality.
Abstract: Digital watermarking has received much attention in recent years as a promising solution to copyright protection. Video watermarking in compressed domain has gained importance since videos are stored and transmitted in a compressed format. This decreases the overhead to fully decode and re-encode the video for embedding and extraction of the watermark. High Efficiency Video Coding (HEVC/H.265) is the latest and most efficient video compression standard and a successor to H.264 Advanced Video Coding. In this article, we propose a robust watermarking framework for HEVC-encoded video using informed detector. A readable watermark is embedded invisibly in P frames for better perceptual quality. Our framework imposes security and robustness by selecting appropriate blocks using a random key and the spatio-temporal characteristics of the compressed video. A detail analysis of the strengths of different compressed domain features is performed for implementing the watermarking framework. We experimentally demonstrate the utility of the proposed work. The results show that the proposed work effectively limits the increase in video bitrate and degradation in perceptual quality. The proposed framework is robust against re-encoding and image processing attacks.

Proceedings ArticleDOI
01 Jul 2017
TL;DR: This paper derives the optimal rate-distortion relationship in spherical domain and presents its optimal solution based on HEVC/H.265 anchor for 360-degree video coding.
Abstract: Emerging virtual reality (VR) applications bring much challenge to video coding for 360-degree videos. To compress this kind of video, each picture should be projected to a 2D plane (e.g. equirctangular projection map) first, adapting to the input of existing video coding systems. At the display side, an inverse projection is performed before viewport rendering. However, such a project introduces much different levels of distortions depending on the location, which makes the rate-distortion optimization process in video coding much inefficient. In this paper, we consider the distortion in spherical domain and analyse its influence to the rate-distortion optimization process. Then we derive the optimal rate-distortion relationship in spherical domain and present its optimal solution based on HEVC/H.265. Experimental results show that the proposed method can bring up to 11.5% bit-saving compared with the current HEVC/H.265 anchor for 360-degree video coding.

Journal ArticleDOI
01 Oct 2017
TL;DR: Experimental results with videos from the Open Video Project show that the proposed new computational visual attention model represents an effective solution to the problem of automatic video summarization, producing video summaries with similar quality to the ground-truth manually created by a group of 50 users.
Abstract: This work addresses the development of a computational model of visual attention to perform the automatic summarization of digital videos from television archives. Although the television system represents one of the most fascinating media phenomena ever created, we still observe the absence of effective solutions for content-based information retrieval from video recordings of programs produced by this media universe. This fact relates to the high complexity of the content-based video retrieval problem, which involves several challenges, among which we may highlight the usual demand on video summaries to facilitate indexing, browsing and retrieval operations. To achieve this goal, we propose a new computational visual attention model, inspired on the human visual system and based on computer vision methods (face detection, motion estimation and saliency map computation), to estimate static video abstracts, that is, collections of salient images or key frames extracted from the original videos. Experimental results with videos from the Open Video Project show that our approach represents an effective solution to the problem of automatic video summarization, producing video summaries with similar quality to the ground-truth manually created by a group of 50 users.

Patent
Ye Jing1, Shan Liu1, Shaw-Min Lei1
23 Feb 2017
TL;DR: In this article, the authors proposed a method for video coding, which includes receiving input data associated with a current block in an image frame, generating an inter predictor of the current block, and generating an intra predictor based on samples of neighboring pixels and an intra prediction mode that locates the samples of neighbouring pixels.
Abstract: Aspects of the disclosure include a method for video coding. The method includes receiving input data associated with a current block in an image frame, generating an inter predictor of the current block, and generating an intra predictor of the current block based on samples of neighboring pixels and an intra prediction mode that locates the samples of neighboring pixels. The method further includes generating a final predictor of the current block by combining the inter predictor and the intra predictor according to one or more intra weight coefficients associated with the intra prediction mode, and encoding or decoding the current block based on the final predictor to output encoded video data or a decoded block. The one or more intra weight coefficients indicate one or more ratios that corresponding one or more portions of the intra predictor are combined with the inter predictor, respectively.

Journal ArticleDOI
TL;DR: This paper enables video coding for video stabilization by constructing the camera motions based on the motion vectors employed in the video coding by designing a grid-based 2D method, named as CodingFlow, which is optimized for a spatially-variant motion compensation.
Abstract: Video coding focuses on reducing the data size of videos. Video stabilization targets at removing shaky camera motions. In this paper, we enable video coding for video stabilization by constructing the camera motions based on the motion vectors employed in the video coding. The existing stabilization methods rely heavily on image features for the recovery of camera motions. However, feature tracking is time-consuming and prone to errors. On the other hand, nearly all captured videos have been compressed before any further processing and such a compression has produced a rich set of block-based motion vectors that can be utilized for estimating the camera motion. More specifically, video stabilization requires camera motions between two adjacent frames. However, motion vectors extracted from video coding may refer to non-adjacent frames. We first show that these non-adjacent motions can be transformed into adjacent motions such that each coding block within a frame contains a motion vector referring to its adjacent previous frame. Then, we regularize these motion vectors to yield a spatially-smoothed motion field at each frame, named as CodingFlow , which is optimized for a spatially-variant motion compensation. Based on CodingFlow, we finally design a grid-based 2D method to accomplish the video stabilization. Our method is evaluated in terms of efficiency and stabilization quality, both quantitatively and qualitatively, which shows that our method can achieve high-quality results compared with the state-of-the-art methods (feature-based).

Proceedings ArticleDOI
28 May 2017
TL;DR: A novel two-tier 360 degree video streaming scheme is proposed to accommodate the dynamics in both network bandwidth and viewing direction and it is demonstrated that the proposed framework can significantly outperform conventional 360 video streaming schemes.
Abstract: 360 degree video compression and delivery is one of the key components of virtual reality (VR) applications. In such applications, the users may freely control and navigate the captured 3D environment from any viewing direction. Given that only a small portion of the entire video is watched at any time, fetching the entire 360 degree raw video is therefore unnecessary and bandwidth-consuming. In this work, a novel two-tier 360 degree video streaming scheme is proposed to accommodate the dynamics in both network bandwidth and viewing direction. Based on the real-trace driven simulations, we demonstrate that the proposed framework can significantly outperform conventional 360 video streaming schemes.

Journal ArticleDOI
TL;DR: An enhanced model of objective VQA based on the estimation of jerkiness is proposed, which performs better, in terms of estimating the impact of multiple frame freezing impairments, and has more affinity with the subjective test results.
Abstract: In wireless networks, due to limited bandwidth and packet losses, seamless and ubiquitous delivery of high-quality video streaming services is a big challenge for the operators. In order to improve the process of online video quality monitoring, the presence of no reference (NR) objective video quality assessment (VQA) methods is required. In some networks, the video decoder on the reception side adopts a mechanism in which last correctly received frame is frozen and displayed on video display terminal until the next correct frame is received. This phenomenon, employed as an error concealment technique, can cause a perceptual jerkiness on the video display terminal. In this paper, we have proposed an enhanced model of objective VQA based on the estimation of jerkiness. A study of three contemporary NR methods, used for objective VQA and online monitoring of videos, has been included along with subjective VQA tests. The subjective tests were performed for a set of video sequences with specific spat...

Proceedings ArticleDOI
05 May 2017
TL;DR: This paper presents a comprehensive study and analysis of numerous cutting edge video steganography methods and their performance evaluations from literature, and suggests current research directions and recommendations to improve on existing video Steganographic techniques.
Abstract: Nowadays, video steganography has become important in many security applications. The performance of any steganographic method ultimately relies on the imperceptibility, hiding capacity, and robustness. In the past decade, many video steganography methods have been proposed; however, the literature lacks of sufficient survey articles that discuss all techniques. This paper presents a comprehensive study and analysis of numerous cutting edge video steganography methods and their performance evaluations from literature. Both compressed and raw video steganographic methods are surveyed. In the compressed domain, video steganographic techniques are categorized according to the video compression stages as venues for data hiding such as intra frame prediction, inter frame prediction, motion vectors, transformed and quantized coefficients, and entropy coding. On the other hand, raw video steganographic methods are classified into spatial and transform domains. This survey suggests current research directions and recommendations to improve on existing video steganographic techniques.

Journal ArticleDOI
TL;DR: Experimental results on video captioning tasks show that the proposed method, utilizing only RGB frames as input without extra video or text training data, could achieve competitive performance with state-of-the-art methods.

Patent
18 May 2017
TL;DR: In this article, a video capture device captures 360 degree video in a first projection format, and an encoding device encodes the captured 360-degree video into a 360-degrees video bitstream.
Abstract: In a system for 360 degree video capture and playback, 360 degree video may be captured, stitched, encoded, decoded, rendered, and played-back. In one or more implementations, a video capture device captures 360 degree video in a first projection format, and an encoding device encodes the captured 360 degree video into a 360 degree video bitstream. In some aspects, the 360 degree video bitstream is encoded with an indication of the first projection format. In one or more implementations, a rendering device converts the decoded 360 degree video bitstream from the first projection format to a second projection format based on the indication. In one or more implementations, a processing device generates projection maps where each is respectively associated with a different projection format, and a rendering device renders the decoded 360 degree video bitstream using one of the projections maps.

Journal ArticleDOI
TL;DR: A fast H.264/advanced video coding (AVC) to HEVC transcoding method is proposed and the corresponding prediction unit (PU) mode's early termination strategies are proposed based on the CU size and corresponding prior statistical knowledge.
Abstract: With the popularity of high-efficiency video coding (HEVC) standard, a video server usually transcodes a video stream to HEVC for its higher compression ratio. In this paper, a fast H.264/advanced video coding (AVC) to HEVC transcoding method is proposed. In the HEVC encoding procedure, a coding unit (CU), which is a motion-homogeneous block, is first checked based on the analysis of the decoded information from H.264/AVC bit stream. Then, for motion-homogeneous blocks, CU depth and the corresponding prediction unit (PU) mode's early termination strategies are proposed based on the CU size and corresponding prior statistical knowledge. For non-motion-homogeneous blocks, a corresponding PU mode's early termination strategy is also proposed. Experimental results demonstrate the effectiveness of the proposed method.

Journal ArticleDOI
TL;DR: A standards-compliant video-encoding scheme that can suppress unnecessary temporal fluctuation in stable background areas of a raw video, and improves object-detection performance and results in lower bit rates with comparable quality.
Abstract: Many distributed wireless surveillance applications use compressed videos for automatic video analysis tasks. However, the accuracy of object detection--which is essential for video analysis--can be reduced because lossy compression degrades video quality. Current standardized video-encoding schemes can cause temporal fluctuation for encoded blocks in stable background areas of a raw video, which strongly affects object-detection accuracy. To obtain better object-detection performance on compressed videos, the authors introduce a standards-compliant video-encoding scheme that can suppress unnecessary temporal fluctuation in stable background areas. New mode-decision strategies, designed for both intra- and interframes, reduce the temporal fluctuation while maintaining acceptable rate-distortion performance. Experimental results show that, compared with traditional encoding schemes, the proposed scheme improves object-detection performance and results in lower bit rates with comparable quality.

Journal ArticleDOI
TL;DR: This work represents the hierarchical structure of video with multiple granularities including, from short to long, single frame, consecutive frames (motion), short clip, and the entire video and proposes a novel deep learning framework to model each granularity individually.
Abstract: Video analysis is an important branch of computer vision due to its wide applications, ranging from video surveillance, video indexing, and retrieval to human computer interaction. All of the applications are based on a good video representation, which encodes video content into a feature vector with fixed length. Most existing methods treat video as a flat image sequence, but from our observations we argue that video is an information-intensive media with intrinsic hierarchical structure, which is largely ignored by previous approaches. Therefore, in this work, we represent the hierarchical structure of video with multiple granularities including, from short to long, single frame, consecutive frames (motion), short clip, and the entire video. Furthermore, we propose a novel deep learning framework to model each granularity individually. Specifically, we model the frame and motion granularities with 2D convolutional neural networks and model the clip and video granularities with 3D convolutional neural networks. Long Short-Term Memory networks are applied on the frame, motion, and clip to further exploit the long-term temporal clues. Consequently, the whole framework utilizes multi-stream CNNs to learn a hierarchical representation that captures spatial and temporal information of video. To validate its effectiveness in video analysis, we apply this video representation to action recognition task. We adopt a distribution-based fusion strategy to combine the decision scores from all the granularities, which are obtained by using a softmax layer on the top of each stream. We conduct extensive experiments on three action benchmarks (UCF101, HMDB51, and CCV) and achieve competitive performance against several state-of-the-art methods.

Proceedings Article
04 Feb 2017
TL;DR: A joint feature projection matrix and heterogeneous dictionary pair learning (PHDL) approach for IVPR is proposed and to ensure that the obtained coding coefficients have favorable discriminability, PHDL designs a point-to-set coefficient discriminant term.
Abstract: Person re-identification (re-id) plays an important role in video surveillance and forensics applications. In many cases, person re-id needs to be conducted between image and video clip, e.g., re-identifying a suspect from large quantities of pedestrian videos given a single image of him. We call re-id in this scenario as image to video person re-id (IVPR). In practice, image and video are usually represented with different features, and there usually exist large variations between frames within each video. These factors make matching between image and video become a very challenging task. In this paper, we propose a joint feature projection matrix and heterogeneous dictionary pair learning (PHDL) approach for IVPR. Specifically, PHDL jointly learns an intra-video projection matrix and a pair of heterogeneous image and video dictionaries. With the learned projection matrix, the influence of variations within each video to the matching can be reduced. With the learned dictionary pair, the heterogeneous image and video features can be transformed into coding coefficients with the same dimension, such that the matching can be conducted using coding coefficients. Furthermore, to ensure that the obtained coding coefficients have favorable discriminability, PHDL designs a point-to-set coefficient discriminant term. Experiments on the public iLIDS-VID and PRID 2011 datasets demonstrate the effectiveness of the proposed approach.

Patent
14 Feb 2017
TL;DR: In this article, a multi-pass non-separable inverse transformation on the plurality of values to derive residual data that represents pixel differences between the current block of video data and a predictive block of the video data is performed.
Abstract: An example method of decoding video data includes determining, by a video decoder and based on syntax elements in an encoded video bitstream, a plurality of values for a current block of the video data; performing, by the video decoder, a multi-pass non-separable inverse transformation on the plurality of values to derive residual data that represents pixel differences between the current block of the video data and a predictive block of the video data; and reconstructing, by the video decoder, the current block of the video data based on the residual data and the predictive block of the video data. In some examples, performing a pass of the multi-pass non-separable inverse transformation includes performing a plurality of Givens orthogonal transformations.

Journal ArticleDOI
TL;DR: A new method in which video summarization is performed as training and selection sparse dictionary problem simultaneously is proposed, and it is shown that the proposed method is able to improve the summarization of a large amount of video data compared to other methods.
Abstract: Every day, a huge amount of video data is generated worldwide and processing this kind of data requires powerful resources in terms of time, manpower, and hardware. Therefore, to help quickly understand the content of video data, video summarization methods have been proposed. Recently, sparse formulation-based methods have been found to be able to summarize a large amount of video compared to other methods. In this paper, we propose a new method in which video summarization is performed as training and selection sparse dictionary problem simultaneously. It is shown that the proposed method is able to improve the summarization of a large amount of video data compared to other methods. Finally, the performance of the proposed method is compared to state-of-the-art methods using standard data sets, in which the key frames are manually tagged. The obtained results demonstrate that the proposed method improves video summarization compared to other methods.

Posted Content
TL;DR: This work presents a classification framework for the joint use of text, visual and audio features, and conducts an extensive set of experiments to quantify the benefit that this additional mode brings.
Abstract: The YouTube-8M video classification challenge requires teams to classify 0.7 million videos into one or more of 4,716 classes. In this Kaggle competition, we placed in the top 3% out of 650 participants using released video and audio features. Beyond that, we extend the original competition by including text information in the classification, making this a truly multi-modal approach with vision, audio and text. The newly introduced text data is termed as YouTube-8M-Text. We present a classification framework for the joint use of text, visual and audio features, and conduct an extensive set of experiments to quantify the benefit that this additional mode brings. The inclusion of text yields state-of-the-art results, e.g. 86.7% GAP on the YouTube-8M-Text validation dataset.

Proceedings ArticleDOI
Ziwei Yang1, Youjiang Xu1, Huiyun Wang1, Bo Wang1, Yahong Han1 
23 Oct 2017
TL;DR: The approach for video captioning gets great performance on the 2nd MSR Video to Language Challenge and the approach utilizes a Multirate GRU to capture temporal structure of videos.
Abstract: Automatically describing videos with natural language is a crucial challenge of video understanding. Compared to images, videos have specific spatial-temporal structure and various modality information. In this paper, we propose a Multirate Multimodal Approach for video captioning. Considering that the speed of motion in videos varies constantly, we utilize a Multirate GRU to capture temporal structure of videos. It encodes video frames with different intervals and has a strong ability to deal with motion speed variance. As videos contain different modality cues, we design a particular multimodal fusion method. By incorporating visual, motion, and topic information together, we construct a well-designed video representation. Then the video representation is fed into a RNN-based language model for generating natural language descriptions. We evaluate our approach for video captioning on "Microsoft Research - Video to Text" (MSR-VTT), a large-scale video benchmark for video understanding. And our approach gets great performance on the 2nd MSR Video to Language Challenge.

Journal ArticleDOI
Jooseung Lee1, In-Cheol Park1
TL;DR: Experimental results show that the proposed architecture provides the best visual quality at the cost of reasonable hardware resources.
Abstract: A new algorithm and its hardware architecture are presented to up-scale high-definition (HD) and full-HD video streams to 4-K ultra-HD video streams in real time. The Lagrange interpolation is employed, as it provides high estimation accuracy and hardware-friendly properties. To enhance the accuracy further, the pixels at the edge regions are specially processed by employing an image-sharpening technique. Experimental results show that the proposed architecture provides the best visual quality at the cost of reasonable hardware resources.