scispace - formally typeset
Search or ask a question
Author

Hongfei Fan

Bio: Hongfei Fan is an academic researcher. The author has contributed to research in topics: Video quality & Feature (computer vision). The author has an hindex of 1, co-authored 6 publications receiving 3 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: This work proposes a no-reference video quality assessment method, aiming to achieve high-generalization capability in cross-content, -resolution and -frame rate quality prediction, and proposes a pyramid temporal aggregation module by involving the short-term and long-term memory to aggregate the frame-level quality.
Abstract: In this work, we propose a no-reference video quality assessment method, aiming to achieve high-generalization capability in cross-content, -resolution and -frame rate quality prediction. In particular, we evaluate the quality of a video by learning effective feature representations in spatial-temporal domain. In the spatial domain, to tackle the resolution and content variations, we impose the Gaussian distribution constraints on the quality features. The unified distribution can significantly reduce the domain gap between different video samples, resulting in more generalized quality feature representation. Along the temporal dimension, inspired by the mechanism of visual perception, we propose a pyramid temporal aggregation module by involving the short-term and long-term memory to aggregate the frame-level quality. Experiments show that our method outperforms the state-of-the-art methods on cross-dataset settings, and achieves comparable performance on intra-dataset configurations, demonstrating the high-generalization capability of the proposed method. The codes are released at.

36 citations

Journal ArticleDOI
TL;DR: This paper develops the first unsupervised domain adaptation based no reference quality assessment method for SCIs, leveraging rich subjective ratings of the natural images (NIs) and introduces three types of losses which complementarily and explicitly regularize the feature space of ranking in a progressive manner.
Abstract: In this paper, we quest the capability of transferring the quality of natural scene images to the images that are not acquired by optical cameras (e.g., screen content images, SCIs), rooted in the widely accepted view that the human visual system has adapted and evolved through the perception of natural environment. Here, we develop the first unsupervised domain adaptation based no reference quality assessment method for SCIs, leveraging rich subjective ratings of the natural images (NIs). In general, it is a non-trivial task to directly transfer the quality prediction model from NIs to a new type of content (i.e., SCIs) that holds dramatically different statistical characteristics. Inspired by the transferability of pair-wise relationship, the proposed quality measure operates based on the philosophy of improving the transferability and discriminability simultaneously. In particular, we introduce three types of losses which complementarily and explicitly regularize the feature space of ranking in a progressive manner. Regarding feature discriminatory capability enhancement, we propose a center based loss to rectify the classifier and improve its prediction capability not only for source domain (NI) but also the target domain (SCI). For feature discrepancy minimization, the maximum mean discrepancy (MMD) is imposed on the extracted ranking features of NIs and SCIs. Furthermore, to further enhance the feature diversity, we introduce the correlation penalization between different feature dimensions, leading to the features with lower rank and higher diversity. Experiments show that our method can achieve higher performance on different source-target settings based on a light-weight convolution neural network. The proposed method also sheds light on learning quality assessment measures for unseen application-specific content without the cumbersome and costing subjective evaluations.

16 citations

Proceedings ArticleDOI
17 Oct 2021
TL;DR: Wu et al. as mentioned in this paper studied the perceptual quality of professional user-generated content (PUGC) based video services and introduced a database consisting of 10,000 PUGC videos with subjective ratings.
Abstract: Recent years have witnessed a surge of professional user-generated content (PUGC) based video services, coinciding with the accelerated proliferation of video acquisition devices such as mobile phones, wearable cameras, and unmanned aerial vehicles. Different from traditional UGC videos by impromptu shooting, PUGC videos produced by professional users tend to be carefully designed and edited, receiving high popularity with a relatively satisfactory playing count. In this paper, we systematically conduct the comprehensive study on the perceptual quality of PUGC videos and introduce a database consisting of 10,000 PUGC videos with subjective ratings. In particular, during the subjective testing, we collect the human opinions based upon not only the MOS, but also the attributes that could potentially influence the visual quality including face, noise, blur, brightness, and color. We make the attempt to analyze the large-scale PUGC database with a series of video quality assessment (VQA) algorithms and a dedicated baseline model based on pretrained deep neural network is further presented. The cross-dataset experiments reveal a large domain gap between the PUGC and the traditional user-generated videos, which are critical in learning based VQA. These results shed light on developing next-generation PUGC quality assessment algorithms with desired properties including promising generalization capability, high accuracy, and effectiveness in perceptual optimization. The dataset and the codes are released at https://github.com/wlkdb/pugcq_create.

4 citations

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors developed the first unsupervised domain adaptation based no reference quality assessment method for SCIs, leveraging rich subjective ratings of the natural images (NIs).
Abstract: In this paper, we quest the capability of transferring the quality of natural scene images to the images that are not acquired by optical cameras (e.g., screen content images, SCIs), rooted in the widely accepted view that the human visual system has adapted and evolved through the perception of natural environment. Here, we develop the first unsupervised domain adaptation based no reference quality assessment method for SCIs, leveraging rich subjective ratings of the natural images (NIs). In general, it is a non-trivial task to directly transfer the quality prediction model from NIs to a new type of content (i.e., SCIs) that holds dramatically different statistical characteristics. Inspired by the transferability of pair-wise relationship, the proposed quality measure operates based on the philosophy of improving the transferability and discriminability simultaneously. In particular, we introduce three types of losses which complementarily and explicitly regularize the feature space of ranking in a progressive manner. Regarding feature discriminatory capability enhancement, we propose a center based loss to rectify the classifier and improve its prediction capability not only for source domain (NI) but also the target domain (SCI). For feature discrepancy minimization, the maximum mean discrepancy (MMD) is imposed on the extracted ranking features of NIs and SCIs. Furthermore, to further enhance the feature diversity, we introduce the correlation penalization between different feature dimensions, leading to the features with lower rank and higher diversity. Experiments show that our method can achieve higher performance on different source-target settings based on a light-weight convolution neural network. The proposed method also sheds light on learning quality assessment measures for unseen application-specific content without the cumbersome and costing subjective evaluations.

2 citations

Posted Content
TL;DR: Wang et al. as discussed by the authors proposed a pyramid temporal aggregation module by involving the short-term and long-term memory to aggregate the frame-level quality, which can reduce the domain gap between different video samples, resulting in a more generalized quality feature representation.
Abstract: In this work, we propose a no-reference video quality assessment method, aiming to achieve high-generalization capability in cross-content, -resolution and -frame rate quality prediction. In particular, we evaluate the quality of a video by learning effective feature representations in spatial-temporal domain. In the spatial domain, to tackle the resolution and content variations, we impose the Gaussian distribution constraints on the quality features. The unified distribution can significantly reduce the domain gap between different video samples, resulting in a more generalized quality feature representation. Along the temporal dimension, inspired by the mechanism of visual perception, we propose a pyramid temporal aggregation module by involving the short-term and long-term memory to aggregate the frame-level quality. Experiments show that our method outperforms the state-of-the-art methods on cross-dataset settings, and achieves comparable performance on intra-dataset configurations, demonstrating the high-generalization capability of the proposed method.

Cited by
More filters
Proceedings ArticleDOI
08 Jul 2022
TL;DR: Wang et al. as mentioned in this paper proposed a temporal perceptual quality index (TPQI) to measure the temporal distortion by describing the graphic morphology of the representation, which can be applied to any dataset without parameter tuning.
Abstract: With the rapid growth of in-the-wild videos taken by non-specialists, blind video quality assessment (VQA) has become a challenging and demanding problem. Although lots of efforts have been made to solve this problem, it remains unclear how the human visual system (HVS) relates to the temporal quality of videos. Meanwhile, recent work has found that the frames of natural video transformed into the perceptual domain of the HVS tend to form a straight trajectory of the representations. With the obtained insight that distortion impairs the perceived video quality and results in a curved trajectory of the perceptual representation, we propose a temporal perceptual quality index (TPQI) to measure the temporal distortion by describing the graphic morphology of the representation. Specifically, we first extract the video perceptual representations from the lateral geniculate nucleus (LGN) and primary visual area (V1) of the HVS, and then measure the straightness and compactness of their trajectories to quantify the degradation in naturalness and content continuity of video. Experiments show that the perceptual representation in the HVS is an effective way of predicting subjective temporal quality, and thus TPQI can, for the first time, achieve comparable performance to the spatial quality metric and be even more effective in assessing videos with large temporal variations. We further demonstrate that by combining with NIQE, a spatial quality metric, TPQI can achieve top performance over popular in-the-wild video datasets. More importantly, TPQI does not require any additional information beyond the video being evaluated and thus can be applied to any datasets without parameter tuning. Source code is available at https://github.com/UoLMM/TPQI-VQA.

13 citations

Journal ArticleDOI
20 Jun 2022
TL;DR: The proposed Temporal Distortion-Content Transformers for Video Quality Assessment reaches state-of-the-art performance on several VQA benchmarks without any extra pre-training datasets and up to 10% better generalization ability than existing methods.
Abstract: The temporal relationships between frames and their influences on video quality assessment (VQA) are still under-studied in existing works. These relationships lead to two important types of effects for video quality. Firstly, some temporal variations (such as shaking, flicker, and abrupt scene transitions) are causing temporal distortions and lead to extra quality degradations, while other variations (e.g. those related to meaningful happenings) do not. Secondly, the human visual system often has different attention to frames with different contents, resulting in their different importance to the overall video quality. Based on prominent time-series modeling ability of transformers, we propose a novel and effective transformer-based VQA method to tackle these two issues. To better differentiate temporal variations and thus capture the temporal distortions, we design a transformer-based Spatial-Temporal Distortion Extraction (STDE) module. To tackle with temporal quality attention, we propose the encoder-decoder-like temporal content transformer (TCT). We also introduce the temporal sampling on features to reduce the input length for the TCT, so as to improve the learning effectiveness and efficiency of this module. Consisting of the STDE and the TCT, the proposed Temporal Distortion-Content Transformers for Video Quality Assessment (DisCoVQA) reaches state-of-the-art performance on several VQA benchmarks without any extra pre-training datasets and up to 10% better generalization ability than existing methods. We also conduct extensive ablation experiments to prove the effectiveness of each part in our proposed model, and provide visualizations to prove that the proposed modules achieve our intention on modeling these temporal issues. We will publish our codes and pretrained weights later.

10 citations

Journal ArticleDOI
TL;DR: In this article , a self-supervised pre-training for video quality assessment (VQA) task is proposed, which exploits the plentiful unlabeled video data to learn feature representation in a simple-yet-effective way.
Abstract: Video quality assessment (VQA) task is an ongoing small sample learning problem due to the costly effort required for manual annotation. Since existing VQA datasets are of limited scale, prior research tries to leverage models pre-trained on ImageNet to mitigate this kind of shortage. Nonetheless, these well-trained models targeting on image classification task can be sub-optimal when applied on VQA data from a significantly different domain. In this paper, we make the first attempt to perform self-supervised pre-training for VQA task built upon contrastive learning method, targeting at exploiting the plentiful unlabeled video data to learn feature representation in a simple-yet-effective way. Specifically, we implement this idea by first generating distorted video samples with diverse distortion characteristics and visual contents based on the proposed distortion augmentation strategy. Afterwards, we conduct contrastive learning to capture quality-aware information by maximizing the agreement on feature representations of future frames and their corresponding predictions in the embedding space. In addition, we further introduce distortion prediction task as an additional learning objective to push the model towards discriminating different distortion categories of the input video. Solving these prediction tasks jointly with the contrastive learning not only provides stronger surrogate supervision signals, but also learns the shared knowledge among the prediction tasks. Extensive experiments demonstrate that our approach sets a new state-of-the-art in self-supervised learning for VQA task. Our results also underscore that the learned pre-trained model can significantly benefit the existing learning based VQA models. Source code is available at https://github.com/cpf0079/CSPT.

9 citations

Journal ArticleDOI
TL;DR: This work proposes a novel scheme, spatial-temporal grid mini-cube sampling (St-GMS) to get a novel type of sample, named fragments, and designs the Fragment Attention Network (FANet), a network architecture tailored specifically for fragments.
Abstract: —The increased resolution of real-world videos presents a dilemma between efficiency and accuracy for deep Video Quality Assessment (VQA). On the one hand, keeping the original resolution will lead to unacceptable computational costs. On the other hand, existing practices, such as resizing and cropping, will change the quality of original videos due to the loss of details and contents, and are therefore harmful to quality assessment. With the obtained insight from the study of spatial-temporal redundancy in the human visual system and visual coding theory, we observe that quality information around a neighbourhood is typically similar, motivating us to investigate an effective quality-sensitive neighbourhood representatives scheme for VQA. In this work, we propose a unified scheme, spatial-temporal grid mini-cube sampling (St-GMS) to get a novel type of sample, named fragments . Full-resolution videos are first divided into mini-cubes with preset spatial-temporal grids, then the temporal-aligned quality representatives are sampled to compose the fragments that serve as inputs for VQA. In addition, we design the Fragment Attention Network (FANet), a network architecture tailored specifically for fragments. With fragments and FANet, the proposed efficient end-to-end FAST-VQA and FasterVQA achieve significantly better performance than existing approaches on all VQA benchmarks while requiring only 1/1612 FLOPs compared to the current state-of-the-art. Codes, models and demos are available at https://github.com/timothyhtimothy/FAST-VQA-and-FasterVQA.

8 citations

Journal ArticleDOI
TL;DR: Li et al. as mentioned in this paper proposed the Disentangled Objective Video Quality Evaluator (DOVER) to disentangle the effects of aesthetic quality issues and technical quality issues risen by the complicated video generation processes in UGC-VQA problem.
Abstract: User-generated-content (UGC) videos have dominated the Internet during recent years. While it is well-recognized that the perceptual quality of these videos can be affected by diverse factors, few existing methods explicitly explore the effects of different factors in video quality assessment (VQA) for UGC videos, i.e. the UGC-VQA problem . In this work, we make the first attempt to disentangle the effects of aesthetic quality issues and technical quality issues risen by the complicated video generation processes in the UGC-VQA problem. To overcome the absence of respective supervisions during disentanglement, we propose the Limited View Biased Supervisions (LVBS) scheme where two separate evaluators are trained with decomposed views specifically designed for each issue. Composed of an Aesthetic Quality Evaluator (AQE) and a Technical Quality Evaluator (TQE) under the LVBS scheme, the proposed Disentangled Objective Video Quality Evaluator ( DOVER ) reach excellent performance (0.91 SRCC for KoNViD-1k, 0.89 SRCC for LSVQ, 0.88 SRCC for YouTube-UGC) in the UGC-VQA problem. More importantly, our blind subjective studies prove that the separate evaluators in DOVER can effectively match human perception on respective disentangled quality issues. Codes and demos are released in https://github.com/teowu/dover .

7 citations