scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Transactions on Circuits and Systems for Video Technology in 2021"


Journal ArticleDOI
TL;DR: Versatile Video Coding (VVC) was developed by the Joint Video Experts Team (JVET) and the ISO/IEC Moving Picture Experts Group (MPEG) to serve an evergrowing need for improved video compression as well as to support a wider variety of today's media content and emerging applications as mentioned in this paper.
Abstract: Versatile Video Coding (VVC) was finalized in July 2020 as the most recent international video coding standard. It was developed by the Joint Video Experts Team (JVET) of the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG) to serve an ever-growing need for improved video compression as well as to support a wider variety of today’s media content and emerging applications. This paper provides an overview of the novel technical features for new applications and the core compression technologies for achieving significant bit rate reductions in the neighborhood of 50% over its predecessor for equal video quality, the High Efficiency Video Coding (HEVC) standard, and 75% over the currently most-used format, the Advanced Video Coding (AVC) standard. It is explained how these new features in VVC provide greater versatility for applications. Highlighted applications include video with resolutions beyond standard- and high-definition, video with high dynamic range and wide color gamut, adaptive streaming with resolution changes, computer-generated and screen-captured video, ultralow-delay streaming, 360° immersive video, and multilayer coding e.g., for scalability. Furthermore, early implementations are presented to show that the new VVC standard is implementable and ready for real-world deployment.

250 citations


Journal ArticleDOI
TL;DR: The feature refinement and filter network is proposed to solve the above problems from three aspects: by weakening the high response features, it aims to identify highly valuable features and extract the complete features of persons, thereby enhancing the robustness of the model.
Abstract: In the task of person re-identification, the attention mechanism and fine-grained information have been proved to be effective. However, it has been observed that models often focus on the extraction of features with strong discrimination, and neglect other valuable features. The extracted fine-grained information may include redundancies. In addition, current methods lack an effective scheme to remove background interference. Therefore, this paper proposes the feature refinement and filter network to solve the above problems from three aspects: first, by weakening the high response features, we aim to identify highly valuable features and extract the complete features of persons, thereby enhancing the robustness of the model; second, by positioning and intercepting the high response areas of persons, we eliminate the interference arising from background information and strengthen the response of the model to the complete features of persons; finally, valuable fine-grained features are selected using a multi-branch attention network for person re-identification to enhance the performance of the model. Our extensive experiments on the benchmark Market-1501, DukeMTMC-reID, CUHK03 and MSMT17 person re-identification datasets demonstrate that the performance of our method is comparable to that of state-of-the-art approaches.

154 citations


Journal ArticleDOI
TL;DR: A novel end-to-end Learned Point Cloud Geometry Compression framework, to efficiently compress the point cloud geometry using deep neural networks (DNN) based variational autoencoders (VAE), which exceeds the geometry-based point cloud compression (G-PCC) algorithm standardized by well-known Moving Picture Experts Group (MPEG).
Abstract: This paper presents a novel end-to-end Learned Point Cloud Geometry Compression (a.k.a., Learned-PCGC) system, leveraging stacked Deep Neural Networks (DNN) based Variational AutoEncoder (VAE) to efficiently compress the Point Cloud Geometry (PCG). In this systematic exploration, PCG is first voxelized, and partitioned into non-overlapped 3D cubes, which are then fed into stacked 3D convolutions for compact latent feature and hyperprior generation. Hyperpriors are used to improve the conditional probability modeling of entropy-coded latent features. A Weighted Binary Cross-Entropy (WBCE) loss is applied in training while an adaptive thresholding is used in inference to remove false voxels and reduce the distortion. Objectively, our method exceeds the Geometry-based Point Cloud Compression (G-PCC) algorithm standardized by the Moving Picture Experts Group (MPEG) with a significant performance margin, e.g., at least 60% BD-Rate (Bjontegaard Delta Rate) savings, using common test datasets, and other public datasets. Subjectively, our method has presented better visual quality with smoother surface reconstruction and appealing details, in comparison to all existing MPEG standard compliant PCC methods. Our method requires about 2.5 MB parameters in total, which is a fairly small size for practical implementation, even on embedded platform. Additional ablation studies analyze a variety of aspects (e.g., thresholding, kernels, etc) to examine the generalization, and application capacity of our Learned-PCGC. We would like to make all materials publicly accessible at https://njuvision.github.io/PCGCv1/ for reproducible research.

122 citations


Journal ArticleDOI
TL;DR: A novel few-shot learning method named multi-scale metric learning (MSML) is proposed to extract multi- Scale features and learn the multi- scale relations between samples for the classification of few- shot learning.
Abstract: Few-shot learning in image classification is developed to learn a model that aims to identify unseen classes with only few training samples for each class. Fewer training samples and new tasks of classification make many traditional classification models no longer applicable. In this paper, a novel few-shot learning method named multi-scale metric learning (MSML) is proposed to extract multi-scale features and learn the multi-scale relations between samples for the classification of few-shot learning. In the proposed method, a feature pyramid structure is introduced for multi-scale feature embedding, which aims to combine high-level strong semantic features with low-level but abundant visual features. Then a multi-scale relation generation network (MRGN) is developed for hierarchical metric learning, in which high-level features are corresponding to deeper metric learning while low-level features are corresponding to lighter metric learning. Moreover, a novel loss function named intra-class and inter-class relation loss (IIRL) is proposed to optimize the proposed deep network, which aims to strengthen the correlation between homogeneous groups of samples and weaken the correlation between heterogeneous groups of samples. Experimental results on mini ImageNet and tiered ImageNet demonstrate that the proposed method achieves superior performance in few-shot learning problem.

122 citations


Journal ArticleDOI
TL;DR: Extensive experiments on the MSCOCO captioning dataset demonstrate that by plugging the Task-Adaptive Attention module into a vanilla Transformer-based image captioning model, performance improvement can be achieved.
Abstract: Attention mechanisms are now widely used in image captioning models. However, most attention models only focus on visual features. When generating syntax related words, little visual information is needed. In this case, these attention models could mislead the word generation. In this paper, we propose Task-Adaptive Attention module for image captioning, which can alleviate this misleading problem and learn implicit non-visual clues which can be helpful for the generation of non-visual words. We further introduce a diversity regularization to enhance the expression ability of the Task-Adaptive Attention module. Extensive experiments on the MSCOCO captioning dataset demonstrate that by plugging our Task-Adaptive Attention module into a vanilla Transformer-based image captioning model, performance improvement can be achieved.

118 citations


Journal ArticleDOI
TL;DR: By combining thermal images and RGB images, an effective and consistent feature fusion network (ECFFNet) for RGB-T SOD is proposed that outperforms 12 state-of-the-art methods under different evaluation indicators.
Abstract: Under ideal environmental conditions, RGB-based deep convolutional neural networks can achieve high performance for salient object detection (SOD). In scenes with cluttered backgrounds and many objects, depth maps have been combined with RGB images to better distinguish spatial positions and structures during SOD, achieving high accuracy. However, under low-light and uneven lighting conditions, RGB and depth information may be insufficient for detection. Thermal images are insensitive to lighting and weather conditions, being able to capture important objects even during nighttime. By combining thermal images and RGB images, we propose an effective and consistent feature fusion network (ECFFNet) for RGB-T SOD. In ECFFNet, an effective cross-modality fusion module fully fuses features of corresponding sizes from the RGB and thermal modalities. Then, a bilateral reversal fusion module performs bilateral fusion of foreground and background information, enabling the full extraction of salient object boundaries. Finally, a multilevel consistent fusion module combines features across different levels to obtain complementary information. Comprehensive experiments on three RGB-T SOD datasets show that the proposed ECFFNet outperforms 12 state-of-the-art methods under different evaluation indicators.

114 citations


Journal ArticleDOI
TL;DR: Weight standardization is applied to pre-activation convolution blocks of the decoder architecture to improve the flow of gradients and thus makes optimization easier, and the proposed method is effective for monocular depth estimation compared to state-of-the-art models.
Abstract: With a great success of the generative model via deep neural networks, monocular depth estimation has been actively studied by exploiting various encoder-decoder architectures. However, the decoding process in most previous methods, which repeats simple up-sampling operations, probably fails to fully utilize underlying properties of well-encoded features for monocular depth estimation. To resolve this problem, we propose a simple but effective scheme by incorporating the Laplacian pyramid into the decoder architecture. Specifically, encoded features are fed into different streams for decoding depth residuals, which are defined by decomposition of the Laplacian pyramid, and corresponding outputs are progressively combined to reconstruct the final depth map from coarse to fine scales. This is fairly desirable to precisely estimate the depth boundary as well as the global layout. We also propose to apply weight standardization to pre-activation convolution blocks of the decoder architecture, which gives a great help to improve the flow of gradients and thus makes optimization easier. Experimental results on benchmark datasets constructed under various indoor and outdoor environments demonstrate that the proposed method is effective for monocular depth estimation compared to state-of-the-art models. The code and model are publicly available at: | https://github.com/tjqansthd/LapDepth-release |.

105 citations


Journal ArticleDOI
TL;DR: This paper presents a comprehensive survey of deep learning based developments in the past decade for content based image retrieval, covering different supervision, different networks, different descriptor type and different retrieval type.
Abstract: The content based image retrieval aims to find the similar images from a large scale dataset against a query image. Generally, the similarity between the representative features of the query image and dataset images is used to rank the images for retrieval. In early days, various hand designed feature descriptors have been investigated based on the visual cues such as color, texture, shape, etc. that represent the images. However, the deep learning has emerged as a dominating alternative of hand-designed feature engineering from a decade. It learns the features automatically from the data. This paper presents a comprehensive survey of deep learning based developments in the past decade for content based image retrieval. The categorization of existing state-of-the-art methods from different perspectives is also performed for greater understanding of the progress. The taxonomy used in this survey covers different supervision, different networks, different descriptor type and different retrieval type. A performance analysis is also performed using the state-of-the-art methods. The insights are also presented for the benefit of the researchers to observe the progress and to make the best choices. The survey presented in this paper will help in further research progress in image retrieval using deep learning.

101 citations


Journal ArticleDOI
Zunjin Zhao1, Bang-Shu Xiong1, Lei Wang1, Qiaofeng Ou1, Lei Yu1, Fa Kuang1 
TL;DR: This paper proposes a novel “generative” strategy for Retinex decomposition, by which the decomposition is cast as a generative problem, and a unified deep framework is proposed to estimate the latent components and perform low-light image enhancement.
Abstract: Low-light images suffer from low contrast and unclear details, which not only reduces the available information for humans but limits the application of computer vision algorithms. Among the existing enhancement techniques, Retinex-based and learning-based methods are under the spotlight today. In this paper, we bridge the gap between the two methods. First, we propose a novel "generative" strategy for Retinex decomposition, by which the decomposition is cast as a generative problem. Second, based on the strategy, a unified deep framework is proposed to estimate the latent components and perform low-light image enhancement. Third, our method can weaken the coupling relationship between the two components while performing Retinex decomposition. Finally, the RetinexDIP performs Retinex decomposition without any external images, and the estimated illumination can be easily adjusted and is used to perform enhancement. The proposed method is compared with ten state-of-the-art algorithms on seven public datasets, and the experimental results demonstrate the superiority of our method. Code is available at:https://github.com/zhaozunjin/RetinexDIP.

97 citations


Journal ArticleDOI
TL;DR: This article revisits feature fusion for mining intrinsic RGB-T saliency patterns and proposes a novel deep feature fusion network, which consists of the multi-scale, multi-modality, and multi-level feature fusion modules.
Abstract: While many RGB-based saliency detection algorithms have recently shown the capability of segmenting salient objects from an image, they still suffer from unsatisfactory performance when dealing with complex scenarios, insufficient illumination or occluded appearances. To overcome this problem, this article studies RGB-T saliency detection, where we take advantage of thermal modality’s robustness against illumination and occlusion. To achieve this goal, we revisit feature fusion for mining intrinsic RGB-T saliency patterns and propose a novel deep feature fusion network, which consists of the multi-scale, multi-modality, and multi-level feature fusion modules. Specifically, the multi-scale feature fusion module captures rich contexture features from each modality feature, while the multi-modality and multi-level feature fusion modules integrate complementary features from different modality features and different level of features, respectively. To demonstrate the effectiveness of the proposed approach, we conduct comprehensive experiments on the RGB-T saliency detection benchmark. The experimental results demonstrate that our approach outperforms other state-of-the-art methods and the conventional feature fusion modules by a large margin.

90 citations


Journal ArticleDOI
TL;DR: This work proposes a new solution to 3D human pose estimation in videos by drawing inspiration from the human skeleton anatomy and decompose the task into bone direction prediction and bone length prediction, from which the 3D joint locations can be completely derived.
Abstract: In this work, we propose a new solution to 3D human pose estimation in videos Instead of directly regressing the 3D joint locations, we draw inspiration from the human skeleton anatomy and decompose the task into bone direction prediction and bone length prediction, from which the 3D joint locations can be completely derived Our motivation is the fact that the bone lengths of a human skeleton remain consistent across time This promotes us to develop effective techniques to utilize global information across all the frames in a video for high-accuracy bone length prediction Moreover, for the bone direction prediction network, we propose a fully-convolutional propagating architecture with long skip connections Essentially, it predicts the directions of different bones hierarchically without using any time-consuming memory units (eg LSTM) A novel joint shift loss is further introduced to bridge the training of the bone length and bone direction prediction networks Finally, we employ an implicit attention mechanism to feed the 2D keypoint visibility scores into the model as extra guidance, which significantly mitigates the depth ambiguity in many challenging poses Our full model outperforms the previous best results on Human36M and MPI-INF-3DHP datasets, where comprehensive evaluation validates the effectiveness of our model

Journal ArticleDOI
TL;DR: This study proposes a neural network—a progressive-recursive image enhancement network (PRIEN)—to enhance low-light images and demonstrates the advantages of the method compared with other methods, from both qualitative and quantitative perspectives.
Abstract: Low-light images have low brightness and contrast, which presents a huge obstacle to computer vision tasks. Low-light image enhancement is challenging because multiple factors (such as brightness, contrast, artifacts, and noise) must be considered simultaneously. In this study, we propose a neural network—a progressive-recursive image enhancement network (PRIEN)—to enhance low-light images. The main idea is to use a recursive unit, composed of a recursive layer and a residual block, to repeatedly unfold the input image for feature extraction. Unlike in previous methods, in the proposed study, we directly input low-light images into the dual attention model for global feature extraction. Next, we use a combination of recurrent layers and residual blocks for local feature extraction. Finally, we output the enhanced image. Furthermore, we input the global feature map of dual attention into each stage in a progressive way. In the local feature extraction module, a recurrent layer shares depth features across stages. In addition, we perform recursive operations on a single residual block, significantly reducing the number of parameters while ensuring good network performance. Although the network structure is simple, it can produce good results for a range of low-light conditions. We conducted experiments on widely adopted datasets. The results demonstrate the advantages of our method compared with other methods, from both qualitative and quantitative perspectives.

Journal ArticleDOI
TL;DR: This paper proposes a novel method called Attention DL based JSCC (ADJSCC) that can successfully operate with different SNR levels during transmission and compares the ADJSCC method with the state-of-the-art DLbased JSCC method through extensive experiments to demonstrate its adaptability, robustness and versatility.
Abstract: Recent research on joint source channel coding (JSCC) for wireless communications has achieved great success owing to the employment of deep learning (DL). However, the existing work on DL based JSCC usually trains the designed network to operate under a specific signal-to-noise ratio (SNR) regime, without taking into account that the SNR level during the deployment stage may differ from that during the training stage. A number of networks are required to cover the scenario with a broad range of SNRs, which is computational inefficiency (in the training stage) and requires large storage. To overcome these drawbacks our paper proposes a novel method called Attention DL based JSCC (ADJSCC) that can successfully operate with different SNR levels during transmission. This design is inspired by the resource assignment strategy in traditional JSCC, which dynamically adjusts the compression ratio in source coding and the channel coding rate according to the channel SNR. This is achieved by resorting to attention mechanisms because these are able to allocate computing resources to more critical tasks. Instead of applying the resource allocation strategy in traditional JSCC, the ADJSCC uses the channel-wise soft attention to scaling features according to SNR conditions. We compare the ADJSCC method with the state-of-the-art DL based JSCC method through extensive experiments to demonstrate its adaptability, robustness and versatility. Compared with the existing methods, the proposed method takes less storage and is more robust in the presence of channel mismatch.

Journal ArticleDOI
TL;DR: A novel deep spatial-spectral subspace clustering network (DS3C-Net) is proposed, which explores spatial-Spectral information via the multi-scale auto-encoder and collaborative constraint, and outperforms state-of-the-art methods.
Abstract: Hyperspectral image (HSI) clustering is a challenging task due to the complex characteristics in HSI data, such as spatial-spectral structure, high-dimension, and large spectral variability. In this paper, we propose a novel deep spatial-spectral subspace clustering network (DS3C-Net), which explores spatial-spectral information via the multi-scale auto-encoder and collaborative constraint. Considering the structure correlations of HSI, the multi-scale auto-encoder is first designed to extract spatial-spectral features with different-scale pixel blocks which are selected as the inputs. Then, the collaborative constrained self-expressive layers are introduced between the encoder and decoder, to capture the self-expressive subspace structures. By designing a self-expressiveness similarity constraint, the proposed network is trained collaboratively, and the affinity matrices of the feature representation are learned in an end-to-end manner. Based on the affinity matrices, the spectral clustering algorithm is utilized to obtain the final HSI clustering result. Experimental results on three widely used hyperspectral image datasets demonstrate that the proposed method outperforms state-of-the-art methods.

Journal ArticleDOI
Jinyuan Liu1, Xin Fan1, Ji Jiang1, Risheng Liu1, Zhongxuan Luo1 
TL;DR: A deep network for infrared and visible image fusion cascading a feature learning module with a fusion learning mechanism, which applies a coarse-to-fine deep architecture to learn multi-scale features for multi-modal images, which enables discovering prominent common structures for later fusion operations.
Abstract: Image fusion integrates a series of images acquired from different sensors, eg, infrared and visible, outputting an image with richer information than either one Traditional and recent deep-based methods have difficulties in preserving prominent structures and recovering vital textural details for practical applications In this paper, we propose a deep network for infrared and visible image fusion cascading a feature learning module with a fusion learning mechanism Firstly, we apply a coarse-to-fine deep architecture to learn multi-scale features for multi-modal images, which enables discovering prominent common structures for later fusion operations The proposed feature learning module requires no well-aligned image pairs for training Compared with the existing learning-based methods, the proposed feature learning module can ensemble numerous examples from respective modals for training, increasing the ability of feature representation Secondly, we design an edge-guided attention mechanism upon the multi-scale features to guide the fusion focusing on common structures, thus recovering details while attenuating noise Moreover, we provide a new aligned infrared and visible image fusion dataset, RealStreet, collected in various practical scenarios for comprehensive evaluation Extensive experiments on two benchmarks, TNO and RealStreet, demonstrate the superiority of the proposed method over the state-of-the-art in terms of both visual inspection and objective analysis on six evaluation metrics We also conduct the experiments on the FLIR and NIR datasets, containing foggy weather and poor light conditions, to verify the generalization and robustness of the proposed method

Journal ArticleDOI
TL;DR: A non-local block is developed which estimates inter-frame similarity and inter- frame difference and proposes a recursive block that iteratively refines feature maps generated at the last iteration to model the temporal information.
Abstract: Video deblurring is still a challenging low-level vision task since spatio-temporal characteristics across both the spatial and temporal domains are difficult to model. In this article, to model the temporal information, we develop a non-local block which estimates inter-frame similarity and inter-frame difference. Specially, for modeling the spatial characteristics and restoring sharp frame details, we propose a recursive block that iteratively refines feature maps generated at the last iteration. In addition, a novel temporal loss function is introduced to ensure the temporal consistency of generated frames. Experimental results on public datasets demonstrate that our method achieves state-of-the-art performance both quantitatively and qualitatively.

Journal ArticleDOI
TL;DR: The definition and related proofs of double parameters fractal sorting matrix (DPFSM) are proposed and the image encryption algorithm based on DPFSM is proposed, and the security analysis demonstrates the security.
Abstract: In the field of frontier research, information security has received a lot of interest, but in the field of information security algorithm, the introduction of decimals makes it impossible to bypass the topic of calculation accuracy. This article creatively proposes the definition and related proofs of double parameters fractal sorting matrix (DPFSM). As a new matrix classification with fractal properties, DPFSM contains self-similar structures in the ordering of both elements and sub-blocks in the matrix. These two self-similar structures are determined by two different parameters. To verify the theory, this paper presents a type of 2×2 DPFSM iterative generation method, as well as the theory, steps, and examples of the iteration. DPFSM is a space position transformation matrix, which has a better periodic law than a single parameter fractal sorting matrix (FSM). The proposal of DPFSM expands the fractal theory and solves the limitation of calculation accuracy on information security. The image encryption algorithm based on DPFSM is proposed, and the security analysis demonstrates the security. DPFSM has good application value in the field of information security.

Journal ArticleDOI
TL;DR: A novel hierarchical graph neural network (HGNN) for FSL is proposed, which consists of three parts, i.e., bottom-up reasoning, top-down reasoning, and skip connections, to enable the efficient learning of multi-level relationships.
Abstract: Recent graph neural network (GNN) based methods for few-shot learning (FSL) represent the samples of interest as a fully-connected graph and conduct reasoning on the nodes flatly, which ignores the hierarchical correlations among nodes. However, real-world categories may have hierarchical structures, and for FSL, it is important to extract the distinguishing features of the categories from individual samples. To explore this, we propose a novel hierarchical graph neural network (HGNN) for FSL, which consists of three parts, i.e., bottom-up reasoning, top-down reasoning, and skip connections, to enable the efficient learning of multi-level relationships. For the bottom-up reasoning, we design intra-class k-nearest neighbor pooling (intra-class knnPool) and inter-class knnPool layers, to conduct hierarchical learning for both the intra- and inter-class nodes. For the top-down reasoning, we propose to utilize graph unpooling (gUnpool) layers to restore the down-sampled graph into its original size. Skip connections are proposed to fuse multi-level features for the final node classification. The parameters of HGNN are learned by episodic training with the signal of node losses, which aims to train a well-generalizable model for recognizing unseen classes with few labeled data. Experimental results on benchmark datasets have demonstrated that HGNN outperforms other state-of-the-art GNN based methods significantly, for both transductive and non-transductive FSL tasks. The dataset as well as the source code can be downloaded online.

Journal ArticleDOI
TL;DR: A novel multimodal emotion recognition model for conversational videos based on reinforcement learning and domain knowledge (ERLDK) is proposed in this paper and achieves the state-of-the-art results on weighted average and most of the specific emotion categories.
Abstract: Multimodal emotion recognition in conversational videos (ERC) develops rapidly in recent years. To fully extract the relative context from video clips, most studies build their models on the entire dialogues which make them lack of real-time ERC ability. Different from related researches, a novel multimodal emotion recognition model for conversational videos based on reinforcement learning and domain knowledge (ERLDK) is proposed in this paper. In ERLDK, the reinforcement learning algorithm is introduced to conduct real-time ERC with the occurrence of conversations. The collection of history utterances is composed as an emotion-pair which represents the multimodal context of the following utterance to be recognized. Dueling deep-Q-network (DDQN) based on gated recurrent unit (GRU) layers is designed to learn the correct action from the alternative emotion categories. Domain knowledge is extracted from public dataset based on the former information of emotion-pairs. The extracted domain knowledge is used to revise the results from the RL module and is transformed into other dataset to examine the rationality. The experimental results on datasets show that ERLDK achieves the state-of-the-art results on weighted average and most of the specific emotion categories.

Journal ArticleDOI
TL;DR: This work proposes a novel multi-view clustering method via learning a LRTG model, which simultaneously learns the representation and affinity matrix in a single step to preserve their correlation.
Abstract: Graph and subspace clustering methods have become the mainstream of multi-view clustering due to their promising performance. However, (1) since graph clustering methods learn graphs directly from the raw data, when the raw data is distorted by noise and outliers, their performance may seriously decrease; (2) subspace clustering methods use a “two-step” strategy to learn the representation and affinity matrix independently, and thus may fail to explore their high correlation. To address these issues, we propose a novel multi-view clustering method via learning a Low-Rank Tensor Graph (LRTG). Different from subspace clustering methods, LRTG simultaneously learns the representation and affinity matrix in a single step to preserve their correlation. We apply Tucker decomposition and l2;1-norm to the LRTG model to alleviate noise and outliers for learning a “clean” representation. LRTG then learns the affinity matrix from this “clean” representation. Additionally, an adaptive neighbor scheme is proposed to find the K largest entries of the affinity matrix to form a flexible graph for clustering. An effective optimization algorithm is designed to solve the LRTG model based on the alternating direction method of multipliers. Extensive experiments on different clustering tasks demonstrate the effectiveness and superiority of LRTG over seventeen state-of-the-art clustering methods.

Journal ArticleDOI
Zhengzheng Tu1, Yan Ma1, Chenglong Li1, Jin Tang1, Bin Luo1 
TL;DR: Zhang et al. as mentioned in this paper proposed an Edge-Guided Non-local FCN (ENFNet) to perform edge-guided feature learning for accurate salient object detection, which extracted hierarchical global and local information in FCN to incorporate non-local features for effective feature representations.
Abstract: Fully Convolutional Neural Network (FCN) has been widely applied to salient object detection recently by virtue of high-level semantic feature extraction, but existing FCN-based methods still suffer from continuous striding and pooling operations leading to loss of spatial structure and blurred edges. To maintain the clear edge structure of salient objects, we propose a novel Edge-guided Non-local FCN (ENFNet) to perform edge-guided feature learning for accurate salient object detection. In a specific, we extract hierarchical global and local information in FCN to incorporate non-local features for effective feature representations. To preserve good boundaries of salient objects, we propose a guidance block to embed edge prior knowledge into hierarchical feature maps. The guidance block not only performs feature-wise manipulation but also spatial-wise transformation for effective edge embeddings. Our model is trained on the MSRA-B dataset and tested on five popular benchmark datasets. Comparing with the state-of-the-art methods, the proposed method performance well on five datasets.

Journal ArticleDOI
TL;DR: Compared to the state-of-the-art (SOTA) methods, the proposed richly activated GCN achieves comparable performance on the standard NTU RGB+D 60 and 120 datasets and on the synthetic occlusion and jittering datasets, the performance deterioration due to the occluded and disturbed joints can be significantly alleviated by utilizing the proposed RA-GCN.
Abstract: Current methods for skeleton-based human action recognition usually work with complete skeletons. However, in real scenarios, it is inevitable to capture incomplete or noisy skeletons, which could significantly deteriorate the performance of current methods when some informative joints are occluded or disturbed. To improve the robustness of action recognition models, a multi-stream graph convolutional network (GCN) is proposed to explore sufficient discriminative features spreading over all skeleton joints, so that the distributed redundant representation reduces the sensitivity of the action models to non-standard skeletons. Concretely, the backbone GCN is extended by a series of ordered streams which is responsible for learning discriminative features from the joints less activated by preceding streams. Here, the activation degrees of skeleton joints of each GCN stream are measured by the class activation maps (CAM), and only the information from the unactivated joints will be passed to the next stream, by which rich features over all active joints are obtained. Thus, the proposed method is termed richly activated GCN (RA-GCN). Compared to the state-of-the-art (SOTA) methods, the RA-GCN achieves comparable performance on the standard NTU RGB+D 60 and 120 datasets. More crucially, on the synthetic occlusion and jittering datasets, the performance deterioration due to the occluded and disturbed joints can be significantly alleviated by utilizing the proposed RA-GCN.

Journal ArticleDOI
TL;DR: In this article, a causal context model is proposed that separates the latent space across channels and makes use of channel-wise relationships to generate highly informative adjacent contexts, and a causal global prediction model is used to find global reference points for accurate predictions of undecoded points.
Abstract: utf8 Over the past several years, we have witnessed impressive progress in the field of learned image compression. Recent learned image codecs are commonly based on autoencoders, that first encode an image into low-dimensional latent representations and then decode them for reconstruction purposes. To capture spatial dependencies in the latent space, prior works exploit hyperprior and spatial context model to build an entropy model, which estimates the bit-rate for end-to-end rate-distortion optimization. However, such an entropy model is suboptimal from two aspects: (1) It fails to capture global-scope spatial correlations among the latents. (2) Cross-channel relationships of the latents remain unexplored. In this paper, we propose the concept of separate entropy coding to leverage a serial decoding process for causal contextual entropy prediction in the latent space. A causal context model is proposed that separates the latents across channels and makes use of channel-wise relationships to generate highly informative adjacent contexts. Furthermore, we propose a causal global prediction model to find global reference points for accurate predictions of undecoded points. Both these two models facilitate entropy estimation without the transmission of overhead. In addition, we further adopt a new group-separated attention module to build more powerful transform networks. Experimental results demonstrate that our full image compression model outperforms standard VVC/H.266 codec on Kodak dataset in terms of both PSNR and MS-SSIM, yielding the state-of-the-art rate-distortion performance.

Journal ArticleDOI
TL;DR: A novel algorithm is designed to efficiently exploit the sparsity of PMP in deblurring, which is much more sparse than that of blurred images, and hence is very effective in discriminating between clear and blurred images.
Abstract: Blind image deblurring is a long standing challenging problem in image processing and low-level vision. Recently, sophisticated priors such as dark channel prior, extreme channel prior, and local maximum gradient prior, have shown promising effectiveness. However, these methods are computationally expensive. Meanwhile, since these priors involved subproblems cannot be solved explicitly, approximate solution is commonly used, which limits the best exploitation of their capability. To address these problems, this work firstly proposes a simplified sparsity prior of local minimal pixels, namely patch-wise minimal pixels (PMP). The PMP of clear images is much more sparse than that of blurred ones, and hence is very effective in discriminating between clear and blurred images. Then, a novel algorithm is designed to efficiently exploit the sparsity of PMP in deblurring. The new algorithm flexibly imposes sparsity inducing on the PMP under the maximum a posterior (MAP) framework rather than directly uses the half quadratic splitting algorithm. By this, it avoids non-rigorous approximation solution in existing algorithms, while being much more computationally efficient. Extensive experiments demonstrate that the proposed algorithm can achieve better practical stability compared with state-of-the-arts. In terms of deblurring quality, robustness and computational efficiency, the new algorithm is superior to state-of-the-arts. Code for reproducing the results of the new method is available at https://github.com/FWen/deblur-pmp.git .

Journal ArticleDOI
TL;DR: A lightweight single image super-resolution network with an expectation-maximization attention mechanism (EMASRN) for better balancing performance and applicability and the experimental results demonstrate the superiority of the EMASRN over state-of-the-art lightweight SISR methods in terms of both quantitative metrics and visual quality.
Abstract: In recent years, with the rapid development of deep learning, super-resolution methods based on convolutional neural networks (CNNs) have made great progress. However, the parameters and the required consumption of computing resources of these methods are also increasing to the point that such methods are difficult to implement on devices with low computing power. To address this issue, we propose a lightweight single image super-resolution network with an expectation-maximization attention mechanism (EMASRN) for better balancing performance and applicability. Specifically, a progressive multi-scale feature extraction block (PMSFE) is proposed to extract feature maps of different sizes. Furthermore, we propose an HR-size expectation-maximization attention block (HREMAB) that directly captures the long-range dependencies of HR-size feature maps. We also utilize a feedback network to feed the high-level features of each generation into the next generationb’s shallow network. Compared with the existing lightweight single image super-resolution (SISR) methods, our EMASRN reduces the number of parameters by almost one-third. The experimental results demonstrate the superiority of our EMASRN over state-of-the-art lightweight SISR methods in terms of both quantitative metrics and visual quality. The source code can be downloaded at https://github.com/xyzhu1/EMASRN.

Journal ArticleDOI
TL;DR: Fuzzy fusion through the Choquet integral leverages the degree of uncertainty of decision scores obtained from four CNNs to adaptively generate final decision score based upon confidence of each information source.
Abstract: Action recognition based on skeleton key joints has gained popularity due to its cost effectiveness and low complexity. Existing Convolutional Neural Network (CNN) based models mostly fail to capture various aspects of the skeleton sequence. To this end, four feature representations, which capture complementary characteristics of the sequence of key joints, are extracted with novel contribution of features estimated from angular information, and kinematics of the human actions. Single channel grayscale images are used to encode these features for classification using four CNNs, with the complementary nature verified through Kullback-Leibler (KL) and Jensen-Shannon (JS) divergences. As opposed to straightforward classifier combination generally used in existing literature, fuzzy fusion through the Choquet integral leverages the degree of uncertainty of decision scores obtained from four CNNs. Experimental results support the efficacy of fuzzy combination of CNNs to adaptively generate final decision score based upon confidence of each information source. Impressive results on the challenging UTD-MHAD, HDM05, G3D, and NTU RGB+D 60 and 120 datasets demonstrate the effectiveness of the proposed method. The source code for our method is available at https://github.com/theavicaster/fuzzy-integral-cnn-fusion-3d-har

Journal ArticleDOI
TL;DR: This paper proposes a novel Viewport oriented Graph Convolution Network (VGCN) for blind omnidirectional image quality assessment (IQA) and demonstrates that the proposed model outperforms state-of-the-art full-reference and no-reference IQA metrics on two public omniddirectional IQA databases.
Abstract: Quality assessment of omnidirectional images has become increasingly urgent due to the rapid growth of virtual reality applications. Different from traditional 2D images and videos, omnidirectional contents can provide consumers with freely changeable viewports and a larger field of view covering the $360^{\circ }\times 180^{\circ }$ spherical surface, which makes the objective quality assessment of omnidirectional images more challenging. In this paper, motivated by the characteristics of the human vision system (HVS) and the viewing process of omnidirectional contents, we propose a novel Viewport oriented Graph Convolution Network (VGCN) for blind omnidirectional image quality assessment (IQA). Generally, observers tend to give the subjective rating of a 360-degree image after passing and aggregating different viewports information when browsing the spherical scenery. Therefore, in order to model the mutual dependency of viewports in the omnidirectional image, we build a spatial viewport graph. Specifically, the graph nodes are first defined with selected viewports with higher probabilities to be seen, which is inspired by the HVS that human beings are more sensitive to structural information. Then, these nodes are connected by spatial relations to capture interactions among them. Finally, reasoning on the proposed graph is performed via graph convolutional networks. Moreover, we simultaneously obtain global quality using the entire omnidirectional image without viewport sampling to boost the performance according to the viewing experience. Experimental results demonstrate that our proposed model outperforms state-of-the-art full-reference and no-reference IQA metrics on two public omnidirectional IQA databases.

Journal ArticleDOI
TL;DR: This paper considers tracking as a linear regression problem and an ensemble of correlation filters is trained on-line to distinguish the foreground target from the background, and applies a channel regularization to the correlation filter learning.
Abstract: In this paper, we propose an adaptive region proposal scheme with feature channel regularization to facilitate robust object tracking. We consider tracking as a linear regression problem and an ensemble of correlation filters is trained on-line to distinguish the foreground target from the background. Further, we integrate adaptively learned region proposals into an enhanced two-stream tracking framework based on correlation filters. For the tracking stream, we learn two-stage cascade correlation filters on deep convolutional features to ensure competitive tracking performance. For the detection stream, we employ adaptive region proposals, which are effective in recovering target objects from tracking failures caused by heavy occlusion or out-of-view movement. In contrast to traditional tracking-by-detection methods using random samples or sliding windows, we perform target re-detection over adaptively learned region proposals. Since region proposals naturally take the objectness information into account, we show that the proposed adaptive region proposals can handle the challenging scale estimation problem as well. In addition, we observe the channel redundancy and noisy of feature representation, especially for the convolutional features. Thus, we apply a channel regularization to the correlation filter learning. Extensive experimental validations on OTB, VOT and UAV-123 datasets demonstrate that the proposed method performs favorably against state-of-the-art tracking algorithms.

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a two-stream method by analyzing the frame-level and temporality-level of compressed Deepfake videos, which gradually pruned the network to prevent the model from fitting the compression noise.
Abstract: The development of technologies that can generate Deepfake videos is expanding rapidly. These videos are easily synthesized without leaving obvious traces of manipulation. Though forensically detection in high-definition video datasets has achieved remarkable results, the forensics of compressed videos is worth further exploring. In fact, compressed videos are common in social networks, such as videos from Instagram, Wechat, and Tiktok. Therefore, how to identify compressed Deepfake videos becomes a fundamental issue. In this paper, we propose a two-stream method by analyzing the frame-level and temporality-level of compressed Deepfake videos. Since the video compression brings lots of redundant information to frames, the proposed frame-level stream gradually prunes the network to prevent the model from fitting the compression noise. Aiming at the problem that the temporal consistency in Deepfake videos might be ignored, we apply a temporality-level stream to extract temporal correlation features. When combined with scores from the two streams, our proposed method performs better than the state-of-the-art methods in compressed Deepfake videos detection.

Journal ArticleDOI
TL;DR: This survey systematically analyzes the performance of existing TBD-based algorithms on MOT challenge datasets and discusses the factors that affect tracking performance.
Abstract: Multiple pedestrian tracking (MPT) has gained significant attention due to its huge potential in a commercial application. It aims to predict multiple pedestrian trajectories and maintain their identities, given a video sequence. In the past decade, due to the advancement in pedestrian detection algorithms, Tracking-by-Detection (TBD) based algorithms have achieved tremendous successes. TBD has become the most popular MPT framework, and it has been actively studied in the past decade. In this paper, we give a comprehensive survey of recent advances in TBD-based MPT algorithms. We systematically analyze the existing TBD-based algorithms and organize the survey into four major parts. At first, this survey draws a timeline to introduce the milestones of TBD-based works which briefly reviews the development of the existing TBD-based methods. Second, the main procedures of the TBD framework are summarized, and each stage in the procedure is described in detail. Afterward, this survey analyzes the performance of existing TBD-based algorithms on MOT challenge datasets and discusses the factors that affect tracking performance. Finally, open issues and future directions in the TBD framework are discussed.