scispace - formally typeset
Search or ask a question

Showing papers by "Guangming Shi published in 2020"


Proceedings Article•DOI•
14 Jun 2020
TL;DR: Zhang et al. as mentioned in this paper proposed a no-reference IQA metric based on deep meta-learning, which learns the meta-knowledge shared by human when evaluating the quality of images with various distortions, which can then be adapted to unknown distortions easily.
Abstract: Recently, increasing interest has been drawn in exploiting deep convolutional neural networks (DCNNs) for no-reference image quality assessment (NR-IQA). Despite of the notable success achieved, there is a broad consensus that training DCNNs heavily relies on massive annotated data. Unfortunately, IQA is a typical small sample problem. Therefore, most of the existing DCNN-based IQA metrics operate based on pre-trained networks. However, these pre-trained networks are not designed for IQA task, leading to generalization problem when evaluating different types of distortions. With this motivation, this paper presents a no-reference IQA metric based on deep meta-learning. The underlying idea is to learn the meta-knowledge shared by human when evaluating the quality of images with various distortions, which can then be adapted to unknown distortions easily. Specifically, we first collect a number of NR-IQA tasks for different distortions. Then meta-learning is adopted to learn the prior knowledge shared by diversified distortions. Finally, the quality prior model is fine-tuned on a target NR-IQA task for quickly obtaining the quality model. Extensive experiments demonstrate that the proposed metric outperforms the state-of-the-arts by a large margin. Furthermore, the meta-model learned from synthetic distortions can also be easily generalized to authentic distortions, which is highly desired in real-world applications of IQA metrics.

158 citations


Journal Article•DOI•
TL;DR: A novel IQA-orientated CNN method is developed for blind IQA (BIQA), which can efficiently represent the quality degradation and the Cascaded CNN with HDC (named as CaHDC) is introduced, demonstrating the superiority of CaH DC compared with existing BIQA methods.
Abstract: The deep convolutional neural network (CNN) has achieved great success in image recognition. Many image quality assessment (IQA) methods directly use recognition-oriented CNN for quality prediction. However, the properties of IQA task is different from image recognition task. Image recognition should be sensitive to visual content and robust to distortion, while IQA should be sensitive to both distortion and visual content. In this paper, an IQA-oriented CNN method is developed for blind IQA (BIQA), which can efficiently represent the quality degradation. CNN is large-data driven, while the sizes of existing IQA databases are too small for CNN optimization. Thus, a large IQA dataset is firstly established, which includes more than one million distorted images (each image is assigned with a quality score as its substitute of Mean Opinion Score (MOS), abbreviated as pseudo-MOS). Next, inspired by the hierarchical perception mechanism (from local structure to global semantics) in human visual system, a novel IQA-orientated CNN method is designed, in which the hierarchical degradation is considered. Finally, by jointly optimizing the multilevel feature extraction, hierarchical degradation concatenation (HDC) and quality prediction in an end-to-end framework, the Cascaded CNN with HDC (named as CaHDC) is introduced. Experiments on the benchmark IQA databases demonstrate the superiority of CaHDC compared with existing BIQA methods. Meanwhile, the CaHDC (with about 0.73M parameters) is lightweight comparing to other CNN-based BIQA models, which can be easily realized in the microprocessing system. The dataset and source code of the proposed method are available at https://web.xidian.edu.cn/wjj/paper.html .

113 citations


Journal Article•DOI•
TL;DR: In this paper, a self-learning architecture based on self-supervised generative adversarial nets is introduced to demonstrate the potential performance improvement that can be achieved by automatic data learning and synthesizing at the edge of the network.
Abstract: Edge intelligence, also called edge-native artificial intelligence (AI), is an emerging technological framework focusing on seamless integration of AI, communication networks, and mobile edge computing. It has been considered to be one of the key missing components in the existing 5G network and is widely recognized to be one of the most sought after functions for tomorrow's wireless 6G cellular systems. In this article, we identify the key requirements and challenges of edge-native AI in 6G. A self-learning architecture based on self-supervised generative adversarial nets is introduced to demonstrate the potential performance improvement that can be achieved by automatic data learning and synthesizing at the edge of the network. We evaluate the performance of our proposed self-learning architecture in a university campus shuttle system connected via a 5G network. Our result shows that the proposed architecture has the potential to identify and classify unknown services that emerge in edge computing networks. Future trends and key research problems for self-learning-enabled 6G edge intelligence are also discussed.

58 citations


Proceedings Article•DOI•
12 Oct 2020
TL;DR: This work proposes a novel no-reference VQA framework named Recurrent-In-Recurrent Network (RIRNet), which integrates concepts from motion perception in human visual system (HVS), which is manifested in the designed network structure composed of low- and high- level processing.
Abstract: Video quality assessment (VQA), which is capable of automatically predicting the perceptual quality of source videos especially when reference information is not available, has become a major concern for video service providers due to the growing demand for video quality of experience (QoE) by end users. While significant advances have been achieved from the recent deep learning techniques, they often lead to misleading results in VQA tasks given their limitations on describing 3D spatio-temporal regularities using only fixed temporal frequency. Partially inspired by psychophysical and vision science studies revealing the speed tuning property of neurons in visual cortex when performing motion perception (i.e., sensitive to different temporal frequencies), we propose a novel no-reference (NR) VQA framework named Recurrent-In-Recurrent Network (RIRNet) to incorporate this characteristic to prompt an accurate representation of motion perception in VQA task. By fusing motion information derived from different temporal frequencies in a more efficient way, the resulting temporal modeling scheme is formulated to quantify the temporal motion effect via a hierarchical distortion description. It is found that the proposed framework is in closer agreement with quality perception of the distorted videos since it integrates concepts from motion perception in human visual system (HVS), which is manifested in the designed network structure composed of low- and high- level processing. A holistic validation of our methods on four challenging video quality databases demonstrates the superior performances over the state-of-the-art methods.

51 citations


Journal Article•DOI•
Zhenyu Wang1, Fu Li1, Guangming Shi1, Xuemei Xie1, Wang Fangyu1 •
TL;DR: A scheme of network channel pruning, based on sparse learning and the genetic algorithm, is proposed to achieve a better balance between the pruning ratio and accuracy and is evaluated against several state-of-the-art CNNs.

48 citations


Posted Content•
TL;DR: A self-learning architecture based on self-supervised generative adversarial nets is introduced to demonstrate the potential performance improvement that can be achieved by automatic data learning and synthesizing at the edge of the network.
Abstract: Edge intelligence, also called edge-native artificial intelligence (AI), is an emerging technological framework focusing on seamless integration of AI, communication networks, and mobile edge computing. It has been considered to be one of the key missing components in the existing 5G network and is widely recognized to be one of the most sought-after functions for tomorrow's wireless 6G cellular systems. In this article, we identify the key requirements and challenges of edge-native AI in 6G. A self-learning architecture based on self-supervised Generative Adversarial Nets (GANs) is introduced to \blu{demonstrate the potential performance improvement that can be achieved by automatic data learning and synthesizing at the edge of the network}. We evaluate the performance of our proposed self-learning architecture in a university campus shuttle system connected via a 5G network. Our result shows that the proposed architecture has the potential to identify and classify unknown services that emerge in edge computing networks. Future trends and key research problems for self-learning-enabled 6G edge intelligence are also discussed.

46 citations


Journal Article•DOI•
TL;DR: The proposed Skeleton-Guided Multimodal Network (SGM-Net) takes full use of the complementarity of RGB and skeleton modalities at semantic feature level, and achieves state-of-the-art performance over the existing methods on NTU and Sub-JHMDB datasets.

36 citations


Journal Article•DOI•
TL;DR: A novel event stream denoising method based on probabilistic undirected graph model (PUGM) that can effectively remove noise events and with the preprocessing of the proposed algorithm, the recognition accuracy on AER data can be remarkably promoted.
Abstract: Dynamic Vision Sensor (DVS) is a new type of neuromorphic event-based sensor, which has an innate advantage in capturing fast-moving objects. Due to the interference of DVS hardware itself and many external factors, noise is unavoidable in the output of DVS. Different from frame/image with structural data, the output of DVS is in the form of address-event representation (AER), which means that the traditional denoising methods cannot be used for the output (i.e., event stream) of the DVS. In this paper, we propose a novel event stream denoising method based on probabilistic undirected graph model (PUGM). The motion of objects always shows a certain regularity/trajectory in space and time, which reflects the spatio-temporal correlation between effective events in the stream. Meanwhile, the event stream of DVS is composed by the effective events and random noise. Thus, a probabilistic undirected graph model is constructed to describe such priori knowledge (i.e., spatio-temporal correlation). The undirected graph model is factorized into the product of the cliques energy function, and the energy function is defined to obtain the complete expression of the joint probability distribution. Better denoising effect means a higher probability (lower energy), which means the denoising problem can be transfered into energy optimization problem. Thus, the iterated conditional modes (ICM) algorithm is used to optimize the model to remove the noise. Experimental results on denoising show that the proposed algorithm can effectively remove noise events. Moreover, with the preprocessing of the proposed algorithm, the recognition accuracy on AER data can be remarkably promoted.

20 citations


Journal Article•DOI•
TL;DR: This work proposes a PIAA method based on meta-learning with bilevel gradient optimization (BLG-PIAA), which is trained using individual aesthetic data directly and generalizes to unknown users quickly and outperforms the state-of-the-art PiaA metrics.
Abstract: Typical image aesthetics assessment (IAA) is modeled for the generic aesthetics perceived by an ``average'' user. However, such generic aesthetics models neglect the fact that users' aesthetic preferences vary significantly depending on their unique preferences. Therefore, it is essential to tackle the issue for personalized IAA (PIAA). Since PIAA is a typical small sample learning (SSL) problem, existing PIAA models are usually built by fine-tuning the well-established generic IAA (GIAA) models, which are regarded as prior knowledge. Nevertheless, this kind of prior knowledge based on ``average aesthetics'' fails to incarnate the aesthetic diversity of different people. In order to learn the shared prior knowledge when different people judge aesthetics, that is, learn how people judge image aesthetics, we propose a PIAA method based on meta-learning with bilevel gradient optimization (BLG-PIAA), which is trained using individual aesthetic data directly and generalizes to unknown users quickly. The proposed approach consists of two phases: 1) meta-training and 2) meta-testing. In meta-training, the aesthetics assessment of each user is regarded as a task, and the training set of each task is divided into two sets: 1) support set and 2) query set. Unlike traditional methods that train a GIAA model based on average aesthetics, we train an aesthetic meta-learner model by bilevel gradient updating from the support set to the query set using many users' PIAA tasks. In meta-testing, the aesthetic meta-learner model is fine-tuned using a small amount of aesthetic data of a target user to obtain the PIAA model. The experimental results show that the proposed method outperforms the state-of-the-art PIAA metrics, and the learned prior model of BLG-PIAA can be quickly adapted to unseen PIAA tasks.

18 citations


Proceedings Article•DOI•
Xiaotong Lu1, Han Huang1, Weisheng Dong1, Xin Li2, Guangming Shi1 •
09 Jul 2020
TL;DR: It is possible to expand the search space of networking pruning by associating each filter with a learnable weight and joint search-and-training can be conducted iteratively to maximize the learning efficiency.
Abstract: Network pruning has been proposed as a remedy for alleviating the over-parameterization problem of deep neural networks. However, its value has been recently challenged especially from the perspective of neural architecture search (NAS). We challenge the conventional wisdom of pruningafter-training by proposing a joint search-andtraining approach that directly learns a compact network from the scratch. By treating pruning as a search strategy, we present two new insights in this paper: 1) it is possible to expand the search space of networking pruning by associating each filter with a learnable weight; 2) joint search-and-training can be conducted iteratively to maximize the learning efficiency. More specifically, we propose a coarseto-fine tuning strategy to iteratively sample and update compact sub-network to approximate the target network. The weights associated with network filters will be accordingly updated by joint searchand-training to reflect learned knowledge in NAS space. Moreover, we introduce strategies of random perturbation (inspired by Monte Carlo) and flexible thresholding (inspired by Reinforcement Learning) to adjust the weight and size of each layer. Extensive experiments on ResNet and VGGNet demonstrate the superior performance of our proposed method on popular datasets including CIFAR10, CIFAR100 and ImageNet.

13 citations


Posted Content•
TL;DR: A no-reference IQA metric based on deep meta-learning that outperforms the state-of-the-arts by a large margin and can be easily generalized to authentic distortions, which is highly desired in real-world applications of IQA metrics.
Abstract: Recently, increasing interest has been drawn in exploiting deep convolutional neural networks (DCNNs) for no-reference image quality assessment (NR-IQA). Despite of the notable success achieved, there is a broad consensus that training DCNNs heavily relies on massive annotated data. Unfortunately, IQA is a typical small sample problem. Therefore, most of the existing DCNN-based IQA metrics operate based on pre-trained networks. However, these pre-trained networks are not designed for IQA task, leading to generalization problem when evaluating different types of distortions. With this motivation, this paper presents a no-reference IQA metric based on deep meta-learning. The underlying idea is to learn the meta-knowledge shared by human when evaluating the quality of images with various distortions, which can then be adapted to unknown distortions easily. Specifically, we first collect a number of NR-IQA tasks for different distortions. Then meta-learning is adopted to learn the prior knowledge shared by diversified distortions. Finally, the quality prior model is fine-tuned on a target NR-IQA task for quickly obtaining the quality model. Extensive experiments demonstrate that the proposed metric outperforms the state-of-the-arts by a large margin. Furthermore, the meta-model learned from synthetic distortions can also be easily generalized to authentic distortions, which is highly desired in real-world applications of IQA metrics.

Journal Article•DOI•
Qian Ning1, Weisheng Dong1, Fangfang Wu1, Jinjian Wu1, Jie Lin1, Guangming Shi1 •
03 Apr 2020
TL;DR: A novel spatial-temporal Gaussian scale mixture (STGSM) model for foreground estimation is proposed, and the optical flow has been used to model the correspondences between foreground pixels in adjacent frames to better characterize the temporal correlations.
Abstract: Subtracting the backgrounds from the video frames is an important step for many video analysis applications. Assuming that the backgrounds are low-rank and the foregrounds are sparse, the robust principle component analysis (RPCA)-based methods have shown promising results. However, the RPCA-based methods suffered from the scale issue, i.e., the l1-sparsity regularizer fails to model the varying sparsity of the moving objects. While several efforts have been made to address this issue with advanced sparse models, previous methods cannot fully exploit the spatial-temporal correlations among the foregrounds. In this paper, we proposed a novel spatial-temporal Gaussian scale mixture (STGSM) model for foreground estimation. In the proposed STGSM model, a temporal consistent constraint is imposed over the estimated foregrounds through nonzero-means Gaussian models. Specifically, the estimates of the foregrounds obtained in the previous frame are used as the prior for these of the current frame, and nonzero means Gaussian scale mixture models (GSM) are developed. To better characterize the temporal correlations, the optical flow has been used to model the correspondences between foreground pixels in adjacent frames. The spatial correlations have also been exploited by considering that local correlated pixels should be characterized by the same STGSM model, leading to further performance improvements. Experimental results on real video datasets show that the proposed method performs comparably or even better than current state-of-the-art background subtraction methods.

Posted Content•
TL;DR: This paper addresses the challenging unsupervised scene flow estimation problem by jointly learning four low-level vision sub-tasks: optical flow F, stereo-depth D, camera pose P and motion segmentation S by designing a novel Rigidity From Motion (RfM) layer with three principal components.
Abstract: This paper addresses the challenging unsupervised scene flow estimation problem by jointly learning four low-level vision sub-tasks: optical flow $\textbf{F}$, stereo-depth $\textbf{D}$, camera pose $\textbf{P}$ and motion segmentation $\textbf{S}$. Our key insight is that the rigidity of the scene shares the same inherent geometrical structure with object movements and scene depth. Hence, rigidity from $\textbf{S}$ can be inferred by jointly coupling $\textbf{F}$, $\textbf{D}$ and $\textbf{P}$ to achieve more robust estimation. To this end, we propose a novel scene flow framework named EffiScene with efficient joint rigidity learning, going beyond the existing pipeline with independent auxiliary structures. In EffiScene, we first estimate optical flow and depth at the coarse level and then compute camera pose by Perspective-$n$-Points method. To jointly learn local rigidity, we design a novel Rigidity From Motion (RfM) layer with three principal components: \emph{}{(i)} correlation extraction; \emph{}{(ii)} boundary learning; and \emph{}{(iii)} outlier exclusion. Final outputs are fused based on the rigid map $M_R$ from RfM at finer levels. To efficiently train EffiScene, two new losses $\mathcal{L}_{bnd}$ and $\mathcal{L}_{unc}$ are designed to prevent trivial solutions and to regularize the flow boundary discontinuity. Extensive experiments on scene flow benchmark KITTI show that our method is effective and significantly improves the state-of-the-art approaches for all sub-tasks, i.e. optical flow ($5.19 \rightarrow 4.20$), depth estimation ($3.78 \rightarrow 3.46$), visual odometry ($0.012 \rightarrow 0.011$) and motion segmentation ($0.57 \rightarrow 0.62$).

Proceedings Article•DOI•
21 Oct 2020
TL;DR: In this article, the authors proposed a federated edge intelligence (FEI) framework that allows edge servers to evaluate the required number of data samples according to the energy cost of the IoT network as well as their local data processing capacity and only request the amount of data that is sufficient for training a satisfactory model.
Abstract: This paper studies an edge intelligence-based IoT network in which a set of edge servers learn a shared model using federated learning (FL) based on the datasets uploaded from a multi-technology-supported IoT network. The data uploading performance of IoT network and the computational capacity of edge servers are entangled with each other in influencing the FL model training process. We propose a novel framework, called federated edge intelligence (FEI), that allows edge servers to evaluate the required number of data samples according to the energy cost of the IoT network as well as their local data processing capacity and only request the amount of data that is sufficient for training a satisfactory model. We evaluate the energy cost for data uploading when two widely-used IoT solutions: licensed band IoT (e.g., 5G NB-IoT) and unlicensed band IoT (e.g., Wi-Fi, ZigBee, and 5G NR-U) are available to each IoT device. We prove that the cost minimization problem of the entire IoT network is separable and can be divided into a set of subproblems, each of which can be solved by an individual edge server. We also introduce a mapping function to quantify the computational load of edge servers under different combinations of three key parameters: size of the dataset, local batch size, and number of local training passes. Finally, we adopt an Alternative Direction Method of Multipliers (ADMM)-based approach to jointly optimize energy cost of the IoT network and average computing resource utilization of edge servers. We prove that our proposed algorithm does not cause any data leakage nor disclose any topological information of the IoT network. Simulation results show that our proposed framework significantly improves the resource efficiency of the IoT network and edge servers with only a limited sacrifice on the model convergence performance.

Posted Content•
TL;DR: The proposed Temporal Enhanced Graph Convolutional Network (TE-GCN) constructs temporal relation graph to capture complex temporal dynamic and achieves the state-of-the-art performance by making contribution to temporal modeling for action recognition.
Abstract: Graph Convolutional Networks (GCNs), which model skeleton data as graphs, have obtained remarkable performance for skeleton-based action recognition. Particularly, the temporal dynamic of skeleton sequence conveys significant information in the recognition task. For temporal dynamic modeling, GCN-based methods only stack multi-layer 1D local convolutions to extract temporal relations between adjacent time steps. With the repeat of a lot of local convolutions, the key temporal information with non-adjacent temporal distance may be ignored due to the information dilution. Therefore, these methods still remain unclear how to fully explore temporal dynamic of skeleton sequence. In this paper, we propose a Temporal Enhanced Graph Convolutional Network (TE-GCN) to tackle this limitation. The proposed TE-GCN constructs temporal relation graph to capture complex temporal dynamic. Specifically, the constructed temporal relation graph explicitly builds connections between semantically related temporal features to model temporal relations between both adjacent and non-adjacent time steps. Meanwhile, to further explore the sufficient temporal dynamic, multi-head mechanism is designed to investigate multi-kinds of temporal relations. Extensive experiments are performed on two widely used large-scale datasets, NTU-60 RGB+D and NTU-120 RGB+D. And experimental results show that the proposed model achieves the state-of-the-art performance by making contribution to temporal modeling for action recognition.

Proceedings Article•DOI•
04 May 2020
TL;DR: A probabilistic undirected graph model is built to describe this difference, with which the denoising problem is converted to a probability maximization problem and can effectively remove noise events directly from the event stream and significantly improve event recognition rate.
Abstract: As a novel asynchronous-driven cameras, event-based sensors are with high sensitivity, fast speed, low power consumption and low data volume, but with abundant noise. Since the output of event-based sensors is in the form of address-event-representation (AER), the traditional frame-based denoising method cannot be used. In this paper, we introduce a novel event stream denoising method for such sensors. Effective events tend to show temporal and spatial regularity, while noise events show a kind of randomness. Thus, we build a probabilistic undirected graph model to describe this difference, with which the denoising problem is converted to a probability maximization problem. Then, the model is decomposed into the product of the energy function on the maximum cliques, and the iterated condition model (ICM) is used for energy minimization to obtain the denoised event stream. Experiments show that our method can effectively remove noise events directly from the event stream and significantly improve event recognition rate.

Book Chapter•DOI•
16 Oct 2020
TL;DR: In this article, the authors propose the equilibrium criterion, which provides a suitable measurement of dissimilarity between samples around a center or that from different centers for unsupervised person re-ID.
Abstract: Unsupervised person re-identification (re-ID) has not achieved desired results because learning a discriminative feature embedding without annotation is difficult. Fortunately, the special distribution of samples in this task provides critical priority information for addressing this problem. On the one hand, the distribution of samples belonging to the same identity is multi-centered. On the other hand, distribution is distinct for samples of different levels that cropped from the images. According to the first property, we propose the equilibrium criterion, which provides a suitable measurement of dissimilarity between samples around a center or that from different centers. According to the second property, we introduce multi-level labels guided learning to mine and utilize the complementary information among different levels. Extensive experiments demonstrate that our method is superior to the state-of-the-art unsupervised re-ID approaches in significant margins.

Journal Article•DOI•
TL;DR: A no-reference image quality index for depth maps is presented by modeling the statistics of edge profiles (SEP) in a multi-scale framework and demonstrates that the proposed metric outperforms the relevant state-of-the-art quality metrics by a large margin and has better generalization ability.

Posted Content•
TL;DR: This paper proposes a novel framework, called federated edge intelligence (FEI), that allows edge servers to evaluate the required number of data samples according to the energy cost of the IoT network as well as their local data processing capacity and only request the amount of data that is sufficient for training a satisfactory model.
Abstract: This paper studies an edge intelligence-based IoT network in which a set of edge servers learn a shared model using federated learning (FL) based on the datasets uploaded from a multi-technology-supported IoT network. The data uploading performance of IoT network and the computational capacity of edge servers are entangled with each other in influencing the FL model training process. We propose a novel framework, called federated edge intelligence (FEI), that allows edge servers to evaluate the required number of data samples according to the energy cost of the IoT network as well as their local data processing capacity and only request the amount of data that is sufficient for training a satisfactory model. We evaluate the energy cost for data uploading when two widely-used IoT solutions: licensed band IoT (e.g., 5G NB-IoT) and unlicensed band IoT (e.g., Wi-Fi, ZigBee, and 5G NR-U) are available to each IoT device. We prove that the cost minimization problem of the entire IoT network is separable and can be divided into a set of subproblems, each of which can be solved by an individual edge server. We also introduce a mapping function to quantify the computational load of edge servers under different combinations of three key parameters: size of the dataset, local batch size, and number of local training passes. Finally, we adopt an Alternative Direction Method of Multipliers (ADMM)-based approach to jointly optimize energy cost of the IoT network and average computing resource utilization of edge servers. We prove that our proposed algorithm does not cause any data leakage nor disclose any topological information of the IoT network. Simulation results show that our proposed framework significantly improves the resource efficiency of the IoT network and edge servers with only a limited sacrifice on the model convergence performance.

Posted Content•
TL;DR: A novel Map Generation technique from the viewpoint of information theory, to boost the slight 3D expression differences from strong personality variations and outperforms the state-of-the-art 2D+3D FER methods in both FER accuracy and the output entropy of the generated maps.
Abstract: In 2D+3D facial expression recognition (FER), existing methods generate multi-view geometry maps to enhance the depth feature representation. However, this may introduce false estimations due to local plane fitting from incomplete point clouds. In this paper, we propose a novel Map Generation technique from the viewpoint of information theory, to boost the slight 3D expression differences from strong personality variations. First, we examine the HDR depth data to extract the discriminative dynamic range $r_{dis}$, and maximize the entropy of $r_{dis}$ to a global optimum. Then, to prevent the large deformation caused by over-enhancement, we introduce a depth distortion constraint and reduce the complexity from $O(KN^2)$ to $O(KN\tau)$. Furthermore, the constrained optimization is modeled as a $K$-edges maximum weight path problem in a directed acyclic graph, and we solve it efficiently via dynamic programming. Finally, we also design an efficient Facial Attention structure to automatically locate subtle discriminative facial parts for multi-scale learning, and train it with a proposed loss function $\mathcal{L}_{FA}$ without any facial landmarks. Experimental results on different datasets show that the proposed method is effective and outperforms the state-of-the-art 2D+3D FER methods in both FER accuracy and the output entropy of the generated maps.

Posted Content•
TL;DR: In this article, the authors proposed a federated edge intelligence based architecture for supporting resource-efficient semantic-aware networking, which allows each user to offload the computationally intensive semantic encoding and decoding tasks to the edge servers and protect its proprietary model-related information by coordinating via intermediate results.
Abstract: Existing communication systems are mainly built based on Shannon's information theory which deliberately ignores the semantic aspects of communication. The recent iteration of wireless technology, the so-called 5G and beyond, promises to support a plethora of services enabled by carefully tailored network capabilities based on the contents, requirements, as well as semantics. This sparkled significant interest in the semantic communication, a novel paradigm that involves the meaning of message into the communication. In this article, we first review the classic semantic communication framework and then summarize key challenges that hinder its popularity. We observe that some semantic communication processes such as semantic detection, knowledge modeling, and coordination, can be resource-consuming and inefficient, especially for the communication between a single source and a destination. We therefore propose a novel architecture based on federated edge intelligence for supporting resource-efficient semantic-aware networking. Our architecture allows each user to offload the computationally intensive semantic encoding and decoding tasks to the edge servers and protect its proprietary model-related information by coordinating via intermediate results. Our simulation result shows that the proposed architecture can reduce the resource consumption and significantly improve the communication efficiency.

Journal Article•DOI•
TL;DR: A novel architecture for UAV vehicle detection is proposed that uses anchor-free mechanism to eliminate predefined anchors, and a multi-scale semantic enhancement block (MSEB) and an effective 49-layer backbone which is based on the DetNet59.
Abstract: Vehicle detection based on unmanned aerial vehicle (UAV) images is a challenging task. One reason is that the objects are small size, low-resolution, and large scale variations, resulting in weak feature representation. Another reason is the imbalance between positive and negative examples. In this paper, we propose a novel architecture for UAV vehicle detection to solve above problems. In detail, we use anchor-free mechanism to eliminate predefined anchors, which can reduce complicated computation and relieve the imbalance between positive and negative samples. Meanwhile, to enhance the features for vehicles, we design a multi-scale semantic enhancement block (MSEB) and an effective 49-layer backbone which is based on the DetNet59. The proposed network offers appropriate receptive fields that match the small-sized vehicles, and involves precise localization information provided by the contexts with high resolution. The MSEB strengthens discriminative feature representation at various scales, without reducing the spatial resolution of prediction layers. Experiments show that the proposed method achieves the state-of-the-art performance. Particularly, the main part of vehicles, much smaller ones, the accuracy is about 2% higher than other existing methods.

Proceedings Article•DOI•
Qingzhe Pan1, Xuemei Xie1, Zhifu Zhao1, Mao Siying1, Jianan Li1, Guangming Shi1 •
25 Dec 2020
TL;DR: In this paper, a two-stage recurrent neural network (RNN) is proposed for high-resolution ultrasonic echo detection and performs well under severe overlapping, while it improves greatly the detection speed.
Abstract: Ultrasonic echo methods have been widely researched for the application of flaw detection, where the flaw locations are identified by the arrival time of each echo. The main difficulty is that the receiving echoes from consecutive flaws overlap in time when the flaws are close. Over the last decades, sparse approximation and neural-network-based methods have been used to address this issue. However, these methods cannot achieve satisfactory performance in high-noise and severe overlapping scenarios. In this paper, we propose a high-resolution ultrasonic echo detection method with two-stage recurrent neural network, which includes the localization of echo and the regression of echo amplitude. The proposed method adopts two-stage ideology which filters interfering sequences through localization and predicts the amplitude of echo. The proposed method realizes high-resolution ultrasonic echo detection and performs well under severe overlapping, while it improves greatly the detection speed.

Journal Article•DOI•
TL;DR: A hybrid network compression technique for exploiting the prior knowledge of network parameters by Gaussian scale mixture (GSM) models and network pruning is formulated as a maximum a posteriori (MAP) estimation problem with a sparsity prior.
Abstract: Despite the great success of deep convolutional neural networks (DCNNs), their heavy computational complexity remains a key obstacle to the wide use in practical applications. To meet this challenge, DCNN pruning has been recently developed as a technique of compressing DCNNs to facilitate their applications in the real world. In this paper, we propose a hybrid network compression technique for exploiting the prior knowledge of network parameters by Gaussian scale mixture (GSM) models. Specifically, the collection of network parameters are characterized by GSM models and network pruning is formulated as a maximum a posteriori (MAP) estimation problem with a sparsity prior. The key novel insight brought by this work is that groups of parameters associated with the same channel are similar, which is analogous to the grouping of similar patches in natural images. Such observation inspires us to leverage powerful structured sparsity prior from image restoration to network compression - i.e., to develop a flexible filter- grouping strategy that not only promotes structured sparsity but also can be seamlessly integrated with the existing network pruning framework. Extensive experimental results on several popular DCNN models including VGGNet, ResNet and DenseNet have shown that the proposed GSM-based joint grouping and pruning method convincingly outperforms other competing approaches (including both pruning and non-pruning based methods).

Proceedings Article•DOI•
Shambel Ferede1, Xuemei Xie1, Chen Zhang1, Jiang Du1, Guangming Shi1 •
23 Oct 2020
TL;DR: Wang et al. as mentioned in this paper proposed small ball tracking with trajectory prediction to track them when the athletes playing with balls at sports field, as shown in Figure 1 This is a challenging task and important for intelligent physical education, especially for the coaches to grasp the accuracy of actions by the players based on pre-defined rules.
Abstract: We propose small ball tracking with trajectory prediction to track them when the athletes playing with balls at sports field, as shown in Figure 1 This is a challenging task and important for intelligent physical education, especially for the coaches to grasp the accuracy of actions by the players based on the pre-defined rules The proposed method achieves a good performance on small ball tracking since the designed algorithm incorporates motion, temporal and directional information to predict the trajectory Experimental results show that the proposed method effectively reduces the number of identity switches and decreases track fragmentation With the integration of motion, directional information and frames storage, this framework efficiently track small balls when the athletes playing with them

Proceedings Article•DOI•
Yan Zhu1, Yi Niu1, Fu Li1, Chunbo Zou, Guangming Shi1 •
01 Oct 2020
TL;DR: A channel-grouping based patch swap technique is proposed to group the style feature maps into surface and texture channels, and the new features are created by the combination of these two groups, which can be regarded as a semantic-level fusion of the raw style features.
Abstract: The basic principle of the patch-matching based style transfer is to substitute the patches of the content image feature maps by the closest patches from the style image feature maps. Since the finite features harvested from one single aesthetic style image are inadequate to represent the rich textures of the content natural image, existing techniques treat the full-channel style feature patches as simple signal tensors and create new style feature patches via signal-level fusion. In this paper, we propose a channel-grouping based patch swap technique to group the style feature maps into surface and texture channels, and the new features are created by the combination of these two groups, which can be regarded as a semantic-level fusion of the raw style features. Experimental results demonstrate that the proposed method outperforms the existing techniques in providing more style-consistent textures while keeping the content fidelity.

Book Chapter•DOI•
Yilei Chen1, Xuemei Xie1, Lihua Ma1, Jiang Du1, Guangming Shi1 •
16 Oct 2020
TL;DR: This paper proposes reasoning-based multi-level predictions with graphical model for single person human pose estimation to obtain the accurate location of body joints and can achieve highly accurate results and outperform state-of-the-art methods.
Abstract: More and more complex Deep Neural Networks (DNNs) are designed for the improvement of human pose estimation task. However, it is still hard to handle the inherent ambiguities due to diversity of postures and occlusions. And it is difficult to meet the requirements for the high accuracy of human pose estimation in practical applications. In this paper, reasoning-based multi-level predictions with graphical model for single person human pose estimation is proposed to obtain the accurate location of body joints. Specifically, a multi-level prediction using cascaded network is designed with recursive prediction according to three different levels from easy to hard joints. At each stage, multi-scale fusion and channel-wise feature enhancement are employed for stronger contextual information to improve capacity of feature extraction. Heatmaps with rich spatial and semantic information are refined by explicitly constructing graphical model to learn the structure information for inference, which can implement the interactions between joints. The proposed method is evaluated on LSP dataset. The experiments show that it can achieve highly accurate results and outperform state-of-the-art methods.

Posted Content•
Yang Li1, Boxun Fu1, Fu Li1, Guangming Shi1, Wenming Zheng2 •
TL;DR: Wang et al. as discussed by the authors proposed a transferable attention neural network (TANN) for EEG emotion recognition, which learns the emotional discriminative information by highlighting the transferable EEG brain regions data and samples adaptively through local and global attention mechanism.
Abstract: The existed methods for electroencephalograph (EEG) emotion recognition always train the models based on all the EEG samples indistinguishably. However, some of the source (training) samples may lead to a negative influence because they are significant dissimilar with the target (test) samples. So it is necessary to give more attention to the EEG samples with strong transferability rather than forcefully training a classification model by all the samples. Furthermore, for an EEG sample, from the aspect of neuroscience, not all the brain regions of an EEG sample contains emotional information that can transferred to the test data effectively. Even some brain region data will make strong negative effect for learning the emotional classification model. Considering these two issues, in this paper, we propose a transferable attention neural network (TANN) for EEG emotion recognition, which learns the emotional discriminative information by highlighting the transferable EEG brain regions data and samples adaptively through local and global attention mechanism. This can be implemented by measuring the outputs of multiple brain-region-level discriminators and one single sample-level discriminator. We conduct the extensive experiments on three public EEG emotional datasets. The results validate that the proposed model achieves the state-of-the-art performance.

Book Chapter•DOI•
14 Jun 2020
TL;DR: In this article, a series of subjective experiments have been constructed to investigate the visual impact of video compression by the H264/AVC standard on 3D depth perception Especially, different frequency components of the compressed videos were extracted to examine their impact on depth perception.
Abstract: Compared to two single-view videos, stereoscopic three dimensional (S3D) videos provide a single most significant feature and a major difference, ie depth perception However, the compression, transmission, and storage of 3D videos will inevitably introduce spatiotemporal and stereoscopic distortions, which may cause loss and/or variations of depth perception, resulting in visual discomfort to viewers Nevertheless, the study remains limited of how these distortions affect the depth perception and how the human vision system (HVS) perceives such loss and variations of depth perception in compressed 3D videos In this paper, a series of subjective experiments have been constructed to investigate the visual impact of video compression by the H264/AVC standard on 3D depth perception Especially, different frequency components of the compressed videos were extracted to examine their impact on depth perception The subjective experiments reveal that the degradation of video quality as a result of compression will cause the loss and reduction of the 3D depth perception Moreover, the subjective data showed that the HVS response in depth perception varies depending on different frequency components of 3D videos, which may bring about a better understanding of human stereoscopic vision, and coding and quality assessment of 3D videos

Proceedings Article•DOI•
03 Jan 2020
TL;DR: The proposed spatial-temporal VCS network can achieve better visual effect with less recovery time than the state-of-the-art, and the refined perceptual loss can guide the spatial- Temporal network to retain more textures and structures.
Abstract: Deep neural networks have been applied to video compressive sensing (VCS) task recently. The existing DNN-based VCS methods compress and reconstruct the scene video only in space or time dimensions, which ignores the spatial-temporal correlation of the video. And they generally utilize pixel-wise loss as the loss function, which causes the results to be over-smoothed. In this paper, we propose a perceptual spatial-temporal VCS network. The spatial-temporal VCS network, which compresses and recovers the video in both space and time dimensions, can preserve the spatial-temporal correlation of the video. Besides, we refine the perceptual loss by selecting specific feature-wise loss terms and adding a pixel-wise loss term. The refined perceptual loss can guide the spatial-temporal network to retain more textures and structures. Experimental results show the proposed method can achieve better visual effect with less recovery time than the state-of-the-art.