scispace - formally typeset
Search or ask a question

Showing papers by "Yuan Xie published in 2021"


Proceedings ArticleDOI
01 Jun 2021
TL;DR: Wu et al. as discussed by the authors proposed a contrastive regularization (CR) based on contrastive learning to exploit both the information of hazy images and clear images as negative and positive samples, respectively.
Abstract: Single image dehazing is a challenging ill-posed problem due to the severe information degeneration. However, existing deep learning based dehazing methods only adopt clear images as positive samples to guide the training of dehazing network while negative information is unexploited. Moreover, most of them focus on strengthening the dehazing network with an increase of depth and width, leading to a significant requirement of computation and memory. In this paper, we propose a novel contrastive regularization (CR) built upon contrastive learning to exploit both the information of hazy images and clear images as negative and positive samples, respectively. CR ensures that the restored image is pulled to closer to the clear image and pushed to far away from the hazy image in the representation space.Furthermore, considering trade-off between performance and memory storage, we develop a compact dehazing network based on autoencoder-like (AE) framework. It involves an adaptive mixup operation and a dynamic feature enhancement module, which can benefit from preserving information flow adaptively and expanding the receptive field to improve the network’s transformation capability, respectively. We term our dehazing network with autoencoder and contrastive regularization as AECR-Net. The extensive experiments on synthetic and real-world datasets demonstrate that our AECR-Net surpass the state-of-the-art approaches. The code is released in https://github.com/GlassyWu/AECR-Net.

311 citations


Proceedings ArticleDOI
20 Jun 2021
TL;DR: A new strategy, Variational Self-Distillation (VSD), is presented, which provides a scalable, flexible and analytic solution to essentially fitting the mutual information but without explicitly estimating it, and highlights the need to rethink the way to estimate mutual information.
Abstract: The Information Bottleneck (IB) provides an information theoretic principle for representation learning, by retaining all information relevant for predicting label while minimizing the redundancy. Though IB principle has been applied to a wide range of applications, its optimization remains a challenging problem which heavily relies on the accurate estimation of mutual information. In this paper, we present a new strategy, Variational Self-Distillation (VSD), which provides a scalable, flexible and analytic solution to essentially fitting the mutual information but without explicitly estimating it. Under rigorously theoretical guarantee, VSD enables the IB to grasp the intrinsic correlation between representation and label for supervised training. Further-more, by extending VSD to multi-view learning, we introduce two other strategies, Variational Cross-Distillation (VCD) and Variational Mutual-Learning (VML), which significantly improve the robustness of representation to view-changes by eliminating view-specific and task-irrelevant in-formation. To verify our theoretically grounded strategies, we apply our approaches to cross-modal person Re-ID, and conduct extensive experiments, where the superior performance against state-of-the-art methods are demonstrated. Our intriguing findings highlight the need to rethink the way to estimate mutual information.

58 citations


Journal ArticleDOI
TL;DR: This is the first work to model the multi-view clustering in a deep joint framework, which will provide a meaningful thinking in unsupervised multi-View learning.
Abstract: In this paper, a novel D eep M ulti-view J oint C lustering ( DMJC ) framework is proposed, where multiple deep embedded features, multi-view fusion mechanism, and clustering assignments can be learned simultaneously. Through the joint learning strategy, the clustering-friendly multi-view features and useful multi-view complementary information can be exploited effectively to improve the clustering performance. Under the proposed joint learning framework, we design two ingenious variants of deep multi-view joint clustering models, whose multi-view fusion is implemented by two kinds of simple yet effective schemes. The first model, called DMJC-S, performs multi-view fusion in an implicit way via a novel multi-view soft assignment distribution. The second model, termed DMJC-T, defines a novel multi-view auxiliary target distribution to conduct the multi-view fusion explicitly. Both DMJC-S and DMJC-T are optimized under a KL divergence objective. Experiments on eight challenging image datasets demonstrate the superiority of both DMJC-S and DMJC-T over single/multi-view baselines and the state-of-the-art multi-view clustering methods, which proves the effectiveness of the proposed DMJC framework. To the best of our knowledge, this is the first work to model the multi-view clustering in a deep joint framework, which will provide a meaningful thinking in unsupervised multi-view learning.

45 citations


Journal ArticleDOI
TL;DR: Evolver is the first deep learning processor that utilizes on-device QVF tuning to achieve both customized and optimal DNN deployment, and introduces bidirectional speculation and runtime reconfiguration techniques into the architecture.
Abstract: When deploying deep neural networks (DNNs) onto deep learning processors, we usually exploit mixed-precision quantization and voltage–frequency scaling to make tradeoffs among accuracy, latency, and energy. Conventional methods usually determine the quantization–voltage–frequency (QVF) policy before DNNs are deployed onto local devices. However, they are difficult to make optimal customizations for local user scenarios. In this article, we solve the problem by enabling on-device QVF tuning with a new deep learning processor architecture Evolver. Evolver has a QVF tuning mode to deploy DNNs with local customizations before normal execution. In this mode, Evolver uses reinforcement learning to search the optimal QVF policy based on direct hardware feedbacks from the chip itself. After that, Evolver runs the newly quantized DNN inference under the searched voltage and frequency. To improve the performance and energy efficiency of both training and inference, we introduce bidirectional speculation and runtime reconfiguration techniques into the architecture. To the best of our knowledge, Evolver is the first deep learning processor that utilizes on-device QVF tuning to achieve both customized and optimal DNN deployment.

42 citations


Journal ArticleDOI
TL;DR: A kernelized version of tensor-based multiview subspace clustering is proposed, which is referred to as Kt-SVD-MSC, to jointly learn self-representation coefficients in mapped high-dimensional spaces and multiple views correlation in unified tensor space.
Abstract: In this article, we propose a multiview self-representation model for nonlinear subspaces clustering. By assuming that the heterogeneous features lie within the union of multiple linear subspaces, the recent multiview subspace learning methods aim to capture the complementary and consensus from multiple views to boost the performance. However, in real-world applications, data feature usually resides in multiple nonlinear subspaces, leading to undesirable results. To this end, we propose a kernelized version of tensor-based multiview subspace clustering, which is referred to as Kt-SVD-MSC, to jointly learn self-representation coefficients in mapped high-dimensional spaces and multiple views correlation in unified tensor space. In view-specific feature space, a kernel-induced mapping is introduced for each view to ensure the separability of self-representation coefficients. In unified tensor space, a new kind of tensor low-rank regularizer is employed on the rotated self-representation coefficient tensor to preserve the global consistency across different views. We also derive an algorithm to efficiently solve the optimization problem with all the subproblems having closed-form solutions. Furthermore, by incorporating the nonnegative and sparsity constraints, the proposed method can be easily extended to a useful variant, meaning that several useful variants can be easily constructed in a similar way. Extensive experiments of the proposed method are tested on eight challenging data sets, in which a significant (even a breakthrough) advance over state-of-the-art multiview clustering is achieved.

39 citations


Journal ArticleDOI
TL;DR: This paper presents a salient instance segmentation method that produces a saliency map with distinct object instance labels for an input image that is capable of achieving satisfactory performance over six public benchmarks for salient region detection as well as on the new dataset for salient instance segmentsation.

39 citations


Journal ArticleDOI
TL;DR: This work proposes a lightweight graph reordering methodology, incorporated with a GCN accelerator architecture that equips a customized cache design to fully utilize the graph-level data reuse, and proposes a mapping methodology aware of data reuse and task-level parallelism to handle various graphs inputs effectively.
Abstract: Graph convolutional network (GCN) emerges as a promising direction to learn the inductive representation in graph data commonly used in widespread applications, such as E-commerce, social networks, and knowledge graphs. However, learning from graphs is non-trivial because of its mixed computation model involving both graph analytics and neural network computing. To this end, we decompose the GCN learning into two hierarchical paradigms: graph-level and node-level computing. Such a hierarchical paradigm facilitates the software and hardware accelerations for GCN learning. We propose a lightweight graph reordering methodology, incorporated with a GCN accelerator architecture that equips a customized cache design to fully utilize the graph-level data reuse. We also propose a mapping methodology aware of data reuse and task-level parallelism to handle various graphs inputs effectively. Results show that Rubik accelerator design improves energy efficiency by 26.3x to 1375.2x than GPU platforms across different datasets and GCN models.

34 citations


Proceedings ArticleDOI
01 Jun 2021
TL;DR: In this article, the authors proposed a gradual receptive field component reasoning (RFCR) method, where target Receptive Field Component Codes (RFCCs) are designed to record categories within receptive fields for hidden units in the encoder.
Abstract: Hidden features in neural network usually fail to learn informative representation for 3D segmentation as supervisions are only given on output prediction, while this can be solved by omni-scale supervision on intermediate layers. In this paper, we bring the first omni-scale supervision method to point cloud segmentation via the proposed gradual Receptive Field Component Reasoning (RFCR), where target Receptive Field Component Codes (RFCCs) are designed to record categories within receptive fields for hidden units in the encoder. Then, target RFCCs will supervise the decoder to gradually infer the RFCCs in a coarse-to-fine categories reasoning manner, and finally obtain the semantic labels. Because many hidden features are inactive with tiny magnitude and make minor contributions to RFCC prediction, we propose a Feature Densification with a centrifugal potential to obtain more unambiguous features, and it is in effect equivalent to entropy regularization over features. More active features can further unleash the potential of our omni-supervision method. We embed our method into four prevailing backbones and test on three challenging benchmarks. Our method can significantly improve the backbones in all three datasets. Specifically, our method brings new state-of-the-art performances for S3DIS as well as Semantic3D and ranks the 1st in the ScanNet benchmark among all the point-based methods. Code is publicly available at https://github.com/azuki-miho/RFCR.

34 citations


Proceedings ArticleDOI
01 Feb 2021
TL;DR: SpaceA as discussed by the authors integrates compute logic near memory banks to exploit bank-level bandwidth for sparse matrix-vector multiplication (SVMV) on PIM architectures, which is an important primitive across a wide range of application domains such as scientific computing and graph analytics.
Abstract: Sparse matrix-vector multiplication (SpMV) is an important primitive across a wide range of application domains such as scientific computing and graph analytics. Due to its intrinsic memory-bound characteristics, the performance of SpMV on throughput-oriented architectures such as GPU is bounded by the limited bandwidth between processors and memory. Processing-in-memory (PIM) architectures, made feasible by advances in 3D stacking, provide new opportunities to utilize ultra-high bandwidth by integrating compute-logic into memory.In this paper, we develop an SpMV accelerator, named as SpaceA, based on PIM architectures. SpaceA integrates compute logic near memory banks to exploit bank-level bandwidth. SpaceA contains both hardware and data-mapping design features to alleviate irregular memory access patterns which hinder full utilization of high memory bandwidth. In terms of hardware design features, SpaceA consists of two unique features: (1) it utilizes the capability of outstanding memory requests to hide the memory access latency to data located in non-local memory banks; (2) it integrates Content Addressable Memory (CAM) at the bank level to exploit data reuse of the input vectors. In addition, we develop a mapping scheme that partitions the sparse matrix into different memory banks, to maximize the data locality of the input vector and to achieve workload balance among processing elements (PEs) near each bank. Overall, SpaceA together with the proposed mapping method achieves 13.54x speedup and 87.49% energy saving on average over the GPU baseline on SpMV computation. In addition to SpMV primitives, we conduct a case study on graph analytics to demonstrate the benefits of SpaceA for applications built on SpMV. Compared to Tesseract and GraphP, state-of-the-art graph accelerators, SpaceA obtains better performance due to its higher effective bandwidth provided by near-bank integration.

32 citations


Proceedings ArticleDOI
22 Jun 2021
TL;DR: IRONMAN as discussed by the authors proposes an end-to-end framework for flexible and automated design space exploration (DSE), which can provide either optimized solutions under user-specified constraints, or Pareto trade-offs among different objectives (e.g., resource types, area and latency).
Abstract: Despite the great success of High-Level Synthesis (HLS) tools, we observe several unresolved challenges: 1) the high-level abstraction of programming styles in HLS conceals optimization opportunities; 2) existing HLS tools do not provide flexible trade-offs among different objectives and constraints; 3) the actual quality of the resulting RTL designs is hard to predict. To this end, we propose an end-to-end framework, IRONMAN. The primary goal is to enable a flexible and automated design space exploration (DSE), which can provide either optimized solutions under user-specified constraints, or Pareto trade-offs among different objectives (e.g., resource types, area, and latency). IronMan consists of three components: GPP (a highly accurate graph-neural-network-based performance predictor), RLMD (a reinforcement-learning-based DSE engine that explores the optimized resource allocation strategy), and CT (a code transformer that assists RLMD and GPP by extracting data flow graphs from original HLS C/C++). Experimental results show that, 1) GPP achieves high prediction accuracy, reducing prediction errors of HLS tools by 10.9X in resource usage and 5.7X in timing; 2) RLMD obtains optimized or Pareto solutions outperforming genetic algorithm and simulated annealing by 12.7% and 12.9%, respectively; 3) IronMan can find optimized solutions perfectly matching various DSP constraints, with 2.54X fewer DSPs and up to 6X shorter latency than those of HLS tools. IronMan is also up to 400X faster than meta-heuristic techniques and HLS tools.

31 citations


Proceedings ArticleDOI
17 Feb 2021
TL;DR: EGEMM-TC as discussed by the authors employs an extendable workflow of hardware profiling and operation design to generate a lightweight emulation algorithm on Tensor Cores with extended-precision, including highly-efficient tensorization to exploit the Tensor Core memory architecture and the instruction-level optimizations to coordinate the emulation computation and memory access.
Abstract: Nvidia Tensor Cores achieve high performance with half-precision matrix inputs tailored towards deep learning workloads. However, this limits the application of Tensor Cores especially in the area of scientific computing with high precision requirements. In this paper, we build Emulated GEMM on Tensor Cores (EGEMM-TC) to extend the usage of Tensor Cores to accelerate scientific computing applications without compromising the precision requirements. First, EGEMM-TC employs an extendable workflow of hardware profiling and operation design to generate a lightweight emulation algorithm on Tensor Cores with extended-precision. Second, EGEMM-TC exploits a set of Tensor Core kernel optimizations to achieve high performance, including the highly-efficient tensorization to exploit the Tensor Core memory architecture and the instruction-level optimizations to coordinate the emulation computation and memory access. Third, EGEMM-TC incorporates a hardware-aware analytic model to offer large flexibility for automatic performance tuning across various scientific computing workloads and input datasets. Extensive evaluations show that EGEMM-TC can achieve on average 3.13× and 11.18× speedup over the cuBLAS kernels and the CUDA-SDK kernels on CUDA Cores, respectively. Our case study on several scientific computing applications further confirms that EGEMM-TC can generalize the usage of Tensor Cores and achieve about 1.8× speedup compared to the hand-tuned, highly-optimized implementations running on CUDA Cores.

Journal ArticleDOI
TL;DR: Zhang et al. as discussed by the authors proposed a Part-Guided Graph Convolution Network (PGCN) for person Re-ID, which could simultaneously learn the inter-local relationship and the intra-local relationships for feature representations.

Proceedings ArticleDOI
01 Jun 2021
TL;DR: In this paper, the authors proposed a novel attention scheme which projects the image and text embedding into a common space and optimises the attention weights directly towards the evaluation metrics. But the proposed attention scheme can be considered as a kind of supervised attention and requiring no additional annotations.
Abstract: Image-text matching is an important multi-modal task with massive applications. It tries to match the image and the text with similar semantic information. Existing approaches do not explicitly transform the different modalities into a common space. Meanwhile, the attention mechanism which is widely used in image-text matching models does not have supervision. We propose a novel attention scheme which projects the image and text embedding into a common space and optimises the attention weights directly towards the evaluation metrics. The proposed attention scheme can be considered as a kind of supervised attention and requiring no additional annotations. It is trained via a novel Discrete-continuous action space policy gradient algorithm, which is more effective in modelling complex action space than previous continuous action space policy gradient. We evaluate the proposed methods on two widely-used benchmark datasets: Flickr30k and MS-COCO, outperforming the previous approaches by a large margin.

Proceedings ArticleDOI
09 Aug 2021
TL;DR: In this article, the authors propose a dual-reeighting domain generalization (DRDG) framework which iteratively reweights the relative importance between samples to further improve the generalization.
Abstract: Face anti-spoofing approaches based on domain generalization (DG) have drawn growing attention due to their robustness for unseen scenarios. Previous methods treat each sample from multiple domains indiscriminately during the training process, and endeavor to extract a common feature space to improve the generalization. However, due to complex and biased data distribution, directly treating them equally will corrupt the generalization ability. To settle the issue, we propose a novel Dual Reweighting Domain Generalization (DRDG) framework which iteratively reweights the relative importance between samples to further improve the generalization. Concretely, Sample Reweighting Module is first proposed to identify samples with relatively large domain bias, and reduce their impact on the overall optimization. Afterwards, Feature Reweighting Module is introduced to focus on these samples and extract more domain-irrelevant features via a self-distilling mechanism. Combined with the domain discriminator, the iteration of the two modules promotes the extraction of generalized features. Extensive experiments and visualizations are presented to demonstrate the effectiveness and interpretability of our method against the state-of-the-art competitors.


Journal ArticleDOI
TL;DR: DLUX, a high performance and energy-efficient 3D-PIM accelerator for DNN training using the near-bank architecture, is proposed and a small scratchpad buffer together with a lightweight transformation engine is proposed to exploit the locality and enable flexible data layout without the expensive cache.
Abstract: The frequent data movement between the processor and the memory has become a severe performance bottleneck for deep neural network (DNN) training workloads in data centers. To solve this off-chip memory access challenge, the 3-D stacking processing-in-memory (3D-PIM) architecture provides a viable solution. However, existing 3D-PIM designs for DNN training suffer from the limited memory bandwidth in the base logic die. To overcome this obstacle, integrating the DNN related logic near each memory bank becomes a promising yet challenging solution, since naively implementing the floating-point (FP) unit and the cache in the memory die incurs a large area overhead. To address these problems, we propose DLUX, a high performance and energy-efficient 3D-PIM accelerator for DNN training using the near-bank architecture. From the hardware perspective, to support the FP multiplier with low area overhead, an in-DRAM lookup table (LUT) mechanism is invented. Then, we propose to use a small scratchpad buffer together with a lightweight transformation engine to exploit the locality and enable flexible data layout without the expensive cache. From the software aspect, we split the mapping/scheduling tasks during DNN training into intralayer and interlayer phases. During the intralayer phase, to maximize data reuse in the LUT buffer and the scratchpad buffer, achieve high concurrency, and reduce data movement among banks, a 3D-PIM customized loop tiling technique is adopted. During the interlayer phase, efficient techniques are invented to ensure the input–output data layout consistency and realize the forward–backward layout transposition. Experiment results show that DLUX can reduce FP32 multiplier area overhead by 60% against the direct implementation. Compared with a Tesla V100 GPU, end-to-end evaluations show that DLUX can provide on average $6.3\times $ speedup and $42\times $ energy efficiency improvement.

Journal ArticleDOI
TL;DR: BBR and M1 may function as new, natural, and intestinal-specific FXR agonists with a potential clinical application to treat hyperglycemia and obesity.
Abstract: Our previous study suggests that berberine (BBR) lowers lipids by modulating bile acids and activating intestinal farnesoid X receptor (FXR). However, to what extent this pathway contributes to the hypoglycemic effect of BBR has not been determined. In this study, the glucose-lowering effects of BBR and its primary metabolites, berberrubine (M1) and demethyleneberberine, in a high-fat diet–induced obese mouse model were studied, and their modulation of the global metabolic profile of mouse livers and systemic bile acids was determined. The results revealed that BBR (150 mg/kg) and M1 (50 mg/kg) decreased mouse serum glucose levels by 23.15% and 48.14%, respectively. Both BBR and M1 markedly modulated the hepatic expression of genes involved in gluconeogenesis and metabolism of amino acids, fatty acids, and purine. BBR showed a stronger modulatory effect on systemic bile acids than its metabolites. Moreover, molecular docking and gene expression analysis in vivo and in vitro suggest that BBR and M1 are FXR agonists. The mRNA levels of gluconeogenesis genes in the liver, glucose-6-phosphatase and phosphoenolpyruvate carboxykinase, were significantly decreased by BBR and M1. In summary, BBR and M1 modulate systemic bile acids and activate the intestinal FXR signaling pathway, which reduces hepatic gluconeogenesis by inhibiting the gene expression of gluconeogenesis genes, achieving a hypoglycemic effect. BBR and M1 may function as new, natural, and intestinal-specific FXR agonists with a potential clinical application to treat hyperglycemia and obesity. SIGNIFICANCE STATEMENT This investigation revealed that BBR and its metabolite, berberrubine, significantly lowered blood glucose, mainly through activating intestinal farnesoid X receptor signaling pathway, either directly by themselves or indirectly by modulating the composition of systemic bile acids, thus inhibiting the expression of gluconeogenic genes in the liver and, finally, reducing hepatic gluconeogenesis and lowering blood glucose. The results will help elucidate the mechanism of BBR and provide a reference for mechanism interpretation of other natural products with low bioavailability.

Journal ArticleDOI
TL;DR: A memory Trojan methodology is proposed that implants the malicious logics merely into the memory controllers of DNN systems without the necessity of toolchain manipulation or accessing to the victim model and thus is feasible for practical uses.
Abstract: Deep neural network (DNN) accelerators are widely deployed in computer vision, speech recognition, and machine translation applications, in which attacks on DNNs have become a growing concern. This article focuses on exploring the implications of hardware Trojan attacks on DNNs. Trojans are one of the most challenging threat models in hardware security where adversaries insert malicious modifications to the original integrated circuits (ICs), leading to malfunction once being triggered. Such attacks can be conducted by adversaries because modern ICs commonly include third-party intellectual property (IP) blocks. Previous studies design hardware Trojans to attack DNNs with the assumption that adversaries have full knowledge or manipulation of the DNN systems’ victim model and toolchain in addition to the hardware platforms, yet such a threat model is strict, limiting their practical adoption. In this article, we propose a memory Trojan methodology that implants the malicious logics merely into the memory controllers of DNN systems without the necessity of toolchain manipulation or accessing to the victim model and thus is feasible for practical uses. Specifically, we locate the input image data among the massive volume of memory traffics based on memory access patterns and propose a Trojan trigger mechanism based on detecting the geometric feature in input images. Extensive experiments show that the proposed trigger mechanism is effective even in the presence of environmental noises and preprocessing operations. Furthermore, we design and implement the payload and verify that the proposed Trojan technique can effectively conduct both untargeted and targeted attacks on DNNs.

Journal ArticleDOI
TL;DR: This paper implicitly map the raw time series space into multiple kernel spaces via elastic distance measure functions and resorts to the tensor constraint based self-representation subspace clustering approach to explore the essential low-dimensional structure of the data, as well as the high-order complementary information from different elastic kernels.
Abstract: Time series clustering has attracted growing attention due to the abundant data accessible and extensive value in various applications. The unique characteristics of time series, including high-dimension, warping, and the integration of multiple elastic measures, pose challenges for the present clustering algorithms, most of which take into account only part of these difficulties. In this paper, we make an effort to simultaneously address all aforementioned issues in time series clustering under a unified multiple kernels clustering (MKC) framework. Specifically, we first implicitly map the raw time series space into multiple kernel spaces via elastic distance measure functions. In such high-dimensional spaces, we resort to the tensor constraint based self-representation subspace clustering approach, which involves the self-paced learning paradigm, to explore the essential low-dimensional structure of the data, as well as the high-order complementary information from different elastic kernels. The proposed approach can be extended to more challenging multivariate time series clustering scenario in a direct but elegant way. Extensive experiments on 85 univariate and 10 multivariate time series datasets demonstrate the significant superiority of the proposed approach beyond the baseline and several state-of-the-art MKC methods.

Proceedings ArticleDOI
07 Sep 2021
TL;DR: In this paper, the authors point out that the quantum software and hardware systems should be designed collaboratively to fully exploit the potential of quantum computing, and discuss some potential future directions following the co-design principle.
Abstract: A quantum computing system naturally consists of two components, the software system and the hardware system. Quantum applications are programmed using the quantum software and then executed on the quantum hardware. However, the performance of existing quantum computing system is still limited. Solving a practical problem that is beyond the capability of classical computers on a quantum computer has not yet been demonstrated. In this review, we point out that the quantum software and hardware systems should be designed collaboratively to fully exploit the potential of quantum computing. We first review three related works, including one hardware-aware quantum compiler optimization, one application-aware quantum hardware architecture design flow, and one co-design approach for the emerging quantum computational chemistry. Then we discuss some potential future directions following the co-design principle.

Proceedings Article
18 May 2021
TL;DR: Zhang et al. as mentioned in this paper proposed a boundary prediction module (BPM) to predict boundary points and a boundary-aware geometric encoding module (GEM) to encode geometric information and aggregate features with discrimination in a neighborhood, so that the local features belonging to different categories will not be polluted by each other.
Abstract: Boundary information plays a significant role in 2D image segmentation, while usually being ignored in 3D point cloud segmentation where ambiguous features might be generated in feature extraction, leading to misclassification in the transition area between two objects. In this paper, firstly, we propose a Boundary Prediction Module (BPM) to predict boundary points. Based on the predicted boundary, a boundary-aware Geometric Encoding Module (GEM) is designed to encode geometric information and aggregate features with discrimination in a neighborhood, so that the local features belonging to different categories will not be polluted by each other. To provide extra geometric information for boundary-aware GEM, we also propose a light-weight Geometric Convolution Operation (GCO), making the extracted features more distinguishing. Built upon the boundary-aware GEM, we build our network and test it on benchmarks like ScanNet v2, S3DIS. Results show our methods can significantly improve the baseline and achieve state-of-the-art performance.

Proceedings ArticleDOI
Zhaodong Chen1, Zheng Qu1, Liu Liu1, Yufei Ding1, Yuan Xie1 
14 Nov 2021
TL;DR: In this article, column-vector-sparse-encoding was proposed to exploit reduced precision and sparsity jointly, achieving 1.71-7.19x speedup over cuSPARSE.
Abstract: The success of DNN comes at the expense of excessive memory/computation cost, which can be addressed by exploiting reduced precision and sparsity jointly. Existing sparse GPU kernels, however, fail to achieve practical speedup over cuBLASHgemm under half-precision. Those for fine-grained sparsity suffer from low data reuse, and others for coarse-grained sparsity are limited by the wrestling between kernel performance and model quality under different grain sizes. We propose column-vector-sparse-encoding that has a smaller grain size under the same reuse rate compared with block sparsity. Column-vector-sparse-encoding can be applied to both SpMM & SDDMM, two major sparse DNN operations. We also introduce the Tensor-Core-based 1D Octet Tiling that has efficient memory access and computation patterns under small grain size. Based on these, we design SpMM and SDDMM kernels and achieve 1.71-7.19x speedup over cuSPARSE. Practical speedup is achieved over cuBLASHgemm under >70% and >90% sparsity with 4x1 grain size and half-precision.

Journal ArticleDOI
TL;DR: P-RT as mentioned in this paper is the first robotic vision runtime framework to efficiently manage dynamic task executions on mobile systems with multiple accelerators as well as on the cloud to achieve better performance and energy savings.
Abstract: We propose P-RT, the first robotic vision runtime framework to efficiently manage dynamic task executions on mobile systems with multiple accelerators as well as on the cloud to achieve better performance and energy savings. With P-RT, we enable a robot to simultaneously perform autonomous navigation.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper proposed a joint skinny tensor learning and latent clustering (JSTC) model, which can learn high-order skinny tensors representations and corresponding clustering assignments simultaneously, and an alternating direction minimization algorithm is carefully designed to optimize the JSTC model.
Abstract: Multiview subspace clustering (MSC) has attracted growing attention due to the extensive value in various applications, such as natural language processing, face recognition, and time-series analysis. In this article, we are devoted to address two crucial issues in MSC: 1) high computational cost and 2) cumbersome multistage clustering. Existing MSC approaches, including tensor singular value decomposition (t-SVD)-MSC that has achieved promising performance, generally utilize the dataset itself as the dictionary and regard representation learning and clustering process as two separate parts, thus leading to the high computational overhead and unsatisfactory clustering performance. To remedy these two issues, we propose a novel MSC model called joint skinny tensor learning and latent clustering (JSTC), which can learn high-order skinny tensor representations and corresponding latent clustering assignments simultaneously. Through such a joint optimization strategy, the multiview complementary information and latent clustering structure can be exploited thoroughly to improve the clustering performance. An alternating direction minimization algorithm, which owns low computational complexity and can be run in parallel when solving several key subproblems, is carefully designed to optimize the JSTC model. Such a nice property makes our JSTC an appealing solution for large-scale MSC problems. We conduct extensive experiments on ten popular datasets and compare our JSTC with 12 competitors. Five commonly used metrics, including four external measures (NMI, ACC, F-score, and RI) and one internal metric (SI), are adopted to evaluate the clustering quality. The experimental results with the Wilcoxon statistical test demonstrate the superiority of the proposed method in both clustering performance and operational efficiency.

Proceedings Article
18 May 2021
TL;DR: In this paper, a weakly supervised method for large-scale point cloud semantic segmentation is proposed, which uses a self-supervised training manner to transfer the learned prior knowledge from a large amount of unlabeled point cloud to a weak-supervision network.
Abstract: Existing methods for large-scale point cloud semantic segmentation require expensive, tedious and error-prone manual point-wise annotation. Intuitively, weakly supervised training is a direct solution to reduce the labeling costs. However, for weakly supervised large-scale point cloud semantic segmentation, too few annotations will inevitably lead to ineffective learning of network. We propose an effective weakly supervised method containing two components to solve the above problem. Firstly, we construct a pretext task, \textit{i.e.,} point cloud colorization, with a self-supervised training manner to transfer the learned prior knowledge from a large amount of unlabeled point cloud to a weakly supervised network. In this way, the representation capability of the weakly supervised network can be improved by knowledge from a heterogeneous task. Besides, to generative pseudo label for unlabeled data, a sparse label propagation mechanism is proposed with the help of generated class prototypes, which is used to measure the classification confidence of unlabeled point. Our method is evaluated on large-scale point cloud datasets with different scenarios including indoor and outdoor. The experimental results show the large gain against existing weakly supervised methods and comparable results to fully supervised methods.

Journal ArticleDOI
TL;DR: A methodology to alleviate the BN’s cost by using only a few sampled or generated data for mean and variance estimation at each iteration by designing two categories of approach: sampling or creating a few uncorrelated data for statistics’ estimation with certain strategy constraints.
Abstract: Deep neural networks (DNNs) thrive in recent years, wherein batch normalization (BN) plays an indispensable role. However, it has been observed that BN is costly due to the huge reduction and elementwise operations that are hard to be executed in parallel, which heavily reduces the training speed. To address this issue, in this article, we propose a methodology to alleviate the BN’s cost by using only a few sampled or generated data for mean and variance estimation at each iteration. The key challenge to reach this goal is how to achieve a satisfactory balance between normalization effectiveness and execution efficiency. We identify that the effectiveness expects less data correlation in sampling while the efficiency expects more regular execution patterns. To this end, we design two categories of approach: sampling or creating a few uncorrelated data for statistics’ estimation with certain strategy constraints. The former includes “batch sampling (BS)” that randomly selects a few samples from each batch and “feature sampling (FS)” that randomly selects a small patch from each feature map of all samples, and the latter is “virtual data set normalization (VDN)” that generates a few synthetic random samples to directly create uncorrelated data for statistics’ estimation. Accordingly, multiway strategies are designed to reduce the data correlation for accurate estimation and optimize the execution pattern for running acceleration in the meantime. The proposed methods are comprehensively evaluated on various DNN models, where the loss of model accuracy and the convergence rate are negligible. Without the support of any specialized libraries, $1.98\times $ BN layer acceleration and 23.2% overall training speedup can be practically achieved on modern GPUs. Furthermore, our methods demonstrate powerful performance when solving the well-known “micro-BN” problem in the case of a tiny batch size. This article provides a promising solution for the efficient training of high-performance DNNs.

Proceedings ArticleDOI
01 Feb 2021
TL;DR: NeMeter as mentioned in this paper is an integrated power, area, and timing modeling framework for ML accelerators and enables the runtime analysis of system-level performance and efficiency when the runtime activity factors are provided.
Abstract: As Machine Learning (ML) becomes pervasive in the era of artificial intelligence, ML specific tools and frameworks are required for architectural research. This paper introduces NeuroMeter, an integrated power, area, and timing modeling framework for ML accelerators. NeuroMeter models the detailed architecture of ML accelerators and generates a fast and accurate estimation on power, area, and chip timing. Meanwhile, it also enables the runtime analysis of system-level performance and efficiency when the runtime activity factors are provided. NeuroMeter’s micro-architecture model includes fundamental components of ML accelerators, including systolic array based tensor units (TU), reduction trees (RT), and 1D vector units (VU). NeuroMeter has accurate modeling results, with the average power and area estimation errors below 10% and 17% respectively when validated against TPU-v1, TPU-v2, and Eyeriss.Leveraging the NeuroMeter’s new capabilities on architecting manycore ML accelerators, this paper presents the first in-depth study on the design space and tradeoffs of “Brawny and Wimpy” inference accelerators in datacenter scenarios with the insights that are otherwise difficult to discover without NeuroMeter. Our study shows that brawny designs with 64x64 systolic arrays are the most performant and efficient for inference tasks in the 28nm datacenter architectural space with a 500mm2 die area budget. Our study also reveals important tradeoffs between performance and efficiency. For datacenter accelerators with low batch inference, a small $(\sim 16$%) sacrifice of system performance (in achieved Tera OPerations per Second, aka TOPS) can lead to more than a 2x efficiency improvement (in achieved TOPS/TCO). To showcase NeuroMeter’s capability to model a wide range of diverse ML accelerator architectures, we also conduct a followon mini-case study on implications of sparsity on different ML accelerators, demonstrating wimpier accelerator architectures benefit more readily from sparsity processing despite their lower achievable raw energy efficiency.

Journal ArticleDOI
TL;DR: In this paper, an adjacency matrix-based data structure was designed to accelerate the search of the optimal contraction sequence. And an outer product pruning method with acceptable overhead was proposed to reduce the search space.
Abstract: Tensor network and tensor computation are widely applied in scientific and engineering domains like quantum physics, electronic design automation, and machine learning. As one of the most fundamental operations for tensor networks, a tensor contraction eliminates the sharing orders among tensors and produces a compact sub-network. Different contraction sequence usually yields distinct storage and compute costs, and searching the optimal sequence is known as a hard problem. Prior work have designed heuristic and fast algorithms to solve this problem, however, several issues still remain unsolved. For example, the data format and data structure are not efficient, the constraints during modeling are impractical, the search of the optimal solution might fail, and the search cost is very high. In this paper, we first introduce a $log_k$ order representation and design an adjacency matrix-based data structure to efficiently accelerate the search of the optimal contraction sequence. Then, we propose an outer product pruning method with acceptable overhead to reduce the search space. Finally, we use a multithread optimization in our implementation to further improve the execution performance. We also present in-depth analysis of factors that influence the search time. This work provides a full-stack solution for optimal contraction sequence search from both high-level data structure and search algorithm to low-level execution parallelism, and it will benefit a broad range of tensor-related applications.

Journal ArticleDOI
TL;DR: Silybin is widely used as a hepatoprotective agent in various liver disease therapies and has been previously identified as a CYP3A inhibitor as mentioned in this paper, however, little is known about the effect of silybin on CYP 3A and the regulatory mechanism during high-fat-diet (HFD)-induced liver inflammation.
Abstract: Silybin is widely used as a hepatoprotective agent in various liver disease therapies and has been previously identified as a CYP3A inhibitor. However, little is known about the effect of silybin on CYP3A and the regulatory mechanism during high-fat-diet (HFD)-induced liver inflammation. In our study, we found that silybin restored CYP3A expression and activity that were decreased by HFD and conditioned medium (CM) from palmitate (PA)-treated Kupffer cells. Moreover, silybin suppressed liver inflammation in HFD-fed mice and inhibited NF-κB translocation into the nucleus through elevation of SIRT2 expression and promotion of p65 deacetylation. This effect was confirmed by overexpression of SIRT2, which suppressed p65 nuclear translocation and restored CYP3A transcription affected by CM. The hepatic NAD+ concentration markedly decreased in HFD-fed mice and CM-treated hepatocytes/HepG2 cells but increased after silybin treatment. Supplementing NMN as an NAD+ donor inhibited p65 acetylation, decreased p65 nuclear translocation, and restored cyp3a transcription in both HepG2 cells and mouse hepatocytes. These results suggest that silybin regulates metabolic enzymes during liver inflammation by a mechanism related to the increase in NAD+ and SIRT2 levels. In addition, silybin enhanced the intracellular NAD+ concentration by decreasing PARP1 expression. In summary, silybin increased NAD+ concentration, promoted SIRT2 expression and lowered p65 acetylation both in vivo and in vitro, which supported the recovery of CYP3A expression. These findings indicate that the NAD+/SIRT2 pathway plays an important role in CYP3A regulation during NAFLD. Significance Statement This research revealed the differential regulation of CYP3A by silybin under physiological and fatty liver pathological conditions. In the treatment of NAFLD, silybin restored, not inhibited, CYP3A expression and activity through the NAD+/SIRT2 pathway in accordance with its anti-inflammatory effect.

Journal ArticleDOI
TL;DR: In this article, the authors performed metabolomics analysis to characterize the metabolic patterns of sensitive and resistant A549 non-small cell lung cancer cells (A549/DTX cells) and found that the resistant cells were characterized by an altered microenvironment of redox homeostasis with reduced glutathione and elevated reactive oxygen species.
Abstract: Continuous docetaxel (DTX) treatment of non-small cell lung cancer induces development of drug resistance, but the mechanism is poorly understood. In this study we performed metabolomics analysis to characterize the metabolic patterns of sensitive and resistant A549 non-small cell lung cancer cells (A549/DTX cells). We showed that the sensitive and resistant A549 cells exhibited distinct metabolic phenotypes: the resistant cells were characterized by an altered microenvironment of redox homeostasis with reduced glutathione and elevated reactive oxygen species (ROS). DTX induction reprogrammed the metabolic phenotype of the sensitive cells, which acquired a phenotype similar to that of the resistant cells: it reduced cystine influx, inhibited glutathione biosynthesis, increased ROS and decreased glutathione/glutathione disulfide (GSH/GSSG); the genes involved in glutathione biosynthesis were dramatically depressed. Addition of the ROS-inducing agent Rosup (25, 50 μg/mL) significantly increased P-glycoprotein expression and reduced intracellular DTX in the sensitive A549 cells, which ultimately acquired a phenotype similar to that of the resistant cells. Supplementation of cystine (1.0 mM) significantly increased GSH synthesis, rebalanced the redox homeostasis of A549/DTX cells, and reversed DTX-induced upregulation of P-glycoprotein, and it markedly improved the effects of DTX and inhibited the growth of A549/DTX in vitro and in vivo. These results suggest that microenvironmental redox homeostasis plays a key role in the acquired resistance of A549 cancer cells to DTX. The enhancement of GSH synthesis by supplementary cystine is a promising strategy to reverse the resistance of tumor cells and has potential for translation in the clinic.