Author
Zongwei Zhu
Bio: Zongwei Zhu is an academic researcher from University of Science and Technology of China. The author has contributed to research in topics: Cache & Computer science. The author has an hindex of 7, co-authored 39 publications receiving 147 citations.
Papers
More filters
TL;DR: Experimental results demonstrate that, compared with the original HDFS, PHDFS can dramatically decrease the latency when accessing small files and improve the FPS of typical deep learning models by 40%.
Abstract: For deep learning cloud computing platforms, file system is a fundamental and critical component. Hadoop distributed file system (HDFS) is widely used in large scale clusters due to its high performance and high availability. However, in deep learning datasets, the number of files is huge but the file size is small, making HDFS suffer a severe performance penalty. Although there have been many optimizing methods for addressing the small file problem, none of them take the file correlation in deep learning datasets into consideration. To address such problem, this paper proposes a Pile-HDFS (PHDFS) based on a new file aggregation approach. Pile is designed as the I/O unit merging a group of small files according to their correlation. In order to effectively access small files, we design a two-layer manager and add the inner organization information to data blocks. Experimental results demonstrate that, compared with the original HDFS, PHDFS can dramatically decrease the latency when accessing small files and improve the FPS (Frames Per Second) of typical deep learning models by 40%.
72 citations
TL;DR: A deep-learning-based fabric defect detection method for edge computing scenarios is proposed and compared with the conventional convolutional neural network (CNN), the proposed optimized model attains an average improvement of 18% in the area under the curve (AUC) metric for 11 defects.
Abstract: As an essential step in quality control, fabric defect detection plays an important role in the textile manufacturing industry. The traditional manual detection method is inaccurate and incurs a high cost; as a result, it is gradually being replaced by deep learning algorithms based on cloud computing. However, a high data transmission latency between end devices and the cloud has a significant impact on textile production efficiency. In contrast, edge computing, which provides services near end devices by deploying network, computing and storage facilities at the edge of the Internet, can effectively solve the above-mentioned problem. In this article, we propose a deep-learning-based fabric defect detection method for edge computing scenarios. First, this article modifies the structure of DenseNet to better suit a resource-constrained edge computing scenario. To better assess the proposed model, an optimized cross-entropy loss function is also formulated. Afterward, six feasible expansion schemes are utilized to enhance the data set according to the characteristics of various defects in fabric samples. To balance the distribution of samples, proportions of various defect types are used to determine the number of enhancements. Finally, a fabric defect detection system is established to test the performance of the optimized model used on edge devices in a real-world textile industry scenario. Experimental results demonstrate that compared with the conventional convolutional neural network (CNN), the proposed optimized model attains an average improvement of 18% in the area under the curve (AUC) metric for 11 defects. Data transmission is reduced by approximately 50% and latency is reduced by 32% in the Cambricon 1H8 platform compared with a cloud platform.
42 citations
TL;DR: This paper proposes a task scheduling framework on the Dynamic Partial Reconfiguration (DPR) platform that takes full account of the characteristics of task switching overhead and predictable execution time of hardware tasks in DPR, and reduces the number of task-switching times and active tasks in the system, thus improving the scheduling efficiency.
Abstract: Real-time performance is the primary requirement for edge computing systems. However, with the surge in data volume and the growing demand for computing power, a computing framework consisting solely of CPUs is no longer competent. As a result, CPU+ heterogeneous architecture using accelerators to improve edge computing systems' computing capacity has received great attention. The type of accelerators determines the performance of the edge computing system largely. The accelerators include Graphics Processing Unit (GPU), Application Specific Integrated Circuit (ASIC) and Field Programmable Gate Array (FPGA). FPGAs with its reconfigurability and high energy efficiency are widely used in many edge computing scenarios. Nontheless, the performance depends also on the scheduling efficiency between software tasks on CPUs and hardware tasks on FPGAs. Unfortunately, the existing strategies have not fully exploited the differences between hardware and software tasks, thus resulting in low scheduling efficiency. This paper proposes a task scheduling framework on the Dynamic Partial Reconfiguration (DPR) platform. We take full account of the characteristics of task switching overhead and predictable execution time of hardware tasks in DPR, and reduce the number of task-switching times and active tasks in the system, thus improving the scheduling efficiency. We conduct a set of experiments on the Zynq platform to verify the proposed framework. Experimental results demonstrate that when the execution time of the accelerator exceeds the reconfiguration cost by an order of magnitude, the efficiencies of all the cases are more than 98%, and the efficiencies can reach 90%-98% in the same order of magnitude.
30 citations
TL;DR: This research presents a meta-modelling system that automates the very labor-intensive and therefore time-heavy and expensive process of manually cataloging and extracting data from mobile data storage devices.
Abstract: While the computing power of mobile devices has been quickly evolving in recent years, the growth of mobile storage capacity is, however, relatively slower. A common problem shared by budget-phone users is that they frequently run out of storage space. This article conducts a deep inspection of file usage of mobile applications and their potential implications on user experience. Our major findings are as follows: First, mobile applications could rapidly consume storage space by creating temporary cache files, but these cache files quickly become obsolete after being re-used for a short period of time. Second, file access patterns of large files, especially executable files, appear highly sparse and random, and therefore large portions of file space are never visited. Third, file prefetching brings an excessive amount of file data into page cache but only a few prefetched data are actually used. The unnecessary memory pressure causes premature memory reclamation and prolongs application launching time. Through the feasibility study of two preliminary optimizations, we demonstrated a high potential to eliminate unnecessary storage and memory space consumption with a minimal impact on user experience.
17 citations
TL;DR: A scalable spatial-temporal graph generation adversarial network architecture (STG-GAN) is introduced, which can comprehensively consider the influence of human-environment interaction and generate a reasonable multi-modal prediction trajectory.
Abstract: Edge agents, represented by socially-aware robots and autonomous vehicles, have gradually been integrated into human society. The safety navigation system in interactive scenes is of great importance to them. The key of this system is that the edge agent has the ability to predict the pedestrian trajectory in the dynamic scene, so as to avoid collision. However, predicting pedestrian trajectories in dynamic scenes is not an easy task, because it is necessary to comprehensively consider the spatial-temporal structure of human-environment interaction, visual attention, and the multi-modal behavior of human walking. In this paper, a scalable spatial-temporal graph generation adversarial network architecture (STG-GAN) is introduced, which can comprehensively consider the influence of human-environment interaction and generate a reasonable multi-modal prediction trajectory. First, we use LSTM nodes to flexibly transform the spatial-temporal graph of human-environment interactions into feed-forward differentiable feature coding, and innovatively propose the global node to integrate scene context information. Then, we capture the relative importance of global interactions on pedestrian trajectories through scaled dot product attention, and use recurrent sequence modeling and generative adversarial network architecture for common training, so as to generate reasonable pedestrian future trajectory distributions based on rich mixed features. Experiments on public data sets show that STG-GAN is superior to previous work in terms of accuracy, reasoning speed and rationality of trajectory prediction.
13 citations
Cited by
More filters
TL;DR: This work presents some important edge computing architectures and classify the previous works on computation offloading into different categories, and discusses some basic models such as channel model, computation and communication model, and energy harvesting model that have been proposed in offloading modeling.
Abstract: As a promising technology, edge computing extends computation, communication, and storage facilities toward the edge of a network. This new computing paradigm opens up new challenges, among which computation offloading is considered to be the most important one. Computation offloading enables end devices to offload computation tasks to edge servers and receive the results after the servers' execution of the tasks. In computation offloading, offloading modeling plays a crucial role in determining the overall edge computing performance. We present a comprehensive overview on the past development as well as the recent advances in research areas related to offloading modeling in edge computing. First, we present some important edge computing architectures and classify the previous works on computation offloading into different categories. Second, we discuss some basic models such as channel model, computation and communication model, and energy harvesting model that have been proposed in offloading modeling. Next, we elaborate on different offloading modeling methods which are based on (non-)convex optimization, Markov decision process, game theory, Lyapunov optimization, or machine learning. Finally, we highlight and discuss some research directions and challenges in the area of offloading modeling in edge computing.
159 citations
17 Dec 2019
TL;DR: This work identifies important families of workloads, as well as prevalent types of DRAM chips, and rigorously analyze the combined DRAM-workload behavior, and draws 12 key observations from the characterization.
Abstract: It has become increasingly difficult to understand the complex interaction between modern applications and main memory, composed of Dynamic Random Access Memory (DRAM) chips. Manufacturers and researchers are developing many different types of DRAM, with each DRAM type catering to different needs (e.g., high throughput, low power, high memory density). At the same time, the memory access patterns of prevalent and emerging applications are rapidly diverging, as these applications manipulate larger data sets in very different ways. As a result, the combined DRAM-workload behavior is often difficult to intuitively determine today, which can hinder memory optimizations in both hardware and software.In this work, we identify important families of workloads, as well as prevalent types of DRAM chips, and rigorously analyze the combined DRAM-workload behavior. To this end, we perform a comprehensive experimental study of the interaction between nine different DRAM types (DDR3/4, LPDDR3/4, GDDR5, Wide I/O, Wide I/O 2, HBM, HMC) and 115 modern applications and multiprogrammed workloads from six diverse application families (desktop/scientific, server/cloud, multimedia acceleration, network acceleration, GPGPU, OS routines). We draw 12 key observations from our characterization, enabled in part by our development of new metrics that quantify the effect of memory access patterns on hardware utilization. We highlight our five most significant observations here:(1) Despite having 50% higher memory bandwidth than DDR3, the newer DDR4 rarely outperforms DDR3 on the applications we evaluate, as DDR4's access latency is 11-14% higher.(2) The high-bandwidth HMC does not outperform DDR3 for most single-thread workloads and many multithreaded applications. This is because HMC's design trade-offs (e.g., a row width that is 97% smaller than DDR3) fundamentally limit opportunities for exploiting spatial locality. For example, single-thread desktop and scientific applications actually perform 5.8% worse with HMC than with DDR3, on average, even though HMC offers 87.4% more memory bandwidth. HMC provides significant performance improvements over other DRAM types in cases where application spatial locality is low(or is destroyed), such as highly-memory-intensive multiprogrammed workloads.(3) While low-power DRAM types typically perform worse than standard-power DRAM for most memory-intensive applications, some low-power DRAM types perform well when bandwidth demand is very high. For example, on average, LPDDR4 performs only 7.0% worse than DDR3 for our multiprogrammed desktop workloads, while consuming 68.2% less energy, and Wide I/O 2 performs 2.3% better than DDR3 for multimedia acceleration.(4) The best DRAM for a heterogeneous system depends heavily on the predominant function(s) performed by the system. We study three types of applications for heterogeneous systems. First, multimedia acceleration benefits most from high-throughput memories that exploit a high amount of spatial locality, running up to 21.6% faster with GDDR5 and 14.7% faster with HBM than DDR3, but only 5.0% faster with HMC. Second, a network accelerator's memory requests are highly bursty and do not exhibit significant spatial locality, and are thus a good fit for the high bank-level parallelism of HMC (88.4% faster on average over DDR3). Third, GPGPU applications exhibit a wide range of memory intensity, but memory-intensive GPGPU applications typically also take advantage of spatial locality due to memory coalescing, and perform more effectively with HBM (26.9% higher on average over DDR3) and GDDR5 (39.7%) than with DDR3 or HMC.(5) Several common OS routines (e.g., file I/O, process forking) exhibit extremely high spatial locality, and do not benefit from high amounts of bank-level parallelism. As a result, they perform better with memories such as DDR3 and GDDR5, which have lower access latencies than the other memory types that we study. Since OS routines are used across most computer systems in a widespread manner, we believe DRAM designers must provide low-latency access, instead of the current trend increasing the latency in order to deliver greater throughput.For more information on our extensive experimental characterization, we refer the reader to the full version of our paper. We hope that the trends we identify can drive optimizations in both hardware and software design. To aid further study, we open-source our extensively-modified simulators, as well as MemBen, a benchmark suite containing our applications.
73 citations
TL;DR: A dynamic adaptive replacement policy (DARP) in the shared last-level cache for the DRAM/PCM hybrid main memory is proposed and results have shown that the DARP improved the memory access efficiency by 25.4%.
Abstract: The increasing demand on the main memory capacity is one of the main big data challenges. Dynamic random access memory (DRAM) does not represent the best choice for a main memory, due to high power consumption and low density. However, the nonvolatile memory, such as the phase-change memory (PCM), represents an additional choice because of the low power consumption and high-density characteristic. Nevertheless, the high access latency and limited write endurance have disabled the PCM to replace the DRAM currently. Therefore, a hybrid memory, which combines both the DRAM and the PCM, has become a good alternative to the traditional DRAM memory. Both DRAM and PCM disadvantages are challenges for the hybrid memory. In this paper, a dynamic adaptive replacement policy (DARP) in the shared last-level cache for the DRAM/PCM hybrid main memory is proposed. The DARP distinguishes the cache data into the PCM data and the DRAM data, then, the algorithm adopts different replacement policies for each data type. Specifically, for the PCM data, the least recently used (LRU) replacement policy is adopted, and for the DRAM data, the DARP is employed according to the process behavior. Experimental results have shown that the DARP improved the memory access efficiency by 25.4%.
55 citations
Journal Article•
TL;DR: This paper designs deep learning accelerator unit (DLAU), which is a scalable accelerator architecture for large-scale deep learning networks using field-programmable gate array (FPGA) as the hardware prototype and employs three pipelined processing units to improve the throughput.
Abstract: As the emerging field of machine learning, deep learning shows excellent ability in solving complex learning problems. However, the size of the networks becomes increasingly large scale due to the demands of the practical applications, which poses significant challenge to construct high performance implementations of deep learning neural networks. In order to improve the performance as well as to maintain the low power cost, in this paper we design deep learning accelerator unit (DLAU), which is a scalable accelerator architecture for large-scale deep learning networks using field-programmable gate array (FPGA) as the hardware prototype. The DLAU accelerator employs three pipelined processing units to improve the throughput and utilizes tile techniques to explore locality for deep learning applications. Experimental results on the state-of-the-art Xilinx FPGA board demonstrate that the DLAU accelerator is able to achieve up to 36.1× speedup comparing to the Intel Core2 processors, with the power consumption at 234 mW.
53 citations