scispace - formally typeset
Search or ask a question

Showing papers on "Performance per watt published in 2020"


Proceedings ArticleDOI
09 Mar 2020
TL;DR: Fleet is presented, a framework that offers a massively parallel streaming model for FPGAs and is effective in a number of domains well-suited to FPGA acceleration, including parsing, compression, and machine learning.
Abstract: We present Fleet, a framework that offers a massively parallel streaming model for FPGAs and is effective in a number of domains well-suited for FPGA acceleration, including parsing, compression, and machine learning. Fleet requires the user to specify RTL for a processing unit that serially processes every input token in a stream, a far simpler task than writing a parallel processing unit. It then takes the user's processing unit and generates a hardware design with many copies of the unit as well as memory controllers to feed the units with separate streams and drain their outputs. Fleet includes a Chisel-based processing unit language. The language maintains Chisel's low-level performance control while adding a few productivity features, including automatic handling of ready-valid signaling and a native and automatically pipelined BRAM type. We evaluate Fleet on six different applications, including JSON parsing and integer compression, fitting hundreds of Fleet processing units on the Amazon F1 FPGA and outperforming CPU implementations by over 400x and GPU implementations by over 9x in performance per watt while requiring a similar number of lines of code.

25 citations


Proceedings ArticleDOI
01 Dec 2020
TL;DR: Forest as mentioned in this paper is an open-source tool that automatically generates ROS2 nodes for high-level synthesis-based FPGA modules, greatly facilitating the integration of FPGAs with other robot components.
Abstract: Integrating FPGAs to robot systems can be a demanding task. In this paper we present Forest, an open-source tool that automatically generates ROS2 nodes for high-level synthesis-based FPGA modules, greatly facilitating the integration of FPGAs with other robot components. Forest runs on the PYNQ version 2.5 environment with ROS2 Eloquent and can be used with Xilinx SoCs, such as Xilinx Zynq-7000. The ROS2-FPGA node generated by Forest is evaluated in an image processing task, where the FPGA logic performs a linear contrast stretch on images of three different sizes, and an average speed-up of 36.3x and a performance per watt improvement of 432.2x is observed when compared to a ROS2 node running on a modern CPU.

10 citations


Posted Content
TL;DR: This paper presents a fast NPU-aware NAS methodology, called S3NAS, to find a CNN architecture with higher accuracy than the existing ones under a given latency constraint, and applies a modified Single-Path NAS technique to the proposed supernet structure.
Abstract: As the application area of convolutional neural networks (CNN) is growing in embedded devices, it becomes popular to use a hardware CNN accelerator, called neural processing unit (NPU), to achieve higher performance per watt than CPUs or GPUs. Recently, automated neural architecture search (NAS) emerges as the default technique to find a state-of-the-art CNN architecture with higher accuracy than manually-designed architectures for image classification. In this paper, we present a fast NPU-aware NAS methodology, called S3NAS, to find a CNN architecture with higher accuracy than the existing ones under a given latency constraint. It consists of three steps: supernet design, Single-Path NAS for fast architecture exploration, and scaling. To widen the search space of the supernet structure that consists of stages, we allow stages to have a different number of blocks and blocks to have parallel layers of different kernel sizes. For a fast neural architecture search, we apply a modified Single-Path NAS technique to the proposed supernet structure. In this step, we assume a shorter latency constraint than the required to reduce the search space and the search time. The last step is to scale up the network maximally within the latency constraint. For accurate latency estimation, an analytical latency estimator is devised, based on a cycle-level NPU simulator that runs an entire CNN considering the memory access overhead accurately. With the proposed methodology, we are able to find a network in 3 hours using TPUv3, which shows 82.72% top-1 accuracy on ImageNet with 11.66 ms latency. Code are released at this https URL

8 citations


Proceedings ArticleDOI
24 Sep 2020
TL;DR: This paper presents PROTEUS framework that employs rule-based self-adaptation in PNoCs, and can achieve up to 24.5% less laser power consumption, up to 31% less average packet latency, and up to 20% less energy-per-bit compared to another laser power management technique from prior work.
Abstract: The performance of on-chip communication in the state-of-the-art multi-core processors that use the traditional electronic NoCs has already become severely energy-constrained. To that end, emerging photonic NoCs (PNoC) are seen as a potential solution to improve the energy-efficiency (performance per watt) of on-chip communication. However, existing PNoC designs cannot realize their full potential due to their excessive laser power consumption. Prior works that attempt to improve laser power efficiency in PNoCs do not consider all key factors that affect the laser power requirement of PNoCs. Therefore, they cannot yield the desired balance between the reduction in laser power, achieved performance and energy-efficiency in PNoCs. In this paper, we present PROTEUS framework that employs rule-based self-adaptation in PNoCs. Our approach not only reduces the laser power consumption, but also minimizes the average packet latency by opportunistically increasing the communication data rate in PNoCs, and thus, yields the desired balance between the laser power reduction, performance, and energy-efficiency in PNoCs. Our evaluation with PARSEC benchmarks shows that our PROTEUS framework can achieve up to 24.5% less laser power consumption, up to 31% less average packet latency, and up to 20% less energy-per-bit, compared to another laser power management technique from prior work.

4 citations


Proceedings ArticleDOI
01 Aug 2020
TL;DR: A hardware accelerator is proposed, which exploits fully the available parallelism in the Local Laplacian Filtering algorithm, while minimizing the utilization of on-chip FPGA resources.
Abstract: Images when processed using various enhancement techniques often lead to edge degradation and other unwanted artifacts such as halos. These artifacts pose a major problem for photographic applications where they can denude the quality of an image. There is a plethora of edge-aware techniques proposed in the field of image processing. However, these require the application of complex optimization or post-processing methods. Local Laplacian Filtering is an edge-aware image processing technique that involves the construction of simple Gaussian and Laplacian pyramids. This technique can be successfully applied for detail smoothing, detail enhancement, tone mapping and inverse tone mapping of an image while keeping it artifact-free. The problem though with this approach is that it is computationally expensive. Hence, parallelization schemes using multi-core CPUs and GPUs have been proposed. As is well known, they are not power-efficient, and a well-designed hardware architecture on an FPGA can do better on the performance per watt metric. In this paper, we propose a hardware accelerator, which exploits fully the available parallelism in the Local Laplacian Filtering algorithm, while minimizing the utilization of on-chip FPGA resources. On Virtex-7 FPGA, we obtain a 7.5x speed-up to process a 1 MB image when compared to an optimized baseline CPU implementation. To the best of our knowledge, we are not aware of any other hardware accelerators proposed in the research literature for the Local Laplacian Filtering problem.

3 citations


Book ChapterDOI
27 May 2020
TL;DR: The world’s first attempt aimed to create an efficient vector-friendly BFS implementation of the direction-optimizing algorithm for NEC SX-Aurora TSUBASA architecture significantly outperforms the existing state-of-the-art implementations both for modern CPUs (Intel Skylake) and NVIDIA V100 GPUs.
Abstract: Breadth-First Search (BFS) is an important computational kernel used as a building-block for many other graph algorithms. Different algorithms and implementation approaches aimed to solve the BFS problem have been proposed so far for various computational platforms, with the direction-optimizing algorithm being the fastest and the most computationally efficient for many real-world graph types. However, straightforward implementation of direction-optimizing BFS for vector computers can be extremely challenging and inefficient due to the high irregularity of graph data structure and the algorithm itself. This paper describes the world’s first attempt aimed to create an efficient vector-friendly BFS implementation of the direction-optimizing algorithm for NEC SX-Aurora TSUBASA architecture. SX-Aurora TSUBASA vector processors provide high-performance computational power together with a world-highest bandwidth memory, making it a very interesting platform for solving various graph-processing problems. The implementation proposed in this paper significantly outperforms the existing state-of-the-art implementations both for modern CPUs (Intel Skylake) and NVIDIA V100 GPUs. In addition, the proposed implementation achieves significantly higher energy efficiency compared to other platforms and implementations both in terms of average power consumption and achieved performance per watt.

3 citations


Journal ArticleDOI
TL;DR: In this article, a distributed system that exploits FPGAs to accelerate compute-intensive tasks in fog computing applications is presented, which is able to efficiently run distributed fog applications thanks to a well-defined application structure, a per-application isolated network overlay and thanks to the acceleration of tasks.
Abstract: In the last few years Internet of Things (IoT) applications are moving from the cloud-sensor paradigm to a more variegated structure where IoT nodes interact with an intermediate fog computing layer. To enable compute-intensive tasks to be executed near the source of the data, fog computing nodes should provide enough performance and be sufficiently energy efficient to run on the field. Within this context, embedded Field Programmable Gate Array (FPGA) can be used to improve the performance per Watt ratio of fog computing nodes. In this paper we present Fog Acceleration through Reconigurable Devices (FARD), a distributed system that exploits FPGAs to accelerate compute-intensive tasks in fog computing applications. FARD is able to efficiently run distributed fog applications thanks to a well-defined application structure, a per-application isolated network overlay and thanks to the acceleration of tasks. Results show energy efficiency improvements while efficiently enabling cooperation across fog nodes.

2 citations


Posted Content
TL;DR: The Universal Number Library as mentioned in this paper is a high-performance number system library that includes arbitrary integer, decimal, fixed-point, floating-point and introduces two tapered floating point types, posit and valid, that support reproducible arithmetic computation in arbitrary concurrency environments.
Abstract: With the proliferation of embedded systems requiring intelligent behavior, custom number systems to optimize performance per Watt of the entire system become essential components for successful commercial products. We present the Universal Number Library, a high-performance number systems library that includes arbitrary integer, decimal, fixed-point, floating-point, and introduces two tapered floating-point types, posit and valid, that support reproducible arithmetic computation in arbitrary concurrency environments. We discuss the design of the Universal library as a run-time for application development, and as a platform for application-driven hardware validation. The library implementation is described, and examples are provided to show educational examples to elucidate the number system properties, and how specialization is used to yield very high-performance emulation on existing x86, ARM, and POWER processors. We will highlight the integration of the library in larger application environments in computational science and engineering to enable multi-precision and adaptive precision algorithms to improve performance and efficiency of large scale and real-time applications. We will demonstrate the integration of the Universal library into a high-performance reproducible linear algebra run-time. We will conclude with the roadmap of additional functionality of the library as we are targeting new application domains, such as Software Defined Radio, instrumentation, sensor fusion, and model-predictive control.

1 citations


Proceedings ArticleDOI
10 Jan 2020
TL;DR: A comparative study exploiting the technique, parameter & performance improvement so that the future computer scientist can develop a contention-aware solution more precisely.
Abstract: For last few decades, multitasking is at its highest demand. To achieve multitasking, symmetric & asymmetric multi-core processors system is a popular technology. Asymmetric multi-core processors (AMPs) use the same instruction set architecture (ISA) but different clock frequency. It is shown that AMPs deliver better performance per watt comparing to its symmetric counterpart. The future multi-core system will combine a few fast cores & many slow cores. Fast core means high power consumption with complex pipelines and high clock frequency, where the slow core will have low power consumption with simple pipelines and low clock frequency. To get the best performance from the asymmetric multi-core processors, the best scheduling policy will play an important role. Scheduling co-running applications in the most suitable core types are very vital for AMPs to get its best performance. Various policies like contention-aware, parallelism-aware & asymmetric-aware need to be considered when developing a scheduling algorithm. For AMPs, contention for resource sharing is a key performance-limiting factor. Despite noteworthy research efforts, the contention for resource sharing in the multi-core processor remains unsolved. In this paper, we discuss the latest five contention-aware scheduling algorithms of AMPs. We present a comparative study exploiting the technique, parameter & performance improvement so that the future computer scientist can develop a contention-aware solution more precisely.

1 citations


Book ChapterDOI
15 Sep 2020
TL;DR: NuPow as discussed by the authors is a hierarchical scheduling and power management framework for architectures with multiple cores per voltage and frequency domain and non-uniform memory access (NUMA) properties.
Abstract: Power management and task placement pose two of the greatest challenges for future many-core processors in data centers. With hundreds of cores on a single die, cores experience varying memory latencies and cannot individually regulate voltage and frequency, therefore calling for new approaches to scheduling and power management. This work presents NuPow, a hierarchical scheduling and power management framework for architectures with multiple cores per voltage and frequency domain and non-uniform memory access (NUMA) properties. NuPow considers the conflicting goals of grouping virtual machines (VMs) with similar load patterns while also placing them as close as possible to the accessed data. Implemented and evaluated on existing hardware, NuPow achieves significantly better performance per watt compared to competing approaches.

Proceedings ArticleDOI
10 Dec 2020
TL;DR: In this article, the authors present a schedule management framework for aperiodic soft-real-time jobs that may be used by a CPU GPU system designer/integrator to select, configure and deploy a suitable architectural platform and to perform concurrent scheduling of these jobs.
Abstract: The Graphics Processing Unit (GPU) was originally designed for the rapid creation and manipulation of images. Since then, it has evolved from being just an application-specific processing unit to supporting more general-purpose computing (GPGPU). GPU based architectures are optimized for throughput and performance per watt, which provides huge computational gains at a fraction of the power when compared to traditional CPU based architectures. As real-time systems begin to integrate more and more functionality, GPU based architectures are becoming an attractive option for them. However, in a real-time system, predictability and temporal requirements are much more important than raw performance. While some real-time jobs may benefit from the performance that all cores of the GPU can provide, most jobs may require only a subset of cores to successfully meet their temporal requirements. In this paper, we present a schedule management framework for aperiodic soft-real-time jobs that may be used by a CPU GPU system designer/integrator to select, conFigure and deploy a suitable architectural platform and to perform concurrent scheduling of these jobs. An open-source implementation of our framework is made available on GitHub. Experimental results demonstrate the utility and robustness of our framework.

Proceedings ArticleDOI
23 Nov 2020
TL;DR: In this paper, a scalable, dynamic and growing hardware self-organizing map (SOM) architecture is presented for real-time color quantization and pattern distribution recognition in embedded systems.
Abstract: In the era of the Internet of Things (IoT) and Big Data (BD), a significant amount of data is permanently generated every day. The data size of collected data streams is now reaching zetta bytes (i.e., 1021), and their processing and analysis becomes more and more challenging especially in embedded systems, where the overall goal is to maximize performance per watt, while meeting real-time requirements and trying to keep the overall power consumption in the very limited power budgets. The collected data are often reduced by means of clustering, vector quantization or compression before their further processing. The unsupervised learning techniques such as Self-Organizing Maps (SOMs) not needing any prior knowledge of processed data are perfect candidates for this task. However, real-time vector quantization with SOMs requires high performances and dynamic online configurability. The software counterparts of SOMs are highly flexible with limited performances per watt whereas the hardware SOMs generally lack of flexibility. In this paper, a novel scalable, dynamic and growing hardware self-organizing map (SOM) is presented. The presented hardware SOM architecture is dynamically configurable and adaptable in terms of neurons, map size and vector dimension depending on the application-specific needs. The proposed architecture is validated on different map sizes (up to 16×16) with different vector widths applied for real-time color quantization and pattern distribution recognition.

Proceedings ArticleDOI
01 Aug 2020
TL;DR: This work proposes a Hybrid Multi-target Binary Translator (HMTBT), capable of transparently translating code to different accelerators: a CGRA and a NEON engine, and automatically dispatching the translation to the most well-suited one, according to the type of the available parallelism (ILP or DLP at the moment).
Abstract: Embedded systems comprise multiple accelerators to exploit both Instruction and Data-Level parallelism, maximizing performance per watt. However, the use of accelerators usually involves changes in the source code, not maintaining binary compatibility and increasing time-to-market. Therefore, Binary Translation (BT) mechanisms emerge as an alternative, since they dynamically detect and transform parts of the application for optimization without needing any prior modification in the code. Nevertheless, the available BT approaches are limited to one single accelerator, which may not always result in the optimal energy-performance trade-off, since parts of an application may have code that will benefit the most from one accelerator or another depending on its available intrinsic parallelism. Given that, this work proposes a Hybrid Multi-target Binary Translator (HMTBT). Our HMTBT is capable of transparently translating code to different accelerators: a CGRA (Coarse-Grained Reconfigurable Architecture) and a NEON engine, and automatically dispatching the translation to the most well-suited one, according to the type of the available parallelism (ILP or DLP) at the moment. HMTBT improves performance by 54% and 76% and saves energy by 15% and 25% when comparing to a BT targeting a CGRA only and another targeting a NEON engine only. We also compare the HMTBT to a system that features both CGRA and NEON BT mechanisms, showing 12% of energy savings and 14% of performance improvements, on average.

Posted Content
TL;DR: In this paper, a rule-based self-adaptation in photonic NoCs (PNoCs) is proposed to reduce the laser power consumption and minimize the average packet latency by opportunis-tically increasing the communication data rate.
Abstract: The performance of on-chip communication in the state-of-the-art multi-core processors that use the traditional electron-ic NoCs has already become severely energy-constrained To that end, emerging photonic NoCs (PNoC) are seen as a po-tential solution to improve the energy-efficiency (performance per watt) of on-chip communication However, existing PNoC designs cannot realize their full potential due to their exces-sive laser power consumption Prior works that attempt to improve laser power efficiency in PNoCs do not consider all key factors that affect the laser power requirement of PNoCs Therefore, they cannot yield the desired balance between the reduction in laser power, achieved performance and energy-efficiency in PNoCs In this paper, we present PROTEUS framework that employs rule-based self-adaptation in PNoCs Our approach not only reduces the laser power consumption, but also minimizes the average packet latency by opportunis-tically increasing the communication data rate in PNoCs, and thus, yields the desired balance between the laser power re-duction, performance, and energy-efficiency in PNoCs Our evaluation with PARSEC benchmarks shows that our PROTEUS framework can achieve up to 245% less laser power consumption, up to 31% less average packet latency, and up to 20% less energy-per-bit, compared to another laser power management technique from prior work

Proceedings ArticleDOI
01 Feb 2020
TL;DR: The proposed near memory computing system provides a considerable improvement in computational performance of graph analytics algorithms with an average improvement in Instructions Per Cycle (IPC) and in performance per Watt of $7.55 - 8.55.
Abstract: Big data graph analytics is the future of high performance computing and key to many current and future applications. There is a growing demand for high performance graph computing for real-world social network graphs. Real-world graph algorithms are memory-intensive and generate a high percentage of accesses to the memory subsystem due to low cache locality. Near memory or 3D die-stacked memory, known for its low latency, high bandwidth communication has the potential to improve the performance of big data graph analytics.In this paper, we analyse, evaluate and compare the performance of a near memory system for big data graph analytics. Real-world graphs associated with social networks and the web are processed with graph analytics algorithms in a simulated near memory system. The performance advantage of near memory with a large number of simple in-order processor cores for graph analysis is presented.The proposed system provides a performance per Watt improvement of $3.55 - 8.55 \times$ for Breadth-First Search algorithm for big data graphs over computing systems with fat cores and traditional Double Data Rate (DDR) memory. The proposed near memory computing system provides a considerable improvement in computational performance of graph analytics algorithms with an average improvement in Instructions Per Cycle (IPC) of $5 \times$ and in performance per Watt of $7 \times$.