scispace - formally typeset
Search or ask a question

Showing papers in "ACM Sigarch Computer Architecture News in 2017"


Journal ArticleDOI
TL;DR: PARSEC3.0 is introduced, a new version of PARSEC suite that implements a user-level network stack and generates three network workloads with this stack to cover network domain and integrates Splash-2 and splash-2x into PARSEC framework so that researchers use these benchmark suite conveniently.
Abstract: Benchmarks play a very important role in accelerating the development and research of CMP. As one of them, the PARSEC suite continues to be updated and revised over and over again so that it can offer better support for researchers. The former versions of PARSEC have enough workloads to evaluate the property of CMP about CPU, cache and memory, but it lacks of applications based on network stack to assess the performance of CMPs in respect of network. In this work, we introduce PARSEC3.0, a new version of PARSEC suite that implements a user-level network stack and generates three network workloads with this stack to cover network domain. We explore the input sets of splash-2 and expand them to multiple scales, a.k.a, splash-2x. We integrate splash-2 and splash-2x into PARSEC framework so that researchers use these benchmark suite conveniently. Finally, we evaluate the u-TCP/IP stack and new network workloads, and analyze the characterizes of splash-2 and splash-2x

50 citations


Journal ArticleDOI
TL;DR: This work proposes an FPGA acceleration system design for Neural Network Q-learning (NNQL), which has high flexibility due to the support to run-time network parameterization, which allows neuroevolution algorithms to dynamically restructure the network to achieve better learning results.
Abstract: Deep Q-learning (DQN) is a recently proposed reinforcement learning algorithm where a neural network is applied as a non-linear approximator to its value function. The exploitation-exploration mechanism allows the training and prediction of the NN to execute simultaneously in an agent during its interaction with the environment. Agents often act independently on battery power, so the training and prediction must occur within the agent and on a limited power budget. In this work, We propose an FPGA acceleration system design for Neural Network Q-learning (NNQL). Our proposed system has high flexibility due to the support to run-time network parameterization, which allows neuroevolution algorithms to dynamically restructure the network to achieve better learning results. Additionally, the power consumption of our proposed system is adaptive to the network size because of a new processing element design. Based on our test cases on networks with hidden layer size ranging from 32 to 16384, our proposed system achieves 7x to 346x speedup compared to GPU implementation and 22x to 77x speedup to hand-coded CPU counterpart.

39 citations


Journal ArticleDOI
TL;DR: A large number of FPGA-based accelerators have been proposed to improvise convolutional neural networks, but how these accelerators are implemented and how they are implemented are still a mystery.
Abstract: Convolutional neural networks (CNNs) are revolutionizing machine learning, but they present significant computational challenges. Recently, many FPGA-based accelerators have been proposed to improv...

30 citations


Journal ArticleDOI
TL;DR: This paper presents an approach inspired by paravirtualized machines for the integration of reconfigurable hardware into cloud services, and uses partial reconfiguration to virtualize a single physical FPGA to enable multiple independent user designs.
Abstract: Computing performance and scalability are the essential basics in modern data centres. Field Programmable Gate Arrays (FPGAs) provide a promising opportunity to improve performance, security and energy efficiency. Especially background acceleration of computationally complex and long-running tasks is an important field of application. A flexible use of reconfigurable devices within a cloud context requires an abstraction of the actual hardware through virtualization.In this paper we present an approach inspired by paravirtualized machines for the integration of reconfigurable hardware into cloud services. Using partial reconfiguration our hardware and software framework virtualizes a single physical FPGA to enable multiple independent user designs. Essential components are the management of those virtual user-defined accelerators (vFPGA) and their migration between physical FPGAs to achieve higher system-wide utilization. The migration requires saving and restoring the internal state or context of the vFPGA. We demonstrate the application possibilities and the resource trade-off of our approach by transferring a running design from one physical FPGA to another. Moreover, we present future perspectives for the use of FPGAs in cloud-based environments.

20 citations


Journal ArticleDOI
TL;DR: This study explores applying the method of offline/static routing to collective operations, in particular, multicast and reduction, and believes that this is one of the few general offline/ static routing solutions for real HPC clusters, and FPGA-centric clusters in particular.
Abstract: FPGA-centric clouds and clusters provide direct and programmable interconnects with obvious benefits for communication latency and bandwidth. One rarely studied aspect of DPI is that they facilitate application-aware routing: if communication patterns are static and known a priori, as is usually the case, then judicious routing can reduce congestion, latency, and the hardware required. In this study we explore applying the method of offline/static routing to collective operations, in particular, multicast and reduction. An entirely new communication infrastructure is proposed and implemented, including switch design and routing algorithm. A substantial improvement in performance is obtained, especially for multicast. We believe that this is one of the few general offline/static routing solutions for real HPC clusters, and FPGA-centric clusters in particular.

19 citations


Journal ArticleDOI
TL;DR: Intel's SGX secure execution technology allows running computations on secret data using untrusted servers to run applications and large-scale computations to run undiscovered computations.
Abstract: Intel's SGX secure execution technology allows running computations on secret data using untrusted servers. While recent work showed how to port applications and large-scale computations to run und...

18 citations


Journal ArticleDOI
TL;DR: Modern DRAM-based systems suffer from significant energy and latency penalties due to conservative DRAM refresh standards.
Abstract: Modern DRAM-based systems suffer from significant energy and latency penalties due to conservative DRAM refresh standards. Volatile DRAM cells can retain information across a wide distribution of t...

10 citations


Journal ArticleDOI
TL;DR: Demand for low-power data processing hardware continues to rise inexorably, and existing programmable and "general purpose" solutions are insufficient, as evidenced by the order-of-m...
Abstract: Demand for low-power data processing hardware continues to rise inexorably. Existing programmable and "general purpose" solutions (eg. SIMD, GPGPUs) are insufficient, as evidenced by the order-of-m...

10 citations


Journal ArticleDOI
TL;DR: A number of suggestions are made to improve GPU architecture, resulting in potentially greatly increased performance for bioinformatics-class algorithms, including BWA-MEM.
Abstract: Next Generation Sequencing techniques have resulted in an exponential growth in the generation of genetics data, the amount of which will soon rival, if not overtake, other Big Data fields, such as astronomy and streaming video services. To become useful, this data requires processing by a complex pipeline of algorithms, taking multiple days even on large clusters. The mapping stage of such genomics pipelines, which maps the short reads onto a reference genome, takes up a significant portion of execution time. BWA-MEM is the de-facto industry-standard for the mapping stage. Here, a GPU-accelerated implementation of BWA-MEM is proposed. The Seed Extension phase, one of the three main BWA-MEM algorithm phases that requires between 30%-50% of overall processing time, is offloaded onto the GPU. A thorough design space analysis is presented for an optimized mapping of this phase onto the GPU. The re- sulting systolic-array based implementation obtains a two- fold overall application-level speedup, which is the maximum theoretically achievable speedup. Moreover, this speedup is sustained for systems with up to twenty-two logical cores. Based on the findings, a number of suggestions are made to improve GPU architecture, resulting in potentially greatly increased performance for bioinformatics-class algorithms.

10 citations


Journal ArticleDOI
TL;DR: The increasing demand for extracting value out of ever-growing data poses an ongoing challenge to system designers, a task only made trickier by the end of Dennard scaling.
Abstract: The increasing demand for extracting value out of ever-growing data poses an ongoing challenge to system designers, a task only made trickier by the end of Dennard scaling. As the performance densi...

10 citations


Journal ArticleDOI
TL;DR: The first prototype system with the Hop-count filtering and Ingress/Engress filtering techniques using the Xilinx Virtex 5 xc5vtx240t FPGA device is implemented.
Abstract: This paper proposes an FPGA-based multicore architecture to integrate multiple DDoS defense mechanisms for DDoS protection. The architecture allows multiple cooperating DDoS mitigation techniques to classify incoming network packets. The proposed architecture consists of two separate partitions static and dynamic. The static partition includes packet pre-processing and post-processing modules while the DDoS filtering techniques are implemented within the dynamic partition. These filtering techniques can be implemented by either hardware custom computing cores or general purpose soft processors or both. In all cases, these DDoS filtering computing cores can be updated or changed at runtime or design time. We implement our first prototype system with the Hop-count filtering and Ingress/Engress filtering techniques using the Xilinx Virtex 5 xc5vtx240t FPGA device. The synthesis results show that the system can work at up to 116.782MHz while utilizing about 41% LUTs, 47% Registers, and 53% Block Memory of the available hardware resources. Experimental results show that our system achieves a 100% detection rate (true positive) with a 0% false negative rate and the maximum 0.74% false positive rate. Moreover, the prototype system obtains packet processing throughput by up to 9.869 Gbps in half-duplex mode and 19.738 Gbps in full-duplex mode.

Journal ArticleDOI
TL;DR: High-level synthesis is used to fine-tune the configuration parameters in order to achieve the highest performance with maximal resource utilization in floating-point matrix multiplication on FPGAs.
Abstract: In the last decade floating-point matrix multiplication on FPGAs has been studied extensively and efficient architectures as well as detailed performance models have been developed. By design these IP cores take a fixed footprint which not necessarily optimizes the use of all available resources. Moreover, the low-level architectures are not easily amenable to a parameterized synthesis. In this paper high-level synthesis is used to fine-tune the configuration parameters in order to achieve the highest performance with maximal resource utilization. An\ exploration strategy is presented to optimize the use of critical resources (DSPs, memory) for any given FPGA. To account for the limited memory size on the FPGA, a blockoriented matrix multiplication is organized such that the block summation is done on the CPU while the block multiplication occurs on the logic fabric simultaneously. The communication overhead between the CPU and the FPGA is minimized by streaming the blocks in a Gray code ordering scheme which maximizes the data reuse for consecutive block matrix product calculations. Using highlevel synthesis optimization, the programmable logic operates at 93% of the theoretical peak performance and the combined CPU-FPGA design achieves 76% of the available hardware processing speed for the floating-point multiplication of 2K by 2K matrices.

Journal ArticleDOI
TL;DR: This paper proposes a heterogeneous computing platform based on the virtualization technology, namely hCODE, which brings multiple benefits like accelerating a program without modifying or recompiling it, enable high portability and scalability across different HW and operating system.
Abstract: One challenge for the heterogeneous computing with the FPGA is how to bridge the development gap between SW and HW designs. The high level synthesis (HLS) technique allows producing hardware with high level languages like C. Design tools based on the HLS like Xilinx SDSoC and SDAccel are developed to speedup SW/HW co-designs. However, the developers still require much circuit design skills to use these tools more efficiently. In this paper, we propose a heterogeneous computing platform based on the virtualization technology, namely hCODE.With the help of the virtualization, the HW and SW design can be totally separated. This brings multiple benefits like accelerating a program without modifying or recompiling it, enable high portability and scalability across different HW and operating system.

Journal ArticleDOI
TL;DR: An comparison of a state-of-the-art FPGA HLS tool, Vivado HLS, and anFPGA overlay tool, ArchSyn, on two computation intensive kernels, matrix-matrix multiplication and fast Fourier transform shows an overwhelming superiority in computation performance, which is 8X to 39X faster than FPGa HLS.
Abstract: To promote FPGA to a wider user community and to increase design productivity, two new design methodologies, namely FPGA high-level synthesis (HLS) and FPGA overlay, are presented to use a high-level design abstraction. To make clear distinguish features of each design methodology, we make an comparison of a state-of-the-art FPGA HLS tool, Vivado HLS, and an FPGA overlay tool, ArchSyn, on two computation intensive kernels, matrix-matrix multiplication and fast Fourier transform.In the comparison, FPGA overlay shows an overwhelming superiority in computation performance, which is 8X to 39X faster than FPGA HLS. However, FPGA HLS exhibits its advantages in dynamic power consumption metric. It achieves up to 17X lower power consumption than FPGA overlay. Power- and energy-efficiency are another two essential metrics evaluating trade-offs between performance and power consumption. As demonstrated with evaluation results, FPGA overlay is averagely 3.5X better in powerefficiency for FFT kernel, and achieves up to 2 orders of magnitude better energy-efficiency than FPGA HLS.

Journal ArticleDOI
TL;DR: In this article, the role of on-chip L1 data caches on modern GPUs is often awkward; however, the locality among global memory requests from different SMs (Streaming Multiproc...
Abstract: Cache is designed to exploit locality; however, the role of on-chip L1 data caches on modern GPUs is often awkward. The locality among global memory requests from different SMs (Streaming Multiproc...

Journal ArticleDOI
TL;DR: A cost-effective and high-throughput merge network for the fastest FPGA sorting accelerator and achieves a throughput of 8 data elements per 200MHz clock cycle is proposed.
Abstract: High-performance sorting is used in various areas such as database transactions and genomic feature operations. To improve sorting performance, in addition to the conventional approach of using general purpose processors or GPUs, the approach of using FPGAs is becoming a promising solution. As an FPGA sorting accelerator, Casper and Olukotun have recently proposed the fastest one known so far. In their study, they proposed a merge network which can merge two sorted data series at a throughput of 6 data elements per 200MHz clock cycle. If an FPGA sorting accelerator is constructed using merge networks, the overall throughput will be mainly determined by the throughputs of the merge networks. This motivates us to design a merge network which outputs more than 6 data elements per 200MHz clock cycle. In this paper, we propose a cost-effective and high-throughput merge network for the fastest FPGA sorting accelerator. The evaluation shows that our proposal achieves a throughput of 8 data elements per 200MHz clock cycle.

Journal ArticleDOI
TL;DR: GPUs have been widely adopted in data centers to provide acceleration services to many applications and are increasingly important for better processing throughput and energy efficiency.
Abstract: GPUs have been widely adopted in data centers to provide acceleration services to many applications. Sharing a GPU is increasingly important for better processing throughput and energy efficiency. ...

Journal ArticleDOI
TL;DR: An FFT circuit based on nested residue number system (NRNS), which recursively decompose the RNS, is applied that satisfied the required size and speed specifications on an available FPGA, since the excessive number of LUTs was the bottleneck of the binary FFT.
Abstract: A radio telescope analyzes radio frequency (RF) received from celestial objects. It consists of an antenna, a receiver, and a spectrometer. The spectrometer converts the time domain into the frequency domain by an FFT operation. This paper applies an FFT circuit based on nested residue number system (NRNS), which recursively decompose the RNS. It can decompose the MAC unit into circuits with small sizes. In the FFT using the NRNS, a MAC unit is decomposed into 4-bit ones realized by look-up tables of the FPGA. Also, to realize the scaling (truncation) circuit, we propose a constant division algorithm on the FPGA. The truncation is realized by the division of a dynamic range for a subset of moduli. We implemented the proposed NRNS FFT on the Xilinx Inc. Virtex 6 FPGA. Compared with a Xilinx Inc. binary FFT library, although the number of block RAMs (BRAMs) was increased by 38%, in the RNS FFT, the number of LUTs was decreased by 42-45% and the maximum clock frequency was increased by 38-74%. With this technique, we successfully implemented an FFT that satisfied the required size and speed specifications on an available FPGA, since the excessive number of LUTs was the bottleneck of the binary FFT.

Journal ArticleDOI
TL;DR: This work presents a pre-synthesized overlay fabric and algorithm to enable rapid triggering and evaluates the techniques using VPR, showing that using the overlay and mapping algorithm together is at least an order of magnitude faster than the previous work resulting in a significant reduction in debug turn-around times.
Abstract: Embedded system designers can benefit from FPGA accelerators to achieve higher performance and efficiency. However, there are challenges that do not exist in software development; using software simulators to validate large and complex hardware designs can be extremely slow and impractical. Debugging designs implemented on an FPGA enables running the design at speed for long runs and more exhaustive test cases. However, limited observability is the primary challenge in hardware debug. To enhance hardware observability, trace-buffers and a trigger circuitry are inserted into the design. During the device operation, a history of signals of interest is recorded into the trace-buffers for off-line debug and validation. Recompiling the design every time the designer wishes to modify the trigger condition results in long debug turn-around times and reduced productivity. In this work, we present a pre-synthesized overlay fabric and algorithm to enable rapid triggering; during debug turn-around, TriggerPlus, a greedy algorithm, is used to implement a trigger circuit on the overlay. TriggerPlus is fast and simple, yet still capable of mapping the trigger circuit to the overlay fabric. We evaluate our techniques using VPR, showing that using our overlay and mapping algorithm together is at least an order of magnitude faster than the previous work resulting in a significant reduction in debug turn-around times.

Journal ArticleDOI
TL;DR: Energy efficiency is one of the most important design considerations in running modernDatacenter operating systems rely on software techniques such as execution migration to achieve energy efficiency.
Abstract: Energy efficiency is one of the most important design considerations in running modern datacenters. Datacenter operating systems rely on software techniques such as execution migration to achieve e...

Journal ArticleDOI
TL;DR: With increasing deployment of virtual machines for cloud services and server applications, memory address translation overheads in virtualized environments have received great attention.
Abstract: With increasing deployment of virtual machines for cloud services and server applications, memory address translation overheads in virtualized environments have received great attention. In the rad...

Journal ArticleDOI
TL;DR: For accelerating image recognition and object tracking, a one-dimensional data pipeline architecture on a field-programmable gate array (FPGA) satisfies both of high-speed streaming computation and small-sized circuits by considering spatiotemporal data dependence.
Abstract: The significant challenge facing sport science is how to grasp the flow of the game and analyze the situation of amatch. The use of information technology will facilitate to achieve the goal. The technical issues from the practical application perspective can be classified into three main points: computation speed, system size and complex data analysis considering the accuracy. In this paper, for accelerating image recognition and object tracking, we propose a one-dimensional data pipeline architecture on a field-programmable gate array (FPGA). It satisfies both of high-speed streaming computation and small-sized circuits by considering spatiotemporal data dependence. Volleyball games have been chosen as a target application. The proposed system will identify the position of six volleyball players within real time. The design on an FPGA includes pre-processing, color filtering, digitalization, noise reduction, template matching, and so on. The design was implemented and evaluated on Atlys Spartan-6 FPGA Trainer Board with one XILINX Spartan-6 LX45 FPGA. The computational performance achieves 100 frames per second at SVGA 800 by 600 pixel resolution. And our design has good scalability; the performance can easily be enhanced when the larger FPGA is used. The proposed system is also compact, which is composed of one Atlys board and one Atlys VmodCAM stereo-camera board. The average-accuracy rates of pregame situation and during a match are 87.1% and 65.7%, respectively. Since the input is streaming data, we can improve the accuracy by considering the previous and the next frames. They could be improved to 90.4% and 72.2%, respectively, when we adopt template matching with a moving average filter.

Journal ArticleDOI
TL;DR: Direct network I/O allows network controllers (NICs) to expose multiple instances of themselves, to be used by untrusted software without a trusted intermediary, and thus frees researchers to work on truly decentralised systems.
Abstract: Direct network I/O allows network controllers (NICs) to expose multiple instances of themselves, to be used by untrusted software without a trusted intermediary. Direct I/O thus frees researchers f...

Journal ArticleDOI
TL;DR: To mitigate excessive TLB misses in large memory applications, techniques such as large pages, variable length segments, and HW coalescing are used to increase the coverage of limited hardware translation coverage.
Abstract: To mitigate excessive TLB misses in large memory applications, techniques such as large pages, variable length segments, and HW coalescing, increase the coverage of limited hardware translation ent...

Journal ArticleDOI
TL;DR: Processors and operating systems (OSes) support multiple memory page sizes, and superpages increase Translation Lookaside Buffer (TLB) hits, while small pages provide fine-grained memory protection.
Abstract: Processors and operating systems (OSes) support multiple memory page sizes. Superpages increase Translation Lookaside Buffer (TLB) hits, while small pages provide fine-grained memory protection. Id...

Journal ArticleDOI
TL;DR: This paper proposes an FPGA solver for partial maximum satisfiability (PMS) problems based on the Dist algorithm, which is one of the best performing stochastic local search algorithms for PMS problems.
Abstract: In this paper, we propose an FPGA solver for partial maximum satisfiability (PMS) problems based on the Dist algorithm, which is one of the best performing stochastic local search algorithms for PMS problems. The Dist algorithm searches for a truth assignment for the variables that satisfies all of the hard clauses and as many soft clauses as possible by iteratively selecting a variable using a heuristic and flipping its truth value. During each iteration, new candidate variables for flipping are generated and existing ones may disappear. In our solver, the variables that may become new candidates for flipping are evaluated by parallel and pipeline processing, and then only the variables that actually become the candidates for flipping are extracted and gathered up in concurrent with the pipeline processing. The extraction process is not influenced by the number of the new candidates or their random generation, which minimizes the disturbance of the parallel and pipeline processing. Our FPGA solver can solve large PMS problems up to 7.74 times faster than running Dist on CPU.

Journal ArticleDOI
TL;DR: CPU-FPGA heterogeneous platforms offer a promising solution for high-performance and energy-efficient computing systems by providing specialized accelerators with post-silicon reconfigurability.
Abstract: CPU-FPGA heterogeneous platforms offer a promising solution for high-performance and energy-efficient computing systems by providing specialized accelerators with post-silicon reconfigurability. To...

Journal ArticleDOI
TL;DR: Byte-addressable non-volatile memory technology is emerging as an alternative for DRAM for main memory, and this new Non-Volatile Main Memory (NVMM) allows programmers to store important data in data stores.
Abstract: Byte-addressable non-volatile memory technology is emerging as an alternative for DRAM for main memory. This new Non-Volatile Main Memory (NVMM) allows programmers to store important data in data s...

Journal ArticleDOI
TL;DR: High-performance computing, enterprise, and datacenter servers are driving demands for higher total memory capacity as well as memory performance, so memory "cubes" with high per-package capacity (fro...
Abstract: High-performance computing, enterprise, and datacenter servers are driving demands for higher total memory capacity as well as memory performance. Memory "cubes" with high per-package capacity (fro...

Journal ArticleDOI
TL;DR: The Do-It-Yourself virtual memory translation architecture is introduced as a flexible complement for current hardware-fixed translation flows and decouples the virtual-to-physics translation flows.
Abstract: In this paper, we introduce the Do-It-Yourself virtual memory translation (DVMT) architecture as a flexible complement for current hardware-fixed translation flows. DVMT decouples the virtual-to-ph...