Showing papers in &quot;ACM Sigarch Computer Architecture News in 2017&quot;

Maximizing CNN Accelerator Efficiency Through Resource Partitioning

TL;DR: This work proposes an FPGA acceleration system design for Neural Network Q-learning (NNQL), which has high flexibility due to the support to run-time network parameterization, which allows neuroevolution algorithms to dynamically restructure the network to achieve better learning results.

...read moreread less

Abstract: Deep Q-learning (DQN) is a recently proposed reinforcement learning algorithm where a neural network is applied as a non-linear approximator to its value function. The exploitation-exploration mechanism allows the training and prediction of the NN to execute simultaneously in an agent during its interaction with the environment. Agents often act independently on battery power, so the training and prediction must occur within the agent and on a limited power budget. In this work, We propose an FPGA acceleration system design for Neural Network Q-learning (NNQL). Our proposed system has high flexibility due to the support to run-time network parameterization, which allows neuroevolution algorithms to dynamically restructure the network to achieve better learning results. Additionally, the power consumption of our proposed system is adaptive to the network size because of a new processing element design. Based on our test cases on networks with hidden layer size ranging from 32 to 16384, our proposed system achieves 7x to 346x speedup compared to GPU implementation and 22x to 77x speedup to hand-coded CPU counterpart.

...read moreread less

39 citations

Journal Article•DOI•

[...]

ShenYongming, FerdmanMichael, MilderPeter

Migration of long-running Tasks between Reconfigurable Resources using Virtualization

TL;DR: A large number of FPGA-based accelerators have been proposed to improvise convolutional neural networks, but how these accelerators are implemented and how they are implemented are still a mystery.

...read moreread less

Abstract: Convolutional neural networks (CNNs) are revolutionizing machine learning, but they present significant computational challenges. Recently, many FPGA-based accelerators have been proposed to improv...

...read moreread less

30 citations

Journal Article•DOI•

[...]

Oliver Knodel¹, Paul R. Genssler¹, Rainer G. Spallek¹•Institutions (1)

Dresden University of Technology¹

Collective Communication on FPGA Clusters with Static Scheduling

TL;DR: This paper presents an approach inspired by paravirtualized machines for the integration of reconfigurable hardware into cloud services, and uses partial reconfiguration to virtualize a single physical FPGA to enable multiple independent user designs.

...read moreread less

Abstract: Computing performance and scalability are the essential basics in modern data centres. Field Programmable Gate Arrays (FPGAs) provide a promising opportunity to improve performance, security and energy efficiency. Especially background acceleration of computationally complex and long-running tasks is an important field of application. A flexible use of reconfigurable devices within a cloud context requires an abstraction of the actual hardware through virtualization.In this paper we present an approach inspired by paravirtualized machines for the integration of reconfigurable hardware into cloud services. Using partial reconfiguration our hardware and software framework virtualizes a single physical FPGA to enable multiple independent user designs. Essential components are the management of those virtual user-defined accelerators (vFPGA) and their migration between physical FPGAs to achieve higher system-wide utilization. The migration requires saving and restoring the internal state or context of the vFPGA. We demonstrate the application possibilities and the resource trade-off of our approach by transferring a running design from one physical FPGA to another. Moreover, we present future perspectives for the use of FPGAs in cloud-based environments.

...read moreread less

20 citations

Journal Article•DOI•

[...]

Jiayi Sheng¹, Qingqing Xiong¹, Chen Yang¹, Martin C. Herbordt¹•Institutions (1)

Boston University¹

Regaining Lost Cycles with HotCalls

TL;DR: This study explores applying the method of offline/static routing to collective operations, in particular, multicast and reduction, and believes that this is one of the few general offline/ static routing solutions for real HPC clusters, and FPGA-centric clusters in particular.

...read moreread less

Abstract: FPGA-centric clouds and clusters provide direct and programmable interconnects with obvious benefits for communication latency and bandwidth. One rarely studied aspect of DPI is that they facilitate application-aware routing: if communication patterns are static and known a priori, as is usually the case, then judicious routing can reduce congestion, latency, and the hardware required. In this study we explore applying the method of offline/static routing to collective operations, in particular, multicast and reduction. An entirely new communication infrastructure is proposed and implemented, including switch design and routing algorithm. A substantial improvement in performance is obtained, especially for multicast. We believe that this is one of the few general offline/static routing solutions for real HPC clusters, and FPGA-centric clusters in particular.

...read moreread less

19 citations

Journal Article•DOI•

[...]

WeisseOfir, BertaccoValeria, AustinTodd

The Reach Profiler (REAPER)

TL;DR: Intel's SGX secure execution technology allows running computations on secret data using untrusted servers to run applications and large-scale computations to run undiscovered computations.

...read moreread less

Abstract: Intel's SGX secure execution technology allows running computations on secret data using untrusted servers. While recent work showed how to port applications and large-scale computations to run und...

...read moreread less

18 citations

Journal Article•DOI•

[...]

PatelMinesh, S KimJeremie, MutluOnur

Stream-Dataflow Acceleration

TL;DR: Modern DRAM-based systems suffer from significant energy and latency penalties due to conservative DRAM refresh standards.

...read moreread less

Abstract: Modern DRAM-based systems suffer from significant energy and latency penalties due to conservative DRAM refresh standards. Volatile DRAM cells can retain information across a wide distribution of t...

...read moreread less

10 citations

Journal Article•DOI•

[...]

NowatzkiTony, GangadharVinay, ArdalaniNewsha, SankaralingamKarthikeyan

An Efficient GPUAccelerated Implementation of Genomic Short Read Mapping with BWAMEM

TL;DR: Demand for low-power data processing hardware continues to rise inexorably, and existing programmable and "general purpose" solutions are insufficient, as evidenced by the order-of-m...

...read moreread less

Abstract: Demand for low-power data processing hardware continues to rise inexorably. Existing programmable and "general purpose" solutions (eg. SIMD, GPGPUs) are insufficient, as evidenced by the order-of-m...

...read moreread less

10 citations

Journal Article•DOI•

[...]

Ernst Joachim Houtgast¹, Vlad-Mihai Sima, Koen Bertels¹, Zaid Al-Ars¹•Institutions (1)

Delft University of Technology¹

TL;DR: A number of suggestions are made to improve GPU architecture, resulting in potentially greatly increased performance for bioinformatics-class algorithms, including BWA-MEM.

...read moreread less

Abstract: Next Generation Sequencing techniques have resulted in an exponential growth in the generation of genetics data, the amount of which will soon rival, if not overtake, other Big Data fields, such as astronomy and streaming video services. To become useful, this data requires processing by a complex pipeline of algorithms, taking multiple days even on large clusters. The mapping stage of such genomics pipelines, which maps the short reads onto a reference genome, takes up a significant portion of execution time. BWA-MEM is the de-facto industry-standard for the mapping stage. Here, a GPU-accelerated implementation of BWA-MEM is proposed. The Seed Extension phase, one of the three main BWA-MEM algorithm phases that requires between 30%-50% of overall processing time, is offloaded onto the GPU. A thorough design space analysis is presented for an optimized mapping of this phase onto the GPU. The re- sulting systolic-array based implementation obtains a two- fold overall application-level speedup, which is the maximum theoretically achievable speedup. Moreover, this speedup is sustained for systems with up to twenty-two logical cores. Based on the findings, a number of suggestions are made to improve GPU architecture, resulting in potentially greatly increased performance for bioinformatics-class algorithms.

...read moreread less

10 citations

Journal Article•DOI•

The Mondrian Data Engine

[...]

DrumondMario, DaglisAlexandros, MirzadehNooshin, UstiugovDmitrii, PicorelJavier, FalsafiBabak, GrotBoris, PnevmatikatosDionisios - Show less +4 more

FPGA-based Multicore Architecture for Integrating Multiple DDoS Defense Mechanisms

TL;DR: The increasing demand for extracting value out of ever-growing data poses an ongoing challenge to system designers, a task only made trickier by the end of Dennard scaling.

...read moreread less

Abstract: The increasing demand for extracting value out of ever-growing data poses an ongoing challenge to system designers, a task only made trickier by the end of Dennard scaling. As the performance densi...

...read moreread less

10 citations

Journal Article•DOI•

[...]

Cuong Pham-Quoc¹, Biet Nguyen¹, Tran Ngoc Thinh¹•Institutions (1)

Ho Chi Minh City University of Technology¹

High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication

TL;DR: The first prototype system with the Hop-count filtering and Ingress/Engress filtering techniques using the Xilinx Virtex 5 xc5vtx240t FPGA device is implemented.

...read moreread less

Abstract: This paper proposes an FPGA-based multicore architecture to integrate multiple DDoS defense mechanisms for DDoS protection. The architecture allows multiple cooperating DDoS mitigation techniques to classify incoming network packets. The proposed architecture consists of two separate partitions static and dynamic. The static partition includes packet pre-processing and post-processing modules while the DDoS filtering techniques are implemented within the dynamic partition. These filtering techniques can be implemented by either hardware custom computing cores or general purpose soft processors or both. In all cases, these DDoS filtering computing cores can be updated or changed at runtime or design time. We implement our first prototype system with the Hop-count filtering and Ingress/Engress filtering techniques using the Xilinx Virtex 5 xc5vtx240t FPGA device. The synthesis results show that the system can work at up to 116.782MHz while utilizing about 41% LUTs, 47% Registers, and 53% Block Memory of the available hardware resources. Experimental results show that our system achieves a 100% detection rate (true positive) with a 0% false negative rate and the maximum 0.74% false positive rate. Moreover, the prototype system obtains packet processing throughput by up to 9.869 Gbps in half-duplex mode and 19.738 Gbps in full-duplex mode.

...read moreread less

Journal Article•DOI•

[...]

Erik H. D'Hollander¹•Institutions (1)

Ghent University¹

A Study of Heterogeneous Computing Design Method based on Virtualization Technology

TL;DR: High-level synthesis is used to fine-tune the configuration parameters in order to achieve the highest performance with maximal resource utilization in floating-point matrix multiplication on FPGAs.

...read moreread less

Abstract: In the last decade floating-point matrix multiplication on FPGAs has been studied extensively and efficient architectures as well as detailed performance models have been developed. By design these IP cores take a fixed footprint which not necessarily optimizes the use of all available resources. Moreover, the low-level architectures are not easily amenable to a parameterized synthesis. In this paper high-level synthesis is used to fine-tune the configuration parameters in order to achieve the highest performance with maximal resource utilization. An\ exploration strategy is presented to optimize the use of critical resources (DSPs, memory) for any given FPGA. To account for the limited memory size on the FPGA, a blockoriented matrix multiplication is organized such that the block summation is done on the CPU while the block multiplication occurs on the logic fabric simultaneously. The communication overhead between the CPU and the FPGA is minimized by streaming the blocks in a Gray code ordering scheme which maximizes the data reuse for consecutive block matrix product calculations. Using highlevel synthesis optimization, the programmable logic operates at 93% of the theoretical peak performance and the combined CPU-FPGA design achieves 76% of the available hardware processing speed for the floating-point multiplication of 2K by 2K matrices.

...read moreread less

Journal Article•DOI•

[...]

Qian Zhao¹, Motoki Amagasaki¹, Masahiro Iida¹, Morihiro Kuga¹, Toshinori Sueyoshi¹ - Show less +1 more•Institutions (1)

Kumamoto University¹

FPGA High-level Synthesis versus Overlay: Comparisons on Computation Kernels

TL;DR: This paper proposes a heterogeneous computing platform based on the virtualization technology, namely hCODE, which brings multiple benefits like accelerating a program without modifying or recompiling it, enable high portability and scalability across different HW and operating system.

...read moreread less

Abstract: One challenge for the heterogeneous computing with the FPGA is how to bridge the development gap between SW and HW designs. The high level synthesis (HLS) technique allows producing hardware with high level languages like C. Design tools based on the HLS like Xilinx SDSoC and SDAccel are developed to speedup SW/HW co-designs. However, the developers still require much circuit design skills to use these tools more efficiently. In this paper, we propose a heterogeneous computing platform based on the virtualization technology, namely hCODE.With the help of the virtualization, the HW and SW design can be totally separated. This brings multiple benefits like accelerating a program without modifying or recompiling it, enable high portability and scalability across different HW and operating system.

...read moreread less

Journal Article•DOI•

[...]

Colin Yu Lin¹, Zhenghong Jiang¹, Cheng Fu¹, Hayden K.-H. So², Haigang Yang¹ - Show less +1 more•Institutions (2)

Chinese Academy of Sciences¹, University of Hong Kong²

Locality-Aware CTA Clustering for Modern GPUs

TL;DR: An comparison of a state-of-the-art FPGA HLS tool, Vivado HLS, and anFPGA overlay tool, ArchSyn, on two computation intensive kernels, matrix-matrix multiplication and fast Fourier transform shows an overwhelming superiority in computation performance, which is 8X to 39X faster than FPGa HLS.

...read moreread less

Abstract: To promote FPGA to a wider user community and to increase design productivity, two new design methodologies, namely FPGA high-level synthesis (HLS) and FPGA overlay, are presented to use a high-level design abstraction. To make clear distinguish features of each design methodology, we make an comparison of a state-of-the-art FPGA HLS tool, Vivado HLS, and an FPGA overlay tool, ArchSyn, on two computation intensive kernels, matrix-matrix multiplication and fast Fourier transform.In the comparison, FPGA overlay shows an overwhelming superiority in computation performance, which is 8X to 39X faster than FPGA HLS. However, FPGA HLS exhibits its advantages in dynamic power consumption metric. It achieves up to 17X lower power consumption than FPGA overlay. Power- and energy-efficiency are another two essential metrics evaluating trade-offs between performance and power consumption. As demonstrated with evaluation results, FPGA overlay is averagely 3.5X better in powerefficiency for FFT kernel, and achieves up to 2 orders of magnitude better energy-efficiency than FPGA HLS.

...read moreread less

Journal Article•DOI•

[...]

LiAng, SongShuaiwen Leon, LiuWeifeng, LiuXu, KumarAkash, CorporaalHenk - Show less +2 more

Cost-Effective and High-Throughput Merge Network: Architecture for the Fastest FPGA Sorting Accelerator

TL;DR: In this article, the role of on-chip L1 data caches on modern GPUs is often awkward; however, the locality among global memory requests from different SMs (Streaming Multiproc...

...read moreread less

Abstract: Cache is designed to exploit locality; however, the role of on-chip L1 data caches on modern GPUs is often awkward. The locality among global memory requests from different SMs (Streaming Multiproc...

...read moreread less

Journal Article•DOI•

[...]

Susumu Mashimo¹, Thiem Van Chu¹, Kenji Kise¹•Institutions (1)

Tokyo Institute of Technology¹

Quality of Service Support for Fine-Grained Sharing on GPUs

TL;DR: A cost-effective and high-throughput merge network for the fastest FPGA sorting accelerator and achieves a throughput of 8 data elements per 200MHz clock cycle is proposed.

...read moreread less

Abstract: High-performance sorting is used in various areas such as database transactions and genomic feature operations. To improve sorting performance, in addition to the conventional approach of using general purpose processors or GPUs, the approach of using FPGAs is becoming a promising solution. As an FPGA sorting accelerator, Casper and Olukotun have recently proposed the fastest one known so far. In their study, they proposed a merge network which can merge two sorted data series at a throughput of 6 data elements per 200MHz clock cycle. If an FPGA sorting accelerator is constructed using merge networks, the overall throughput will be mainly determined by the throughputs of the merge networks. This motivates us to design a merge network which outputs more than 6 data elements per 200MHz clock cycle. In this paper, we propose a cost-effective and high-throughput merge network for the fastest FPGA sorting accelerator. The evaluation shows that our proposal achieves a throughput of 8 data elements per 200MHz clock cycle.

...read moreread less

Journal Article•DOI•

[...]

WangZhenning, YangJun, MelhemRami, ChildersBruce, ZhangYoutao, GuoMinyi - Show less +2 more

An FFT Circuit for a Spectrometer of a Radio Telescope using the Nested RNS including the Constant Division

TL;DR: GPUs have been widely adopted in data centers to provide acceleration services to many applications and are increasingly important for better processing throughput and energy efficiency.

...read moreread less

Abstract: GPUs have been widely adopted in data centers to provide acceleration services to many applications. Sharing a GPU is increasingly important for better processing throughput and energy efficiency. ...

...read moreread less

Journal Article•DOI•

[...]

Hiroki Nakahara¹, Hiroyuki Nakanishi², Kazumasa Iwai, Tsutomu Sasao³•Institutions (3)

Tokyo Institute of Technology¹, Kagoshima University², Meiji University³

An Improved Overlay and Mapping Algorithm Supporting Rapid Triggering for FPGA Debug

TL;DR: An FFT circuit based on nested residue number system (NRNS), which recursively decompose the RNS, is applied that satisfied the required size and speed specifications on an available FPGA, since the excessive number of LUTs was the bottleneck of the binary FFT.

...read moreread less

Abstract: A radio telescope analyzes radio frequency (RF) received from celestial objects. It consists of an antenna, a receiver, and a spectrometer. The spectrometer converts the time domain into the frequency domain by an FFT operation. This paper applies an FFT circuit based on nested residue number system (NRNS), which recursively decompose the RNS. It can decompose the MAC unit into circuits with small sizes. In the FFT using the NRNS, a MAC unit is decomposed into 4-bit ones realized by look-up tables of the FPGA. Also, to realize the scaling (truncation) circuit, we propose a constant division algorithm on the FPGA. The truncation is realized by the division of a dynamic range for a subset of moduli. We implemented the proposed NRNS FFT on the Xilinx Inc. Virtex 6 FPGA. Compared with a Xilinx Inc. binary FFT library, although the number of block RAMs (BRAMs) was increased by 38%, in the RNS FFT, the number of LUTs was decreased by 42-45% and the maximum clock frequency was increased by 38-74%. With this technique, we successfully implemented an FFT that satisfied the required size and speed specifications on an available FPGA, since the excessive number of LUTs was the bottleneck of the binary FFT.

...read moreread less

Journal Article•DOI•

[...]

Fatemeh Eslami¹, Steven J. E. Wilton¹•Institutions (1)

University of British Columbia¹

Breaking the Boundaries in Heterogeneous-ISA Datacenters

TL;DR: This work presents a pre-synthesized overlay fabric and algorithm to enable rapid triggering and evaluates the techniques using VPR, showing that using the overlay and mapping algorithm together is at least an order of magnitude faster than the previous work resulting in a significant reduction in debug turn-around times.

...read moreread less

Abstract: Embedded system designers can benefit from FPGA accelerators to achieve higher performance and efficiency. However, there are challenges that do not exist in software development; using software simulators to validate large and complex hardware designs can be extremely slow and impractical. Debugging designs implemented on an FPGA enables running the design at speed for long runs and more exhaustive test cases. However, limited observability is the primary challenge in hardware debug. To enhance hardware observability, trace-buffers and a trigger circuitry are inserted into the design. During the device operation, a history of signals of interest is recorded into the trace-buffers for off-line debug and validation. Recompiling the design every time the designer wishes to modify the trigger condition results in long debug turn-around times and reduced productivity. In this work, we present a pre-synthesized overlay fabric and algorithm to enable rapid triggering; during debug turn-around, TriggerPlus, a greedy algorithm, is used to implement a trigger circuit on the overlay. TriggerPlus is fast and simple, yet still capable of mapping the trigger circuit to the overlay fabric. We evaluate our techniques using VPR, showing that using our overlay and mapping algorithm together is at least an order of magnitude faster than the previous work resulting in a significant reduction in debug turn-around times.

...read moreread less

Journal Article•DOI•

[...]

BarbalaceAntonio, LyerlyRobert, JelesnianskiChristopher, CarnoAnthony, ChuangHo-Ren, LegoutVincent, RavindranBinoy - Show less +3 more

Rethinking TLB Designs in Virtualized Environments

TL;DR: Energy efficiency is one of the most important design considerations in running modernDatacenter operating systems rely on software techniques such as execution migration to achieve energy efficiency.

...read moreread less

Abstract: Energy efficiency is one of the most important design considerations in running modern datacenters. Datacenter operating systems rely on software techniques such as execution migration to achieve e...

...read moreread less

Journal Article•DOI•

[...]

RyooJee Ho, GulurNagendra, SongShuang, K JohnLizy

FPGA-based Volleyball Player Tracker

TL;DR: With increasing deployment of virtual machines for cloud services and server applications, memory address translation overheads in virtualized environments have received great attention.

...read moreread less

Abstract: With increasing deployment of virtual machines for cloud services and server applications, memory address translation overheads in virtualized environments have received great attention. In the rad...

...read moreread less

Journal Article•DOI•

[...]

Chengzhe Li¹, Lai Yoong Yee¹, Hiroshi Maruyama¹, Yoshiki Yamaguchi¹•Institutions (1)

University of Tsukuba¹

Page Fault Support for Network Controllers

TL;DR: For accelerating image recognition and object tracking, a one-dimensional data pipeline architecture on a field-programmable gate array (FPGA) satisfies both of high-speed streaming computation and small-sized circuits by considering spatiotemporal data dependence.

...read moreread less

Abstract: The significant challenge facing sport science is how to grasp the flow of the game and analyze the situation of amatch. The use of information technology will facilitate to achieve the goal. The technical issues from the practical application perspective can be classified into three main points: computation speed, system size and complex data analysis considering the accuracy. In this paper, for accelerating image recognition and object tracking, we propose a one-dimensional data pipeline architecture on a field-programmable gate array (FPGA). It satisfies both of high-speed streaming computation and small-sized circuits by considering spatiotemporal data dependence. Volleyball games have been chosen as a target application. The proposed system will identify the position of six volleyball players within real time. The design on an FPGA includes pre-processing, color filtering, digitalization, noise reduction, template matching, and so on. The design was implemented and evaluated on Atlys Spartan-6 FPGA Trainer Board with one XILINX Spartan-6 LX45 FPGA. The computational performance achieves 100 frames per second at SVGA 800 by 600 pixel resolution. And our design has good scalability; the performance can easily be enhanced when the larger FPGA is used. The proposed system is also compact, which is composed of one Atlys board and one Atlys VmodCAM stereo-camera board. The average-accuracy rates of pregame situation and during a match are 87.1% and 65.7%, respectively. Since the input is streaming data, we can improve the accuracy by considering the previous and the next frames. They could be improved to 90.4% and 72.2%, respectively, when we adopt template matching with a moving average filter.

...read moreread less

Journal Article•DOI•

[...]

LesokhinIlya, EranHaggai, RaindelShachar, ShapiroGuy, GrimbergSagi, LissLiran, Ben-YehudaMuli, AmitNadav, TsafrirDan - Show less +5 more

TL;DR: Direct network I/O allows network controllers (NICs) to expose multiple instances of themselves, to be used by untrusted software without a trusted intermediary, and thus frees researchers to work on truly decentralised systems.

...read moreread less

Abstract: Direct network I/O allows network controllers (NICs) to expose multiple instances of themselves, to be used by untrusted software without a trusted intermediary. Direct I/O thus frees researchers f...

...read moreread less

Journal Article•DOI•

Hybrid TLB Coalescing

[...]

ParkChang Hyun, HeoTaekyung, JeongJungi, HuhJaehyuk

Efficient Address Translation for Architectures with Multiple Page Sizes

TL;DR: To mitigate excessive TLB misses in large memory applications, techniques such as large pages, variable length segments, and HW coalescing are used to increase the coverage of limited hardware translation coverage.

...read moreread less

Abstract: To mitigate excessive TLB misses in large memory applications, techniques such as large pages, variable length segments, and HW coalescing, increase the coverage of limited hardware translation ent...

...read moreread less

Journal Article•DOI•

[...]

CoxGuilherme, BhattacharjeeAbhishek

An FPGA Solver for Partial MaxSAT Problems Based on Stochastic Local Search

TL;DR: Processors and operating systems (OSes) support multiple memory page sizes, and superpages increase Translation Lookaside Buffer (TLB) hits, while small pages provide fine-grained memory protection.

...read moreread less

Abstract: Processors and operating systems (OSes) support multiple memory page sizes. Superpages increase Translation Lookaside Buffer (TLB) hits, while small pages provide fine-grained memory protection. Id...

...read moreread less

Journal Article•DOI•

[...]

Shohei Sassa¹, Kenji Kanazawa¹, Shaowei Cai², Moritoshi Yasunaga¹•Institutions (2)

University of Tsukuba¹, Chinese Academy of Sciences²

Aggressive Pipelining of Irregular Applications on Reconfigurable Hardware

TL;DR: This paper proposes an FPGA solver for partial maximum satisfiability (PMS) problems based on the Dist algorithm, which is one of the best performing stochastic local search algorithms for PMS problems.

...read moreread less

Abstract: In this paper, we propose an FPGA solver for partial maximum satisfiability (PMS) problems based on the Dist algorithm, which is one of the best performing stochastic local search algorithms for PMS problems. The Dist algorithm searches for a truth assignment for the variables that satisfies all of the hard clauses and as many soft clauses as possible by iteratively selecting a variable using a heuristic and flipping its truth value. During each iteration, new candidate variables for flipping are generated and existing ones may disappear. In our solver, the variables that may become new candidates for flipping are evaluated by parallel and pipeline processing, and then only the variables that actually become the candidates for flipping are extracted and gathered up in concurrent with the pipeline processing. The extraction process is not influenced by the number of the new candidates or their random generation, which minimizes the disturbance of the parallel and pipeline processing. Our FPGA solver can solve large PMS problems up to 7.74 times faster than running Dist on CPU.

...read moreread less

Journal Article•DOI•

[...]

LiZhaoshi, LiuLeibo, DengYangdong, YinShouyi, WangYao, WeiShaojun - Show less +2 more

Hiding the Long Latency of Persist Barriers Using Speculative Execution

TL;DR: CPU-FPGA heterogeneous platforms offer a promising solution for high-performance and energy-efficient computing systems by providing specialized accelerators with post-silicon reconfigurability.

...read moreread less

Abstract: CPU-FPGA heterogeneous platforms offer a promising solution for high-performance and energy-efficient computing systems by providing specialized accelerators with post-silicon reconfigurability. To...

...read moreread less

Journal Article•DOI•

[...]

ShinSeunghee, TuckJames, SolihinYan

TL;DR: Byte-addressable non-volatile memory technology is emerging as an alternative for DRAM for main memory, and this new Non-Volatile Main Memory (NVMM) allows programmers to store important data in data stores.

...read moreread less

Abstract: Byte-addressable non-volatile memory technology is emerging as an alternative for DRAM for main memory. This new Non-Volatile Main Memory (NVMM) allows programmers to store important data in data s...

...read moreread less

Journal Article•DOI•

There and Back Again

[...]

PorembaMatthew, AkgunItir, YinJieming, KayiranOnur, XieYuan, H LohGabriel - Show less +2 more

Do-It-Yourself Virtual Memory Translation

TL;DR: High-performance computing, enterprise, and datacenter servers are driving demands for higher total memory capacity as well as memory performance, so memory "cubes" with high per-package capacity (fro...

...read moreread less

Abstract: High-performance computing, enterprise, and datacenter servers are driving demands for higher total memory capacity as well as memory performance. Memory "cubes" with high per-package capacity (fro...

...read moreread less

Journal Article•DOI•

[...]

AlamHanna, ZhangTianhao, ErezMattan, EtsionYoav