scispace - formally typeset
Search or ask a question
Author

Hsuan Hsiao

Bio: Hsuan Hsiao is an academic researcher from University of Toronto. The author has contributed to research in topics: High-level synthesis & Design space exploration. The author has an hindex of 5, co-authored 13 publications receiving 376 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: This work uses a first-published methodology to compare one commercial and three academic tools on a common set of C benchmarks, aiming at performing an in-depth evaluation in terms of performance and the use of resources.
Abstract: High-level synthesis (HLS) is increasingly popular for the design of high-performance and energy-efficient heterogeneous systems, shortening time-to-market and addressing today’s system complexity. HLS allows designers to work at a higher-level of abstraction by using a software program to specify the hardware functionality. Additionally, HLS is particularly interesting for designing field-programmable gate array circuits, where hardware implementations can be easily refined and replaced in the target device. Recent years have seen much activity in the HLS research community, with a plethora of HLS tool offerings, from both industry and academia. All these tools may have different input languages, perform different internal optimizations, and produce results of different quality, even for the very same input description. Hence, it is challenging to compare their performance and understand which is the best for the hardware to be implemented. We present a comprehensive analysis of recent HLS tools, as well as overview the areas of active interest in the HLS research community. We also present a first-published methodology to evaluate different HLS tools. We use our methodology to compare one commercial and three academic tools on a common set of C benchmarks, aiming at performing an in-depth evaluation in terms of performance and the use of resources.

433 citations

Proceedings ArticleDOI
30 May 2020
TL;DR: In this article, an area and energy-efficient unary general matrix multiplication (GEMM) architecture is proposed, which relaxes previously-imposed constraints on input bit streams, such as low correlation and long stream length.
Abstract: General matrix multiplication (GEMM) is universal in various applications, such as signal processing, machine learning, and computer vision. Conventional GEMM hardware architectures based on binary computing exhibit low area and energy efficiency as they scale due to the spatial nature of number representation and computing. Unary computing, on the other hand, can be performed with extremely simple processing units, often just with a single logic gate. But currently there exist no efficient architectures for unary GEMM. In this paper, we present uGEMM, an area- and energy-efficient unary GEMM architecture enabled by novel arithmetic units. The proposed design relaxes previously-imposed constraints on input bit streams---low correlation and long stream length---and achieves superior area and energy efficiency over existing unary systems. Furthermore, uGEMM's output bit streams exhibit higher accuracy and faster convergence, enabling dynamic energy-accuracy scaling on resource-constrained systems.

36 citations

Proceedings ArticleDOI
20 Oct 2018
TL;DR: The EH model is proposed, which characterizes an intermittent system's ability to maximize how much of its available energy is spent on useful processor execution and parametrizes the energy costs associated with intermittent execution to allow an intuitive understanding of how forward progress can change.
Abstract: Energy-harvesting devices---which operate solely on energy collected from their environment---have brought forth a new paradigm of intermittent computing. These devices succumb to frequent power outages that would cause conventional systems to be stuck in a perpetual loop of restarting computation and never making progress. Ensuring forward progress in an intermittent execution model requires saving state in nonvolatile memory (backup) and potentially re-executing from the last saved state upon a power loss (restore). The interplay between spending energy on useful processing and spending energy on these necessary overheads yield unexpected trade-offs. To facilitate early design space exploration, the field of intermittent computing requires better models for 1) generalizing and reasoning about these trade-offs and 2) helping architects and programmers in making early-stage design decisions. We propose the EH model, which characterizes an intermittent system's ability to maximize how much of its available energy is spent on useful processor execution. The model parametrizes the energy costs associated with intermittent execution to allow an intuitive understanding of how forward progress can change. We use the EH model to explore how forward progress is impacted with the frequency of backups and the energy cost of backups and restores. We validate the EH model with hardware measurements on an MSP430 and characterize its parameters via simulation. We also demonstrate how architects and programmers can use the model to explore the design space of intermittent processors, derive insights, and model new optimizations that are unique to intermittent processor architectures.

23 citations

Book ChapterDOI
01 Jan 2016
TL;DR: This section overviews LegUp, its programming model, unique aspects of the tool versus other HLS offerings, and concludes with a case study.
Abstract: LegUp is a High-level Synthesis tool under active development at the University of Toronto since 2011. The tool is on its fourth public release, is open source and freely downloadable. LegUp has been the subject of over 15 publications and has been downloaded by over 1500 groups from around the world. In this section, we overview LegUp, its programming model, unique aspects of the tool versus other HLS offerings, and conclude with a case study.

17 citations

Proceedings ArticleDOI
07 Jul 2021
TL;DR: The CGRA-ME as mentioned in this paper is a software framework that enables the modeling and exploration of coarse-grained reconfigurable arrays (CGRAs) architectures, as well as research on CGRA CAD algorithms.
Abstract: Coarse-grained reconfigurable arrays (CGRAs) are programmable hardware platforms that can be used to realize application-specific accelerators for higher performance and energy efficiency. A CGRA is a 2D array of configurable logic blocks & interconnect, where the logic blocks are typically large & ALU-like, and the interconnect is word-wide. CGRA-ME is a software framework that enables the modelling and exploration of CGRA architectures, as well as research on CGRA CAD algorithms. With CGRA-ME, an architect can specify a CGRA architecture at a high level of abstraction. A set of applications can be mapped onto the architecture to assess the mappability, power, performance and cost. CGRA-ME also allows one to generate synthesizable Verilog RTL for the modelled CGRA, permitting its implementation as an ASIC or FPGA overlay. In this paper, we describe the CGRA-ME framework [5] and overview its capabilities and current limitations. We discuss ongoing and prior research conducted with the framework, as well as outline future plans. We believe CGRA-ME will be a valuable contribution to the community, enabling new research on CGRA CAD & architectures.

10 citations


Cited by
More filters
Proceedings ArticleDOI
11 Jun 2018
TL;DR: This work describes a new domain-specific language and compiler called Spatial for higher level descriptions of application accelerators, and summarizes the compiler passes required to support these abstractions, including pipeline scheduling, automatic memory banking, and automated design tuning driven by active machine learning.
Abstract: Industry is increasingly turning to reconfigurable architectures like FPGAs and CGRAs for improved performance and energy efficiency. Unfortunately, adoption of these architectures has been limited by their programming models. HDLs lack abstractions for productivity and are difficult to target from higher level languages. HLS tools are more productive, but offer an ad-hoc mix of software and hardware abstractions which make performance optimizations difficult. In this work, we describe a new domain-specific language and compiler called Spatial for higher level descriptions of application accelerators. We describe Spatial's hardware-centric abstractions for both programmer productivity and design performance, and summarize the compiler passes required to support these abstractions, including pipeline scheduling, automatic memory banking, and automated design tuning driven by active machine learning. We demonstrate the language's ability to target FPGAs and CGRAs from common source code. We show that applications written in Spatial are, on average, 42% shorter and achieve a mean speedup of 2.9x over SDAccel HLS when targeting a Xilinx UltraScale+ VU9P FPGA on an Amazon EC2 F1 instance.

154 citations

Journal ArticleDOI
TL;DR: HLS is currently a viable option for fast prototyping and for designs with short time to market and to help close the QoR gap, a survey of literature focused on improving HLS concludes.
Abstract: To increase productivity in designing digital hardware components, high-level synthesis (HLS) is seen as the next step in raising the design abstraction level. However, the quality of results (QoRs) of HLS tools has tended to be behind those of manual register-transfer level (RTL) flows. In this paper, we survey the scientific literature published since 2010 about the QoR and productivity differences between the HLS and RTL design flows. Altogether, our survey spans 46 papers and 118 associated applications. Our results show that on average, the QoR of RTL flow is still better than that of the state-of-the-art HLS tools. However, the average development time with HLS tools is only a third of that of the RTL flow, and a designer obtains over four times as high productivity with HLS. Based on our findings, we also present a model case study to sum up the best practices in comparative studies between HLS and RTL. The outcome of our case study is also in line with the survey results, as using an HLS tool is seen to increase the productivity by a factor of six. In addition, to help close the QoR gap, we present a survey of literature focused on improving HLS. Our results let us conclude that HLS is currently a viable option for fast prototyping and for designs with short time to market.

99 citations

Journal ArticleDOI
TL;DR: In this article, a survey of the state-of-the-art software-defined radio (SDR) platforms in the context of wireless communication protocols is presented, with a focus on programmability, flexibility, portability, and energy efficiency.

91 citations

Journal ArticleDOI
TL;DR: A collection of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications, is presented, aiming to establish a common toolbox to guide both performance engineers and compiler engineers in tapping into the performance potential offered by spatial computing architectures using HLS.
Abstract: Spatial computing architectures promise a major stride in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to target spatial computing architectures, the optimization principles known from traditional software design are no longer sufficient to implement high-performance codes, due to fundamentally distinct aspects of hardware design, such as programming for deep pipelines, distributed memory resources, and scalable routing. To alleviate this, we present a collection of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. We systematically identify classes of transformations (pipelining, scalability, and memory), the characteristics of their effect on the HLS code and the resulting hardware (e.g., increasing data reuse or resource consumption), and the objectives that each transformation can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining, on-chip distributed fast memory, and on-chip dataflow, allowing for massively parallel architectures. To quantify the effect of various transformations, we cover the optimization process of a sample set of HPC kernels, provided as open source reference codes. We aim to establish a common toolbox to guide both performance engineers and compiler engineers in tapping into the performance potential offered by spatial computing architectures using HLS.

83 citations