scispace - formally typeset
Search or ask a question
Author

Yong Dou

Bio: Yong Dou is an academic researcher. The author has contributed to research in topics: Multiplication & Matrix multiplication. The author has an hindex of 1, co-authored 1 publications receiving 214 citations.

Papers
More filters
Proceedings ArticleDOI
20 Feb 2005
TL;DR: A 64-bit ANSI/IEEE Std 754-1985 floating point design of a hardware matrix multiplier optimized for FPGA implementations and implement a scalable linear array of processing elements (PE) supporting the proposed algorithm in the Xilinx Virtex II Pro technology.
Abstract: We introduce a 64-bit ANSI/IEEE Std 754-1985 floating point design of a hardware matrix multiplier optimized for FPGA implementations. A general block matrix multiplication algorithm, applicable for an arbitrary matrix size is proposed. The algorithm potentially enables optimum performance by exploiting the data locality and reusability incurred by the general matrix multiplication scheme and considering the limitations of the I/O bandwidth and the local storage volume. We implement a scalable linear array of processing elements (PE) supporting the proposed algorithm in the Xilinx Virtex II Pro technology. Synthesis results confirm a superior performance-area ratio compared to related recent works. Assuming the same FPGA chip, the same amount of local memory, and the same I/O bandwidth, our design outperforms related proposals by at least 1.7X and up to 18X consuming the least reconfigurable resources. A total of 39 PEs can be integrated into the xc2vp125-7 FPGA, reaching performance of, e.g., 15.6 GFLOPS with 1600 KB local memory and 400 MB/s external memory bandwidth.

224 citations


Cited by
More filters
Book
02 Nov 2007
TL;DR: This book is intended as an introduction to the entire range of issues important to reconfigurable computing, using FPGAs as the context, or "computing vehicles" to implement this powerful technology.
Abstract: The main characteristic of Reconfigurable Computing is the presence of hardware that can be reconfigured to implement specific functionality more suitable for specially tailored hardware than on a simple uniprocessor. Reconfigurable computing systems join microprocessors and programmable hardware in order to take advantage of the combined strengths of hardware and software and have been used in applications ranging from embedded systems to high performance computing. Many of the fundamental theories have been identified and used by the Hardware/Software Co-Design research field. Although the same background ideas are shared in both areas, they have different goals and use different approaches.This book is intended as an introduction to the entire range of issues important to reconfigurable computing, using FPGAs as the context, or "computing vehicles" to implement this powerful technology. It will take a reader with a background in the basics of digital design and software programming and provide them with the knowledge needed to be an effective designer or researcher in this rapidly evolving field. · Treatment of FPGAs as computing vehicles rather than glue-logic or ASIC substitutes · Views of FPGA programming beyond Verilog/VHDL · Broad set of case studies demonstrating how to use FPGAs in novel and efficient ways

531 citations

Proceedings ArticleDOI
27 Feb 2011
TL;DR: A new FPGA memory architecture called Connected RAM (CoRAM) is proposed to serve as a portable bridge between the distributed computation kernels and the external memory interfaces to improve performance and efficiency and to improve an application's portability and scalability.
Abstract: FPGAs have been used in many applications to achieve orders-of-magnitude improvement in absolute performance and energy efficiency relative to conventional microprocessors. Despite their promise in both processing performance and efficiency, FPGAs have not yet gained widespread acceptance as mainstream computing devices. A fundamental obstacle to FPGA-based computing today is the FPGA's lack of a common, scalable memory architecture. When developing applications for FPGAs, designers are often directly responsible for crafting the application-specific infrastructure logic that manages and transports data to and from the processing kernels. This infrastructure not only increases design time and effort but will frequently lock a design to a particular FPGA product line, hindering scalability and portability. We propose a new FPGA memory architecture called Connected RAM (CoRAM) to serve as a portable bridge between the distributed computation kernels and the external memory interfaces. In addition to improving performance and efficiency, the CoRAM architecture provides a virtualized memory environment as seen by the hardware kernels to simplify development and to improve an application's portability and scalability.

159 citations

Journal ArticleDOI
TL;DR: This work presents the Photonic Recurrent Ising Sampler (PRIS), a heuristic method tailored for parallel architectures allowing fast and efficient sampling from distributions of arbitrary Ising problems, and suggests speedups in heuristic methods via photonic implementations of the PRIS.
Abstract: The inability of conventional electronic architectures to efficiently solve large combinatorial problems motivates the development of novel computational hardware. There has been much effort toward developing application-specific hardware across many different fields of engineering, such as integrated circuits, memristors, and photonics. However, unleashing the potential of such architectures requires the development of algorithms which optimally exploit their fundamental properties. Here, we present the Photonic Recurrent Ising Sampler (PRIS), a heuristic method tailored for parallel architectures allowing fast and efficient sampling from distributions of arbitrary Ising problems. Since the PRIS relies on vector-to-fixed matrix multiplications, we suggest the implementation of the PRIS in photonic parallel networks, which realize these operations at an unprecedented speed. The PRIS provides sample solutions to the ground state of Ising models, by converging in probability to their associated Gibbs distribution. The PRIS also relies on intrinsic dynamic noise and eigenvalue dropout to find ground states more efficiently. Our work suggests speedups in heuristic methods via photonic implementations of the PRIS.

91 citations

Proceedings ArticleDOI
12 Nov 2005
TL;DR: This paper proposes a BLAS (Basic Linear Algebra Subprograms) library for state-of-the-art reconfigurable systems, and proposes a design which employs a linear array of FPGAs for matrix multiply operation.
Abstract: Field-Programmable Gate Arrays (FPGAs) have become an attractive option for scientific computing. Several vendors have developed high performance reconfigurable systems which employ FPGAs for application acceleration. In this paper, we propose a BLAS (Basic Linear Algebra Subprograms) library for state-of-the-art reconfigurable systems. We study three data-intensive operations: dot product, matrix-vector multiply and dense matrix multiply. The first two operations are I/O bound, and our designs efficiently utilize the available memory bandwidth in the systems. As these operations require accumulation of sequentially delivered floating-point values, we develop a high performance reduction circuit. This circuit uses only one floating-point adder and buffers of moderate size. For matrix multiply operation, we propose a design which employs a linear array of FPGAs. This design exploits the memory hierarchy in the reconfigurable systems, and has very low memory bandwidth requirements. To illustrate our ideas, we have implemented our designs for Level 2 and Level 3 BLAS on Cray XD1.

87 citations

Journal ArticleDOI
TL;DR: This paper studies designs for floating-point matrix multiplication, a fundamental kernel in a number of scientific applications, on reconfigurable computing systems, and proposes three parameterized algorithms which can be tuned according to the problem size and the available hardware resources.
Abstract: The abundant hardware resources on current reconfigurable computing systems provide new opportunities for high-performance parallel implementations of scientific computations. In this paper, we study designs for floating-point matrix multiplication, a fundamental kernel in a number of scientific applications, on reconfigurable computing systems. We first analyze design trade-offs in implementing this kernel. These trade-offs are caused by the inherent parallelism of matrix multiplication and the resource constraints, including the number of configurable slices, the size of on-chip memory, and the available memory bandwidth. We propose three parameterized algorithms which can be tuned according to the problem size and the available hardware resources. Our algorithms employ linear array architecture with simple control logic. This architecture effectively utilizes the available resources and reduces routing complexity. The processing elements (PEs) used in our algorithms are modular so that it is easy to embed floating-point units into them. Experimental results on a Xilinx Virtex-ll Pro XC2VP100 show that our algorithms achieve good scalability and high sustained GFLOPS performance. We also implement our algorithms on Cray XD1. XD1 is a high-end reconfigurable computing system that employs both general-purpose processors and reconfigurable devices. Our algorithms achieve a sustained performance of 2.06 GFLOPS on a single node of XD1

86 citations