scispace - formally typeset
Search or ask a question
Author

Bingfeng Mei

Other affiliations: IMEC, Samsung
Bio: Bingfeng Mei is an academic researcher from Katholieke Universiteit Leuven. The author has contributed to research in topics: Compiler & Very long instruction word. The author has an hindex of 12, co-authored 25 publications receiving 1696 citations. Previous affiliations of Bingfeng Mei include IMEC & Samsung.

Papers
More filters
Book ChapterDOI
01 Sep 2003
TL;DR: A novel architecture with tightly coupled very long instruction word (VLIW) processor and coarse-grained reconfigurable matrix is proposed, which has good performance and is very compiler-friendly.
Abstract: The coarse-grained reconfigurable architectures have advantages over the traditional FPGAs in terms of delay, area and configuration time. To execute entire applications, most of them combine an instruction set processor(ISP) and a reconfigurable matrix. However, not much attention is paid to the integration of these two parts, which results in high communication overhead and programming difficulty. To address this problem, we propose a novel architecture with tightly coupled very long instruction word (VLIW) processor and coarse-grained reconfigurable matrix. The advantages include simplified programming model, shared resource costs, and reduced communication overhead. To exploit this architecture, our previously developed compiler framework is adapted to the new architecture. The results show that the new architecture has good performance and is very compiler-friendly.

600 citations

Proceedings ArticleDOI
03 Mar 2003
TL;DR: A modulo scheduling algorithm to exploit loop-level parallelism for coarse-grained reconfigurable architectures and is a key part of theDRESC Dynamically Reconfigurable Embedded Systems Compiler.
Abstract: Coarse-grained reconfigurable architectures have become increasingly important in recent years. Automatic design or compilation tools are essential to their success. In this paper, we present a modulo scheduling algorithm to exploit loop-level parallelism for coarse-grained reconfigurable architectures. This algorithm is a key part of our dynamically reconfigurable embedded systems compiler (DRESC). It is capable of solving placement, scheduling and routing of operations simultaneously in a modulo-constrained 3D space and uses an abstract architecture representation to model a wide class of coarse-grained architectures. The experimental results show high performance and efficient resource utilization on tested kernels.

225 citations

Proceedings ArticleDOI
01 Jan 2002
TL;DR: This paper presents a retargetable compiler for a family of coarse-grained reconfigurable architectures and presents experimental results, showing up to 28.7 instructions per cycle (IPC) over tested kernels.
Abstract: Coarse-grained reconfigurable architectures have become increasingly important in recent years. Automatic design or compiling tools are essential to their success. In this paper, we present a retargetable compiler for a family of coarse-grained reconfigurable architectures. Several key issues are addressed. Program analysis and transformation prepare dataflow for scheduling. Architecture abstraction generates an internal graph representation from a concrete architecture description. A modulo scheduling algorithm is key to exploit parallelism and achieve high performance. The experimental results show up to 28.7 instructions per cycle (IPC) over tested kernels.

206 citations

Journal ArticleDOI
TL;DR: An environment and an architecture exploration for a novel CGRA template and a retargetable simulator and compiler enable systematic architecture exploration that can lead to more efficient domain-specific architecture design.
Abstract: Coarse-grained architectures (CGRAs) can be tailored and optimized for different application domains. The vast design space of coarse-grained reconfigurable architectures complicates the design of optimized processors. The goal is to design a domain-specific processor that provides just enough-flexibility for that domain while minimizing the energy consumption for a given level of performance. However, a flexible architecture template and a retargetable simulator and compiler enable systematic architecture exploration that can lead to more efficient domain-specific architecture design. This article presents such an environment and an architecture exploration for a novel CGRA template.

182 citations

Proceedings ArticleDOI
16 Feb 2004
TL;DR: A C-based design flow using an MPEG-2 decoder as a design example shows that the methodology and architecture can deliver a competitive package in terms of design efforts and performance over other programmable architectures.
Abstract: Coarse-grained reconfigurable architectures have seen growing importance recently. Design tools and methodology are essential to their success. Based on our previous work on modulo scheduling algorithms and a novel architecture with tightly coupled VLIW/reconfigurable matrix, we present a C-based design flow using an MPEG-2 decoder as a design example. The application is mapped to the architecture in less than one person-week starting from a software implementation. The kernel and overall speedup over the reference VLIW are 4.84 and 3.05 respectively. The case study shows that our methodology and architecture can deliver a competitive package in terms of design efforts and performance over other programmable architectures.

155 citations


Cited by
More filters
Journal ArticleDOI
18 Jun 2016
TL;DR: A novel dataflow, called row-stationary (RS), is presented, that minimizes data movement energy consumption on a spatial architecture and can adapt to different CNN shape configurations and reduces all types of data movement through maximally utilizing the processing engine local storage, direct inter-PE communication and spatial parallelism.
Abstract: Deep convolutional neural networks (CNNs) are widely used in modern AI systems for their superior accuracy but at the cost of high computational complexity. The complexity comes from the need to simultaneously process hundreds of filters and channels in the high-dimensional convolutions, which involve a significant amount of data movement. Although highly-parallel compute paradigms, such as SIMD/SIMT, effectively address the computation requirement to achieve high throughput, energy consumption still remains high as data movement can be more expensive than computation. Accordingly, finding a dataflow that supports parallel processing with minimal data movement cost is crucial to achieving energy-efficient CNN processing without compromising accuracy.In this paper, we present a novel dataflow, called row-stationary (RS), that minimizes data movement energy consumption on a spatial architecture. This is realized by exploiting local data reuse of filter weights and feature map pixels, i.e., activations, in the high-dimensional convolutions, and minimizing data movement of partial sum accumulations. Unlike dataflows used in existing designs, which only reduce certain types of data movement, the proposed RS dataflow can adapt to different CNN shape configurations and reduces all types of data movement through maximally utilizing the processing engine (PE) local storage, direct inter-PE communication and spatial parallelism. To evaluate the energy efficiency of the different dataflows, we propose an analysis framework that compares energy cost under the same hardware area and processing parallelism constraints. Experiments using the CNN configurations of AlexNet show that the proposed RS dataflow is more energy efficient than existing dataflows in both convolutional (1.4× to 2.5×) and fully-connected layers (at least 1.3× for batch size larger than 16). The RS dataflow has also been demonstrated on a fabricated chip, which verifies our energy analysis.

1,126 citations

Book ChapterDOI
01 Sep 2003
TL;DR: A novel architecture with tightly coupled very long instruction word (VLIW) processor and coarse-grained reconfigurable matrix is proposed, which has good performance and is very compiler-friendly.
Abstract: The coarse-grained reconfigurable architectures have advantages over the traditional FPGAs in terms of delay, area and configuration time. To execute entire applications, most of them combine an instruction set processor(ISP) and a reconfigurable matrix. However, not much attention is paid to the integration of these two parts, which results in high communication overhead and programming difficulty. To address this problem, we propose a novel architecture with tightly coupled very long instruction word (VLIW) processor and coarse-grained reconfigurable matrix. The advantages include simplified programming model, shared resource costs, and reduced communication overhead. To exploit this architecture, our previously developed compiler framework is adapted to the new architecture. The results show that the new architecture has good performance and is very compiler-friendly.

600 citations

Book
02 Nov 2007
TL;DR: This book is intended as an introduction to the entire range of issues important to reconfigurable computing, using FPGAs as the context, or "computing vehicles" to implement this powerful technology.
Abstract: The main characteristic of Reconfigurable Computing is the presence of hardware that can be reconfigured to implement specific functionality more suitable for specially tailored hardware than on a simple uniprocessor. Reconfigurable computing systems join microprocessors and programmable hardware in order to take advantage of the combined strengths of hardware and software and have been used in applications ranging from embedded systems to high performance computing. Many of the fundamental theories have been identified and used by the Hardware/Software Co-Design research field. Although the same background ideas are shared in both areas, they have different goals and use different approaches.This book is intended as an introduction to the entire range of issues important to reconfigurable computing, using FPGAs as the context, or "computing vehicles" to implement this powerful technology. It will take a reader with a background in the basics of digital design and software programming and provide them with the knowledge needed to be an effective designer or researcher in this rapidly evolving field. · Treatment of FPGAs as computing vehicles rather than glue-logic or ASIC substitutes · Views of FPGA programming beyond Verilog/VHDL · Broad set of case studies demonstrating how to use FPGAs in novel and efficient ways

531 citations

Proceedings ArticleDOI
Mingyu Gao1, Jing Pu1, Xuan Yang1, Mark Horowitz1, Christos Kozyrakis1 
04 Apr 2017
TL;DR: The hardware architecture and software scheduling and partitioning techniques for TETRIS, a scalable NN accelerator using 3D memory, are presented and it is shown that despite the use of small SRAM buffers, the presence of3D memory simplifies dataflow scheduling for NN computations.
Abstract: The high accuracy of deep neural networks (NNs) has led to the development of NN accelerators that improve performance by two orders of magnitude. However, scaling these accelerators for higher performance with increasingly larger NNs exacerbates the cost and energy overheads of their memory systems, including the on-chip SRAM buffers and the off-chip DRAM channels.This paper presents the hardware architecture and software scheduling and partitioning techniques for TETRIS, a scalable NN accelerator using 3D memory. First, we show that the high throughput and low energy characteristics of 3D memory allow us to rebalance the NN accelerator design, using more area for processing elements and less area for SRAM buffers. Second, we move portions of the NN computations close to the DRAM banks to decrease bandwidth pressure and increase performance and energy efficiency. Third, we show that despite the use of small SRAM buffers, the presence of 3D memory simplifies dataflow scheduling for NN computations. We present an analytical scheduling scheme that matches the efficiency of schedules derived through exhaustive search. Finally, we develop a hybrid partitioning scheme that parallelizes the NN computations over multiple accelerators. Overall, we show that TETRIS improves mthe performance by 4.1x and reduces the energy by 1.5x over NN accelerators with conventional, low-power DRAM memory systems.

453 citations

Journal ArticleDOI
25 Jul 2005
TL;DR: It is shown that reconfigurable computing designs are capable of achieving up to 500 times speedup and 70% energy savings over microprocessor implementations for specific applications.
Abstract: Reconfigurable computing is becoming increasingly attractive for many applications. This survey covers two aspects of reconfigurable computing: architectures and design methods. The paper includes recent advances in reconfigurable architectures, such as the Alters Stratix II and Xilinx Virtex 4 FPGA devices. The authors identify major trends in general-purpose and special-purpose design methods. It is shown that reconfigurable computing designs are capable of achieving up to 500 times speedup and 70% energy savings over microprocessor implementations for specific applications.

414 citations