scispace - formally typeset
Search or ask a question
Author

Mary Kiemb

Bio: Mary Kiemb is an academic researcher from Seoul National University. The author has contributed to research in topics: Design space exploration & Microarchitecture. The author has an hindex of 4, co-authored 6 publications receiving 237 citations.

Papers
More filters
Posted Content
Yoonjin Kim1, Mary Kiemb1, Chulsoo Park1, Jinyong Jung1, Kiyoung Choi1 
TL;DR: In this article, the authors proposed a reconfigurable array architecture template and design space exploration flow for domain-specific optimization, which can reduce the hardware cost and the delay without any performance degradation for some application domains.
Abstract: Coarse-grained reconfigurable architectures aim to achieve both goals of high performance and flexibility. However, existing reconfigurable array architectures require many resources without considering the specific application domain. Functional resources that take long latency and/or large area can be pipelined and/or shared among the processing elements. Therefore the hardware cost and the delay can be effectively reduced without any performance degradation for some application domains. We suggest such reconfigurable array architecture template and design space exploration flow for domain-specific optimization. Experimental results show that our approach is much more efficient both in performance and area compared to existing reconfigurable architectures.

91 citations

Proceedings ArticleDOI
Yoonjin Kim1, Mary Kiemb1, Chulsoo Park1, Jinyong Jung1, Kiyoung Choi1 
07 Mar 2005
TL;DR: A reconfigurable array architecture template and a design space exploration flow for domain-specific optimization are suggested and Experimental results show that this approach is much more efficient, in both performance and area, compared to existing reconfigured array architectures.
Abstract: Coarse-grained reconfigurable architectures aim to achieve goals of both high performance and flexibility. However, existing reconfigurable array architectures require many resources without considering the specific application domain. Functional resources that take long latency and/or large area can be pipelined and/or shared among the processing elements. Therefore, the hardware cost and the delay can be effectively reduced without any performance degradation for some application domains. We suggest such a reconfigurable array architecture template and a design space exploration flow for domain-specific optimization. Experimental results show that our approach is much more efficient, in both performance and area, compared to existing reconfigurable architectures.

86 citations

Proceedings ArticleDOI
06 Mar 2006
TL;DR: This work investigates the problem of automatically mapping applications onto a coarse-grained reconfigurable architecture and proposes an efficient algorithm to solve the problem and formalizes the mapping problem and shows that it is NP-complete.
Abstract: In this work, we investigate the problem of automatically mapping applications onto a coarse-grained reconfigurable architecture and propose an efficient algorithm to solve the problem. We formalize the mapping problem and show that it is NP-complete. To solve the problem within a reasonable amount of time, we divide it into three subproblems: covering, partitioning and layout. Our empirical results demonstrate that our technique produces nearly as good performance as hand-optimized outputs for many kernels.

50 citations

Proceedings ArticleDOI
22 Sep 2004
TL;DR: A design space exploration algorithm, which considers both memory configuration and multithreaded architecture and a thread shifting technique, which shifts threads in compile time to minimize cache conflict is suggested.
Abstract: In embedded multithreaded architectures, the performance enhancement relative to the base single-threaded architecture is highly dependent on the characteristics of the application and memory configuration. When the application is well parallelized, the multithreading performance may be good even with a small cache since the memory access latency can be hidden. However, if there are complicated dependencies between threads, they cause frequent cache conflicts, so the performance may not be improved. For that reason, not only processor architecture but also memory configuration should be customized to get an optimal solution of an embedded multithreaded system. We suggest a design space exploration algorithm, which considers both memory configuration and multithreaded architecture and a thread shifting technique, which shifts threads in compile time to minimize cache conflict.

2 citations


Cited by
More filters
Proceedings ArticleDOI
25 Oct 2008
TL;DR: Experiments on a wide variety of compute-intensive loops from the multimedia domain show that EMS improves throughput by 25% over traditional iterative modulo scheduling, and achieves 98% of the throughput of simulated annealing techniques at a fraction of the compilation time.
Abstract: Coarse-grained reconfigurable architectures (CGRAs) present an appealing hardware platform by providing the potential for high computation throughput, scalability, low cost, and energy efficiency. CGRAs consist of an array of function units and register files often organized as a two dimensional grid. The most difficult challenge in deploying CGRAs is compiler scheduling technology that can efficiently map software implementations of compute intensive loops onto the array. Traditional schedulers focus on the placement of operations in time and space. With CGRAs, the challenge of placement is compounded by the need to explicitly route operands from producers to consumers. To systematically attack this problem, we take an edge-centric approach to modulo scheduling that focuses on the routing problem as its primary objective. With edge-centric modulo scheduling (EMS), placement is a by-product of the routing process, and the schedule is developed by routing each edge in the dataflow graph. Routing cost metrics provide the scheduler with a global perspective to guide selection. Experiments on a wide variety of compute-intensive loops from the multimedia domain show that EMS improves throughput by 25% over traditional iterative modulo scheduling, and achieves 98% of the throughput of simulated annealing techniques at a fraction of the compilation time.

196 citations

Proceedings ArticleDOI
03 Jun 2012
TL;DR: Experimental results on 14 important kernels extracted from well known benchmark programs show that using EPIMap can improve the performance of the kernels on CGRA by more than 2.8X on average, as compared to one of the best existing mapping algorithm, EMS.
Abstract: Coarse-Grained Reconfigurable Architectures (CGRAs) are an attractive platform that promise simultaneous high-performance and high power-efficiency. One of the primary challenges in using CGRAs is to develop efficient compilers that can automatically and efficiently map applications to the CGRA. To this end, this paper makes several contributions: i) Using Re-computation for Resource Limitations: For the first time in CGRA compilers, we propose the use of re-computation as a solution for resource limitation problem. This extends the solutions space, and enables better mappings, ii) General Problem Formulation: A precise and general formulation of the application mapping problem on a CGRA is presented, and its computational complexity is established. iii) Extracting an Efficient Heuristic: Using the insights from the problem formulation, we design an effective global heuristic called EPIMap. EPIMap transforms the input specification (a directed graph) to an Epimorphic equivalent graph that satisfies the necessary conditions for mapping on to a CGRA, reducing the search space. Experimental results on 14 important kernels extracted from well known benchmark programs show that using EPIMap can improve the performance of the kernels on CGRA by more than 2.8X on average, as compared to one of the best existing mapping algorithm, EMS. EPIMap was able to achieve the theoretical best performance for 9 out of 14 benchmarks, while EMS could not achieve the theoretical best performance for any of the benchmarks. EPIMap achieves better mappings at acceptable increase in the compilation time.

125 citations

Book ChapterDOI
01 Jan 2013
TL;DR: The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual fine-tuning of source code.
Abstract: Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefit from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efficiently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on flexibility, performance, and power-efficiency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual fine-tuning of source code.

67 citations

Proceedings ArticleDOI
19 Jun 2009
TL;DR: A recurrence cycle-aware scheduling technique for CGRAs is introduced and it is shown that the technique achieves better quality schedules than schedulers based on simulated annealing at a 170-fold speed increase.
Abstract: In high-end embedded systems, coarse-grained reconfigurable architectures (CGRA) continue to replace traditional ASIC designs. CGRAs offer high performance at a low power consumption, yet provide flexibility through programmability. In this paper we introduce a recurrence cycle-aware scheduling technique for CGRAs. Our modulo scheduler groups operations belonging to a recurrence cycle into a clustered node and then computes a scheduling order for those clustered nodes. Deadlocks that arise when two or more recurrence cycles depend on each other are resolved by using heuristics that favor recurrence cycles with long recurrence delays. While with previous work one had to sacrifice either a fast compilation speed in order to get good quality results, or vice versa, this is not necessary anymore with the proposed recurrence cycle-aware scheduling technique. We have implemented the proposed method into our in-house CGRA chip and compiler solution and show that the technique achieves better quality schedules than schedulers based on simulated annealing at a 170-fold speed increase.

54 citations

Proceedings ArticleDOI
04 Oct 2006
TL;DR: This paper shows how power is consumed in a typical coarse-grained reconfigurable architecture and suggests a power-conscious configuration cache structure and code mapping technique, which reduce power consumption without performance degradation.
Abstract: Coarse-grained reconfigurable architecture aims to achieve both performance and flexibility. However, power consumption is no less important for the reconfigurable architecture to be used as a competitive processing core in embedded systems. In this paper, we show how power is consumed in a typical coarse-grained reconfigurable architecture. Based on the power breakdown data, we suggest a power-conscious configuration cache structure and code mapping technique, which reduce power consumption without performance degradation. Experimental results show that the proposed approach saves much power even with reduced configuration cache size.

54 citations