scispace - formally typeset
Search or ask a question
Author

A. Parikh

Bio: A. Parikh is an academic researcher from Pennsylvania State University. The author has contributed to research in topics: Dynamic priority scheduling & Two-level scheduling. The author has an hindex of 6, co-authored 7 publications receiving 447 citations.

Papers
More filters
Proceedings ArticleDOI
22 Jun 2001
TL;DR: A compiler-controlled dynamic on-chip scratch-pad memory (SPM) management framework that uses both loop and data transformations is proposed that indicates significant reductions in data transfer activity between SPM and off-chip memory.
Abstract: Optimizations aimed at improving the efficiency of on-chip memories are extremely important. We propose a compiler-controlled dynamic on-chip scratch-pad memory (SPM) management framework that uses both loop and data transformations. Experimental results obtained using a generic cost model indicate significant reductions in data transfer activity between SPM and off-chip memory.

296 citations

Journal ArticleDOI
TL;DR: A compiler-controlled dynamic on-chip scratch-pad memory (SPM) management framework that includes an optimization suite that uses loop and data transformations, an on- chip memory partitioning step, and a code-rewriting phase that collectively transform an input code automatically to take advantage of the on- Chip SPM.
Abstract: Optimizations aimed at improving the efficiency of on-chip memories in embedded systems are extremely important. Using a suitable combination of program transformations and memory design space exploration aimed at enhancing data locality enables significant reductions in effective memory access latencies. While numerous compiler optimizations have been proposed to improve cache performance, there are relatively few techniques that focus on software-managed on-chip memories. It is well-known that software-managed memories are important in real-time embedded environments with hard deadlines as they allow one to accurately predict the amount of time a given code segment will take. In this paper, we propose and evaluate a compiler-controlled dynamic on-chip scratch-pad memory (SPM) management framework. Our framework includes an optimization suite that uses loop and data transformations, an on-chip memory partitioning step, and a code-rewriting phase that collectively transform an input code automatically to take advantage of the on-chip SPM. Compared with previous work, the proposed scheme is dynamic, and allows the contents of the SPM to change during the course of execution, depending on the changes in the data access pattern. Experimental results from our implementation using a source-to-source translator and a generic cost model indicate significant reductions in data transfer activity between the SPM and off-chip memory.

75 citations

Journal ArticleDOI
01 May 2004
TL;DR: This paper compares a performance-oriented scheduling technique with three energy-oriented instruction scheduling algorithms from both performance (execution cycles of the resulting schedules) and energy consumption points of view and proposes three scheduling algorithms that consider energy and performance at the same time.
Abstract: Reducing energy consumption has become an important issue in designing hardware and software systems in recent years. Although low power hardware components are critical for reducing energy consumption, the switching activity, which is the main source of dynamic power dissipation in electronic systems, is largely determined by the software running on these systems. In this paper, we present and evaluate several instruction scheduling algorithms that reorder a given sequence of instructions taking into account the energy considerations. We first compare a performance-oriented scheduling technique with three energy-oriented instruction scheduling algorithms from both performance (execution cycles of the resulting schedules) and energy consumption points of view. Then, we propose three scheduling algorithms that consider energy and performance at the same time. Our experimentation with these scheduling techniques shows that the best scheduling from the performance perspective is not necessarily the best scheduling from the energy perspective. Further, scheduling techniques that consider both energy and performance simultaneously are found to be desirable, that is, these techniques are quite successful in reducing energy consumption and their performance (in terms of execution cycles) is comparable to that of a pure performance-oriented scheduling. We also illuminate the inherent approximations and difficulties in building energy models for enabling energy-aware instruction scheduling and explore alternative options using cycle-accurate energy simulator. The simulation results show that the energy-oriented scheduling reduces energy consumption by up to 30% compared to the performance-oriented scheduling.

42 citations

Proceedings ArticleDOI
27 Apr 2000
TL;DR: This paper presents and evaluates several instruction scheduling algorithms that reorder a given sequence of instructions taking into account the energy considerations and shows that these techniques are quite successful in reducing energy consumption and their performance is comparable to that of a pure performance-oriented scheduling.
Abstract: Reducing energy consumption has become an important issue in designing hardware and software systems in recent years. Although low power hardware components are critical for reducing energy consumption, the switching activity, which is the main source of dynamic power dissipation in electronic systems, is largely determined by the software running on these systems. In this paper we present and evaluate several instruction scheduling algorithms that reorder a given sequence of instructions taking into account the energy considerations. We first compare a performance-oriented scheduling technique with three energy-oriented instruction scheduling algorithms from both performance (execution cycles of the resulting schedules) and energy consumption points of view. Then, we propose three scheduling algorithms that consider energy and performance at the same time. The results obtained show that these techniques are quite successful in reducing energy consumption and their performance is comparable to that of a pure performance-oriented scheduling.

23 citations

Proceedings ArticleDOI
19 Apr 2001
TL;DR: The results obtained using randomly generated directed acyclic graphs show that these techniques are quite successful in reducing energy consumption and their performance is comparable to that of a pure performance-oriented scheduling.
Abstract: We present and evaluate several instruction scheduling algorithms that reorder a given sequence of instructions taking into account the energy considerations. We first compare a performance oriented scheduling technique with three energy-oriented instruction scheduling algorithms from both performance (execution cycles of the resulting schedules) and energy consumption points of view. Then, we propose scheduling algorithms that consider energy and performance at the same time. The results obtained using randomly generated directed acyclic graphs show that these techniques are quite successful in reducing energy consumption and their performance (in terms of execution cycles) is comparable to that of a pure performance-oriented scheduling.

8 citations


Cited by
More filters
Book
10 Sep 2007
TL;DR: Is your memory hierarchy stopping your microprocessor from performing at the high level it should be?
Abstract: Is your memory hierarchy stopping your microprocessor from performing at the high level it should be? Memory Systems: Cache, DRAM, Disk shows you how to resolve this problem. The book tells you everything you need to know about the logical design and operation, physical design and operation, performance characteristics and resulting design trade-offs, and the energy consumption of modern memory hierarchies. You learn how to to tackle the challenging optimization problems that result from the side-effects that can appear at any point in the entire hierarchy.As a result you will be able to design and emulate the entire memory hierarchy. . Understand all levels of the system hierarchy -Xcache, DRAM, and disk. . Evaluate the system-level effects of all design choices. . Model performance and energy consumption for each component in the memory hierarchy.

659 citations

Proceedings ArticleDOI
04 Mar 2002
TL;DR: An algorithm integrated into a compiler is presented which analyses the application and selects program and data parts which are placed into the scratchpad and Comparisons against a cache solution show remarkable advantages between 12% and 43% in energy consumption for designs of the same memory size.
Abstract: The number of embedded systems is increasing and a remarkable percentage is designed as mobile applications. For the latter, energy consumption is a limiting factor because of today's battery capacities. Besides the processor, memory accesses consume a high amount of energy. The use of additional less power hungry memories like caches or scratchpads is thus common. Caches incorporate the hardware control logic for moving data in and out automatically. On the other hand, this logic requires chip area and energy. A scratchpad memory is much more energy efficient, but there is a need for software control of its content. In this paper, an algorithm integrated into a compiler is presented which analyses the application and selects program and data parts which are placed into the scratchpad. Comparisons against a cache solution show remarkable advantages between 12% and 43% in energy consumption for designs of the same memory size.

374 citations

14 Oct 2005
TL;DR: In this article, the authors examined the potential of using the STI Cell processor as a building block for future high-end computing systems and proposed modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations.
Abstract: The slowing pace of commodity microprocessor performance improvements combined with ever-increasing chip power demands has become of utmost concern to computational scientists. As a result, the high performance computing community is examining alternative architectures that address the limitations of modern cache-based designs. In this work, we examine the potential of the using the forth coming STI Cell processor as a building block for future high-end computing systems. Our work contains several novel contributions. We are the first to present quantitative Cell performance data on scientific kernels and show direct comparisons against leading superscalar (AMD Opteron), VLIW (IntelItanium2), and vector (Cray X1) architectures. Since neither Cell hardware nor cycle-accurate simulators are currently publicly available, we develop both analytical models and simulators to predict kernel performance. Our work also explores the complexity of mapping several important scientific algorithms onto the Cells unique architecture. Additionally, we propose modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations. Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency.

346 citations

Proceedings ArticleDOI
03 May 2006
TL;DR: This work introduces a performance model for Cell and applies it to several key scientific computing kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations, and 1D/2D FFTs, and proposes modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations.
Abstract: The slowing pace of commodity microprocessor performance improvements combined with ever-increasing chip power demands has become of utmost concern to computational scientists. As a result, the high performance computing community is examining alternative architectures that address the limitations of modern cache-based designs. In this work, we examine the potential of using the forthcoming STI Cell processor as a building block for future high-end computing systems. Our work contains several novel contributions. First, we introduce a performance model for Cell and apply it to several key scientific computing kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations, and 1D/2D FFTs. The difficulty of programming Cell, which requires assembly level intrinsics for the best performance, makes this model useful as an initial step in algorithm design and evaluation. Next, we validate the accuracy of our model by comparing results against published hardware results, as well as our own implementations on the Cell full system simulator. Additionally, we compare Cell performance to benchmarks run on leading superscalar (AMD Opteron), VLIW (Intel Itanium2), and vector (Cray X1E) architectures. Our work also explores several different mappings of the kernels and demonstrates a simple and effective programming model for Cell's unique architecture. Finally, we propose modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations. Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency.

317 citations

Journal ArticleDOI
TL;DR: This research proposes a dynamic allocation methodology for global and stack data and program code that accounts for changing program requirements at runtime, has no software-caching tags, requires no runtime checks, has extremely low overheads, and yields 100% predictable memory access times.
Abstract: In this research, we propose a highly predictable, low overhead, and, yet, dynamic, memory-allocation strategy for embedded systems with scratch pad memory. A scratch pad is a fast compiler-managed SRAM memory that replaces the hardware-managed cache. It is motivated by its better real-time guarantees versus cache and by its significantly lower overheads in energy consumption, area, and overall runtime, even with a simple allocation scheme. Primarily scratch pad allocation methods are of two types. First, software-caching schemes emulate the workings of a hardware cache in software. Instructions are inserted before each load/store to check the software-maintained cache tags. Such methods incur large overheads in runtime, code size, energy consumption, and SRAM space for tags and deliver poor real-time guarantees just like hardware caches. A second category of algorithms partitions variables at compile-time into the two banks. However, a drawback of such static allocation schemes is that they do not account for dynamic program behavior. It is easy to see why a data allocation that never changes at runtime cannot achieve the full locality benefits of a cache. We propose a dynamic allocation methodology for global and stack data and program code that; (i) accounts for changing program requirements at runtime, (ii) has no software-caching tags, (iii) requires no runtime checks, (iv) has extremely low overheads, and (v) yields 100p predictable memory access times. In this method, data that is about to be accessed frequently is copied into the scratch pad using compiler-inserted code at fixed and infrequent points in the program. Earlier data is evicted if necessary. When compared to a provably optimal static allocation, results show that our scheme reduces runtime by up to 39.8p and energy by up to 31.3p, on average, for our benchmarks, depending on the SRAM size used. The actual gain depends on the SRAM size, but our results show that close to the maximum benefit in runtime and energy is achieved for a substantial range of small SRAM sizes commonly found in embedded systems. Our comparison with a direct mapped cache shows that our method performs roughly as well as a cached architecture.

240 citations