scispace - formally typeset
Search or ask a question
Conference

Compilers, Architecture, and Synthesis for Embedded Systems 

About: Compilers, Architecture, and Synthesis for Embedded Systems is an academic conference. The conference publishes majorly in the area(s): Compiler & Cache. Over the lifetime, 549 publications have been published by the conference receiving 14593 citations.


Papers
More filters
Proceedings ArticleDOI
Seon-Yeong Park1, Dawoon Jung1, Jeong-Uk Kang1, Jin-Soo Kim1, Joonwon Lee1 
22 Oct 2006
TL;DR: The Clean-First LRU (CFLRU) replacement algorithm is proposed that exploits the characteristics of flash memory and reduces the average replacement cost by 28.4% in swap system and by 26.2% in buffer cache, compared with LRU algorithm.
Abstract: In most operating systems which are customized for disk-based storage system, the replacement algorithm concerns only the number of memory hits. However, flash memory has different read and write cost in the aspects of time and energy so the replacement algorithm with flash memory should consider not only the hit count but also the replacement cost caused by selecting dirty victims. The replacement cost of dirty page is higher than that of clean page with regard to both access time and energy consumption. In this paper, we propose the Clean-First LRU (CFLRU) replacement algorithm that exploits the characteristics of flash memory. CFLRU splits the LRU list into the working region and the clean-first region and adopts a policy that evicts clean pages preferentially in the clean-first region until the number of page hits in the working region is preserved in a suitable level. Using the trace-driven simulation, the proposed algorithm reduces the average replacement cost by 28.4% in swap system and by 26.2% in buffer cache, compared with LRU algorithm. We also implement the CFLRU algorithm in the Linux kernel and present some optimization issues.

434 citations

Proceedings ArticleDOI
16 Nov 2001
TL;DR: Based on profiling information on computation time and data sharing at the level of procedure calls, a cost graph is constructed for a given application program and a partition scheme is applied to statically divide the program into server tasks and client tasks such that the energy consumed by the program is minimized.
Abstract: We consider handheld computing devices which are connected to a server (or a powerful desktop machine) via a wireless LAN On such devices, it is often possible to save the energy on the handheld by offloading its computation to the server In this work, based on profiling information on computation time and data sharing at the level of procedure calls, we construct a cost graph for a given application program We then apply a partition scheme to statically divide the program into server tasks and client tasks such that the energy consumed by the program is minimized Experiments are performed on a suite of multimedia benchmarks Results show considerable energy saving for several programs through offloading

309 citations

Proceedings ArticleDOI
30 Oct 2003
TL;DR: A dynamic allocation method for global and stack data that accounts for changing program requirements at runtime, has no software-caching tags, requires no run-time checks, has extremely low overheads, and yields 100% predictable memory access times is presented.
Abstract: This paper presents a highly predictable, low overhead and yet dynamic, memory allocation strategy for embedded systems with scratch-pad memory. A scratch-pad is a fast compiler-managed SRAM memory that replaces the hardware-managed cache. It is motivated by its better real-time guarantees vs cache and by its significantly lower overheads in energy consumption, area and overall runtime, even with a simple allocation scheme [4].Existing scratch-pad allocation methods are of two types. First, software-caching schemes emulate the workings of a hardware cache in software. Instructions are inserted before each load/store to check the software-maintained cache tags. Such methods incur large overheads in runtime, code size, energy consumption and SRAM space for tags and deliver poor real-time guarantees just like hardware caches. A second category of algorithms partitionsm variables at compile-time into the two banks. For example, our previous work in [3] derives a provably optimal static allocation for global and stack variables and achieves a speedup over all earlier methods. However, a drawback of such static allocation schemes is that they do not account for dynamic program behavior. It is easy to see why a data allocation that never changes at runtime cannot achieve the full locality benefits of a cache.In this paper we present a dynamic allocation method for global and stack data that for the first time, (i) accounts for changing program requirements at runtime (ii) has no software-caching tags (iii) requires no run-time checks (iv) has extremely low overheads, and (v) yields 100% predictable memory access times. In this method data that is about to be accessed frequently is copied into the SRAM using compiler-inserted code at fixed and infrequent points in the program. Earlier data is evicted if necessary. When compared to a provably optimal static allocation our results show runtime reductions ranging from 11% to 38%, averaging 31.2%, using no additional hardware support. With hardware support for pseudo-DMA and full DMA, which is already provided in some commercial systems, the runtime reductions increase to 33.4% and 34.2% respectively.

236 citations

Proceedings ArticleDOI
08 Oct 2002
TL;DR: An energy-aware scheduling policy for non-real-time operating systems that benefits from event counters is proposed and energy measurements of the target architecture under variable load show the advantage of the proposed approach.
Abstract: Scalability of the core frequency is a common feature of low-power processor architectures. Many heuristics for frequency scaling were proposed in the past to find the best trade-off between energy efficiency and computational performance. With complex applications exhibiting unpredictable behavior these heuristics cannot reliably adjust the operation point of the hardware because they do not know where the energy is spent and why the performance is lost.Embedded hardware monitors in the form of event counters have proven to offer valuable information in the field of performance analysis. We will demonstrate that counter values can also reveal the power-specific characteristics of a thread.In this paper we propose an energy-aware scheduling policy for non-real-time operating systems that benefits from event counters. By exploiting the information from these counters, the scheduler determines the appropriate clock frequency for each individual thread running in a time-sharing environment. A recurrent analysis of the thread-specific energy and performance profile allows an adjustment of the frequency to the behavioral changes of the application. While the clock frequency may vary in a wide range, the application performance should only suffer slightly (e.g. with 10% performance loss compared to the execution at the highest clock speed). Because of the similarity to a car cruise control, we called our scheduling policy Process Cruise Control. This adaptive clock scaling is accomplished by the operating system without any application support.Process Cruise Control has been implemented on the Intel XScale architecture, that offers a variety of frequencies and a set of configurable event counters. Energy measurements of the target architecture under variable load show the advantage of the proposed approach.

231 citations

Proceedings ArticleDOI
22 Sep 2004
TL;DR: This paper presents an efficient algorithm for exact enumeration of all possible candidate instructions given the dataflow graph (DFG) corresponding to a code fragment, and achieves orders of magnitude speedup in enumerating these candidate custom instructions for very large DFGs.
Abstract: Extensible processors allow addition of application-specific custom instructions to the core instruction set architecture. However, it is computationally expensive to automatically select the optimal set of custom instructions. Therefore, heuristic techniques are often employed to quickly search the design space. In this paper, we present an efficient algorithm for exact enumeration of all possible candidate instructions given the dataflow graph (DFG) corresponding to a code fragment. Even though this is similar to the "subgraph enumeration" problem (which is exponential), we find that most subgraphs are not feasible candidates for various reasons. In fact, the number of candidates is quite small compared to the size of the DFG. Compared to previous approaches, our technique achieves orders of magnitude speedup in enumerating these candidate custom instructions for very large DFGs.

176 citations

Performance
Metrics
No. of papers from the Conference in previous years
YearPapers
20212
20208
201815
201722
201621
201526