scispace - formally typeset
Search or ask a question

Showing papers on "Cache coloring published in 2000"


Journal ArticleDOI
TL;DR: This paper demonstrates the benefits of cache sharing, measures the overhead of the existing protocols, and proposes a new protocol called "summary cache", which reduces the number of intercache protocol messages, reduces the bandwidth consumption, and eliminates 30% to 95% of the protocol CPU overhead, all while maintaining almost the same cache hit ratios as ICP.
Abstract: The sharing of caches among Web proxies is an important technique to reduce Web traffic and alleviate network bottlenecks. Nevertheless it is not widely deployed due to the overhead of existing protocols. In this paper we demonstrate the benefits of cache sharing, measure the overhead of the existing protocols, and propose a new protocol called "summary cache". In this new protocol, each proxy keeps a summary of the cache directory of each participating proxy, and checks these summaries for potential hits before sending any queries. Two factors contribute to our protocol's low overhead: the summaries are updated only periodically, and the directory representations are very economical, as low as 8 bits per entry. Using trace-driven simulations and a prototype implementation, we show that, compared to existing protocols such as the Internet cache protocol (ICP), summary cache reduces the number of intercache protocol messages by a factor of 25 to 60, reduces the bandwidth consumption by over 50%, eliminates 30% to 95% of the protocol CPU overhead, all while maintaining almost the same cache hit ratios as ICP. Hence summary cache scales to a large number of proxies. (This paper is a revision of Fan et al. 1998; we add more data and analysis in this version.).

2,174 citations


Patent
24 Nov 2000
TL;DR: In this paper, a preloader uses a cache replacement manager to manage requests for retrievals, insertions, and removal of web page components in a component cache, and a profile server predicts a user's next content request.
Abstract: A preloader works in conjunction with a web/app server and optionally a profile server to cache web page content elements or components for faster on-demand and anticipatory dynamic web page delivery. The preloader uses a cache manager to manage requests for retrievals, insertions, and removal of web page components in a component cache. The preloader uses a cache replacement manager to manage the replacement of components in the cache. While the cache replacement manager may utilize any cache replacement policy, a particularly effective replacement policy utilizes predictive information to make replacement decisions. Such a policy uses a profile server, which predicts a user's next content request. The components that can be cached are identified by tagging them within the dynamic scripts that generate them. The preloader caches components that are likely to be accessed next, thus improving a web site's scalability.

366 citations


Patent
23 Mar 2000
TL;DR: In this paper, a plurality of cache servers capable of caching WWW information provided by the information servers are provided in association with the wireless network, and the cache servers can be managed by receiving a message indicating at least a connected location of a mobile computer in a wireless network from the mobile computer, selecting one or more cache servers located nearby the mobile computers according to the message, and controlling these one or multiple cache servers to cache selected web information selected for the mobile Computer, so as to enable faster accesses to the selected Web information by the Mobile Computer.
Abstract: In the disclosed information delivery scheme for delivering WWW information provided by information servers on the Internet to mobile computers connected to the Internet through a wireless network, a plurality of cache servers capable of caching WWW information provided by the information servers are provided in association with the wireless network. The cache servers can be managed by receiving a message indicating at least a connected location of a mobile computer in the wireless network from the mobile computer, selecting one or more cache servers located nearby the mobile computer according to the message, and controlling these one or more cache servers to cache selected WWW information selected for the mobile computer, so as to enable faster accesses to the selected WWW information by the mobile computer. Also, the cache servers can be managed by selecting one or more cache servers located within a geographic range defined for an information provider who provides WWW information from an information server, and controlling these one or more cache servers to cache selected WWW information selected for the information provider, so as to enable faster accesses to the selected WWW information by the mobile computer.

300 citations


Proceedings ArticleDOI
01 Aug 2000
TL;DR: This paper focuses on the features of the M340 cache sub-system and illustrates the effect on power and performance through benchmark analysis and actual silicon measurements.
Abstract: Advances in technology have allowed portable electronic devices to become smaller and more complex, placing stringent power and performance requirements on the devices components. The M7CORE M3 architecture was developed specifically for these embedded applications. To address the growing need for longer battery life and higher performance, an 8-Kbyte, 4-way set-associative, unified (instruction and data) cache with pro-grammable features was added to the M3 core. These features allow the architecture to be optimized based on the applications requirements. In this paper, we focus on the features of the M340 cache sub-system and illustrate the effect on power and perfor-mance through benchmark analysis and actual silicon measure-ments.

253 citations


Proceedings ArticleDOI
01 May 2000
TL;DR: A practical, fully associative, software-managed secondary cache system that provides performance competitive with or superior to traditional caches without OS or application involvement is presented.
Abstract: As DRAM access latencies approach a thousand instruction-execution times and on-chip caches grow to multiple megabytes, it is not clear that conventional cache structures continue to be appropriate. Two key features—full associativity and software management—have been used successfully in the virtual-memory domain to cope with disk access latencies. Future systems will need to employ similar techniques to deal with DRAM latencies. This paper presents a practical, fully associative, software-managed secondary cache system that provides performance competitive with or superior to traditional caches without OS or application involvement. We see this structure as the first step toward OS- and application-aware management of large on-chip caches.This paper has two primary contributions: a practical design for a fully associative memory structure, the indirect index cache (IIC), and a novel replacement algorithm, generational replacement, that is specifically designed to work with the IIC. We analyze the behavior of an IIC with generational replacement as a drop-in, transparent substitute for a conventional secondary cache. We achieve miss rate reductions from 8% to 85% relative to a 4-way associative LRU organization, matching or beating a (practically infeasible) fully associative true LRU cache. Incorporating these miss rates into a rudimentary timing model indicates that the IIC/generational replacement cache could be competitive with a conventional cache at today's DRAM latencies, and will outperform a conventional cache as these CPU-relative latencies grow.

224 citations


Patent
16 Feb 2000
TL;DR: In this article, a field programmable gate array (FPGA) which includes first and second arrays of configurable logic blocks, and first-and second-level configuration cache memories coupled to the first and the second arrays, respectively, is described.
Abstract: A field programmable gate array (FPGA) which includes first and second arrays of configurable logic blocks, and first and second configuration cache memories coupled to the first and second arrays of configurable logic blocks, respectively. The first configuration cache memory array can either store values for reconfiguring the first array of configurable logic blocks, or operate as a RAM. Similarly, the second configuration cache array can either store values for reconfiguring the second array of configurable logic blocks, or operate as a RAM. The first configuration cache memory array and the second configuration cache memory array are independently controlled, such that partial reconfiguration of the FPGA can be accomplished. In addition, the second configuration cache memory array can store values for reconfiguring the first (rather than the second) array of configurable logic blocks, thereby providing a second-level reconfiguration cache memory.

222 citations


Patent
25 Jan 2000
TL;DR: In this paper, a relatively high-speed, intermediate-volume storage device is operated as a user-configurable cache, where data is preloaded and responsively cached in the cache memory based on user preferences.
Abstract: An apparatus and method for caching data in a storage device (26) of a computer system (10). A relatively high-speed, intermediate-volume storage device (25) is operated as a user-configurable cache. Requests to access a mass storage device (46) such as a disk or tape (26, 28) are intercepted by a device driver (32) that compares the access request against a directory (51) of the contents of the user-configurable cache (25). If the user-configurable cache contains the data sought to be accessed, the access request is carried out in the user-configurable cache instead of being forwarded to the device driver for the target mass storage device (46). Because the user-cache is implemented using memory having a dramatically shorter access time than most mechanical mass storage devices, the access request is fulfilled much more quickly than if the originally intended mass storage device was accessed. Data is preloaded and responsively cached in the user-configurable cache memory based on user preferences.

205 citations


Proceedings ArticleDOI
01 Dec 2000
TL;DR: The design and evaluation of the compression cache (CC) is presented which is a first level cache that has been designed so that each cache line can either hold one uncompressed line or two cache lines which have been compressed to at least half their lengths.
Abstract: Since the area occupied by cache memories on processor chips continues to grow, an increasing percentage of power is consumed by memory. We present the design and evaluation of the compression cache (CC) which is a first level cache that has been designed so that each cache line can either hold one uncompressed line or two cache lines which have been compressed to at least half their lengths. We use a novel data compression scheme based upon encoding of a small number of valves that appear frequently during memory accesses. This compression scheme preserves the ability to randomly access individual data items. We observed that the contents of 40%, 52% and 51% of the memory blocks of size 4, 8, and 16 words respectively in SPECint95 benchmarks can be compressed to at least half their sizes by encoding the top 2, 4, and 8 frequent valves respectively. Compression allows greater amounts of data to be stored leading to substantial reductions in miss rates (0-36.4%), off-chip traffic (3.9-48.1%), and energy consumed (1-27%). Traffic and energy reductions are in part derived by transferring data over external buses in compressed form.

195 citations


Proceedings ArticleDOI
01 Dec 2000
TL;DR: Dynamic zero compression reduces the energy required for cache accesses by only writing and reading a single bit for every zero-valued byte and an instruction recoding technique is described that increases instruction cache energy savings to 18%.
Abstract: Dynamic zero compression reduces the energy required for cache accesses by only writing and reading a single bit for every zero-valued byte. This energy-conscious compression is invisible to software and is handled with additional circuitry embedded inside the cache RAM arrays and the CPU. The additional circuitry imposes a cache area overhead of 9% and a read latency overhead of around two F04 gate delays. Simulation results show that we can reduce total data cache energy by around 26% and instruction cache energy by around 10% for SPECint95 and MediaBench benchmarks. We also describe the use of an instruction recoding technique that increases instruction cache energy savings to 18%.

181 citations


Patent
12 Jun 2000
TL;DR: In this paper, the authors propose a Web page cache that stores Web pages such that servers will be able to retrieve valid dynamic pages without going to a dynamic content server or origin Web server for the page every time a user requests that dynamic page.
Abstract: A Web page cache that stores Web pages such that servers will be able to retrieve valid dynamic pages without going to a dynamic content server or origin Web server for the page every time a user requests that dynamic page. The dynamic content cache receives information that defines data upon which each dynamic page is dependent, such that when the value of any dependency data item changes, the associated dynamic page is marked as invalid or deleted. The dynamic page cache stores dependency data, receives change event information, and indicates when pages in the cache are invalidated or need to be refreshed.

165 citations


Patent
19 Apr 2000
TL;DR: In this article, a disk drive consisting of a cache memory and a cache control system with a tag memory having a plurality of tag records, and means for allocating a tag record for responding to a host command is described.
Abstract: The present invention relates to a disk drive 10 comprising a cache memory 14 and a cache control system having a tag memory having a plurality of tag records, and means for allocating a tag record for responding to a host command. The cache memory has a plurality of sequentially-ordered memory clusters 46 for caching disk data stored in sectors (not shown) on disks of a disk assembly 38 . Conventionally the disk sectors are identified by logical block addresses (LBAs). The cache control system 12 along with the tag memory 22 and means for allocating tag records are embedded within the cache control system 12 and thereby configured only for use in defining variable length segments of the memory clusters 46 . The segments are defined without regard to the sequential order of the memory clusters 46.

Proceedings ArticleDOI
01 Jun 2000
TL;DR: This approach is the first one to measure and optimize the power consumption of a complete SOC comprising a CPU, instruction cache, data cache, main memory, data buses and address bus through code compression.
Abstract: We propose instruction code compression as an efficient method for reducing power on an embedded system. Our approach is the first one to measure and optimize the power consumption of a complete SOC (System--On--a--Chip) comprising a CPU, instruction cache, data cache, main memory, data buses and address bus through code compression. We compare the pre-cache architecture (decompressor between main memory and cache) to a novel post-cache architecture (decompressor between cache and CPU). Our simulations and synthesis results show that our methodology results in large energy savings between 22% and 82% compared to the same system without code compression. Furthermore, we demonstrate that power savings come with reduced chip area and the same or even improved performance.

Patent
31 Jul 2000
TL;DR: In this paper, the authors present a cache control method for caching disk data in a disk drive configured to receive commands for both streaming and non-streaming data from a host, where a lossy state record is provided for memory segments in a cache memory.
Abstract: The present invention may be embodied in a cache control method for caching disk data in a disk drive configured to receive commands for both streaming and non-streaming data from a host. A lossy state record is provided for memory segments in a cache memory. The lossy state record allows hosts commands to be mixed for streaming and non-streaming data without flushing of cache data for a command mode change.

Patent
19 Apr 2000
TL;DR: In this paper, a disk drive including a cache memory having a plurality of sequentially-ordered memory clusters for caching disk data stored in sectors (not shown) on disks of a disk assembly is described.
Abstract: The present invention relates to a disk drive including a cache memory having a plurality of sequentially-ordered memory clusters for caching disk data stored in sectors (not shown) on disks of a disk assembly. The disk sectors are identified by logical block addresses (LBAs). A cache control system of the disk drive comprises a cluster control block memory, having a plurality of cluster control blocks (CCB), and a tag memory 22, having a plurality of tag records, that are embedded within the cache control system. Each CCB includes a cluster segment record with an entry for associating the CCB with a particular memory cluster and for forming variable length segments of the memory clusters without regard to the sequential order of the memory clusters. Each tag record assigns a segment to a continuous range of LBAs and defines the CCBs forming the segment. Each segment of the memory clusters is for caching data from a contiguous range of the logical block addresses.

Proceedings ArticleDOI
01 May 2000
TL;DR: A generalization of time skewing for multiprocessor architectures is given, and techniques for using multilevel caches reduce the LI cache requirement, which would otherwise be unacceptably high for some architectures when using arrays of high dimension.
Abstract: Time skewing is a compile-time optimization that can provide arbitrarily high cache hit rates for a class of iterative calculations, given a sufficient number of time steps and sufficient cache memory. Thus, it can eliminate processor idle time caused by inadequate main memory bandwidth. In this article, we give a generalization of time skewing for multiprocessor architectures, and discuss time skewing for multilevel caches. Our generalization for multiprocessors lets us eliminate processor idle time caused by any combination of inadequate main memory bandwidth, limited network bandwidth, and high network latency, given a sufficiently large problem and sufficient cache. As in the uniprocessor case, the cache requirement grows with the machine balance rather than the problem size. Our techniques for using multilevel caches reduce the LI cache requirement, which would otherwise be unacceptably high for some architectures when using arrays of high dimension.

Proceedings ArticleDOI
01 Jun 2000
TL;DR: A way to improve the performance of embedded processors running data-intensive applications by allowing software to allocate on-chip memory on an application-specific basis via a novel hardware mechanism, called column caching.
Abstract: We propose a way to improve the performance of embedded processors running data-intensive applications by allowing software to allocate on-chip memory on an application-specific basis. On-chip memory in the form of cache can be made to act like scratch-pad memory via a novel hardware mechanism, which we call column caching. Column caching enables dynamic cache partitioning in software, by mapping data regions to a specified sets of cache “columns” or “ways.” When a region of memory is exclusively mapped to an equivalent sized partition of cache, column caching provides the same functionality and predictability as a dedicated scratchpad memory for time-critical parts of a real-time application. The ratio between scratchpad size and cache size can be easily and quickly varied for each application, or each task within an application. Thus, software has much finer software control of on-chip memory, providing the ability to dynamically tradeoff performance for on-chip memory.

Journal ArticleDOI
TL;DR: The importance of different Web proxy workload characteristics in making good cache replacement decisions is analyzed, and results indicate that higher cache hit rates are achieved using size-based replacement policies.

Patent
26 Apr 2000
TL;DR: In this paper, the cache employs one or more prefetch ways for storing prefetch cache lines and accessed cache lines, while cache lines fetched in response to cache misses for requests initiated by a microprocessor connected to the cache are stored into non-prefetch ways.
Abstract: A cache employs one or more prefetch ways for storing prefetch cache lines and one or more ways for storing accessed cache lines. Prefetch cache lines are stored into the prefetch way, while cache lines fetched in response to cache misses for requests initiated by a microprocessor connected to the cache are stored into the non-prefetch ways. Accessed cache lines are thereby maintained within the cache separately from prefetch cache lines. When a prefetch cache line is presented to the cache for storage, the prefetch cache line may displace another prefetch cache line but does not displace an accessed cache line. A cache hit in either the prefetch way or the non-prefetch ways causes the cache line to be delivered to the requesting microprocessor in a cache hit fashion. The cache is further configured to move prefetch cache lines from the prefetch way to the non-prefetch way if the prefetch cache lines are requested (i.e. they become accessed cache lines). Instruction cache lines may be moved immediately upon access, while data cache line accesses may be counted and a number of accesses greater than a predetermined threshold value may occur prior to moving the data cache line from the prefetch way to the non-prefetch way. Additionally, movement of an accessed cache line from the prefetch way to the non-prefetch way may be delayed until the accessed cache line is to be replaced by a prefetch cache line.

Patent
07 Dec 2000
TL;DR: In this paper, the cache memory is partitioned among a set of threads of a multi-threaded processor, and when a cache miss occurs, a replacement line is selected in a partition of the cache space which is allocated to the particular thread from which the access causing the cache miss originated, thereby preventing pollution to partitions belonging to other threads.
Abstract: A method and apparatus which provides a cache management policy for use with a cache memory for a multi-threaded processor. The cache memory is partitioned among a set of threads of the multi-threaded processor. When a cache miss occurs, a replacement line is selected in a partition of the cache memory which is allocated to the particular thread from which the access causing the cache miss originated, thereby preventing pollution to partitions belonging to other threads.

Patent
30 Jun 2000
TL;DR: In this article, a cache system for looking up one or more elements of an external memory includes a set of cache memory elements coupled to the external memory, a cache cache cache memory cells (CAMs) containing an address and a pointer to a cache memory element, and a matching circuit having an input such that the CAM asserts a match output when the input is the same as the address in the CAM cell.
Abstract: A cache system for looking up one or more elements of an external memory includes a set of cache memory elements coupled to the external memory, a set of content addressable memory cells (CAMs) containing an address and a pointer to one of the cache memory elements, and a matching circuit having an input such that the CAM asserts a match output when the input is the same as the address in the CAM cell. The cache memory element which a particular CAM points to changes over time. In the preferred implementation, the CAMs are connected in an order from top to bottom, and the bottom CAM points to the least recently used cache memory element.

Patent
Zohar Bogin1, Steven J. Clohset1
21 Sep 2000
Abstract: A write cache that reduces the number of memory accesses required to write data to main memory. When a memory write request is executed, the request not only updates the relevant location in cache memory, but the request is also directed to updating the corresponding location in main memory. A separate write cache is dedicated to temporarily holding multiple write requests so that they can be organized for more efficient transmission to memory in burst transfers. In one embodiment, all writes within a predefined range of addresses can be written to memory as a group. In another embodiment, entries are held in the write cache until a minimum number of entries are available for writing to memory, and a least-recently-used mechanism can be used to decide which entries to transmit first. In yet another embodiment, partial writes are merged into a single cache line, to be written to memory in a single burst transmission.

Patent
14 Dec 2000
TL;DR: In this paper, a cache system and method in accordance with the invention includes a cache near the target devices and another cache at the requesting host side so that the data traffic across the computer network is reduced.
Abstract: A cache system and method in accordance with the invention includes a cache near the target devices and another cache at the requesting host side so that the data traffic across the computer network is reduced. A cache updating and invalidation method are described.

Patent
09 May 2000
TL;DR: In this article, a technique for cache segregation utilizes logic for storing and communicating thread identification (TID) bits, which can be inserted at the most significant bits of the cache index.
Abstract: A processor includes logic (612) for tagging a thread identifier (TID) for usage with processor blocks that are not stalled. Pertinent non-stalling blocks include caches, translation look-aside buffers (TLB) (1258, 1220), a load buffer asynchronous interface, an external memory management unit (MMU) interface (320, 330), and others. A processor (300) includes a cache that is segregated into a plurality of N cache parts. Cache segregation avoids interference, 'pollution', or 'cross-talk' between threads. One technique for cache segregation utilizes logic for storing and communicating thread identification (TID) bits. The cache utilizes cache indexing logic. For example, the TID bits can be inserted at the most significant bits of the cache index.

Journal ArticleDOI
TL;DR: Another, more incremental approach of cache-conscious data layout is proposed, which uses techniques such as clustering, coloring, and compression to enhance data locality by placing structure elements more carefully in the cache.
Abstract: To narrow the widening gap between processor and memory performance, the authors propose improving the cache locality of pointer-manipulating programs and bolstering performance by careful placement of structure elements. It is concluded that considering past trends and future technology, it seems clear that the processor-memory performance gap will continue to increase and software will continue to grow larger and more complex. Although cache-conscious algorithms and data structures are the first and perhaps best place to attack this performance problem, the complexity of software design and an increasing tendency to build large software systems by assembling smaller components does not favor a focused, integrated approach. We propose another, more incremental approach of cache-conscious data layout, which uses techniques such as clustering, coloring, and compression to enhance data locality by placing structure elements more carefully in the cache.

Patent
01 Dec 2000
TL;DR: In this article, the data access layer determines whether a data item required by an application program is in the cache, and if it is, the access layer obtains the item from the cache; otherwise, it obtains from the data source.
Abstract: A middle-tier Web server (230) with a queryable cache (219) that contains items from one or more data sources (241). Items are included in the cache (223) on the basis of the probability of future hits on the items. When the data source (241) determines that an item that has been included in the cache (223) has changed, it sends an update message to the server (230), which updates the item if it is still included in the cache. In a preferred embodiment, the data source is a database system and triggers in the database system are used to generate update messages. In a preferred embodiment, the data access layer determines whether a data item required by an application program is in the cache. If it is, the data access layer obtains the item from the cache; otherwise, it obtains the item from the data source. The queryable cache includes a miss table that accelerates the determination of whether a data item is in the cache. The miss table is made up of miss table entries that relate the status of a data item to the query used to access the data item. There are three statuses: miss, indicating that the item is not in the cache, hit, indicating that it is, and unknown, indicating that it is not known whether the item is in the cache. When an item is referenced, the query used to access it is presented to the table. If the entry for the query has the status miss, the data access layer obtains the item from the data source instead of attempting to obtain it from the cache. If the entry has the status unknown, the data access layer attempts to obtain it from the cache and the miss table entry for the item is updated in accordance with the result. When a copy of an item is added to the cache, miss table entries with the status miss are set to indicate unknown.

Patent
20 Oct 2000
TL;DR: In this paper, a cache server among a plurality of cache servers among a cluster server apparatus cache server, while optimally distributing loads on the plurality of the cache servers, is considered.
Abstract: A cluster server apparatus operable to continuously carrying out data distribution to terminals even if among a plurality of cache servers of the cluster server apparatus cache server, while optimally distributing loads on the plurality of cache servers. A cluster control unit of the cluster server apparatus distributes requests from the terminals based on the load of each of the plurality of cache servers. A cache server among the plurality of cache servers distributes, requested data (streaming data) to a terminal if the requested data is stored in a streaming data storage unit of the cache server, while distributing data from a content server the requested data if it is not stored in the streaming data storage unit. The data distributed from the content server is redundantly stored in the respective streaming data storage units of two or more cache servers. One cache server detects the state of distribution of the other cache server that stores the same data as that stored in the one cache server. If the one cache server becomes unable to carry out distribution, the other cache server continues data distribution instead.

Journal ArticleDOI
TL;DR: This paper defines a new memory consistency model, called Location Consistency (LC), in which the state of a memory location is modeled as a partially ordered multiset (pomset) of write and synchronization operations.
Abstract: Existing memory models and cache consistency protocols assume the memory coherence property which requires that all processors observe the same ordering of write operations to the same location In this paper, we address the problem of defining a memory model that does not rely on the memory coherence assumption and also the problem of designing a cache consistency protocol based on such a memory model We define a new memory consistency model, called Location Consistency (LC), in which the state of a memory location is modeled as a partially ordered multiset (pomset) of write and synchronization operations We prove that LC is strictly weaker than existing memory models, but is still equivalent to stronger models for the common case of parallel programs that have no data races We also describe a new multiprocessor cache consistency protocol based on the LC memory model We prove that this LC protocol obeys the LC memory model The LC protocol does not need to enforce single write ownership of memory blocks As a result, the LC protocol is simpler and more scalable than existing snooping and directory-based cache consistency protocols

Journal ArticleDOI
TL;DR: This work proposes sacrificing some performance in exchange for energy efficiency by filtering cache references through an unusually small first level cache, which results in a 51 percent reduction in the energy-delay product when compared to a conventional design.
Abstract: Most modern microprocessors employ one or two levels of on-chip caches in order to improve performance. Caches typically are implemented with static RAM cells and often occupy a large portion of the chip area. Not surprisingly, these caches can consume a significant amount of power. In many applications, such as portable devices, energy efficiency is more important than performance. We propose sacrificing some performance in exchange for energy efficiency by filtering cache references through an unusually small first level cache. We refer to this structure as the filter cache. A second level cache, similar in size and structure to a conventional first level cache, is positioned behind the filter cache and serves to mitigate the performance loss. Extensive experiments indicate that a small filter cache still can achieve a high hit rate and good performance. This approach allows the second level cache to be in a low power mode most of the time, thus resulting in power savings. The filter cache is particularly attractive in low power applications, such as the embedded processors used for communication and multimedia applications. For example, experimental results across a wide range of embedded applications show that a direct mapped 255-byte filter cache achieves a 58 percent power reduction while reducing performance by 21 percent. This trade-off results in a 51 percent reduction in the energy-delay product when compared to a conventional design.

Proceedings ArticleDOI
01 Jun 2000
TL;DR: This paper studies the memory system behavior of Java programs by analyzing memory reference traces of several SPECjvm98 applications running with a Just-In-Time (JIT) compiler and finds that the overall cache miss ratio is increased due to garbage collection, which suffers from higher cache misses compared to the application.
Abstract: This paper studies the memory system behavior of Java programs by analyzing memory reference traces of several SPECjvm98 applications running with a Just-In-Time (JIT) compiler. Trace information is collected by an exception-based tracing tool called JTRACE, without any instrumentation to the Java programs or the JIT compiler.First, we find that the overall cache miss ratio is increased due to garbage collection, which suffers from higher cache misses compared to the application. We also note that going beyond 2-way cache associativity improves the cache miss ratio marginally. Second, we observe that Java programs generate a substantial amount of short-lived objects. However, the size of frequently-referenced long-lived objects is more important to the cache performance, because it tends to determine the application's working set size. Finally, we note that the default heap configuration which starts from a small initial heap size is very inefficient since it invokes a garbage collector frequently. Although the direct costs of garbage collection decrease as we increase the available heap size, there exists an optimal heap size which minimizes the total execution time due to the interaction with the virtual memory performance.

Journal ArticleDOI
Peter Sanders1
TL;DR: In this article, a fast priority queue for external memory and cached memory that is based on k-way merging is proposed, which is at least two times faster than an optimized implementation of binary heaps and 4-ary heaps for large inputs.
Abstract: The cache hierarchy prevalent in todays high performance processors has to be taken into account in order to design algorithms that perform well in practice. This paper advocates the adaption of external memory algorithms to this purpose. This idea and the practical issues involved are exemplified by engineering a fast priority queue suited to external memory and cached memory that is based on k-way merging. It improves previous external memory algorithms by constant factors crucial for transferring it to cached memory. Running in the cache hierarchy of a workstation the algorithm is at least two times faster than an optimized implementation of binary heaps and 4-ary heaps for large inputs.