Showing papers on "Smart Cache published in 1999"

PDF

Open Access

Proceedings Article•DOI•

Selective cache ways: on-demand cache resource allocation

[...]

16 Nov 1999

TL;DR: In this paper, a tradeoff between performance and energy is made between a small performance degradation for energy savings, and the tradeoff can produce a significant reduction in cache energy dissipation.

...read moreread less

Abstract: Increasing levels of microprocessor power dissipation call for new approaches at the architectural level that save energy by better matching of on-chip resources to application requirements. Selective cache ways provides the ability to disable a subset of the ways in a set associative cache during periods of modest cache activity, while the full cache may remain operational for more cache-intensive periods. Because this approach leverages the subarray partitioning that is already present for performance reasons, only minor changes to a conventional cache are required, and therefore, full-speed cache operation can be maintained. Furthermore, the tradeoff between performance and energy is flexible, and can be dynamically tailored to meet changing application and machine environmental conditions. We show that trading off a small performance degradation for energy savings can produce a significant reduction in cache energy dissipation using this approach.

...read moreread less

733 citations

Proceedings Article•DOI•

Cache-conscious structure layout

[...]

Trishul Chilimbi¹, Mark D. Hill¹, James R. Larus²•Institutions (2)

University of Wisconsin-Madison¹, Microsoft²

01 May 1999

TL;DR: It is demonstrated that careful data organization and layout provides an essential mechanism to improve the cache locality of pointer-manipulating programs and consequently, their performance.

...read moreread less

Abstract: Hardware trends have produced an increasing disparity between processor speeds and memory access times. While a variety of techniques for tolerating or reducing memory latency have been proposed, these are rarely successful for pointer-manipulating programs.This paper explores a complementary approach that attacks the source (poor reference locality) of the problem rather than its manifestation (memory latency). It demonstrates that careful data organization and layout provides an essential mechanism to improve the cache locality of pointer-manipulating programs and consequently, their performance. It explores two placement techniques---clustering and coloring---that improve cache performance by increasing a pointer structure's spatial and temporal locality, and by reducing cache-conflicts.To reduce the cost of applying these techniques, this paper discusses two strategies---cache-conscious reorganization and cache-conscious allocation---and describes two semi-automatic tools---ccmorph and ccmalloc---that use these strategies to produce cache-conscious pointer structure layouts. ccmorph is a transparent tree reorganizer that utilizes topology information to cluster and color the structure. ccmalloc is a cache-conscious heap allocator that attempts to co-locate contemporaneously accessed data elements in the same physical cache block. Our evaluations, with microbenchmarks, several small benchmarks, and a couple of large real-world applications, demonstrate that the cache-conscious structure layouts produced by ccmorph and ccmalloc offer large performance benefits---in most cases, significantly outperforming state-of-the-art prefetching.

...read moreread less

382 citations

Proceedings Article•DOI•

Way-predicting set-associative cache for high performance and low energy consumption

[...]

Koji Inoue¹, Tohru Ishihara¹, Kazuaki Murakami¹•Institutions (1)

Kyushu University¹

17 Aug 1999

TL;DR: In this paper, a new approach using way prediction for achieving high performance and low energy consumption of set-associative caches is proposed, where only a single cache way is accessed, instead of accessing all the ways in a set.

...read moreread less

Abstract: This paper proposes a new approach using way prediction for achieving high performance and low energy consumption of set-associative caches. By accessing only a single cache way predicted, instead of accessing all the ways in a set, the energy consumption can be reduced. This paper shows that the way-predicting set-associative cache improves the ED (energy-delay) product by 60-70% compared to a conventional set-associative cache,.

...read moreread less

295 citations

Journal Article•DOI•

Active Cache: caching dynamic contents on the Web

[...]

Pei Cao, Jin Zhang, Kevin Beach

01 Mar 1999-Distributed Systems Engineering

TL;DR: This paper proposes the Active Cache scheme, a feasible scheme that can result in significant network bandwidth savings at the expense of moderate CPU costs, and describes the protocol, interface and security mechanisms of the scheme.

...read moreread less

Abstract: Dynamic documents constitute an increasing percentage of contents on the Web, and caching dynamic documents becomes an increasingly important issue that affects the scalability of the Web. In this paper, we propose the Active Cache scheme to support caching of dynamic contents at Web proxies. The scheme allows servers to supply cache applets to be attached with documents, and requires proxies to invoke cache applets upon cache hits to furnish the necessary processing without contacting the server. We describe the protocol, interface and security mechanisms of the Active Cache scheme, and illustrate its use via several examples. Through prototype implementation and performance measurements, we show that Active Cache is a feasible scheme that can result in significant network bandwidth savings at the expense of moderate CPU costs.

...read moreread less

283 citations

Proceedings Article•DOI•

Cache-conscious structure definition

[...]

Trishul Chilimbi¹, Bob Davidson², James R. Larus²•Institutions (2)

University of Wisconsin-Madison¹, Microsoft²

01 May 1999

TL;DR: In this article, the authors describe two techniques, structure splitting and field reordering, that improve the cache behavior of data structures larger than a cache block by increasing the number of hot fields that can be placed in the cache block.

...read moreread less

Abstract: A program's cache performance can be improved by changing the organization and layout of its data---even complex, pointer-based data structures. Previous techniques improved the cache performance of these structures by arranging distinct instances to increase reference locality. These techniques produced significant performance improvements, but worked best for small structures that could be packed into a cache block.This paper extends that work by concentrating on the internal organization of fields in a data structure. It describes two techniques---structure splitting and field reordering---that improve the cache behavior of structures larger than a cache block. For structures comparable in size to a cache block, structure splitting can increase the number of hot fields that can be placed in a cache block. In five Java programs, structure splitting reduced cache miss rates 10--27% and improved performance 6--18% beyond the benefits of previously described cache-conscious reorganization techniques.For large structures, which span many cache blocks, reordering fields, to place those with high temporal affinity in the same cache block can also improve cache utilization. This paper describes bbcache, a tool that recommends C structure field reorderings. Preliminary measurements indicate that reordering fields in 5 active structures improves the performance of Microsoft SQL Server 7.0 2--3%.

...read moreread less

278 citations

Journal Article•DOI•

Bounding pipeline and instruction cache performance

[...]

Christopher Healy¹, Robert D. Arnold, Frank Mueller², David Whalley¹, Marion G. Harmon³ - Show less +1 more•Institutions (3)

Florida State University¹, Humboldt University of Berlin², Florida A&M University³

01 Jan 1999-IEEE Transactions on Computers

TL;DR: This paper describes an approach for bounding the worst and best case performance of large code segments on machines that exploit both pipelining and instruction caching and indicates that the timing analyzer efficiently produces tight predictions of best and best-case performance for pipelined and instruction cache.

...read moreread less

Abstract: Predicting the execution time of code segments in real-time systems is challenging. Most recently designed machines contain pipelines and caches. Pipeline hazards may result in multicycle delays. Instruction or data memory references may not be found in cache and these misses typically require several cycles to resolve. Whether an instruction will stall due to a pipeline hazard or a cache miss depends on the dynamic sequence of previous instructions executed and memory references performed. Furthermore, these penalties are not independent since delays due to pipeline stalls and cache miss penalties may overlap. This paper describes an approach for bounding the worst and best case performance of large code segments on machines that exploit both pipelining and instruction caching. First, a method is used to analyze a program's control flow to statically categorize the caching behavior of each instruction. Next, these categorizations are used in the pipeline analysis of sequences of instructions representing paths within the program. A timing analyzer uses the pipeline path analysis to estimate the worst and best-case execution performance of each loop and function in the program. Finally, a graphical user interface is invoked that allows a user to request timing predictions on portions of the program. The results indicate that the timing analyzer efficiently produces tight predictions of worst and best-case performance for pipelining and instruction caching.

...read moreread less

223 citations

Journal Article•DOI•

Proxy cache algorithms: design, implementation, and performance

[...]

Junho Shim, Peter Scheuermann, R. Vingralek¹•Institutions (1)

Intertrust Technologies Corporation¹

01 Jul 1999-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A unified cache maintenance algorithm, LNC-R-WS-U, is described, which integrates both cache replacement and consistency algorithms and considers in the eviction consideration the validation rate of each document, as provided by the cache consistency component of LNC.R-W3-U.

...read moreread less

Abstract: Caching at proxy servers is one of the ways to reduce the response time perceived by World Wide Web users. Cache replacement algorithms play a central role in the response time reduction by selecting a subset of documents for caching, so that a given performance metric is maximized. At the same time, the cache must take extra steps to guarantee some form of consistency of the cached documents. Cache consistency algorithms enforce appropriate guarantees about the staleness of the cached documents. We describe a unified cache maintenance algorithm, LNC-R-WS-U, which integrates both cache replacement and consistency algorithms. The LNC-R-WS-U algorithm evicts documents from the cache based on the delay to fetch each document into the cache. Consequently, the documents that took a long time to fetch are preferentially kept in the cache. The LNC-R-W3-U algorithm also considers in the eviction consideration the validation rate of each document, as provided by the cache consistency component of LNC-R-WS-U. Consequently, documents that are infrequently updated and thus seldom require validations are preferentially retained in the cache. We describe the implementation of LNC-R-W3-U and its integration with the Apache 1.2.6 code base. Finally, we present a trace-driven experimental study of LNC-R-W3-U performance and its comparison with other previously published algorithms for cache maintenance.

...read moreread less

211 citations

Patent•

Web cache performance by applying different replacement policies to the web cache

[...]

Martin Arlitt¹, Richard J. Friedrich¹, Tai Jin¹•Institutions (1)

Hewlett-Packard¹

22 Mar 1999

TL;DR: In this article, a cache system is described that includes a storage that is partitioned into a plurality of storage areas, each for storing one kind of objects received from remote sites and to be directed to target devices.

...read moreread less

Abstract: A cache system is described that includes a storage that is partitioned into a plurality of storage areas, each for storing one kind of objects received from remote sites and to be directed to target devices. The cache system further includes a cache manager coupled to the storage to cause objects to be stored in the corresponding storage areas of the storage. The cache manager causes cached objects in each of the storage areas to be replaced in accordance with one of a plurality of replacement policies, each being optimized for one kind of objects.

...read moreread less

197 citations

Proceedings Article•DOI•

Memory exploration for low power, embedded systems

[...]

Wen-Tsong Shiue¹, Chaitali Chakrabarti¹•Institutions (1)

Arizona State University¹

01 Jun 1999

TL;DR: This work presents a memory exploration strategy based on three performance metrics, namely, cache size, the number of processor cycles and the energy consumption, and shows how the performance is affected by cache parameters such as caches size, line size, set associativity and tiling, and the off-chip data organization.

...read moreread less

Abstract: In embedded system design, the designer has to choose an on-chip memory configuration that is suitable for a specific application. To aid in this design choice, we present a memory exploration strategy based on three performance metrics, namely, cache size, the number of processor cycles and the energy consumption. We show how the performance is affected by cache parameters such as cache size, line size, set associativity and tiling, and the off-chip data organization. We show the importance of including energy in the performance metrics, since an increase in the cache line size, cache size, tiling and set associativity reduces the number of cycles but does not necessarily reduce the energy consumption. These performance metrics help us find the minimum energy cache configuration if time is the hard constraint, or the minimum time cache configuration if energy is the hard constraint.

...read moreread less

193 citations

Proceedings Article•DOI•

Instruction fetch energy reduction using loop caches for embedded applications with small tight loops

[...]

Lea Hwang Lee¹, Bill Moyer¹, John H. Arends¹•Institutions (1)

Motorola¹

17 Aug 1999

TL;DR: This paper proposes using a small instruction buffer, also called a loop cache, to save power in caches, which has no address tag store and knows precisely whether the next instruction request will hit in the loop cache well ahead of time.

...read moreread less

Abstract: A fair amount of work has been done in recent years on reducing power consumption in caches by using a small instruction buffer placed between the execution pipe and a larger main cache. These techniques, however, often degrade the overall system performance. In this paper, we propose using a small instruction buffer, also called a loop cache, to save power. A loop cache has no address tag store. It consists of a direct-mapped data array and a loop cache controller. The loop cache controller knows precisely whether the next instruction request will hit in the loop cache, well ahead of time. As a result, there is no performance degradation.

...read moreread less

190 citations

Patent•

Protocol for distributing fresh content among networked cache servers

[...]

Abdelsalam A. Heddaya, Sulaiman A. Mirdad, David J. Yates, Ian C. Yates

03 Mar 1999

TL;DR: In this article, the authors propose a technique for automatic, transparent, distributed, scalable and robust replication of document copies in a computer network where request messages for a particular document follow paths from the clients to a home server that form a routing graph.

...read moreread less

Abstract: A technique for automatic, transparent, distributed, scalable and robust replication of document copies in a computer network wherein request messages for a particular document follow paths from the clients to a home server that form a routing graph. Client request messages are routed up the graph towards the home server as would normally occur in the absence of caching. However, cache servers are located along the route, and may intercept requests if they can be serviced. In order to be able to service requests in this manner without departing from standard network protocols, the cache server needs to be able to insert a packet filter into the router associated with it, and needs also to proxy for the home server from the perspective of the client. Cache servers cooperate to update cache content by communicating with neighboring caches whenever information is received about invalid cache copies.

...read moreread less

Patent•

Cache override control in an apparatus for caching dynamic content

[...]

John T. Chamberlain¹, Edward M. Batchelder¹, Andrew J. Warton¹, Charles E. Dumont¹•Institutions (1)

IBM¹

25 Jan 1999

TL;DR: In this paper, a caching system and method for web pages that have dynamic content is described, and a cacheability analyzer that analyzes responses based on time, content, user identification, and macro hierarchy.

...read moreread less

Abstract: A caching system and method are disclosed that allow for the caching of web pages that have dynamic content. The caching system and method utilize a cacheability analyzer that analyzes responses based on time, content, user identification, and macro hierarchy. The caching system only caches those responses having dynamic content that are deemed cacheable. Further, the automatic caching system can be overridden by the information author, the page creator or the system designer.

...read moreread less

Journal Article•DOI•

Procedure placement using temporal-ordering information

[...]

Nikolas Gloy, Michael D. Smith¹•Institutions (1)

Harvard University¹

01 Sep 1999-ACM Transactions on Programming Languages and Systems

TL;DR: This paper describes an algorithm for procedure placement, one type of code placement, that signicantly differs from previous approaches in the type of information used to drive the placement algorithm, and gathers temporal-ordering information that summarizes the interleaving of procedures in a program trace.

...read moreread less

Abstract: Instruction cache performance is important to instruction fetch efficiency and overall processor performance. The layout of an executable has a substantial effect on the cache miss rate and the instruction working set size during execution. This means that the performance of an executable can be improved by applying a code-placement algorithm that minimizes instruction cache conflicts and improves spatial locality. We describe an algorithm for procedure placement, one type of code placement, that signicantly differs from previous approaches in the type of information used to drive the placement algorithm. In particular, we gather temporal-ordering information that summarizes the interleaving of procedures in a program trace. Our algorithm uses this information along with cache configuration and procedure size information to better estimate the conflict cost of a potential procedure ordering. It optimizes the procedure placement for single level and multilevel caches. In addition to reducing instruction cache conflicts, the algorithm simultaneously minimizes the instruction working set size of the program. We compare the performance of our algorithm with a particularly successful procedure-placement algorithm and show noticeable improvements in the instruction cache behavior, while maintaining the same instruction working set size.

...read moreread less

Journal Article•DOI•

Run-time cache bypassing

[...]

Teresa L. Johnson¹, Daniel A. Connors², M.C. Merten², Wen-mei W. Hwu²•Institutions (2)

Hewlett-Packard¹, University of Illinois at Urbana–Champaign²

01 Dec 1999-IEEE Transactions on Computers

TL;DR: This paper examines the theoretical upper bounds on the cache hit ratio that cache bypassing can provide for integer applications, including several Windows applications with OS activity, and proposes a microarchitecture scheme where the hardware determines data placement within the cache hierarchy based on dynamic referencing behavior.

...read moreread less

Abstract: The growing disparity between processor and memory performance has made cache misses increasingly expensive. Additionally, data and instruction caches are not always used efficiently, resulting in large numbers of cache misses. Therefore, the importance of cache performance improvements at each level of the memory hierarchy will continue to grow. In numeric programs, there are several known compiler techniques for optimizing data cache performance. However, integer (nonnumeric) programs often have irregular access patterns that are more difficult for the compiler to optimize. In the past, cache management techniques such as cache bypassing were implemented manually at the machine-language-programming level. As the available chip area grows, it makes sense to spend more resources to allow intelligent control over the cache management. In this paper, we present an approach to improving cache effectiveness, taking advantage of the growing chip area, utilizing run-time adaptive cache management techniques, optimizing both performance and cost of implementation. Specifically, we are aiming to increase data cache effectiveness for integer programs. We propose a microarchitecture scheme where the hardware determines data placement within the cache hierarchy based on dynamic referencing behavior. This scheme is fully compatible with existing instruction set architectures. This paper examines the theoretical upper bounds on the cache hit ratio that cache bypassing can provide for integer applications, including several Windows applications with OS activity. Then, detailed trace-driven simulations of the integer applications are used to show that the implementation described in this paper can achieve performance close to that of the upper bound.

...read moreread less

Proceedings Article•DOI•

Reducing cache misses using hardware and software page placement

[...]

Timothy Sherwood¹, Brad Calder¹, Joel Emer•Institutions (1)

University of California, San Diego¹

01 May 1999

TL;DR: This paper examines the performance of compiler and hardware approaches for reordering pages in physically addressed caches to eliminate cache misses and shows that software page placement provided a 28% speedup and hardware page placementprovided a 21% speed up on average for a superscalar processor.

...read moreread less

Abstract: As the gap between memory and processor speeds continues to widen, cache efficiency is an increasingly important component of processor performance. Compiler techniques have been used to improve instruction and data cache performance for virtually indexed caches by mapping code and data with temporal locality to different cache blocks. In this paper we examine the performance of compiler and hardware approaches for reordering pages in physically addressed caches to eliminate cache misses. The software approach provides a color mapping at compile-time for code and data pages, which can then be used by the operating system to guide its allocation of physical pages. The hardware approach works by adding a page remap field to the TLB, which is used to allow a page to be remapped to a different color in the physically indexed cache while keeping the same physical page in memory. The results show that software page placement provided a 28% speedup and hardware page placement provided a 21% speedup on average for a superscalar processor. For a 4 processor single-chip multiprocessor, the miss rate was reduced from 8.7% down to 7.2% on average.

...read moreread less

Proceedings Article•DOI•

Area efficient architectures for information integrity in cache memories

[...]

Seongwoo Kim¹, Arun K. Somani¹•Institutions (1)

Iowa State University¹

01 May 1999

TL;DR: This work focuses on transient fault tolerance in primary cache memories and develops new architectural solutions, to maximize fault coverage when the budgeted silicon area is not sufficient for the conventional configuration of an error checking code.

...read moreread less

Abstract: Information integrity in cache memories is a fundamental requirement for dependable computing. Conventional architectures for enhancing cache reliability using check codes make it difficult to trade between the level of data integrity and the chip area requirement. We focus on transient fault tolerance in primary cache memories and develop new architectural solutions, to maximize fault coverage when the budgeted silicon area is not sufficient for the conventional configuration of an error checking code. The underlying idea is to exploit the corollary of reference locality in the organization and management of the code. A higher protection priority is dynamically assigned to the portions of the cache that are more error-prone and have a higher probability of access. The error-prone likelihood prediction is based on the access frequency. We evaluate the effectiveness of the proposed schemes using a trace-driven simulation combined with software error injection using four different fault manifestation models. From the simulation results, we show that for most benchmarks the proposed architectures are effective and area efficient for increasing the cache integrity under all four models.

...read moreread less

Patent•

Cache object store

[...]

Robert Drew Major¹•Institutions (1)

Novell¹

22 Jun 1999

TL;DR: In this article, a cache object store is organized to provide fast and efficient storage of data as cache objects organized into cache object groups, and a multi-level hierarchical storage architecture comprising a primary memory-level cache store and, optionally, a secondary disk level cache store, each of which is configured to optimize access to the cache objects groups.

...read moreread less

Abstract: A cache object store is organized to provide fast and efficient storage of data as cache objects organized into cache object groups. The cache object store preferably embodies a multi-level hierarchical storage architecture comprising a primary memory-level cache store and, optionally, a secondary disk-level cache store, each of which is configured to optimize access to the cache object groups. These levels of the cache object store further exploit persistent and non-persistent storage characteristics of the inventive architecture.

...read moreread less

Journal Article•DOI•

Augmenting loop tiling with data alignment for improved cache performance

[...]

Preeti Ranjan Panda¹, Hiroshi Nakamura², Nikil Dutt³, Alexandru Nicolau³•Institutions (3)

Synopsys¹, University of Tokyo², University of California, Irvine³

01 Feb 1999-IEEE Transactions on Computers

TL;DR: DAT is presented, a technique that augments loop tiling with data alignment, achieving improved efficiency (by ensuring that the cache is never under-utilized) as well as improved flexibility (by eliminating self-interference cache conflicts independent of the tile size) in a more stable and better cache performance.

...read moreread less

Abstract: Loop blocking (tiling) is a well-known compiler optimization that helps improve cache performance by dividing the loop iteration space into smaller blocks (tiles); reuse of array elements within each tile is maximized by ensuring that the working set for the tile fits into the data cache. Padding is a data alignment technique that involves the insertion of dummy elements into a data structure for improving cache performance. In this work, we present DAT, a technique that augments loop tiling with data alignment, achieving improved efficiency (by ensuring that the cache is never under-utilized) as well as improved flexibility (by eliminating self-interference cache conflicts independent of the tile size). This results in a more stable and better cache performance than existing approaches, in addition to maximizing cache utilization, eliminating self-interference, and minimizing cross-interference conflicts. Further, while all previous efforts are targeted at programs characterized by the reuse of a single array, we also address the issue of minimizing conflict misses when several tiled arrays are involved. To validate our technique, we ran extensive experiments using both simulations as well as actual measurements on SUN Sparc5 and Sparc10 workstations. The results on benchmarks exhibiting varying memory access patterns demonstrate the effectiveness of our technique through consistently high hit ratios and improved performance across varying problem sizes.

...read moreread less

Patent•

Method and apparatus for curious and column caching

[...]

Derek Chiou¹, Boon Seong Ang¹•Institutions (1)

Massachusetts Institute of Technology¹

19 Nov 1999

TL;DR: Curious caching as mentioned in this paper improves upon cache snooping by allowing a cache to insert data from snooped bus operations that is not currently in the cache and independent of any prior accesses to the associated memory location.

...read moreread less

Abstract: Curious caching improves upon cache snooping by allowing a snooping cache to insert data from snooped bus operations that is not currently in the cache and independent of any prior accesses to the associated memory location. In addition, curious caching allows software to specify which data producing bus operations, e.g., reads and writes, result in data being inserted into the cache. This is implemented by specifying “memory regions of curiosity” and insertion and replacement policy actions for those regions. In column caching, the replacement of data can be restricted to particular regions of the cache. By also making the replacement address-dependent, column caching allows different regions of memory to be mapped to different regions of the cache. In a set-associative cache, a replacement policy specifies the particular column(s) of the set-associative cache in which a page of data can be stored. The column specification is made in page table entries in a TLB that translates between virtual and physical addresses. The TLB includes a bit vector, one bit per column, which indicates the columns of the cache that are available for replacement.

...read moreread less

Proceedings Article•

JIA-JIA : An SVM System Based on A New Cache Coherence Protocol

[...]

W. Hu

01 Jan 1999

Patent•

I/o cache with user configurable preload

[...]

Deniz Teoman, John M. Neil

26 Jan 1999

TL;DR: In this paper, a relatively high-speed, intermediate-volume storage device is operated as a user-configurable cache, where requests to access a mass storage device such as a disk or tape are intercepted by a device driver that compares the access request against a directory of the contents of the user configurable cache.

...read moreread less

Abstract: An apparatus and method for accessing data in a computer system. A relatively high-speed, intermediate-volume storage device is operated as a user-configurable cache. Requests to access a mass storage device such as a disk or tape are intercepted by a device driver that compares the access request against a directory of the contents of the user-configurable cache. If the user-configurable cache contains the data sought to be accessed, the access request is carried out in the user-configurable cache instead of being forwarded to the device driver for the target mass storage device. Because the user-cache is implemented using memory having a dramatically shorter access time than most mechanical mass storage devices, the access request is fulfilled much more quickly than if the originally intended mass storage device was accessed. Data is preloaded and responsively cached in the user-configurable cache memory based on user preferences.

...read moreread less

Proceedings Article•DOI•

Energy and performance improvements in microprocessor design using a loop cache

[...]

Nikolaos Bellas¹, Ibrahim N. Hajj, Constantine D. Polychronopoulos, Georgios Stamoulis•Institutions (1)

University of Illinois at Urbana–Champaign¹

10 Oct 1999

TL;DR: This work extends the work proposed by J. Kin et al. (1997), in which an extra, small cache is inserted between the CPU data path and the L1 cache and serves to filter most of the references initiated from the CPU.

...read moreread less

Abstract: Energy dissipated in on-chip caches represents a substantial portion in the energy budget of today's processors. Extrapolating current trends, this portion is likely to increase in the near future, since the devices devoted to the caches occupy an increasingly larger percentage of the total area of the chip. We extend the work proposed by J. Kin et al. (1997), in which an extra, small cache (called filter cache) is inserted between the CPU data path and the L1 cache and serves to filter most of the references initiated from the CPU. In our scheme, the compiler is used to generate code that exploits the new memory hierarchy and reduces the possibility of a miss in the extra cache. Experimental results across a wide range of SPEC95 benchmarks show that this cache, which we call L-Cache, has a small performance overhead with respect to the scheme without any extra caches, and provides substantial energy savings. The L-Cache is placed between the CPU and the I-Cache. The D-Cache subsystem is not modified. Since the L-Cache is much smaller, and thus, has a smaller access time than the I-Cache, this scheme can also be used for performance improvements provided that the hit rate in the L-Cache is very high. In our experimental results, we show that the L-Cache does indeed improve performance in some cases.

...read moreread less

Patent•

System, apparatus and method for multi-level cache in a multi-processor/multi-controller environment

[...]

Noel Simen Otterness, William A. Brant, Keith Edward Short, Joseph G. Skazinski

30 Sep 1999

TL;DR: In this paper, a multiple level cache structure and multiple level caching method that distributes I/O processing loads including caching operations between processors to provide higher performance processing, especially in a server environment is presented.

...read moreread less

Abstract: This inventive provides a multiple level cache structure and multiple level caching method that distributes I/O processing loads including caching operations between processors to provide higher performance I/O processing, especially in a server environment. A method of achieving optimal data throughput by taking full advantage of multiple processing resources is disclosed. A method for managing the allocation of the data caches to optimize the host access time and parity generation is disclosed. A cache allocation for RAID stripes guaranteed to provide fast access times for the XOR engine by ensuring that all cache lines are allocated from the same cache level is disclosed. Allocation of cache lines for RAID levels which do not require parity generation and are allocated in such manner as to maximize utilization of the memory bandwidth is disclosed. Parity generation which is optimized for use of the processor least utilized at the time the cache lines are allocated, thereby providing for dynamic load balancing amongst the multiple processing resources, is disclosed. An inventive cache line descriptor for maintaining information about which cache data pool the cache line resides within, and an inventive cache line descriptor which includes enhancements to allow for movement of cache data from one cache level to another is disclosed. A cache line descriptor with enhancements for tracking the cache within which RAID stripe cache lines siblings reside is disclosed. System, apparatus, computer program product, and methods to support these aspects alone and in combination are also provided.

...read moreread less

Proceedings Article•DOI•

The block-based trace cache

[...]

Bryan Black¹, Bohuslav Rychlik¹, John Paul Shen¹•Institutions (1)

Carnegie Mellon University¹

01 May 1999

TL;DR: This work presents a new block-based trace cache implementation that can achieve higher IPC performance with more efficient storage of traces, instead of explicitly storing instructions of a trace, and pointers to blocks constituting a trace are stored in a much smaller trace table.

...read moreread less

Abstract: The trace cache is a recently proposed solution to achieving high instruction fetch bandwidth by buffering and reusing dynamic instruction traces. This work presents a new block-based trace cache implementation that can achieve higher IPC performance with more efficient storage of traces. Instead of explicitly storing instructions of a trace, pointers to blocks constituting a trace are stored in a much smaller trace table. The block-based trace cache renames fetch addresses at the basic block level and stores aligned blocks in a block cache. Traces are constructed by accessing the replicated block cache using block pointers from the trace table. Performance potential of the block-based trace cache is quantified and compared with perfect branch prediction and perfect fetch schemes. Comparing to the conventional trace cache, the block-based design can achieve higher IPC, with less impact on cycle time.Results: Using the SPECint95 benchmarks, a 16-wide realistic design of a block-based trace cache can improve performance 75% over a baseline design and to within 7% of a baseline design with perfect branch prediction. With idealized trace prediction, it is shown the block-based trace cache with an 1K-entry block cache achieves the same performance of the conventional trace cache with 32K entries.

...read moreread less

Patent•

Storage subsystem including an error correcting cache and means for performing memory to memory transfers

[...]

Patrick Francis Dutton¹, Steven Lee Gregor¹, Hehching Harry Li¹•Institutions (1)

IBM¹

14 Apr 1999

TL;DR: In this article, a storage subsystem for use in a data processing system having real and extended storage, a vector processor and a store-in cache buffer is described, where hard data errors in the cache are corrected with hardware invert-retry mechanism which operates in response to a machine check and does the correction as a part of the instruction retry.

...read moreread less

Abstract: A storage subsystem for use in a data processing system having real and extended storage, a vector processor and a store-in cache buffer. Transfers between real and extended storage are performed with a store buffer external to the cache, but comparable in size to the line size of the cache directly associated with the real storage. Hard data errors in the cache are corrected with hardware invert-retry mechanism which operates in response to a machine check and does the correction as a part of the instruction retry. Vector processor storage operations bypass the cache and transfer data directly from storage to the vector processor.

...read moreread less

Journal Article•DOI•

Improving proxy cache performance: analysis of three replacement policies

[...]

John Dilley, Martin Arlitt

01 Nov 1999-IEEE Internet Computing

TL;DR: Two new policies implemented in the Squid cache server show marked improvement over the standard mechanism, which affects page load time.

...read moreread less

Abstract: Web cache replacement policy choice affects network bandwidth demand and object hit rate, which affect page load time. Two new policies implemented in the Squid cache server show marked improvement over the standard mechanism.

...read moreread less

Journal Article•DOI•

Randomized cache placement for eliminating conflicts

[...]

Nigel Topham¹, Antonio González•Institutions (1)

University of Edinburgh¹

01 Feb 1999-IEEE Transactions on Computers

TL;DR: This paper shows how the introduction of a pseudorandom element into the cache index function can effectively eliminate repetitive conflict misses and produce a cache where miss ratio depends solely on working set behavior.

...read moreread less

Abstract: Applications with regular patterns of memory access can experience high levels of cache conflict misses. In shared-memory multiprocessors conflict misses can be increased significantly by the data transpositions required for parallelization. Techniques such as blocking which are introduced within a single thread to improve locality, can result in yet more conflict misses. The tension between minimizing cache conflicts and the other transformations needed for efficient parallelization leads to complex optimization problems for parallelizing compilers. This paper shows how the introduction of a pseudorandom element into the cache index function can effectively eliminate repetitive conflict misses and produce a cache where miss ratio depends solely on working set behavior. We examine the impact of pseudorandom cache indexing on processor cycle times and present practical solutions to some of the major implementation issues for this type of cache. Our conclusions are supported by simulations of a superscalar out-of-order processor executing the SPEC95 benchmarks, as well as from cache simulations of individual loop kernels to illustrate specific effects. We present measurements of instructions committed per cycle (IPC) when comparing the performance of different cache architectures on whole-program benchmarks such as the SPEC95 suite.

...read moreread less

Patent•

Synchronized two-level graphics processing cache

[...]

Chih-Hong Fu, I-Chung Ling, Huai-Shih Hsu

17 Sep 1999

TL;DR: In this article, a synchronized two-level cache including a Level 1 cache and a Level 2 cache is implemented in a graphics processing system, which is further partitioned into a number of slots which are dynamically allocated to texture maps as needed.

...read moreread less

Abstract: A synchronized two-level cache including a Level 1 cache and a Level 2 cache is implemented in a graphics processing system. The Level 2 cache is further partitioned into a number of slots which are dynamically allocated to texture maps as needed. The reference counter of each of the cache lines in each cache level is tracked so that a cache line is not overwritten with new data prior to transferring old data out to the recipient device. The age status of each cache line is tracked so that the oldest cache line is overwritten first. The use of synchronized two-level cache system conserves system memory bandwidth and reduces memory latency, thereby improving the graphics processing system's performance.

...read moreread less

Patent•

Semantics-based caching policy to minimize latency

[...]

Chanda Dharap

16 Aug 1999

TL;DR: In this article, the cache management strategy is customized for each semantic type, using different caching policies for different semantic types, and the relationship between semantic content type and caching policy to be associated with the type can be determined in advance, or may be determined directly by the user, or could be based, at least partly, on user-history and profiling of userinteraction with the resources.

...read moreread less

Abstract: Resources are cached based on the semantic type of the resource. The cache management strategy is customized for each semantic type, using different caching policies for different semantic types. Semantic types that can be expected to contain dynamic information, such as news and weather, employ an active caching policy wherein the resource in the cache memory is chosen for replacement based on the duration of time that the resource has been in cache memory. Conversely, semantic types that can be expected to contain static resources, such as encyclopedic information, employ a more conservative caching strategy, such as LRU (Last Recently Used) and LFU (Least Frequently Used) that is substantially independent of the time duration that the resource remains in cache memory. Additionally, some semantic types, such as communicated e-mail messages, newsgroup messages, and so on, may employ a caching policy that is a combination of multiple strategies, wherein the resource progresses from an active cache with a dynamic caching policy to a more static caches with increasing less dynamic caching policies. The relationship between semantic content type and caching policy to be associated with the type can be determined in advance, or may be determined directly by the user, or could be based, at least partly, on user-history and profiling of user-interaction with the resources.

...read moreread less

Journal Article•DOI•

Timing Analysis for Data and Wrap-Around Fill Caches

[...]

Randall T. White¹, Frank Mueller², Christopher Healy¹, David Whalley¹, Marion G. Harmon³ - Show less +1 more•Institutions (3)

Florida State University¹, Humboldt University of Berlin², Florida A&M University³

14 Dec 1999-Real-time Systems

TL;DR: An automatic tool-based approach is described to bound worst-case data cache performance and a method to deal with realistic cache filling approaches, namely wrap-around-filling for cache misses, is presented as an extension to pipeline analysis.

...read moreread less

Abstract: The contributions of this paper are twofold. First, an automatic tool-based approach is described to bound worst-case data cache performance. The approach works on fully optimized code, performs the analysis over the entire control flow of a program, detects and exploits both spatial and temporal locality within data references, and produces results typically within a few seconds. Results obtained by running the system on representative programs are presented and indicate that timing analysis of data cache behavior usually results in significantly tighter worst-case performance predictions. Second, a method to deal with realistic cache filling approaches, namely wrap-around-filling for cache misses, is presented as an extension to pipeline analysis. Results indicate that worst-case timing predictions become significantly tighter when wrap-around-fill analysis is performed. Overall, the contribution of this paper is a comprehensive report on methods and results of worst-case timing analysis for data caches and wrap-around caches. The approach taken is unique and provides a considerable step toward realistic worst-case execution time prediction of contemporary architectures and its use in schedulability analysis for hard real-time systems.

...read moreread less

Collapse