scispace - formally typeset
Search or ask a question

Showing papers on "Cache published in 1994"


Proceedings ArticleDOI
14 Nov 1994
TL;DR: In this article, the authors examine four cooperative caching algorithms using a trace-driven simulation study and conclude that cooperative caching can significantly improve file system read response time and that relatively simple cooperative caching methods are sufficient to realize most of the potential performance gain.
Abstract: Emerging high-speed networks will allow machines to access remote data nearly as quickly as they can access local data. This trend motivates the use of cooperative caching: coordinating the file caches of many machines distributed on a LAN to form a more effective overall file cache. In this paper we examine four cooperative caching algorithms using a trace-driven simulation study. These simulations indicate that for the systems studied cooperative caching can halve the number of disk accesses, improving file system read response time by as much as 73%. Based on these simulations we conclude that cooperative caching can significantly improve file system read response time and that relatively simple cooperative caching algorithms are sufficient to realize most of the potential performance gain.

491 citations


Patent
21 Oct 1994
TL;DR: In this article, a method and apparatus generates and transfers a stream of broadcast information which includes multiple video and audio information segments to a receiving device, once received, the segments are stored in a cache and indexing information associated with the segments is made available to end users of the apparatus.
Abstract: A method and apparatus generates and transfers a stream of broadcast information which includes multiple video and audio information segments to a receiving device. Once received, the segments are stored in a cache and indexing information associated with the segments is made available to end users of the apparatus. The end users are then able to select which segment(s) they would like to receive, and the selected segment(s) is transmitted to the requesting user(s). In one embodiment, when a new segment is received, a determination is made as to whether it is a more recent version of another segment already stored by the apparatus. If it is a more recent version, then the older version is overwritten by the newer version.

460 citations


Proceedings ArticleDOI
01 Apr 1994
TL;DR: The results show that, for the majority of the benchmarks, stream buffers can attain hit rates that are comparable to typical hit rates of secondary caches, and as the data-set size of the scientific workload increases the performance of streams typically improves relative to secondary cache performance, showing that streams are more scalable to large data- set sizes.
Abstract: Today's commodity microprocessors require a low latency memory system to achieve high sustained performance. The conventional high-performance memory system provides fast data access via a large secondary cache. But large secondary caches can be expensive, particularly in large-scale parallel systems with many processors (and thus many caches).We evaluate a memory system design that can be both cost-effective as well as provide better performance, particularly for scientific workloads: a single level of (on-chip) cache backed up only by Jouppi's stream buffers [10] and a main memory. This memory system requires very little hardware compared to a large secondary cache and doesn't require modifications to commodity processors. We use trace-driven simulation of fifteen scientific applications from the NAS and PERFECT suites in our evaluation. We present two techniques to enhance the effectiveness of Jouppi's original stream buffers: filtering schemes to reduce their memory bandwidth requirement and a scheme that enables stream buffers to prefetch data being accessed in large strides. Our results show that, for the majority of our benchmarks, stream buffers can attain hit rates that are comparable to typical hit rates of secondary caches. Also, we find that as the data-set size of the scientific workload increases the performance of streams typically improves relative to secondary cache performance, showing that streams are more scalable to large data-set sizes.

368 citations


Proceedings ArticleDOI
01 Apr 1994
TL;DR: It is found that Stache running on Typhoon performs comparably to an all-hardware DirNNB cache-coherence protocol for five shared-memory programs, and how programmers or compilers can use Tempest's flexibility to exploit an application's sharing patterns with a custom protocol is illustrated.
Abstract: Future parallel computers must efficiently execute not only hand-coded applications but also programs written in high-level, parallel programming languages. Today's machines limit these programs to a single communication paradigm, either message-passing or shared-memory, which results in uneven performance. This paper addresses this problem by defining an interface, Tempest, that exposes low-level communication and memory-system mechanisms so programmers and compilers can customize policies for a given application. Typhoon is a proposed hardware platform that implements these mechanisms with a fully-programmable, user-level processor in the network interface. We demonstrate the utility of Tempest with two examples. First, the Stache protocol uses Tempest's finegrain access control mechanisms to manage part of a processor's local memory as a large, fully-associative cache for remote data. We simulated Typhoon on the Wisconsin Wind Tunnel and found that Stache running on Typhoon performs comparably (±30%) to an all-hardware DirNNB cache-coherence protocol for five shared-memory programs. Second, we illustrate how programmers or compilers can use Tempest's flexibility to exploit an application's sharing patterns with a custom protocol. For the EM3D application, the custom protocol improves performance up to 35% over the all-hardware protocol.

356 citations


Journal ArticleDOI
TL;DR: This article describes a caching strategy that offers the performance of caches twice its size and investigates three cache replacement algorithms: random replacement, least recently used, and a frequency-based variation of LRU known as segmented LRU (SLRU).
Abstract: I/O subsystem manufacturers attempt to reduce latency by increasing disk rotation speeds, incorporating more intelligent disk scheduling algorithms, increasing I/O bus speed, using solid-state disks, and implementing caches at various places in the I/O stream. In this article, we examine the use of caching as a means to increase system response time and improve the data throughput of the disk subsystem. Caching can help to alleviate I/O subsystem bottlenecks caused by mechanical latencies. This article describes a caching strategy that offers the performance of caches twice its size. After explaining some basic caching issues, we examine some popular caching strategies and cache replacement algorithms, as well as the advantages and disadvantages of caching at different levels of the computer system hierarchy. Finally, we investigate the performance of three cache replacement algorithms: random replacement (RR), least recently used (LRU), and a frequency-based variation of LRU known as segmented LRU (SLRU). >

325 citations


Journal ArticleDOI
01 Nov 1994
TL;DR: In this article, the authors describe the design and performance of a caching relay for the World Wide Web and model the distribution of requests for pages from the web and see how this distribution affects the performance of the cache.
Abstract: We describe the design and performance of a caching relay for the World Wide Web. We model the distribution of requests for pages from the web and see how this distribution affects the performance of a cache. We use the data gathered from the relay to make some general characterizations about the web.

319 citations


Proceedings ArticleDOI
01 May 1994
TL;DR: This work examines the impact of complex logical-to-physical mappings and large prefetching caches on scheduling effectiveness and concludes that the cyclical scan algorithm, which always schedules requests in ascending logical order, achieves the highest performance among seek-reducing algorithms for such workloads.
Abstract: Disk subsystem performance can be dramatically improved by dynamically ordering, or scheduling, pending requests. Via strongly validated simulation, we examine the impact of complex logical-to-physical mappings and large prefetching caches on scheduling effectiveness. Using both synthetic workloads and traces captured from six different user environments, we arrive at three main conclusions: (1) Incorporating complex mapping information into the scheduler provides only a marginal (less than 2%) decrease in response times for seek-reducing algorithms. (2) Algorithms which effectively utilize prefetching disk caches provide significant performance improvements for workloads with read sequentiality. The cyclical scan algorithm (C-LOOK), which always schedules requests in ascending logical order, achieves the highest performance among seek-reducing algorithms for such workloads. (3) Algorithms that reduce overall positioning delays produce the highest performance provided that they recognize and exploit a prefetching cache.

312 citations


Proceedings ArticleDOI
01 Nov 1994
TL;DR: This paper presents compiler optimizations to improve data locality based on a simple yet accurate cost model and demonstrates that these program transformations are useful for optimizing many programs.
Abstract: In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit data locality. In this paper, we present compiler optimizations to improve data locality based on a simple yet accurate cost model. The model computes both temporal and spatial reuse of cache lines to find desirable loop organizations. The cost model drives the application of compound transformations consisting of loop permutation, loop fusion, loop distribution, and loop reversal. We demonstrate that these program transformations are useful for optimizing many programs.To validate our optimization strategy, we implemented our algorithms and ran experiments on a large collection of scientific programs and kernels. Experiments with kernels illustrate that our model and algorithm can select and achieve the best performance. For over thirty complete applications, we executed the original and transformed versions and simulated cache hit rates. We collected statistics about the inherent characteristics of these programs and our ability to improve their data locality. To our knowledge, these studies are the first of such breadth and depth. We found performance improvements were difficult to achieve because benchmark programs typically have high hit rates even for small data caches; however, our optimizations significantly improved several programs.

310 citations


Journal ArticleDOI
01 Nov 1994
TL;DR: The paper shows how combining the fish search with a cache greatly reduces these problems and highlights the properties and implementation of a client-based search tool called the “ fish-search ” algorithm, and compares it to other approaches.
Abstract: Finding specific information in the World-Wide Web (WWW, or Web for short) is becoming increasingly difficult, because of the rapid growth of the Web and because of the diversity of the information offered through the Web. Hypertext in general is ill-suited for information retrieval as it is designed for stepwise exploration. To help readers find specific information quickly, specific overview documents are often included into the hypertext. Hypertext systems often provide simple searching tools such as full text search or title search, that mostly ignore the “hyper-structure” formed by the links. In the WWW, finding information is further complicated by its distributed nature. Navigation, often via overview documents, still is the predominant method of finding one's way around the Web. Several searching tools have been developed, basically in two types: • A gateway, offering (limited) search operations on small or large parts of the WWW, using a pre-compiled database. The database is often built by an automated Web scanner (a “robot”). • A client-based search tool that does automated navigation, thereby working more or less like a browsing user, but much faster and following an optimized strategy. This paper highlights the properties and implementation of a client-based search tool called the “ fish-search ” algorithm, and compares it to other approaches. The fish-search, implemented on top of Mosaic for X, offers an open-ended selection of search criteria. Client-based searching has some definite drawbacks: slow speed and high network resource consumption. The paper shows how combining the fish search with a cache greatly reduces these problems. The “ Lagoon ” cache program is presented. Caches can call each other, currently only to further reduce network traffic. By moving the algorithm into the cache program, the calculation of the answer to a search request can be distributed among the caching servers.

288 citations


Proceedings Article
01 Jul 1994
TL;DR: It is demonstrated that a simple rule system can be constructed that supports a more powerful view system than available in current commercial systems and that a rule system is a fundamental concept in a next generation DBMS, and it subsumes both views and procedures as special cases.
Abstract: This paper demonstrates that a simple rule system can be constructed that supports a more powerful view system than available in current commercial systems. Not only can views be specified by using rules but also special semantics for resolving ambiguous view updates are simply additional rules. Moreover, procedural data types as proposed in POSTGRES are also efficiently simulated by the same rules system. Lastly, caching of the action part of certain rules is a possible performance enhancement and can be applied to materialize views as well as to cache procedural data items. Hence, we conclude that a rule system is a fundamental concept in a next generation DBMS, and it subsumes both views and procedures as special cases.

277 citations


Proceedings Article
Jeff Bonwick1
06 Jun 1994
TL;DR: This paper presents a comprehensive design overview of the SunOS 5.4 kernel memory allocator, based on a set of object-caching primitives that reduce the cost of allocating complex objects by retaining their state between uses.
Abstract: This paper presents a comprehensive design overview of the SunOS 5.4 kernel memory allocator. This allocator is based on a set of object-caching primitives that reduce the cost of allocating complex objects by retaining their state between uses. These same primitives prove equally effective for managing stateless memory (e.g. data pages and temporary buffers) because they are space-efficient and fast. The allocator's object caches respond dynamically to global memory pressure, and employ an object-coloring scheme that improves the system's overall cache utilization and bus balance. The allocator also has several statistical and debugging features that can detect a wide range of problems throughout the system.

Journal ArticleDOI
TL;DR: To mitigate false sharing and to enhance spatial locality, the layout of shared data in cache blocks is optimized in a programmer-transparent manner and it is shown that this approach can reduce the number of misses on shared data by about 10% on average.
Abstract: The performance of the data cache in shared-memory multiprocessors has been shown to be different from that in uniprocessors. In particular, cache miss rates in multiprocessors do not show the sharp drop typical of uniprocessors when the size of the cache block increases. The resulting high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. Some researchers have speculated that this effect is due to false sharing, the coherence transactions that result when different processors update different words of the same cache block in an interleaved fashion. While the analysis of six applications in the paper confirms that false sharing has a significant impact on the miss rate, the measurements also show that poor spatial locality among accesses to shared data has an even larger impact. To mitigate false sharing and to enhance spatial locality, we optimize the layout of shared data in cache blocks in a programmer-transparent manner. We show that this approach can reduce the number of misses on shared data by about 10% on average. >

01 Jun 1994
TL;DR: This dissertation proposes and evaluates a new compiler algorithm for inserting prefetches into code that attempts to minimize overheads by only issuing prefetched for references that are predicted to suffer cache misses, and investigates the architectural support necessary to make prefetching effective.
Abstract: The large latency of memory accesses in modern computer systems is a key obstacle to achieving high processor utilization. Furthermore, the technology trends indicate that this gap between processor and memory speeds is likely to increase in the future. While increased latency affects all computer systems, the problem is magnified in large-scale shared-memory multiprocessors, where physical dimensions cause latency to be an inherent problem. To cope with the memory latency problem, the basic solution that nearly all computer systems rely on is their cache hierarchy. While caches are useful, they are not a panacea. Software-controlled prefetching is a technique for tolerating memory latency by explicitly executing prefetch instructions to move data close to the processor before it is actually needed. This technique is attractive because it can hide both read and write latency within a single thread of execution while requiring relatively little hardware support. Software-controlled prefetching, however, presents two major challenges. First, some sophistication is required on the part of either the programmer, runtime system, or (preferably) the compiler to insert prefetches into the code. Second, care must be taken that the overheads of prefetching, which include additional instructions and increased memory queueing delays, do not outweigh the benefits. This dissertation proposes and evaluates a new compiler algorithm for inserting prefetches into code. The proposed algorithm attempts to minimize overheads by only issuing prefetches for references that are predicted to suffer cache misses. The algorithm can prefetch both dense-matrix and sparse-matrix codes, thus covering a large fraction of scientific applications. It also works for both uniprocessor and large-scale shared-memory multiprocessor architectures. We have implemented our algorithm in the SUIF (Stanford University Intermediate Form) optimizing compiler. The results of our detailed architectural simulations demonstrate that the speed of some applications can be improved by as much as a factor of two, both on uniprocessor and multiprocessor systems. This dissertation also compares software-controlled prefetching with other latency-hiding techniques (e.g., locality optimizations, relaxed consistency models, and multithreading), and investigates the architectural support necessary to make prefetching effective.

Journal ArticleDOI
TL;DR: It is shown that cache profiling, using the CProf cache profiling system, improves program performance by focusing a programmer's attention on problematic code sections and providing insight into appropriate program transformations.
Abstract: A vital tool-box component, the CProf cache profiling system lets programmers identify hot spots by providing cache performance information at the source-line and data-structure level. Our purpose is to introduce a broad audience to cache performance profiling and tuning techniques. Although used sporadically in the supercomputer and multiprocessor communities, these techniques also have broad applicability to programs running on fast uniprocessor workstations. We show that cache profiling, using our CProf cache profiling system, improves program performance by focusing a programmer's attention on problematic code sections and providing insight into appropriate program transformations. >

Patent
29 Jul 1994
TL;DR: In this article, a technique called key swizzling is proposed to reduce the volume of queries to the structured database by using explicit relationship pointers between object instances in the object cache.
Abstract: In an object-oriented application being executed in a digital computing system comprising a processor, a method and apparatus are provided for managing information retrieved from a structured database, such as a relational database, wherein the processor is used to construct a plurality of object instances, each of these object instances having its own unique object ID that provides a mapping between the object instance and at least one row in the structured database. The processor is used to construct a single cohesive data structure, called an object cache, that comprises all the object instances and that represents information retrieved from the structured database in a form suitable for use by one or more object-oriented applications. A mechanism for managing the object cache is provided that has these three properties: First, through a technique called key swizzling, it uses explicit relationship pointers between object instances in the object cache to reduce the volume of queries to the structured database. Second, it ensures that only one copy of an object instance is in the cache at any given time, even if several different queries return the same information from the database. Third, the mechanism guarantees the integrity of data in the cache by locking data appropriately in the structured database during a database transaction, flushing cache data at the end of each transaction, and transparently re-reading the data and reacquiring the appropriate locks for an object instance whose data has been flushed.

Journal ArticleDOI
01 Nov 1994
TL;DR: It is shown that superpages improve TLB performance only after invasive operating system modifications that introduce considerable overhead, and two subblock TLB designs are proposed as alternate ways to improveTLB performance.
Abstract: Many commercial microprocessor architectures have added translation lookaside buffer (TLB) support for superpages. Superpages differ from segments because their size must be a power of two multiple of the base page size and they must be aligned in both virtual and physical address spaces. Very large superpages (e.g., 1MB) are clearly useful for mapping special structures, such as kernel data or frame buffers. This paper considers the architectural and operating system support required to exploit medium-sized superpages (e.g., 64KB, i.e., sixteen times a 4KB base page size). First, we show that superpages improve TLB performance only after invasive operating system modifications that introduce considerable overhead.We then propose two subblock TLB designs as alternate ways to improve TLB performance. Analogous to a subblock cache, a complete-subblock TLB associates a tag with a superpage-sized region but has valid bits, physical page number, attributes, etc., for each possible base page mapping. A partial-subblock TLB entry is much smaller than a complete-subblock TLB entry, because it shares physical page number and attribute fields across base page mappings. A drawback of a partial-subblock TLB is that base page mappings can share a TLB entry only if they map to consecutive physical pages and have the same attributes. We propose a physical memory allocation algorithm, page reservation, that makes this sharing more likely. When page reservation is used, experimental results show partial-subblock TLBs perform better than superpage TLBs, while requiring simpler operating system changes. If operating system changes are inappropriate, however, complete-subblock TLBs perform best.

Proceedings ArticleDOI
01 Apr 1994
TL;DR: Qualitative comparisons indicate that both schemes are able to reduce cache misses in the domain of linear array references, and an approach combining software and hardware schemes is proposed; it shows promise in reducing the memory latency with least overhead.
Abstract: Prefetching, i.e., exploiting the overlap of processor computations with data accesses, is one of several approaches for tolerating memory latencies. Prefetching can be either hardware-based or software-directed or a combination of both. Hardware-based prefetching, requiring some support unit connected to the cache, can dynamically handle prefetches at run-time without compiler intervention. Software-directed approaches rely on compiler technology to insert explicit prefetch instructions. Mowry et al.'s software scheme [13, 14] and our hardware approach [1] are two representative schemes.In this paper, we evaluate approximations to these two schemes in the context of a shared-memory multiprocessor environment. Our qualitative comparisons indicate that both schemes are able to reduce cache misses in the domain of linear array references. When complex data access patterns are considered, the software approach has compile-time information to perform sophisticated prefetching whereas the hardware scheme has the advantage of manipulating dynamic information. The performance results from an instruction-level simulation of four benchmarks confirm these observations. Our simulations show that the hardware scheme introduces more memory traffic into the network and that the software scheme introduces a non-negligible instruction execution overhead. An approach combining software and hardware schemes is proposed; it shows promise in reducing the memory latency with least overhead.

Proceedings ArticleDOI
07 Dec 1994
TL;DR: This paper describes an approach for bounding the worst-case instruction cache performance of large code segments by using static cache simulation to analyze a program's control flow to statically categorize the caching behavior of each instruction.
Abstract: The use of caches poses a difficult tradeoff for architects of real-time systems. While caches provide significant performance advantages, they have also been viewed as inherently unpredictable, since the behavior of a cache reference depends upon the history of the previous references. The use of caches is only suitable for real-time systems if a reasonably tight bound on the performance of programs using cache memory can be predicted. This paper describes an approach for bounding the worst-case instruction cache performance of large code segments. First, a new method called static cache simulation is used to analyze a program's control flow to statically categorize the caching behavior of each instruction. A timing analyzer, which uses the categorization information, then estimates the worst-case instruction cache performance for each loop and function in the program. >

Proceedings Article
12 Sep 1994
TL;DR: It is shown that there are significant benefits in redesigning traditional query processing algorithms so that they can make better use of the cache, and new algorithms run 8%-200% faster than the traditional ones.
Abstract: The current main memory (DRAM) access speeds lag far behind CPU speeds Cache memory, made of static RAM, is being used in today’s architectures to bridge this gap It provides access latencies of 2-4 processor cycles, in contrast to main memory which requires 15-25 cycles Therefore, the performance of the CPU depends upon how well the cache can be utilized We show that there are significant benefits in redesigning our traditional query processing algorithms so that they can make better use of the cache The new algorithms run 8%-200% faster than the traditional ones

Proceedings ArticleDOI
01 Apr 1994
TL;DR: A method to characterize the performance of proposed queue lock algorithms, and applies it to previously published algorithms conclude that the M lock is the best overall queue lock for the class of architectures studied.
Abstract: Large-scale shared-memory multiprocessors typically have long latencies for remote data accesses. A key issue for execution performance of many common applications is the synchronization cost. The communication scalability of synchronization has been improved by the introduction of queue-based spin-locks instead of Test&(Test&Set). For architectures with long access latencies for global data, attention should also be paid to the number of global accesses that are involved in synchronization. We present a method to characterize the performance of proposed queue lock algorithms, and apply it to previously published algorithms. We also present two new queue locks, the LH lock and the M lock. We compare the locks in terms of performance, memory requirements, code size and required hardware support. The LH lock is the simplest of all the locks, yet requires only an atomic swap operation. The M lock is superior in terms of global accesses needed to perform synchronization and still competitive in all other criteria. We conclude that the M lock is the best overall queue lock for the class of architectures studied. >

Proceedings ArticleDOI
01 Nov 1994
TL;DR: The data presented shows that increasing the associativity of second-level caches can reduce miss rates significantly and should help system designers choose a cache configuration that will perform well in commercial markets.
Abstract: Experience has shown that many widely used benchmarks are poor predictors of the performance of systems running commercial applications. Research into this anomaly has long been hampered by a lack of address traces from representative multi-user commercial workloads. This paper presents research, using traces of industry-standard commercial benchmarks, which examines the characteristic differences between technical and commercial workloads and illustrates how those differences affect cache performance.Commercial and technical environments differ in their respective branch behavior, operating system activity, I/O, and dispatching characteristics. A wide range of uniprocessor instruction and data cache geometries were studied. The instruction cache results for commercial workloads demonstrate that instruction cache performance can no longer be neglected because these workloads have much larger code working sets than technical applications. For database workloads, a breakdown of kernel and user behavior reveals that the application component can exhibit behavior similar to the operating system and therefore, can experience miss rates equally high. This paper also indicates that “dispatching” or process switching characteristics must be considered when designing level-two caches. The data presented shows that increasing the associativity of second-level caches can reduce miss rates significantly. Overall, the results of this research should help system designers choose a cache configuration that will perform well in commercial markets.

Proceedings ArticleDOI
01 Apr 1994
TL;DR: Two-level exclusive caching improves the performance of two-level caching organizations by increasing the effective associativity and capacity.
Abstract: The performance of two-level on-chip caching is investigated for a range of technology and architecture assumptions. The area and access time of each level of cache is modeled in detail. The results indicate that for most workloads, two-level cache configurations (with a set-associative second level) perform marginally better than single-level cache configurations that require the same chip area once the first-level cache sizes are 64KB or larger. Two-level configurations become even more important in systems with no off-chip cache and in systems in which the memory cells in the first-level caches are multiported and hence larger than those in the second-level cache. Finally, a new replacement policy called two-level exclusive caching is introduced. Two-level exclusive caching improves the performance of two-level caching organizations by increasing the effective associativity and capacity.

Proceedings Article
01 Jan 1994
TL;DR: In this paper, a static cache sitnula- tion is used to analyze a program's control flow to stati- cally categorize the caching behavior of each instruction and then a timing analyzer, which uses the Categorization informa- tion, then estimates the worst-case instruction cache per- formance for each loop and function in the program.
Abstract: The use of caches poses a difficult rradeoff for architects of real-time systems. While caches provide signifreant performance advantages, they have also been viewed as inherently unpredictable since the behavior of a cache ref- erence depends upon the history of the previous refer- ences. The use of caches will only be suitable for real- time systems if a reasonably tight bound on the perfor- mance of programs using cache nzemory can be predirterl. This paper describes an approach for bounding the worst- case instruction cache performanre of large code seg- ments. First, a new method called Static Cache Sitnula- tion is used to analyze a program's control flow to stati- cally categorize the caching behavior of each instruction. A timing analyzer, which uses the Categorization informa- tion, then estimates the worst-case instruction cache per- formance for each loop and function in the program.

Journal ArticleDOI
01 Nov 1994
TL;DR: Using trace-driven simulation of applications and the operating system, it is shown that a CML buffer enables a large direct-mapped cache to perform nearly as well as a two-way set associative cache of equivalent size and speed, although with lower hardware cost and complexity.
Abstract: This paper describes a method for improving the performance of a large direct-mapped cache by reducing the number of conflict misses. Our solution consists of two components: an inexpensive hardware device called a Cache Miss Lookaside (CML) buffer that detects conflicts by recording and summarizing a history of cache misses, and a software policy within the operating system's virtual memory system that removes conflicts by dynamically remapping pages whenever large numbers of conflict misses are detected. Using trace-driven simulation of applications and the operating system, we show that a CML buffer enables a large direct-mapped cache to perform nearly as well as a two-way set associative cache of equivalent size and speed, although with lower hardware cost and complexity.

Patent
Alexander D. Peleg1, Uri Weiser1
30 Mar 1994
TL;DR: In this paper, the authors propose an improved cache and organization particularly suitable for superscalar architectures, where the cache is organized around trace segments of running programs rather than an organization based on memory addresses.
Abstract: An improved cache and organization particularly suitable for superscalar architectures. The cache is organized around trace segments of running programs rather than an organization based on memory addresses. A single access to the cache memory may cross virtual address line boundaries. Branch prediction is integrally incorporated into the cache array permitting the crossing of branch boundaries with a single access.

Journal ArticleDOI
TL;DR: This article demonstrates that simple and natural parallelizations work very well, the sequential implementations do not have to be fundamentally restructured, and the high degree of temporal locality obviates the need for explicit data distribution and communication management on the best known visualization algorithms.
Abstract: Recently, a new class of scalable, shared-address-space multiprocessors has emerged. Like message-passing machines, these multiprocessors have a distributed interconnection network and physically distributed main memory. However, they provide hardware support for efficient implicit communication through a shared address space, and they automatically exploit temporal locality by caching both local and remote data in a processor's hardware cache. In this article, we show that these architectural characteristics make it much easier to obtain very good speedups on the best known visualization algorithms. Simple and natural parallelizations work very well, the sequential implementations do not have to be fundamentally restructured, and the high degree of temporal locality obviates the need for explicit data distribution and communication management. We demonstrate our claims through parallel versions of three state-of-the-art algorithms: a recent hierarchical radiosity algorithm by Hanrahan et al. (1991), a parallelized ray-casting volume renderer by Levoy (1992), and an optimized ray-tracer by Spach and Pulleyblank (1992). We also discuss a new shear-warp volume rendering algorithm that provides the first demonstration of interactive frame rates for a 256/spl times/256/spl times/256 voxel data set on a general-purpose multiprocessor. >

Journal ArticleDOI
01 Nov 1994
TL;DR: This paper examines the effects of OS scheduling and page migration policies on the performance of compute servers for multiprogramming and parallel application workloads, and suggests that policies based only on TLB miss information can be quite effective, and useful for addressing the data distribution problems of space-sharing schedulers.
Abstract: Several cache-coherent shared-memory multiprocessors have been developed that are scalable and offer a very tight coupling between the processing resources. They are therefore quite attractive for use as compute servers for multiprogramming and parallel application workloads. Process scheduling and memory management, however, remain challenging due to the distributed main memory found on such machines. This paper examines the effects of OS scheduling and page migration policies on the performance of such compute servers. Our experiments are done on the Stanford DASH, a distributed-memory cache-coherent multiprocessor. We show that for our multiprogramming workloads consisting of sequential jobs, the traditional Unix scheduling policy does very poorly. In contrast, a policy incorporating cluster and cache affinity along with a simple page-migration algorithm offers up to two-fold performance improvement. For our workloads consisting of multiple parallel applications, we compare space-sharing policies that divide the processors among the applications to time-slicing policies such as standard Unix or gang scheduling. We show that space-sharing policies can achieve better processor utilization due to the operating point effect, but time-slicing policies benefit strongly from user-level data distribution. Our initial experience with automatic page migration suggests that policies based only on TLB miss information can be quite effective, and useful for addressing the data distribution problems of space-sharing schedulers.

Journal ArticleDOI
TL;DR: This paper develops and evaluates techniques that automatically restructure program loops to achieve high performance on specific target architectures and attempts to balance computation and memory accesses and seek to eliminate or reduce pipeline interlock.
Abstract: Over the past decade, microprocessor design strategies have focused on increasing the computational power on a single chip. Because computations often require more data from cache per floating-point operation than a machine can deliver and because operations are pipelined, idle computational cycles are common when scientific applications are executed. To overcome these bottlenecks, programmers have learned to use a coding style that ensures a better balance between memory references and floating-point operations. In our view, this is a step in the wrong direction because it makes programs more machine-specific. A programmer should not be required to write a new program version for each new machine; instead, the task of specializing a program to a target machine should be left to the compiler.But is our view practical? Can a sophisticated optimizing compiler obviate the need for the myriad of programming tricks that have found their way into practice to improve the performance of the memory hierarchy? In this paper we attempt to answer that question. To do so, we develop and evaluate techniques that automatically restructure program loops to achieve high performance on specific target architectures. These methods attempt to balance computation and memory accesses and seek to eliminate or reduce pipeline interlock. To do this, they estimate statically the balance between memory operations and floating-point operations for each loop in a particular program and use these estimates to determine whether to apply various loop transformations.Experiments with our automatic techniques show that integer-factor speedups are possible on kernels. Additionally, the estimate of the balance between memory operations and computation, and the application of the estimate are very accurate—experiments reveal little difference between the balance achieved by our automatic system that is made possible by hand optimization.

Book ChapterDOI
02 May 1994
TL;DR: The design and evaluation of the ADMS optimizer are described and the results showed that pointer caching and dynamic cache update strategies substantially speedup query computations and, thus, increase query throughput under situations with fair query correlation and update load.
Abstract: In this paper, we describe the design and evaluation of the ADMS optimizer. Capitalizing on a structure called Logical Access Path Schema to model the derivation relationship among cached query results, the optimizer is able to perform query matching coincidently with the optimization and generate more efficient query plans using cached results. The optimizer also features data caching and pointer caching, alternative cache replacement strategies, and different cache update strategies. An extensive set of experiments were conducted and the results showed that pointer caching and dynamic cache update strategies substantially speedup query computations and, thus, increase query throughput under situations with fair query correlation and update load. The requirement of the cache space is relatively small and the extra computation overhead introduced by the caching and matching mechanism is more than offset by the time saved in query processing.

Patent
12 Jul 1994
TL;DR: In this paper, the computer memory is partitioned into a plurality of buffer caches, each of which is separately addressable, and one buffer cache is set aside as a default buffer cache, while the other buffer caches are reserved for specific data objects meeting certain predefined criteria.
Abstract: Computer systems and computer implemented methods are provided for managing memory in a database management system. The computer memory is partitioned into a plurality of buffer caches, each of which is separately addressable. One buffer cache is set aside as a default buffer cache, while the other buffer caches are reserved for specific data objects meeting certain predefined criteria. Those objects meeting the predefined criteria are stored in reserved buffer caches where they are likely to remain for a relatively long period of time (in comparison to data objects stored in the default buffer caches). A buffer cache may have a plurality of memory pools, each of which contains multiple storage blocks. The storage blocks in a given memory pool are identically sized, while the storage blocks in one memory pool are sized differently from the storage blocks in another memory pool.