Showing papers on "Cache pollution published in 1994"

PDF

Open Access

Proceedings Article•DOI•

Evaluating stream buffers as a secondary cache replacement

[...]

Subbarao Palacharla¹, R. E. Kessler²•Institutions (2)

01 Apr 1994

TL;DR: The results show that, for the majority of the benchmarks, stream buffers can attain hit rates that are comparable to typical hit rates of secondary caches, and as the data-set size of the scientific workload increases the performance of streams typically improves relative to secondary cache performance, showing that streams are more scalable to large data- set sizes.

...read moreread less

Abstract: Today's commodity microprocessors require a low latency memory system to achieve high sustained performance. The conventional high-performance memory system provides fast data access via a large secondary cache. But large secondary caches can be expensive, particularly in large-scale parallel systems with many processors (and thus many caches).We evaluate a memory system design that can be both cost-effective as well as provide better performance, particularly for scientific workloads: a single level of (on-chip) cache backed up only by Jouppi's stream buffers [10] and a main memory. This memory system requires very little hardware compared to a large secondary cache and doesn't require modifications to commodity processors. We use trace-driven simulation of fifteen scientific applications from the NAS and PERFECT suites in our evaluation. We present two techniques to enhance the effectiveness of Jouppi's original stream buffers: filtering schemes to reduce their memory bandwidth requirement and a scheme that enables stream buffers to prefetch data being accessed in large strides. Our results show that, for the majority of our benchmarks, stream buffers can attain hit rates that are comparable to typical hit rates of secondary caches. Also, we find that as the data-set size of the scientific workload increases the performance of streams typically improves relative to secondary cache performance, showing that streams are more scalable to large data-set sizes.

...read moreread less

368 citations

Journal Article•DOI•

False sharing and spatial locality in multiprocessor caches

[...]

Josep Torrellas¹, H.S. Lam², John L. Hennessy²•Institutions (2)

University of Illinois at Urbana–Champaign¹, Stanford University²

01 Jun 1994-IEEE Transactions on Computers

TL;DR: To mitigate false sharing and to enhance spatial locality, the layout of shared data in cache blocks is optimized in a programmer-transparent manner and it is shown that this approach can reduce the number of misses on shared data by about 10% on average.

...read moreread less

Abstract: The performance of the data cache in shared-memory multiprocessors has been shown to be different from that in uniprocessors. In particular, cache miss rates in multiprocessors do not show the sharp drop typical of uniprocessors when the size of the cache block increases. The resulting high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. Some researchers have speculated that this effect is due to false sharing, the coherence transactions that result when different processors update different words of the same cache block in an interleaved fashion. While the analysis of six applications in the paper confirms that false sharing has a significant impact on the miss rate, the measurements also show that poor spatial locality among accesses to shared data has an even larger impact. To mitigate false sharing and to enhance spatial locality, we optimize the layout of shared data in cache blocks in a programmer-transparent manner. We show that this approach can reduce the number of misses on shared data by about 10% on average. >

...read moreread less

265 citations

Proceedings Article•DOI•

Bounding worst-case instruction cache performance

[...]

Arnold¹, Mueller¹, Whalley¹, Harmon•Institutions (1)

Florida State University¹

07 Dec 1994

TL;DR: This paper describes an approach for bounding the worst-case instruction cache performance of large code segments by using static cache simulation to analyze a program's control flow to statically categorize the caching behavior of each instruction.

...read moreread less

Abstract: The use of caches poses a difficult tradeoff for architects of real-time systems. While caches provide significant performance advantages, they have also been viewed as inherently unpredictable, since the behavior of a cache reference depends upon the history of the previous references. The use of caches is only suitable for real-time systems if a reasonably tight bound on the performance of programs using cache memory can be predicted. This paper describes an approach for bounding the worst-case instruction cache performance of large code segments. First, a new method called static cache simulation is used to analyze a program's control flow to statically categorize the caching behavior of each instruction. A timing analyzer, which uses the categorization information, then estimates the worst-case instruction cache performance for each loop and function in the program. >

...read moreread less

233 citations

Proceedings Article•

Cache Conscious Algorithms for Relational Query Processing

[...]

Ambuj Shatdal¹, Chander Kant¹, Jeffrey F. Naughton¹•Institutions (1)

University of Wisconsin-Madison¹

12 Sep 1994

TL;DR: It is shown that there are significant benefits in redesigning traditional query processing algorithms so that they can make better use of the cache, and new algorithms run 8%-200% faster than the traditional ones.

...read moreread less

Abstract: The current main memory (DRAM) access speeds lag far behind CPU speeds Cache memory, made of static RAM, is being used in today’s architectures to bridge this gap It provides access latencies of 2-4 processor cycles, in contrast to main memory which requires 15-25 cycles Therefore, the performance of the CPU depends upon how well the cache can be utilized We show that there are significant benefits in redesigning our traditional query processing algorithms so that they can make better use of the cache The new algorithms run 8%-200% faster than the traditional ones

...read moreread less

215 citations

Proceedings Article•DOI•

Tradeoffs in two-level on-chip caching

[...]

Norman P. Jouppi, Steven J. E. Wilton¹•Institutions (1)

University of Toronto¹

01 Apr 1994

TL;DR: Two-level exclusive caching improves the performance of two-level caching organizations by increasing the effective associativity and capacity.

...read moreread less

Abstract: The performance of two-level on-chip caching is investigated for a range of technology and architecture assumptions. The area and access time of each level of cache is modeled in detail. The results indicate that for most workloads, two-level cache configurations (with a set-associative second level) perform marginally better than single-level cache configurations that require the same chip area once the first-level cache sizes are 64KB or larger. Two-level configurations become even more important in systems with no off-chip cache and in systems in which the memory cells in the first-level caches are multiported and hence larger than those in the second-level cache. Finally, a new replacement policy called two-level exclusive caching is introduced. Two-level exclusive caching improves the performance of two-level caching organizations by increasing the effective associativity and capacity.

...read moreread less

195 citations

Journal Article•DOI•

Avoiding conflict misses dynamically in large direct-mapped caches

[...]

Brian N. Bershad¹, Dennis Lee¹, Theodore H. Romer¹, J. Bradley Chen²•Institutions (2)

University of Washington¹, Carnegie Mellon University²

01 Nov 1994

TL;DR: Using trace-driven simulation of applications and the operating system, it is shown that a CML buffer enables a large direct-mapped cache to perform nearly as well as a two-way set associative cache of equivalent size and speed, although with lower hardware cost and complexity.

...read moreread less

Abstract: This paper describes a method for improving the performance of a large direct-mapped cache by reducing the number of conflict misses. Our solution consists of two components: an inexpensive hardware device called a Cache Miss Lookaside (CML) buffer that detects conflicts by recording and summarizing a history of cache misses, and a software policy within the operating system's virtual memory system that removes conflicts by dynamically remapping pages whenever large numbers of conflict misses are detected. Using trace-driven simulation of applications and the operating system, we show that a CML buffer enables a large direct-mapped cache to perform nearly as well as a two-way set associative cache of equivalent size and speed, although with lower hardware cost and complexity.

...read moreread less

187 citations

Patent•

Dynamic flow instruction cache memory organized around trace segments independent of virtual address line

[...]

Alexander D. Peleg¹, Uri Weiser¹•Institutions (1)

Intel¹

30 Mar 1994

TL;DR: In this paper, the authors propose an improved cache and organization particularly suitable for superscalar architectures, where the cache is organized around trace segments of running programs rather than an organization based on memory addresses.

...read moreread less

Abstract: An improved cache and organization particularly suitable for superscalar architectures. The cache is organized around trace segments of running programs rather than an organization based on memory addresses. A single access to the cache memory may cross virtual address line boundaries. Branch prediction is integrally incorporated into the cache array permitting the crossing of branch boundaries with a single access.

...read moreread less

186 citations

Patent•

Second level cache controller unit and system

[...]

Peter D. MacWilliams¹, Robert L. Farrell¹, Adalberto Golbert¹, Itzik Silas¹•Institutions (1)

Intel¹

08 Mar 1994

TL;DR: A second level cache memory controller, implemented as an integrated circuit unit, operates in conjunction with a secondary random access cache memory and a main memory (system) bus controller to form a second-level cache memory subsystem.

...read moreread less

Abstract: A second level cache memory controller, implemented as an integrated circuit unit, operates in conjunction with a secondary random access cache memory and a main memory (system) bus controller to form a second level cache memory subsystem. The subsystem is interfaced to the local processor (CPU) bus and to the main memory bus providing independent access by both buses, thereby reducing traffic of the main memory bus when the data required by the CPU is located in secondary cache. Similarly, CPU bus traffic is minimized when secondary cache access by the main memory bus for snoops and write-backs to main memory. Snoop latches interfaced with the main memory bus provide snoop access to the cache memory via the cache directory in the secondary cache controller unit. The controller also supports parallel look-up in the controller tag array and the secondary cache using most-recently-used (MRU) main memory write-through and pipelining of memory bus cycle requests.

...read moreread less

159 citations

Proceedings Article•

Application-controlled file caching policies

[...]

Pei Cao¹, Edward W. Felten¹, Kai Li¹•Institutions (1)

Princeton University¹

06 Jun 1994

TL;DR: The main contribution of this paper is the solution to the allocation problem, which allows processes to manage their own cache blocks, while at the same time maintains the dynamic allocation of cache blocks among processes.

...read moreread less

Abstract: We consider how to improve the performance of file caching by allowing user-level control over file cache replacement decisions. We use two-level cache management: the kernel allocates physical pages to individual applications (allocation), and each application is responsible for deciding how to use its physical pages (replacement). Previous work on two-level memory management has focused on replacement, largely ignoring allocation. The main contribution of this paper is our solution to the allocation problem. Our solution allows processes to manage their own cache blocks, while at the same time maintains the dynamic allocation of cache blocks among processes. Our solution makes sure that good user-level policies can improve the file cache hit ratios of the entire system over the existing replacement approach. We evaluate our scheme by trace-based simulation, demonstrating that it leads to significant improvements in hit ratios for a variety of applications.

...read moreread less

138 citations

Patent•

Managing the fetching and replacement of cache entries associated with a file system

[...]

Kadangode K. Ramakrishnan, Prabuddha Biswas

16 May 1994

TL;DR: In this article, a cache memory is organized in multiple levels, each level having multiple entries, the entries of each level receiving information of a predetermined category, each entry being accessible independently, links are defined between entries of one level of the cache memory and entries at another level of cache memory, the links corresponding to information relationships specified by a user of information stored in the secondary storage.

...read moreread less

Abstract: Information needed by application programs from a secondary storage is cached in a cache memory which is organized in multiple levels, each level having multiple entries, the entries of each level receiving information of a predetermined category, each entry being accessible independently. Links are defined between entries of one level of the cache memory and entries at another level of the cache memory, the links corresponding to information relationships specified by a user of information stored in the secondary storage. In response to a request to a file system from an application for needed information, the needed information is fetched into the cache, and in connection with fetching the needed information, other information is prefetched from the system of files which is not immediately needed. Quotas are established on information which may be fetched from a secondary storage into the cache, the quotas being applicable to file contents within a file and to the number of files within a directory. Upon a request from an application program to open a file, an entry is created in a cache corresponding to the file, the entry including file header information. The file header information is retained in the cache so long as the file remains open, whether or not any file contents of the file remain in cache. The entries of three level of the cache respectively receive directory information, file header information, and file contents information.

...read moreread less

134 citations

Patent•

Disk meshing and flexible storage mapping with enhanced flexible caching

[...]

Larry T. Jost

26 Apr 1994

TL;DR: In this article, a hierarchical memory system is provided which includes a cache and long-term storage, in which an address of a requested data block is translated to a second addressing scheme, and is meshed, so that proximate data blocks are placed on different physical target disks within the longterm storage.

...read moreread less

Abstract: A data processing system (10) has a processor with a processor memory and a mechanism for specifying an address that corresponds to a processor-requested data block located within another memory to be accessed by the processor. A hierarchical memory system is provided which includes a cache (16) and long-term storage (20). In accordance with a mapping and meshing process performed by a memory subsystem (22), an address of a requested data block is translated to a second addressing scheme, and is meshed, so that proximate data blocks are placed on different physical target disks within the long-term storage. In accordance with a cache drain mechanism, the cache will drain data from the cache to the physical target disks under different specified conditions. A further mechanism is provided for preserving data within the cache that is frequently accessed by the requester processor. A user-configuration mechanism is provided.

...read moreread less

Software-based cache partitioning for real-time applications

[...]

Andrew Wolfe

01 Mar 1994

Patent•

Display controller incorporating cache memory dedicated for VRAM

[...]

Akihisa Fujimoto¹•Institutions (1)

Toshiba¹

22 Mar 1994

TL;DR: In this article, a frame buffer cache is arranged to store part of image data in an image memory so that a CPU and a drawing processor can perform image data read/write operations by only accessing the frame buffer.

...read moreread less

Abstract: A frame buffer cache is arranged to store part of image data in an image memory so that a CPU and a drawing processor can perform image data read/write operations by only accessing the frame buffer cache. Therefore, the image data read/write operations of the CPU and the drawing processor can be performed simultaneously with the access to a dual port image memory, thus improving the drawing performance of the CPU and the drawing processor.

...read moreread less

Patent•

Performance enhancement system and method for a hierarchical data cache using a RAID parity scheme

[...]

Brent Cameron Beardsley¹, Joel H. Cord¹, S. Hyde Ii Joseph¹, Vernon J. Legvold¹, Carol Santich Michod¹, Gary E. Morain¹, Chan Y. Ng¹, John R. Paveza¹, Lloyd R. Shipman¹ - Show less +5 more•Institutions (1)

IBM¹

20 Jun 1994

TL;DR: In this paper, a cache storage drawer containing a plurality of DASD devices for implementing a RAID parity data protection scheme, and permanently storing data, is coupled with a cache controller.

...read moreread less

Abstract: A system and method for reducing device wait time in response to a host initiated write operation modifying a data block. The system includes a host computer channel connected to a storage controller which has cache memory and a nonvolatile storage buffer in a first embodiment. An identical system makes up the second embodiment with the exception that there is no nonvolatile storage buffer in the storage controller of the second embodiment. The controller in either embodiment is coupled to a cache storage drawer containing a plurality of DASD devices for implementing a RAID parity data protection scheme, and for permanently storing data. The drawer has nonvolatile cache memory which is used for accepting data destaged from controller cache. In a first embodiment, no commit reply is sent to the controller to indicate that data has been written to DASD. Instead a status information block is created to indicate that the data has been destaged from controller cache but is not committed. The status information is stored in directory means attached to the controller. The system uses this information to create a list of data which is in the state of Not committed. In this way data can be committed according to a cache management algorithm of least recently used (LRU), rather than requiring synchronous commit which is inefficient because it requires waiting on a commit response and ties up nonvolatile storage space allocated to back-up copies of cache data. In a second embodiment, directory means attached to the controller stores information about status blocks that may be modified or unmodified. The status information is used to eliminate wait times associated with waiting for data to be written to HDAs below.

...read moreread less

Proceedings Article•DOI•

PA7200: a PA-RISC processor with integrated high performance MP bus interface

[...]

Gordon Kurpanek, Kenneth K. Chan, J. Zheng, Eric Delano, William R. Bryg - Show less +1 more

28 Feb 1994

TL;DR: A new processor implementing Hewlett-Packard's PA-RISC 1.1 (Precision Architecture) has been designed, which incorporates many improvements over the HP PA7100 CPU, including increased frequency, instruction and data cache prefetching, enhanced superscalar execution, and enhanced multiprocessor support.

...read moreread less

Abstract: A new processor implementing Hewlett-Packard's PA-RISC 1.1 (Precision Architecture) has been designed. This latest design incorporates many improvements over the HP PA7100 CPU, including increased frequency, instruction and data cache prefetching, enhanced superscalar execution, and enhanced multiprocessor support. The PA7200 connects directly to a new split transaction, 120 MHz, 64-bit bus capable of supporting multiple processors and multiple outstanding memory reads per processor. A novel fully associative on-chip data cache, which is accessed in parallel with an external data cache, is used to reduce the miss rate and facilitate hardware and software directed prefetching to reduce average memory access time. >

...read moreread less

Patent•

Process of predicting and controlling the use of cache memory in a computer system

[...]

Michael S. Milillo¹, Patrick A. L. De Martine¹•Institutions (1)

Storage Technology Corporation¹

30 Dec 1994

TL;DR: In this article, the cache memory space in a computer system is controlled on a dynamic basis by adjusting the low threshold which triggers the release of more cache free space and the high threshold which ceases the free space.

...read moreread less

Abstract: The cache memory space in a computer system is controlled on a dynamic basis by adjusting the low threshold which triggers the release of more cache free space and by adjusting the high threshold which ceases the release of free space The low and high thresholds are predicted based on the number of allocations which are accomplished in response to I/O requests, and based on the number of blockages which occur when an allocation can not be accomplished The predictions may be based on weighted values of different historical time periods, and the high and low thresholds may be made equal to one another In this manner the performance degradation resulting from variations in workload caused by prior art fixed or static high and low thresholds is avoided Instead only a predicted amount of cache memory space is freed and that amount of free space is more likely to accommodate the predicted output requests without releasing so much cache space that an unacceptable number of blockages occur

...read moreread less

Patent•

System for dynamically controlling cache manager maintaining cache index and controlling sequential data access

[...]

Moshe Yanai¹, Natan Vishlitzky¹, Bruno Alterescu¹, Daniel Castel¹•Institutions (1)

EMC Corporation¹

12 Dec 1994

TL;DR: In this paper, a cache indexer maintains a current index of data elements which are stored in cache memory and a sequential data access indicator, responsive to the cache index and to a user selectable sequential access threshold, determines that a sequential access is in progress for a given process and provides an indication of the same.

...read moreread less

Abstract: A cache management system and method monitors and controls the contents of cache memory coupled to at least one host and at least one data storage device. A cache indexer maintains a current index of data elements which are stored in cache memory. A sequential data access indicator, responsive to the cache index and to a user selectable sequential data access threshold, determines that a sequential data access is in progress for a given process and provides an indication of the same. The system and method allocate a micro-cache memory to any process performing a sequential data access. In response to the indication of a sequential data access in progress and to a user selectable maximum number of data elements to be prefetched, a data retrieval requestor requests retrieval of up to the selected maximum number of data elements from a data storage device. A user selectable number of sequential data elements determines when previously used micro-cache memory locations will be overwritten. A method of dynamically monitoring and adjusting cache management parameters is also presented.

...read moreread less

Patent•

Cache memory system and method with multiple hashing functions and hash control storage

[...]

Anant Agarwal¹, Steven D. Pudar¹•Institutions (1)

Massachusetts Institute of Technology¹

23 Dec 1994

TL;DR: In this paper, a column-associative cache that reduces conflict misses, increases the hit rate and maintains a minimum hit access time is proposed, where the cache lines represent a column of sets.

...read moreread less

Abstract: A column-associative cache that reduces conflict misses, increases the hit rate and maintains a minimum hit access time. The column-associative cache indexes data from a main memory into a plurality of cache lines according to a tag and index field through hash and rehash functions. The cache lines represent a column of sets. Each cache line contains a rehash block indicating whether the set is a rehash location. To increase the performance of the column-associative cache, a content addressable memory (CAM) is used to predict future conflict misses.

...read moreread less

Patent•

Method and structure for evaluating and enhancing the performance of cache memory systems

[...]

Michael A. Salsburg

13 Jul 1994

TL;DR: In this article, an analytical model for calculating cache hit rate for combinations of data sets and LRU sizes is presented. But the model is not directly applied in software for constructing a precise model that can be used to predict cache hit rates for a cache, using statistics accumulated for each element independently.

...read moreread less

Abstract: Method and structure for collecting statistics for quantifying locality of data and thus selecting elements to be cached, and then calculating the overall cache hit rate as a function of cached elements. LRU stack distance has a straight-forward probabilistic interpretation and is part of statistics to quantify locality of data for each element considered for caching. Request rates for additional slots in the LRU are a function of file request rate and LRU size. Cache hit rate is a function of locality of data and the relative request rates for data sets. Specific locality parameters for each data set and arrival rate of requests for data-sets are used to produce an analytical model for calculating cache hit rate for combinations of data sets and LRU sizes. This invention provides algorithms that can be directly implemented in software for constructing a precise model that can be used to predict cache hit rates for a cache, using statistics accumulated for each element independently. The model can rank the elements to find the best candidates for caching. Instead of considering the cache as a whole, the average arrival rates and re-reference statistics for each element are estimated, and then used to consider various combinations of elements and cache sizes in predicting the cache hit rate. Cache hit rate is directly calculated using the to-be-cached files' arrival rates and re-reference statistics and used to rank the elements to find the set that produces the optimal cache hit rate.

...read moreread less

Proceedings Article•DOI•

A caching model of operating system kernel functionality

[...]

David R. Cheriton¹, Kenneth J. Duda¹•Institutions (1)

Stanford University¹

12 Sep 1994

TL;DR: In this article, the cache kernel is proposed to provide a hardware adaptation layer to operating system services rather than just providing a key subset of OS services, as has been the common approach in previous microkernel work.

...read moreread less

Abstract: Operating system design has had limited success in providing adequate application functionality and a poor record in avoiding excessive growth in size and complexity, especially with protected operating systems. Applications require far greater control over memory, I/O and processing resources to meet their requirements. For example, database transaction processing systems include their own "kernel" which can much better manage resources for the application than can the application-ignorant general-purpose conventional operating system mechanisms. Large-scale parallel applications have similar requirements. The same requirements arise with servers implemented outside the operating system kernel.In our research, we have been exploring the approach of making the operating system kernel a cache for active operating systems objects such as processes, address spaces and communication channels, rather than a complete manager of these objects. The resulting system is smaller than recent so-called microkernels, and also provides greater flexibility for applications, including real-time applications, database management systems and large-scale simulations. As part of this research, we have developed what we call a cache kernel, a new generation of microkernel that supports operating system configurations across these dimensions.The cache kernel can also be regarded as providing a hardware adaptation layer (HAL) to operating system services rather than trying to just provide a key subset of OS services, as has been the common approach in previous microkernel work. However, in contrast to conventional HALs, the cache kernel is fault-tolerant because it is protected from the rest of the operating system (and applications), it is replicated in large-scale configurations and it includes audit and recovery mechanisms. A cache kernel has been implemented on a scalable shared-memory and networked multi-computer [1] hardware which provides architectural support for the cache kernel approach.Fig 1 illustrates a typical target configuration. There is an instance of the cache kernel per multi-processor module (MPM), each managing the processors, second-level cache and network interface of that MPM. The cache kernel executes out of PROM and local memory of the MPM, making it hardware-independent of the rest of the system except for power. That is, the separate cache kernels and MPMs fail independently. Operating system services are provided by application kernels, server kernels and conventional operating system emulation kernels in conjunction with privileged MPM resource managers (MRM) that execute on top of the cache kernel. These kernels may be in separate protected address spaces or a shared library within a sophisticated application address space. A system bus connects the MPMs to each other and the memory modules. A high-speed network interface per MPM connects this node to file servers and other similarly configured processing nodes. This overall design can be simplified for real-time applications and similar restricted scenarios. For example, with relatively static partitioning of resources, an embedded real-time application could be structured as one or more application spaces incorporating application kernels as shared libraries executing directly on top of the cache kernel.

...read moreread less

Patent•

Write back cache coherency module for systems with a write through cache supporting bus

[...]

Srinivas Raman¹•Institutions (1)

Intel¹

15 Apr 1994

TL;DR: In this paper, the cache coherency is maintained by writing back all modified data in the cache prior to execution of the command that initiate the DMA or Bus Master transfer to or from main memory.

...read moreread less

Abstract: A system and method for guaranteeing coherency between a write back cache and main memory in a computer system that does not have the bus level signals for a conventional write back cache memory. Cache coherency can be maintained by writing back all modified data in the cache prior to execution of the command that initiate the DMA or Bus Master transfer to or from main memory. When bus snooping logic detects writes from the CPU, the cache and main memory are synchronized. After synchronization, the bus snooper continues to look for access hits to modified data in the cache. If hits occurs and it is a DMA cycle, the CPU is prevented from further accesses to cache until after the DMA transfer, modified bytes are written back to main memory. If it is a bus master device seeking access to main memory, the CPU is prevented from further accesses to cache until the modified bytes are written back to main memory.

...read moreread less

Patent•

Stream buffers for high-performance computer memory systems

[...]

Richard E. Kessler¹, Steven M. Oberlin¹, Steven L. Scott¹, Subbarao Palacharla¹•Institutions (1)

Cray¹

01 Nov 1994

TL;DR: In this article, a filtered stream buffer coupled with a memory and a processor is proposed to prefetch data from the memory, where the filter controller determines whether a pattern of references has a predetermined relationship, and if so, prefetches stream data into the cache block storage area.

...read moreread less

Abstract: Method and apparatus for a filtered stream buffer coupled to a memory and a processor, and operating to prefetch data from the memory. The filtered stream buffer includes a cache block storage area and a filter controller. The filter controller determines whether a pattern of references has a predetermined relationship, and if so, prefetches stream data into the cache block storage area. Such stream data prefetches are particularly useful in vector processing computers, where once the processor starts to fetch a vector, the addresses of future fetches can be predicted based in the pattern of past fetches. According to various aspects of the present invention, the filtered stream buffer further includes a history table, a validity indicator which is associated with the cache block storage area and indicates which cache blocks, if any, are valid. According to yet another aspect of the present invention, the filtered stream buffer controls random access memory (RAM) chips to stream the plurality of consecutive cache blocks from the RAM into the cache block storage area. According to yet another aspect of the present invention, the stream data includes data for a plurality of strided cache blocks, wherein each of which these strided cache blocks corresponds to an address determined by adding to the first address an integer multiple of the difference between the second address and the first address. According to yet another aspect of the present invention, the processor generates three addresses of data words in the memory, and the filter controller determines whether a predetermined relationship exists among three addresses, and if so, prefetches strided stream data into said cache block storage area.

...read moreread less

Patent•

Prefetching into a cache to minimize main memory access time and cache size in a computer system

[...]

Karnamadakala Krishnamohan¹, Paul Michael Farmwald¹, Frederick A. Ware¹•Institutions (1)

Rambus¹

15 Nov 1994

TL;DR: In this paper, a cache subsystem for a computer system having a processor and a main memory is described, which includes a prefetch buffer coupled to the processor and the main memory.

...read moreread less

Abstract: A cache subsystem for a computer system having a processor and a main memory is described. The cache subsystem includes a prefetch buffer coupled to the processor and the main memory. The prefetch buffer stores a first data prefetched from the main memory in accordance with a predicted address for a next memory fetch by the processor. The predicted address is based upon an address for a last memory fetch from the processor. A main cache is coupled to the processor and the main memory. The main cache is not coupled to the prefetch buffer and does not receive data from the prefetch buffer. The main cache stores a second data fetched from the main memory in accordance with the address for the last memory fetch by the processor only if the address for the last memory fetch is an unpredictable address. The address for the last memory fetch is the unpredictable address if both of the prefetch buffer and the main cache do not contain the address and the second data associated with the address.

...read moreread less

Patent•

Method and apparatus for implementing a single clock cycle line replacement in a data cache unit

[...]

Haitham Akkary¹, Mandar S. Joshi¹, Rob Murray¹, Brent E. Lince¹, Paul D. Madland¹, Andrew F. Glew¹, Glenn J. Hinton¹ - Show less +3 more•Institutions (1)

Intel¹

30 Sep 1994

TL;DR: In this paper, a data cache unit is employed within a microprocessor capable of speculative and out-of-order processing of memory instructions, where each microprocessor is capable of snooping the cache lines of data cache units of each other microprocessor.

...read moreread less

Abstract: The data cache unit includes a separate fill buffer and a separate write-back buffer. The fill buffer stores one or more cache lines for transference into data cache banks of the data cache unit. The write-back buffer stores a single cache line evicted from the data cache banks prior to write-back to main memory. Circuitry is provided for transferring a cache line from the fill buffer into the data cache banks while simultaneously transferring a victim cache line from the data cache banks into the write-back buffer. Such allows the overall replace operation to be performed in only a single clock cycle. In a particular implementation, the data cache unit is employed within a microprocessor capable of speculative and out-of-order processing of memory instructions. Moreover, the microprocessor is incorporated within a multiprocessor computer system wherein each microprocessor is capable of snooping the cache lines of data cache units of each other microprocessor. The data cache unit is also a non-blocking cache.

...read moreread less

Proceedings Article•DOI•

Decoupled sectored caches: conciliating low tag implementation cost and low miss ratio

[...]

A. Seznec

18 Apr 1994

TL;DR: A decoupled sectored cache will allow the same level of performance as a non-sectored cache, but at a significantly lower hardware cost.

...read moreread less

Abstract: Sectored caches have been used for many years in order to reconcile low tag array size and small or medium block size. In a sectored cache, a single address tag is associated with a sector consisting on several cache lines, while validity, dirty and coherency tags are associated with each of the inner cache lines. Usually in a cache, a cache line location is statically linked to one and only one address tag word location. In the decoupled sectored cache introduced in the paper, this monolithic association is broken; the address tag location associated with a cache line location is dynamically chosen at fetch time among several possible locations. The tag volume on a decoupled sectored cache is in the same range as the tag volume in a traditional sectored cache; but the hit ratio on a decoupled sectored cache is very close to the hit ratio on a non-sectored cache. A decoupled sectored cache will allow the same level of performance as a non-sectored cache, but at a significantly lower hardware cost. >

...read moreread less

Patent•

Master-slave cache system for instruction and data cache memories

[...]

Earl T. Cohen, Russell W. Tilleman, Jay C. Pattin, James S. Blomgren

29 Jun 1994

TL;DR: A master-slave cache system as discussed by the authors uses a set-associative master cache and two smaller direct-mapped slave caches, a slave instruction cache for supplying instructions to an instruction pipeline of a processor, and a slave data cache for providing data operands to an execution pipeline of the processor.

...read moreread less

Abstract: A master-slave cache system has a large, set-associative master cache, and two smaller direct-mapped slave caches, a slave instruction cache for supplying instructions to an instruction pipeline of a processor, and a slave data cache for supplying data operands to an execution pipeline of the processor. The master cache and the slave caches are tightly coupled to each other. This tight coupling allows the master cache to perform most cache management operations for the slave caches, freeing the slave caches to supply a high bandwidth of instructions and operands to the processor's pipelines. The master cache contains tags that include valid bits for each slave, allowing the master cache to determine if a line is present and valid in either of the slave caches without interrupting the slave caches. The master cache performs all search operations required by external snooping, cache invalidation, cache data zeroing instructions, and store-to-instruction-stream detection. The master cache interrupts the slave caches only when the search reveals that a line is valid in a slave cache, the master cache causing the slave cache to invalidate the line. A store queue is shared between the master cache and the slave data cache. Store data is written from the store queue directly in to both the slave data cache and the master cache, eliminating the need for the slave data cache to write data through to the master cache. The master-slave cache system also eliminates the need for a second set of address tags for snooping and coherency operations. The master cache can be large and designed for a low miss rate, while the slave caches are designed for the high speed required by the processor's pipelines.

...read moreread less

Proceedings Article•DOI•

Decoupled sectored caches: conciliating low tag implementation cost

[...]

André Seznec

01 Apr 1994

TL;DR: The decoupled sectored cache introduced in this paper will allow the same level of performance as a non-sectored cache, but at a significantly lower hardware cost.

...read moreread less

Abstract: Sectored caches have been used for many years in order to reconcile low tag array size and small or medium block size In a sectored cache, a single address tag is associated with a sector consisting on several cache lines, while validity, dirty and coherency tags are associated with each of the inner cache linesMaintaining a low tag array size is a major issue in many cache designs (eg L2 caches) Using a sectored cache is a design trade-off between a low size of the tag array which is possible with large line size and a low memory traffic which requires a small line sizeThis technique has been used in many cache designs including small on-chip microprocessor caches and large external second level caches Unfortunately, as on some applications, the miss ratio on a sectored cache is significantly higher than the miss ratio on a non-sectored cache (factors higher than two are commonly observed), a significant part of the potential performance may be wasted in miss penaltiesUsually in a cache, a cache line location is statically linked to one and only one address tag word location In the decoupled sectored cache we introduce in this paper, this monolithic association is broken; the address tag location associated with a cache line location is dynamically chosen at fetch time among several possible locationsThe tag volume on a decoupled sectored cache is in the same range as the tag volume in a traditional sectored cache; but the hit ratio on a decoupled sectored cache is very close to the hit ratio on a non-sectored cache A decoupled sectored cache will allow the same level of performance as a non-sectored cache, but at a significantly lower hardware cost

...read moreread less

Patent•

Exclusive and/or partially inclusive extension cache system and method to minimize swapping therein

[...]

Konrad K. Lai¹•Institutions (1)

Intel¹

23 Mar 1994

TL;DR: In this paper, a multi-level memory system is provided having a primary cache and a secondary cache in which unnecessary swapping operations are minimized, and the secondary cache responds to the request.

...read moreread less

Abstract: A multi-level memory system is provided having a primary cache and a secondary cache in which unnecessary swapping operations are minimized. If a memory access request misses in the primary cache, but hits in the secondary cache, then the secondary cache responds to the request. If, however, the request also misses in the secondary cache, but is found in main memory, then main memory responds to the request. In responding to the request, the secondary cache or main memory returns the requested data to the primary cache. If an address tag of a primary cache victim line does not match an address tag in the secondary cache or the primary cache victim line is dirty, then the victim is stored in the secondary cache. The primary cache victim line includes a first bit for indicating whether the address tag of the primary cache victim line matches an address tag of the secondary cache.

...read moreread less

Proceedings Article•DOI•

Exploring the design space for a shared-cache multiprocessor

[...]

Basem A. Nayfeh¹, Kunle Olukotun¹•Institutions (1)

Stanford University¹

01 Apr 1994

TL;DR: This paper investigates the architecture and partitioning of resources between processors and cache memory for single chip and MCM-based multiprocessors, and shows that for parallel applications, clustering via shared caches provides an effective mechanism for increasing the total number of processors in a system.

...read moreread less

Abstract: In the near future, semiconductor technology will allow the integration of multiple processors on a chip or multichip-module (MCM). In this paper we investigate the architecture and partitioning of resources between processors and cache memory for single chip and MCM-based multiprocessors. We study the performance of a cluster-based multiprocessor architecture in which processors within a cluster are tightly coupled via a shared cluster cache for various processor-cache configurations. Our results show that for parallel applications, clustering via shared caches provides an effective mechanism for increasing the total number of processors in a system, without increasing the number of invalidations. Combining these results with cost estimates for shared cluster cache implementations leads to two conclusions: 1) For a four cluster multiprocessor with single chip clusters, two processors per cluster with a smaller cache provides higher performance and better cost/performance than a single processor with a larger cache and 2) this four cluster configuration can be scaled linearly in performance by adding processors to each cluster using MCM packaging techniques.

...read moreread less

Patent•

Partial cache line write transactions in a computing system with a write back cache

[...]

William R. Bryg¹, Robert J. Brooks¹, Eric W. Hamilton¹, Michael L. Ziegler¹•Institutions (1)

Hewlett-Packard¹

24 Mar 1994

TL;DR: In this article, the authors present a computing system which includes a memory, an input/output adapter and a processor, and the processor includes a write back cache in which dirty data may be stored.

...read moreread less

Abstract: A computing system is presented which includes a memory, an input/output adapter and a processor. The processor includes a write back cache in which dirty data may be stored. When performing a coherent write from the input/output adapter to the memory, a block of data is written from the input/output adapter to a memory location within the memory. The block of data contains less data than a full cache line in the write back cache. The write back cache is searched to determine whether the write back cache contains data for the memory location. When the search determines that the write back cache contains data for the memory location a full cache line which contains the data for the memory location is purged.

...read moreread less

Collapse