scispace - formally typeset
Search or ask a question

Showing papers on "Cache coloring published in 1997"


Patent
25 Aug 1997
TL;DR: In this article, a unified re-map table in a RAM is used to arbitrarily remap all logical addresses from a host system to physical addresses of flash-memory devices, and wear-leveling is performed on a block being written when both total and incremental counts exceed system-wide total and incrementally thresholds.
Abstract: A flash-memory system provides solid-state mass storage as a replacement to a hard disk. A unified re-map table in a RAM is used to arbitrarily re-map all logical addresses from a host system to physical addresses of flash-memory devices. Each entry in the unified re-map table contains a physical block address (PBA) of the flash memory allocated to the logical address, and a cache valid bit and a cache index. When the cache valid bit is set, the data is read or written to a line in the cache pointed to by the cache index. A separate cache tag RAM is not needed. When the cache valid bit is cleared, the data is read from the flash memory block pointed to by the PBA. Two write count values are stored with the PBA in the table entry. A total-write count indicates a total number of writes to the flash block since manufacture. An incremental-write count indicates the number of writes since the last wear-leveling operation that moved the block. Wear-leveling is performed on a block being written when both total and incremental counts exceed system-wide total and incremental thresholds. The incremental-write count is cleared after a block is wear-leveled, but the total-write count is never cleared. The incremental-write count prevents moving a block again immediately after wear-leveling. The thresholds are adjusted as the system ages to provide even wear.

592 citations


Proceedings ArticleDOI
01 Dec 1997
TL;DR: Experimental results across a wide range of embedded applications show that the filter cache results in improved memory system energy efficiency, and this work proposes to trade performance for power consumption by filtering cache references through an unusually small L1 cache.
Abstract: Most modern microprocessors employ one or two levels of on-chip caches in order to improve performance. These caches are typically implemented with static RAM cells and often occupy a large portion of the chip area. Not surprisingly, these caches often consume a significant amount of power. In many applications, such as portable devices, low power is more important than performance. We propose to trade performance for power consumption by filtering cache references through an unusually small L1 cache. An L2 cache, which is similar in size and structure to a typical L1 cache, is positioned behind the filter cache and serves to reduce the performance loss. Experimental results across a wide range of embedded applications show that the filter cache results in improved memory system energy efficiency. For example, a direct mapped 256-byte filter cache achieves a 58% power reduction while reducing performance by 21%, corresponding to a 51% reduction in the energy-delay product over conventional design.

544 citations


Patent
25 Sep 1997
TL;DR: In this article, a method for storing a plurality of multimedia objects in a cache memory is described, where first ones of the multimedia objects are written into the cache memory sequentially from the beginning of the cache cache memory in the order in which they are received.
Abstract: A method for storing a plurality of multimedia objects in a cache memory is described. First ones of the multimedia objects are written into the cache memory sequentially from the beginning of the cache memory in the order in which they are received. When a first memory amount from a most recently stored one of the first multimedia objects to the end of the cache memory is insufficient to accommodate a new multimedia object, the new multimedia object is written from the beginning of the cache memory, thereby writing over a previously stored one of the first multimedia objects. Second ones of the multimedia objects are then written into the cache memory sequentially following the new multimedia object in the order in which they are received, thereby writing over the first ones of the multimedia objects. This cycle is repeated, thereby maintaining a substantially full cache memory.

398 citations



Patent
07 May 1997
TL;DR: In this article, the predicted-to-be selected page is added to a local cache of the requested pages in the client, and the client can update the appearance of the link to indicate to the user that the page represented by that link is available in the local cache.
Abstract: A computer, e.g. a server or computer operated by a network provider sends one or more requesting computers (clients) a most likely predicted-to-be selected (predicted) page of information by determining a preference factor for this page based on one or more pages that are requested by the client. This page is added to a local cache of predicted-to-be-selected pages in the client. Once the predicted-to-be selected page is in the cache, the client can update the appearance of the link (i.e. by changing the color or otherwise changing the appearance of the link indicator) to indicate to the user that the page represented by that link is available in the local cache.

250 citations


Patent
Hiroshi Sukegawa1
14 Mar 1997
TL;DR: In this article, the storage area of the flash memory unit is logically divided into a permanent storage area, a non-volatile cache area, which are used as cache memory areas of the HDD, and a high-speed access area.
Abstract: In a data storage system using a flash memory unit and an HDD, the storage area of the flash memory unit is logically divided into a permanent storage area, a non-volatile cache area, which are used as cache memory areas of the HDD, and a high-speed access area. These divided areas are individually managed. The permanent storage area stores data which is used frequently for a relatively long time period. The non-volatile cache area is used as an ordinary cache memory area in which data, which is updated relatively frequently, is stored. The high-speed access area is a storage area to be used by, e.g. an operating system (OS) of a host system. For example, a swap file, which needs to be accessed at high speed, is shifted into the high-speed access area.

224 citations


Proceedings ArticleDOI
09 Jun 1997
TL;DR: An OS-controlled application-transparent cache-partitioning technique that can be transparently assigned to tasks for their exclusive use and the interaction of both are analysed with regard to cache-induced worst case penalties.
Abstract: Cache-partitioning techniques have been invented to make modern processors with an extensive cache structure useful in real-time systems where task switches disrupt cache working sets and hence make execution times unpredictable. This paper describes an OS-controlled application-transparent cache-partitioning technique. The resulting partitions can be transparently assigned to tasks for their exclusive use. The major drawbacks found in other cache-partitioning techniques, namely waste of memory and additions on the critical performance path within CPUs, are avoided using memory coloring techniques that do nor require changes within the chips of modern CPUs or on the critical path for performance. A simple filter algorithm commonly used in real-time systems, a matrix-multiplication algorithm and the interaction of both are analysed with regard to cache-induced worst case penalties. Worst-case penalties are determined for different widely-used cache architectures. Some insights regarding the impact of cache architectures on worst-case execution are described.

224 citations


Proceedings ArticleDOI
11 Jul 1997
TL;DR: In this article, the authors describe methods for generating and solving cache miss equations that give a detailed representation of the cache misses in loop-oriented scientific code, which can be used to guide code optimizations for improving cache performance.
Abstract: With the widening performance gap between processors and main memory, efficient memory accessing behavior is necessary for good program performance. Both hand-tuning and compiler optimization techniques are often used to transform codes to improve memory performance. Effective transformations require detailed knowledge about the frequency and causes of cache misses in the code. This paper describes methods for generating and solving Cache Miss equations that give a detailed representation of the cache misses in loop-oriented scientific code. Implemented within the SUIF compiler framework, our approach extends on traditional compiler reuse analysis to generate linear Diophantine equations that summarize each loop’s memory behavior. Mathematical techniques for msnipulating Diophantine equations allow us to compute the number of possible solutions, where each solution corresponds to a potential cache miss. These equations provide a general framework to guide code optimizations for improving cache performance. The paper gives examples of their use to determine array padding and offset amounts that minimize cache misses, and also to determine optimal blocking factors for tiled code. Overall, these equations represent an analysis framework that is more precise than traditional memory behavior heuristics, and is also potentially fazter than simulation.

205 citations


Proceedings Article
Arun Iyengar1, Jim Challenger1
08 Dec 1997
TL;DR: The DynamicWeb cache is analyzed, which resulted in near-optimal performance for many cases and 58% of optimal performance in the worst case on systems which invoke server programs via CGI.
Abstract: Dynamic Web pages can seriously reduce the performance of Web servers. One technique for improving performance is to cache dynamic Web pages. We have developed the Dynamic Web cache which is particularly well-suited for dynamic pages. Our cache has improved performance significantly at several commercial Web sites. This paper analyzes the design and performance of the DynamicWeb cache. It also presents a model for analyzing overall system performance in the presence of caching. Our cache can satisfy several hundred requests per second. On systems which invoke server programs via CGI, the DynamicWeb cache results in near-optimal performance, where optimal performance is that which would be achieved by a hypothetical cache which consumed no CPU cycles. On a system we tested which invoked server programs via ICAPI which has significantly less overhead than CGI, the DynamicWeb cache resulted in near-optimal performance for many cases and 58% of optimal performance in the worst case. The DynamicWeb cache achieved a hit rate of around 80% when it was deployed to support the official Internet Web site for the 1996 Atlanta Olympic games.

201 citations


Proceedings ArticleDOI
05 Jan 1997
TL;DR: In this article, the effect of cache misses on the performance of sorting algorithms was investigated both experimentally and analytically, and it was shown that high cache miss penalties lead to worse overall performance than the efficient comparison based sorting algorithms.
Abstract: We investigate the effect that caches have on the performance of sorting algorithms both experimentally and analytically. To address the performance problems that high cache miss penalties introduce we restructure mergesort, quicksort, and heapsort in order to improve their cache locality. For all three algorithms the improvement in cache performance leads to a reduction in total execution time. We also investigate the performance of radix sort. Despite the extremely low instruction count incurred by this linear time sorting algorithm, its relatively poor cache performance results in worse overall performance than the efficient comparison based sorting algorithms. For each algorithm we provide an analysis that closely predicts the number of cache misses incurred by the algorithm.

200 citations


Proceedings ArticleDOI
01 May 1997
TL;DR: A technique for dynamic analysis of program data access behavior is presented, which is then used to proactively guide the placement of data within the cache hierarchy in a location-sensitive manner and is fully compatible with existing Instruction Set Architectures.
Abstract: Improvements in main memory speeds have not kept pace with increasing processor clock frequency and improved exploitation of instruction-level parallelism. Consequently, the gap between processor and main memory performance is expected to grow, increasing the number of execution cycles spent waiting for memory accesses to complete. One solution to this growing problem is to reduce the number of cache misses by increasing the effectiveness of the cache hierarchy. In this paper we present a technique for dynamic analysis of program data access behavior, which is then used to proactively guide the placement of data within the cache hierarchy in a location-sensitive manner. We introduce the concept of a macroblock, which allows us to feasibly characterize the memory locations accessed by a program, and a Memory Address Table, which performs the dynamic reference analysis. Our technique is fully compatible with existing Instruction Set Architectures. Results from detailed simulations of several integer programs show significant speedups.

Patent
29 Dec 1997
TL;DR: In this article, the cache directory structure is used for defining the name of each configured central cache system and for providing an index value identifying the particular set of descriptors associated therewith.
Abstract: A host system includes a multicache system configured within the host system's memory which has a plurality of local and central cache systems used for storing information being utilized by a plurality of processes running on the system. Persistent shared memory is used to store control structure information entries required for operating central cache systems for substantially long periods of time in conjunction with the local caches established for the processes. Such entries includes a descriptor value for identifying a directory control structure and individual sets of descriptors for identifying a group of control structures defining those components required for operating the configured central cache systems. The cache directory structure is used for defining the name of each configured central cache system and for providing an index value identifying the particular set of descriptors associated therewith. The multicache system also includes a plurality of interfaces for configuring the basic characteristics of both local and central cache systems as a function of the type and performance requirements of application processes being run.

Proceedings ArticleDOI
01 May 1997
TL;DR: The use of texture image caches are proposed to alleviate the above bottlenecks, and indicate that caching is a promising approach to designing memory systems for texture mapping.
Abstract: The effectiveness of texture mapping in enhancing the realism of computer generated imagery has made support for real-time texture mapping a critical part of graphics pipelines. Despite a recent surge in interest in three-dimensional graphics from computer architects, high-quality high-speed texture mapping has so far been confined to costly hardware systems that use brute-force techniques to achieve high performance. One obstacle faced by designers of texture mapping systems is the requirement of extremely high bandwidth to texture memory. High bandwidth is necessary since there are typically tens to hundreds of millions of accesses to texture memory per second. In addition, to achieve the high clock rates required in graphics pipelines, low-latency access to texture memory is needed. In this paper, we propose the use of texture image caches to alleviate the above bottlenecks, and evaluate various tradeoffs that arise in such designs.We find that the factors important to cache behavior are (i) the representation of texture images in memory, (ii) the rasterization order on screen and (iii) the cache organization. Through a detailed investigation of these issues, we explore the best way to exploit locality of reference and determine whether this technique is robust with respect to different scenes and different amounts of texture. Overall, we observe that there is a significant amount of temporal and spatial locality and that the working set sizes are relatively small (at most 16KB) across all cases that we studied. Consequently, the memory bandwidth requirements of a texture cache system are substantially lower (at least three times and as much as fifteen times) than the memory bandwidth requirements of a system which achieves equivalent performance but does not utilize a cache. These results are very encouraging and indicate that caching is a promising approach to designing memory systems for texture mapping.

Proceedings ArticleDOI
01 Dec 1997
TL;DR: This work revisits memory hierarchy design viewing memory as an inter-operation communication agent and uses data dependence prediction to identify and link dependent loads and stores so that they can communicate speculatively without incurring the overhead of address calculation, disambiguation and data cache access.
Abstract: We revisit memory hierarchy design viewing memory as an inter-operation communication agent. This perspective leads to the development of novel methods of performing inter-operation memory communication. We use data dependence prediction to identify and link dependent loads and stores so that they can communicate speculatively without incurring the overhead of address calculation, disambiguation and data cache access. We also use data dependence prediction to convert, DEF-store-load-USE chains within the instruction window into DEF-USE chains prior to address calculation and disambiguation. We use true and output data dependence status prediction to introduce and manage a small storage structure called the transient value cache (TVC). The TVC captures memory values that are short-lived. It also captures recently stored values that are likely to be accessed soon. Accesses that are serviced by the TVC do not have to be serviced by other parts of the memory hierarchy, e.g., the data cache. The first two techniques are aimed at reducing the effective communication latency whereas the last technique is aimed at reducing data cache bandwidth requirements. Experimental analysis of the proposed techniques shows that: the proposed speculative communication methods correctly handle a large fraction of memory dependences; and a large number of the loads and stores do not have to ever reach the data cache when the TVC is in place.

Proceedings ArticleDOI
09 Jun 1997
TL;DR: Results of incorporating instruction cache predictions within pipeline simulation show that timing predictions for set-associative caches remain just as tight as predictions for direct-mapped caches.
Abstract: The contributions of this paper are twofold. First, an automatic tool-based approach is described to bound worst-case data cache performance. The given approach works on fully optimized code, performs the analysis over the entire control flow of a program, detects and exploits both spatial and temporal locality within data references, produces results typically within a few seconds, and estimates, on average, 30% tighter WCET bounds than can be predicted without analyzing data cache behavior. Results obtained by running the system on representative programs are presented and indicate that timing analysis of data cache behavior can result in significantly tighter worst-case performance predictions. Second, a framework to bound worst-case instruction cache performance for set-associative caches is formally introduced and operationally described. Results of incorporating instruction cache predictions within pipeline simulation show that timing predictions for set-associative caches remain just as tight as predictions for direct-mapped caches. The cache simulation overhead scales linearly with increasing associativity.

Proceedings ArticleDOI
11 Jul 1997
TL;DR: It is shown that for a 8 Kbyte data cache, XOR-mapping schemes approximately halve the miss ratio for two-way associative and column-associative organizations, and XOR mapping schemes provide a very significant reduction in the misses ratio for the other cache organizations, including the direct-mapped cache.
Abstract: This paper makes the case for the use of XOR-based placement functions for cache memories. It shows that these XOR-mapping schemes can eliminate many conflict misses for direct-mapped and victim caches and practically all of them for (pseudo) two-way associative organizations. The paper evaluates the performance of XOR-mapping schemes for a number of different cache organizations: direct-mapped, set-associative, victim, hash-rehash, column-associative and skewed-associative. It also proposes novel replacement policies for some of these cache organizations. In particular, it presents a low-cost implementation of a pure LRU replacement policy which demonstrates a significant improvement over the pseudo-LRU replacement previously proposed. The paper shows that for a 8 Kbyte data cache, XOR-mapping schemes approximately halve the miss ratio for two-way associative and column-associative organizations. Skewed-associative caches, which already make use of XOR-mapping functions, can benefit from the LRU replacement and also from the use of more sophisticated mapping functions. For two-way associative, columnassociative and two-way skewed-associative organizations, XORmapping schemes achieve a miss ratio that is not higher than 1.10 times that of a fully-associative cache. XOR mapping schemes also provide a very significant reduction in the miss ratio for the other cache organizations, including the direct-mapped cache. Ultimately, the conclusion of this study is that XOR-based placement functions unequivocally provide highly significant performance benefits to most cache organizations.

Patent
31 Jul 1997
TL;DR: In this article, a cache-extension disk region is used to expand the size of the log structured cache by partitioning the cache memory region into write cache segments and redundancy data (parity) cache segments.
Abstract: Method and apparatus for accelerating write operations logging write requests in a log structured cache and by expanding the log structured cache using a cache-extension disk region The log structured cache include a cache memory region partitioned into one or more write cache segments and one or more redundancy-data (parity) cache segments The cache-extension disk region is a portion of a disk array separate from a main disk region The cache-extension disk region is also partitioned into segments and is used to extend the size of the log structured cache The main disk region is instead managed in accordance with storage management techniques (eg, RAID storage management) The write cache segment is partitioned into multiple write cache segments so that when one is full another can be used to handle new write requests When one of these multiple write cache segments is filled, it is moved to the cache-extension disk region thereby freeing the write cache segment for reuse The redundancy-data (parity) cache segment holds redundancy data for recent write requests, thereby assuring integrity of the logged write request data in the log structured cache

Proceedings ArticleDOI
01 May 1997
TL;DR: This paper presents a link-time procedure mapping algorithm which can significantly improve the eflectiveness of the instruction cache and produces an improved program layout by performing a color mapping of procedures to cache lines, taking into consideration the procedure size, cache size,cache line size, and call graph.
Abstract: As the gap between memory and processor performance continues to widen, it becomes increasingly important to exploit cache memory eflectively. Both hardware and aoftware approaches can be explored to optimize cache performance. Hardware designers focus on cache organization issues, including replacement policy, associativity, line size and the resulting cache access time. Software writers use various optimization techniques, including software prefetching, data scheduling and code reordering. Our focus is on improving memory usage through code reordering compiler techniques.In this paper we present a link-time procedure mapping algorithm which can significantly improve the eflectiveness of the instruction cache. Our algorithm produces an improved program layout by performing a color mapping of procedures to cache lines, taking into consideration the procedure size, cache size, cache line size, and call graph. We use cache line coloring to guide the procedure mapping, indicating which cache lines to avoid when placing a procedure in the program layout. Our algorithm reduces on average the instruction cache miss rate by 40% over the original mapping and by 17% over the mapping algorithm of Pettis and Hansen [12].

Patent
10 Apr 1997
TL;DR: In this paper, a scalable distributed caching system on a network receives a request for a data object from a user and carries out a locator function that locates a directory cache for the object.
Abstract: A scalable distributed caching system on a network receives a request for a data object from a user. The caching system carries out a locator function that locates a directory cache for the object. The directory cache stores a directory list that identifies the locations of object caches that purport to store copies of the object requested by the user. The object caches on the object directory list are polled, and in response send messages to the cache that received the user request indicating if each object cache stores a copy of the requested object. The receiving cache sends a message requesting a copy of the object to the object cache that sent the message first received by the receiving cache indicating that an object cache stores the requested object. The object cache that sent the first received message then sends a copy of the object to the receiving cache, which stores a copy and then sends a copy to the user. The directory list for the object is then updated by adding the network address of the receiving cache. Outdated copies of objects stored on object caches are deleted in a distributed fashion to maintain the coherence of the cached copies. This is further reinforced by the association of time-to-live parameters with the each copy and each object cache address on directory lists.

Patent
30 Sep 1997
TL;DR: In this article, a central cache controller performs RAID management functions on behalf of the plurality of storage controllers including redundancy information (parity) generation and checking as well as RAID geometry (striping) management.
Abstract: Apparatus and methods which allow multiple storage controllers sharing access to common data storage devices in a data storage subsystem to access a centralized intelligent cache. The intelligent central cache provides substantial processing for storage management functions. In particular, the central cache of the present invention performs RAID management functions on behalf of the plurality of storage controllers including, for example, redundancy information (parity) generation and checking as well as RAID geometry (striping) management. The plurality of storage controllers (also referred to herein as RAID controllers) transmit cache requests to the central cache controller. The central cache controller performs all operations related to storing supplied data in cache memory as well as posting such cached data to the storage array as required. The storage controllers are significantly simplified because the present invention obviates the need for duplicative local cache memory on each of the plurality of storage controllers. The storage subsystem of the present invention obviates the need for inter-controller communication for purposes of synchronizing local cache contents of the storage controllers. The storage subsystem of the present invention offers improved scalability in that the storage controllers are simplified as compared to those of prior designs. Addition of storage controllers to enhance subsystem performance is less costly than prior designs. The central cache controller may include a mirrored cache controller to enhance redundancy of the central cache controller. Communication between the cache controller and its mirror are performed over a dedicated communication link.

Patent
17 Jan 1997
TL;DR: In this article, the authors proposed a set-associative cache memory for allocating entries in a branch prediction table (BPT) to branch prediction information for related branch instructions.
Abstract: Allocation circuitry for allocating entries within a set-associative cache memory is disclosed. The set-associative cache memory comprises N ways, each way having M entries and corresponding entries in each of the N ways constituting a set of entries. The allocation circuitry has a first circuit which identifies related data units by identifying a probability that the related data units may be successively read from the cache memory. A second circuit within the allocation circuitry allocates the corresponding entries in each of the ways to the related data units, so that related data units are stored in a common set of entries. Accordingly, the related data units will be simultaneously outputted from the set-associative cache memory, and are thus concurrently available for processing. The invention may find application in allocating entries of a common set in a branch prediction table (BPT) to branch prediction information for related branch instructions.

Proceedings Article
08 Dec 1997
TL;DR: Trace-driven simulation of this mechanism on two large, independent data sets shows that PCV both provides stronger cache coherency and reduces the request traffic in comparison to the time-to-live (TTL) based techniques currently used.
Abstract: This paper presents work on piggyback cache validation (PCV), which addresses the problem of maintaining cache coherency for proxy caches. The novel aspect of our approach is to capitalize on requests sent from the proxy cache to the server to improve coherency. In the simplest case, whenever a proxy cache has a reason to communicate with a server it piggybacks a list of cached, but potentially stale, resources from that server for validation. Trace-driven simulation of this mechanism on two large, independent data sets shows that PCV both provides stronger cache coherency and reduces the request traffic in comparison to the time-to-live (TTL) based techniques currently used. Specifically, in comparison to the best TTL-based policy, the best PCV-based policy reduces the number of request messages from a proxy cache to a server by 16-17% and the average cost (considering response latency, request messages and bandwidth) by 6-8%. Moreover, the best PCV policy reduces the staleness ratio by 57-65% in comparison to the best TTL-based policy. Additionally, the PCV policies can easily be implemented within the HTTP 1.1 protocol.

Journal ArticleDOI
01 Sep 1997
TL;DR: This paper presents a new, delay-conscious cache replacement algorithm LNC-R-W3 which maximizes a performance metric called delay-savings-ratio and compares it with other existing cache replacement algorithms, namely LRU and LRU-MIN.
Abstract: Caching at proxy servers plays an important role in reducing the latency of the user response, the network delays and the load on Web servers. The cache performance depends critically on the design of the cache replacement algorithm. Unfortunately, most cache replacement algorithms ignore the Web's scale. In this paper we argue for the design of delay-conscious cache replacement algorithms which explicitly consider the Web's scale by preferentially caching documents which require a long time to fetch to the cache. We present a new, delay-conscious cache replacement algorithm LNC-R-W3 which maximizes a performance metric called delay-savings-ratio. Subsequently, we test the performance of LNC-R-W3 experimentally and compare it with the performance of other existing cache replacement algorithms, namely LRU and LRU-MIN.

Patent
03 Jul 1997
TL;DR: An apparatus for increased data access in a network includes a file/object server computer having a permanent storage memory, a cache verifying computer operably connected to the file or object server computer in a manner to form a network for rapidly transferring data as discussed by the authors.
Abstract: An apparatus for increased data access in a network includes a file/object server computer having a permanent storage memory, a cache verifying computer operably connected to the file/object server computer in a manner to form a network for rapidly transferring data, the cache verifying computer having an operating system, a first memory and a processor capable of performing an operation on data stored in the permanent storage memory of the file/object server computer to produce a signature of the data characteristic of one of a file, an object and a directory, a remote client computer having an operating system, a first memory, a cache memory and a processor capable of performing an operation on data stored in the cache memory to produce a signature of the data, a communication server operably connected to the remote client computer to the cache verifying computer and the file/object server computer and comparators operably associated with the cache verifying computer and remote client computer for comparing the signatures of data with one another to determine whether the signature of data of the remote client is valid

Patent
26 Mar 1997
TL;DR: In this paper, the cache coherency attribute information is used to define a limitable cache coherent area to maintain data consistency among caches, and a processor memory interface unit includes a cache-coherency control which identifies whether cache co-herency is required only within a particular cluster of processors or is required for every one of the cache memories in every cluster throughout the system.
Abstract: To provide a large scale multiprocessor system capable of executing an area limited cache coherency control implementing a high speed operation while substantially reducing the amount of processor-to-processor communications there is provided a translation lookaside buffer which retains cache coherency attribute information defining a limitable cache coherent area to maintain data consistency among caches, and a processor memory interface unit includes a cache coherency control which identifies whether cache coherency is required only within a particular cluster of processors or is required for every one of the cache memories in every one of the clusters throughout the system, on the basis of the contents of the cache coherency attribute information. Further, in another version of large scale multiprocessor system, each cluster may be provided with an export directory which registers an identifier of data whose copy is cached in cache memories in other clusters. Thereby, latency in cache coherency procedures can be reduced greatly, since a cache coherent area can be limited in dependence on various characteristics of data. Further, it is also possible to greatly reduce inter-cluster communication quantities, since it is no longer necessary to broadcast to all processors in the system upon every occasion of a memory read/write.

Patent
27 Mar 1997
TL;DR: In this paper, sideband signals are used to overlay advanced mechanisms for cache attribute mapping, cache consistency cycles, and dual processor support onto a high speed peripheral bus, and several new signals and an associated protocol for support of dual processors are presented.
Abstract: Memory bus extensions to a high speed peripheral bus are presented. Specifically, sideband signals are used to overlay advanced mechanisms for cache attribute mapping, cache consistency cycles, and dual processor support onto a high speed peripheral bus. In the case of cache attribute mapping, three cache memory attribute signals that have been supported in previous processors and caches are replaced by two cache attribute signals that maintain all the functionality of the three original signals. In the case of cache consistency cycles, advanced modes of operation are presented. These include support of fast writes, the discarding of write back data by a cache for full cache line writes, and read intervention that permits a cache to supply data in response to a memory read. In the case of dual processor support, several new signals and an associated protocol for support of dual processors are presented. Specific support falls into three areas: the extension of snooping to support multiple caches, the support of shared data between the two processors, and the provision of a processor and upgrade arbitration protocol that permits dual processors to share a single grant signal line.

Patent
21 Aug 1997
TL;DR: In this article, the problem of a synonym arising from a case where the same physical address is assigned to different logical addresses is solved in such a manner that the number of times access is provided to TLB is halved as compared with the conventional arrangement.
Abstract: Physical page information PA(a) corresponding to logical page information VA(a) as a cache tag address is retained in a logical cache memory 10 and in the event of a cache miss when a shared area is accessed, the physical page information PA (a) retained in the cache memory is compared with physical page information PA (b) resulting from the translation of a search address by TLB. When the result of the comparison is proved to be conformity, the cache entry is processes as a cache hit, so that the problem of a synonym arising from a case where the same physical address is assigned to different logical addresses is solved in such a manner that the number of times access is provided to TLB is halved as compared with the conventional arrangement.

Patent
07 Mar 1997
TL;DR: In this paper, cache misses occur simultaneously on two or more ports of a multi-port cache, different replacement sets are selected for different ports through different write ports, and the replacements are performed simultaneously through different read ports.
Abstract: When cache misses occur simultaneously on two or more ports of a multi-port cache, different replacement sets are selected for different ports. The replacements are performed simultaneously through different write ports. In some embodiments, every set has its own write ports. The tag memory of every set has its own write port. In addition, the tag memory of every set has several read ports, one read port for every port of the cache. For every cache entry, a tree data structure is provided to implement a tree replacement policy (for example, a tree LRU replacement policy). If only one cache miss occurred, the search for the replacement set is started from the root of the tree. If multiple cache misses occurred simultaneously, the search starts at a tree level that has at least as many nodes as the number of cache misses. For each cache miss, a separate node is selected at that tree level, and the search for the respective replacement set starts at the selected node.

Patent
22 Jul 1997
TL;DR: In this paper, the cache is partitioned into sections based on class attributes, and cache classes are ranked in a hierarchy, and target entries having higher ranked attributes may be entered into cache sections corresponding to lower ranked attributes.
Abstract: A cache memory system for a computer. Target entries for the cache memory include a class attribute. The cache may use a different replacement algorithm for each possible class attribute value. The cache may be partitioned into sections based on class attributes. Class attributes may indicate a relative likelihood of future use. Alternatively, class attributes may be used for locking. In one embodiment, each cache section is dedicated to one corresponding class. In alternative embodiments, cache classes are ranked in a hierarchy, and target entries having higher ranked attributes may be entered into cache sections corresponding to lower ranked attributes. With each of the embodiments, entries with a low likelihood of future use or low temporal locality are less likely to flush entries from the cache that have a higher likelihood of future use.

Proceedings ArticleDOI
01 Dec 1997
TL;DR: An algorithm for procedure placement, one type of code-placement algorithm, that significantly differs from previous approaches in the type of information used to drive the placement algorithm is described, that gathers temporal ordering information that summarizes the interleaving of procedures in a program trace.
Abstract: Instruction cache performance is very important to instruction fetch efficiency and overall processor performance. The layout of an executable has a substantial effect on the cache miss rate during execution. This means that the performance of an executable can be improved significantly by applying a code-placement algorithm that minimizes instruction cache conflicts. We describe an algorithm for procedure placement, one type of code-placement algorithm, that significantly differs from previous approaches in the type of information used to drive the placement algorithm. In particular, we gather temporal ordering information that summarizes the interleaving of procedures in a program trace. Our algorithm uses this information along with cache configuration and procedure size information to better estimate the conflict cost of a potential procedure ordering. We compare the performance of our algorithm with previously published procedure-placement algorithms and show noticeable improvements in the instruction cache behavior.