scispace - formally typeset
Search or ask a question

Showing papers on "Cache published in 1990"


Proceedings ArticleDOI
01 May 1990
TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Abstract: Projections of computer technology forecast processors with peak performance of 1,000 MIPS in the relatively near future. These processors could easily lose half or more of their performance in the memory hierarchy if the hierarchy design is based on conventional caching techniques. This paper presents hardware techniques to improve the performance of caches.Miss caching places a small fully-associative cache between a cache and its refill path. Misses in the cache that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache. Small miss caches of 2 to 5 entries are shown to be very effective in removing mapping conflict misses in first-level direct-mapped caches.Victim caching is an improvement to miss caching that loads the small fully-associative cache with the victim of a miss and not the requested line. Small victim caches of 1 to 5 entries are even more effective at removing conflict misses than miss caching.Stream buffers prefetch cache lines starting at a cache miss address. The prefetched data is placed in the buffer and not in the cache. Stream buffers are useful in removing capacity and compulsory cache misses, as well as some instruction cache conflict misses. Stream buffers are more effective than previously investigated prefetch techniques at using the next slower level in the memory hierarchy when it is pipelined. An extension to the basic stream buffer, called multi-way stream buffers, is introduced. Multi-way stream buffers are useful for prefetching along multiple intertwined data reference streams.Together, victim caches and stream buffers reduce the miss rate of the first level in the cache hierarchy by a factor of two to three on a set of six large benchmarks.

1,481 citations


Patent
30 Mar 1990
TL;DR: In this paper, the authors proposed selective multiple sector erase, in which any combinations of Flash sectors may be erased together, and select sectors among the selected combination may also be de-selected during the erase operation.
Abstract: A system of Flash EEprom memory chips with controlling circuits serves as non-volatile memory such as that provided by magnetic disk drives. Improvements include selective multiple sector erase, in which any combinations of Flash sectors may be erased together. Selective sectors among the selected combination may also be de-selected during the erase operation. Another improvement is the ability to remap and replace defective cells with substitute cells. The remapping is performed automatically as soon as a defective cell is detected. When the number of defects in a Flash sector becomes large, the whole sector is remapped. Yet another improvement is the use of a write cache to reduce the number of writes to the Flash EEprom memory, thereby minimizing the stress to the device from undergoing too many write/erase cycling.

1,279 citations


Proceedings ArticleDOI
24 Jun 1990
TL;DR: A package for manipulating Boolean functions based on the reduced, ordered, binary decision diagram (ROBDD) representation is described, based on an efficient implementation of the if-then-else (ITE) operator.
Abstract: Efficient manipulation of Boolean functions is an important component of many computer-aided design tasks This paper describes a package for manipulating Boolean functions based on the reduced, ordered, binary decision diagram (ROBDD) representation The package is based on an efficient implementation of the if-then-else (ITE) operator A hash table is used to maintain a strong canonical form in the ROBDD, and memory use is improved by merging the hash table and the ROBDD into a hybrid data structure A memory function for the recursive ITE algorithm is implemented using a hash-based cache to decrease memory use Memory function efficiency is improved by using rules that detect when equivalent functions are computed The usefulness of the package is enhanced by an automatic and low-cost scheme for recycling memory Experimental results are given to demonstrate why various implementation trade-offs were made These results indicate that the package described here is significantly faster and more memory-efficient than other ROBDD implementations described in the literature

1,252 citations


Proceedings ArticleDOI
01 May 1990
TL;DR: A new model of memory consistency, called release consistency, that allows for more buffering and pipelining than previously proposed models is introduced and is shown to be equivalent to the sequential consistency model for parallel programs with sufficient synchronization.
Abstract: Scalable shared-memory multiprocessors distribute memory among the processors and use scalable interconnection networks to provide high bandwidth and low latency communication. In addition, memory accesses are cached, buffered, and pipelined to bridge the gap between the slow shared memory and the fast processors. Unless carefully controlled, such architectural optimizations can cause memory accesses to be executed in an order different from what the programmer expects. The set of allowable memory access orderings forms the memory consistency model or event ordering model for an architecture.This paper introduces a new model of memory consistency, called release consistency, that allows for more buffering and pipelining than previously proposed models. A framework for classifying shared accesses and reasoning about event ordering is developed. The release consistency model is shown to be equivalent to the sequential consistency model for parallel programs with sufficient synchronization. Possible performance gains from the less strict constraints of the release consistency model are explored. Finally, practical implementation issues are discussed, concentrating on issues relevant to scalable architectures.

1,169 citations


Proceedings ArticleDOI
01 Jun 1990
TL;DR: The Tera architecture was designed with several goals in mind; it needed to be suitable for very high speed implementations, i.
Abstract: The Tera architecture was designed with several ma jor goals in mind. First, it needed to be suitable for very high speed implementations, i. e., admit a short clock period and be scalable to many processors. This goal will be achieved; a maximum configuration of the first implementation of the architecture will have 256 processors, 512 memory units, 256 I/O cache units, 256 I/O processors, and 4096 interconnection network nodes and a clock period less than 3 nanoseconds. The abstract architecture is scalable essentially without limit (although a particular implementation is not, of course). The only requirement is that the number of instruction streams increase more rapidly than the number of physical processors. Although this means that speedup is sublinear in the number of instruction streams, it can still increase linearly with the number of physical pro cessors. The price/performance ratio of the system is unmatched, and puts Tera’s high performance within economic reach. Second, it was important that the architecture be applicable to a wide spectrum of problems. Programs that do not vectoriae well, perhaps because of a preponderance of scalar operations or too-frequent conditional branches, will execute efficiently as long as there is sufficient parallelism to keep the processors busy. Virtually any parallelism available in the total computational workload can be turned into speed, from operation level parallelism within program basic blocks to multiuser timeand space-sharing. The architecture

797 citations


Journal ArticleDOI
TL;DR: A novel kind of language model which reflects short-term patterns of word use by means of a cache component (analogous to cache memory in hardware terminology) is presented and contains a 3g-gram component of the traditional type.
Abstract: Speech-recognition systems must often decide between competing ways of breaking up the acoustic input into strings of words. Since the possible strings may be acoustically similar, a language model is required; given a word string, the model returns its linguistic probability. Several Markov language models are discussed. A novel kind of language model which reflects short-term patterns of word use by means of a cache component (analogous to cache memory in hardware terminology) is presented. The model also contains a 3g-gram component of the traditional type. The combined model and a pure 3g-gram model were tested on samples drawn from the Lancaster-Oslo/Bergen (LOB) corpus of English text. The relative performance of the two models is examined, and suggestions for the future improvements are made. >

555 citations


Journal ArticleDOI
01 May 1990
TL;DR: The architecture of a rapid-context-switching processor called APRIL with support for fine-grain threads and synchronization is described and the scalability of a multiprocessor based on APRIL is explored using a performance model.
Abstract: Processors in large-scale multiprocessors must be able to tolerate large communication latencies and synchronization delays. This paper describes the architecture of a rapid-context-switching processor called APRIL with support for fine-grain threads and synchronization. APRIL achieves high single-thread performance and supports virtual dynamic threads. A commercial RISC-based implementation of APRIL and a run-time software system that can switch contexts in about 10 cycles is described. Measurements taken for several parallel applications on an APRIL simulator show that the overhead for supporting parallel tasks based on futures is reduced by a factor of two over a corresponding implementation on the Encore Multimax. The scalability of a multiprocessor based on APRIL is explored using a performance model. We show that the SPARC-based implementation of APRIL can achieve close to 80% processor utilization with as few as three resident threads per processor in a large-scale cache-based machine with an average base network latency of 55 cycles.

434 citations


Journal ArticleDOI
TL;DR: Using a user's local storage capabilities to cache data at the user's site would improve the response time of user queries albeit at the cost ofurring the overhead required in maintaining multiple copies.
Abstract: Currently, a variety of information retrieval systems are available to potential users.… While in many cases these systems are accessed from personal computers, typically no advantage is taken of the computing resources of those machines (such as local processing and storage). In this paper we explore the possibility of using the user's local storage capabilities to cache data at the user's site. This would improve the response time of user queries albeit at the cost of incurring the overhead required in maintaining multiple copies. In order to reduce this overhead it may be appropriate to allow copies to diverge in a controlled fashion.… Thus, we introduce the notion of quasi-copies, which embodies the ideas sketched above. We also define the types of deviations that seem useful, and discuss the available implementation strategies.—From the Authors' Abstract

345 citations


Journal ArticleDOI
TL;DR: In this article, the usefulness of shared data caches in large-scale multiprocessors, the relative merits of different coherence schemes, and system-level methods for improving directory efficiency are addressed.
Abstract: The usefulness of shared-data caches in large-scale multiprocessors, the relative merits of different coherence schemes, and system-level methods for improving directory efficiency are addressed. The research presented is part of an effort to build a high-performance, large-scale multiprocessor. The various classes of cache directory schemes are described, and a method of measuring cache coherence is presented. The various directory schemes are analyzed, and ways of improving the performance of directories are considered. It is found that the best solutions to the cache-coherence problem result from a synergy between a multiprocessor's software and hardware components. >

291 citations


Proceedings ArticleDOI
01 May 1990
TL;DR: It is demonstrated that a simple rule system can be constructed that supports a more powerful view system than available in current commercial systems and that a rule system is a fundamental concept in a next generation DBMS, and it subsumes both views and procedures as special cases.
Abstract: This paper demonstrates that a simple rule system can be constructed that supports a more powerful view system than available in current commercial systems. Not only can views be specified by using rules but also special semantics for resolving ambiguous view updates are simply additional rules. Moreover, procedural data types as proposed in POSTGRES are also efficiently simulated by the same rules system. Lastly, caching of the action part of certain rules is a possible performance enhancement and can be applied to materialize views as well as to cache procedural data items. Hence, we conclude that a rule system is a fundamental concept in a next generation DBMS, and it subsumes both views and procedures as special cases.

168 citations


Journal ArticleDOI
01 May 1990
TL;DR: This work has examined the sharing and synchronization behavior of a variety of shared memory parallel programs and found that the access patterns of a large percentage of shared data objects fall in a small number of categories for which efficient software coherence mechanisms exist.
Abstract: An adaptive cache coherence mechanism exploits semantic information about the expected or observed access behavior of particular data objects. We contend that, in distributed shared memory systems, adaptive cache coherence mechanisms will outperform static cache coherence mechanisms. We have examined the sharing and synchronization behavior of a variety of shared memory parallel programs. We have found that the access patterns of a large percentage of shared data objects fall in a small number of categories for which efficient software coherence mechanisms exist. In addition, we have performed a simulation study that provides two examples of how an adaptive caching mechanism can take advantage of semantic information.

Book
03 Jan 1990
TL;DR: This paper presents a meta-modelling architecture that automates the very labor-intensive and therefore time-heavy and therefore expensive and expensive process of designing and implementing a caching system.
Abstract: 1. Introduction 2. Background Material 3. The Cache Design Problem and Its Solution 4. Performance-Directed Cache Design 5. Multi-Level Cache Hierarchies 6. Summary, Implications and Conclusions Appendix A. Validation of the Empirical Results Appendix B. Modelling Write Strategy Effects

Proceedings ArticleDOI
01 May 1990
TL;DR: Prescriptive use of the model under various scenarios indicates that multithreading is effective, and an analytical models of multi threaded processor behavior based on a small set of architectural and program parameters are developed.
Abstract: Multithreading has been proposed as an architectural strategy for tolerating latency on multiprocessors and, through limited empirical studies shows to offer promise. This paper develops an analytical models of multi threaded processor behavior based on a small set of architectural and program parameters. The model gives rise to a large Markov chain, which is solved to obtain a formula for processor in terms of the number of threads). transition, and saturation efficiency depends only on the remorse reference rate and switch case. Formulas for regime boundaries are derived. The model is embellished to reflect cache degradation due to multithreading, using an analytical model of cache behavior, demonstrating that returns diminish as the number threads becomes large. Predictions from the embellished model correlate will with published empirical measurements. Prescriptive use of the model under various scenarios indicates that multithreading is effective But the number of useful threads per processor is fairly small.

Patent
27 Mar 1990
TL;DR: In this article, the authors propose an extension to the basic stream buffer, called multi-way stream buffers (62), which is useful for prefetching along multiple intertwined data reference streams.
Abstract: A memory system (10) utilizes miss caching by incorporating a small fully-associative miss cache (42) between a cache (18 or 20) and second-level cache (26). misses in the cache (18 or 20) that hit in the miss cache have only a one cycle miss penalty, as opposed to a many cycle miss penalty without the miss cache (42). Victim caching is an improvement to miss caching that loads a small, fully associative cache (52) with the victim of a miss and not the requested line. Small victim caches (52) of 1 to 4 entries are even more effective at removing conflict misses than miss caching. Stream buffers (62) prefetch cache lines starting at a cache miss address. The prefetched data is placed in the buffer (62) and not in the cache (18 or 20). Stream buffers (62) are useful in removing capacity and compulsory cache misses, as well as some instruction cache misses. Stream buffers (62) are more effective than previously investigated prefetch techniques when the next slower level in the memory hierarchy is pipelined. An extension to the basic stream buffer, called multi-way stream buffers (62), is useful for prefetching along multiple intertwined data reference streams.

Patent
24 Oct 1990
TL;DR: In this paper, a method and system for independently resetting primary and secondary processors 20 and 120 respectively under program control in a multiprocessor, cache memory system is presented.
Abstract: A method and system for independently resetting primary and secondary processors 20 and 120 respectively under program control in a multiprocessor, cache memory system. Processors 20 and 120 are reset without causing cache memory controllers 24 and 124 to reset.

Patent
17 Jul 1990
TL;DR: In this article, a disk drive access control apparatus for connection between a host computer and a plurality of disk drives to provide an asynchronously operating storage system is presented. But the disk drive controller channels are not connected to the disk drives and each of the channels includes a cache/buffer memory and a micro-processor unit.
Abstract: This invention provides disk drive access control apparatus for connection between a host computer and a plurality of disk drives to provide an asynchronously operating storage system. It also provides increases in performance over earlier versions thereof. There are a plurality of disk drive controller channels connected to respective ones of the disk drives and controlling transfers of data to and from the disk drives, each of the disk drive controller channels includes a cache/buffer memory and a micro-processor unit. An interface and driver unit interfaces with the host computer and there is a central cache memory. Cache memory control logic controls transfers of data from the cache/buffer memory of the plurality of disk drive controller channels to the cache memory and from the cache memory to the cache/buffer memory of the plurality of disk drive controller channels and from the cache memory to the host computer through the interface and driver unit. A central processing unit manages the use of the cache memory by requesting data transfers only of data not presently in the cache memory and by sending high level commands to the disk drive controller channels. A first (data) bus interconnects the plurality of disk drive cache/buffer memories, the interface and driver unit, and the cache memory for the transfer of information therebetween and a second (information and commands) bus interconnects the same elements with the central processing unit for the transfer of control and information therebetween.

Journal ArticleDOI
01 May 1990
TL;DR: Simulations show that in terms of access time and network traffic both directory methods provide significant performance improvements over a memory system in which shared-writeable data is not cached.
Abstract: This paper presents an empirical evaluation of two memory-efficient directory methods for maintaining coherent caches in large shared memory multiprocessors. Both directory methods are modifications of a scheme proposed by Censier and Feautrier [5] that does not rely on a specific interconnection network and can be readily distributed across interleaved main memory. The schemes considered here overcome the large amount of memory required for tags in the original scheme in two different ways. In the first scheme each main memory block is sectored into sub-blocks for which the large tag overhead is shared. In the second scheme a limited number of large tags are stored in an associative cache and shared among a much larger number of main memory blocks. Simulations show that in terms of access time and network traffic both directory methods provide significant performance improvements over a memory system in which shared-writeable data is not cached. The large block sizes required for the sectored scheme, however, promotes sufficient false sharing that its performance is markedly worse than using a tag cache.

Patent
01 Jun 1990
TL;DR: In this paper, the authors use statistical information obtained by running the computer code with test data to determine a new ordering for the code blocks, which places code blocks that are often executed after one another close to one another in the computer's memory.
Abstract: The method uses statistical information obtained by running the computer code with test data to determine a new ordering for the code blocks. The new order places code blocks that are often executed after one another close to one another in the computer's memory. The method first generates chains of basic blocks, and then merges the chains. Finally, basic blocks that were not executed by the test data that was used to generate the statistical information are moved to a distant location to allow the blocks that were used to be more closely grouped together.

Proceedings Article
13 Aug 1990
TL;DR: A new cache consistency algorithm for client caches is proposed, a simple extension to twophase locking and consists of three additional lock modes that must be supported by the server lock manager that can significantly improve server performance over basic two-chase loins.
Abstract: This paper addresses the problem of cache consistency in a client-server database environment. We assume the server provides shared database access for multiple client workstations and that client workstations may cache a portion of the database. Our primary goal is to investigate techniques to maintain the consistency of the client cache and to improve server throughput. We propose a new cache consistency algorithm for client caches. The algorithm is a simple extension to twophase locking and consists of three additional lock modes that must be supported by the server lock manager. For comparison, we devised a second cache consistency algorithm based on notify locks. A simulation model was developed to analyze the performance of the server under the two cache consistency algorithms and under noncaching two-phase locking. The results show that both consistency algorithms can significantly improve server performance over basic two-chase lo&ins. The notifv locks algorithm? at times, ou

Journal ArticleDOI
TL;DR: The problem of recovering from processor transient faults in shared memory multiprocessor systems is examined and a user-transparent checkpointing and recovery scheme using private caches is presented, which prevents rollback propagation, provides rapid recovery, and can be integrated into standard cache coherence protocols.
Abstract: The problem of recovering from processor transient faults in shared memory multiprocessor systems is examined. A user-transparent checkpointing and recovery scheme using private caches is presented. Processes can recover from errors due to faulty processors by restarting from the checkpointed computation state. Implementation techniques using checkpoint identifiers and recovery stacks are examined as a means of reducing performance degradation in processor utilization during normal execution. This cache-based checkpointing technique prevents rollback propagation, provides rapid recovery, and can be integrated into standard cache coherence protocols. An analytical model is used to estimate the relative performance of the scheme during normal execution. Extensions to take error latency into account are presented. >

Patent
13 Aug 1990
TL;DR: In this article, a cache lock, a pending lock, and an out-of-date lock are added to a two-lock concurrency control system to maintain the consistency of cached data in a clientserver database system.
Abstract: A method of maintaining the consistency of cached data in a client-server database system. Three new locks--a cache lock, a pending lock and an out-of-date lock--are added to a two-lock concurrency control system. A new long-running envelope transaction holds a cache lock on each object cached by a given client. A working transaction of the client works only with the cached object until commit time. If a second client's working transaction acquires an "X" lock on the object the cache lock is changed to a pending lock; if the transaction thereafter commits the pending lock is changed to an out-of-date lock. If the first client's working transaction thereafter attempts to commit, it waits for a pending lock to change; it aborts if it encounters an out-of-date lock; and otherwise it commits.

Journal ArticleDOI
01 May 1990
TL;DR: If the cache is multi-level and references to the TLB slice are “shielded” by hits in a virtually indexed primary cache, the slice can get by with very few entries, once again lowering its cost and increasing its speed.
Abstract: The MIPS R6000 microprocessor relies on a new type of translation lookaside buffer — called a TLB slice — which is less than one-tenth the size of a conventional TLB and as fast as one multiplexer delay, yet has a high enough hit rate to be practical. The fast translation makes it possible to use a physical cache without adding a translation stage to the processor's pipeline. The small size makes it possible to include address translation on-chip, even in a technology with a limited number of devices.The key idea behind the TLB slice is to have both a virtual tag and a physical tag on a physically-indexed cache. Because of the virtual tag, the TLB slice needs to hold only enough physical page number bits — typically 4 to 8 — to complete the physical cache index, in contrast with a conventional TLB, which needs to hold both a virtual page number and a physical page number. The virtual page number is unnecessary because the TLB slice needs to provide only a hint for the translated physical address rather than a guarantee. The full physical page number is unnecessary because the cache hit logic is based on the virtual tag. Furthermore, if the cache is multi-level and references to the TLB slice are “shielded” by hits in a virtually indexed primary cache, the slice can get by with very few entries, once again lowering its cost and increasing its speed. With this mechanism, the simplicity of a physical cache can been combined with the speed of a virtual cache.

Patent
14 Mar 1990
TL;DR: In this paper, a conditional broadcast or notification facility of a global lock manager is utilized to both serialize access to pages stored in local caches of counterpart processors in a distributed system and to ensure consistency among pages common to the caches.
Abstract: A conditional broadcast or notification facility of a global lock manager is utilized to both serialize access to pages stored in local caches of counterpart processors in a distributed system and to ensure consistency among pages common to the caches Exclusive use locks are obtained in advance of all write operations When a page is to be updated, which page is cached in a processor other than that of the requester, then a delay is posed to the grant of the exclusive lock, all shared use lock holders to the same page notified, local copies are invalidated, exclusive lock granted, page is updated and written through cache, after which the lock is demoted to shared use

Patent
30 Jan 1990
TL;DR: In this article, a method and apparatus for storing an instruction word in a compacted form on a storage media, the instruction word having a plurality of instruction fields, features associating with each instruction word, a mask word having length in bits at least equal to the number of instructions in the instruction words, is presented.
Abstract: A method and apparatus for storing an instruction word in a compacted form on a storage media, the instruction word having a plurality of instruction fields, features associating with each instruction word, a mask word having a length in bits at least equal to the number of instruction fields in the instruction word. Each instruction field is associated with a bit of the mask word and accordingly, using the mask word, only non-zero instruction fields need to be stored in memory. The instruction compaction method is advantageously used in a high speed cache miss engine for refilling portions of instruction cache after a cache miss occurs.

Patent
10 May 1990
TL;DR: In this article, a set associative cache using decoded data element select lines which can be selectively configured to provide different data sets arrangements is presented. But the cache is limited to the maximum possible number of sets.
Abstract: A set associative cache using decoded data element select lines which can be selectively configured to provide different data sets arrangements. The cache includes a tag array, a number of tag comparators corresponding to the maximum possible number of sets, a data element select logic circuit, and a data array. The tag and data arrays each provide, in response to an input address, a number of output tag and data elements, respectively. The number of output tag and data elements depends upon the maximum set size desired for the cache. An input main memory address is used to address both the tag and data arrays. The tag comparators compare a tag field portion of the input main memory address to each element output from the tag array. The select logic then uses the outputs of the tag comparators and one or more of the input main memory address bits to generate decoded data array enable signals. The decoded enable signals are then coupled to enable the desired one of the enabled data elements.

Proceedings ArticleDOI
26 Jun 1990
TL;DR: Three cache-aided error-recovery algorithms for use in shared-memory multiprocessor systems are presented and the results of a tradeoff analysis are presented.
Abstract: Three cache-aided error-recovery algorithms for use in shared-memory multiprocessor systems are presented. They rely on hardware and specially designed cache memory for all their soft error management operations and can be easily incorporated into existing cache-coherence protocols. An example illustrating their use in a multiprocessor system employing Dragon as its cache-coherence protocol is given, and the results of a tradeoff analysis are presented. >

Journal ArticleDOI
01 May 1990
TL;DR: This paper explores the interactions between a cache's block size, fetch size and fetch policy from the perspective of maximizing system-level performance and finds the most effective fetch strategy improved performance by between 1.7% and 4.5%.
Abstract: This paper explores the interactions between a cache's block size, fetch size and fetch policy from the perspective of maximizing system-level performance. It has been previously noted that given a simple fetch strategy the performance optimal block size is almost always four or eight words [10]. If there is even a small cycle time penalty associated with either longer blocks or fetches, then the performance-optimal size is noticeably reduced. In split cache organizations, where the fetch and block sizes of instruction and data caches are all independent design variables, instruction cache block size and fetch size should be the same. For the workload and write-back write policy used in this trace-driven simulation study, the instruction cache block size should be about a factor of two greater than the data cache fetch size, which in turn should equal to or double the data cache block size. The simplest fetch strategy of fetching only on a miss and stalling the CPU until the fetch is complete works well. Complicated fetch strategies do not produce the performance improvements indicated by the accompanying reductions in miss ratios because of limited memory resources and a strong temporal clustering of cache misses. For the environments simulated here, the most effective fetch strategy improved performance by between 1.7% and 4.5% over the simplest strategy described above.

Patent
21 Nov 1990
TL;DR: In this article, the data is stored in an appropriate one of the arrays (88)-(94) and transferred through the primary cache (26) via transfer circuits (96), (98), (100) and (102) to the data bus (32).
Abstract: A central processing unit (10) has a cache memory system (24) associated therewith for interfacing with a main memory system (23). The cache memory system (24) includes a primary cache (26) comprised of SRAMS and a secondary cache (28) comprised of DRAM. The primary cache (26) has a faster access than the secondary cache (28). When it is determined that the requested data is stored in the primary cache (26) it is transferred immediately to the central processing unit (10). When it is determined that the data resides only in the secondary cache (28), the data is accessed therefrom and routed to the central processing unit (10) and simultaneously stored in the primary cache (26). If a hit occurs in the primary cache (26), it is accessed and output to a local data bus (32). If only the secondary cache (28) indicates a hit, data is accessed from the appropriate one of the arrays (80)-(86) and transferred through the primary cache ( 26) via transfer circuits (96), (98), (100) and (102) to the data bus (32). Simultaneously therewith, the data is stored in an appropriate one of the arrays (88)-(94). When a hit does not occur in either the secondary cache (28) or the primary cache (26), data is retrieved from the main system memory (23) through a buffer/multiplexer circuit on one side of the secondary cache (28) and passed through both the secondary cache (28) and the primary cache (26) and stored therein in a single operation due to the line for line transfer provided by the transfer circuits (96)-(102).

Patent
14 Jun 1990
TL;DR: In this article, a dynamic random access memory with a fast serial access mode for use in a simple cache system includes a plurality of memory cell blocks prepared by division of a memory cell array.
Abstract: A dynamic random access memory with a fast serial access mode for use in a simple cache system includes a plurality of memory cell blocks prepared by division of a memory cell array, a plurality of data latches each provided for each column in the memory cell blocks and a block selector. When a cache miss signal is produced by the cache system, data on the column in the cell block selected by the block decoder are transferred into the data latches provided for the columns in the selected block after selection. When a cache hit signal is produced by the cache system, the data latches are isolated from the memory cell array. Accessing is made to at least one of the data latches based on an externally applied column address on cache hit, and to at least one of the columns in the selected block based on the column address on cache miss.

Patent
29 Jan 1990
TL;DR: In this article, the authors propose a prioritization scheme based on the operational proximity of the request to the instruction currently being executed, which temporarily suspends the higher priority request while the desired data is being retrieved from main memory 14, but continues to operate on a lower priority request so that the overall operation will be enhanced if the lower priority requests hits in the cache 28.
Abstract: In a pipelined computer system, memory access functions are simultaneously generated from a plurality of different locations. These multiple requests are passed through a multiplexer 50 according to a prioritization scheme based upon the operational proximity of the request to the instruction currently being executed. In this manner, the complex task of converting virtual-to-physical addresses is accomplished for all memory access requests by a single translation buffer 30. The physical addresses resulting from the translation buffer 30 are passed to a cache 28 of the main memory 14 through a second multiplexer 40 according to a second prioritization scheme based upon the operational proximity of the request to the instruction currently being executed. The first and second prioritization schemes differ in that the memory is capable of handling other requests while a higher priority "miss" is pending. Thus, the prioritization scheme temporarily suspends the higher priority request while the desired data is being retrieved from main memory 14, but continues to operate on a lower priority request so that the overall operation will be enhanced if the lower priority request hits in the cache 28.