Showing papers on "Cache published in 1984"
••
01 Jan 1984TL;DR: This paper presents a cache coherence solution for multiprocessors organized around a single time-shared bus that aims at reducing bus traffic and hence bus wait time and increases the overall processor utilization.
Abstract: This paper presents a cache coherence solution for multiprocessors organized around a single time-shared bus. The solution aims at reducing bus traffic and hence bus wait time. This in turn increases the overall processor utilization. Unlike most traditional high-performance coherence solutions, this solution does not use any global tables. Furthermore, this coherence scheme is modular and easily extensible, requiring no modification of cache modules to add more processors to a system. The performance of this scheme is evaluated by using an approximate analysis method. It is shown that the performance of this scheme is closely tied with the miss ratio and the amount of sharing between processors.
531 citations
••
01 Jan 1984TL;DR: It appears that moderately large parallel processors can be designed by employing the principles presented in this paper, and both schemes feature decentralized consistency control and dynamic type classification of the datum cached.
Abstract: This paper presents two cache schemes for a shared-memory shared bus multiprocessor. Both schemes feature decentralized consistency control and dynamic type classification of the datum cached (i.e. read-only, local, or shared). It is shown how to exploit these features to minimize the shared bus traffic. The broadcasting ability of the shared bus is used not only to signal an event but also to distribute data. In addition, by introducing a new synchronization construct, i.e. the Test-and-Test-and-Set instruction, many of the traditional. parallell processing “hot spots” or bottlenecks are eliminated. Sketches of formal correctness proofs for the proposed schemes are also presented. It appears that moderately large parallel processors can be designed by employing the principles presented in this paper.
210 citations
••
TL;DR: Experimental results show that chickadees do incorporate these kinds of information in memory for cache sites, and that memory for storage sites, if used by black-capped chickadee, is predicted to have four properties.
153 citations
•
IBM1
TL;DR: In this article, a prefetching mechanism for a system having a cache has, in addition to the normal cache directory, a two-level shadow directory, in which a parent identifier derived from the block address is stored in a first level of the shadow directory.
Abstract: A prefetching mechanism for a system having a cache has, in addition to the normal cache directory, a two-level shadow directory. When an information block is accessed, a parent identifier derived from the block address is stored in a first level of the shadow directory. The address of a subsequently accessed block is stored in the second level of the shadow directory, in a position associated with the first-level position of the respective parent identifier. With each access to an information block, a check is made whether the respective parent identifier is already stored in the first level of the shadow directory. If it is found, then a descendant address from the associated second-level position is used to prefetch an information block to the cache if it is not already resident therein. This mechanism avoids, with a high probability, the occurrence of cache misses.
103 citations
•
IBM1
TL;DR: In this paper, a distributed cache is achieved by the use of communicating random access memory chips of the type incorporating a primary port (10) and a secondary port (14), which can run totally independently of each other.
Abstract: The cache reload time in small computer systems is improved by using a distributed cache located on the memory chips. The large bandwidth between the main memory and cache is the usual on-chip interconnecting lines which avoids pin input/output problems. This distributed cache is achieved by the use of communicating random access memory chips of the type incorporating a primary port (10) and a secondary port (14). Ideally, the primary and secondary ports can run totally independently of each other. The primary port functions as in a typical dynamic random access memory and is the usual input/output path for the memory chips. The secondary port, which provides the distributed cache, makes use of a separate master/slave row buffer (15) which is normally isolated from the sense amplifier/latches. Once this master/slave row buffer is loaded, it can be accessed very fast, and the large bandwidth between the main memory array and the on-chip row buffer provides a very fast reload time for a cache miss.
100 citations
12 Jan 1984
TL;DR: The n+1 architecture, with individual cache memories and a 'bus-ownership protocol,' has overcome the problems which have prevented other designs from realising the increased performance a tightly coupled multiprocessor architecture should provide and offers linear increases of performance.
Abstract: A tightly coupled architecture of multiple processors automates the functions of expansion, system tuning, load balancing and data-base distribution-a major part of designing and implementing online systems. To reduce memory-access times significantly, the n+1 multiprocessor system for fault-tolerant transaction processing also combines individual cache memories for each processor with a data-sharing scheme. Thus it overcomes the problems which have prevented other designs from realising the increased performance a tightly coupled multiprocessor architecture should provide. The principal problem resulted from the use of a shared memory, because the individual processors contend for that memory and therefore waste much valuable processing time. The n+1 architecture, with individual cache memories and a 'bus-ownership protocol,' has overcome this impediment and offers linear increases of performance, with up to 28 processors sharing a common fault-tolerant memory system. The key to the synapse expansion architecture is a new look at bus arbitration and caching in tightly coupled systems.
99 citations
••
01 Jan 1984TL;DR: This paper uses trace driven simulation to study design tradeoffs for small (on-chip) caches, and finds that general purpose caches of 64 bytes (net size) are marginally useful in some cases, while 1024-byte caches perform fairly well.
Abstract: Advances in integrated circuit density are permitting the implementation on a single chip of functions and performance enhancements beyond those of a basic processors. One performance enhancement of proven value is a cache memory; placing a cache on the processor chip can reduce both mean memory access time and bus traffic. In this paper we use trace driven simulation to study design tradeoffs for small (on-chip) caches. Miss ratio and traffic ratio (bus traffic) are the metrics for cache performance. Particular attention is paid to sub-block caches (also known as sector caches), in which address tags are associated with blocks, each of which contains multiple sub-blocks; sub-blocks are the transfer unit. Using traces from two 16-bit architectures (Z8000, PDP-11) and two 32-bit architectures (VAX-11, System/370), we find that general purpose caches of 64 bytes (net size) are marginally useful in some cases, while 1024-byte caches perform fairly well; typical miss and traffic ratios for a 1024 byte (net size) cache, 4-way set associative with 8 byte blocks are: PDP-11: .039, .156, Z8000: .015, .060, VAX 11: .080, .160, Sys/370: .244, .489. (These figures are based on traces of user programs and the performance obtained in practice is likely to be less good.) The use of sub-blocks allows tradeoffs between miss ratio and traffic ratio for a given cache size. Load forward is quite useful. Extensive simulation results are presented.
99 citations
•
27 Sep 1984TL;DR: In this paper, a cache memory unit is constructed to have a two-stage pipeline shareable by a plurality of sources which include two independently operated central processing units (CPUs).
Abstract: A cache memory unit is constructed to have a two-stage pipeline shareable by a plurality of sources which include two independently operated central processing units (CPUs). Apparatus included within the cache memory unit operates to allocate alternate time slots to the two CPUs which offset their operations by a pipeline stage. This permits one pipeline stage of the cache memory unit to perform a directory search for one CPU while the other pipeline stage performs a data buffer read for the other CPU. Each CPU is programmed to use less than all of the time slots allocated to it. Thus, the processing units operate conflict-free while pipeline stages are freed up for processing requests from other sources, such as replacement data from main memory or cache updates.
80 citations
•
05 Mar 1984TL;DR: A host connected outboard back-up and recovery system has a tape drive connected to a plural port solid-state cache memory which, in turn, is connected to disk drive as mentioned in this paper.
Abstract: A host connected outboard back-up and recovery system has a tape drive connected to a plural port solid-state cache memory which, in turn, is connected to a disk drive. Data to be backed-up can first be copied from the disk drive to the cache and then from the cache to the tape drive in back-up operations, so that the relative speeds of the disk drive and the tape need not be matched. An outboard controller controls flow of the data between the disk drive and the tape drive and additionally controls storage of the data on the tape, so that host computer involvement is avoided.
78 citations
•
29 Aug 1984
TL;DR: In this paper, a data processing system including virtual-addressed and real-addiated stores is described, where one store is addressed with real addresses and the other store is addressing with virtual addresses.
Abstract: Disclosed is a data processing system including virtual-addressed and real-addressed stores. One store is addressed with real addresses and the other store is addressed with virtual addresses. Whenever an addressed location is to be accessed in a store addressed by the other type of addresses, the address is translated to the other type of address. If a virtual address cannot access the desired location in the virtual store, the virtual address through a virtual-to-real translator is translated to a real address and the location is addressed in the real store. Whenever a real address needs to access a virtual address location in the virtual-addressed store, the real address is converted through a real-to-virtual translator in order to locate corresponding locations in the virtual-addressed memory.
59 citations
•
11 Apr 1984TL;DR: In this paper, the cache memory is implemented in two memory parts (301, 302) as a two-way interleaved twoway set-associative memory, where one memory part implements odd words of one cache set and even words of the other cache set.
Abstract: In a processing system (10) comprising a main memory (102) for storing blocks (150) of four contiguous words (160) of information, a cache memory (101) for storing selected ones of the blocks, and a two-word wide bus (110) for transferring words from the main memory to the cache, the cache memory is implemented in two memory parts (301, 302) as a two-way interleaved two-way set-associative memory. One memory part implements odd words of one cache set (0), and even words of the other cache set (1), while the other memory part implements even words of the one cache set and odd words of the other cache set. Storage locations (303) of the memory parts are grouped into at least four levels (204) with each level having a location from each of the memory parts and each of the cache sets. The cache receives a block over the bus in two pairs of contiguous words. The cache memory is updated with both words of a word pair simultaneously. The pairs of words are each stored simultaneously into locations of one of either of the cache sets, each word into a location of a different memory part and of a different level. Cache hit check is performed on all locations of a level simultaneously. Simultaneously with the hit check, all locations of the checked level are accessed simultaneously.
•
[...]
16 Feb 1984
TL;DR: In this paper, a disk cache processor stores the file extent information in an extent table and reserves a corresponding area in the disk cache buffer in response to a write command and a logical termination of a program.
Abstract: In response to a command including file extent information defining a file area received from a host processor, a disk cache processor stores the file extent information in an extent table and reserves a corresponding area in a disk cache buffer. In response to a WRITE command, the processor stores the data in the corresponding reserved area of the disk cache buffer and sets a write flag of the corresponding entry of a cache directory. When the processor is idling, it writes to a disk unit the data in the disk cache buffer which has not yet been written to the disk unit, and resets the corresponding write flag. In response to a command received from the host processor which has file extent information and designates a logical termination of a program, the processor writes all of the not-yet-written data which corresponds to the file area indicated by the extent information. When an error is caused during data write to the disk unit, the processor stores corresponding error status information into an error log table such that it corresponds to the extent information in the extent table. After data write into the designated area is terminated, the processor fetches the error status information for this file area alone and transfers the fetched information to the host processor.
•
20 Jun 1984TL;DR: In this paper, an address translation unit (118), an instruction processing unit (126), an address scalar unit (142), a vector control unit (144), and vector processing units (148, 150).
Abstract: A physical cache unit (100) is used within a computer (20). The computer (20) further includes a main memory (99) a memory control unit (22), inputs/output processors (54, 68) and a central processor (156). The central processor includes an address translation unit (118), an instruction processing unit (126), an address scalar unit (142), a vector control unit (144) and vector processing units (148, 150). The physical cache unit (100) stores operands in a data cache (180), the operands for delivery to and receipt from the control processor (156). Addresses for requested operands are received from the central processor (156) and are examined concurrently during one clock cycle in tag stores (190 and 192). The tag stores (190 and 192) produce tags which are compared in comparators (198 and 200) to the tag of physical addresses received from the central processor (156). If a comparison is made, a hit, both of the requested operands are read, during one clock period, from the data cache (180) and transmitted to the central processor (156). If the requested operands are not in the data cache (180) they are fetched from the main memory (99). The operands requested from the main memory (99) within a block are placed in a buffer (188) and/or transmitted directly through a bypass bus (179) to the central processor (156). Concurrently, the block of operands fetched from main memory (99) may be stored in the data cache (180) for subsequent delivery to the central processor (156) upon request. Further, a block of operands from the central processor (156) can be transmitted directly to the memory control unit 22 and bypass the data cache (180).
•
21 Sep 1984
TL;DR: In this paper, the authors propose to send an invalidate signal over the common communications path (68) when a non-path access of the local memory (54) has been made to a location to which access was previously afforded over the Common Communications Path (68).
Abstract: One of a plurality of devices on a common communications path (68) has a local memory (54) that is accessible by other devices on the common communications path (68). Another device on the common communications path (68) may include a cache memory (190) that keeps copies of certain of the data contained by the local memory (54). If another device on the common communications path (68) accesses the local memory (54), the cache (190) is kept apprised of this fact by monitoring of the common communications path (68), and it sets an internal flag to indicate that the data involved may not be valid. However, the contents of memory 54 may also be accessed by means of a processor (50) without using the common communications path (68). Accordingly, provisions are made to send an invalidate signal over the common communications path (68) when a non-path access of the local memory (54) has been made to a location to which access was previously afforded over the common communications path ( 68). In this way, non-path accesses of a local memory can be permitted, yet proper invalidation of cache memories can be performed in a simple manner.
•
IBM1
TL;DR: In this article, a redundant error-detecting addressing code for use in a cache memory is presented, where the blocks are expanded to include redundant addressing information such as the logical data address and the physical cache address.
Abstract: A redundant error-detecting addressing code for use in a cache memory. A directory converts logical data addresses to physical addresses in the cache where the data is stored in blocks. The blocks are expanded to include redundant addressing information such as the logical data address and the physical cache address. When a block is accessed from the cache, the redundant addressing is compared to the directory addressing information to confirm that the correct data has been accessed.
•
NEC1
TL;DR: In this paper, a data processing system for vector processing having a main memory accessible in parallel by a plurality of processors, each processor having a cache memory, where, in response to a storage instruction given to the main memory by a processor, a main block of a given size (BS) and having a give start address (B) and containing element data spaced at an interelement distance (D) being preempted as a result of the storage instruction, a single block address invalidation takes place at each cache memory previously having data stored at that main memory location, the single
Abstract: A data processing system for vector processing having a main memory accessible in parallel by a plurality of processors, each processor having a cache memory, wherein, in response to a storage instruction given to the main memory by a processor, a main memory block of a given size (BS) and having a give start address (B) and containing element data spaced at an interelement distance (D) being preempted as a result of the storage instruction, a single block address invalidation takes place at each cache memory previously having data stored at that main memory location, the single block address invalidation corresponding to (BS/D) cache address invalidations, whereby repeated sequential individual cache address invalidation operations for each address in the preeempted block no longer are required.
••
01 Jan 1984TL;DR: The Static Column RAM devices recently introduced offer the potential for implementing a direct-mapped cache on-chip with only a small increase in complexity over that needed for a conventional dynamic RAM memory system.
Abstract: The Static Column RAM devices recently introduced offer the potential for implementing a direct-mapped cache on-chip with only a small increase in complexity over that needed for a conventional dynamic RAM memory system. Trace-driven simulation shows that such a cache can only be marginally effective if used in the obvious way. However it can be effective in satisfying the requests from a processor containing an on-chip cache. The SCRAM cache is more effective if the processor cache handles both instructions and data.
•
27 Apr 1984TL;DR: In this paper, a register unit includes means for storing pertinent data relative to a plurality of cache transactions, identifying the zones of an addressed word block which is the subject of the individual transactions.
Abstract: A register unit includes means for storing pertinent data relative to a plurality of cache transactions, identifying the zones of an addressed word block which is the subject of the individual transactions. These data are selectively extracted from the register to control the merging of the identified zone or zones of the associated word with the remainder of the data in the addressed word block.
••
09 Jul 1984
TL;DR: Using a non-write-through cache and the Synapse Expansion Bus, Synapse has designed a symmetric, tightly coupled multiprocessor system, capable of being expanded on line and under power from two through twenty-eight processors with a linear improvement in system performance.
Abstract: The theoretical merits of a tightly coupled multiple-processor/shared-memory architecture have long been recognized. Two major problems in designing such an architecture are the performance limitations imposed by shared-memory bus contention in cached processors and multiple-processor data coherency. In the Synapse system, memory contention was significantly reduced by designing a processor cache employing a non-write-through algorithm, which minimized bandwidth between cache and shared memory. The multicache coherency problem was solved by a new bussing scheme, the Synapse Expansion Bus, which includes an ownership level protocol between processor caches. Using a non-write-through cache and the Synapse Expansion Bus, Synapse has designed a symmetric, tightly coupled multiprocessor system, capable of being expanded on line and under power from two through twenty-eight processors with a linear improvement in system performance.
••
01 Jan 1984TL;DR: A simple Markov chain model is used to estimate the effect of the cache flushes and the estimates obtained are within 12% error margin from the results obtained by cache simulation.
Abstract: A simple Markov chain model is used to estimate the effect of the cache flushes. The model assumes LRU stack model of program behaviour and geometrically distributed lengths of task switch intervals. Given the LRU stack depth distribution, one may easily compute estimates of miss ratios for caches of all sizes with any desired average task switch interval. The model is validated with three reference strings recorded from simulations of large B7800 Extended Algol programs. The estimates obtained with the Markov model are within 12% error margin from the results obtained by cache simulation.
••
01 Nov 1984TL;DR: This paper first defines and describes a highly parallel external data handling system and then shows how the capabilities of the system can be used to implement a high performance relational data base machine.
Abstract: This paper first defines and describes a highly parallel external data handling system and then shows how the capabilities of the system can be used to implement a high performance relational data base machine. The elements of the system architecture are
an interconnection network which implements both packet routing and circuit switching and which implements data organization functions such as indexing and sort merge and
an intelligent memory unit with a self-managing cache which implements associative search and capabilities for application of filtering operations on data streaming to and from storage.
•
IBM1
TL;DR: A digital storage includes several sections having different time characteristics as discussed by the authors, which include a common address and control circuit, and the fast sections store data which are accessed more frequently or which are addressed first when data blocks are transferred.
Abstract: A digital storage includes several sections having different time characteristics. Such sections are operated in the overlap mode and include a common address and control circuit. The fast sections store data which are accessed more frequently or which are addressed first when data blocks are transferred. The digital storage may be used in a storage hierarchy comprising a cache storage and wherein only data blocks positioned at main storage address boundaries are transferred to the cache storage.
•
[...]
12 Dec 1984
TL;DR: In this paper, the authors make the data base file access high-speed by adopting setting of cache buffer area security reference and a cache buffer use mode in the disc cache system.
Abstract: PURPOSE:To make the data base file access high-speed by adopting setting of cache buffer area security reference and a cache buffer use mode in the disc cache system CONSTITUTION:A user task 11 generates information by the access of a data base file 3 and issues a processing request to an operating system OS16 In actual disc input and output, access information of the user and information of control block FCB peculiar to the file are transmitted to a file server 2 The file server 2 recognizes a block corresponding to the transmitted basic input/output request and retrieves a cache management table 18; and if this retrieval results in hitting, physical input/output is not performed, and data is transferred between a cache area 19 and a user buffer 20 indicated by a data base/file managing routine 17 If the retrieval does not result in hitting, the area is secured in the cache area 19, and physical input/output is performed
01 Jan 1984
TL;DR: This paper looks at each component of the memory hierarchy and addresses two issues: what are likely directions for development, and what are the interesting research problems.
Abstract: The effective and efficient use of the memory hierarchy of the computer system is one of the, if not the single most important aspect of computer system design and use. Cache memory performance is often the limiting factor in CPU performance and cache memories also serve to cut the memory traffic in multiprocessor systems. Multiprocessor systems are also requiring advances in cache architecture with respect to cache consistency. Similarly, the study of the best means to share main memory is an important research topic. Disk cache is becoming important for performance in high end computer systems and is now widely available commercially; there are many related research problems. The development of mass storage, especially optical disk, will promote research in effective algorithms for file management and migration. In this paper, we look at each component of the memory hierarchy and address two issues: what are likely directions for development, and what are the interesting research problems.
•
IBM1
TL;DR: In this paper, the existence and type of commands stored in the command status registers associated with the system processors are detected and a masking circuit for a multiprocessor system is disclosed.
Abstract: A masking circuit for a multiprocessor system is disclosed. The masking circuit senses the existence and type of commands stored in the command status registers associated with the system processors. Masking begins if it is determined that information needed by one processor is located in the cache memory of another processor and is to be flushed to the main memory, which is accessible by the first processor. The masking circuit masks the command present in the command status register associated with the first processor, for the first processor to access the main memory, until after the information has been flushed from the cache to the main memory. The first processor is thus prevented from accessing the main memory until after the information has been flushed thereto.
•
06 Apr 1984
TL;DR: In this article, a cache processor retrieves a directory 100, and the hit number of each block data is stored in a counter 19 and the value stored in this register 11 is compared with the entry number having a maximum RBP value from a register 8 by a comparator 15, and a value selected by a selector 14 in accordance with the output.
Abstract: PURPOSE:To enhance the efficiency of a cache memory by considering the hit number also when reject block data is determined. CONSTITUTION:A cache processor 5 retrieves a directory 100, and the hit number of each block data is stored in a counter 19. The entry number of block data which has this value of the counter 19 as a depth of an LRU stack is transferred to a part 12 of register 11, and the hit number of this block data is transferred to a reject block determining parameter (RBP) operator 18. Ths RBP operator 18 calculates an RBP value on a basis of the hit number of the block and the depth of the stack from the counter 19 and stores the result in a part 13 of the register 11. The value stored in this register 11 is compared with the entry number having a maximum RBP value from a register 8 by a comparator 15, and a value selected by a selector 14 in accordance with the output is stored in the register 8. Thus, block data having a smaller hit number is rejected preferentially.
•
19 Mar 1984
TL;DR: In this paper, a controller for communication between the auxiliary processor and a cache mechanism in the system interface is proposed, which communication is to be carried on independently of the main memory accesses required to update the cache mechanisation in an overlapped manner.
Abstract: A controller for communication between the auxiliary
processor and a cache mechanism in the system interface
which communication is to be carried on independently of
main memory accesses required to update the cache mechan
ism in an overlapped manner.
•
07 Jan 1984
TL;DR: In this article, a sub-block containing a necessary data by a CPU, and executes a transfer request to an MEM, in case of a data transfer between the CPU and the MEM.
Abstract: PURPOSE:To raise the transfer efficiency without generating a waiting time for start of a transfer, by designating a sub-block containing a necessary data by a CPU, and executes a transfer request to an MEM, in case of a data transfer between the CPU and the MEM. CONSTITUTION:Data from banks 321-328 of a main storage device MEM30 are transferred in accordance with sub-block store areas 111-118, 211-218 of cache memories 11, 21. When a request PQ0 for transferring a data to be stored in the sub-block 111 is generated to the MEM30 from a CPU10 at a time ''0'', the MEM30 starts the transfer from the bank 321. On the other hand, when a transfer request whose head is a data of the sub-block 218 is executed from a CPU20, a control part 31 starts an optimum transfer, namely, a transfer from the bank 324. In this way, the transfer time can be shortened, comparing with a conventional case when the transfer is executed after waiting for a transfer of the CPU.
•
IBM1
TL;DR: In this article, the authors propose a method for Direct Access Storage Device (DASD) cache management that reduces the volume of data transfer between DASD (27, 29, 53) and cache (101) while avoiding the complexity of managing variable length records in the cache.
Abstract: @ A method for Direct Access Storage Device (DASD) cache management that reduces the volume of data transfer between DASD (27, 29, 53) and cache (101) while avoiding the complexity of managing variable length records in the cache. This is achieved by always choosing the starting point for staging a record to be at the start of the missing record and, at the same time, allocating and managing cache space in fixed length blocks. The method steps require staging records, starting with the requested record and continuing until either the cache block is full, the end of track is reached, or a record already in the cache is encountered.