scispace - formally typeset
Search or ask a question

Showing papers on "Memory management published in 1988"



Patent
13 Dec 1988
TL;DR: In this paper, a fault-tolerant configuration employs multiple identical CPUs executing the same instruction stream, with multiple identical memory modules in the address space of the CPUs storing duplicates of the same data.
Abstract: A computer system in a fault-tolerant configuration employs multiple identical CPUs executing the same instruction stream, with multiple, identical memory modules in the address space of the CPUs storing duplicates of the same data. Memory references. The multiple CPUs are loosely synchronized, as by detecting events such as memory references and stalling any CPU ahead of others until all execute the function simultaneously; interrupts can be synchronized by ensuring that all CPUs implement the interrupt at the same point in their instruction stream. Memory references by the multiple CPUs are voted by each of the memory modules. A private-write area is included in the shared memory space in the memory modules to allow functions such as software voting of state information unique to CPUs. All CPUs write state information to their private-write area, then all CPUs read all the private-write areas for functions such as detecting differences in interrupt cause or the like.

185 citations


Journal ArticleDOI
Robert Courts1
TL;DR: An adaptive memory management algorithm allows substantial improvement in locality of reference in garbage-collected systems and indicates that page-wait time typically is reduced by a factor of four with constant memory size and disk technology.
Abstract: Modern Lisp systems make heavy use of a garbage-collecting style of memory management. Generally, the locality of reference in garbage-collected systems has been very poor. In virtual memory systems, this poor locality of reference generally causes a large amount of wasted time waiting on page faults or uses excessively large amounts of main memory. An adaptive memory management algorithm, described in this article, allows substantial improvement in locality of reference. Performance measurements indicate that page-wait time typically is reduced by a factor of four with constant memory size and disk technology. Alternately, the size of memory typically can be reduced by a factor of two with constant performance.

142 citations


Journal ArticleDOI
01 Mar 1988
TL;DR: A methodology is proposed that facilitates analysis of the behavior of the matrix-matrix primitives and the resulting block algorithms as a function of certain system parameters to identify the limits of performance improvement possible via blocking and any contradictory trends that require trade-off consideration.
Abstract: Linear algebra algorithms based on the BLAS or ex tended BLAS do not achieve high performance on mul tivector processors with a hierarchical memory system because of a lack of data locality. For such machines, block linear algebra algorithms must be implemented in terms of matrix-matrix primitives BLAS3. Designing ef ficient linear algebra algorithms for these architectures requires analysis of the behavior of the matrix-matrix primitives and the resulting block algorithms as a func tion of certain system parameters. The analysis must identify the limits of performance improvement possible via blocking and any contradictory trends that require trade-off consideration. We propose a methodology that facilitates such an analysis and use it to analyze the per formance of the BLAS3 primitives used in block methods. A similar analysis of the block size-perfor mance relationship is also performed at the algorithm level for block versions of the LU decomposition and the Gram-Schmidt orthogonalization procedures.

138 citations


Patent
22 Jan 1988
TL;DR: In this paper, a modular, expandable, topologically-distributed-memory multiprocessor computer comprises a plurality of non-directly communicating slave processors under the control of a synchronizer and a master processor.
Abstract: A modular, expandable, topologically-distributed-memory multiprocessor computer comprises a plurality of non-directly communicating slave processors under the control of a synchronizer and a master processor. Memory space is partitioned into a plurality of memory cells. Dynamic variables may be mapped into the memory cells so that they depend upon processing in nearby partitions. Each slave processor is connected in a topologically well-defined way through a dynamic bi-directional switching system (gateway) to different respective ones of the memory cells. Access by the slave processors to their respective topologically similar memory cells occurs concurrently or in parallel in such a way that no data-flow conflicts occur. The topology of data distribution may be chosen to take advantage of symmetries which occur in broad classes of problems. The system may be tied to a host computer used for data storage and analysis of data not efficiently processed by the multiprocessor computer.

133 citations


Proceedings ArticleDOI
01 Oct 1988
TL;DR: A method for using data dependence analysis to estimate cache and local memory demand in highly iterative scientific codes in the form of a family of “reference” windows for each variable that reflects the current set of elements that should be kept in cache.
Abstract: In this paper we describe a method for using data dependence analysis to estimate cache and local memory demand in highly iterative scientific codes. The estimates take the form of a family of “reference” windows for each variable that reflects the current set of elements that should be kept in cache. It is shown that, in important special cases, we can estimate the size of the window and predict a lower bound on the number of cache hits. If the machine has local memory or cache that can be managed by the compiler, these estimates can be used to guide the management of this resource. It is also shown that these estimates can be used to guide program transformations in an attempt to optimize cache performance.

97 citations


Patent
16 Mar 1988
TL;DR: In this paper, a multiprocessing system is presented having a plurality of processing nodes interconnected together by a communication network, each processing node including a processor, responsive to user software running on the system, and an associated memory module, and capable under user control of dynamically partitioning each memory module into a global storage efficiently accessible by a number of processors connected to the network, and local storage efficient accessible by its associated processor.
Abstract: A multiprocessing system is presented having a plurality of processing nodes interconnected together by a communication network, each processing node including a processor, responsive to user software running on the system, and an associated memory module, and capable under user control of dynamically partitioning each memory module into a global storage efficiently accessible by a number of processors connected to the network, and local storage efficiently accessible by its associated processor.

83 citations


Proceedings ArticleDOI
01 Jun 1988
TL;DR: An overview of some of the mathematical issues behind several of the problems associated with restructuring software to take advantage of parallel supercomputers architectures with complex memory hierarchies or distributed memory systems is presented.
Abstract: Parallel supercomputers architectures with complex memory hierarchies or distributed memory systems have become very common. Unfortunately, the problems associated with restructuring software to take advantage of these memory systems are not easily solved. This paper presents an overview of some of the mathematical issues behind several of these problems and attempts to give a brief look at some of the potential solutions.

81 citations


Patent
Alan J. Deerfield1, Sun-Chi Siu1
05 Dec 1988
TL;DR: In this paper, a memory having an address generator in an intelligent port which generates address sequences specified by an array transformation operator in a programmable processor is presented. But the address generator is not considered in this paper.
Abstract: A memory having an address generator in an intelligent port which generates address sequences specified by an array transformation operator in a programmable processor, thereby allowing a controlling processor to proceed immediately to the preparation of the next instruction in parallel with memory execution of a present instruction. The intelligent port of the memory creates complex data structures from input data arrays stored in memory and directs the transformation of the data structures into output data streams. The memory comprises a plurality of read-write memory banks and a bank of read-only memory interconnected through intelligent ports and busses to other units of the processor. An arbitration and switching network assigns memory banks to the intelligent ports.

73 citations


Patent
19 Feb 1988
TL;DR: In this paper, a uniform memory system for use with symbolic computers has a very large virtual address space, and no separate files, not directly addressable in the address space of the virtual memory, exist.
Abstract: A uniform memory system for use with symbolic computers has a very large virtual address space. No separate files, not directly addressable in the address space of the virtual memory, exist. A special object, the peristent root, defines memory objects which are relatively permanent, such objects being traceable by pointers from the persistent root. A tombstone mechanism is used to prevent objects from referencing deleted objects.

72 citations


Patent
28 Jun 1988
TL;DR: In this paper, a memory system backup for tightly or loosely coupled multiprocessor systems is proposed, where a plurality of primary memory units (13, 14, 15) having substantially the same configuration are backed up by a single memory unit of similiar configuration.
Abstract: A memory system backup for use in a tightly or loosely coupled multiprocessor system. A plurality of primary memory units (13, 14, 15) having substantially the same configuration are backed up by a single memory unit (20) of similiar configuration. The backup memory unit holds the checksum of all data held in all primary memory units. In the event of the failure of one of the primary memory units its contents can be recreated based on the data in the remaining non-failed memory unit and the checksum in the backup unit.


01 Jan 1988
TL;DR: A new distributed algorithm is shown to outperform centralized ones and provide unrestricted sharing of read-write memory between tasks running on either strongly coupled or loosely coupled architectures, and any mixture thereof.
Abstract: This report describes the design, implementation and performance evaluation of a virtual shared memory server for the Mach operating system. The server provides unrestricted sharing of read-write memory between tasks running on either strongly coupled or loosely coupled architectures, and any mixture thereof. A number of memory coherency algorithms have been implemented and evaluated, including a new distributed algorithm that is shown to outperform centralized ones. Some of the features of the server include support for machines with multiple page sizes, for heterogeneous shared memory, and for fault tolerance. Extensive performance measures of applications are presented, and the intrinsic costs evaluated. Table of

16 Feb 1988
TL;DR: Inprof as discussed by the authors is a two-phase tool that records the amount of memory each function allocates, breaks down allocation information by type and size, and displays a program's dynamic cal graph so that functions indirectly responsible for memory allocation are easy to identify.
Abstract: This paper describes inprof, a tool used to study the memory allocation behavior of programs. mprof records the amount of memory each function allocates, breaks down allocation information by type and size, and displays a program''s dynamic cal graph so that functions indirectly responsible for memory allocation are easy to identify. mprof is a two-phase tool. The monitor phase is linked into executing programs and records information each time memory is allocated. The display phase reduces the data generated by the monitor and displays the information to the user in several tables. mprof has been implemented for C and Kyoto Common Lisp. Measurements of these implementations are presented.


Journal ArticleDOI
Per Stenström1
TL;DR: The techniques that can be used to design a memory system that reduces the impact of contention are examined and the implementations and the design decisions taken in each are reviewed.
Abstract: The techniques that can be used to design a memory system that reduces the impact of contention are examined. To exemplify the techniques, the implementations and the design decisions taken in each are reviewed. The discussion covers memory organization, interconnection networks, memory allocation, cache memory, and synchronization and contention. The multiprocessor implementations considered are C.mmp, CM*, RP3, Alliant FX, Cedar, Butterfly, SPUR, Dragon, Multimax, and Balance. >

Journal ArticleDOI
17 May 1988
TL;DR: This paper reports the results of a study of VAX 8800 processor performance using a hardware monitor that collects histograms of the processor's micro-PC and memory bus status, which yields a very detailed picture of the amount of time an average VAX instruction spends in various activities on the 8800.
Abstract: This paper reports the results of a study of VAX 8800 processor performance using a hardware monitor that collects histograms of the processor's micro-PC and memory bus status. The monitor keeps a count of all machine cycles executed at each micro-PC location, as well as counting all occurrences of each bus transaction. It can measure a running system without interfering with it, and this paper's results are based on measurements of live timesharing. Because the 8800 is a microcoded machine, a great deal of information can be gleaned from these data. The paper reports opcode and operand specifier frequencies, as well as the amount of time spent in instruction execution and various kinds of overhead, such as memory management and cache-wait stalls. The histogram method yields a very detailed picture of the amount of time an average VAX instruction spends in various activities on the 8800.

Patent
05 Dec 1988
TL;DR: In this article, the process manager assigns processes to processors and satisfies their initial memory requirements through global memory allocations, and deallocates to uncommitted memory both memory that is dynamically requested to be deallocated and memory of terminating processes.
Abstract: In a multiprocessor system (FIG. 1) wherein each adjunct processor has its own, non-shared, memory (22) the non-shared memory of each adjunct processor (11-12) comprises global memory (42) and local memory (41). All global memory of all adjunct processors is managed by a single process manager (30) of a system-wide host processor (10). Each processor's local memory is managed by its operating system kernel (31). Local memory comprises uncommitted memory (45) not allocated to any process and committed memory (46) allocated to processes. The process manager assigns processes to processors and satisfies their initial memory requirements through global memory allocations. Each kernel satisfies processes' dynamic memory allocation requests from uncommitted memory, and deallocates to uncommitted memory both memory that is dynamically requested to be deallocated and memory of terminating processes. Each processor's kernel and the process manager cooperate to transfer memory between global memory and uncommitted memory to keep the amount of uncommitted memory within a predetermined range.

Patent
29 Jul 1988
TL;DR: A photo printer system comprises a magnetic storage for storing print data sent from a host computer, a bit map memory for storing a print dot data, and a printer engine for printing the contents of the bitmap memory as discussed by the authors.
Abstract: A photo printer system comprises a magnetic storage for storing a print data sent from a host computer, a bit map memory for storing a print dot data, and a printer engine for printing the contents of the bit map memory. The system includes a program which operates on the magnetic storage to serve as an external storage for the host computer and on the bit map memory to serve as a cache memory in response to a data read/write command issued by the host computer, and a CPU which controls the execution of the program. At least in a non-print process mode, the system forms a data path so that the host computer can access to the bit map memory and magnetic storage so as to have a bidirectional data read/write operation. In another mode, the system forms a data path so that image data picked up with an image scanner is saved directly in the bit map memory and the image data is sent to the host computer by request.

Journal ArticleDOI
D.P. Ryan1
TL;DR: The register model, core instruction set, register operations, memory operations, control operations instruction cache, user-supervisor protection, interrupts, faults, and debug support are presented.
Abstract: Important features and capabilities of the 80960 are briefly examined, and an overview of its architecture is given. A detached discussion is presented of the register model, core instruction set, register operations, memory operations, control operations instruction cache, user-supervisor protection, interrupts, faults, and debug support. >

Proceedings ArticleDOI
Alok Aggarwal1, Ashok K. Chandra1
01 Jan 1988
TL;DR: Computer programs are usually written with the illusion that they will run on something like a random access machine (RAM) with a large memory, all locations of which are equally fast, but in practice this is far from the truth.
Abstract: Computer programs are usually written with the illusion that they will run on something like a random access machine (RAM) [AHU74], with a large memory, all locations of which are equally fast. In practice, this is far from the truth. In large machines, for example, the range of speeds from the fastest memory (registers at about Ions) to the slowest (disks or mass store at IOms or seconds) can bc a factor of a million or even a billion! Machine dcsigncrs attempt IO smooth out this range. to the ~‘xtcnt lhat is technologically feasible, by providing many levels of memory in hctwccn. ‘l’hcsc memory levels m;ry includes one or Iwo lcvcls of c:rThc. main memory, cxpandcd slow , :IIKI drums. I‘hc pro~r;un~. (II’ course, run in virtual memory. ‘l‘hc h;ndwarc and lhc 0pCrillill~ syslclll al-

Journal ArticleDOI
TL;DR: An approximate, closed-form solution is given that is simple and easy to use for any number of processors, buses, or memory modules and for arbitrary memory block size.
Abstract: A simple queueing model is presented for studying the effect of multiple-bus interconnection networks on the performance of asynchronous multiprocessor systems. The proposed model is suitable for systems in which each processor has a local memory and is thus able to continue processing while waiting for a response from the global memory. An approximate, closed-form solution is given that is simple and easy to use for any number of processors, buses, or memory modules and for arbitrary memory block size. The model is used to study the access time of the global memory as a function of the number of buses for different local-memory/global-memory traffic rates. >


Proceedings ArticleDOI
27 Mar 1988
TL;DR: The authors consider a videotex system architecture where user requests are processed by a service computer, and the requested information pages are broadcast to all users, and features such as scheduling page broadcasts, memory management, and disk scheduling are represented explicitly in the model.
Abstract: The authors consider a videotex system architecture where user requests are processed by a service computer, and the requested information pages are broadcast to all users. Due to the large volume of information that is typically available, a secondary storage device, such as a disk, is used to hold the database. However, a small fraction of the information pages may be kept in main memory. A detailed simulation model is used to study the performance of this system architecture. Features such as scheduling page broadcasts, memory management, and disk scheduling are represented explicitly in the model. Simulation results are presented to show the response-time performance of various memory-management and disk scheduling strategies. >

Journal ArticleDOI
B. Liu1, N. Strother1
TL;DR: Programming techniques necessary for high performance on the 3090 Vector Facilities are illustrated, showing that VS Fortran programs can achieve near-maximum execution rates.
Abstract: Programming techniques necessary for high performance on the 3090 Vector Facilities are illustrated, showing that VS Fortran programs can achieve near-maximum execution rates. Relevant features of the 3090 architecture are reviewed, stressing the need to make efficient use of a hierarchical storage system and take advantage of the compound vector instructions. The key programming techniques for managing the storage hierarchy are loop sectioning, loop distribution, and data compaction. Vector register, cache reuse, and virtual memory, storage format, and page reuse are shown to lead to efficient use of the vector registers, the high speed cache, and the virtual memory system, respectively. The multiply-and-add compound instruction is discussed. >

Journal ArticleDOI
TL;DR: The results of the present experiments support a version of multiple-resource theory applied to working memory in which resource composition depends on internal mediators even when stimulus and response modality are held constant.
Abstract: A frequent assumption in cognitive psychology is that performance in decision making and planning is severely restricted by the limited capacity of short-term working memory. Many predictions of this theory have not been supported, possibly because working memory may be composed of multiple resources rather than a single resource. The present experiments study two tasks, both involving memory for digits. Although these tasks can employ the same modality for input and for responding, they appear to differ in their demands for working memory resources. Specifically, the tasks appear to differ in resources required for processing at input, and they also differ in resources in the sense of storage capacity. The results support a version of multiple-resource theory applied to working memory in which resource composition depends on internal mediators even when stimulus and response modality are held constant.

Journal ArticleDOI
TL;DR: Although it is aimed at graphics systems, the TMS34010's large address reach, bit-field processing, and DRAM (dynamic random-access memory) interface make it suitable for many other embedded processing applications.
Abstract: The authors discuss the TMS34010, a high-performance 32-bit microprocessor with special instructions and hardware for handling the bit-field data and address manipulations often associated with computer graphics. They give a history of embedded microprocessors and examine the wide range of processors and applications covered by that term. They provide an overview of the internal architecture of the TMS34010 and discuss the choice of feature set in its design. Although it is aimed at graphics systems, the processor's large address reach, bit-field processing, and DRAM (dynamic random-access memory) interface make it suitable for many other embedded processing applications. >

Journal ArticleDOI
TL;DR: An overview of the issues involved in the design of the control mechanism used by the Myrias parallel computer system to manage the execution of parallel programs is presented.
Abstract: This paper presents an overview of the issues involved in the design of the control mechanism used by the Myrias parallel computer system to manage the execution of parallel programs. The following issues are discussed: initial task distribution, dynamic load leveling, hierarchical caching, task synchronization, memory management, and scalability. Some of the more important points are illustrated using performance measurements obtained by running a test program on the Myrias Research Corporation prototype system.

Journal ArticleDOI
R. Ford1
TL;DR: The conflict between the performance demands of real-time systems and the shared-resource needs of high-level languages (Ada in particular) is examined and it is shown that one system, an optimized optimistic version, does deliver performance that is acceptable for real- time applications.
Abstract: The conflict between the performance demands of real-time systems and the shared-resource needs of high-level languages (Ada in particular) is examined. Shared memory requires carefully designed concurrency control, but the traditional approach, which is to embed the entire allocate-release implementation code in critical sections, is unsuitable for real-time applications because it results in excessively high response time. The design and performance of three memory-management systems for real-time applications are evaluated, and it is shown that one system, an optimized optimistic version, does deliver performance that is acceptable for real-time applications. >

Journal ArticleDOI
17 May 1988
TL;DR: Memory referencing behavior is analyzed via the study of traces for the purpose of developing new local memory structures and management techniques and indicates the use of a program controlled cache to efficiently reduce the traffic from the cache to main memory.
Abstract: Memory referencing behavior is analyzed via the study of traces for the purpose of developing new local memory structures and management techniques. A novel trace processing technique called flattening reduces the dependence of the results on the underlying compiler and architecture on which the trace was generated, and partitions each memory location into its constituent values. The referencing patterns of each value in the resulting trace is described via statistics such as interreference time, lifetime, etc. The referencing patterns of the entire trace are described via histograms showing the distributions of the statistics for the individual values. The results of this analysis indicate the use of a program controlled cache to efficiently reduce the traffic from the cache to main memory. By using program control, the future knowledge of the compiler can be imparted to the cache, allowing the rejection of dead values and early replacement of values with long interreference times.