scispace - formally typeset
Search or ask a question

Showing papers on "Memory management published in 1997"


Journal ArticleDOI
TL;DR: The state of microprocessors and DRAMs today is reviewed, some of the opportunities and challenges for IRAMs are explored, and performance and energy efficiency of three IRAM designs are estimated.
Abstract: Two trends call into question the current practice of fabricating microprocessors and DRAMs as different chips on different fabrication lines. The gap between processor and DRAM speed is growing at 50% per year; and the size and organization of memory on a single DRAM chip is becoming awkward to use, yet size is growing at 60% per year. Intelligent RAM, or IRAM, merges processing and memory into a single chip to lower memory latency, increase memory bandwidth, and improve energy efficiency. It also allows more flexible selection of memory size and organization, and promises savings in board area. This article reviews the state of microprocessors and DRAMs today, explores some of the opportunities and challenges for IRAMs, and finally estimates performance and energy efficiency of three IRAM designs.

671 citations


Journal ArticleDOI
TL;DR: A region-based dynamic semantics for a skeletal programming language extracted from Standard ML is defined and the inference system which specifies where regions can be allocated and de-allocated is presented and a detailed proof that the system is sound with respect to a standard semantics is presented.
Abstract: This paper describes a memory management discipline for programs that perform dynamic memory allocation and de-allocation. At runtime, all values are put intoregions. The store consists of a stack of regions. All points of region allocation and de-allocation are inferred automatically, using a type and effect based program analysis. The scheme does not assume the presence of a garbage collector. The scheme was first presented in 1994 (M. Tofte and J.-P. Talpin,in“Proceedings of the 21st ACM SIGPLAN?SIGACT Symposium on Principles of Programming Languages,” pp. 188?201); subsequently, it has been tested in The ML Kit with Regions, a region-based, garbage-collection free implementation of the Standard ML Core language, which includes recursive datatypes, higher-order functions and updatable references L. Birkedal, M. Tofte, and M. Vejlstrup, (1996),in“Proceedings of the 23 rd ACM SIGPLAN?SIGACT Symposium on Principles of Programming Languages,” pp. 171?183. This paper defines a region-based dynamic semantics for a skeletal programming language extracted from Standard ML. We present the inference system which specifies where regions can be allocated and de-allocated and a detailed proof that the system is sound with respect to a standard semantics. We conclude by giving some advice on how to write programs that run well on a stack of regions, based on practical experience with the ML Kit.

640 citations


Patent
21 Nov 1997
TL;DR: In this article, the authors describe distributed shared memory systems and processes that can connect into each node of a computer network to encapsulate the memory management operations of the connected nodes and to provide thereby an abstraction of a shared virtual memory that can span across each node and that optionally spans across each memory device connected to the computer network.
Abstract: Distributed shared memory systems and processes that can connect into each node of a computer network to encapsulate the memory management operations of the connected nodes and to provide thereby an abstraction of a shared virtual memory that can span across each node of the network and that optionally spans across each memory device connected to the computer network. Accordingly, each node on the network having the distributed shared memory system of the invention can access the shared memory.

343 citations


Patent
21 Nov 1997
TL;DR: In this article, a distributed shared memory (DSM) system is proposed to provide low level memory device services to the data control program, such as read, write, allocate, flush, or any other service suitable for providing low level control of a memory storage device.
Abstract: In a network of computer nodes, a structured storage system interfaces to a globally addressable memory system that provides persistent storage of data. The globally addressable memory system may be a distributed shared memory (DSM) system. A control program resident on each network node can direct the memory system to map file and directory data into the shared memory space. The memory system can include functionality to share data, coherently replicate data, and create log-based transaction data to allow for recovery. In one embodiment, the memory system provides memory device services to the data control program. These services can include read, write, allocate, flush, or any other similar or additional service suitable for providing low level control of a memory storage device. The data control program employs these memory system services to allocate and access portions of the shared memory space for creating and manipulating a structured store of data such as a file system, a database system, or a Web page system for storing, retrieving, and delivering objects such as files, database records or information, and Web pages.

315 citations


Journal ArticleDOI
TL;DR: A survey and analysis of trace-driven memory simulation tools can be found in this article, where the authors discuss the strengths and weaknesses of different approaches and show that no single method is best when all criteria, including accuracy, speed, memory, flexibility, portability, expense, and ease of use are considered.
Abstract: As the gap between processor and memory speeds continues to widen, methods for evaluating memory system designs before they are implemented in hardware are becoming increasingly important. One such method, trace-driven memory simulation, has been the subject of intense interest among researchers and has, as a result, enjoyed rapid development and substantial improvements during the past decade. This article surveys and analyzes these developments by establishing criteria for evaluating trace-driven methods, and then applies these criteria to describe, categorize, and compare over 50 trace-driven simulation tools. We discuss the strengths and weaknesses of different approaches and show that no single method is best when all criteria, including accuracy, speed, memory, flexibility, portability, expense, and ease of use are considered. In a concluding section, we examine fundamental limitations to trace-driven simulation, and survey some recent developments in memory simulation that may overcome these bottlenecks.

315 citations


Proceedings ArticleDOI
17 Mar 1997
TL;DR: This work presents a technique for efficiently exploiting on-chip Scratch-Pad memory by partitioning the application's scalar and array variables into off-chip DRAM and on- chip Scratch -Pad SRAM, with the goal of minimizing the total execution time of embedded applications.
Abstract: Efficient utilization of on-chip memory space is extremely important in modern embedded system applications based on microprocessor cores. In addition to a data cache that interfaces with slower off-chip memory, a fast on-chip SRAM, called Scratch-Pad memory, is often used in several applications. We present a technique for efficiently exploiting on-chip Scratch-Pad memory by partitioning the application's scalar and array variables into off-chip DRAM and on-chip Scratch-Pad SRAM, with the goal of minimizing the total execution time of embedded applications. Our experiments on code kernels from typical applications show that our technique results in significant performance improvements.

304 citations


Proceedings ArticleDOI
01 May 1997
TL;DR: A technique for dynamic analysis of program data access behavior is presented, which is then used to proactively guide the placement of data within the cache hierarchy in a location-sensitive manner and is fully compatible with existing Instruction Set Architectures.
Abstract: Improvements in main memory speeds have not kept pace with increasing processor clock frequency and improved exploitation of instruction-level parallelism. Consequently, the gap between processor and main memory performance is expected to grow, increasing the number of execution cycles spent waiting for memory accesses to complete. One solution to this growing problem is to reduce the number of cache misses by increasing the effectiveness of the cache hierarchy. In this paper we present a technique for dynamic analysis of program data access behavior, which is then used to proactively guide the placement of data within the cache hierarchy in a location-sensitive manner. We introduce the concept of a macroblock, which allows us to feasibly characterize the memory locations accessed by a program, and a Memory Address Table, which performs the dynamic reference analysis. Our technique is fully compatible with existing Instruction Set Architectures. Results from detailed simulations of several integer programs show significant speedups.

192 citations


Proceedings ArticleDOI
01 Oct 1997
TL;DR: This paper describes a two-level software coherent shared memory system-Cashmere-2L-that meets the challenge to take advantage of hardware shared memory for sharing within an SMP, and to ensure that software overhead is incurred only when actively sharing data across SMPs in the cluster.
Abstract: Low-latency remote-write networks, such as DEC's Memory Channel, provide the possibility of transparent, inexpensive, large-scale shared-memory parallel computing on clusters of shared memory multiprocessors (SMPs). The challenge is to take advantage of hardware shared memory for sharing within an SMP, and to ensure that software overhead is incurred only when actively sharing data across SMPs in the cluster. In this paper, we describe a two-level software coherent shared memory system-Cashmere-2L-that meets this challenge. Cashmere-2L uses hardware to share memory within a node, while exploiting the Memory Channel's remote-write capabilities to implement moderately lazy release consistency with multiple concurrent writers, directories, home nodes, and page-size coherence blocks across nodes. Cashmere-2L employs a novel coherence protocol that allows a high level of asynchrony by eliminating global directory locks and the need for TLB shootdown. Remote interrupts are minimized by exploiting the remote-write capabilities of the Memory Channel network. Cashmere-2L currently runs on an 8-node, 32-processor DEC AlphaServer system. Speedups range from 8 to 31 on 32 processors for our benchmark suite, depending on the application's characteristics. We quantify the importance of our protocol optimizations by comparing performance to that of several alternative protocols that do not share memory in hardware within an SMP, and require more synchronization. In comparison to a one-level protocol that does not share memory in hardware within an SMP, Cashmere-2L improves performance by up to 46%.

185 citations


Proceedings ArticleDOI
01 Dec 1997
TL;DR: This work revisits memory hierarchy design viewing memory as an inter-operation communication agent and uses data dependence prediction to identify and link dependent loads and stores so that they can communicate speculatively without incurring the overhead of address calculation, disambiguation and data cache access.
Abstract: We revisit memory hierarchy design viewing memory as an inter-operation communication agent. This perspective leads to the development of novel methods of performing inter-operation memory communication. We use data dependence prediction to identify and link dependent loads and stores so that they can communicate speculatively without incurring the overhead of address calculation, disambiguation and data cache access. We also use data dependence prediction to convert, DEF-store-load-USE chains within the instruction window into DEF-USE chains prior to address calculation and disambiguation. We use true and output data dependence status prediction to introduce and manage a small storage structure called the transient value cache (TVC). The TVC captures memory values that are short-lived. It also captures recently stored values that are likely to be accessed soon. Accesses that are serviced by the TVC do not have to be serviced by other parts of the memory hierarchy, e.g., the data cache. The first two techniques are aimed at reducing the effective communication latency whereas the last technique is aimed at reducing data cache bandwidth requirements. Experimental analysis of the proposed techniques shows that: the proposed speculative communication methods correctly handle a large fraction of memory dependences; and a large number of the loads and stores do not have to ever reach the data cache when the TVC is in place.

183 citations


Patent
30 Apr 1997
TL;DR: In this article, the authors present a mobile client system in which provision is made for management of flash memory flash memory management is done using variable block length and supports data compression Blocks are allocated contiguously in each erase unit and each block starts with a header that contains the length of the block Blocks are tracked using a single-level virtual address map which resides in random access memory (RAM).
Abstract: A computer system such as a mobile client system in which provision is made for management of flash memory Flash memory management is done using variable block length and supports data compression Blocks are allocated contiguously in each erase unit and each block starts with a header that contains the length of the block Blocks are tracked using a single-level virtual address map which resides in random access memory (RAM) The mobile computer system may also include a housing, processor, random access memory, display and an input digitizer such as a touchscreen

182 citations


Patent
05 Sep 1997
TL;DR: In this article, a method for preventing tampering with memory in an electronic device, such as a cellular telephone, is described. But this method requires the use of a public/private key encryption scheme.
Abstract: Methods and apparatus for preventing tampering with memory in an electronic device, such as a cellular telephone, are disclosed. An electronic device having a memory and a processing means contains logic that is used to perform a one-way hash calculation on the device's memory contents whereby an audit hash value, or signature, of such contents is derived. The audit hash value si compared to an authenticated valid hash value derived from authentic memory contents. A difference between the audit and valid hash values can be indicative of memory tampering. In accordance with another aspect of the invention, electronic device memory contents can be updated by a data transfer device that is authenticated before being permitted access to the memory contents. Data transfer device authentication involves the use of a public/private key encryption scheme. When the data transfer device interfaces with an electronic device and requests memory access, a process to authenticate the data transfer device is initiated.

Proceedings ArticleDOI
01 Jun 1997
TL;DR: The performance of NOW-Sort, a collection of sorting implementations on a Network of Workstations (NOW), finds that parallel sorting on a NOW is competitive to sorting on the large-scale SMPs that have traditionally held the performance records.
Abstract: We report the performance of NOW-Sort, a collection of sorting implementations on a Network of Workstations (NOW). We find that parallel sorting on a NOW is competitive to sorting on the large-scale SMPs that have traditionally held the performance records. On a 64-node cluster, we sort 6.0 GB in just under one minute, while a 32-node cluster finishes the Datamation benchmark in 2.41 seconds.Our implementations can be applied to a variety of disk, memory, and processor configurations; we highlight salient issues for tuning each component of the system. We evaluate the use of commodity operating systems and hardware for parallel sorting. We find existing OS primitives for memory management and file access adequate. Due to aggregate communication and disk bandwidth requirements, the bottleneck of our system is the workstation I/O bus.

Proceedings ArticleDOI
06 Feb 1997
TL;DR: IRAM is attractive because the gigabit DRAM chip has enough transistors for both a powerful processor and a memory big enough to contain whole programs and data sets, and it needs more metal layers to accelerate the long lines of 600mm/sup 2/ chips.
Abstract: It is time to reconsider unifying logic and memory. Since most of the transistors on this merged chip will be devoted to memory, it is called 'intelligent RAM'. IRAM is attractive because the gigabit DRAM chip has enough transistors for both a powerful processor and a memory big enough to contain whole programs and data sets. It contains 1024 memory blocks each 1kb wide. It needs more metal layers to accelerate the long lines of 600mm/sup 2/ chips. It may require faster transistors for the high-speed interface of synchronous DRAM. Potential advantages of IRAM include lower memory latency, higher memory bandwidth, lower system power, adjustable memory width and size, and less board space. Challenges for IRAM include high chip yield given processors have not been repairable via redundancy, high memory retention rates given processors usually need higher power than DRAMs, and a fast processor given logic is slower in a DRAM process.

Patent
Gordon P. Sorber1
12 Nov 1997
TL;DR: In this article, a memory manager requests a large area of memory from an operating system and from the viewpoint of the operating system, that memory is fixed, and a fixed memory area is then divided up into an integral number of classes, e.g. by the memory manager.
Abstract: A memory system and management method for optimized dynamic memory allocation are disclosed. A memory manager requests a large area of memory from an operating system, and from the viewpoint of the operating system, that memory is fixed. That fixed memory area is then divided up into an integral number of classes, e.g. by the memory manager. Each memory class includes same-size blocks of memory linked together by pointers. The memory block sizes are different for each class, and the sizes of the different class memory blocks are selected to conform to the CPU and memory access bus hardware (e.g. align with a bus bit width) as well as to accommodate the various sizes of data expected to be processed for a particular application. The memory manager maintains a separate, linked list of unused blocks for each class. Each memory block is zeroed initially and after release by a process previously assigned to it. When a block of memory is assigned to a particular process, a flag is set to indicate that it is in use. Incoming messages of variable length are parsed based upon definitions of message structures expected to be received by a particular application. The parsed message or message segment is then stored in an appropriate size memory block.

Patent
Mitchell A. Bauman1
31 Dec 1997
TL;DR: In this paper, a modular, expandable, multi-port main memory system that includes multiple point-to-point switch interconnections and a highly parallel data path structure that allows multiple memory operations to occur simultaneously is presented.
Abstract: A modular, expandable, multi-port main memory system that includes multiple point-to-point switch interconnections and a highly-parallel data path structure that allows multiple memory operations to occur simultaneously. The main memory system includes an expandable number of modular Memory Storage Units, each of which are mapped to a portion of the total address space of the main memory system, and may be accessed simultaneously. Each of the Memory Storage Units includes a predetermined number of memory ports, and an expandable number of memory banks, wherein each of the memory banks may be accessed simultaneously. Each of the memory banks is also modular, and includes an expandable number of memory devices each having a selectable memory capacity. All of the memory devices in the system may be performing different memory read or write operations substantially simultaneously and in parallel. Multiple data paths within each of the Memory Storage Units allow data transfer operations to occur to each of the multiple memory ports in parallel. Simultaneously with the transfer operations occurring to the memory ports, unrelated data transfer operations may occur to multiple ones of the memory devices within all memory banks in parallel. The main memory system further incorporates independent storage devices and control logic to implement a directory-based coherency protocol. Thus the main memory system is adapted to providing the flexibility, bandpass, and memory coherency needed to support a high-speed multiprocessor environment.

Journal ArticleDOI
TL;DR: An out-of-core approach for interactive streamline construction on large unstructured tetrahedral meshes containing millions of elements using an octree to partition and restructure the raw data into subsets stored into disk files for fast data retrieval.
Abstract: This paper presents an out-of-core approach for interactive streamline construction on large unstructured tetrahedral meshes containing millions of elements. The out-of-core algorithm uses an octree to partition and restructure the raw data into subsets stored into disk files for fast data retrieval. A memory management policy tailored to the streamline calculations is used such that, during the streamline construction, only a very small amount of data are brought into the main memory on demand. By carefully scheduling computation and data fetching, the overhead of reading data from the disk is significantly reduced and good memory performance results. This out-of-core algorithm makes possible interactive streamline visualization of large unstructured-grid data sets on a single mid-range workstation with relatively low main-memory capacity: 5-15 megabytes. We also demonstrate that this approach is much more efficient than relying on virtual memory and operating system's paging algorithms.

Proceedings ArticleDOI
01 Dec 1997
TL;DR: This work extends previous studies of data value and dependence speculation by introducing a novel modification of the processor pipeline called memory renaming, which allows the processor to speculatively fetch values when the producer of the data can be reliably determined without the need for an effective address.
Abstract: As processors continue to exploit more instruction-level parallelism, a greater demand is placed on reducing the effects of memory access latency. In this paper, we introduce a novel modification of the processor pipeline called memory renaming. Memory renaming applies register access techniques to load instructions, reducing the effect of delays caused by the need to calculate effective addresses for the load and all preceding stores before the data can be fetched. Memory renaming allows the processor to speculatively fetch values when the producer of the data can be reliably determined without the need for an effective address. This work extends previous studies of data value and dependence speculation. When memory renaming is added to the processor pipeline, renaming can be applied to 30% to 50% of all memory references, translating to an overall improvement in execution time of up to 41%. Furthermore, this improvement is seen across all memory segments-including the heap segment, which has often been difficult to manage efficiently.

Patent
12 Sep 1997
TL;DR: In this article, the authors present a system and method for managing memory in a network, where the primary memory on one node may be used to store memory data (pages) from other nodes.
Abstract: A system and method for managing memory in a network. In a computer network in which multiple computers (nodes) are interconnected by a network, the primary memory on one node may be used to store memory data (pages)from other nodes. The transfer of a data page over the network from the memory of a node holding it to the memory of another node requesting that data gives improved performance when compared to the transfer of the same data from disk, either local or remote, to the requesting node. Global information about the disposition of the nodes and their memories in the network is used to determine the nodes in the network that should best be used to hold data pages for other nodes at a particular time. This information is exchanged by the nodes periodically under command of a coordinating node. The system includes distributed data structures that permit one node to locate data pages stored in another node's memory, procedures to determine when global information should be recomputed and redistributed, and procedures to avoid overburdening nodes with remote page traffic.


Patent
Jan van Lunteren1
21 Mar 1997
TL;DR: In this article, an address-mapping method applying a table lookup procedure so that arbitrary, non-power-of-two interleave factors and numbers of memory banks are possible for various strides.
Abstract: For optimizing access to system memory having a plurality of memory banks, interleaving can be used when storing data so that data sequences are distributed over memory banks. The invention introduces an address-mapping method applying a table lookup procedure so that arbitrary, non-power-of-two interleave factors and numbers of memory banks are possible for various strides.

Journal ArticleDOI
TL;DR: This survey introduces the problems and discusses solutions in the context of single-processor systems, to catalog all solutions, past and present, and to identify technology trends and attractive future approaches.
Abstract: This survey exposes the problems related to virtual caches in the context of uniprocessor (Part 1) and multiprocessor (Part 2) systems. We review proposed solutions that have been implemented or proposed in different contexts. The idea is to catalog all solutions, past and present, and to identify technology trends and attractive future approaches. We first overview the relevant properties of virtual memory and of physical caches. To solve the virtual-to-physical address bottle-neck, processors may access caches directly with virtual addresses. This survey introduces the problems and discusses solutions in the context of single-processor systems.

Patent
Johan Lodenius1
22 May 1997
TL;DR: In this article, a single chip CMOS technology architecture is used to implement all or various combinations of baseband radio transmission, baseband interfaces and filtering, source coding, source interfaces, control and supervision, power and clock management, keyboard and display drivers, memory management and code compaction, digital signal processing ( DSP ) and DSP memory and radio interface functions.
Abstract: According to the present invention, a single chip semiconductor devices is provided. In one version of the invention, a single chip CMOS technology architecture is used to implement all or various combinations of baseband radio transmission, baseband interfaces and filtering, source coding, source interfaces and filtering, control and supervision, power and clock management, keyboard and display drivers, memory management and code compaction, digital signal processing ( DSP ) and DSP memory and radio interface functions.

Patent
09 Sep 1997
TL;DR: In this paper, an optimized memory architecture for computer systems is presented. But this architecture is limited to integrated circuits that implement a memory subsystem that is comprised of internal memory and control for external memory.
Abstract: The present invention relates generally to an optimized memory architecture for computer systems and, more particularly, to integrated circuits that implement a memory subsystem that is comprised of internal memory and control for external memory. The invention includes one or more shared high-bandwidth memory subsystems, each coupled over a plurality of buses to a display subsystem, a central processing unit (CPU) subsystem, input/output (I/O) buses and other controllers. Additional buffers and multiplexers are used for the subsystems to further optimize system performance.

Patent
10 Feb 1997
TL;DR: In this paper, a hand-held video game system has a microprocessor controller with address and data buses for providing memory accesses during memory cycles to a plurality of cartridge slots for electrically connecting cartridges containing memory to the data buses.
Abstract: A hand-held video game system having a microprocessor controller with address and data buses for providing memory accesses during memory cycles to a plurality of cartridge slots for electrically connecting cartridges containing memory to the address and data buses. An output terminal of the microprocessor controller provides cartridge-select signal which identifies a first memory containing cartridge to be accessed during an initial memory cycle with the microprocessor controller controlling the output terminal to change the cartridge-select signal for transparently accessing a second memory containing cartridge for a subsequent memory cycle. The cartridge slot may also provide a port for transferring and receiving information over a bi-directional communication link in which a communication cartridge allows communication over the internet, and allows for interactive play of a video game.

Journal ArticleDOI
TL;DR: SLDRAM meets the high data bandwidth requirements of emerging processor architectures and retains the low cost of earlier DRAM interface standards, suggesting that SLDRAM will become the mainstream commodity memory of the early 21st century.
Abstract: The primary objective of DRAM-dynamic random access memory-is to offer the largest memory capacity at the lowest possible cost. Designers achieve this by two means. First, they optimize the process and the design to minimize die area. Second, they ensure that the device serves high-volume markets and can be mass-produced to achieve the greatest economies of scale. SLDRAM-synchronous-link DRAM-is a new memory interface specification developed through the cooperative efforts of leading semiconductor memory manufacturers and high-end computer architects and system designers. SLDRAM meets the high data bandwidth requirements of emerging processor architectures and retains the low cost of earlier DRAM interface standards. These and other benefits suggest that SLDRAM will become the mainstream commodity memory of the early 21st century.

Patent
17 Mar 1997
TL;DR: In this paper, a memory management method for a nonvolatile memory having a plurality of blocks each of which can be electrically erased and programmed is provided. But the data file is stored in a first block of the number of blocks if the size of the file is smaller than the total storage space of the first block.
Abstract: A memory management method is provided for a nonvolatile memory having a plurality of blocks each of which can be electrically erased and programmed. A data file to be stored in the nonvolatile memory is received. A number of blocks that are available for data storage are located from the plurality of blocks. The data file is stored in a first block of the number of blocks if the size of the data file is smaller than the total storage space of the first block. The data file is stored in the first block and a second block of the number of blocks if the size of the data file is larger than the total storage space of the first block but smaller than the total storage space of the first and second blocks. Unoccupied space of the first or second block by the data file is not used to store other data files such that no file reallocation operation is needed when the first or second block is to be erased. A data storage system having an erasable and programmable memory and a memory management program is also described.

Patent
13 Oct 1997
TL;DR: An ATM reassembly controller is described in this paper that optimizes the utilization of host memory space and hardware resources such as I/O bus bandwidth, host memory bandwidth, memory system bandwidth and CPU resources.
Abstract: An ATM reassembly controller is disclosed that optimizes the utilization of host memory space and hardware resources such as I/O bus bandwidth, host memory bandwidth, memory system bandwidth and CPU resources The system combines, whenever possible, the PDU status, PDU data and pointers to the host memory data buffers into a large burst write to the status queue In addition, multiple status bundles are packed into a host memory buffer for efficient use of memory An additional benefit of combining and packing information is that CPU resources are conserved by having combined the information the CPU must access into to a contiguous memory area

Patent
28 Feb 1997
TL;DR: In this article, a data processing system consisting of a data processor and a memory control device for controlling the access of information from the memory is described. But the information is stored until required by the data processor.
Abstract: A data processing system is disclosed which comprises a data processor and memory control device for controlling the access of information from the memory. The memory control device includes temporary storage and decision ability for determining what order to execute the memory accesses. The compiler detects the requirements of the data processor and selects the data to stream to the memory control device which determines a memory access order. The order in which to access said information is selected based on the location of information stored in the memory. The information is repeatedly accessed from memory and stored in the temporary storage until all streamed information is accessed. The information is stored until required by the data processor. The selection of the order in which to access information maximizes bandwidth and decreases the retrieval time.

Patent
29 May 1997
TL;DR: In this paper, the authors use two different types of virtual instructions: computational and pattern manipulation instructions to transfer data from one FPGA configuration to another with no external memory access, thereby transferring data to the storage elements in the logic plane at very high speed.
Abstract: A dynamically reconfigurable FPGA includes an array of tiles on a logic plane and a plurality of memory planes. Each tile has associated storage elements on each memory plane, called local memory. This local memory allows large amounts of data to pass from one FPGA configuration (memory plane) to another with no external memory access, thereby transferring data to/from the storage elements in the logic plane at very high speed. Typically, all the local memory can be simultaneously transferred to/from other memory planes in one cycle. Each FPGA configuration provides a virtual instruction. The present invention uses two different types of virtual instructions: computational and pattern manipulation instructions. Computational instructions perform some computation with data stored in some pre-defined local memory pattern. Pattern manipulation instructions move the local data into different memory locations to create the pattern required by the next instruction. A virtual computation may be accomplished by a sequence of instructions.

Patent
19 Sep 1997
TL;DR: In this article, a method, system, and computer program product for allocating physical memory in a distributed shared memory (DSM) network is provided, where global geometry data is stored that defines a global geometry of nodes in the DSM network.
Abstract: A method, system, and computer program product for allocating physical memory in a distributed shared memory (DSM) network is provided. Global geometry data is stored that defines a global geometry of nodes in the DSM network. The global geometry data includes node-node distance data and node-resource affinity data. The node-node distance data defines network distances between the nodes for the global geometry of the DSM network. The node-resource affinity data defines resources associated with the nodes in the global geometry of the DSM network. A physical memory allocator searches for a set of nodes in the DSM network that fulfills a memory configuration request based on the global geometry data. The memory configuration request can have parameters that define at least one of a requested geometry, memory amount, and resource affinity. The physical memory allocator in an operating system searches the global geometry data for a set of the nodes within the DSM network that fulfill the memory configuration request and minimize network latency and/or bandwidth. During the search, each node can be evaluated to ensure that the node has sufficient available memory amount and resource affinity. The physical memory allocator can begin a search at locations which are determined based on CPU load, actual memory usage or pseudo-randomly. Faster search algorithms can be used by approximating the DSM network by Boolean cubes.