scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Micro in 1997"


Journal Article•DOI•
TL;DR: The state of microprocessors and DRAMs today is reviewed, some of the opportunities and challenges for IRAMs are explored, and performance and energy efficiency of three IRAM designs are estimated.
Abstract: Two trends call into question the current practice of fabricating microprocessors and DRAMs as different chips on different fabrication lines. The gap between processor and DRAM speed is growing at 50% per year; and the size and organization of memory on a single DRAM chip is becoming awkward to use, yet size is growing at 60% per year. Intelligent RAM, or IRAM, merges processing and memory into a single chip to lower memory latency, increase memory bandwidth, and improve energy efficiency. It also allows more flexible selection of memory size and organization, and promises savings in board area. This article reviews the state of microprocessors and DRAMs today, explores some of the opportunities and challenges for IRAMs, and finally estimates performance and energy efficiency of three IRAM designs.

671 citations


Journal Article•DOI•
TL;DR: Because simultaneous multithreading successfully (and simultaneously) exploits both types of parallelism, SMT processors use resources more efficiently, and both instruction throughput and speedups are greater.
Abstract: Simultaneous multithreading is a processor design which consumes both thread-level and instruction-level parallelism. In SMT processors, thread-level parallelism can come from either multithreaded, parallel programs or individual, independent programs in a multiprogramming workload. Instruction-level parallelism comes from each single program or thread. Because it successfully (and simultaneously) exploits both types of parallelism, SMT processors use resources more efficiently, and both instruction throughput and speedups are greater.

581 citations


Journal Article•DOI•
TL;DR: Tiny Tera as mentioned in this paper is an input-buffered switch, which makes it the highest bandwidth switch possible given a particular CMOS and memory technology. But it does not support multicasting.
Abstract: Describes Tiny Tera: a small, high-bandwidth, single-stage switch. Tiny Tera has 32 ports switching fixed-size packets, each operating at over 10 Gbps (approximately the Sonet OC-192e rate, a telecom standard for system interconnects). The switch distinguishes four classes of traffic and includes efficient support for multicasting. We aim to demonstrate that it is possible to use currently available CMOS technology to build this compact switch with an aggregate bandwidth of approximately 1 terabit per second and a central hub no larger than a can of soda. Such a switch could serve as a core for an ATM switch or an Internet router. Tiny Tera is an input-buffered switch, which makes it the highest bandwidth switch possible given a particular CMOS and memory technology. The switch consists of three logical elements: ports, a central crossbar switch, and a central scheduler. It queues packets at a port on entry and optionally prior to exit. The scheduler, which has a map of each port's queue occupancy, determines the crossbar configuration every packet time slot. Input queueing, parallelism, and tight integration are the keys to such a high-bandwidth switch. Input queueing reduces the memory bandwidth requirements: When a switch queues packets at the input, the buffer memories need run no faster than the line rate. Thus, there is no need for the speedup required in output-queued switches.

279 citations


Journal Article•DOI•
TL;DR: SGI's Spider chip-Scalable, Pipelined Interconnect for Distributed Endpoint Routing-create a scalable, short-range network delivering hundreds of gigabytes per second in bandwidth to large configurations.
Abstract: SGI's Spider chip-Scalable, Pipelined Interconnect for Distributed Endpoint Routing-create a scalable, short-range network delivering hundreds of gigabytes per second in bandwidth to large configurations. Individual Spider chips sustain a 4.8-Gbyte/s switching rate, connecting to each other and to endpoints across cables up to 5 meters in length. By delivering very high bandwidth-thousands of times higher than standard Ethernet-at low latencies, Spider is ideal for CPU interconnect applications, high-end network switches, or high-performance graphics interconnects. The Spider chip design drew on the principles of computer communications architecture. Isolation between the physical, data link, and message layers led to a well-structured design that is transportable and more easily verified than a nonlayered solution. Because the chip implements all layers in hardware, latency is very low. Thus, we could realize the benefits of layering without sacrificing performance.

248 citations


Journal Article•DOI•
TL;DR: 0.5-micron CMOS transmitter and receiver circuits that use active equalization to overcome the frequency-dependent attenuation of copper lines are developed.
Abstract: Most digital systems today use full-swing, unterminated signaling methods that are unsuited for data rates over 100 MHz on 1-meter wires. We are currently developing 0.5-micron CMOS transmitter and receiver circuits that use active equalization to overcome the frequency-dependent attenuation of copper lines. The circuits will operate at 4 Gbps over up to 6 meters of 24AWG twisted pair or up to 1 meter of 5-mil 0.5-oz. PC trace. In addition to frequency-dependent attenuation, timing uncertainty (skew and jitter) and receiver bandwidth are also major obstacles to high-data rates. To address all of these issues, we've given our system the following characteristics: An active transmitter equalizer compensates for the frequency-dependent attenuation of the transmission line. The system performs closed-loop clock recovery independently for each signal line in a manner that cancels all clock and data skew and the low-frequency components of clock jitter. The delay line that generates the transmit and receive clocks (a 400-MHz clock with 10 equally spaced phases) uses several circuit techniques to achieve a total simulated jitter of less than 20 ps in the presence of supply and substrate noise. A clocked receive amplifier with a 50-ps aperture time senses the signal during the center of the eye at the receiver.

194 citations


Journal Article•DOI•
TL;DR: This article describes some of the important issues related to just-in-time, or JIT, compilation techniques for Java and focuses on the JIT compilers developed by Sun for use with the JDK (Java Development Kit) virtual machine running on SPARC and Intel processors.
Abstract: The Java programming language promises portable, secure execution of applications. Early Java implementations relied on interpretation, leading to poor performance compared to compiled programs. Compiling Java programs to the native machine instructions provides much higher performance. Because traditional compilation would defeat Java's portability and security, another approach is necessary. This article describes some of the important issues related to just-in-time, or JIT, compilation techniques for Java. We focus on the JIT compilers developed by Sun for use with the JDK (Java Development Kit) virtual machine running on SPARC and Intel processors. (Access the Web at www.sun. com/workshop/java/jit for these compilers and additional information.) We also discuss performance improvements and limitations of JIT compilers. Future Java implementations may provide even better performance, and we outline some specific techniques that they may use.

185 citations


Journal Article•DOI•
TL;DR: AMBA defines both a bus specification and a technology-independent methodology for designing, implementing, and testing customized, high-integration embedded controllers that will aid designers in making detailed comparisons with other buses.
Abstract: In 1995, Advanced RISC Machines released its Advanced Microcontroller Bus Architecture in response to input from key semiconductor licensees. AMBA's goal is to help designers of embedded CPU systems meet challenges like design for low power consumption and test access. Because input for AMBA came from designers of ARM-based microprocessors, ARM was also able to develop a solid design rationale and evolve an architectural design that would address the most common problems. This article describes some of AMBA's design methodology and provides a set of specifications that will aid designers in making detailed comparisons with other buses. AMBA defines both a bus specification and a technology-independent methodology for designing, implementing, and testing customized, high-integration embedded controllers.

180 citations


Journal Article•DOI•
TL;DR: This small, flexible microprocessor core provides performance five to 20 times better than other means of Java execution and the microarchitecture trade-offs made for picoJava-I are illustrated.
Abstract: Our goal is to describe the picoJava-I architecture. To do so, we first describe characteristics of the Java Virtual Machine that are of interest to a processor designer. To illustrate the microarchitecture trade-offs made for picoJava-I, we also present statistics on the dynamic distribution of byte codes for various Java applications as well as the impact of the Java runtime. Finally, we present the microarchitecture itself and discuss its performance. This small, flexible microprocessor core provides performance five to 20 times better than other means of Java execution.

164 citations


Journal Article•DOI•
R. Crisp1•
TL;DR: Providing three times the memory bandwidth of the 66-MHz SDRAM subsystem, Direct RDRAM modules fit seamlessly into the existing mechanical space and airflow environment of the industry-standard PC chassis.
Abstract: Providing three times the memory bandwidth of the 66-MHz SDRAM subsystem, Direct RDRAM modules fit seamlessly into the existing mechanical space and airflow environment of the industry-standard PC chassis.

163 citations


Journal Article•DOI•
TL;DR: This survey introduces the problems and discusses solutions in the context of single-processor systems, to catalog all solutions, past and present, and to identify technology trends and attractive future approaches.
Abstract: This survey exposes the problems related to virtual caches in the context of uniprocessor (Part 1) and multiprocessor (Part 2) systems. We review proposed solutions that have been implemented or proposed in different contexts. The idea is to catalog all solutions, past and present, and to identify technology trends and attractive future approaches. We first overview the relevant properties of virtual memory and of physical caches. To solve the virtual-to-physical address bottle-neck, processors may access caches directly with virtual addresses. This survey introduces the problems and discusses solutions in the context of single-processor systems.

112 citations


Journal Article•DOI•
TL;DR: This article introduces Java's existing security features and the way they contribute to its overall usability, simplicity, adequacy, and adaptability in the global computing arena.
Abstract: This article introduces Java's existing security features and the way they contribute to its overall usability, simplicity, adequacy, and adaptability in the global computing arena. It also discusses JavaSoft's plans to make new features available as the technology evolves.

Journal Article•DOI•
TL;DR: SLDRAM meets the high data bandwidth requirements of emerging processor architectures and retains the low cost of earlier DRAM interface standards, suggesting that SLDRAM will become the mainstream commodity memory of the early 21st century.
Abstract: The primary objective of DRAM-dynamic random access memory-is to offer the largest memory capacity at the lowest possible cost. Designers achieve this by two means. First, they optimize the process and the design to minimize die area. Second, they ensure that the device serves high-volume markets and can be mass-produced to achieve the greatest economies of scale. SLDRAM-synchronous-link DRAM-is a new memory interface specification developed through the cooperative efforts of leading semiconductor memory manufacturers and high-end computer architects and system designers. SLDRAM meets the high data bandwidth requirements of emerging processor architectures and retains the low cost of earlier DRAM interface standards. These and other benefits suggest that SLDRAM will become the mainstream commodity memory of the early 21st century.

Journal Article•DOI•
Ashok Kumar1•
TL;DR: The PA-8000 RISC CPU is the first of a new generation of Hewlett-Packard microprocessors designed for high-end systems, and features an aggressive, four-way, superscalar implementation, combining speculative execution with on-the-fly instruction reordering.
Abstract: The PA-8000 RISC CPU is the first of a new generation of Hewlett-Packard microprocessors. Designed for high-end systems, it is among the world's most powerful and fastest microprocessors. It features an aggressive, four-way, superscalar implementation, combining speculative execution with on-the-fly instruction reordering. The heart of the machine, the instruction reorder buffer, provides out-of-order execution capability. Our primary design objective for the PA-8000 was to attain industry-leading performance in a broad range of applications. In addition, we wanted to provide full support for 64-bit applications. To make the PA-8000 truly useful, we needed to ensure that the processor would not only achieve high benchmark performance but would sustain such performance in large, real-world applications. To achieve this goal, we designed large, external primary caches with the ability to hide memory latency in hardware. We also implemented dynamic instruction reordering in hardware to maximize instruction-level parallelism available to the execution units. The PA-8000 connects to a high-bandwidth Runway system bus, a 768-Mbyte/s split-transaction bus that allows each processor to generate multiple outstanding memory requests. The processor also provides glueless support for up to four-way multiprocessing via the Runway bus. The PA-8000 implements the new PA (Precision Architecture) 2.0, a binary-compatible extension of the previous PA-RISC architecture. All previous code executes on the PA-8000 without recompilation or translation.

Journal Article•DOI•
TL;DR: Two schemes for implementing associativity greater than two are proposed, which are an extension of the column-associative cache and the parallel multicolumn cache, which can effectively reduce the average access time.
Abstract: In the race to improve cache performance, many researchers have proposed schemes that increase a cache's associativity. The associativity of a cache is the number of places in the cache where a block may reside. In a direct-mapped cache, which has an associativity of 1, there is only one location to search for a match for each reference. In a cache with associativity n-an n-way set-associative cache-there are n locations. Increasing associativity reduces the miss rate by decreasing the number of conflict, or interference, references. The column-associative cache and the predictive sequential associative cache seem to have achieved near-optimal performance for an associativity of two. Increasing associativity beyond two, therefore, is one of the most important ways to further improve cache performance. We propose two schemes for implementing associativity greater than two: the sequential multicolumn cache, which is an extension of the column-associative cache, and the parallel multicolumn cache. For an associativity of four, they achieve the low miss rate of a four-way set-associative cache. Our simulation results show that both schemes can effectively reduce the average access time.

Journal Article•DOI•
TL;DR: This article explores the various trade-offs involved and illuminates the consequences of different design choices, thus enabling designers to make informed decisions on how to implement division and square root functions.
Abstract: Floating-point support has become a mandatory feature of new microprocessors due to the prevalence of business, technical, and recreational applications that use these operations. Spreadsheets, CAD tools, and games, for instance, typically feature floating-point-intensive code. Over the past few years, the leading architectures have incorporated several generations of floating-point units (FPUs). However, while addition and multiplication implementations have become increasingly efficient, support for division and square root has remained uneven. The design community has reached no consensus on the type of algorithm to use for these two functions, and quality and performance of the implementations vary widely. This situation originates in skepticism about the importance of division and square root and an insufficient understanding of the design alternatives. Quantifying what constitutes good performance is challenging. One rule thumb, for example, states that the latency of division should be three times that of multiplication; this figure is based on division frequencies in a selection of typical scientific applications. Even if we accept this doctrine at face value, implementing division-and square root-involves much more than relative latencies. We must also consider area, throughput, complexity, and the interaction with other operations. This article explores the various trade-offs involved and illuminates the consequences of different design choices, thus enabling designers to make informed decisions.

Journal Article•DOI•
TL;DR: RMI is designed to support pure-Java distributed objects in a seamless manner, and allows calls to be made between Java objects in different virtual machines, even on different physical machines.
Abstract: The Java language and platform provide a base for distributed computing that changes several conventional assumptions. In particular, the Java Virtual Machine allows a group of Java-enabled machines to be treated as a homogeneous group rather than a heterogeneous group, despite possible differences in the machine architectures and underlying operating systems. Java also makes it possible to safely and dynamically load code in a running Java process. Together, these features allow a system to invoke methods on remote objects, which can move code associated with language-level objects from the calling process to the process called and vice versa. Combining these qualities with a language-centric design not only significantly simplifies traditional RPC systems, it adds functionality that was previously not possible. We designed Java Remote Method Invocation (RMI) to support pure-Java distributed objects in a seamless manner. RMI allows calls to be made between Java objects in different virtual machines, even on different physical machines.

Journal Article•DOI•
TL;DR: It is shown that the implementation of aggressive latency tolerance techniques aggravates stalls due to finite memory bandwidth, which actually become more significant than stalls resulting from uncongested memory latency alone.
Abstract: This paper quantifies and compares the performance impacts of memory latencies and finite bandwidth. We show that the implementation of aggressive latency tolerance techniques aggravates stalls due to finite memory bandwidth, which actually become more significant than stalls resulting from uncongested memory latency alone. We expect that memory bandwidth limitations across the processor pins will drive significant architectural change. An execution-driven simulation measures the time that several SPEC95 benchmarks spend stalled for memory latency, limited-memory bandwidth and computing.

Journal Article•DOI•
TL;DR: This work designs a simultaneous multithreaded vector architecture that achieves performance equivalent to executing 15 to 26 scalar instructions/cycle for numerical applications.
Abstract: Simultaneous multithreaded vector architectures combine the best of data-level and instruction-level parallelism and perform better than either approach could separately. Our design achieves performance equivalent to executing 15 to 26 scalar instructions/cycle for numerical applications.

Journal Article•DOI•
Y. Nunomura1, T. Shimizu, O. Tomisawa•
TL;DR: This work combines a DRAM with a high-performance RISC processor on a single silicon chip-what the authors call embedded DRAM, or eRAM, and embodies the concept that this new processor, M32R/D, embodies.
Abstract: Newly emerging portable multimedia systems demand low-energy processing. Until now, however, designers of microprocessors for PCs and engineering workstations have focused mostly on high speed. Microcontroller designers, on the other hand, have taken low power dissipation as a first priority. Today's portable multimedia system design calls for developers to combine these two approaches, lowering power consumption while maintaining reasonable performance. There are several approaches to reducing power dissipation. Using advanced process technology is one of them. Another approach merges a DRAM with a high-performance RISC processor on a single silicon chip-what we call embedded DRAM, or eRAM. This is exactly the concept that this new processor, M32R/D, embodies.

Journal Article•DOI•
TL;DR: The author discusses the reasons for media processors existence, the implementation of the M pact media processor, and some examples of the relationship between hardware and software on the Mpact chip.
Abstract: Media processors, a new class of processor architectures that combine hardware and software to accelerate multimedia functions concurrently, owe their existence to several computational needs and enabling technologies. In this article, the author discusses the reasons for media processors existence, the implementation of the Mpact media processor, and some examples of the relationship between hardware and software on the Mpact chip.

Journal Article•DOI•
TL;DR: This paper discusses the inherent problems and possible solutions in this approach for multiprocessor systems where processors may access their cache directly using virtual addresses.
Abstract: In this two-part survey, we discussed the problems and possible solutions caused by virtual address caches in single-processor systems. In this paper, we continue to explore these topics in the context of multiprocessor systems. Processors may access their cache directly using virtual addresses. We discuss the inherent problems and possible solutions in this approach for multiprocessor systems.

Journal Article•DOI•
TL;DR: The architecture is a 16-bit processor with dedicated instructions and hardware for efficient support of fuzzy logic for medium-range applications that demand computational power combined with low cost for the resulting hardware system (chip and board).
Abstract: We propose an architecture dedicated mainly to medium-range applications that demand computational power combined with low cost for the resulting hardware system (chip and board). Our architecture is a 16-bit processor with dedicated instructions and hardware for efficient support of fuzzy logic. To make the architecture effective for control applications developed with a traditional approach or with fuzzy logic, we equipped the processor with a microcontroller's general features. Our design accounts for application characteristics to provide efficient hardware support for fuzzy logic. To achieve this we first analyzed fuzzy control algorithms and derived a general model for fuzzy computation. In defining the model, we considered the large spectrum of possible inference methods, fuzzification and defuzzification mechanisms, and the operators used in control applications. On this basis, we defined the instruction set that supports this computational model and a proper architectural solution. We tested the system (composed of the software model and its hardware support) by simulating different sets of general-purpose and fuzzy control benchmarks.

Journal Article•DOI•
TL;DR: This first-generation, high-performance network for clusters, the Memory Channel for PCI network, and all SMP AlphaServers running Digital Unix support it and standard message-passing APIs benefit greatly from this underlying capability.
Abstract: Digital has announced and shipped this first-generation, high-performance network for clusters, the Memory Channel for PCI network, and all SMP AlphaServers running Digital Unix support it. Digital has publicly demonstrated Memory Channel-connected systems running Windows/NT. The Memory Channel network does not require functionality beyond the PCI bus specification and works with any system having a PCI I/O slot. Production Memory Channel clusters can be as large as eight nodes (limited only by first-generation hardware) of 12 processors each (96 processors). One such cluster installed at Supercomputing 95 ran clusterwide applications using High Performance Fortran, PVM, and MPI. A four-node, 48 processor Memory Channel cluster, using Oracle Parallel Server, has held the record for TPC-C benchmarks since its introduction in April 1996. The same Memory Channel network used to connect this high-end database configuration also cost-effectively supports configuration of two-node, single-processor clusters. Latency over Memory Channel for a one-way, user-process-to-user-process message is 2.9 microseconds. The processor overhead is less than 150 ns for a 32-byte message. Standard message-passing APIs benefit greatly from this underlying capability.


Journal Article•DOI•
TL;DR: This work has developed a radio solution at millimeter-wavelength frequencies, where the spectrum is sufficient to accommodate link speeds of hundreds of Mbps, and demonstrated a picocellular approach with a range of approximately 10 meters and link rates up to 185 Mbps.
Abstract: WLAN systems face technical problems similar to those encountered in outdoor wide-area radio-based systems, including the limited available bandwidth and fading noise due to multipath interference and blockage. The goal in WLAN system design is to transmit at the maximum information rate with an acceptable probability of error and minimum equipment complexity, power, and cost. Competing approaches use either infrared radiation or radio waves in the microwave or millimeter-wave bands. We have developed a radio solution at millimeter-wavelength frequencies, where the spectrum is sufficient to accommodate link speeds of hundreds of Mbps. Using a test bed with burst-mode transmission capability and an experimental 40-GHz radio, we have demonstrated a picocellular approach with a range of approximately 10 meters and link rates up to 185 Mbps. In addition, we have built a prototype modem with a raw link rate of 54 Mbps for use in a high-speed indoor WLAN demonstrator.

Journal Article•DOI•

Journal Article•DOI•
TL;DR: Portable and handheld products require processors that consume less power than those in desktop and other powered applications, as a result, designers must analyze power use in the early stage of design both at the circuit and system levels.
Abstract: Portable and handheld products require processors that consume less power than those in desktop and other powered applications. As a result, designers must analyze power use in the early stage of design both at the circuit and system levels. RISC processors, such as our ARM7TDMI, have both strengths and weaknesses as far as power consumption is concerned. From a system perspective, RISC processors should consume more power than CISC processors since RISCs need to be fed with an instruction virtually every cycle. RISC processors usually have a fixed 32bit instruction format, which forces a 32-bit memory access every cycle. Thus, the processor consumes power both in accessing the memory and in driving 32 address and 32 data wires across a PCB.

Journal Article•DOI•
TL;DR: ChARM's program locality analysis illustrates the sequentiality, temporality, and loops of a program in easy-to-read three dimensional graphs to help designers understand how a program works and how it stresses the memory hierarchy.
Abstract: ChARM is a simulation tool for tuning ARM-based embedded systems that include cache memories. ChARM provides a parametric, trace-driven simulation for tuning system configuration. A designer can observe performance while varying the timing, the architectural features, and the management policies of the system components. Designers can therefore evaluate the execution time of the program, the time spent in memory accesses, miss ratio, code miss ratio, and data miss ratio, and the number of burst-read operations. They can also evaluate the number of write operations for write-through cache models and burst-write operations for copy-back cache models. finally, ChARM's program locality analysis illustrates the sequentiality, temporality, and loops of a program in easy-to-read three dimensional graphs. These graphs, together with the graphs showing the distribution of the replacement conflicts in cache, help designers understand how a program works and how it stresses the memory hierarchy.

Journal Article•DOI•
TL;DR: The article demonstrates a concise way to represent the design space using DS trees, reviews the most frequently used issue schemes, and highlights trends for each design aspect of instruction issue.
Abstract: Clearly, instruction issue and execution are closely related: The more parallel the instruction execution, the higher the requirements for the parallelism of instruction issue. Thus, we see the continuous and harmonized increase of parallelism in instruction issue and execution. This article focuses on superscalar instruction issue, tracing the way parallel instruction execution and issue have increased performance. It also spans the design space of instruction issue, identifying important design aspects and available design choices. The article also demonstrates a concise way to represent the design space using DS trees, reviews the most frequently used issue schemes, and highlights trends for each design aspect of instruction issue.

Journal Article•DOI•
TL;DR: A tool that will allow designers using the codesign approach to partially automate the development of embedded systems by using artificial intelligence techniques to imitate the behavior of a human in defining a system's partitioning.
Abstract: We propose a tool that will allow designers using the codesign approach to partially automate the development of embedded systems. The framework takes advantage of tools already available on the market for VLSI CAD as well as soft computing techniques. We focus our work mainly on evaluation of cost and partitioning, because this is the area in which soft computing seems to have great advantages over traditional approaches. The main novelty of our approach is our use of artificial intelligence techniques to imitate the behavior of a human in defining a system's partitioning. We hope to devote further studies to techniques to optimize the genetic algorithm, in both the representation and processing of data. We are also working on the use of formal techniques to describe our system.