scispace - formally typeset
Search or ask a question

Showing papers in "IEEE Micro in 2000"


Journal ArticleDOI
TL;DR: The approach of using energy-enabled performance simulators in early design, examining some of the emerging paradigms in processor design and comment on their inherent power-performance characteristics, is described.
Abstract: The ability to estimate power consumption during early-stage definition and trade-off studies is a key new methodology enhancement Opportunities for saving power can be exposed via microarchitecture-level modeling, particularly through clock-gating and dynamic adaptation In this paper we describe the approach of using energy-enabled performance simulators in early design We examine some of the emerging paradigms in processor design and comment on their inherent power-performance characteristics

495 citations


Journal ArticleDOI
R.E. Gonzalez1
TL;DR: System designers can optimize Xtensa for their embedded application by sizing and selecting features and adding new instructions, which allows easy customization of both hardware and software.
Abstract: System designers can optimize Xtensa for their embedded application by sizing and selecting features and adding new instructions. Xtensa provides an integrated solution that allows easy customization of both hardware and software. This process is simple, fast, and robust.

438 citations


Journal ArticleDOI
TL;DR: The design of a CMP is motivated, the architecture of the Hydra design is described with a focus on its speculative thread support, and the prototype implementation is described.
Abstract: The Hydra chip multiprocessor (CMP) integrates four MIPS-based processors and their primary caches on a single chip together with a shared secondary cache. A standard CMP offers implementation and performance advantages compared to wide-issue superscalar designs. However, it must be programmed with a more complicated parallel programming model to obtain maximum performance. To simplify parallel programming, the Hydra CMP supports thread-level speculation and memory renaming, a paradigm that allows performance similar to a uniprocessor of comparable die area on integer programs. This article motivates the design of a CMP, describes the architecture of the Hydra design with a focus on its speculative thread support, and describes our prototype implementation. Chip multiprocessors offer an economical, scalable architecture for future microprocessors. Thread-level speculation support allows them to speed up past software.

372 citations


Journal ArticleDOI
TL;DR: The approach, which is called HiCuts (hierarchical intelligent cuttings), attempts to partition the search space in each dimension by using heuristics that exploit structure present in classifiers.
Abstract: Increasing demands on Internet router performance and functionality create a need for algorithms that classify packets quickly with minimal storage requirements and allow frequent updates. Unlike previous algorithms, the algorithm proposed here meets this need well by using heuristics that exploit structure present in classifiers. Our approach, which we call HiCuts (hierarchical intelligent cuttings), attempts to partition the search space in each dimension, guided by simple heuristics that exploit the classifier's structure. We discover this structure by preprocessing the classifier. We can tune the algorithm's parameters to trade off query time against storage requirements. In classifying packets based on four header fields, HiCuts performs quickly and requires relatively little storage compared with previously described algorithms.

342 citations


Journal ArticleDOI
TL;DR: PowerPC's AltiVec speeds not only media processing but also nearly any application in which data parallelism exists, as demonstrated by a cycle-accurate simulation of Motorola's MPC 7400, the heart of Apple G4 systems.
Abstract: There is a clear trend in personal computing toward multimedia-rich applications. These applications will incorporate a wide variety of multimedia technologies, including audio and video compression, 2D image processing, 3D graphics, speech and handwriting recognition, media mining, and narrow/broadband signal processing for communication. In response to this demand, major microprocessor vendors have announced architectural extensions to their general-purpose processors in an effort to improve their multimedia performance. Intel extended IA-32 with MMX and SSE (alias KNI), Sun enhanced Sparc with VIS, Hewlett-Packard added MAX to its PA-RISC architecture, Silicon Graphics extended the MIPS architecture with MDMX, and Digital (now Compaq) added MVI to Alpha. This article describes the most recent, and what we believe to be the most comprehensive, addition to this list: PowerPC's AltiVec, AltiVec speeds not only media processing but also nearly any application in which data parallelism exists, as demonstrated by a cycle-accurate simulation of Motorola's MPC 7400, the heart of Apple G4 systems.

331 citations


Journal ArticleDOI
TL;DR: The streaming SIMD extensions (SSE) provides a rich set of instructions to meet the requirements of demanding multimedia and Internet applications and makes a number of design trade-offs to satisfy tight die size constraints and attain frequency goals.
Abstract: This paper describes the streaming SIMD extensions (SSE) provides a rich set of instructions to meet the requirements of demanding multimedia and Internet applications. In implementing the SSE, the Pentium III developers made a number of design trade-offs to satisfy tight die size constraints and attain frequency goals.

201 citations


Journal ArticleDOI
H. Sharangpani1, H. Arora
TL;DR: The Itanium processor is the first implementation of the IA-64 instruction set architecture (ISA) and employs EPIC (explicitly parallel instruction computing) design concepts for a tighter coupling between hardware and software.
Abstract: The Itanium processor is the first implementation of the IA-64 instruction set architecture (ISA). The design team optimized the processor to meet a wide range of requirements: high performance on Internet servers and workstations, support for 64-bit addressing, reliability for mission-critical applications, full IA-32 instruction set compatibility in hardware, and scalability across a range of operating systems and platforms. The processor employs EPIC (explicitly parallel instruction computing) design concepts for a tighter coupling between hardware and software. In this design style the hardware-software interface lets the software exploit all available compilation time information and efficiently deliver this information to the hardware. It addresses several fundamental performance bottlenecks in modern computers, such as memory latency, memory address disambiguation, and control flow dependencies.

188 citations


Journal ArticleDOI
Jerome C Huck1, D. Morris, J. Ross, A. Knies, H. Mulder, Rumi Zahir 
TL;DR: The motivation, operation, and benefits of the major features of IA-64 are examined and it is found that instruction-level parallelism (ILP) can be exploited for further performance increases.
Abstract: Microprocessors continue on the relentless path to provide more performance. Every new innovation in computing-distributed computing on the Internet, data mining, Java programming, and multimedia data streams-requires more cycles and computing power. Even traditional applications such as databases and numerically intensive codes present increasing problem sizes that drive demand for higher performance. Design innovations, compiler technology, manufacturing process improvements, and integrated circuit advances have been driving exponential performance increases in microprocessors. To continue this growth in the future, Hewlett Packard and Intel architects examined barriers in contemporary designs and found that instruction-level parallelism (ILP) can be exploited for further performance increases. This article examines the motivation, operation, and benefits of the major features of IA-64. Intel's IA-64 manual provides a complete specification of the IA-64 architecture.

131 citations


Journal ArticleDOI
Gene A. Frantz1
TL;DR: Developers will be challenged to use digital signal processing power to its utmost, while creating new applications and improving existing ones, for increasingly widespread applications.
Abstract: Advancements in digital signal processing technology are enabling its use for increasingly widespread applications. Developers will be challenged to use this processing power to its utmost, while creating new applications and improving existing ones.

121 citations


Journal ArticleDOI
TL;DR: This highly parallel DSP architecture based on a short-vector memory system incorporates techniques found in general-purpose computing and promises sustained performance close to its peak computational rates.
Abstract: This highly parallel DSP architecture based on a short-vector memory system incorporates techniques found in general-purpose computing. It promises sustained performance close to its peak computational rates of 900 MFLOPS (32-bit floating-point) or 3.6 BOPS (16-bit fixed-point).

114 citations


Journal ArticleDOI
TL;DR: The MAJC architecture enhances application performance by exploiting parallelism at multiple levels-instruction, data, thread, and process and treats all data types similarly.
Abstract: The MAJC architecture enhances application performance by exploiting parallelism at multiple levels-instruction, data, thread, and process. Supporting vertical multithreading, speculative multithreading, and chip multiprocessors, the scalable VLIW architecture is also capable of advanced speculation and predication and treats all data types similarly.

Journal ArticleDOI
Nhon Quach1
TL;DR: The Itanium Processor is the first implementation of the Intel IA-64 architecture, designed for the high-end server market segment, and equipped with many advanced RAS features to maximize system reliability and availability.
Abstract: The Itanium Processor is the first implementation of the Intel IA-64 architecture. Designed for the high-end server market segment, the processor is equipped with many advanced RAS (reliability, availability, and serviceability) features to maximize system reliability and availability.

Journal ArticleDOI
TL;DR: This work presents an architecture for network-authenticated disks that implements distributed file systems without file servers or encryption and provides network clients with direct network access to remote storage.
Abstract: We present an architecture for network-authenticated disks that implements distributed file systems without file servers or encryption. Our system provides network clients with direct network access to remote storage.

Journal ArticleDOI
TL;DR: The home network will connect to every consumer electronic device and must consider robust system requirements, home phone line standards, costs, and implementation of a supporting iline10 chip set.
Abstract: In addition to shared Internet access for PCs, the home network will connect to every consumer electronic device. To make this possible, we must consider robust system requirements, home phone line standards, costs, and implementation of a supporting iline10 chip set.

Journal ArticleDOI
TL;DR: The identification and exploration of the design space of register-renaming lead to a comprehensive understanding of this intricate technique, which duplicates the eight basic alternatives to 16 possible implementation schemes.
Abstract: Register renaming is a technique to remove false data dependencie-write after read (WAR) and write after write (WAW)-that occur in straight line code between register operands of subsequent instructions. By eliminating related precedence requirements in the execution sequence of the instructions, renaming increases the average number of instructions that are available for parallel execution per cycle. This results in increased IPC (number of instructions executed per cycle). The identification and exploration of the design space of register-renaming lead to a comprehensive understanding of this intricate technique. As this article shows, the design space of register renaming is spanned by four main dimensions: the scope of register renaming, the layout of the rename buffers, the method of register mapping, and the rename rate. Relevant aspects of the design space give rise to eight basic alternatives for register-renaming. In addition, the kind of operand fetch policy significantly affects how the processor carries out the rename process, which duplicates the eight basic alternatives to 16 possible implementation schemes. The article indicates which basic implementation scheme is used in relevant superscalar processors. As register renaming is usually implemented in conjunction with shelving, the underlying microarchitecture is assumed to employ shelving.

Journal ArticleDOI
TL;DR: The author describes an undergraduate computer engineering curriculum using a rapid prototyping approach to simulate, synthesize, and implement digital system and computer architectures.
Abstract: Traditionally, undergraduates in electrical and computer engineering study the design and implementation of a simple computer and then develop their own designs. In recent years, computer design courses have for the most part taken a simulation-only approach. Rapid prototyping techniques and a new generation of large field-programmable logic devices (FPLDs) enabled an educational approach that combines modeling with hardware description languages (HDLs), extensive simulation, synthesis, and final verification on a hardware prototype. The author describes an undergraduate computer engineering curriculum using a rapid prototyping approach to simulate, synthesize, and implement digital system and computer architectures.

Journal ArticleDOI
TL;DR: This distributed network computer lets users access and run engineering tools anytime, anywhere via web browsers.
Abstract: This distributed network computer lets users access and run engineering tools anytime, anywhere via web browsers. It is the enabling technology for Netcare, an Internet resource that is now freely available for computer architecture education and research.

Journal ArticleDOI
TL;DR: The electron code generator (ECG) is described, the component of Intel's IA-64 production compiler that maximizes the benefits of instruction-level parallelism and control and data speculation.
Abstract: In planning the new EPIC (Explicitly Parallel Instruction Computing) architecture, Intel designers wanted to exploit the high level of instruction-level parallelism (ILP) found in application code. To accomplish this goal, they incorporated a powerful set of features such as control and data speculation, predication, register rotation, loop branches, and a large register file. By using these features, the compiler plays a crucial role in achieving the overall performance of an IA-64 platform. This paper describes the electron code generator (ECG), the component of Intel's IA-64 production compiler that maximizes the benefits of these features. The ECG consists of multiple phases. The first phase, translation, converts the optimizer's intermediate representation (ILO) of the program into the ECG IR. Predicate region formation, if conversion, and compare generation occur in the predication phase. The ECG contains two schedulers: the software pipeliner for targeted cyclic regions and the global code scheduler for all remaining regions. Both schedulers make use of control and data speculation. The software pipeliner also uses rotating registers, predication, and loop branches to generate efficient schedules for integer as well as floating-point loops.

Journal ArticleDOI
R. Krishnaiyer1, D. Kulkami, D. Laven, L. Wei, C.-C. Lim, J. Ng, D. Sehr 
TL;DR: The IA-64 architecture's rich set of features enable aggressive high-level and scalar optimizations-supported by the latest analysis techniques-to improve integer and floating-point performance.
Abstract: The IA-64 architecture's rich set of features enable aggressive high-level and scalar optimizations-supported by the latest analysis techniques-to improve integer and floating-point performance.

Journal Article

Journal ArticleDOI
TL;DR: Two vector units embedded in the emotion engine chip support high-quality 3D graphics, emotion synthesis, and 300-MHz, 5.5-GFLOPS operation for the recently introduced PlayStation2 game entertainment system.
Abstract: Two vector units embedded in the emotion engine chip support high-quality 3D graphics, emotion synthesis, and 300-MHz, 5.5-GFLOPS operation for the recently introduced PlayStation2 game entertainment system.

Journal ArticleDOI
A. Suga1, K. Matsunami
TL;DR: The FR-V architecture, which includes the variable-length VLIW and instruction set architectures, speculative execution control, and conditional execution control is described, and its performance is evaluated.
Abstract: Because conventional RISC processors have insufficient processing power to support the continuing development of digital consumer products, we need a new high performance processor for multimedia applications. Processing multimedia video images requires more than 10 times the currently available performance. At Fujitsu, we provide this higher performance in software to attain a high degree of flexibility. We developed the FR500 microprocessor with a novel embedded VLIW (very long instruction word) architecture for use in such digital consumer products. The FR500 is the first product in the FR-V line, Fujitsu's generic name for VLIW architecture microprocessors. The FR-V line offers the flexibility to develop new products optimized for a wide variety of digital consumer products. In this paper, we describe the FR-V architecture, which includes our variable-length VLIW and instruction set architectures, speculative execution control, and conditional execution control. We also evaluate its performance.

Journal ArticleDOI
TL;DR: A major problem in teaching computer architecture and organization courses is how to help students make the cognitive leap that connects their theoretical knowledge with practical experience, resulting in a variety of educational tools for computer system simulation.
Abstract: A major problem in teaching computer architecture and organization courses is how to help students make the cognitive leap that connects their theoretical knowledge with practical experience. Numerous researchers involved in computer architecture and organization education have tackled this problem, resulting in a variety of educational tools for computer system simulation. The tools differ greatly in scope, target architecture complexity, simulation level, and user interface. The available educational systems vary in how they handle digital system simulation. They usually offer tools for creating hardware component libraries, viewing simulation results, and conducting statistical analysis of system performance. Available systems range from sophisticated ones, for complex analysis, to simpler ones that are more readily understood by users, both instructors and students. Beyond system simulation, an educational system should support three key objectives. First, it must cover an extensive range of computer architecture and organization topics. Second, it should graphically depict a computer system, from the block level to the register-transfer level. Third, it must provide the means to follow system functions at the program, instruction, and clock cycle levels.

Journal ArticleDOI
TL;DR: Merlot, the first MP98 architecture prototype, promises 1-GIPS performance at 1 watt for 1.3-V operations in support of smart 21st-century information terminals.
Abstract: Merlot, the first MP98 architecture prototype, promises 1-GIPS performance at 1 watt for 1.3-V operations in support of smart 21st-century information terminals.

Journal ArticleDOI
TL;DR: The case of the CD music publishing industry against Napster has now been argued before the US Court of Appeals in San Francisco and awaits decision, and the district court has issued its opinion explaining why it previously decided to order Napster to shut down operations.
Abstract: The case of the CD music publishing industry against Napster (A&M Records, Inc v Napster, Inc, ND Calif) has now been argued before the US Court of Appeals in San Francisco and awaits decision. In a curious reversal of customary judicial procedure, the district court has now issued its opinion ("sentence first, verdict afterward") explaining why it previously decided to order Napster to shut down operations. However, the court of appeals stayed that order in late July just hours before the order was scheduled to go into effect. Although the formal opinion is in the nature of assault and battery upon a dead horse, the opinion is nonetheless informative because it explains why the district court thought Napster's system shouldn't be permitted to operate.

Journal ArticleDOI
TL;DR: A single-chip, programmable mediaprocessor that also makes use of general-purpose RISC processing and a view framework that provides a new programmable infrastructure suitable for replacing RISCs and ASICs in consumer electronics, communications, and imaging applications while, retaining a completely high-level-language programming approach.
Abstract: Presents the MAP1000A, an alternative to using custom ASICs for each multimedia-processing task. It is a single-chip, programmable mediaprocessor that also makes use of general-purpose RISC processing and a view framework. It provides a new programmable infrastructure with the cost, performance, and power characteristics suitable for replacing RISCs and ASICs in consumer electronics, communications, and imaging applications while, retaining a completely high-level-language programming approach. This single-chip mediaprocessor handles all digital functions in high-level-language software with significantly improved performance and without increased system cost or development complexity.

Journal ArticleDOI
TL;DR: This work evaluates a series of three progressively more aggressive routing-table cache designs and demonstrates that the incorporation of hardware caches into Internet processors, combined with efficient caching algorithms can significantly improve overall packet forwarding performance.
Abstract: As a result of the exploding bandwidth demand from the Internet, network router and switch designers are designing and fabricating a growing number of microchips specifically for networking devices rather than traditional computing applications. In particular, a new breed of microprocessors, called Internet processors, has emerged that is designed to efficiently execute network protocols on various types of internetworking devices including switches, routers, and application-level gateways. We evaluate a series of three progressively more aggressive routing-table cache designs and demonstrate that the incorporation of hardware caches into Internet processors, combined with efficient caching algorithms can significantly improve overall packet forwarding performance.

Journal ArticleDOI
TL;DR: This DLX architecture model offers a variety of visualization mechanisms that can be used as a demonstration tool and as a platform for programming exercises through which students learn the importance of code optimization at the software-hardware interface.
Abstract: This DLX architecture model offers a variety of visualization mechanisms. Instructors use HASE both as a demonstration tool and as a platform for programming exercises through which students learn to appreciate the importance of code optimization at the software-hardware interface.

Journal ArticleDOI
A. Clements1
TL;DR: The subject matter of computer architecture has grown in breadth and depth, forcing teachers to decide what to include and what to omit-and even to justify the course itself as discussed by the authors, and the author makes a strong case for including a broad-based course early in every computing student's program.
Abstract: The subject matter of computer architecture has grown in breadth and depth, forcing teachers to decide what to include and what to omit-and even to justify the course itself. The author makes a strong case for including a broad-based course early in every computing student's program.

Journal ArticleDOI
TL;DR: The statistics collector and analyzer records, displays, and analyzes performance measurements from an active IEEE-1394 bus in real time and an empirical analysis using SCA exposes the unique, complex arbitration mechanisms used by IEEE- 1394 nodes and their effect on the performance of higher level protocols.
Abstract: The statistics collector and analyzer records, displays, and analyzes performance measurements from an active IEEE-1394 bus in real time. An empirical analysis using SCA exposes the unique, complex arbitration mechanisms used by IEEE-1394 nodes and their effect on the performance of higher level protocols.