Author(s): Asanovic, K; Bodik, R; Catanzaro, B; Gebis, J; Husbands, P; Keutzer, K; Patterson, D; Plishker, W; Shalf, J; Williams, SW | Abstract: The recent switch to parallel microprocessors is a milestone in the history of computing. Industry has laid out a roadmap for multicore designs that preserves the programming paradigm of the past via binary compatibility and cache coherence. Conventional wisdom is now to double the number of cores on a chip with each silicon generation. A multidisciplinary group of Berkeley researchers met nearly two years to discuss this change. Our view is that this evolutionary approach to parallel hardware and software may work from 2 or 8 processor systems, but is likely to face diminishing returns as 16 and 32 processor systems are realized, just as returns fell with greater instruction-level parallelism. We believe that much can be learned by examining the success of parallelism at the extremes of the computing spectrum, namely embedded computing and high performance computing. This led us to frame the parallel landscape with seven questions, and to recommend the following: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS per development dollar. • Instead of traditional benchmarks, use 13 “Dwarfs” to design and evaluate parallel programming models and architectures. (A dwarf is an algorithmic method that captures a pattern of computation and communication.) • “Autotuners” should play a larger role than conventional compilers in translating parallel programs. • To maximize programmer productivity, future programming models must be more human-centric than the conventional focus on hardware or applications. • To be successful, programming models should be independent of the number of processors. • To maximize application efficiency, programming models should support a wide range of data types and successful models of parallelism: task-level parallelism, word-level parallelism, and bit-level parallelism. 1 The Landscape of Parallel Computing Research: A View From Berkeley • Architects should not include features that significantly affect performance or energy if programmers cannot accurately measure their impact via performance counters and energy counters. • Traditional operating systems will be deconstructed and operating system functionality will be orchestrated using libraries and virtual machines. • To explore the design space rapidly, use system emulators based on Field Programmable Gate Arrays (FPGAs) that are highly scalable and low cost. Since real world applications are naturally parallel and hardware is naturally parallel, what we need is a programming model, system software, and a supporting architecture that are naturally parallel. Researchers have the rare opportunity to re-invent these cornerstones of computing, provided they simplify the efficient programming of highly parallel systems.

/pdf/the-landscape-of-parallel-computing-research-a-view-from-1i5pswggnf.pdf

The Landscape of Parallel Computing Research: A View from Berkeley

Due to their high volume, general-purpose processors, and now chip multiprocessors (CMPs), are much more cost effective than ASICs, but lag significantly in terms of performance and energy efficiency. This paper explores the sources of these performance and energy overheads in general-purpose processing systems by quantifying the overheads of a 720p HD H.264 encoder running on a general-purpose CMP system. It then explores methods to eliminate these overheads by transforming the CPU into a specialized system for H.264 encoding. We evaluate the gains from customizations useful to broad classes of algorithms, such as SIMD units, as well as those specific to particular computation, such as customized storage and functional units. The ASIC is 500x more energy efficient than our original four-processor CMP. Broadly applicable optimizations improve performance by 10x and energy by 7x. However, the very low energy costs of actual core ops (100s fJ in 90nm) mean that over 90% of the energy used in these solutions is still "overhead". Achieving ASIC-like performance and efficiency requires algorithm-specific optimizations. For each sub-algorithm of H.264, we create a large, specialized functional unit that is capable of executing 100s of operations per instruction. This improves performance and energy by an additional 25x and the final customized CMP matches an ASIC solution's performance within 3x of its energy and within comparable area.

/pdf/understanding-sources-of-inefficiency-in-general-purpose-530x98wd4x.pdf

Understanding sources of inefficiency in general-purpose chips

The multiprocessor system-on-chip (MPSoC) uses multiple CPUs along with other hardware subsystems to implement a system. A wide range of MPSoC architectures have been developed over the past decade. This paper surveys the history of MPSoCs to argue that they represent an important and distinct category of computer architecture. We consider some of the technological trends that have driven the design of MPSoCs. We also survey computer-aided design problems relevant to the design of MPSoCs.

/pdf/multiprocessor-system-on-chip-mpsoc-technology-4ijji9oo01.pdf

Multiprocessor System-on-Chip (MPSoC) Technology

A growing number of applications, often with firm or soft real-time requirements, are integrated on the same System on Chip, in the form of either hardware or software intellectual property. The applications are started and stopped at run time, creating different use-cases. Resources, such as interconnects and memories, are shared between different applications, both within and between use-cases, to reduce silicon cost and power consumption.The functional and temporal behaviour of the applications is verified by simulation and formal methods. Traditionally, designers resort to monolithic verification of the system as whole, since the applications interfere in shared resources, and thus affect each other's behaviour. Due to interference between applications, the integration and verification complexity grows exponentially in the number of applications, and the task to verify correct behaviour of concurrent applications is on the system designer rather than the application designers.In this work, we propose a Composable and Predictable Multi-Processor System on Chip (CoMPSoC) platform template. This scalable hardware and software template removes all interference between applications through resource reservations. We demonstrate how this enables a divide-and-conquer design strategy, where all applications, potentially using different programming models and communication paradigms, are developed and verified independently of one another. Performance is analyzed per application, using state-of-the-art dataflow techniques or simulation, depending on the requirements of the application. These results still apply when the applications are integrated onto the platform, thus separating system-level design and application design.

CoMPSoC: A template for composable and predictable multi-processor system on chips

Introduction to Reconfigurable Computing provides a comprehensive study of the field Reconfigurable Computing. It provides an entry point to the novice willing to move in the research field reconfigurable computing, FPGA and system on programmable chip design. The book can also be used as teaching reference for a graduate course in computer engineering, or as reference to advance electrical and computer engineers. It provides a very strong theoretical and practical background to the field of reconfigurable computing, from the early Estrins machine to the very modern architecture like coarse-grained reconfigurable device and the embedded logic devices. Apart from the introduction and the conclusion, the main chapters of the book are Architecture of reconfigurable systems, Design and implementation, High-Level Synthesis for Reconfigurable Devices, Temporal placement, On-line and Dynamic Interconnection, Designing a reconfigurable application on Xilinx Virtex FPGA, System on programmable chip, Applications.

Introduction to Reconfigurable Computing: Architectures, Algorithms, and Applications

List of Figures. Foreword by Clayton Christensen. Foreword by John Hennessy. Author's Preface. Acknowledgments. 1. The Case for a New SOC Design Methodology. The Age of Megagate SOCs. The Fundamental Trends of SOC Design. What's Wrong with Today's Approach to SOC Design? Preview: An Improved Design Methodology for SOC Design. Further Reading. 2. SOC Design Today. Hardware System Structure. Software Structure. Current SOC Design Flow. The Impact of Semiconductor Economics. Six Major Issues in SOC Design. Further Reading. 3. A New Look at SOC Design. Accelerating Processors for Traditional Software Tasks. Example: Tensilica Xtensa Processors for EEMBC Benchmarks. System Design with Multiple Processors. New Essentials of SOC Design Methodology. Addressing the Six Problems. Further Reading. 4. System-Level Design of Complex SOCs Complex SOC System Architecture Opportunities. Major Decisions in Processor-Centric SOC Organization. Communication Design = Software Mode + Hardware Interconnect. Hardware Interconnect Mechanisms. Performance-Driven Communication Design. The SOC Design Flow. Non-Processor Building Blocks in Complex SOC. Implications of Processor-Centric SOC Architecture. Further Reading. 5. Configurable Processors: A Software View. Processor Hardware/Software Cogeneration. The Process of Instruction Definition and Application Tuning. The Basics of Instruction Extension. The Programmer's Model. Processor Performance Factors. Example: Tuning a Large Task. Memory-System Tuning. Long Instruction Words. Fully Automatic Instruction-Set Extension. Further Reading. 6. Configurable Processors: A Hardware View. Application Acceleration: A Common Problem. Introduction to Pipelines and Processors. Hardware Blocks to Processors. Moving from Hardwired Engines to Processors. Designing the Processor Interface. A Short Example: ATM Packet Segmentation and Reassembly. Novel Roles for Processors in Hardware Replacement. Processors, Hardware Implementation, and Verification Flow. Progress in Hardware Abstraction. Further Reading. 7. Advanced Topics in SOC Design. Pipelining for Processor Performance. Inside Processor Pipeline Stalls. Optimizing Processors to Match Hardware. Multiple Processor Debug and Trace. Issues in Memory Systems. Optimizing Power Dissipation in Extensible Processors. Essentials of TIE. Further Reading. 8. The Future of SOC Design: The Seaof Processors. Why Is Software Programmability So Central? Looking into the Future of SOC. Processor Scaling Model. Future Applications of Complex SOCs. The Future of the Complex SOC Design Process. The Future of the Industry. The Disruptive-Technology View. The Long View. Further Reading. Index.

Engineering the complex SOC : fast, flexible design with configurable processors

This paper focuses on a particular SOC design technology and methodology, here called the advanced or processor-centric SOC design method, which reduces the risk of SOC design and increases ROI by using configurable processors to implement on-chip functions while increasing the SOC's flexibility through software programmability. The essential enabler for this design methodology is automatic processor generation-the rapid and easy creation of new microprocessor architectures, complete with efficient hardware designs and comprehensive software tools. The high speed of the generation process and the great flexibility of the generated architectures underpin a fundamental shift of the role of processors in system architecture.

Flexible architectures for engineering successful SOCs

Moore's law (double the number of transistors at each new processing node) and classical semiconductor scaling (faster transistors running at lower power at each new processing node) parted company after the 130nm processing node. As a result, on-chip clock rates have stopped rising as fast and transistor power levels have stopped falling as quickly as they did in the past. This change in the trend demands a change to a system-design style that emphasizes the use of multiple processor cores (Leibson, 2006)

The Future of Nanometer SOC Design

Publisher Summary The application-specific instruction-set processor (ASIP) concept is reviewed in this chapter and discusses automated processor configuration Tailoring a processor to an application has been more of an art than an exact science, and the process demands effort when done on a manual ad hoc basis Many existing approaches to ASIP creation require the in-depth knowledge of a processor architect, the software knowledge of applications specialists, and the hardware-implementation skills of a team of experienced digital designers Both structural, coarse-grained configuration parameters (for example, the inclusion or exclusion of functional units, the width of processor-to-memory or bus interfaces, the number and size of local and system memories), and fine-grained instruction extensions (the addition of application-specific tuned instructions that accelerate the processing of major functional application kernels by a factor of 2 ×, 10 ×, and more) are possible in ASIP configuration Deciding the specific configuration parameters and extended instructions can be akin to finding the optimal needle in the proverbial haystack—and requires years of broad experience in a host of design disciplines With this in mind, the design and use of ASIPs on a widespread basis across multiple application domains demand a more automated process for creating these processors from high-level configuration specifications

Automated Processor Configuration and Instruction Extension

The combination of reduced core operating voltage and reduced clock frequency achieved through processor core ISA extension greatly reduces the energy required to execute the task, often by one to two orders of magnitude.

Steve Leibson

Papers

Engineering the complex SOC : fast, flexible design with configurable processors

Flexible architectures for engineering successful SOCs

The Future of Nanometer SOC Design

Automated Processor Configuration and Instruction Extension

Reduce SOC Energy Consumption through Processor ISA Extension