The scaling of microchip technologies has enabled large scale systems-on-chip (SoC). Network-on-chip (NoC) research addresses global communication in SoC, involving (i) a move from computation-centric to communication-centric design and (ii) the implementation of scalable communication structures. This survey presents a perspective on existing NoC research. We define the following abstractions: system, network adapter, network, and link to explain and structure the fundamental concepts. First, research relating to the actual network design is reviewed. Then system level design and modeling are discussed. We also evaluate performance analysis techniques. The research shows that NoC constitutes a unification of current trends of intrachip communication rather than an explicit new alternative.

/pdf/a-survey-of-research-and-practices-of-network-on-chip-2rc0mlsrfq.pdf

A survey of research and practices of Network-on-chip

Modern Graphic Processing Units (GPUs) provide sufficiently flexible programming models that understanding their performance can provide insight in designing tomorrow's manycore processors, whether those are GPUs or otherwise. The combination of multiple, multithreaded, SIMD cores makes studying these GPUs useful in understanding tradeoffs among memory, data, and thread level parallelism. While modern GPUs offer orders of magnitude more raw computing power than contemporary CPUs, many important applications, even those with abundant data level parallelism, do not achieve peak performance. This paper characterizes several non-graphics applications written in NVIDIA's CUDA programming model by running them on a novel detailed microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set. For this study, we selected twelve non-trivial CUDA applications demonstrating varying levels of performance improvement on GPU hardware (versus a CPU-only sequential version of the application). We study the performance of these applications on our GPU performance simulator with configurations comparable to contemporary high-end graphics cards. We characterize the performance impact of several microarchitecture design choices including choice of interconnect topology, use of caches, design of memory controller, parallel workload distribution mechanisms, and memory request coalescing hardware. Two observations we make are (1) that for the applications we study, performance is more sensitive to interconnect bisection bandwidth rather than latency, and (2) that, for some applications, running fewer threads concurrently than on-chip resources might otherwise allow can improve performance by reducing contention in the memory system.

/pdf/analyzing-cuda-workloads-using-a-detailed-gpu-simulator-lz9yu5o7va.pdf

Analyzing CUDA workloads using a detailed GPU simulator

Concern about the performance of wires wires in scaled technologies has led to research exploring other communication methods. This paper examines wire and gate delays as technologies migrate from 0.18-/spl mu/m to 0.035-/spl mu/m feature sizes to better understand the magnitude of the the wiring problem. Wires that shorten in length as technologies scale have delays that either track gate delays or grow slowly relative to gate delays. This result is good news since these "local" wires dominate chip wiring. Despite this scaling of local wire performance, computer-aided design (CAD) tools must still become move sophisticated in dealing with these wires. Under scaling, the total number of wires grows exponentially, so CAD tools will need to handle an ever-growing percentage of all the wires in order to keep designer workloads constant. Global wires present a more serious problem to designers. These are wires that do not scale in length since they communicate signals across the chip. The delay of these wives will remain constant if repeaters are used meaning that relative to gate delays, their delays scale upwards. These increased delays for global communication will drive architectures toward modular designs with explicit global latency mechanisms.

The future of wires

In this paper, we present Brook for GPUs, a system for general-purpose computation on programmable graphics hardware. Brook extends C to include simple data-parallel constructs, enabling the use of the GPU as a streaming co-processor. We present a compiler and runtime system that abstracts and virtualizes many aspects of graphics hardware. In addition, we present an analysis of the effectiveness of the GPU as a compute engine compared to the CPU, to determine when the GPU can outperform the CPU for a particular algorithm. We evaluate our system with five applications, the SAXPY and SGEMV BLAS operators, image segmentation, FFT, and ray tracing. For these applications, we demonstrate that our Brook implementations perform comparably to hand-written GPU code and up to seven times faster than their CPU counterparts.

/pdf/brook-for-gpus-stream-computing-on-graphics-hardware-49gy5u8zyg.pdf

Brook for GPUs: stream computing on graphics hardware

We characterize high-performance streaming applications as a new and distinct domain of programs that is becoming increasingly important. The StreamIt language provides novel high-level representations to improve programmer productivity and program robustness within the streaming domain. At the same time, the StreamIt compiler aims to improve the performance of streaming applications via stream-specific analyses and optimizations. In this paper, we motivate, describe and justify the language features of StreamIt, which include: a structured model of streams, a messaging system for control, a re-initialization mechanism, and a natural textual syntax.

/pdf/streamit-a-language-for-streaming-applications-il2k3rjwwy.pdf

StreamIt: A Language for Streaming Applications

Trends in VLSI technology scaling demand that future computing devices be narrowly focused to achieve high performance and high efficiency, yet also target the high volumes and low costs of widely applicable general purpose designs. To address these conflicting requirements, we propose a modular reconfigurable architecture called Smart Memories, targeted at computing needs in the 0.1m technology generation. A Smart Memories chip is made up of many processing tiles, each containing local memory, local interconnect, and a processor core. For efficient computation under a wide class of possible applications, the memories, the wires, and the computational model can all be altered to match the applications. To show the applicability of this design, two very different machines at opposite ends of the architectural spectrum, the Imagine stream processor and the Hydra speculative multiprocessor, are mapped onto the Smart Memories computing substrate. Simulations of the mappings show that the Smart Memories architecture can successfully map these architectures with only modest performance degradation.

Smart Memories: a modular reconfigurable architecture

Merrimac uses stream architecture and advanced interconnection networks to give an order of magnitude more performance per unit cost than cluster-based scientific computers built from the same technology. Organizing the computation into streams and exploiting the resulting locality using a register hierarchy enables a stream architecture to reduce the memory bandwidth required by representative applications by an order of magnitude or more. Hence a processing node with a fixed bandwidth (expensive) can support an order of magnitude more arithmetic units (inexpensive). This in turn allows a given level of performance to be achieved with fewer nodes (a 1-PFLOPS machine, for example, with just 8,192 nodes) resulting in greater reliability, and simpler system management. We sketch the design of Merrimac, a streaming scientific computer that can be scaled from a $20K 2 TFLOPS workstation to a $20M 2 PFLOPS supercomputer and present the results of some initial application experiments on this architecture.

/pdf/merrimac-supercomputing-with-streams-3wxuod8j31.pdf

Merrimac: Supercomputing with Streams

Many current programmable architecture designed to exploit data parallelism require computation to be structured to operate on sequentially accessed vectors or streams of data. Applications with less regular data access patterns perform sub-optimally on such architectures. We present a register file for streams (SRF) that allows arbitrary, indexed accesses. Compared to sequential SRF access, indexed access captures more temporal locality, reduces data replication in the SRF, and provides efficient support for certain types of complex access patterns. Our simulations show that indexed SRF access provides speedups of 1.03x to 4.1x and memory bandwidth reductions of up to 95% over sequential SRF access for a set of benchmarks representative of data-parallel applications with irregular accesses. Indexed SRF access also provides greater speedups than caches for a number of application classes despite significantly lower hardware costs. The area overhead of our indexed SRF implementation is 11%-22% over a sequentially accessed SRF, which corresponds to a modest 1.5%-3% increase in the total die area of a typical stream processor.

/pdf/stream-register-files-with-indexed-access-ab01kfjglf.pdf

Stream register files with indexed access

Several classes of applications with abundant fine-grain parallelism, such as media and signal processing, graphics, and scientific computing, have become increasingly dominant consumers of computing resources. Prior research has shown that stream processors provide an energy-efficient, programmable approach to achieving high performance for these applications. However, given the strong compute capabilities of these processors, efficient utilization of bandwidth, particularly when accessing off-chip memory, is crucial to sustaining high performance. 
This thesis explores tradeoffs in, and techniques for, improving the efficiency of memory and bandwidth hierarchy utilization in stream processors. We first evaluate the appropriate granularity for expressing data-level parallelism—entire records or individual words—and show that record-granularity expression of parallelism leads to reduced intermediate state storage requirements and higher sustained bandwidths in modern memory systems. We also explore the effectiveness of software- and hardware-managed memories, and identify the relative merits of each type of memory within the context of stream computing. Software-managed memories are shown to efficiently support coarse-grain and producer-consumer data reuse, while hardware-managed memories are shown to effectively capture fine-grain and irregular temporal reuse. 
We introduce three new techniques for improving the efficiency of off-chip memory bandwidth utilization. First, we propose a stream register file architecture that enables indexed, arbitrary access patterns, allowing a wider range of data reuse to be captured in on-chip, software-managed memory compared to current stream processors. We then introduce epoch-based cache invalidation—a technique that actively identifies and invalidates dead data—to improve the performance of hardware-managed caches for stream computing. Finally, we propose a hybrid bandwidth hierarchy that incorporates both hardware- and software-managed memory, and allows dynamic reallocation of capacity between these two types of memories to better cater to application requirements. Our analyses and evaluations show that these techniques not only provide performance improvements for existing streaming applications but also broaden the capabilities of stream processors, enabling new classes of applications to be executed efficiently.

Memory hierarchy design for stream computing

As device scales shrink, higher transistor counts are available while soft-errors, even in logic, become a major concern A new class of architectures, such as Merrimac and the IBM Cell, take advantage of the higher transistor count by exposing control, communication, and a large number of functional-units at the architectural level, thus achieving high performance and efficiency This paper explores soft-error fault tolerance in the context of these computeintensive architectures, which differ significantly from their control-intensive CPU counterparts The main goal of the proposed schemes for Merrimac is to conserve the critical and costly off-chip bandwidth and on-chip storage resources, while maintaining high peak and sustained performance We achieve this by allowing for reconfigurability and relying on programmer input The processor is either run at full peak performance employing software fault-tolerance methods, or reduced performance with hardware redundancy We present several methods, their analysis, and detailed case studies

/pdf/fault-tolerance-techniques-for-the-merrimac-streaming-3f2imv8m65.pdf

Nuwan Jayasena

Papers

Smart Memories: a modular reconfigurable architecture

Merrimac: Supercomputing with Streams

Stream register files with indexed access

Memory hierarchy design for stream computing

Fault Tolerance Techniques for the Merrimac Streaming Supercomputer