International Technology Roadmap for Semiconductors 2003の要求清浄度について － シリコンウエハ表面と雰囲気環境に要求される清浄度, 分析方法の現状について －

http://user.it.uu.se/~jarst116/slides/week4.pdf

Graph-Based Algorithms for Boolean Function Manipulation

Principles of Asynchronous Circuit Design - A Systems Perspective addresses the need for an introductory text on asynchronous circuit design. Part I is an 8-chapter tutorial which addresses the most important issues for the beginner, including how to think about asynchronous systems. Part II is a 4-chapter introduction to Balsa, a freely-available synthesis system for asynchronous circuits which will enable the reader to get hands-on experience of designing high-level asynchronous systems. Part III offers a number of examples of state-of-the-art asynchronous systems to illustrate what can be built using asynchronous techniques. The examples range from a complete commercial smart card chip to complex microprocessors. The objective in writing this book has been to enable industrial designers with a background in conventional (clocked) design to be able to understand asynchronous design sufficiently to assess what it has to offer and whether it might be advantageous in their next design task.

/pdf/principles-of-asynchronous-circuit-design-a-systems-58fhcojs1t.pdf

Principles of Asynchronous Circuit Design: A Systems Perspective

We consider a fully SAT-based method of unbounded symbolic model checking based on computing Craig interpolants. In benchmark studies using a set of large industrial circuit verification instances, this method is greatly more efficient than BDD-based symbolic model checking, and compares favorably to some recent SAT-based model checking methods on positive instances.

Interpolation and SAT-based model checking

To alleviate the complex communication problems that arise as the number of on-chip components increases, network-on-chip (NoC) architectures have been recently proposed to replace global interconnects. In this paper, we first provide a general description of NoC architectures and applications. Then, we enumerate several related research problems organized under five main categories: Application characterization, communication paradigm, communication infrastructure, analysis, and solution evaluation. Motivation, problem description, proposed approaches, and open issues are discussed for each problem from system, microarchitecture, and circuit perspectives. Finally, we address the interactions among these research problems and put the NoC design process into perspective.

Outstanding Research Problems in NoC Design: System, Microarchitecture, and Circuit Perspectives

Single flux quantum (SFQ) circuits are an attractive beyond-CMOS technology because they promise two orders of magnitude lower power at clock frequencies exceeding 25 GHz. However, every SFQ gate is clocked creating very deep gate-level pipelines that are difficult to keep full, particularly for sequences that include data-dependent operations. This paper proposes to increase the throughput of SFQ pipelines by redesigning the datapath to accept and operate on least-significant bits (LSBs) clock cycles earlier than more significant bits. This skewed datapath approach reduces the latency of the LSB side which can be feedback earlier for use in subsequent data-dependent operations increasing their throughput. In particular, we propose to group the bits into 4-bit blocks that are operated on concurrently and create block-skewed datapath units for 32-bit operation. This skewed approach allows a subsequent data-dependent operation to start evaluating as soon as the first 4-bit block completes. Using this general approach, we develop a block-skewed MIPS-compatible 32-bit ALU. Our gate-level Verilog design improves the throughput of 32-bit data dependent operations by 2x and 1.5x compared to previously proposed 4-bit bit-slice and 32-bit Ladner-Fischer ALUs respectively. We have quantified the benefit of this design on instructions per cycle (IPC) for various RISC-V benchmarks assuming a range of non-ALU operation latencies from one to ten cycles. Averaging across benchmarks, our experimental results show that compared to the 32-bit Ladner-Fischer our proposed architecture provides a range of IPC improvements between 1.37x assuming one-cycle non-ALU latency to 1.2x assuming ten-cycle non-ALU latency. Moreover, our average IPC improvements compared to a 32-bit ALU based on the 4-bit bit-slice range from 2.93x to 4x.

qBSA: Logic Design of a 32-bit Block-Skewed RSFQ Arithmetic Logic Unit

We describe a new design technique for efficient harmonic resonant rail drivers. The proposed circuit implementation is coupled to a standard pulse source and uses only discrete passive components and no external dc power supply. It can thus be externally tuned to minimize the consumed power in the target IC. A new design technique based on current-fed voltage pulse-forming network theory is proposed to find the value of each discrete component for a target frequency and a given load capacitance. The proposed circuit topology can be used to generate any desired periodic 50% duty-cycle waveform by superimposing multiple harmonics of the desired waveform, however, this paper focuses on the generation of trapezoidal-wave clock signals. We have tested the driver with a capacitive load between 38.3 and 97.8 pF with clock frequency ranging between 0.8 and 15 MHz. The overall power dissipation for our second-order harmonic rail driver is 19% of fC/sub L/V/sup 2/ at 15 MHz and 97.8 pF load.

/pdf/voltage-pulse-driven-harmonic-resonant-rail-drivers-for-low-15han7dix9.pdf

Voltage-pulse driven harmonic resonant rail drivers for low-power applications

Today’s high resolution, high frame rate cameras in autonomous vehicles generate a large volume of data that needs to be transferred and processed by a downstream processor or machine learning (ML) accelerator to enable intelligent computing tasks, such as multi-object detection and tracking. The massive amount of data transfer incurs significant energy, latency, and bandwidth bottlenecks, which hinders real-time processing. To mitigate this problem, we propose an algorithm-hardware co-design framework called Processing-in-Pixel-in-Memory-based object Detection and Tracking (P2M-DeTrack). P2M-DeTrack is based on a custom faster R-CNN-based model that is distributed partly inside the pixel array (front-end) and partly in a separate FPGA/ASIC (back-end). The proposed front-end in-pixel processing down-samples the input feature maps significantly with judiciously optimized strided convolution and pooling. Compared to a conventional baseline design that transfers frames of RGB pixels to the back-end, the resulting P2M-DeTrack designs reduce the data bandwidth between sensor and back-end by up to 24×. The designs also reduce the sensor and total energy (obtained from in-house circuit simulations at Globalfoundries 22nm technology node) per frame by 5.7× and 1.14×, respectively. Lastly, they reduce the sensing and total frame latency by an estimated 1.7× and 3×, respectively. We evaluate our approach on the multi-object object detection (tracking) task of the large-scale BDD100K dataset and observe only a 0.5% reduction in the mean average precision (0.8% reduction in the identification F1 score) compared to the state-of-the-art.

P2M-DeTrack: Processing-in-Pixel-in-Memory for Energy-efficient and Real-Time Multi-Object Detection and Tracking

Neural networks have proven to be extremely powerful tools for modern artificial intelligence applications, but computational and storage complexity remain limiting factors. This paper presents two compatible contributions towards reducing the time, energy, computational, and storage complexities associated with multilayer perceptrons. Pre-defined sparsity is proposed to reduce the complexity during both training and inference, regardless of the implementation platform. Our results show that storage and computational complexity can be reduced by factors greater than 5X without significant performance loss. The second contribution is an architecture for hardware acceleration that is compatible with pre-defined sparsity. This architecture supports both training and inference modes and is flexible in the sense that it is not tied to a specific number of neurons. For example, this flexibility implies that various sized neural networks can be supported on various sized Field Programmable Gate Array (FPGA)s.

Pre-Defined Sparse Neural Networks with Hardware Acceleration

This paper presents comprehensive energy-throughput comparisons of two well-known asynchronous design styles applied to a matrix-vector multiplication core of the discrete cosine transforms (DCT). The first design style, bundled-data pipelines, uses a single-rail synchronous datapath with recently proposed true-four-phase controllers integrated with data-dependent delay lines. The design achieves reasonably-high average performance and very low energy but requires significant design effort to verify the two-sided timing constraints (set-up and hold) typical of bundled-data pipelines. The second design style, 2D QDI pipelines, consists of a network of small communicating cells communicating through delay-insensitive 1-of-N encoded channels. Compared to the bundled-data counterpart, transistor-level simulations show that all QDI designs achieve higher throughput at the cost of larger area and energy and in particular have 22% better E/spl tau//sup 2/ metric. In addition, the QDI designs require less design effort than the bundled-data counterpart, because they require virtually no timing verification.

Peter A. Beerel

Papers

qBSA: Logic Design of a 32-bit Block-Skewed RSFQ Arithmetic Logic Unit

Voltage-pulse driven harmonic resonant rail drivers for low-power applications

P2M-DeTrack: Processing-in-Pixel-in-Memory for Energy-efficient and Real-Time Multi-Object Detection and Tracking

Pre-Defined Sparse Neural Networks with Hardware Acceleration

An asynchronous pipeline comparisons with application to DCT matrix-vector multiplication