Variations in process parameters affect the operation of integrated circuits (ICs) and pose a significant threat to the continued scaling of transistor dimensions. Such parameter variations, however, tend to affect logic and memory circuits in different ways. In logic, this fluctuation in device geometries might prevent them from meeting timing and power constraints and degrade the parametric yield. Memories, on the other hand, experience stability failures on account of such variations. Process limitations are not exhibited as physical disparities only; transistors experience temporal device degradation as well. Such issues are expected to further worsen with technology scaling. Resolving the problems of traditional Si-based technologies by employing non-Si alternatives may not present a viable solution; the non-Si miniature devices are expected to suffer the ill-effects of process/temporal variations as well. To circumvent these nonidealities, there is a need to design ICs that can adapt themselves to operate correctly under the presence of such inconsistencies. In this paper, we first provide an overview of the process variations and time-dependent degradation mechanisms. Next, we discuss the emerging paradigm of variation-tolerant adaptive design for both logic and memories. Interestingly, these resiliency techniques transcend several design abstraction levels-we present circuit and microarchitectural techniques to perform reliable computations in an unreliable environment.

Parameter Variation Tolerance and Error Resiliency: New Design Paradigm for the Nanoscale Era

Dynamic Random Access Memory (DRAM) has been used in main memory design for decades. However, DRAM consumes an increasing power budget and faces difficulties in scaling down for small feature size CMOS processing technologies. Compared to conventional DRAM, emerging phase change random access memory (PRAM) demonstrates superior power efficiency and processing scalability as VLSI technologies and integration density continue to advance. Nevertheless, using nano-scale fabrication technologies will unavoidably introduce design parameter variability in the manufacturing stage. In the past, the impact of process variation (PV) on conventional transistor-based storage cells and combinational logic has been studied extensively. However, the implication of PV on non-volatile memory design using emerging phase change techniques has not been well understood. In this paper, we take the first step toward characterizing the effect of process variation on PRAM and explore PV-aware design techniques. We show that process variation increases the PRAM programming power by 96% and degrades PRAM endurance by 50X. Our proposed circuit and two microarchtiecture techniques with system-level support reduce PRAM power by 44%, 59% and 57% and improve PRAM endurance by 27X, 277X and 268X, relative to PV-affected PRAM design. Moreover, we show that the synergy of the proposed cross-layer approaches, which achieve an average 63% power savings and 13050X endurance improvement over the conventional case, provide an attractive design solution to mitigate the deleterious impact of PV for non-volatile memory in the upcoming nano-scale processing technology era.

Characterizing and mitigating the impact of process variations on phase change based memory systems

Personalized recommendation systems leverage deep learning models and account for the majority of data center AI cycles. Their performance is dominated by memory-bound sparse embedding operations with unique irregular memory access patterns that pose a fundamental challenge to accelerate. This paper proposes a lightweight, commodity DRAM compliant, near-memory processing solution to accelerate personalized recommendation inference. The in-depth characterization of production-grade recommendation models shows that embedding operations with high model-, operator- and data-level parallelism lead to memory bandwidth saturation, limiting recommendation inference performance. We propose RecNMP which provides a scalable solution to improve system throughput, supporting a broad range of sparse embedding models. RecNMP is specifically tailored to production environments with heavy co-location of operators on a single server. Several hardware/software co-optimization techniques such as memory-side caching, table-aware packet scheduling, and hot entry profiling are studied, providing up to $9.8 \times$ memory latency speedup over a highly-optimized baseline. Overall, RecNMP offers $4.2 \times$ throughput improvement and 45.8% memory energy savings.

RecNMP: accelerating personalized recommendation with near-memory processing

Performance and power are the first order design metrics for Network-on-Chips (NoCs) that have become the de-facto standard in providing scalable communication backbones for multicores/CMPs. However, NoCs can be plagued by higher power consumption and degraded throughput if the network and router are not designed properly. Towards this end, this paper proposes a novel router architecture, where we tune the frequency of a router in response to network load to manage both performance and power. We propose three dynamic frequency tuning techniques, FreqBoost, FreqThrtl and FreqTune, targeted at congestion and power management in NoCs. As enablers for these techniques, we exploit Dynamic Voltage and Frequency Scaling (DVFS) and the imbalance in a generic router pipeline through time stealing. Experiments using synthetic workloads on a 8x8 wormhole-switched mesh interconnect show that FreqBoost is a better choice for reducing average latency (maximum 40%) while, FreqThrtl provides the maximum benefits in terms of power saving and energy delay product (EDP). The FreqTune scheme is a better candidate for optimizing both performance and power, achieving on an average 36% reduction in latency, 13% savings in power (up to 24% at high load), and 40% savings (up to 70% at high load) in EDP. With application benchmarks, we observe IPC improvement up to 23% using our design. The performance and power benefits also scale for larger NoCs.

https://web.eecs.umich.edu/~reetudas/papers/raft-micro09.pdf

A case for dynamic frequency tuning in on-chip networks

Continued improvement in computing efficiency requires functional specialization of hardware designs. Agile hardware design methodologies have been proposed to alleviate the increased design costs of custom silicon architectures, but their practice thus far has been accompanied with challenges in integration and validation of complex systems-on-a-chip (SoCs). We present the Chipyard framework, an integrated SoC design, simulation, and implementation environment for specialized compute systems. Chipyard includes configurable, composable, open-source, generator-based IP blocks that can be used across multiple stages of the hardware development flow while maintaining design intent and integration consistency. Through cloud-hosted FPGA accelerated simulation and rapid ASIC implementation, Chipyard enables continuous validation of physically realizable customized systems.

Chipyard: Integrated Design, Simulation, and Implementation Framework for Custom SoCs

Neural personalized recommendation is the cornerstone of a wide collection of cloud services and products, constituting significant compute demand of cloud infrastructure. Thus, improving the execution efficiency of recommendation directly translates into infrastructure capacity saving. In this paper, we propose DeepRecSched, a recommendation inference scheduler that maximizes latency-bounded throughput by taking into account characteristics of inference query size and arrival patterns, model architectures, and underlying hardware systems. By carefully optimizing task versus data-level parallelism, DeepRecSched improves system throughput on server class CPUs by $2 \times$ across eight industry-representative models. Next, we deploy and evaluate this optimization in an at-scale production datacenter which reduces end-to-end tail latency across a wide variety of recommendation models by 30%. Finally, DeepRecSched demonstrates the role and impact of specialized AI hardware in optimizing system level performance (QPS) and power efficiency (QPS/watt) of recommendation inference.In order to enable the design space exploration of customized recommendation systems shown in this paper, we design and validate an end-to-end modeling infrastructure, DeepRecInfra. DeepRecInfra enables studies over a variety of recommendation use cases, taking into account at-scale effects, such as query arrival patterns and recommendation query sizes, observed from a production datacenter, as well as industry-representative models and tail latency targets.

DeepRecSys: a system for optimizing end-to-end at-scale neural recommendation inference

Process variations are poised to significantly degrade performance benefits sought by moving to the next nanoscale technology node. Parameter fluctuations in devices can introduce large variations in peak operation among chips, among cores on a single chip, and among microarchitectural blocks within one core. Hence, it will be difficult to only rely on traditional frequency binning to efficiently cover the large variations that are expected. Furthermore, multiple voltage/frequency domains introduce significant hardware overhead and alone cannot address the full extent of delay variations expected in future multi-core systems. In this paper, we present ReVIVaL, which combines two fine-grained post-fabrication tuning techniques---voltage interpolation(VI) and variable latency(VL). We show that the frequency variation between chips, between cores on one chip, and between functional units within cores can be reduced to a very small range. The effectiveness of these techniques are further verified through experiments on test chips fabricated in a 130nm CMOS process. Detailed architectural simulations of multi-core processors demonstrate significant performance and power advantages are possible by combining variable latency with voltage interpolation.

ReVIVaL: A Variation-Tolerant Architecture Using Voltage Interpolation and Variable Latency

Process variations will significantly degrade the performance benefits of future microprocessors as they move toward nanoscale technology. Device parameter fluctuations can introduce large variations in peak operation among chips, cores on a single chip, and microarchitectural blocks within one core. The revival technique combines the post-fabrication tuning techniques voltage interpolation (VI) and variable latency (VL) to reduce such frequency variations.

/pdf/revival-a-variation-tolerant-architecture-using-voltage-52n00qfxal.pdf

Revival: A Variation-Tolerant Architecture Using Voltage Interpolation and Variable Latency

Neural personalized recommendation is the corner-stone of a wide collection of cloud services and products, constituting significant compute demand of the cloud infrastructure. Thus, improving the execution efficiency of neural recommendation directly translates into infrastructure capacity saving. In this paper, we devise a novel end-to-end modeling infrastructure, DeepRecInfra, that adopts an algorithm and system co-design methodology to custom-design systems for recommendation use cases. Leveraging the insights from the recommendation characterization, a new dynamic scheduler, DeepRecSched, is proposed to maximize latency-bounded throughput by taking into account characteristics of inference query size and arrival patterns, recommendation model architectures, and underlying hardware systems. By doing so, system throughput is doubled across the eight industry-representative recommendation models. Finally, design, deployment, and evaluation in at-scale production datacenter shows over 30% latency reduction across a wide variety of recommendation models running on hundreds of machines.

DeepRecSys: A System for Optimizing End-To-End At-scale Neural Recommendation Inference.

In recent years, there has been tremendous advances in hardware acceleration of deep neural networks. However, most of the research has focused on optimizing accelerator microarchitecture for higher performance and energy efficiency on a per-layer basis. We find that for overall single-batch inference latency, the accelerator may only make up 25–40%, with the rest spent on data movement and in the deep learning software framework. Thus far, it has been very difficult to study end-to-end DNN performance during early stage design (before RTL is available), because there are no existing DNN frameworks that support end-to-end simulation with easy custom hardware accelerator integration. To address this gap in research infrastructure, we present SMAUG, the first DNN framework that is purpose-built for simulation of end-to-end deep learning applications. SMAUG offers researchers a wide range of capabilities for evaluating DNN workloads, from diverse network topologies to easy accelerator modeling and SoC integration. To demonstrate the power and value of SMAUG, we present case studies that show how we can optimize overall performance and energy efficiency for up to 1.8×–5× speedup over a baseline system, without changing any part of the accelerator microarchitecture, as well as show how SMAUG can tune an SoC for a camera-powered deep learning pipeline.

Gu-Yeon Wei

Papers

DeepRecSys: a system for optimizing end-to-end at-scale neural recommendation inference

ReVIVaL: A Variation-Tolerant Architecture Using Voltage Interpolation and Variable Latency

Revival: A Variation-Tolerant Architecture Using Voltage Interpolation and Variable Latency

DeepRecSys: A System for Optimizing End-To-End At-scale Neural Recommendation Inference.

SMAUG: End-to-End Full-Stack Simulation Infrastructure for Deep Learning Workloads