scispace - formally typeset
Search or ask a question

Showing papers presented at "International Conference on Computer Design in 2014"


Proceedings ArticleDOI
01 Oct 2014
TL;DR: This paper introduces a new mechanism, called Loose-Ordering Consistency (LOC), that satisfies the ordering requirements of persistent memory writes at significantly lower performance degradation than state-of-the-art mechanisms.
Abstract: Emerging non-volatile memory (NVM) technologies enable data persistence at the main memory level at access speeds close to DRAM In such persistent memories, memory writes need to be performed in strict order to satisfy storage consistency requirements and enable correct recovery from system crashes Unfortunately, adhering to a strict order for writes to persistent memory significantly degrades system performance as it requires flushing dirty data blocks from CPU caches and waiting for their completion at the main memory in the order specified by the program This paper introduces a new mechanism, called Loose-Ordering Consistency (LOC), that satisfies the ordering requirements of persistent memory writes at significantly lower performance degradation than state-of-the-art mechanisms LOC consists of two key techniques First, Eager Commit reduces the commit overhead for writes within a transaction by eliminating the need to perform a persistent commit record write at the end of a transaction We do so by ensuring that we can determine the status of all committed transactions during recovery by storing necessary metadata information statically with blocks of data written to memory Second, Speculative Persistence relaxes the ordering of writes between transactions by allowing writes to be speculatively written to persistent memory A speculative write is made visible to software only after its associated transaction commits To enable this, our mechanism requires the tracking of committed transaction ID and support for multi-versioning in the CPU cache Our evaluations show that LOC reduces the average performance overhead of strict write ordering from 669% to 349% on a variety of workloads

145 citations


Proceedings ArticleDOI
04 Dec 2014
TL;DR: The Blacklisting Memory Scheduler (BLISS), which achieves high system performance and fairness while incurring low hardware cost and complexity and is evaluated across a wide variety of workloads and system configurations.
Abstract: In a multicore system, applications running on different cores interfere at main memory. This inter-application interference degrades overall system performance and unfairly slows down applications. Prior works have developed application-aware memory request schedulers to tackle this problem. State-of-the-art application-aware memory request schedulers prioritize memory requests of applications that are vulnerable to interference, by ranking individual applications based on their memory access characteristics and enforcing a total rank order.

107 citations


Proceedings ArticleDOI
04 Dec 2014
TL;DR: The current ITRS roadmapping process is extended with studies of key requirements from a system- level perspective, based on multiple generations of smartphones and microservers, and the new system-level framing of the roadmap is referred to as I TRS 2.0.
Abstract: The International Technology Roadmap for Semiconductors (ITRS) has roadmapped technology requirements of the semiconductor industry over the past two decades. The roadmap identifies major challenges in advanced technology and leads the investment of research in a costeffective way. Traditionally, the ITRS identifies major semiconductor IC products as drivers; these set requirements for the state-of-theart semiconductor technologies. High-performance microprocessor unit (MPU-HP) for servers and consumer portable system-on-chip (SOC-CP) for smartphones are two examples. Throughout the history of the ITRS, Moore’s Law has been the main impetus for these drivers, continuously pushing the transistor density to scale at a rate of 2× per technology generation (aka “node”). However, as new requirements from applications such as data center, mobility, and context-aware computing emerge, the existing roadmapping methodology is unable to capture the entire evolution of the current semiconductor industry. Today, comprehending how key markets and applications drive the process, design and integration technology roadmap requires new system-level studies along with chip-level studies. In this paper, we extend the current ITRS roadmapping process with studies of key requirements from a system-level perspective, based on multiple generations of smartphones and microservers. We describe potential new system drivers and new metrics, and we refer to the new system-level framing of the roadmap as ITRS 2.0.

81 citations


Proceedings ArticleDOI
01 Oct 2014
TL;DR: This paper proposes a technique to share random number generators with several SNGs, and discusses the influence of input correlation around a multiplexer, which is a scaled adder for stochastic computing, so as to avoid over reducing the input correlation.
Abstract: Stochastic computing, which is an approximate computation with probabilities (called stochastic numbers), draws attention as an alternative method of deterministic computing. In this paper, we discuss a design of compact and accurate stochastic circuits. Stochastic circuits are known as a way to stochastically compute complex calculation at low hardware cost, while stochastic number generators (SNGs), which are used for converting deterministic numbers into stochastic numbers, account for a large fraction of the circuits. To reduce such SNGs in stochastic circuits, we propose a technique to share random number generators with several SNGs. This sharing method employs circular shift of the output of LFSRs to reduce the correlation between stochastic numbers. We also discuss the influence of input correlation around a multiplexer, which is a scaled adder for stochastic computing, so as to avoid over reducing the input correlation. Application of the proposed techniques to two stochastic image processing shows the reduction in the size of SNGs without greatly sacrificing accuracy.

66 citations


Proceedings ArticleDOI
04 Dec 2014
TL;DR: A framework using nonparametric regression technique in machine learning to construct routing congestion model can capture multiple factors and enables direct prediction of detailed routing congestion with high accuracy and significant reduction of design rule violations and detailed routing runtime can be achieved.
Abstract: Routing congestion model is of great importance in design stages of modern physical synthesis, e.g. global routing and routability estimation during placement. As the technology node becomes smaller, routing congestion is more difficult to estimate during design stages ahead of detailed routing. In this paper, we propose a framework using nonparametric regression technique in machine learning to construct routing congestion model. The constructed model can capture multiple factors and enables direct prediction of detailed routing congestion with high accuracy. By using this model in global routing, significant reduction of design rule violations and detailed routing runtime can be achieved compared with the model in previous work, with small overhead in global routing runtime and memory usage.

54 citations


Proceedings ArticleDOI
04 Dec 2014
TL;DR: This paper develops an algorithm to derive the Pareto-optimal curve (performance versus area) of the application when mapped onto FPGAs using HLS and develops accurate performance and area models to assist the design space exploration process.
Abstract: Real-world applications such as image processing, signal processing, and others often contain a sequence of computation intensive kernels, each represented in the form of a nested loop High-level synthesis (HLS) enables efficient hardware implementation of these loops using high-level programming languages HLS tools also allow the designers to evaluate design choices with different trade-offs through pragmas/directives Prior design space exploration techniques for HLS primarily focus on either single nested loop or multiple loops without consideration to the data dependencies among them In this paper, we propose efficient design space exploration techniques for applications that consist of multiple nested loops with or without data dependencies In particular, we develop an algorithm to derive the Pareto-optimal curve (performance versus area) of the application when mapped onto FPGAs using HLS Our algorithm is efficient as it effectively prunes the dominated points in the design space We also develop accurate performance and area models to assist the design space exploration process Experiments on various scientific kernels and real-world applications demonstrate that our design space exploration technique is accurate and efficient

46 citations


Proceedings ArticleDOI
01 Oct 2014
TL;DR: A PID (Proportional Integral Derivative) controller based dynamic power management method that considers an upper bound on power consumption (called the Thermal Design Power) and provides fine-grained DVFS (Dynamic Voltage and Frequency Scaling) including near-threshold operation.
Abstract: Dark Silicon denotes the phenomenon that, due to thermal and power constraints, the fraction of transistors that can operate at full frequency is decreasing with each technology generation. We propose a PID (Proportional Integral Derivative) controller based dynamic power management method that considers an upper bound on power consumption (called the Thermal Design Power (TDP)). To avoid violation of the TDP constraint for manycore systems running highly dynamic workloads, it provides fine-grained DVFS (Dynamic Voltage and Frequency Scaling) including near-threshold operation. In addition, the method distinguishes applications with hard Real-Time, soft Real-Time and no Real-Time constraints and treats them with appropriate priorities. In simulations with dynamic workloads mixed-critical application profiles, we show that the method is effective in honoring the TDP bound and it can boost system throughput by over 43% compared to a naive TDP scheduling policy.

40 citations


Proceedings ArticleDOI
01 Oct 2014
TL;DR: This paper proposes a fine-grained heterogeneous core design, called the heterogeneous block architecture (HBA), that combines heterogeneous execution backends into one core and shows that it provides significantly better energy efficiency than all designs at similar performance.
Abstract: This paper makes two new observations that lead to a new heterogeneous core design. First, we observe that most serial code exhibits fine-grained heterogeneity: at the scale of tens or hundreds of instructions, regions of code fit different microarchitectures better (at the same point or at different points in time). Second, we observe that by grouping contiguous regions of instructions into blocks that are executed atomically, a core can exploit this heterogeneity: atomicity allows each block to be executed independently on its own execution backend that fits its characteristics best. Based on these observations, we propose a fine-grained heterogeneous design that combines heterogeneous execution backends into one core. Our core design, the heterogeneous block architecture (HBA), breaks the program into blocks of code, determines the best backend for each block, and specializes the block for that backend. As an initial, concrete design, we combine out-of-order, VLIW, and in-order backends, using simple heuristics to choose backends. We compare HBA to multiple baseline core designs (including monolithic out-of-order, clustered out-of-order, in-order and a state-of-the-art heterogeneous core design) and show that HBA can provide significantly better energy efficiency than all designs at similar performance. Averaged across 184 traces from a wide variety of workloads, HBA reduces core power by 36.4% and energy per instruction by 31.9% compared to a 4-wide out-of-order core. We conclude that HBA provides a flexible substrate for exploiting fine-grained heterogeneity, enabling new energy-performance tradeoff points in core design.

39 citations


Proceedings ArticleDOI
04 Dec 2014
TL;DR: This paper uses an algebraic framework based on probabilistic transfer matrices (PTMs) to analyze correlation-induced errors and concludes that the isolation method can offer significant cost advantages in reducing correlation errors.
Abstract: Stochastic computing (SC) is an approximate computing technique that represents data by probabilistic bit-streams called stochastic numbers (SNs). Arithmetic operations can be implemented at very low cost by means of SC. To achieve acceptable accuracy, interacting SNs must usually be statistically independent or uncorrelated. Correlation is poorly understood, however, and is a key problem in SC because of its impact on accuracy and the high cost of correlation-reducing logic. In this paper we analyze and quantify the role of correlation in stochastic circuit design. We use an algebraic framework based on probabilistic transfer matrices (PTMs) to analyze correlation-induced errors. We compare two systematic correlation-reducing methods, regeneration and isolation. Regeneration introduces new (pseudo) random sources to re-randomize SNs, while isolation uses delays (D flip-flops) to derive multiple independent SNs from a single random source. We present bounds on accuracy loss due to isolator insertion and compare its hardware cost to that of regeneration. We conclude that the isolation method can offer significant cost advantages in reducing correlation errors.

37 citations


Proceedings ArticleDOI
04 Dec 2014
TL;DR: This paper examines several different ways to characterize “fairness” for GPGPU spatial multitasking, by balancing individual application's performance and overall system performance and presents a run-time algorithm to predict and adjust the SM allocation at runtime to meet the desired fairness metric.
Abstract: General-purpose computing on the GPU (GPGPU computing) is becoming widely adopted for an increasing variety of applications. However, it has been shown that as the available computing elements in the GPU increase with every generation some GPGPU applications fail to fully utilize the GPU resources. Spatial multitasking-subdividing GPU resources amongst concurrently-running applications-has been shown to increase overall system performance and utilization for GPGPU computing. However, dividing the computing resources among multiple applications to maximize system performance often results in one application having “unfair” access to GPU resources. Yet, evenly dividing resources among applications does not guarantee equal speedups to each application; nor does it take into account overall system performance. In this paper we examine several different ways to characterize “fairness” for GPGPU spatial multitasking, by balancing individual application's performance and overall system performance. We further present a run-time algorithm to predict and adjust the SM allocation at runtime to meet the desired fairness metric.

35 citations


Proceedings ArticleDOI
Seokin Hong1, Jongmin Lee1, Soontae Kim1
04 Dec 2014
TL;DR: In this paper, a novel cache design called Ternary cache is proposed, which achieves the data density benefit of MLC STT-RAM and the reliability benefit of SLC STt-RAM.
Abstract: Spin-transfer torque random access memory (STT-RAM) has become a promising non-volatile memory technology for cache memories. Recently, 2-bit multi-level cell (MLC) STT-RAM has been proposed to enhance data density, but it suffers from low reliability of its read and write operations. In this paper, we propose a novel cache design called Ternary cache. In Ternary cache, a memory cell can store three values (i.e., 0,1,2) while MLC STT-RAM can store four values. In this way, Ternary cache achieves much higher read stability than MLC STT-RAM-based caches. To enhance writability, a write operation is performed with high current and terminated as soon as the data is written. Evaluation results show that Ternary cache achieves the data density benefit of MLC STT-RAM and the reliability benefit of SLC STT-RAM.

Proceedings ArticleDOI
01 Oct 2014
TL;DR: The Neuralizer is presented, an automated software flow for kernel identification, NN training, and NN integration, as well as supplementary user-controlled optimization techniques that achieve performance gains and energy savings on a variety of divergent applications.
Abstract: The purpose of this research is to find a neural-network-based solution to the well-known problem of branch divergence in Single Instruction Multiple Data (SIMD) architectures. Our approach differs from existing techniques that handle branch (or control-flow) divergence, which use costly hardware modifications, low-utilization masking techniques, or static prediction methods. As we examine divergent applications, we characterize the degree of data-dependent control flow seen in each and isolate the code regions (or “kernels”) that cause the most performance degradation due to branch divergence. We then train neural networks (NNs) offline to approximate these kernels and inject the NN computations directly into the applications as substitutes for the kernels they approximate. This essentially translates control flow into nondivergent computation, trading off precision for performance. As our methodology manipulates application source code directly, it is inherently platform agnostic and can be adopted as a general means for accelerating divergent applications on data-parallel architectures. In this article, we present the Neuralizer, an automated software flow for kernel identification, NN training, and NN integration, as well as supplementary user-controlled optimization techniques. Evaluating our approach on a variety of divergent applications run on a Graphics Processing Unit (GPU), we on average achieve performance gains of 13.6 × and energy savings of 14.8 × with 96p accuracy.

Proceedings ArticleDOI
04 Dec 2014
TL;DR: It is shown that AES is vulnerable in all modes of operations against Correlation Power Analysis (CPA) attack, one of the strongest power analysis based side channel attacks, and the Counter mode of operation provides a balance in between area and power while maintaining adequate resistance for power analysis attacks.
Abstract: Advanced Encryption Standard (AES) is arguably the most popular symmetric block cipher algorithm. The commonly used mode of operation in AES is the Electronic Codebook (ECB) mode. In the past, side channel attacks (including power analysis based attacks) have been shown to be effective in breaking the secret keys used with AES, while AES is operating in the ECB mode. AES defines a number of advanced modes (namely Cipher Block Chaining - CBC, Cipher Feedback - CFB, Output Feedback - OFB, and Counter - CTR) of operations that are built on top of the EBC mode to enhance security via disassociating the encryption function from the plaintext or the secret key used. In this paper, we investigate the vulnerabilities against power analysis based side channel attacks of all such modes of operations, implemented on hardware circuits for low power and high speed embedded systems. Through such an investigation, we show that AES is vulnerable in all modes of operations against Correlation Power Analysis (CPA) attack, one of the strongest power analysis based side channel attacks. We also quantify the level of difficulty in breaking AES in different modes by calculating the number of power traces needed to arrive at the complete secret key. We conclude that the Counter mode of operation provides a balance in between area and power while maintaining adequate resistance for power analysis attacks than when used with other modes of operations. We show that the previous recommendations for the rate of change in the keys and vectors is grossly inadequate, and suggest that it must be changed at least every 2 10 encryptions in CBC mode and 2 12 encryptions in CFB, OFB and CTR modes in order to resist power analysis attacks.

Proceedings ArticleDOI
01 Oct 2014
TL;DR: This work proposes a scalable backend architecture that exploits column-major traversal and interleaving to achieve high bandwidth utilization and improves the average bandwidth utilization of existing FPGA SpMV accelerators by 15 to 77%.
Abstract: FPGAs are promising candidates for energy efficient acceleration of sparse matrix-vector multiplication (SpMV), a kernel with important applications in scientific computing and engineering. SpMV is characterized by matrix-dependent perfor- mance and high external memory bandwidth demands, which makes bandwidth utilization an important performance indi- cator. Existing FPGA SpMV accelerators focus on datapath optimizations instead of memory behavior, and exhibit matrix- dependent bandwidth utilization. In this work, we propose to decouple the SpMV computation and memory behavior, and focus on the backend which handles the latter. We describe a scalable backend architecture that exploits column-major traversal and interleaving to achieve high bandwidth utilization. Our experiments show that a single backend is able to sustain 96% of its assigned memory port bandwidth on average, and scales well with increased bandwidth by instantiating multiple parallel units. Compared to a baseline scheme, our scheme offers up to 1.5x higher DRAM power efficiency and up to 20% higher aggregate bandwidth. The results indicate that our scheme improves the average bandwidth utilization of existing FPGA SpMV accelerators by 15 to 77%.

Proceedings ArticleDOI
04 Dec 2014
TL;DR: A near-zero-cost bit-level wear leveling strategy to improve PCM endurance, which can be combined with various coarse-grained wear leveling strategies and shows significantly lower storage, performance and energy overheads.
Abstract: Phase change memory (PCM) has demonstrated great potential as an alternative of DRAM to serve as main memory due to its favorable characteristics of non-volatility, scalability and near-zero leakage power. However, the comparatively poor endurance of PCM largely limits its adoption. Wear leveling strategies targeting to even write distributions have been proposed at different granularities and on various memory hierarchies for PCM endurance enhancement. Write operations are distributed across the memory through migrating data from heavily written locations to less burdened ones, which is usually guided by counters recording the number of writes. However, evenly distributing writes at a coarse granularity cannot deliver the best endurance results as write distributions are highly imbalanced even at the bit level. In this work, we propose a near-zero-cost bit-level wear leveling strategy to improve PCM endurance. The proposed technique can be combined with various coarse-grained wear leveling strategies. Experiment results show 102% endurance enhancement on average, which is 34% higher than the most related work, with significantly lower storage, performance and energy overheads.

Proceedings ArticleDOI
04 Dec 2014
TL;DR: The proposed framework allows to compute quasi-exact results with a reasonable computational time, without adopting typical (and possibly misleading) simplifications that characterize the existing tools for computing Mean Time To Failure (MTTF).
Abstract: This paper presents a Monte Carlo-based framework for the estimation of lifetime reliability of multicore systems. Existing mathematical tools either consider only the time to the first failure, or are limited by their intrinsic complexity and high computational time. The proposed framework allows to compute quasi-exact results with a reasonable computational time, without adopting typical (and possibly misleading) simplifications that characterize the existing tools for computing Mean Time To Failure (MTTF). The paper describes the framework with all its mathematical details, assumptions and simplifications; it proves the correctness of the obtained results, by comparing them against the exact ones, and underlines the differences with the simplistic approaches, also discussing time overhead improvements.

Proceedings ArticleDOI
04 Dec 2014
TL;DR: Experimental results show that dynamic policy decisions change dramatically when using the proposed detailed phase change model, as prior simpler PCM models can substantially over/under-estimate temperature and PCM melting duration.
Abstract: Direct placement of Phase Change Materials (PCMs) on the chip has been recently explored as a passive temperature management solution. PCMs provide the ability to store large amounts of heat at a close-to-constant temperature during the phase change (solid to liquid and vice versa). This latent heat capacity can be used to provide higher performance while reducing hot spots. Detailed modeling of the phase change behavior is essential for the design and evaluation of systems with PCM. This paper proposes an accurate phase change model that is integrated into the commonly used thermal simulation tool, HotSpot. It also provides validation of the proposed model by carrying out computational fluid dynamics (CFD) simulations using COMSOL Multiphysics ® . This paper also explores the impact of PCM properties on the thermal profile of a processor, and demonstrates that PCM material choices can affect peak temperatures by up to 20.1°C. Experimental results show that dynamic policy decisions change dramatically when using the proposed detailed phase change model, as prior simpler PCM models can substantially over/under-estimate temperature and PCM melting duration. The proposed model helps design more effective dynamic management policies and enables realistic evaluation of systems with PCM.

Proceedings ArticleDOI
04 Dec 2014
TL;DR: This work proposes a method to include inter-cycle battery effects into a reference model template in an automated way, and using solely data reported by battery manufacturers, and demonstrated by modeling a commercial lithium iron phosphate battery, whose datasheet provides long-term capacity fading information.
Abstract: The de-facto standard approach in battery modeling consists of the definition of a generic model template in terms of an equivalent electric circuit, which is then populated either using data obtained from direct measurements on actual devices or by some extrapolation of battery characteristics available from datasheets. These models typically describe only intra-cycle effects, that is, those manifesting within a single charge/discharge cycle of a battery. However, basic battery dynamics, during a single discharge, cannot provide a true estimate of the actual lifetime of the battery, e.g., how its usability decreases due to long-term and irreversible effects, such as the fading of capacity due to aging or to repeated cycling. While some solutions in the literature provide answers to this problem by proposing suitable models for these effects, they do not provide solutions for how to incorporate them into a generic model template.

Proceedings ArticleDOI
04 Dec 2014
TL;DR: This paper proposes an architectural design to dynamically reconfigure the cache block size for a MLC STT-RAM last-level cache and places certain hot data chunks in smaller blocks so as to benefit from the lower latency and energy, while keeping the rest in larger blocks to maintain an overall hit rate.
Abstract: The use of STT-RAM as on-chip caches has been widely studied. However, existing works focused mainly on single-level cell (SLC) design while the potential of multi-level cell (MLC) STT-RAM has not yet been fully explored. It is expected that MLC STT-RAM can achieve 2× the storage density of SLC and thus improves system performance. Unfortunately, at the device level, the two-step read/write scheme introduces performance and energy overhead. In this paper, we propose an architectural design to dynamically reconfigure the cache block size for a MLC STT-RAM last-level cache. Our approach place certain hot data chunks in smaller blocks so as to benefit from the lower latency and energy, while keeping the rest in larger blocks to maintain an overall hit rate. Experiment shows that our strategy reduces the performance and energy penalty of MLC STT-RAM caches with a slightly higher miss rate. On average, IPC is increased by 4.6% while energy consumption is reduced by 23.5% compared to the conventional MLC STT-RAM cache.

Proceedings ArticleDOI
04 Dec 2014
TL;DR: A low-power accuracy-configurable floating point multiplier based on Mitchell's Algorithm that shows significantly better power reduction vs. quality degradation trade-offs than existing bit truncation schemes.
Abstract: Floating point multiplication is one of the most frequently used arithmetic operations in a wide variety of applications, but the high power consumption of the IEEE-754 standard floating point multiplier prohibits its implementation in many low power systems, such as wireless sensors and other battery-powered embedded systems, and limits performance scaling in high performance systems, such as CPUs and GPGPUs for scientific computation. This paper presents a low-power accuracy-configurable floating point multiplier based on Mitchell's Algorithm. Post-layout SPICE simulations in a 45nm process show same-delay power reductions up to 26X for single precision and 49X for double precision compared to their IEEE-754 counterparts. Functional simulations on six CPU and GPU benchmarks show significantly better power reduction vs. quality degradation trade-offs than existing bit truncation schemes.

Proceedings ArticleDOI
01 Oct 2014
TL;DR: Her Hermes is a highly-robust, distributed and lightweight fault-tolerant routing algorithm, whose performance degrades gracefully with increasing faulty link counts, and improves network throughput by up to 3× when compared against prior-art.
Abstract: Networks-on-Chips (NoCs) are experiencing escalating susceptibility to wear-out and reduced reliability, with the risk of becoming the key point of failure in an entire multicore chip. Aiming towards seamless NoC operation in the presence of faulty communication links, in this paper we propose Hermes, a highly-robust, distributed and lightweight fault-tolerant routing algorithm, whose performance degrades gracefully with increasing faulty link counts. Hermes is a deadlock-free hybrid routing algorithm, utilizing load-balancing routing on fault-free paths to sustain high-performance, while providing pre-reconfigured escape path selection in the vicinity of faults. Additionally, Hermes identifies non-communicating network partitions in scenarios where faulty links are topologically densely distributed. An extensive experimental evaluation, including utilizing traffic benchmarks gathered from full-system chip multi-processor simulations, shows that Hermes improves network throughput by up to 3× when compared against prior-art.

Proceedings ArticleDOI
01 Oct 2014
TL;DR: The design of an energy-efficient STT-MRAM cell which utilizes a FinFET access transistor and the MTJ is proposed, which offers denser area and higher energy efficiency compared with the conventional MOS-accessed counterpart.
Abstract: Spin-Transfer Torque Magnetic RAM (STT-MRAM) technology requires a high current in order to write data into memory cells, which gives rise to large access transistors in conventional MOS-accessed cells. On the other hand, FinFET devices offer higher ON current and denser layout compared with planar CMOS transistors. This paper thus proposes the design of an energy-efficient STT-MRAM cell which utilizes a FinFET access transistor. To assess the performance of the new cell, optimal layout-related parameters of the FinFET access transistor and the MTJ are analytically derived in order to minimize the STT-MRAM cell area. Afterwards, detailed cell- and architecture-level comparisons between FinFET- vs. MOS-accessed STT-MRAMs are performed. According to the comparison results, while the area of the MOS-accessed STT-MRAM increases significantly under 3ns write pulse width (τ w ), the FinFET-based design can effectively function under τ w = 2ns, at the cost of slight increase in the memory area. Hence, the FinFET-accessed STT-MRAM offers denser area and higher energy efficiency compared with the conventional MOS-accessed counterpart.

Proceedings ArticleDOI
04 Dec 2014
TL;DR: ProactiveDRAM is proposed, a novel scheme that allows DRAM proactively guides the timing of weak cell refresh management and reuses memory controllers' capability in command scheduling and can be built atop any modern DRAM architecture.
Abstract: DRAM cells are leaky and need periodic refresh, which hurts system performance and consumes additional energy. With DRAM scaling towards sub-20nm process technology, we expect a significant portion of DRAM cells become weak cells and require a higher refresh rate, resulting in even higher refresh overhead. A possible solution is to selectively refresh those weak cells using a higher frequency but still refresh the majority at the nominal rate. However, how to provide a multi-rate DRAM refresh scheme is not straightforward. Previous work on this topic was built on an obsolete refresh framework and was incompatible to modern DRAM standards, making it challenging to be adopted in practice. In this work, we propose ProactiveDRAM, a novel scheme that allows DRAM proactively guides the timing of weak cell refresh management and reuses memory controllers' capability in command scheduling. ProactiveDRAM offers a smart retention-aware refresh on the DRAM row granularity, and more importantly, it can be built atop any modern DRAM architecture. Our simulation results show that ProactiveDRAM can handle 1% (or even 10%) weak row population with negligible performance and energy overhead1.

Proceedings ArticleDOI
01 Oct 2014
TL;DR: It is described how this scaling gap can potentially be compensated if the semiconductor industry urgently pursues design-based equivalent scaling (DES), which substantially changes the area and power model trajectories of MPUs and SOCs in the ITRS System Drivers Chapter.
Abstract: The system driver models for microprocessor (MPU) and system-on-chip (SOC) in the International Technology Roadmap for Semiconductors (21) (ITRS) determine the roadmap of underlying technology requirements across devices, patterning, interconnect, test, design and other semiconductor supplier industries. In this paper, we describe several fundamental changes in the ITRS MPU and SOC system driver models as of the recently-released 2013 edition of the roadmap. We first present new A-factor (i.e., layout density) models for the logic and memory components of the MPU and SOC drivers; these updated density models comprehend the industry's shift to FinFET devices below the foundry 20nm node. We also describe updated architectural, total chip area, and total chip power models for the MPU and SOC drivers. Notably, we model the growing uncore portion of MPU products, and the growing presence of graphic processing units (GPUs) and other peripheral cores (PEs) in SOC architectures. The updated SOC architectural model enables more realistic scenario- based power modeling for the SOC driver. The 2013 ITRS update of system driver models embodies extensive calibration with foundry data as well as product structural analysis reports from a leading analysis firm (Chipworks). The model calibration reveals that the industry has contended with a "scaling gap" since 2008, whereby traditional Moore's-Law density scaling of 2× per node has failed due to patterning limitations on layout design, as well as manufacturability and performability challenges of Metal-1 half- pitch (M1HP) scaling. Growing design margins due to reliability, yield, variability, etc. have also contributed to the slowdown of density scaling. We describe how this scaling gap can potentially be compensated if the semiconductor industry urgently pursues design-based equivalent scaling (DES), which substantially changes the area and power model trajectories of MPUs and SOCs in the ITRS System Drivers Chapter. Finally, we note that as a consequence of the updated A-factor, area and power models in the 2013 ITRS, the industry now faces a 20% more daunting power management challenge than had been predicted in the 2011 roadmap.

Proceedings ArticleDOI
04 Dec 2014
TL;DR: 3D-Wiz integrates sub-bank level 3D partitioning of the data array to enable fine-grained activation and greater memory parallelism and yields the best latency and energy consumption values per access among other well-known 3D DRAM architectures.
Abstract: This paper introduces 3D-Wiz, which is a high bandwidth, low latency, optically interfaced 3D DRAM architecture with fine grained data organization and activation. 3D-Wiz integrates sub-bank level 3D partitioning of the data array to enable fine-grained activation and greater memory parallelism. A novel method of routing the internal memory bus using TSVs and fan-out buffers enables 3D-Wiz to use smaller dimension subarrays without significant area overhead. This in turn reduces the random access latency and activation-precharge energy. 3D-Wiz demonstrates access latency of 19.5ns and row cycle time of 25ns. It yields per access activation energy and precharge energy of 0.78nJ and 0.62nJ respectively with 42.5% area efficiency. 3D-Wiz yields the best latency and energy consumption values per access among other well-known 3D DRAM architectures. Experimental results with PARSEC benchmarks indicate that 3D-Wiz achieves 38.8% improvement in performance, 81.1% reduction in power consumption, and 77.1% reduction in energy-delay product (EDP) on average over 3D DRAM architectures from prior work.

Proceedings ArticleDOI
04 Dec 2014
TL;DR: A hybrid memory aware cache partitioning technique (HAP) to dynamically adjust the cache spaces for DRAM and NVM data based on TMPKI, which can exactly reflect the LLC performance on the top of hybrid memories.
Abstract: Data-center servers require large capacity main memory to run multiple workloads simultaneously. However, the scalability and power consumption of DRAM limit its capability of constructing large capacity memory. Emerging non-volatile memories (e.g. PCM and STT-RAM) provide better scalability and lower power leakage than DRAM. Especially, hybrid memory consisting of DRAM and NVM is able to exploit advantages of different memory medias. However, NVMs have a few drawbacks, such as relatively longer read and write latency. Cache miss at the shared last level cache (LLC) suffers from longer latency if the missing data resides in NVM. Current LLC policies manage the cache space without being aware of the underlying heterogeneous medias. This results in cache performance degradation if a large number of missing data come from NVM. Taking the asymmetric cache miss cost into account, we first propose a new performance metric -TMPKI, which can exactly reflect the LLC performance on the top of hybrid memories. Then we propose a hybrid memory aware cache partitioning technique (HAP) to dynamically adjust the cache spaces for DRAM and NVM data based on TMPKI. Experimental results show that HAP improves performance against the traditional LRU policy by up to 54.3% (19.6% on average) while it incurs a little storage overhead (0.2%).

Proceedings ArticleDOI
04 Dec 2014
TL;DR: It is demonstrated that the fault-injection attacks when coupled with a machine learning (ML) algorithm can considerably push the limits of prediction accuracies.
Abstract: Physically Unclonable Functions have emerged as a possible candidate to replace traditional cryptography. However, majority of the strong PUFs are vulnerable to modeling attacks. In this work, we take a closer look at the possible attacks on one of the strong PUF architectures known as Current-based PUFs, which exploit irregularities in transistor currents to generate unique signatures. We demonstrate that the fault-injection attacks when coupled with a machine learning (ML) algorithm can considerably push the limits of prediction accuracies. Based on simulations, we observed that the stand-alone ML algorithms suffer from error prone CRPs especially for higher length PUFs. In such scenarios, hybrid attacks exploiting the unreliable responses pushed the prediction accuracies up to 99% for higher length Current-based PUF circuits.

Proceedings ArticleDOI
04 Dec 2014
TL;DR: This paper presents a Ternary Content-addressable Memory (TCAM) design which is based on the use of floating-gate (flash) transistors, and shows that it has a significantly lowered area compared to a CMOS based TCAM block, with a speed that can meet current data rates that are found in the internet core.
Abstract: This paper presents a Ternary Content-addressable Memory (TCAM) design which is based on the use of floating-gate (flash) transistors. TCAMs are extensively used in high speed IP networking, and are commonly found in routers in the internet core. Traditional TCAM ICs are built using CMOS devices, and a single TCAM cell utilizes 17 transistors. In contrast, our TCAM cell utilizes only 2 flash transistors, thereby significantly reducing circuit area. We cover the chip-level architecture of the TCAMIC briefly, focusing mainly on the TCAMblock which does fast parallel IP routing table lookup. Our flash based TCAM block is simulated in SPICE, and we show that it has a significantly lowered area compared to a CMOS based TCAMblock, with a speed that can meet current (∼400 Gb/s) data rates that are found in the internet core.

Proceedings ArticleDOI
04 Dec 2014
TL;DR: This work uses sensors to monitor the variable operating conditions and the degradation rate and develops a variability-aware OS scheduling algorithm that sets the power/performance tradeoffs to meet the mobile processor's lifetime constraints while adjusting to variability and improving the overall performance.
Abstract: Variability is a key issue in modern multiprocessors, resulting in performance and lifetime uncertainty, and high design margins. The margins can be reduced by exposing variability to software and then adapting at runtime. In this work we use sensors to monitor the variable operating conditions and the degradation rate. Based on the sensor data, our variability-aware OS scheduling algorithm assigns the workload to the cores and sets the power/performance tradeoffs to meet the mobile processor's lifetime constraints while adjusting to variability and improving the overall performance. We implement our algorithm in Android OS on a mobile phone and show that it achieves up to 160% performance improvement over the state-of-the-art while meeting the lifetime constraints.

Proceedings ArticleDOI
01 Oct 2014
TL;DR: A criticality aware cache design is presented which implements a Least Critical (LC) cache replacement policy, where a least recently used non-critical cache line is replaced during a cache miss.
Abstract: Shared caches in mixed criticality systems are a source of interference for safety critical tasks. Shared memory not only leads to worst-case execution time (WCET) pessimism, but also affects the response time of safety critical tasks. In this paper, we present a criticality aware cache design which implements a Least Critical (LC) cache replacement policy, where a least recently used non-critical cache line is replaced during a cache miss. The cache acts as a Least Recently Used (LRU) cache if there are no critical lines or if all cache lines are critical in a set. In our design, data within a certain address space is given higher preference in the cache. These critical address spaces are configured using critical address range (CAR) registers. The new cache design was implemented in a Leon3 processor core, a 32bit processor compliant with the SPARC V8 architecture. Experimental results are presented that illustrate the impact of the Least Critical cache replacement policy on the response time of critical tasks, and on overall application performance as compared to a conventional LRU cache policy.