scispace - formally typeset
Search or ask a question
Author

Emre Kultursay

Other affiliations: Google
Bio: Emre Kultursay is an academic researcher from Pennsylvania State University. The author has contributed to research in topics: Cache & Uniform memory access. The author has an hindex of 11, co-authored 23 publications receiving 883 citations. Previous affiliations of Emre Kultursay include Google.

Papers
More filters
Proceedings ArticleDOI
21 Apr 2013
TL;DR: It is shown that an optimized, equal capacity STT-RAM main memory can provide performance comparable to DRAM main memory, with an average 60% reduction in main memory energy.
Abstract: In this paper, we explore the possibility of using STT-RAM technology to completely replace DRAM in main memory. Our goal is to make STT-RAM performance comparable to DRAM while providing substantial power savings. Towards this goal, we first analyze the performance and energy of STT-RAM, and then identify key optimizations that can be employed to improve its characteristics. Specifically, using partial write and row buffer write bypass, we show that STT-RAM main memory performance and energy can be significantly improved. Our experiments indicate that an optimized, equal capacity STT-RAM main memory can provide performance comparable to DRAM main memory, with an average 60% reduction in main memory energy.

478 citations

Patent
07 Nov 2012
TL;DR: In this paper, a compilation method for automatically converting a single-threaded software program into an application-specific supercomputer, and the supercomputer system structure generated as a result of applying this method is presented.
Abstract: The invention comprises (i) a compilation method for automatically converting a single-threaded software program into an application-specific supercomputer, and (ii) the supercomputer system structure generated as a result of applying this method. The compilation method comprises: (a) Converting an arbitrary code fragment from the application into customized hardware whose execution is functionally equivalent to the software execution of the code fragment; and (b) Generating interfaces on the hardware and software parts of the application, which (i) Perform a software-to-hardware program state transfer at the entries of the code fragment; (ii) Perform a hardware-to-software program state transfer at the exits of the code fragment; and (iii) Maintain memory coherence between the software and hardware memories. If the resulting hardware design is large, it is divided into partitions such that each partition can fit into a single chip. Then, a single union chip is created which can realize any of the partitions.

147 citations

Proceedings ArticleDOI
20 May 2013
TL;DR: The results show that the Intel® Composer XE 2013 compiler can make effective use of software prefetching instructions to hide memory latencies and special store instructions to save bandwidth on streaming non-temporal store operations to achieve significant performance improvements.
Abstract: The Intel® Xeon Phi™ coprocessor has software prefetching instructions to hide memory latencies and special store instructions to save bandwidth on streaming non-temporal store operations. In this work, we provide details on compiler-based generation of these instructions and evaluate their impact on the performance of the Intel® Xeon Phi™ coprocessor using a wide range of parallel applications with different characteristics. Our results show that the Intel® Composer XE 2013 compiler can make effective use of these mechanisms to achieve significant performance improvements.

47 citations

Journal ArticleDOI
TL;DR: In this article, the authors discuss device-level heterogeneous multicores and various resource-management schemes for reaching higher energy efficiency.
Abstract: Although the superior subthreshold characteristics of steep-slope devices can help power up more cores, researchers still need CMOS technology to accelerate sequential applications, because it can reach higher frequencies. Device-level heterogeneous multicores can give the best of both worlds, but they need smart resource management to realize this promise. In this article, the authors discuss device-level heterogeneous multicores and various resource-management schemes for reaching higher energy efficiency.

42 citations

Proceedings ArticleDOI
01 Dec 2012
TL;DR: This work proposes two network prioritization schemes that can cooperatively improve performance by reducing end-to-end memory access latencies in Network-on-Chip (NoC) based multicores and prioritizes the request messages that are destined for idle memory banks over others, which lead to uniform memory access Latencies with a low average value.
Abstract: To achieve high performance in emerging multicores, it is crucial to reduce the number of memory accesses that suffer from very high latencies. However, this should be done with care as improving latency of an access can worsen the latency of another as a result of resource sharing. Therefore, the goal should be to balance latencies of memory accesses issued by an application in an execution phase, while ensuring a low average latency value. Targeting Network-on-Chip (NoC) based multicores, we propose two network prioritization schemes that can cooperatively improve performance by reducing end-to-end memory access latencies. Our first scheme prioritizes memory response messages such that, in a given period of time, messages of an application that experience higher latencies than the average message latency for that application are expedited and a more uniform memory latency pattern is achieved. Our second scheme prioritizes the request messages that are destined for idle memory banks over others, with the goal of improving bank utilization and preventing long queues from being built in front of the memory banks. These two network prioritization-based optimizations together lead to uniform memory access latencies with a low average value. Our experiments with a 4x8 mesh network-based multicore show that, when applied together, our schemes can achieve 15%, 10% and 13% performance improvement on memory intensive, memory non-intensive, and mixed multiprogrammed workloads, respectively.

41 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: The tunnel field effect transistor (TFET) is considered a future transistor option due to its steep-slope prospects and the resulting advantages in operating at low supply voltage as mentioned in this paper.
Abstract: The tunnel field-effect transistor (TFET) is considered a future transistor option due to its steep-slope prospects and the resulting advantages in operating at low supply voltage ( $\mathrm{V}_{\rm DD}$ ). In this paper, using atomistic quantum models that are in agreement with experimental TFET devices, we are reviewing TFETs prospects at $\mathrm{L}_{\rm G}= 13$ nm node together with the main challenges and benefits of its implementation. Significant power savings at iso-performance to CMOS are shown for GaSb/InAs TFET, but only for performance targets which use lower than conventional $\mathrm{V}_{\rm DD}$ . Also, P-TFET current-drive is between $1\times $ to $0.5\times $ of N-TFET, depending on choice of $\mathrm{I}_{\rm OFF}$ and $\mathrm{V}_{\rm DD}$ . There are many challenges to realizing TFETs in products, such as the requirement of high quality III–V materials and oxides with very thin body dimensions, and the TFET’s layout density and reliability issues due to its source/drain asymmetry. Yet, extremely parallelizable products, such as graphics cores, show the prospect of longer battery life at a cost of some chip area.

357 citations

Journal ArticleDOI
01 Aug 2018
TL;DR: This Perspective argues that electronics is poised to enter a new era of scaling – hyper-scaling – driven by advances in beyond-Boltzmann transistors, embedded non-volatile memories, monolithic three-dimensional integration and heterogeneous integration techniques.
Abstract: In the past five decades, the semiconductor industry has gone through two distinct eras of scaling: the geometric (or classical) scaling era and the equivalent (or effective) scaling era. As transistor and memory features approach 10 nanometres, it is apparent that room for further scaling in the horizontal direction is running out. In addition, the rise of data abundant computing is exacerbating the interconnect bottleneck that exists in conventional computing architecture between the compute cores and the memory blocks. Here we argue that electronics is poised to enter a new, third era of scaling — hyper-scaling — in which resources are added when needed to meet the demands of data abundant workloads. This era will be driven by advances in beyond-Boltzmann transistors, embedded non-volatile memories, monolithic three-dimensional integration and heterogeneous integration techniques. This Perspective argues that electronics is poised to enter a new era of scaling – hyper-scaling – driven by advances in beyond-Boltzmann transistors, embedded non-volatile memories, monolithic three-dimensional integration, and heterogeneous integration techniques.

343 citations

Proceedings ArticleDOI
26 May 2013
TL;DR: Three key solution directions are surveyed: enabling new DRAM architectures, functions, interfaces, and better integration of the DRAM and the rest of the system, designing a memory system that employs emerging memory technologies and takes advantage of multiple different technologies, and providing predictable performance and QoS to applications sharing the memory system.
Abstract: The memory system is a fundamental performance and energy bottleneck in almost all computing systems Recent system design, application, and technology trends that require more capacity, bandwidth, efficiency, and predictability out of the memory system make it an even more important system bottleneck At the same time, DRAM technology is experiencing difficult technology scaling challenges that make the maintenance and enhancement of its capacity, energy-efficiency, and reliability significantly more costly with conventional techniques In this paper, after describing the demands and challenges faced by the memory system, we examine some promising research and design directions to overcome challenges posed by memory scaling Specifically, we survey three key solution directions: 1) enabling new DRAM architectures, functions, interfaces, and better integration of the DRAM and the rest of the system, 2) designing a memory system that employs emerging memory technologies and takes advantage of multiple different technologies, 3) providing predictable performance and QoS to applications sharing the memory system We also briefly describe our ongoing related work in combating scaling challenges of NAND flash memory

270 citations

Journal ArticleDOI
18 Aug 2017
TL;DR: In this article, the authors provide rigorous experimental data from state-of-the-art MLC and TLC NAND flash devices on various types of flash memory errors, to motivate the need for such techniques.
Abstract: NAND flash memory is ubiquitous in everyday life today because its capacity has continuously increased and cost has continuously decreased over decades. This positive growth is a result of two key trends: 1) effective process technology scaling; and 2) multi-level (e.g., MLC, TLC) cell data coding. Unfortunately, the reliability of raw data stored in flash memory has also continued to become more difficult to ensure, because these two trends lead to 1) fewer electrons in the flash memory cell floating gate to represent the data; and 2) larger cell-to-cell interference and disturbance effects. Without mitigation, worsening reliability can reduce the lifetime of NAND flash memory. As a result, flash memory controllers in solid-state drives (SSDs) have become much more sophisticated: they incorporate many effective techniques to ensure the correct interpretation of noisy data stored in flash memory cells. In this article, we review recent advances in SSD error characterization, mitigation, and data recovery techniques for reliability and lifetime improvement. We provide rigorous experimental data from state-of-the-art MLC and TLC NAND flash devices on various types of flash memory errors, to motivate the need for such techniques. Based on the understanding developed by the experimental characterization, we describe several mitigation and recovery techniques, including 1) cell-to-cell interference mitigation; 2) optimal multi-level cell sensing; 3) error correction using state-of-the-art algorithms and methods; and 4) data recovery when error correction fails. We quantify the reliability improvement provided by each of these techniques. Looking forward, we briefly discuss how flash memory and these techniques could evolve into the future.

251 citations

Proceedings ArticleDOI
01 Jun 2014
TL;DR: New challenges as well as opportunities are described in the context of the interaction of dark silicon with thermal, reliability and variability concerns, and preliminary experimental evidence in their support is provided.
Abstract: Technology scaling has resulted in smaller and faster transistors in successive technology generations. However, transistor power consumption no longer scales commensurately with integration density and, consequently, it is projected that in future technology nodes it will only be possible to simultaneously power on a fraction of cores on a multi-core chip in order to stay within the power budget. The part of the chip that is powered off is referred to as dark silicon and brings new challenges as well as opportunities for the design community, particularly in the context of the interaction of dark silicon with thermal, reliability and variability concerns. In this perspectives paper we describe these new challenges and opportunities, and provide preliminary experimental evidence in their support.

191 citations