scispace - formally typeset
Search or ask a question
Author

Krit Athikulwongse

Bio: Krit Athikulwongse is an academic researcher from Georgia Institute of Technology. The author has contributed to research in topics: Three-dimensional integrated circuit & Memory bandwidth. The author has an hindex of 14, co-authored 25 publications receiving 990 citations. Previous affiliations of Krit Athikulwongse include Thailand National Science and Technology Development Agency & NECTEC.

Papers
More filters
Proceedings ArticleDOI
02 Nov 2009
TL;DR: A new force-directed 3D gate-level placement that efficiently handles TSV usage, and an algorithm that assigns TSVs to nets to complete routing that involves TSVs are presented.
Abstract: Through-Silicon-Via (TSV) is the enabling technology for the fine-grained 3D integration of multiple dies into a single stack. These TSVs occupy non-negligible silicon area because of their sheer size. This significant silicon area occupied by the TSVs and the interconnections made to the TSVs greatly affect area, power, performance, and reliability of 3D IC layouts. Well-managed TSVs alleviate congestion, reduce wirelength, and improve performance, whereas excessive TSVs not only increase the die area, but also have negative impact on many design objectives. In this paper, we study the impact of TSV on various aspects of 3D layouts. We use GDSII layouts of 2D and 3D designs, and thoroughly compare the pros and cons of TSV usage. We propose a new force-directed 3D gate-level placement that efficiently handles TSVs. In addition, we present an algorithm that assigns TSVs to nets to complete routing that involves TSVs. This algorithm, together with our 3D placer, is integrated into a commercial P&R tool to generate fully validated GDSII layouts. Our experiments based on synthesized benchmarks indicate that our algorithms help generate GDSII layouts of 3D designs that are optimized in terms of area, wirelength, and metal layer count.

214 citations

Book ChapterDOI
03 Apr 2012
TL;DR: 3D-MAPS (3D Massively Parallel Processor with Stacked Memory) is a two-tier 3D IC, where the logic die consists of 64 general-purpose processor cores running at 277MHz, and the memory die contains 256KB SRAM.
Abstract: Several recent works have demonstrated the benefits of through-silicon-via (TSV) based 3D integration [1–4], but none of them involves a fully functioning multicore processor and memory stacking. 3D-MAPS (3D Massively Parallel Processor with Stacked Memory) is a two-tier 3D IC, where the logic die consists of 64 general-purpose processor cores running at 277MHz, and the memory die contains 256KB SRAM (see Fig. 10.6.1). Fabrication is done using 130nm GlobalFoundries device technology and Tezzaron TSV and bonding technology. Packaging is done by Amkor. This processor contains 33M transistors, 50K TSVs, and 50K face-to-face connections in 5×5mm2 footprint. The chip runs at 1.5V and consumes up to 4W, resulting in 16W/cm2 power density. The core architecture is developed from scratch to benefit from single-cycle access to SRAM.

181 citations

Proceedings ArticleDOI
13 Jun 2010
TL;DR: Systematic TSV stress aware timing analysis is proposed and it is shown that stress-aware perturbation could reduce cell delay by up to 14.0% and critical path delay by 6.5% in a test case.
Abstract: As the geometry shrinking faces severe limitations, 3D wafer stacking with through silicon via (TSV) has gained interest for future SOC integration. Since TSV fill material and silicon have different coefficients of thermal expansion (CTE), TSV causes silicon deformation due to different temperatures at chip manufacturing and operating. The widely used TSV fill material is copper which causes tensile stress on silicon near TSV. In this paper, we propose systematic TSV stress aware timing analysis and show how to optimize layout for better performance. First, we generate a stress contour map with an analytical radial stress model. Then, the tensile stress is converted to hole and electron mobility variations depending on geometric relation between TSVs and transistors. Mobility variation aware cell library and netlist are generated and incorporated in an industrial timing engine for 3D-IC timing analysis. It is interesting to observe that rise and fall time react differently to stress and relative locations with respect to TSVs. Overall, TSV stress induced timing variations can be as much as ± 10% for an individual cell. Thus as an application for layout optimization, we can exploit the stress-induced mobility enhancement to improve timing on critical cells. We show that stress-aware perturbation could reduce cell delay by up to 14.0% and critical path delay by 6.5% in our test case.

117 citations

Proceedings ArticleDOI
07 Nov 2010
TL;DR: This paper proposes a new TSV stress-driven force-directed 3D placement that consistently provides placement result with, on average, 21.6% better worst negative slack (WNS) and 28.0% better total negative slack than wirelength-driven placement.
Abstract: Through-silicon via (TSV) fabrication causes tensile stress around TSVs which results in significant carrier mobility variation in the devices in their neighborhood. Keep-out zone (KOZ) is a conservative way to prevent any devices/cells from being impacted by the TSV-induced stress. However, owing to already large TSV size, large KOZ can significantly reduce the placement area available for cells, thus requiring larger dies which negate improvement in wirelength and timing due to 3D integration. In this paper, we study the impact of KOZ dimension on stress, carrier mobility variation, area, wirelength, and performance of 3D ICs. We demonstrate that, instead of requiring large KOZ, 3D-IC placers must exploit TSV stress-induced carrier mobility variation to improve the timing and area objectives during placement. We propose a new TSV stress-driven force-directed 3D placement that consistently provides placement result with, on average, 21.6% better worst negative slack (WNS) and 28.0% better total negative slack (TNS) than wirelength-driven placement.

94 citations

Proceedings ArticleDOI
01 Nov 2010
TL;DR: The design and analysis of3D-MAPS, a 64-core 3D-stacked memory-on-processor running at 277 MHz with 63 GB/s memory bandwidth, sent for fabrication using Tezzaron's 3D stacking technology is described.
Abstract: We describe the design and analysis of 3D-MAPS, a 64-core 3D-stacked memory-on-processor running at 277 MHz with 63 GB/s memory bandwidth, sent for fabrication using Tezzaron's 3D stacking technology. We also describe the design flow used to implement it using industrial 2D tools and custom add-ons to handle 3D specifics.

78 citations


Cited by
More filters
ReportDOI
08 Dec 1998
TL;DR: In this article, the authors consider the unique features of UWB technology and propose that the FCC should consider them in considering changes to Part 15 and take into account their unique features for radar and communications uses.
Abstract: In general, Micropower Impulse Radar (MIR) depends on Ultra-Wideband (UWB) transmission systems. UWB technology can supply innovative new systems and products that have an obvious value for radar and communications uses. Important applications include bridge-deck inspection systems, ground penetrating radar, mine detection, and precise distance resolution for such things as liquid level measurement. Most of these UWB inspection and measurement methods have some unique qualities, which need to be pursued. Therefore, in considering changes to Part 15 the FCC needs to take into account the unique features of UWB technology. MIR is applicable to two general types of UWB systems: radar systems and communications systems. Currently LLNL and its licensees are focusing on radar or radar type systems. LLNL is evaluating MIR for specialized communication systems. MIR is a relatively low power technology. Therefore, MIR systems seem to have a low potential for causing harmful interference to other users of the spectrum since the transmitted signal is spread over a wide bandwidth, which results in a relatively low spectral power density.

644 citations

Proceedings ArticleDOI
23 Mar 2014
TL;DR: A number of key elements necessary in realizing efficient NDC operation are described and evaluated, including low-EPI cores, long daisy chains of memory devices, and the dynamic activation of cores and SerDes links.
Abstract: While Processing-in-Memory has been investigated for decades, it has not been embraced commercially. A number of emerging technologies have renewed interest in this topic. In particular, the emergence of 3D stacking and the imminent release of Micron's Hybrid Memory Cube device have made it more practical to move computation near memory. However, the literature is missing a detailed analysis of a killer application that can leverage a Near Data Computing (NDC) architecture. This paper focuses on in-memory MapReduce workloads that are commercially important and are especially suitable for NDC because of their embarrassing parallelism and largely localized memory accesses. The NDC architecture incorporates several simple processing cores on a separate, non-memory die in a 3D-stacked memory package; these cores can perform Map operations with efficient memory access and without hitting the bandwidth wall. This paper describes and evaluates a number of key elements necessary in realizing efficient NDC operation: (i) low-EPI cores, (ii) long daisy chains of memory devices, (iii) the dynamic activation of cores and SerDes links. Compared to a baseline that is heavily optimized for MapReduce execution, the NDC design yields up to 15X reduction in execution time and 18X reduction in system energy.

263 citations

Proceedings ArticleDOI
01 Feb 2015
TL;DR: This paper proposes near-DRAM acceleration (NDA) architectures, which process data using accelerators 3D-stacked on DRAM devices comprising off-chip main memory modules, substantially reducing energy consumption and improving performance.
Abstract: Energy consumed for transferring data across the processor memory hierarchy constitutes a large fraction of total system energy consumption, and this fraction has steadily increased with technology scaling. In this paper, we propose near-DRAM acceleration (NDA) architectures, which process data using accelerators 3D-stacked on DRAM devices comprising off-chip main memory modules. NDA transfers most data through high-bandwidth and low-energy 3D interconnects between accelerators and DRAM devices instead of low-bandwidth and high-energy off-chip interconnects between a processor and DRAM devices, substantially reducing energy consumption and improving performance. Unlike previous near-memory processing architectures, NDA is built upon commodity DRAM devices; apart from inserting through-silicon vias (TSVs) to 3D-interconnect DRAM devices and accelerators, NDA requires minimal changes to the commodity DRAM device and standard memory module architectures. This allows NDA to be more easily adopted in both existing and emerging systems. Our experiments demonstrate that, on average, our NDA-based system consumes 46% (68%) lower (data transfer) energy at 1.67× higher performance than a system that integrates the same accelerator logic within the processor itself.

251 citations

Journal ArticleDOI
18 Jun 2016
TL;DR: Extensive evaluations across a variety of modern memory-intensive GPU workloads show that TOM significantly improves performance compared to a baseline GPU system that cannot offload computation to 3D-stacked memories.
Abstract: Main memory bandwidth is a critical bottleneck for modern GPU systems due to limited off-chip pin bandwidth. 3D-stacked memory architectures provide a promising opportunity to significantly alleviate this bottleneck by directly connecting a logic layer to the DRAM layers with high bandwidth connections. Recent work has shown promising potential performance benefits from an architecture that connects multiple such 3D-stacked memories and offloads bandwidth-intensive computations to a GPU in each of the logic layers. An unsolved key challenge in such a system is how to enable computation offloading and data mapping to multiple 3D-stacked memories without burdening the programmer such that any application can transparently benefit from near-data processing capabilities in the logic layer.Our paper develops two new mechanisms to address this key challenge. First, a compiler-based technique that automatically identifies code to offload to a logic-layer GPU based on a simple cost-benefit analysis. Second, a software/hardware cooperative mechanism that predicts which memory pages will be accessed by offloaded code, and places those pages in the memory stack closest to the offloaded code, to minimize off-chip bandwidth consumption. We call the combination of these two programmer-transparent mechanisms TOM: Transparent Offloading and Mapping.Our extensive evaluations across a variety of modern memory-intensive GPU workloads show that, without requiring any program modification, TOM significantly improves performance (by 30% on average, and up to 76%) compared to a baseline GPU system that cannot offload computation to 3D-stacked memories.

234 citations