scispace - formally typeset
Search or ask a question
Author

Gregory J. Fredeman

Other affiliations: GlobalFoundries
Bio: Gregory J. Fredeman is an academic researcher from IBM. The author has contributed to research in topics: Dram & eDRAM. The author has an hindex of 14, co-authored 35 publications receiving 665 citations. Previous affiliations of Gregory J. Fredeman include GlobalFoundries.
Topics: Dram, eDRAM, Cache, CPU cache, Sense amplifier

Papers
More filters
Proceedings ArticleDOI
14 Jun 2007
TL;DR: A compact eFUSE programmable array memory configured as a 4 Kb one-time programmable ROM (OTPROM) is presented, demonstrating a >10X density increase over traditional VLSI fuse circuits.
Abstract: Demonstrating a >10X density increase over traditional VLSI fuse circuits, a compact eFUSE programmable array memory configured as a 4 Kb one-time programmable ROM (OTPROM) is presented using a 6.2 mum2 NiSix silicide electromigration ITIR cell in 65 nm SOI CMOS. A 20 mus programming time at 1.5 V is achieved by asymmetrical scaling of the fuse and a shared differential sensing scheme. Having zero process cost adder, eFUSE is fully compatible with standard VLSI manufacturing.

92 citations

Journal ArticleDOI
TL;DR: A 1.35 ns random access and 1.7 ns-random-cycle SOI embedded-DRAM macro has been developed for the POWER7™ high-performance microprocessor, allowing the embedded DRAM to operate reliably without constraining of the microprocessor voltage supply windows.
Abstract: A 1.35 ns random access and 1.7 ns-random-cycle SOI embedded-DRAM macro has been developed for the POWER7™ high-performance microprocessor. The macro employs a 6 transistor micro sense-amplifier architecture with extended precharge scheme to enhance the sensing margin for product quality. The detailed study shows a 67% bit-line power reduction with only 1.7% area overhead, while improving a read zero margin by more than 500ps. The array voltage window is improved by the programmable BL voltage generator, allowing the embedded DRAM to operate reliably without constraining of the microprocessor voltage supply windows. The 2.5nm gate oxide transistor cell with deep-trench capacitor is accessed by the 1.7 V wordline high voltage (VPP) with V WL low voltage (VWL), and both are generated internally within the microprocessor. This results in a 32 MB on-chip L3 on-chip-cache for 8 cores in a 567 mm POWER7™ die.

63 citations

Proceedings Article
01 Jan 2008
TL;DR: In this article, the authors describe a 500MHz random cycle Silicon on Insulator (SOI) embedded DRAM macro which features a three-transistor micro sense amplifier, realizing significant performance gains over traditional array design methods.
Abstract: -As microprocessors enter the highly multi-core/multi-threaded era, higher density, lower latency embedded memory will be required to meet cache design needs. This paper describes a 500MHz random cycle Silicon on Insulator (SOI) embedded DRAM macro which features a three-transistor micro sense amplifier, realizing significant performance gains over traditional array design methods. To address the realities of process integration, we describe the features and issues associated with integrating this DRAM into SOI technology, including deep trench processing and floating body effects. After a brief description of the macro architecture, details are provided on the three-transistor micro sense amplifier scheme, which is key to achieving a high transfer ratio with minimal area overhead. The paper concludes with hardware results and a summary.

62 citations

Journal ArticleDOI
13 Sep 2004
TL;DR: An 800-MHz embedded DRAM macro employs a memory cell utilizing a device from the 90-nm high-performance technology menu; a 2.2-nm gate oxide 1.5 V IO device to improve the memory utilization to over 99% for a 64 /spl mu/s data retention time.
Abstract: An 800-MHz embedded DRAM macro employs a memory cell utilizing a device from the 90-nm high-performance technology menu; a 2.2-nm gate oxide 1.5 V IO device. A concurrent refresh mode is designed to improve the memory utilization to over 99% for a 64 /spl mu/s data retention time. A concurrent refresh scheduler utilizes up-count and down-count registers to identify at least one array to be refreshed at every clock cycle, emulating a classical distributed refresh mode. A command multiplier employs low frequency phased clock signals to generate the clock, commands, and addresses at rates up to 4/spl times/ that of the tester frequency. The macro integrates masked redundancy allocation logic during at speed multibank test. The hardware results show a 312-MHz random access frequency and 800-MHz multibank frequency at 1.2 V, respectively.

59 citations

Proceedings ArticleDOI
18 Jun 2007
TL;DR: A prototype SOI embedded DRAM macro is developed for high-performance microprocessors and introduces performance-enhancing 3T micro sense amplifier architecture (muSA), which confirms 1.5ns random access time with a 1V supply at 85deg and low voltage operation with a 600mV supply.
Abstract: A prototype SOI embedded DRAM macro is developed for high-performance microprocessors and introduces performance-enhancing 3T micro sense amplifier architecture (muSA). The macro was characterized via a test chip fabricated in a 65nm SOI deep-trench DRAM process. Measurements confirm 1.5ns random access time with a 1V supply at 85deg and low voltage operation with a 600mV supply.

59 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: The studies based on the proposed scaling methodology show that in-plane STT-MRAM will outperform SRAM from 15 nm node, while its perpendicular counterpart requires further innovations in MTJ material in order to overcome the poor write performance scaling from 22 nm node onwards.
Abstract: This paper explores the scalability of in-plane and perpendicular MTJ based STT-MRAMs from 65 nm to 8 nm while taking into consideration realistic variability effects. We focus on the read and write performances of a STT-MRAM based cache rather than the obvious advantages such as the denser bit-cell and zero static power. An accurate MTJ macromodel capturing key MTJ properties was adopted for efficient Monte Carlo simulations. For the simulation of access devices and peripheral circuitries, ITRS projected transistor parameters were utilized and calibrated using the MASTAR tool that has been widely used in industry. 6T SRAM and STT-MRAM arrays were implemented with aggressive assist schemes to mimic industrial memory designs. A constant JC0·RA/VDD scaling scenario was used which to the first order gives the optimal balance between read and write margins of STT-MRAMs. The thermal stability factor ensuring a 10 year retention time was obtained by adjusting the free layer thickness as well as assuming improvement in the crystalline anisotropy. Our studies based on the proposed scaling methodology show that in-plane STT-MRAM will outperform SRAM from 15 nm node, while its perpendicular counterpart requires further innovations in MTJ material in order to overcome the poor write performance scaling from 22 nm node onwards.

322 citations

Proceedings ArticleDOI
14 Oct 2017
TL;DR: DRISA, a DRAM-based Reconfigurable In-Situ Accelerator architecture, is proposed to provide both powerful computing capability and large memory capacity/bandwidth to address the memory wall problem in traditional von Neumann architecture.
Abstract: Data movement between the processing units and the memory in traditional von Neumann architecture is creating the “memory wall” problem. To bridge the gap, two approaches, the memory-rich processor (more on-chip memory) and the compute-capable memory (processing-in-memory) have been studied. However, the first one has strong computing capability but limited memory capacity/bandwidth, whereas the second one is the exact the opposite.To address the challenge, we propose DRISA, a DRAM-based Reconfigurable In-Situ Accelerator architecture, to provide both powerful computing capability and large memory capacity/bandwidth. DRISA is primarily composed of DRAM memory arrays, in which every memory bitline can perform bitwise Boolean logic operations (such as NOR). DRISA can be reconfigured to compute various functions with the combination of the functionally complete Boolean logic operations and the proposed hierarchical internal data movement designs. We further optimize DRISA to achieve high performance by simultaneously activating multiple rows and subarrays to provide massive parallelism, unblocking the internal data movement bottlenecks, and optimizing activation latency and energy. We explore four design options and present a comprehensive case study to demonstrate significant acceleration of convolutional neural networks. The experimental results show that DRISA can achieve 8.8× speedup and 1.2× better energy efficiency compared with ASICs, and 7.7× speedup and 15× better energy efficiency over GPUs with integer operations.CCS CONCEPTS• Hardware → Dynamic memory; • Computer systems organization → reconfigurable computing; Neural networks;

315 citations

Proceedings ArticleDOI
01 Dec 2007
TL;DR: The basic concept behind the scheme is that a DRAM row that was recently read or written to by the processor does not need to be refreshed again by the periodic refresh operation, thereby eliminating excessive refreshes and the energy dissipated.
Abstract: DRAMs require periodic refresh for preserving data stored in them. The refresh interval for DRAMs depends on the vendor and the de- sign technology they use. For each refresh in a DRAM row, the stored information in each cell is read out and then written back to itself as each DRAM bit read is self-destructive. The refresh pro- cess is inevitable for maintaining data correctness, unfortunately, at the expense of power and bandwidth overhead. The future trend to integrate layers of 3D die-stacked DRAMs on top of a proces- sor further exacerbates the situation as accesses to these DRAMs will be more frequent and hiding refresh cycles in the available slack becomes increasingly difficult. Moreover, due to the implica- tion of temperature increase, the refresh interval of 3D die-stacked DRAMs will become shorter than those of conventional ones. This paper proposes an innovative scheme to alleviate the en- ergy consumed in DRAMs. By employing a time-out counter for each memory row of a DRAM module, all the unnecessary periodic refresh operations can be eliminated. The basic concept behind our scheme is that a DRAM row that was recently read or written to by the processor (or other devices that share the same DRAM) does not need to be refreshed again by the periodic refresh opera- tion, thereby eliminating excessive refreshes and the energy dissi- pated. Based on this concept, we propose a low-cost technique in the memory controller for DRAM power reduction. The simulation results show that our technique can reduce up to 86% of all refresh operations and 59.3% on the average for a 2GB DRAM. This in turn results in a 52.6% energy savings for refresh operations. The overall energy saving in the DRAM is up to 25.7% with an average of 12.13% obtained for SPLASH-2, SPECint2000, and Biobench benchmark programs simulated on a 2GB DRAM. For a 64MB 3D DRAM, the energy saving is up to 21% and 9.37% on an average when the refresh rate is 64 ms. For a faster 32ms refresh rate the maximum and average savings are 12% and 6.8% respectively.

305 citations

Journal ArticleDOI
TL;DR: Power Systems™ continue strong 7th Generation Power chip: Balanced Multi-Core design EDRAM technology SMT4 greater then 4X performance in same power envelope as previous generation.
Abstract: The Power7 is IBM's first eight-core processor, with each core capable of four-way simultaneous-multithreading operation. Its key architectural features include an advanced memory hierarchy with three levels of on-chip cache; embedded-DRAM devices used in the highest level of the cache; and a new memory interface. This balanced multicore design scales from 1 to 32 sockets in commercial and scientific environments.

259 citations

Proceedings ArticleDOI
19 Jun 2010
TL;DR: The significant impact of variations on refresh time and cache power consumption for large eDRAM caches is shown and Hi-ECC, a technique that incorporates multi-bit error-correcting codes to significantly reduce refresh rate, is proposed.
Abstract: Technology advancements have enabled the integration of large on-die embedded DRAM (eDRAM) caches. eDRAM is significantly denser than traditional SRAMs, but must be periodically refreshed to retain data. Like SRAM, eDRAM is susceptible to device variations, which play a role in determining refresh time for eDRAM cells. Refresh power potentially represents a large fraction of overall system power, particularly during low-power states when the CPU is idle. Future designs need to reduce cache power without incurring the high cost of flushing cache data when entering low-power states. In this paper, we show the significant impact of variations on refresh time and cache power consumption for large eDRAM caches. We propose Hi-ECC, a technique that incorporates multi-bit error-correcting codes to significantly reduce refresh rate. Multi-bit error-correcting codes usually have a complex decoder design and high storage cost. Hi-ECC avoids the decoder complexity by using strong ECC codes to identify and disable sections of the cache with multi-bit failures, while providing efficient single-bit error correction for the common case. Hi-ECC includes additional optimizations that allow us to amortize the storage cost of the code over large data words, providing the benefit of multi-bit correction at same storage cost as a single-bit error-correcting (SECDED) code (2% overhead). Our proposal achieves a 93% reduction in refresh power vs. a baseline eDRAM cache without error correcting capability, and a 66% reduction in refresh power vs. a system using SECDED codes.

231 citations