scispace - formally typeset
Search or ask a question
Author

M. Lanzerotti

Bio: M. Lanzerotti is an academic researcher from IBM. The author has contributed to research in topics: POWER6 & Microprocessor. The author has an hindex of 1, co-authored 1 publications receiving 120 citations.

Papers
More filters
Proceedings ArticleDOI
18 Jun 2007
TL;DR: The POWER6trade microprocessor combines ultra-high frequency operation, aggressive power reduction, a highly scalable memory subsystem, and mainframe-like reliability, availability, and serviceability.
Abstract: The POWER6trade microprocessor combines ultra-high frequency operation, aggressive power reduction, a highly scalable memory subsystem, and mainframe-like reliability, availability, and serviceability. The 341mm2 700M transistor dual-core microprocessor is fabricated in a 65nm SOI process with 10 levels of low-k copper interconnect. It operates at clock frequencies over 5GHz in high-performance applications, and consumes under 100W in power-sensitive applications.

120 citations


Cited by
More filters
01 Jan 2010
TL;DR: The TILE64TM processor as mentioned in this paper is a multicore SoC targeting the high-performance demands of a wide range of embedded applications across networking and digital multimedia applications, with 64 tile processors arranged in an 8x8 array.
Abstract: The TILE64TM processor is a multicore SoC targeting the high-performance demands of a wide range of embedded applications across networking and digital multimedia applications. A figure shows a block diagram with 64 tile processors arranged in an 8x8 array. These tiles connect through a scalable 2D mesh network with high-speed I/Os on the periphery. Each general-purpose processor is identical and capable of running SMP Linux.

634 citations

Proceedings ArticleDOI
01 Feb 2008
TL;DR: The TILE64TM processor is a multicore SoC targeting the high-performance demands of a wide range of embedded applications across networking and digital multimedia applications.
Abstract: The TILE64TM processor is a multicore SoC targeting the high-performance demands of a wide range of embedded applications across networking and digital multimedia applications. A figure shows a block diagram with 64 tile processors arranged in an 8x8 array. These tiles connect through a scalable 2D mesh network with high-speed I/Os on the periphery. Each general-purpose processor is identical and capable of running SMP Linux.

587 citations

Proceedings ArticleDOI
01 Dec 2007
TL;DR: A range of cache refresh and placement schemes that are sensitive to retention time are proposed, and it is shown that most of the retention time variations can be masked by the microarchitecture when using these schemes.
Abstract: Process variations will greatly impact the stability, leakage power consumption, and performance of future microprocessors. These variations are especially detrimental to 6T SRAM (6-transistor static memory) structures and will become critical with continued technology scaling. In this paper, we propose new on-chip memory architectures based on novel 3T1D DRAM (3-transistor, 1-diode dynamic memory) cells. We provide a detailed comparison between 6T and 3T1D designs in the context of a L1 data cache. The effects of physical device variation on a 3T1D cache can be lumped into variation of data retention times. This paper proposes a range of cache refresh and placement schemes that are sensitive to reten- tion time, and we show that most of the retention time variations can be masked by the microarchitecture when using these schemes. We have performed detailed circuit and architectural simulations assuming different degrees of variability in advanced technology nodes, and we show that the resulting memory architecture can tol- erate large process variations with little or even no impact on per- formance when compared to ideal 6T SRAM designs. Furthermore, these designs are robust to memory cell stability issues and can achieve large power savings. These advantages make the new mem- ory architectures a promising choice for on-chip variation-tolerant cache structures required for next generation microprocessors.

359 citations

Journal ArticleDOI
09 Jun 2012
TL;DR: This work introduces a methodology for designing scalable and efficient scale-out server processors based on a metric of performance-density, and facilitates the design of optimal multi-core configurations, called pods.
Abstract: Scale-out datacenters mandate high per-server throughput to get the maximum benefit from the large TCO investment Emerging applications (eg, data serving and web search) that run in these datacenters operate on vast datasets that are not accommodated by on-die caches of existing server chips Large caches reduce the die area available for cores and lower performance through long access latency when instructions are fetched Performance on scale-out workloads is maximized through a modestly-sized last-level cache that captures the instruction footprint at the lowest possible access latency In this work, we introduce a methodology for designing scalable and efficient scale-out server processors Based on a metric of performance-density, we facilitate the design of optimal multi-core configurations, called pods Each pod is a complete server that tightly couples a number of cores to a small last-level cache using a fast interconnect Replicating the pod to fill the die area yields processors which have optimal performance density, leading to maximum per-chip throughput Moreover, as each pod is a stand-alone server, scale-out processors avoid the expense of global (ie, inter-pod) interconnect and coherence These features synergistically maximize throughput, lower design complexity, and improve technology scalability In 20nm technology, scale-out chips improve throughput by 5x-65x over conventional and by 16x-19x over emerging tiled organizations

185 citations

Journal ArticleDOI
01 Jun 2008
TL;DR: FlexTM (FLEXible Transactional Memory) is described, an STM-inspired protocol that uses CSTs to manage conflicts in a distributed manner (no global arbitration) and allows parallel commits and its distributed commit protocol is also more efficient than a central hardware manager.
Abstract: A high-concurrency transactional memory (TM) implementation needs to track concurrent accesses, buffer speculative updates, and manage conflicts. We present a system, FlexTM (FLEXible Transactional Memory), that coordinates four decoupled hardware mechanisms: read and write signatures, which summarize per-thread access sets; per-thread conflict summary tables (CSTs), which identify the threads with which conflicts have occurred; Programmable Data Isolation, which maintains speculative updates in the local cache and employs a thread-private buffer (in virtual memory) in the rare event of overflow; and Alert-On-Update, which selectively notifies threads about coherence events. All mechanisms are software-accessible, to enable virtualization and to support transactions of arbitrary length. FlexTM allows software to determine when to manage conflicts (either eagerly or lazily), and to employ a variety of conflict management and commit protocols. We describe an STM-inspired protocol thatuses CSTs to manage conflicts in a distributed manner (no global arbitration) and allows parallel commits. In experiments with a prototype on Simics/GEMS, FlexTM exhibits 5x speedup over high-quality software TM, with no loss in policy flexibility. Its distributed commit protocol is also more efficient than a central hardware manager. Our results highlight the importance of flexibility in determining when to manage conflicts: lazy maximizes concurrency and helps to ensure forward progress while eager provides better overall utilization in a multi-programmed system.

141 citations