scispace - formally typeset
Search or ask a question
Author

Ron Kalla

Bio: Ron Kalla is an academic researcher from IBM. The author has contributed to research in topics: POWER5 & IBM. The author has an hindex of 3, co-authored 4 publications receiving 322 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: Power Systems™ continue strong 7th Generation Power chip: Balanced Multi-Core design EDRAM technology SMT4 greater then 4X performance in same power envelope as previous generation.
Abstract: The Power7 is IBM's first eight-core processor, with each core capable of four-way simultaneous-multithreading operation. Its key architectural features include an advanced memory hierarchy with three levels of on-chip cache; embedded-DRAM devices used in the highest level of the cache; and a new memory interface. This balanced multicore design scales from 1 to 32 sockets in commercial and scientific environments.

259 citations

Journal ArticleDOI
TL;DR: With a new core microarchitecture design, along with an innovative I/O fabric to support several accelerated computing requirements, the Power9 processor meets the diverse computing needs of the cognitive era and provides a platform for accelerated computing.
Abstract: The IBM Power9 processor has an enhanced core and chip architecture that provides superior thread performance and higher throughput. The core and chip architectures are optimized for emerging workloads to support the needs of next-generation computing. Multiple variants of silicon target the scale-out and scale-up markets. With a new core microarchitecture design, along with an innovative I/O fabric to support several accelerated computing requirements, the Power9 processor meets the diverse computing needs of the cognitive era and provides a platform for accelerated computing.

82 citations


Cited by
More filters
Journal ArticleDOI
24 Dec 2015-Nature
TL;DR: This demonstration could represent the beginning of an era of chip-scale electronic–photonic systems with the potential to transform computing system architectures, enabling more powerful computers, from network infrastructure to data centres and supercomputers.
Abstract: An electronic–photonic microprocessor chip manufactured using a conventional microelectronics foundry process is demonstrated; the chip contains 70 million transistors and 850 photonic components and directly uses light to communicate to other chips. The rapid transfer of data between chips in computer systems and data centres has become one of the bottlenecks in modern information processing. One way of increasing speeds is to use optical connections rather than electrical wires and the past decade has seen significant efforts to develop silicon-based nanophotonic approaches to integrate such links within silicon chips, but incompatibility between the manufacturing processes used in electronics and photonics has proved a hindrance. Now Chen Sun et al. describe a 'system on a chip' microprocessor that successfully integrates electronics and photonics yet is produced using standard microelectronic chip fabrication techniques. The resulting microprocessor combines 70 million transistors and 850 photonic components and can communicate optically with the outside world. This result promises a way forward for new fast, low-power computing systems architectures. Data transport across short electrical wires is limited by both bandwidth and power density, which creates a performance bottleneck for semiconductor microchips in modern computer systems—from mobile phones to large-scale data centres. These limitations can be overcome1,2,3 by using optical communications based on chip-scale electronic–photonic systems4,5,6,7 enabled by silicon-based nanophotonic devices8. However, combining electronics and photonics on the same chip has proved challenging, owing to microchip manufacturing conflicts between electronics and photonics. Consequently, current electronic–photonic chips9,10,11 are limited to niche manufacturing processes and include only a few optical devices alongside simple circuits. Here we report an electronic–photonic system on a single chip integrating over 70 million transistors and 850 photonic components that work together to provide logic, memory, and interconnect functions. This system is a realization of a microprocessor that uses on-chip photonic devices to directly communicate with other chips using light. To integrate electronics and photonics at the scale of a microprocessor chip, we adopt a ‘zero-change’ approach to the integration of photonics. Instead of developing a custom process to enable the fabrication of photonics12, which would complicate or eliminate the possibility of integration with state-of-the-art transistors at large scale and at high yield, we design optical devices using a standard microelectronics foundry process that is used for modern microprocessors13,14,15,16. This demonstration could represent the beginning of an era of chip-scale electronic–photonic systems with the potential to transform computing system architectures, enabling more powerful computers, from network infrastructure to data centres and supercomputers.

1,058 citations

Journal ArticleDOI
TL;DR: The studies based on the proposed scaling methodology show that in-plane STT-MRAM will outperform SRAM from 15 nm node, while its perpendicular counterpart requires further innovations in MTJ material in order to overcome the poor write performance scaling from 22 nm node onwards.
Abstract: This paper explores the scalability of in-plane and perpendicular MTJ based STT-MRAMs from 65 nm to 8 nm while taking into consideration realistic variability effects. We focus on the read and write performances of a STT-MRAM based cache rather than the obvious advantages such as the denser bit-cell and zero static power. An accurate MTJ macromodel capturing key MTJ properties was adopted for efficient Monte Carlo simulations. For the simulation of access devices and peripheral circuitries, ITRS projected transistor parameters were utilized and calibrated using the MASTAR tool that has been widely used in industry. 6T SRAM and STT-MRAM arrays were implemented with aggressive assist schemes to mimic industrial memory designs. A constant JC0·RA/VDD scaling scenario was used which to the first order gives the optimal balance between read and write margins of STT-MRAMs. The thermal stability factor ensuring a 10 year retention time was obtained by adjusting the free layer thickness as well as assuming improvement in the crystalline anisotropy. Our studies based on the proposed scaling methodology show that in-plane STT-MRAM will outperform SRAM from 15 nm node, while its perpendicular counterpart requires further innovations in MTJ material in order to overcome the poor write performance scaling from 22 nm node onwards.

322 citations

Journal ArticleDOI
04 Jun 2011
TL;DR: An abstract-machine semantics that abstracts from most of the implementation detail but explains the behaviour of a range of subtle examples of IBM POWER multiprocessors is given, which should bring new clarity to concurrent systems programming for these architectures.
Abstract: Exploiting today's multiprocessors requires high-performance and correct concurrent systems code (optimising compilers, language runtimes, OS kernels, etc.), which in turn requires a good understanding of the observable processor behaviour that can be relied on. Unfortunately this critical hardware/software interface is not at all clear for several current multiprocessors.In this paper we characterise the behaviour of IBM POWER multiprocessors, which have a subtle and highly relaxed memory model (ARM multiprocessors have a very similar architecture in this respect). We have conducted extensive experiments on several generations of processors: POWER G5, 5, 6, and 7. Based on these, on published details of the microarchitectures, and on discussions with IBM staff, we give an abstract-machine semantics that abstracts from most of the implementation detail but explains the behaviour of a range of subtle examples. Our semantics is explained in prose but defined in rigorous machine-processed mathematics; we also confirm that it captures the observable processor behaviour, or the architectural intent, for our examples with an executable checker. While not officially sanctioned by the vendor, we believe that this model gives a reasonable basis for reasoning about current POWER multiprocessors.Our work should bring new clarity to concurrent systems programming for these architectures, and is a necessary precondition for any analysis or verification. It should also inform the design of languages such as C and C++, where the language memory model is constrained by what can be efficiently compiled to such multiprocessors.

249 citations

Proceedings ArticleDOI
04 Dec 2010
TL;DR: In this article, the relative merits between different approaches in the face of technology constraints are analyzed for U-cores and the predictive power of their model depends upon U-core-specific parameters derived by measuring performance and power of tuned applications on today's state-of-the-art multicores, GPUs, FPGAs, and ASICs.
Abstract: To extend the exponential performance scaling of future chip multiprocessors, improving energy efficiency has become a first-class priority. Single-chip heterogeneous computing has the potential to achieve greater energy efficiency by combining traditional processors with unconventional cores (U-cores) such as custom logic, FPGAs, or GPGPUs. Although U-cores are effective at increasing performance, their benefits can also diminish given the scarcity of projected bandwidth in the future. To understand the relative merits between different approaches in the face of technology constraints, this work builds on prior modeling of heterogeneous multicores to support U-cores. Unlike prior models that trade performance, power, and area using well-known relationships between simple and complex processors, our model must consider the less-obvious relationships between conventional processors and a diverse set of U-cores. Further, our model supports speculation of future designs from scaling trends predicted by the ITRS road map. The predictive power of our model depends upon U-core-specific parameters derived by measuring performance and power of tuned applications on today's state-of-the-art multicores, GPUs, FPGAs, and ASICs. Our results reinforce some current-day understandings of the potential and limitations of U-cores and also provides new insights on their relative merits.

249 citations

Proceedings ArticleDOI
04 Nov 2013
TL;DR: PHANTOM is the first demonstration of a practical, oblivious processor that can provide strong confidentiality guarantees when offloading computation to the cloud and is efficient in both area and performance.
Abstract: We introduce PHANTOM [1] a new secure processor that obfuscates its memory access trace. To an adversary who can observe the processor's output pins, all memory access traces are computationally indistinguishable (a property known as obliviousness). We achieve obliviousness through a cryptographic construct known as Oblivious RAM or ORAM. We first improve an existing ORAM algorithm and construct an empirical model for its trusted storage requirement. We then present PHANTOM, an oblivious processor whose novel memory controller aggressively exploits DRAM bank parallelism to reduce ORAM access latency and scales well to a large number of memory channels. Finally, we build a complete hardware implementation of PHANTOM on a commercially available FPGA-based server, and through detailed experiments show that PHANTOM is efficient in both area and performance. Accessing 4KB of data from a 1GB ORAM takes 26.2us (13.5us for the data to be available), a 32x slowdown over accessing 4KB from regular memory, while SQLite queries on a population database see 1.2-6x slowdown. PHANTOM is the first demonstration of a practical, oblivious processor and can provide strong confidentiality guarantees when offloading computation to the cloud.

249 citations