scispace - formally typeset
Search or ask a question
Author

Vijaykrishnan Narayanan

Bio: Vijaykrishnan Narayanan is an academic researcher from Pennsylvania State University. The author has contributed to research in topics: Speedup & Neuromorphic engineering. The author has an hindex of 27, co-authored 97 publications receiving 4157 citations. Previous affiliations of Vijaykrishnan Narayanan include Foundation University, Islamabad & Arizona State University.


Papers
More filters
Journal ArticleDOI
TL;DR: The other source of power dissipation in microprocessors, dynamic power, arises from the repeated capacitance charge and discharge on the output of the hundreds of millions of gates in today's chips.
Abstract: Off-state leakage is static power, current that leaks through transistors even when they are turned off. The other source of power dissipation in today's microprocessors, dynamic power, arises from the repeated capacitance charge and discharge on the output of the hundreds of millions of gates in today's chips. Until recently, only dynamic power has been a significant source of power consumption, and Moore's law helped control it. However, power consumption has now become a primary microprocessor design constraint; one that researchers in both industry and academia will struggle to overcome in the next few years. Microprocessor design has traditionally focused on dynamic power consumption as a limiting factor in system integration. As feature sizes shrink below 0.1 micron, static power is posing new low-power design challenges.

1,233 citations

Journal ArticleDOI
01 May 2006
TL;DR: A router architecture and a topology design that makes use of a network architecture embedded into the L2 cache memory are proposed that demonstrate that a 3D L2 memory architecture generates much better results than the conventional two-dimensional designs under different number of layers and vertical connections.
Abstract: Long interconnects are becoming an increasingly important problem from both power and performance perspectives. This motivates designers to adopt on-chip network-based communication infrastructures and three-dimensional (3D) designs where multiple device layers are stacked together. Considering the current trends towards increasing use of chip multiprocessing, it is timely to consider 3D chip multiprocessor design and memory networking issues, especially in the context of data management in large L2 caches. The overall goal of this paper is to study the challenges for L2 design and management in 3D chip multiprocessors. Our first contribution is to propose a router architecture and a topology design that makes use of a network architecture embedded into the L2 cache memory. Our second contribution is to demonstrate, through extensive experiments, that a 3D L2 memory architecture generates much better results than the conventional two-dimensional (2D) designs under different number of layers and vertical (inter-wafer) connections. In particular, our experiments show that a 3D architecture with no dynamic data migration generates better performance than a 2D architecture that employs data migration. This also helps reduce power consumption in L2 due to a reduced number of data movements.

397 citations

Proceedings ArticleDOI
09 Jun 2007
TL;DR: A novel partially-connected 3D crossbar structure, called the 3D Dimensionally-Decomposed (DimDe) Router, is proposed, which provides a good tradeoff between circuit complexity and performance benefits.
Abstract: Much like multi-storey buildings in densely packed metropolises, three-dimensional (3D) chip structures are envisioned as a viable solution to skyrocketing transistor densities and burgeoning die sizes in multi-core architectures. Partitioning a larger die into smaller segments and then stacking them in a 3D fashion can significantly reduce latency and energy consumption. Such benefits emanate from the notion that inter-wafer distances are negligible compared to intra-wafer distances. This attribute substantially reduces global wiring length in 3D chips. The work in this paper integrates the increasingly popular idea of packet-based Networks-on-Chip (NoC) into a 3D setting. While NoCs have been studied extensively in the 2D realm, the microarchitectural ramifications of moving into the third dimension have yet to be fully explored. This paper presents a detailed exploration of inter-strata communication architectures in 3D NoCs. Three design options are investigated; a simple bus-based inter-wafer connection, a hop-by-hop standard 3D design, and a full 3D crossbar implementation. In this context, we propose a novel partially-connected 3D crossbar structure, called the 3D Dimensionally-Decomposed (DimDe) Router, which provides a good tradeoff between circuit complexity and performance benefits. Simulation results using (a) a stand-alone cycle-accurate 3D NoC simulator running synthetic workloads, and (b) a hybrid 3D NoC/cache simulation environment running real commercial and scientific benchmarks, indicate that the proposed DimDe design provides latency and throughput improvements of over 20% on average over the other 3D architectures, while remaining within 5% of the full 3D crossbar performance. Furthermore, based on synthesized hardware implementations in 90 nm technology, the DimDe architecture outperforms all other designs -- including the full 3D crossbar -- by an average of 26% in terms of the Energy-Delay Product (EDP).

285 citations

Proceedings ArticleDOI
03 Jun 2012
TL;DR: This work forms the relationship between retention-time and write-latency, and finds optimal retention- time for architecting an efficient cache hierarchy using STT-RAM to overcome high write latency and energy problems.
Abstract: High density, low leakage and non-volatility are the attractive features of Spin-Transfer-Torque-RAM (STT-RAM), which has made it a strong competitor against SRAM as a universal memory replacement in multi-core systems. However, STT-RAM suffers from high write latency and energy which has impeded its widespread adoption. To this end, we look at trading-off STT-RAM's non-volatility property (data-retention-time) to overcome these problems. We formulate the relationship between retention-time and write-latency, and find optimal retention-time for architecting an efficient cache hierarchy using STT-RAM. Our results show that, compared to SRAM-based design, our proposal can improve performance and energy consumption by 18% and 60%, respectively.

261 citations

Journal ArticleDOI
01 May 2006
TL;DR: This work proposes a novel fine-grained modular router architecture that employs decoupled parallel arbiters and uses smaller crossbars for row and column connections to reduce output port contention probabilities as compared to existing designs.
Abstract: Packet-based on-chip networks are increasingly being adopted in complex System-on-Chip (SoC) designs supporting numerous homogeneous and heterogeneous functional blocks. These Network-on-Chip (NoC) architectures are required to not only provide ultra-low latency, but also occupy a small footprint and consume as little energy as possible. Further, reliability is rapidly becoming a major challenge in deep sub-micron technologies due to the increased prominence of permanent faults resulting from accelerated aging effects and manufacturing/testing challenges. Towards the goal of designing low-latency, energy efficient and reliable on-chip communication networks, we propose a novel fine-grained modular router architecture. The proposed architecture employs decoupled parallel arbiters and uses smaller crossbars for row and column connections to reduce output port contention probabilities as compared to existing designs. Furthermore, the router employs a new switch allocation technique known as "Mirroring Effect" to reduce arbitration depth and increase concurrency. In addition, the modular design permits graceful degradation of the network in the event of permanent faults and also helps to reduce the dynamic power consumption. Our simulation results indicate that in an 8 × 8 mesh network, the proposed architecture reduces packet latency by 4-40% and power consumption by 6-20% as compared to two existing router architectures. Evaluation using a combined performance, energy and fault-tolerance metric indicates that the proposed architecture provides 35-50% overall improvement compared to the two earlier routers.

215 citations


Cited by
More filters
Journal ArticleDOI
17 Nov 2011-Nature
TL;DR: Tunnels based on ultrathin semiconducting films or nanowires could achieve a 100-fold power reduction over complementary metal–oxide–semiconductor transistors, so integrating tunnel FETs with CMOS technology could improve low-power integrated circuits.
Abstract: Power dissipation is a fundamental problem for nanoelectronic circuits. Scaling the supply voltage reduces the energy needed for switching, but the field-effect transistors (FETs) in today's integrated circuits require at least 60 mV of gate voltage to increase the current by one order of magnitude at room temperature. Tunnel FETs avoid this limit by using quantum-mechanical band-to-band tunnelling, rather than thermal injection, to inject charge carriers into the device channel. Tunnel FETs based on ultrathin semiconducting films or nanowires could achieve a 100-fold power reduction over complementary metal-oxide-semiconductor (CMOS) transistors, so integrating tunnel FETs with CMOS technology could improve low-power integrated circuits.

2,390 citations

Proceedings ArticleDOI
24 Feb 2014
TL;DR: This study designs an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy, and shows that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s in a small footprint.
Abstract: Machine-Learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural Networks, i.e., CNNs and DNNs) are proving to be state-of-the-art across many applications. As architectures evolve towards heterogeneous multi-cores composed of a mix of cores and accelerators, a machine-learning accelerator can achieve the rare combination of efficiency (due to the small number of target algorithms) and broad application scope. Until now, most machine-learning accelerator designs have focused on efficiently implementing the computational part of the algorithms. However, recent state-of-the-art CNNs and DNNs are characterized by their large size. In this study, we design an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy. We show that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s (key NN operations such as synaptic weight multiplications and neurons outputs additions) in a small footprint of 3.02 mm2 and 485 mW; compared to a 128-bit 2GHz SIMD processor, the accelerator is 117.87x faster, and it can reduce the total energy by 21.08x. The accelerator characteristics are obtained after layout at 65 nm. Such a high throughput in a small footprint can open up the usage of state-of-the-art machine-learning algorithms in a broad set of systems and for a broad set of applications.

1,582 citations

Proceedings ArticleDOI
01 Dec 2007
TL;DR: This work implements two major extensions to the CACTI cache modeling tool that focus on interconnect design for a large cache, and adopts state-of-the-art design space exploration strategies for non-uniform cache access (NUCA).
Abstract: A significant part of future microprocessor real estate will be dedicated to L2 or L3 caches. These on-chip caches will heavily impact processor perfor- mance, power dissipation, and thermal management strategies. There are a number of interconnect design considerations that influence power/performance/area characteristics of large caches, such as wire mod- els (width/spacing/repeaters), signaling strategy (RC/differential/transmission), router design, etc. Yet, to date, there exists no analytical tool that takes all of these parameters into account to carry out a design space exploration for large caches and estimate an optimal organization. In this work, we implement two major extensions to the CACTI cache modeling tool that focus on interconnect design for a large cache. First, we add the ability to model different types of wires, such as RC-based wires with different power/delay characteristics and differential low-swing buses. Second, we add the ability to model Non-uniform Cache Access (NUCA). We not only adopt state-of-the-art design space exploration strategies for NUCA, we also enhance this exploration by considering on-chip network contention and a wider spectrum of wiring and routing choices. We present a validation analysis of the new tool (to be released as CACTI 6.0) and present a case study to showcase how the tool can improve architecture research methodologies. Keywords: cache models, non-uniform cache archi- tectures (NUCA), memory hierarchies, on-chip intercon- nects.

778 citations

Journal ArticleDOI
01 Oct 2015-Nature
TL;DR: This paper demonstrates band-to-band tunnel field-effect transistors (tunnel-FETs), based on a two-dimensional semiconductor, that exhibit steep turn-on and is the only planar architecture tunnel-fET to achieve subthermionic subthreshold swing over four decades of drain current, and is also the only tunnel- FET (in any architecture) to achieve this at a low power-supply voltage of 0.1 volts.
Abstract: A new type of device, the band-to-band tunnel transistor, which has atomically thin molybdenum disulfide as the active channel, operates in a fundamentally different way from a conventional silicon (MOSFET) transistor; it has turn-on characteristics and low-power operation that are better than those of state-of-the-art MOSFETs or any tunnelling transistor reported so far. Traditional transistor technology is fast approaching its fundamental limits, and two-dimensional semiconducting materials such as molybdenum disulfide (MoS2) are seen as possible replacements for silicon in a next generation of high-density, lower-power chip electronics. A particularly promising prospect is their potential in band-to-band tunnel transistors, which operate in a fundamentally different way from conventional silicon (MOSFET) transistors. So far, few such devices with overall characteristics better than silicon transistors have been demonstrated. Now Kaustav Banerjee et al. have built a tunnel transistor by making a vertical structure with atomically thin MoS2 as the active channel and germanium as the source electrode. It has turn-on characteristics and low-power operation that are better than those of existing silicon transistors, and the results will be of interest in a range of electronic applications including low-power integrated circuits, as well as ultra-sensitive bio sensors or gas sensors. The fast growth of information technology has been sustained by continuous scaling down of the silicon-based metal–oxide field-effect transistor. However, such technology faces two major challenges to further scaling. First, the device electrostatics (the ability of the transistor’s gate electrode to control its channel potential) are degraded when the channel length is decreased, using conventional bulk materials such as silicon as the channel. Recently, two-dimensional semiconducting materials1,2,3,4,5,6,7 have emerged as promising candidates to replace silicon, as they can maintain excellent device electrostatics even at much reduced channel lengths. The second, more severe, challenge is that the supply voltage can no longer be scaled down by the same factor as the transistor dimensions because of the fundamental thermionic limitation of the steepness of turn-on characteristics, or subthreshold swing8,9. To enable scaling to continue without a power penalty, a different transistor mechanism is required to obtain subthermionic subthreshold swing, such as band-to-band tunnelling10,11,12,13,14,15,16. Here we demonstrate band-to-band tunnel field-effect transistors (tunnel-FETs), based on a two-dimensional semiconductor, that exhibit steep turn-on; subthreshold swing is a minimum of 3.9 millivolts per decade and an average of 31.1 millivolts per decade for four decades of drain current at room temperature. By using highly doped germanium as the source and atomically thin molybdenum disulfide as the channel, a vertical heterostructure is built with excellent electrostatics, a strain-free heterointerface, a low tunnelling barrier, and a large tunnelling area. Our atomically thin and layered semiconducting-channel tunnel-FET (ATLAS-TFET) is the only planar architecture tunnel-FET to achieve subthermionic subthreshold swing over four decades of drain current, as recommended in ref. 17, and is also the only tunnel-FET (in any architecture) to achieve this at a low power-supply voltage of 0.1 volts. Our device is at present the thinnest-channel subthermionic transistor, and has the potential to open up new avenues for ultra-dense and low-power integrated circuits, as well as for ultra-sensitive biosensors and gas sensors18,19,20,21.

774 citations