scispace - formally typeset
Search or ask a question

Showing papers in "ACM Journal on Emerging Technologies in Computing Systems in 2011"


Journal ArticleDOI
TL;DR: It is shown that significant improvements in waveguide propagation and waveguide crossing insertion losses resulting from using other CMOS-compatible silicon materials enables the realization of topologies that were previously not feasible using only the single-layer crystalline silicon approaches.
Abstract: Integrated photonics has been slated as a revolutionary technology with the potential to mitigate the many challenges associated with on- and off-chip electrical interconnection networks. To date, all proposed chip-scale photonic interconnects have been based on the crystalline silicon platform for CMOS-compatible fabrication. However, maintaining CMOS compatibility does not preclude the use of other CMOS-compatible silicon materials such as silicon nitride and polycrystalline silicon. In this work, we investigate utilizing devices based on these deposited materials to design photonic networks with multiple layers of photonic devices. We apply rigorous device optimization and insertion loss analysis on various network architectures, demonstrating that multilayer photonic networks can exhibit dramatically lower total insertion loss, enabling unprecedented bandwidth scalability. We show that significant improvements in waveguide propagation and waveguide crossing insertion losses resulting from using these materials enables the realization of topologies that were previously not feasible using only the single-layer crystalline silicon approaches.

131 citations


Journal ArticleDOI
TL;DR: In this paper, a survey of photonic technologies amenable to large-scale CMOS integration is presented from the perspective of high-performance interconnects operating over distance scales of 1mm to 100m.
Abstract: Moore's Law has set great expectations that the performance of information technology will improve exponentially until at least the end of this decade. Although the physics of silicon transistors alone might allow these expectations to be met, the physics of the long metal wires that cross and connect packages almost certainly will not. Global-level interconnects incorporating large-scale integrated photonics fabricated on the same platform as silicon microelectronics hold the promise of revolutionizing computing by enabling parallel many-core and network switch architectures that combine unprecedented performance and ease of use with affordable power consumption.Over the last decade, remarkable progress has been made in research on low-power silicon photonic devices for interconnect applications, and CMOS-compatible fabrication technologies promise a “Moore's Law for photonics” that could completely change the economics of integrated optics. In this survey, photonic technologies amenable to large-scale CMOS integration are reviewed from the perspective of high-performance interconnects operating over distance scales of 1mm to 100m. An overview of the requirements placed on integrated optical devices by a variety of modern computer applications leads to discussions of active and passive photonic components designed to generate, guide, filter, modulate, and detect light in the telecommunication bands. Critical challenges and prospects for large-scale integration are evaluated with an emphasis on silicon-on-insulator as a platform for photonics.

89 citations


Journal ArticleDOI
TL;DR: This article describes the successful design and implementation of SpiNNaker, a GALS multicore system-on-chip, using a hierarchical methodology to deal with the asynchronous sections of the system, encapsulating and validating timing assumptions at each level.
Abstract: The design and implementation of globally asynchronous locally synchronous systems-on-chip is a challenging activity. The large size and complexity of the systems require the use of computer-aided design (CAD) tools but, unfortunately, most tools do not work adequately with asynchronous circuits. This article describes the successful design and implementation of SpiNNaker, a GALS multicore system-on-chip. The process was completed using commercial CAD tools from synthesis to layout. A hierarchical methodology was devised to deal with the asynchronous sections of the system, encapsulating and validating timing assumptions at each level. The crossbar topology combined with a pipelined asynchronous fabric implementation allows the on-chip network to meet the stringent requirements of the system. The implementation methodology constrains the design in a way that allows the tools to complete their tasks successfully. A first test chip, with reduced resources and complexity was taped-out using the proposed methodology. Test chips were received in December 2009 and were fully functional. The methodology had to be modified to cope with the increased complexity of the SpiNNaker SoC. SpiNNaker chips were delivered in May 2011 and were also fully operational, and the interconnect requirements were met.

51 citations


Journal ArticleDOI
TL;DR: Iris, a CMOS-compatible high-performance low-power nanophotonic on- chip network, is introduced and offers an on-chip communication backplane that is power efficient while demonstrating low latency and high throughput.
Abstract: On-chip communication, including short, often-multicast, latency-critical coherence and synchronization messages, and long, unicast, throughput-sensitive data transfers, limits the power efficiency and performance scalability of many-core chip-multiprocessor systems. This article analyzes on-chip communication challenges and studies the characteristics of existing electrical and emerging nanophotonic interconnect. Iris, a CMOS-compatible high-performance low-power nanophotonic on-chip network, is thus introduced. Iris's circuit-switched subnetwork supports throughput-sensitive data transfer. Iris's optical-antenna-array-based broadcast--multicast subnetwork optimizes latency-critical traffic and supports the path setup of circuit-switched communication. Overall, the proposed nanophotonic network design offers an on-chip communication backplane that is power efficient while demonstrating low latency and high throughput.

43 citations


Journal ArticleDOI
TL;DR: This result, the first application of graph embedding to quantum circuits and devices, provides a new tool for compiler development, emphasizes the impact of quantum computer architecture on performance, and acts as a cautionary note when evaluating the time performance of quantum algorithms.
Abstract: We investigate the theoretical limits of the effect of the quantum interaction distance on the speed of exact quantum addition circuits. For this study, we exploit graph embedding for quantum circuit analysis. We study a logical mapping of qubits and gates of any Ω(log n)-depth quantum adder circuit for two n-qubit registers onto a practical architecture, which limits interaction distance to the nearest neighbors only and supports only one- and two-qubit logical gates. Unfortunately, on the chosen k-dimensional practical architecture, we prove that the depth lower bound of any exact quantum addition circuits is no longer Ω(log n), but Ω(k√n). This result, the first application of graph embedding to quantum circuits and devices, provides a new tool for compiler development, emphasizes the impact of quantum computer architecture on performance, and acts as a cautionary note when evaluating the time performance of quantum algorithms.

38 citations


Journal ArticleDOI
TL;DR: This article analyzes and compares nanomagnetic circuits based on full NCL, mixed Boolean-NCL, and fully Boolean logic, and discusses the advantages of these logics, but also the issues they raise.
Abstract: In the years to come new solutions will be required to overcome the limitations of scaled CMOS technology. One approach is to adopt Nano-Magnetic Logic Circuits, highly appealing for their extremely reduced power consumption. Despite the interesting nature of this approach, many problems arise when this technology is considered for real designs. The wire is the most critical of these problems from the circuit implementation point of view. It works as a pipelined interconnection, and its delay in terms of clock cycles depends on its length. Serious complications arise at the design phase, both in terms of synthesis and of physical design.One possible solution is the use of a delay insensitive asynchronous logic, Null Convention Logic (NCLTM). Nevertheless its use has many negative consequences in terms of area occupation and speed loss with respect to a Boolean version. In this article we analyze and compare different solutions: nanomagnetic circuits based on full NCL, mixed Boolean-NCL, and fully Boolean logic. We discuss the advantages of these logics, but also the issues they raise. In particular we analyze feedback signals, which, due to their intrinsic pipelined nature, cause errors that still have not found a solution in the literature. The innovative arrangement we propose solves most of the problems and thus soundly increases the knowledge of this technology. The analysis is performed using a VHDL behavioral model we developed and a microprocessor we designed based on this model, as a sound and realistic test bench.

30 citations


Journal ArticleDOI
TL;DR: The experimental results show that 6M-RRNS realizes a competitive error correction capability, provides larger data storage capacity, and offers higher decoding performance as compared to C- RRNS and Reed-Solomon (RS) codes.
Abstract: Hybrid memories are envisioned as one of the alternatives to existing semiconductor memories. Although offering enormous data storage capacity, low power consumption, and reduced fabrication complexity (at least for the memory cell array), such memories are subject to a high degree of intermittent and transient faults leading to reliability issues. This article examines the use of Conventional Redundant Residue Number System (C-RRNS) error correction code, which has been extensively used in digital signal processing and communication, to detect and correct intermittent and transient cluster faults in hybrid memories. It introduces a modified version of C-RRNS, referred to as 6M-RRNS, to realize the aims at lower area overhead and performance penalty. The experimental results show that 6M-RRNS realizes a competitive error correction capability, provides larger data storage capacity, and offers higher decoding performance as compared to C-RRNS and Reed-Solomon (RS) codes. For instance, for 64-bit hybrid memories at 10p fault rate, 6M-RRNS has 98.95p error correction capability, which is 0.35p better than RS and 0.40p less than C-RRNS. Moreover, when considering 1Tbit memory, 6M-RRNS offers 4.35p more data storage capacity than RS and 11.41p more than C-RRNS. Additionally, it decodes up to 5.25 times faster than C-RRNS.

30 citations


Journal ArticleDOI
TL;DR: This article quantitatively considers the performance of nanomagnetic logic circuits within the context of realistic drive circuitry and demonstrates how one of the five fundamental tenets of digital logic---preventing unwanted feedback---can be satisfied by realisticDrive circuitry.
Abstract: This article quantitatively considers the performance of nanomagnetic logic circuits within the context of realistic drive circuitry. We also demonstrate how one of the five fundamental tenets of digital logic---preventing unwanted feedback---can be satisfied by realistic drive circuitry. More specifically, different types of multiphase clocks are investigated and compared. Initial projections suggest that even with drive circuitry overhead, nanomagnet logic can outperform subthreshold CMOS in terms of energy delay product---and paths to lower power exist.

24 citations


Journal ArticleDOI
TL;DR: This article shows that the problem becomes a Bipartite SubGraph Isomorphism (BSGI) problem within sub-nanocrossbars free of stuck-closed faults, and proposes a heuristic KNS-2DS, an iterative rough canonizer with approximately O(N2) complexity followed by an O( N3) matching algorithm.
Abstract: Nanocrossbars (i.e., nanowire crossbars) offer extreme logic densities but come with very high defect rates; stuck-open/closed, broken nanowires. Achieving reasonable yield and utilization requires logic mapping that is defect-aware even at the crosspoint level. Such logic mapping works with a defect map per each manufactured chip. The problem can be expressed as matching of two bipartite graphs; one for the logic to be implemented and other for the nanocrossbar. This article shows that the problem becomes a Bipartite SubGraph Isomorphism (BSGI) problem within sub-nanocrossbars free of stuck-closed faults. Our heuristic KNS-2DS is an iterative rough canonizer with approximately O(N2) complexity followed by an O(N3) matching algorithm. Canonization brings a partial or full order to graph nodes. It is normally used for solving the regular Graph Isomorphism (GI) problem, while we apply it to BSGI. KNS stands for K-Neighbor Sort and is used for initializing our main contribution 2-Dimensional-Sort (2DS). 2DS operates on the adjacency matrix of a bipartite graph. Radix-2 2DS solves the problem in the absence of stuck-closed faults. With the addition of Radix-3 and our novel Radix-2.5 sort, we solve problems that also have stuck-closed faults. We offer very short runtimes (due to canonization) compared to previous work and have success on all benchmarks. KNS-2DS is also novel from the perspective of BSGI problem as it is based on canonization but not on a search tree with backtracking.

18 citations


Journal ArticleDOI
TL;DR: A completion detection scheme based on wide NOR gates, which results in significant latency and energy savings especially as the number of outputs increase, and three separate pipeline implementations of an 8x8-bit Booth-encoded array multiplier are presented.
Abstract: We present two novel energy-efficient pipeline templates for high throughput asynchronous circuits. The proposed templates, called N-P and N-Inverter pipelines, use a single-track handshake protocol. There are multiple stages of logic within each pipeline. The proposed techniques minimize handshake overheads associated with input tokens and intermediate logic nodes within a pipeline template. Each template can pack a significant amount of logic in a single stage, while still maintaining a fast cycle time of only 18 transitions. Noise and timing robustness constraints of our pipelined circuits are quantified across all process corners. We present completion detection scheme based on wide NOR gates, which results in significant latency and energy savings especially as the number of outputs increase. To fully quantify all design trade-offs, three separate pipeline implementations of an 8x8-bit Booth-encoded array multiplier are presented. Compared to a standard QDI pipeline implementation, the N-Inverter and N-P pipeline implementations reduced the energy-delay product by 38.5p and 44p respectively. The overall multiplier latency was reduced by 20.2p and 18.7p, while the total transistor width was reduced by 35.6p and 46p with N-Inverter and N-P pipeline templates respectively.

17 citations


Journal ArticleDOI
TL;DR: A novel countermeasure against DPA attacks on smart cards and other digital ICs based on FinFETs, an emerging substitute for bulk CMOS at the 22nm technology node and beyond and can be applied to other encryption algorithms as well.
Abstract: Differential power analysis (DPA) is a side-channel attack that statistically analyzes the power consumption of a cryptographic system to obtain secret information. This type of attack is well known as a major threat to information security. Effective solutions with low energy and area cost for improved DPA resistance are urgently needed, especially for energy-constrained modern devices that are often in the physical proximity of attackers. This article presents a novel countermeasure against DPA attacks on smart cards and other digital ICs based on FinFETs, an emerging substitute for bulk CMOS at the 22nm technology node and beyond. We exploit the adaptive power management characteristic of FinFETs to generate a high level of noise at critical moments in the execution of a cryptosystem to thwart DPA attacks. We demonstrate the effectiveness of the proposed countermeasure by developing a simple power model for estimating DPA spikes. We then validate the model by carrying out DPA attacks on an ASIC implementation of the advanced encryption standard system via gate-level simulation. Both modeling and simulation-based experiment indicate that with the proposed countermeasure, even 8,000,000 power acquisitions are not sufficient to reveal the secret key. As opposed to other countermeasures presented in the literature, the proposed hardware design requires less than 1p increase in area and 15p increase in total energy consumption without any extra delay in the critical path. The proposed method is generic and can be applied to other encryption algorithms as well.

Journal ArticleDOI
TL;DR: A novel power delivery method which employs a capacitor bank for adaptively storing the energy from power harvesters depending on load and source conditions, is developed and its advantages, especially when driving asynchronous loads, are demonstrated through comprehensive comparative analysis.
Abstract: For systems depending on power harvesting, a fundamental contradiction in the power delivery chain has existed between conventional synchronous computational loads requiring relatively stable Vdd and power harvesters unable to supply it. DC/DC conversion has therefore been an integral part of such systems to resolve this contradiction. On the other hand, asynchronous computational loads, in addition to their potential power-saving capabilities, can be made tolerant to a much wider range of Vdd variance. This may open up opportunities for much more energy efficient methods of power delivery. This article presents in-depth investigations into the behavior and performance of different on-chip power delivery methods driving both asynchronous and synchronous loads directly from a harvester source. A novel power delivery method, which employs a capacitor bank for adaptively storing the energy from power harvesters depending on load and source conditions, is developed. Its advantages, especially when driving asynchronous loads, are demonstrated through comprehensive comparative analysis.

Journal ArticleDOI
TL;DR: The architecture taps the flexibility provided by the clocking system of molecular QCA to build a simple tile-based programmable device with the 3-input Majority gate as the fundamental logic element.
Abstract: Quantum-dot cellular automata is an interesting computation fabric with many never-seen-before properties. However, no programmable fabric scheme has utilized all these properties effectively. We propose an architecture for a programmable device using molecular QCA which exploits all the specialities of the fabric. The architecture taps the flexibility provided by the clocking system of molecular QCA to build a simple tile-based programmable device with the 3-input Majority gate as the fundamental logic element. Observing how a QCA structure can behave as either an interconnect or a logic gate depending on clocking, the proposed architecture merges routing and logic elements, thus drastically changing how programmable fabrics have been designed.

Journal ArticleDOI
TL;DR: This article presents a hybrid electrical/optical router for future large scale, cache coherent multicore microprocessors that achieves 2X better network performance than a state-of-the-art electrical baseline in a mesh topology while consuming 30% less network power.
Abstract: Tens and eventually hundreds of processing cores are projected to be integrated onto future microprocessors, making the global interconnect a key component to achieving scalable chip performance within a given power envelope. While CMOS-compatible nanophotonics has emerged as a leading candidate for replacing global wires beyond the 16nm timeframe, on-chip optical interconnect architectures are typically limited in scalability or are dependent on comparatively slow electrical control networks.In this article, we present a hybrid electrical/optical router for future large scale, cache coherent multicore microprocessors. The heart of the router is a low-latency optical crossbar that uses predecoded source routing and switch state preconfiguration to transmit cache-line-sized packets several hops in a single clock cycle under contentionless conditions. Overall, our optical router achieves 2X better network performance than a state-of-the-art electrical baseline in a mesh topology while consuming 30p less network power.

Journal ArticleDOI
TL;DR: How elasticity can be effectively and practically used to derive pipelined circuits by using correct-by-construction transformations that can be fully automated is revealed.
Abstract: Elasticity is a paradigm that tolerates the variations in computation and communication delays. By applying elastic transformations that allow varying the original timing, circuits can be optimized beyond the conventional rigid transformations that do not modify the external timing.Pipelining is one of the classical techniques to improve the throughput of a circuit. This article reveals how elasticity can be effectively and practically used to derive pipelined circuits by using correct-by-construction transformations that can be fully automated. Two designs, one of them industrial, are used to demonstrate how the area-performance trade-off can be explored using elasticity.

Journal ArticleDOI
TL;DR: This article describes a novel computing architecture organization based on nanoscale logic cells that could improve the scalability of traditional FPGAs by a factor of 8.5 and proposes a method to map functions onto such architectures.
Abstract: This article describes a novel computing architecture organization based on nanoscale logic cells. We propose the use of a cluster of matrix arrangements of cells. In order to interconnect such fine-grained logic cells within a matrix, conventional techniques are not suitable due to a large interconnect overhead. Therefore, we propose the use of static and incomplete interconnect topologies to create matrices of cells. We also propose a method to map functions onto such architectures. We then explore the main parameters of the structure (size of matrices and interconnect topologies) and their impact on the main performance metrics (packing efficiency, speed, and fault tolerance). A cluster packing method also allows the evaluation of the number of matrices used by complex functions and the fill factor for various matrix sizes. The analyses show that this approach is particularly suited for matrices of 16 cells interconnected by modified omega networks. We can conclude that this architecture could improve the scalability of traditional FPGAs by a factor of 8.5.

Journal ArticleDOI
TL;DR: This work rewrite Udding’s rules, which characterize communications between DI components, and introduces relativistic generalizations of traces, called R-traces, which provide a pertinent description of communications and compositions of DI components.
Abstract: Time plays a crucial role in the performance of computing systems. The accurate modelling of logical devices, and of their physical implementations, requires an appropriate representation of time and of all properties that depend on this notion. The need for a proper model, particularly acute in the design of clockless delay-insensitive (DI) circuits, leads one to reconsider the classical descriptions of time and of the resulting order and causal relations satisfied by logical operations. This questioning meets the criticisms of classical spacetime formulated by Einstein when founding relativity theory and is answered by relativistic conceptions of time and causality. Applying this approach to clockless circuits and considering the trace formalism, we rewrite Udding’s rules, which characterize communications between DI components. We exhibit their intrinsic relation with relativistic causality. For that purpose, we introduce relativistic generalizations of traces, called R-traces, which provide a pertinent description of communications and compositions of DI components.

Journal ArticleDOI
TL;DR: This special issue reports some of the recent advances in nanopho-tonic interconnect research, from material and device innovation, network architecture design, to on-chip system integration and packaging.
Abstract: Over the next decade, advances in integrated circuit (IC) communication systems will encounter a number of interdependent challenges, from system and architecture design to physical technology integration. With increasing system integration and technology scaling, the performance and scalability of electrical interconnect have become the primary system bottlenecks in terms of performance and power efficiency. Emerging nanophotonic communication technology offers great opportunities to address the ever-increasing on-chip and off-chip communication challenges for emerging many-core computing systems. This special issue reports some of the recent advances in nanopho-tonic interconnect research, from material and device innovation, network architecture design, to on-chip system integration. The first article by Raymond Beausoleil of HP Labs provides a comprehensive survey of the recent progress in nanophotonic device technology and network architecture research. The article first highlights the challenges raised in the exascale computing paradigm and surveys several computer architectural designs employing photonic networks. Next, it reviews the key components of a photonic interconnect system, including waveguides, filters, modulators, detectors and optical sources. It then focuses on discussing the challenges and opportunities in nanophotonic system integration and packaging. The article by Biberman et al. presents a multi-layer photonic design. Different from most existing silicon-on-insulator-based solutions, the proposed design uses deposited materials such as silicon nitride and polycrystalline silicon. The performance of the proposed solution in both wavelength-routing and broadband-switched networks is investigated. This study shows that the multi-layer structures can provide significant improvements in insertion loss and bandwidth scalability over the single-layer crystalline silicon approach. The next article by Li et al. from the University of Colorado at Boulder proposes a hybrid on-chip photonic communication system. This hybrid network design consists of two subnetworks. The first is a broadcast subnetwork using a novel dielectric antenna array that carries latency-critical short messages for coherence and synchronization and global resource arbitration messages. The second is a circuit-switched mesh subnetwork for throughput-bound workload. The proposed design demonstrates both performance and power efficiency advantages over existing electrical and photonic network designs. The final article in this special issue by Cianchetti and Albonesi of Cornell University presents a hybrid electrical/optical router design for emerging multi-core/many-core microprocessors. The proposed design exploits the low latency of nanophotonics and network architectural innovation to allow fast packet forwarding. The authors show that the proposed router design has 2X better performance with 30% less power consumption than a state-of-the-art electrical router. Overall, the four articles presented in this …

Journal ArticleDOI
TL;DR: This special issue of the ACM Journal of Emerging Technologies in Computing Systems (JETC) presents key papers from the IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH’09), which addresses design challenges that will arise in computing with massive numbers of devices as well as the challenges arising from the complexity of managing defects and faults in such systems.
Abstract: This special issue of the ACM Journal of Emerging Technologies in Computing Systems (JETC) presents key papers from the IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH’09). NANOARCH is the IEEE and ACM’s premier annual symposium devoted to the presentation and discussion of novel nanoelectronic system architectures. On 30–31 July 2009, the fifth of these symposia was convened in San Francisco, California, in partnership with the Design Automation Conference (DAC). Important challenges addressed at NANOARCH include the design challenges that will arise in computing with massive numbers of devices as well as the challenges arising from the complexity of managing defects and faults in such systems. The three articles in this special issue confront both sets of challenges. Two articles, by Dingler et al. and Gaillardon et al., respectively, present architectures for computing systems based on novel nanodevices and schemes for manufacturing at post-CMOS sublithographic scales. The third article, by Haron and Hamdioui, confronts reliability issues in ultra-dense nanomemory systems. In “Performance and Energy Impact of Locally Controlled NML Circuits,” Aaron Dingler, Michael Niemier, X. Sharon Hu, and Evan Lent evaluate the performance of a form of magnetic quantum-dot cellular automata known as Nanomagnet Logic (NML). Their evaluation is notable because it incorporates for the first time a detailed consideration of the performance impact of the magnetic clocking scheme that is required to drive NML. The authors find that, even with the overhead due to clocking, the energydelay product of NML 32-bit full adder circuits compares favorably with that of CMOS. Also, it could be improved even further through the use of new magnetic materials or higher-permeability cladding dielectrics. In “Matrix Nanodevice-Based Architectures and Associated Functional Mapping Method,” authors Pierre-Emmanuel Gaillardon, Ian O’Connor, Junchen Liu, Maimouna Amadou, Fabien Clermidy, and Gabriela Nicolescu present a novel nanoscale computing architecture based on the interconnection of fine-grained logic cells. The logic cells they consider are individual reconfigurable logic gates based on carbon nanotube transistors (CNTFETs). However, the authors point out that their approach generalizes to any matrix of ultra-fine reconfigurable cells. By mapping various benchmark circuits onto this architecture, Gaillardon et al. show that the architecture may provide up to a 14-fold improvement in functional density when compared with commercial silicon CMOS FPGAs. This occurs due to the great reduction in area provided by CNTFET-based logic cells, which is sufficient to overcome the increased interconnection overhead arising from the much finer-grained logic. In “Redundant Residue Number System Code for Fault Tolerance Hybrid Memories,” Nor Zaidi Haron and Said Hamdioui present a modified redundant residue number system (RRNS) coding scheme to enhance the reliability of nanomemory systems in the presence of transient and intermittent faults. By introducing this code, which uses fewer residues, the authors present a lower-overhead fault tolerance option with a correction capability near to that of conventional RRNS. This trade-off introduces the advantage of improved data storage capacity compared to RRNS, as well as ReedSolomon codes, due to a shorter codeword length. The authors also show that their modification improves the speed of RRNS decoding.