scispace - formally typeset
Search or ask a question
Author

Adam Teman

Bio: Adam Teman is an academic researcher from Bar-Ilan University. The author has contributed to research in topics: Static random-access memory & CMOS. The author has an hindex of 21, co-authored 96 publications receiving 1116 citations. Previous affiliations of Adam Teman include École Polytechnique Fédérale de Lausanne & Ben-Gurion University of the Negev.


Papers
More filters
Journal ArticleDOI
TL;DR: A novel 9T bitcell is presented, implementing a Supply Feedback concept to internally weaken the pull-up current during write cycles and thus enable low-voltage write operations, achieved without the need for additional peripheral circuits and techniques.
Abstract: Low voltage operation of digital circuits continues to be an attractive option for aggressive power reduction. As standard SRAM bitcells are limited to operation in the strong-inversion regimes due to process variations and local mismatch, the development of specially designed SRAMs for low voltage operation has become popular in recent years. In this paper, we present a novel 9T bitcell, implementing a Supply Feedback concept to internally weaken the pull-up current during write cycles and thus enable low-voltage write operations. As opposed to the majority of existing solutions, this is achieved without the need for additional peripheral circuits and techniques. The proposed bitcell is fully functional under global and local variations at voltages from 250 mV to 1.1 V. In addition, the proposed cell presents a low-leakage state reducing power up to 60%, as compared to an identically supplied 8T bitcell. An 8 kbit SF-SRAM array was implemented and fabricated in a low-power 40 nm process, showing full functionality and ultra-low power.

105 citations

Journal ArticleDOI
TL;DR: An ultrahigh throughput low-density parity-check (LDPC) decoder with an unrolled full-parallel architecture is proposed, which achieves the highest decoding throughput compared to previously reported LDPC decoders in the literature.
Abstract: An ultrahigh throughput low-density parity-check (LDPC) decoder with an unrolled full-parallel architecture is proposed, which achieves the highest decoding throughput compared to previously reported LDPC decoders in the literature. The decoder benefits from a serial message-transfer approach between the decoding stages to alleviate the well-known routing congestion problem in parallel LDPC decoders. Furthermore, a finite-alphabet message passing algorithm is employed to replace the VN update rule of the standard min-sum (MS) decoder with lookup tables, which are designed in a way that maximizes the mutual information between decoding messages. The proposed algorithm results in an architecture with reduced bit-width messages, leading to a significantly higher decoding throughput and to a lower area compared to an MS decoder when serial message transfer is used. The architecture is placed and routed for the standard MS reference decoder and for the proposed finite-alphabet decoder using a custom pseudo-hierarchical backend design strategy to further alleviate routing congestions and to handle the large design. Postlayout results show that the finite-alphabet decoder with the serial message-transfer architecture achieves a throughput as large as 588 Gb/s with an area of 16.2 mm 2 and dissipates an average power of 22.7 pJ per decoded bit in a 28-nm fully depleted silicon on isulator library. Compared to the reference MS decoder, this corresponds to 3.1 times smaller area and 2 times better energy efficiency.

64 citations

Journal ArticleDOI
TL;DR: An ultra-low-power parallel computing platform and its system-on-chip (SoC) embodiment, targeting a wide range of emerging near-sensor processing tasks for Internet of Things (IoT) applications, is presented.
Abstract: This article presents an ultra-low-power parallel computing platform and its system-on-chip (SoC) embodiment, targeting a wide range of emerging near-sensor processing tasks for Internet of Things (IoT) applications. The proposed SoC achieves 193 million operations per second (MOPS) per mW at 162 MOPS (32 bits), improving the first-generation Parallel Ultra-Low-Power (PULP) architecture by 6.4 and 3.2 times in performance and energy efficiency, respectively.

56 citations

Journal ArticleDOI
TL;DR: This article presents a design methodology for optimizing the physical implementation of SCM macros as part of the standard design flow, leading to a structured, noncongested layout with close to 100% placement utilization, resulting in a smaller silicon footprint, reduced wire length, and lower power consumption compared to SCMs without controlled placement.
Abstract: Embedded memory remains a major bottleneck in current integrated circuit design in terms of silicon area, power dissipation, and performance; however, static random access memories (SRAMs) are almost exclusively supplied by a small number of vendors through memory generators, targeted at rather generic design specifications. As an alternative, standard cell memories (SCMs) can be defined, synthesized, and placed and routed as an integral part of a given digital system, providing complete design flexibility, good energy efficiency, low-voltage operation, and even area efficiency for small memory blocks. Yet implementing an SCM block with a standard digital flow often fails to exploit the distinct and regular structure of such an array, leaving room for optimization. In this article, we present a design methodology for optimizing the physical implementation of SCM macros as part of the standard design flow. This methodology introduces controlled placement, leading to a structured, noncongested layout with close to 100p placement utilization, resulting in a smaller silicon footprint, reduced wire length, and lower power consumption compared to SCMs without controlled placement. This methodology is demonstrated on SCM macros of various sizes and aspect ratios in a state-of-the-art 28nm fully depleted silicon-on-insulator technology, and compared with equivalent macros designed with the noncontrolled, standard flow, as well as with foundry-supplied SRAM macros. The controlled SCMs provide an average 25p reduction in area as compared to noncontrolled implementations while achieving a smaller size than SRAM macros of up to 1Kbyte. Power and performance comparisons of controlled SCM blocks of a commonly found 256 × 32 (1 Kbyte) memory with foundry-provided SRAMs show greater than 65p and 10p reduction in read and write power, respectively, while providing faster access than their SRAM counterparts, despite being of an aspect ratio that is typically unfavorable for SCMs. In addition, the SCM blocks function correctly with a supply voltage as low as 0.3V, well below the lower limit of even the SRAM macros optimized for low-voltage operation. The controlled placement methodology is applied within a full-chip physical implementation flow of an OpenRISC-based test chip, providing more than 50p power reduction compared to equivalently sized compiled SRAMs under a benchmark application.

53 citations

Journal ArticleDOI
TL;DR: This paper presents for the first time a fully functional GC-eDRAM array, implemented and fabricated in a 28-nm process node, based on a novel, 2-ported, 4-transistor NMOS-only bitcell with internal feedback to provide efficient operation in the target28-nm FD-SOI technology.
Abstract: Gain-cell embedded DRAM (GC-eDRAM) is a possible alternative to traditional static random access memories (SRAM). While GC-eDRAM provides high-density, low-leakage, low-voltage, and inherent2-ported operation, its limited retention time requires periodic, power-hungry refresh cycles. This drawback is further aggravated at scaled technologies, where increased subthreshold leakage currents and decreased in-cell storage capacitances result in faster data integrity deterioration. Therefore, integration of GC-eDRAM within modern systems is often considered to be limited to mature process technologies, where these phenomena are less detrimental. In this paper, we present for the first time a fully functional GC-eDRAM array, implemented and fabricated in a 28-nm process node. The 8-kb array is based on a novel, 2-ported, 4-transistor NMOS-only bitcell with internal feedback to provide efficient operation in the target 28-nm FD-SOI technology. The fabricated memory macro achieves more than 1.6-ms data retention time at 27 °C, which is $30\times $ longer than conventional gain-cell topologies when applied to this technology. The described 4-transistor dual-port nMOS array utilizes over 70% of the total memory macro area, while retaining almost 30% lower cell area than a single-ported 6T SRAM in the same technology.

43 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: A survey of techniques for approximate computing (AC), which discusses strategies for finding approximable program portions and monitoring output quality, techniques for using AC in different processing units, processor components, memory technologies, and so forth, as well as programming frameworks for AC.
Abstract: Approximate computing trades off computation quality with effort expended, and as rising performance demands confront plateauing resource budgets, approximate computing has become not merely attractive, but even imperative. In this article, we present a survey of techniques for approximate computing (AC). We discuss strategies for finding approximable program portions and monitoring output quality, techniques for using AC in different processing units (e.g., CPU, GPU, and FPGA), processor components, memory technologies, and so forth, as well as programming frameworks for AC. We classify these techniques based on several key characteristics to emphasize their similarities and differences. The aim of this article is to provide insights to researchers into working of AC techniques and inspire more efforts in this area to make AC the mainstream computing approach in future systems.

890 citations

01 Jan 2010
TL;DR: This journal special section will cover recent progress on parallel CAD research, including algorithm foundations, programming models, parallel architectural-specific optimization, and verification, as well as other topics relevant to the design of parallel CAD algorithms and software tools.
Abstract: High-performance parallel computer architecture and systems have been improved at a phenomenal rate. In the meantime, VLSI computer-aided design (CAD) software for multibillion-transistor IC design has become increasingly complex and requires prohibitively high computational resources. Recent studies have shown that, numerous CAD problems, with their high computational complexity, can greatly benefit from the fast-increasing parallel computation capabilities. However, parallel programming imposes big challenges for CAD applications. Fully exploiting the computational power of emerging general-purpose and domain-specific multicore/many-core processor systems, calls for fundamental research and engineering practice across every stage of parallel CAD design, from algorithm exploration, programming models, design-time and run-time environment, to CAD applications, such as verification, optimization, and simulation. This journal special section will cover recent progress on parallel CAD research, including algorithm foundations, programming models, parallel architectural-specific optimization, and verification. More specifically, papers with in-depth and extensive coverage of the following topics will be considered, as well as other topics relevant to the design of parallel CAD algorithms and software tools. 1. Parallel algorithm design and specification for CAD applications 2. Parallel programming models and languages of particular use in CAD 3. Runtime support and performance optimization for CAD applications 4. Parallel architecture-specific design and optimization for CAD applications 5. Parallel program debugging and verification techniques particularly relevant for CAD The papers should be submitted via the Manuscript Central website and should adhere to standard ACM TODAES formatting requirements (http://todaes.acm.org/). The page count limit is 25.

459 citations

Journal ArticleDOI
TL;DR: In this paper, the authors describe the design of an open-source RISC-V processor core specifically designed for near-threshold (NT) operation in tightly coupled multicore clusters and introduce instruction extensions and micro-architectural optimizations to increase the computational density and to minimize the pressure toward the shared-memory hierarchy.
Abstract: Endpoint devices for Internet-of-Things not only need to work under extremely tight power envelope of a few milliwatts, but also need to be flexible in their computing capabilities, from a few kOPS to GOPS. Near-threshold (NT) operation can achieve higher energy efficiency, and the performance scalability can be gained through parallelism. In this paper, we describe the design of an open-source RISC-V processor core specifically designed for NT operation in tightly coupled multicore clusters. We introduce instruction extensions and microarchitectural optimizations to increase the computational density and to minimize the pressure toward the shared-memory hierarchy. For typical data-intensive sensor processing workloads, the proposed core is, on average, $3.5\times $ faster and $3.2\times $ more energy efficient, thanks to a smart L0 buffer to reduce cache access contentions and support for compressed instructions. Single Instruction Multiple Data extensions, such as dot products, and a built-in L0 storage further reduce the shared-memory accesses by $8\times $ reducing contentions by $3.2\times $ . With four NT-optimized cores, the cluster is operational from 0.6 to 1.2 V, achieving a peak efficiency of 67 MOPS/mW in a low-cost 65-nm bulk CMOS technology. In a low-power 28-nm FD-SOI process, a peak efficiency of 193 MOPS/mW (40 MHz and 1 mW) can be achieved.

304 citations

Journal ArticleDOI
TL;DR: In this paper, the authors present an accelerator optimized for binary-weight CNNs that achieves 1.5 TOp/s at 1.2 V on a core area of only 1.33 million gate equivalent (MGE) or 1.6 V.
Abstract: Convolutional neural networks (CNNs) have revolutionized the world of computer vision over the last few years, pushing image classification beyond human accuracy. The computational effort of today’s CNNs requires power-hungry parallel processors or GP-GPUs. Recent developments in CNN accelerators for system-on-chip integration have reduced energy consumption significantly. Unfortunately, even these highly optimized devices are above the power envelope imposed by mobile and deeply embedded applications and face hard limitations caused by CNN weight I/O and storage. This prevents the adoption of CNNs in future ultralow power Internet of Things end-nodes for near-sensor analytics. Recent algorithmic and theoretical advancements enable competitive classification accuracy even when limiting CNNs to binary (+1/−1) weights during training. These new findings bring major optimization opportunities in the arithmetic core by removing the need for expensive multiplications, as well as reducing I/O bandwidth and storage. In this paper, we present an accelerator optimized for binary-weight CNNs that achieves 1.5 TOp/s at 1.2 V on a core area of only 1.33 million gate equivalent (MGE) or 1.9 mm 2 and with a power dissipation of 895 $\mu$ W in UMC 65-nm technology at 0.6 V. Our accelerator significantly outperforms the state-of-the-art in terms of energy and area efficiency achieving 61.2 TOp/s/W@0.6 V and 1.1 TOp/s/MGE@1.2 V, respectively.

198 citations

Posted Content
TL;DR: In this article, the authors present a HW accelerator optimized for BinaryConnect CNNs that achieves 1510 GOp/s on a core area of only 1.33 MGE and with a power dissipation of 153 mW in UMC 65 nm technology at 1.2 V.
Abstract: Convolutional Neural Networks (CNNs) have revolutionized the world of image classification over the last few years, pushing the computer vision close beyond human accuracy. The required computational effort of CNNs today requires power-hungry parallel processors and GP-GPUs. Recent efforts in designing CNN Application-Specific Integrated Circuits (ASICs) and accelerators for System-On-Chip (SoC) integration have achieved very promising results. Unfortunately, even these highly optimized engines are still above the power envelope imposed by mobile and deeply embedded applications and face hard limitations caused by CNN weight I/O and storage. On the algorithmic side, highly competitive classification accuracy can be achieved by properly training CNNs with binary weights. This novel algorithm approach brings major optimization opportunities in the arithmetic core by removing the need for the expensive multiplications as well as in the weight storage and I/O costs. In this work, we present a HW accelerator optimized for BinaryConnect CNNs that achieves 1510 GOp/s on a core area of only 1.33 MGE and with a power dissipation of 153 mW in UMC 65 nm technology at 1.2 V. Our accelerator outperforms state-of-the-art performance in terms of ASIC energy efficiency as well as area efficiency with 61.2 TOp/s/W and 1135 GOp/s/MGE, respectively.

162 citations