Home
/
Authors
/
Adam Teman

Author

Adam Teman

Other affiliations: École Polytechnique Fédérale de Lausanne, Ben-Gurion University of the Negev, École Normale Supérieure

Bio: Adam Teman is an academic researcher from Bar-Ilan University. The author has contributed to research in topics: Static random-access memory & CMOS. The author has an hindex of 21, co-authored 96 publications receiving 1116 citations. Previous affiliations of Adam Teman include École Polytechnique Fédérale de Lausanne & Ben-Gurion University of the Negev.

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008

Papers

PDF

Open Access

More filters

Journal Article•DOI•

A 250 mV 8 kb 40 nm Ultra-Low Power 9T Supply Feedback SRAM (SF-SRAM)

[...]

Adam Teman¹, Lidor Pergament¹, Omer Cohen¹, Alexander Fish¹•Institutions (1)

Ben-Gurion University of the Negev¹

01 Sep 2011-IEEE Journal of Solid-state Circuits

TL;DR: A novel 9T bitcell is presented, implementing a Supply Feedback concept to internally weaken the pull-up current during write cycles and thus enable low-voltage write operations, achieved without the need for additional peripheral circuits and techniques.

...read moreread less

Abstract: Low voltage operation of digital circuits continues to be an attractive option for aggressive power reduction. As standard SRAM bitcells are limited to operation in the strong-inversion regimes due to process variations and local mismatch, the development of specially designed SRAMs for low voltage operation has become popular in recent years. In this paper, we present a novel 9T bitcell, implementing a Supply Feedback concept to internally weaken the pull-up current during write cycles and thus enable low-voltage write operations. As opposed to the majority of existing solutions, this is achieved without the need for additional peripheral circuits and techniques. The proposed bitcell is fully functional under global and local variations at voltages from 250 mV to 1.1 V. In addition, the proposed cell presents a low-leakage state reducing power up to 60%, as compared to an identically supplied 8T bitcell. An 8 kbit SF-SRAM array was implemented and fabricated in a low-power 40 nm process, showing full functionality and ultra-low power.

...read moreread less

105 citations

Journal Article•DOI•

A 588-Gb/s LDPC Decoder Based on Finite-Alphabet Message Passing

[...]

Reza Ghanaatian¹, Alexios Balatsoukas-Stimming¹, Thomas Christoph Muller¹, Michael Meidlinger², Gerald Matz², Adam Teman¹, Andreas Burg¹ - Show less +3 more•Institutions (2)

École Polytechnique Fédérale de Lausanne¹, Vienna University of Technology²

16 Nov 2017-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: An ultrahigh throughput low-density parity-check (LDPC) decoder with an unrolled full-parallel architecture is proposed, which achieves the highest decoding throughput compared to previously reported LDPC decoders in the literature.

...read moreread less

Abstract: An ultrahigh throughput low-density parity-check (LDPC) decoder with an unrolled full-parallel architecture is proposed, which achieves the highest decoding throughput compared to previously reported LDPC decoders in the literature. The decoder benefits from a serial message-transfer approach between the decoding stages to alleviate the well-known routing congestion problem in parallel LDPC decoders. Furthermore, a finite-alphabet message passing algorithm is employed to replace the VN update rule of the standard min-sum (MS) decoder with lookup tables, which are designed in a way that maximizes the mutual information between decoding messages. The proposed algorithm results in an architecture with reduced bit-width messages, leading to a significantly higher decoding throughput and to a lower area compared to an MS decoder when serial message transfer is used. The architecture is placed and routed for the standard MS reference decoder and for the proposed finite-alphabet decoder using a custom pseudo-hierarchical backend design strategy to further alleviate routing congestions and to handle the large design. Postlayout results show that the finite-alphabet decoder with the serial message-transfer architecture achieves a throughput as large as 588 Gb/s with an area of 16.2 mm 2 and dissipates an average power of 22.7 pJ per decoded bit in a 28-nm fully depleted silicon on isulator library. Compared to the reference MS decoder, this corresponds to 3.1 times smaller area and 2 times better energy efficiency.

...read moreread less

64 citations

Journal Article•DOI•

Energy-Efficient Near-Threshold Parallel Computing: The PULPv2 Cluster

[...]

Davide Rossi¹, Antonio Pullini², Igor Loi¹, Michael Gautschi², Frank K. Gurkaynak², Adam Teman³, Jeremy Constantin³, Andreas Burg³, Ivan Miro-Panades, Edith Beigne, Fabien Clermidy, Philippe Flatresse⁴, Luca Benini¹ - Show less +9 more•Institutions (4)

University of Bologna¹, ETH Zurich², École Polytechnique Fédérale de Lausanne³, STMicroelectronics⁴

01 Sep 2017-IEEE Micro

TL;DR: An ultra-low-power parallel computing platform and its system-on-chip (SoC) embodiment, targeting a wide range of emerging near-sensor processing tasks for Internet of Things (IoT) applications, is presented.

...read moreread less

Abstract: This article presents an ultra-low-power parallel computing platform and its system-on-chip (SoC) embodiment, targeting a wide range of emerging near-sensor processing tasks for Internet of Things (IoT) applications. The proposed SoC achieves 193 million operations per second (MOPS) per mW at 162 MOPS (32 bits), improving the first-generation Parallel Ultra-Low-Power (PULP) architecture by 6.4 and 3.2 times in performance and energy efficiency, respectively.

...read moreread less

56 citations

Journal Article•DOI•

Power, Area, and Performance Optimization of Standard Cell Memory Arrays Through Controlled Placement

[...]

Adam Teman¹, Davide Rossi², Pascal Meinerzhagen³, Luca Benini², Andreas Burg¹ - Show less +1 more•Institutions (3)

École Polytechnique Fédérale de Lausanne¹, University of Bologna², Bar-Ilan University³

27 May 2016-ACM Transactions on Design Automation of Electronic Systems

TL;DR: This article presents a design methodology for optimizing the physical implementation of SCM macros as part of the standard design flow, leading to a structured, noncongested layout with close to 100% placement utilization, resulting in a smaller silicon footprint, reduced wire length, and lower power consumption compared to SCMs without controlled placement.

...read moreread less

Abstract: Embedded memory remains a major bottleneck in current integrated circuit design in terms of silicon area, power dissipation, and performance; however, static random access memories (SRAMs) are almost exclusively supplied by a small number of vendors through memory generators, targeted at rather generic design specifications. As an alternative, standard cell memories (SCMs) can be defined, synthesized, and placed and routed as an integral part of a given digital system, providing complete design flexibility, good energy efficiency, low-voltage operation, and even area efficiency for small memory blocks. Yet implementing an SCM block with a standard digital flow often fails to exploit the distinct and regular structure of such an array, leaving room for optimization. In this article, we present a design methodology for optimizing the physical implementation of SCM macros as part of the standard design flow. This methodology introduces controlled placement, leading to a structured, noncongested layout with close to 100p placement utilization, resulting in a smaller silicon footprint, reduced wire length, and lower power consumption compared to SCMs without controlled placement. This methodology is demonstrated on SCM macros of various sizes and aspect ratios in a state-of-the-art 28nm fully depleted silicon-on-insulator technology, and compared with equivalent macros designed with the noncontrolled, standard flow, as well as with foundry-supplied SRAM macros. The controlled SCMs provide an average 25p reduction in area as compared to noncontrolled implementations while achieving a smaller size than SRAM macros of up to 1Kbyte. Power and performance comparisons of controlled SCM blocks of a commonly found 256 × 32 (1 Kbyte) memory with foundry-provided SRAMs show greater than 65p and 10p reduction in read and write power, respectively, while providing faster access than their SRAM counterparts, despite being of an aspect ratio that is typically unfavorable for SCMs. In addition, the SCM blocks function correctly with a supply voltage as low as 0.3V, well below the lower limit of even the SRAM macros optimized for low-voltage operation. The controlled placement methodology is applied within a full-chip physical implementation flow of an OpenRISC-based test chip, providing more than 50p power reduction compared to equivalently sized compiled SRAMs under a benchmark application.

...read moreread less

53 citations

Journal Article•DOI•

A 4-Transistor nMOS-Only Logic-Compatible Gain-Cell Embedded DRAM With Over 1.6-ms Retention Time at 700 mV in 28-nm FD-SOI

[...]

Robert Giterman¹, Alexander Fish¹, Andreas Burg², Adam Teman¹•Institutions (2)

Bar-Ilan University¹, École Polytechnique Fédérale de Lausanne²

01 Apr 2018-IEEE Transactions on Circuits and Systems I-regular Papers

TL;DR: This paper presents for the first time a fully functional GC-eDRAM array, implemented and fabricated in a 28-nm process node, based on a novel, 2-ported, 4-transistor NMOS-only bitcell with internal feedback to provide efficient operation in the target28-nm FD-SOI technology.

...read moreread less

Abstract: Gain-cell embedded DRAM (GC-eDRAM) is a possible alternative to traditional static random access memories (SRAM). While GC-eDRAM provides high-density, low-leakage, low-voltage, and inherent2-ported operation, its limited retention time requires periodic, power-hungry refresh cycles. This drawback is further aggravated at scaled technologies, where increased subthreshold leakage currents and decreased in-cell storage capacitances result in faster data integrity deterioration. Therefore, integration of GC-eDRAM within modern systems is often considered to be limited to mature process technologies, where these phenomena are less detrimental. In this paper, we present for the first time a fully functional GC-eDRAM array, implemented and fabricated in a 28-nm process node. The 8-kb array is based on a novel, 2-ported, 4-transistor NMOS-only bitcell with internal feedback to provide efficient operation in the target 28-nm FD-SOI technology. The fabricated memory macro achieves more than 1.6-ms data retention time at 27 °C, which is $30\times $ longer than conventional gain-cell topologies when applied to this technology. The described 4-transistor dual-port nMOS array utilizes over 70% of the total memory macro area, while retaining almost 30% lower cell area than a single-ported 6T SRAM in the same technology.

...read moreread less

43 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

A Survey of Techniques for Approximate Computing

[...]

Sparsh Mittal¹•Institutions (1)

Oak Ridge National Laboratory¹

18 Mar 2016-ACM Computing Surveys

TL;DR: A survey of techniques for approximate computing (AC), which discusses strategies for finding approximable program portions and monitoring output quality, techniques for using AC in different processing units, processor components, memory technologies, and so forth, as well as programming frameworks for AC.

...read moreread less

Abstract: Approximate computing trades off computation quality with effort expended, and as rising performance demands confront plateauing resource budgets, approximate computing has become not merely attractive, but even imperative. In this article, we present a survey of techniques for approximate computing (AC). We discuss strategies for finding approximable program portions and monitoring output quality, techniques for using AC in different processing units (e.g., CPU, GPU, and FPGA), processor components, memory technologies, and so forth, as well as programming frameworks for AC. We classify these techniques based on several key characteristics to emphasize their similarities and differences. The aim of this article is to provide insights to researchers into working of AC techniques and inspire more efforts in this area to make AC the mainstream computing approach in future systems.

...read moreread less

890 citations

Parallel CAD: Algorithm Design and Programming Special Section Call for Papers TODAES: ACM Transactions on Design Automation of Electronic Systems

[...]

Kurt Keutzer, Peng Li, Li Shang, Hai Zhou

01 Jan 2010

TL;DR: This journal special section will cover recent progress on parallel CAD research, including algorithm foundations, programming models, parallel architectural-specific optimization, and verification, as well as other topics relevant to the design of parallel CAD algorithms and software tools.

...read moreread less

Abstract: High-performance parallel computer architecture and systems have been improved at a phenomenal rate. In the meantime, VLSI computer-aided design (CAD) software for multibillion-transistor IC design has become increasingly complex and requires prohibitively high computational resources. Recent studies have shown that, numerous CAD problems, with their high computational complexity, can greatly benefit from the fast-increasing parallel computation capabilities. However, parallel programming imposes big challenges for CAD applications. Fully exploiting the computational power of emerging general-purpose and domain-specific multicore/many-core processor systems, calls for fundamental research and engineering practice across every stage of parallel CAD design, from algorithm exploration, programming models, design-time and run-time environment, to CAD applications, such as verification, optimization, and simulation. This journal special section will cover recent progress on parallel CAD research, including algorithm foundations, programming models, parallel architectural-specific optimization, and verification. More specifically, papers with in-depth and extensive coverage of the following topics will be considered, as well as other topics relevant to the design of parallel CAD algorithms and software tools. 1. Parallel algorithm design and specification for CAD applications 2. Parallel programming models and languages of particular use in CAD 3. Runtime support and performance optimization for CAD applications 4. Parallel architecture-specific design and optimization for CAD applications 5. Parallel program debugging and verification techniques particularly relevant for CAD The papers should be submitted via the Manuscript Central website and should adhere to standard ACM TODAES formatting requirements (http://todaes.acm.org/). The page count limit is 25.

...read moreread less

459 citations

Journal Article•DOI•

Near-Threshold RISC-V Core With DSP Extensions for Scalable IoT Endpoint Devices

[...]

Michael Gautschi¹, Pasquale Davide Schiavone¹, Andreas Traber¹, Igor Loi, Antonio Pullini¹, Davide Rossi, Eric Flamand¹, Frank K. Gurkaynak¹, Luca Benini¹ - Show less +5 more•Institutions (1)

ETH Zurich¹

24 Feb 2017-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: In this paper, the authors describe the design of an open-source RISC-V processor core specifically designed for near-threshold (NT) operation in tightly coupled multicore clusters and introduce instruction extensions and micro-architectural optimizations to increase the computational density and to minimize the pressure toward the shared-memory hierarchy.

...read moreread less

Abstract: Endpoint devices for Internet-of-Things not only need to work under extremely tight power envelope of a few milliwatts, but also need to be flexible in their computing capabilities, from a few kOPS to GOPS. Near-threshold (NT) operation can achieve higher energy efficiency, and the performance scalability can be gained through parallelism. In this paper, we describe the design of an open-source RISC-V processor core specifically designed for NT operation in tightly coupled multicore clusters. We introduce instruction extensions and microarchitectural optimizations to increase the computational density and to minimize the pressure toward the shared-memory hierarchy. For typical data-intensive sensor processing workloads, the proposed core is, on average, $3.5\times $ faster and $3.2\times $ more energy efficient, thanks to a smart L0 buffer to reduce cache access contentions and support for compressed instructions. Single Instruction Multiple Data extensions, such as dot products, and a built-in L0 storage further reduce the shared-memory accesses by $8\times $ reducing contentions by $3.2\times $ . With four NT-optimized cores, the cluster is operational from 0.6 to 1.2 V, achieving a peak efficiency of 67 MOPS/mW in a low-cost 65-nm bulk CMOS technology. In a low-power 28-nm FD-SOI process, a peak efficiency of 193 MOPS/mW (40 MHz and 1 mW) can be achieved.

...read moreread less

304 citations

Journal Article•DOI•

YodaNN: An Architecture for Ultralow Power Binary-Weight CNN Acceleration

[...]

Renzo Andri¹, Lukas Cavigelli¹, Davide Rossi², Luca Benini¹•Institutions (2)

ETH Zurich¹, University of Bologna²

01 Jan 2018-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

TL;DR: In this paper, the authors present an accelerator optimized for binary-weight CNNs that achieves 1.5 TOp/s at 1.2 V on a core area of only 1.33 million gate equivalent (MGE) or 1.6 V.

...read moreread less

Abstract: Convolutional neural networks (CNNs) have revolutionized the world of computer vision over the last few years, pushing image classification beyond human accuracy. The computational effort of today’s CNNs requires power-hungry parallel processors or GP-GPUs. Recent developments in CNN accelerators for system-on-chip integration have reduced energy consumption significantly. Unfortunately, even these highly optimized devices are above the power envelope imposed by mobile and deeply embedded applications and face hard limitations caused by CNN weight I/O and storage. This prevents the adoption of CNNs in future ultralow power Internet of Things end-nodes for near-sensor analytics. Recent algorithmic and theoretical advancements enable competitive classification accuracy even when limiting CNNs to binary (+1/−1) weights during training. These new findings bring major optimization opportunities in the arithmetic core by removing the need for expensive multiplications, as well as reducing I/O bandwidth and storage. In this paper, we present an accelerator optimized for binary-weight CNNs that achieves 1.5 TOp/s at 1.2 V on a core area of only 1.33 million gate equivalent (MGE) or 1.9 mm 2 and with a power dissipation of 895 $\mu$ W in UMC 65-nm technology at 0.6 V. Our accelerator significantly outperforms the state-of-the-art in terms of energy and area efficiency achieving 61.2 TOp/s/W@0.6 V and 1.1 TOp/s/MGE@1.2 V, respectively.

...read moreread less

198 citations

Posted Content•

YodaNN: An Ultra-Low Power Convolutional Neural Network Accelerator Based on Binary Weights

[...]

Renzo Andri¹, Lukas Cavigelli¹, Davide Rossi², Luca Benini²•Institutions (2)

ETH Zurich¹, University of Bologna²

17 Jun 2016-arXiv: Hardware Architecture

TL;DR: In this article, the authors present a HW accelerator optimized for BinaryConnect CNNs that achieves 1510 GOp/s on a core area of only 1.33 MGE and with a power dissipation of 153 mW in UMC 65 nm technology at 1.2 V.

...read moreread less

Abstract: Convolutional Neural Networks (CNNs) have revolutionized the world of image classification over the last few years, pushing the computer vision close beyond human accuracy. The required computational effort of CNNs today requires power-hungry parallel processors and GP-GPUs. Recent efforts in designing CNN Application-Specific Integrated Circuits (ASICs) and accelerators for System-On-Chip (SoC) integration have achieved very promising results. Unfortunately, even these highly optimized engines are still above the power envelope imposed by mobile and deeply embedded applications and face hard limitations caused by CNN weight I/O and storage. On the algorithmic side, highly competitive classification accuracy can be achieved by properly training CNNs with binary weights. This novel algorithm approach brings major optimization opportunities in the arithmetic core by removing the need for the expensive multiplications as well as in the weight storage and I/O costs. In this work, we present a HW accelerator optimized for BinaryConnect CNNs that achieves 1510 GOp/s on a core area of only 1.33 MGE and with a power dissipation of 153 mW in UMC 65 nm technology at 1.2 V. Our accelerator outperforms state-of-the-art performance in terms of ASIC energy efficiency as well as area efficiency with 61.2 TOp/s/W and 1135 GOp/s/MGE, respectively.

...read moreread less

162 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169

Collapse