# Leakage Aware Digital Design Optimization for Minimal Total Power Consumption in Nanometer CMOS Technologies 

 bySchuster Christian

A thesis submitted to the Faculty of Science of the University of Neuchâtel, in conformity with the requirements for the degree of Doctor of Science

Institute of Microtechnology
University of Neuchâtel
Switzerland

21 March 2007

Ii

## IMPRIMATUR POUR LA THESE

# Leakage aware digital design optimization for minimal total power consumption in nanometer CMOS technologies 

## Christian SCHUSTER

UNIVERSITE DE NEUCHATEL

## FACULTE DES SCIENCES

La Faculté des sciences de l'Université de Neuchâtel, sur le rapport des membres du jury
MM. P.-A. Farine (directeur de thèse),
C. Piguet (co-directeur de thèse, CSEM, Neuchâtel), S. Tanner, J.-L. Nagel (CSEM, Neuchâtel) et M. Belleville (CEA-LETI, Grenoble F)
autorise l'impression de la présente thèse.

Neuchâtel, le 27 mars 2007
Le doyen :
T. Ward

UNIVERSITE DE NEUGHATEL
FACULTE DES SCIENCES
Secrétariat-jécanat de la faculté Rug Enile-Ar and $11-1 C P 158$ $04-7809 \mathrm{Neuchâtel}$

To Mario, Silvana and Eliana


#### Abstract

Starting from deep submicron technologies $(<0.13 \mu m)$, and even stronger in nanometer technologies, static power consumption, due to leaky "off" transistors, is becoming a non-negligible contributor to the total power dissipation. Under this condition, the total power optimization problem changes considerably. The high parallelization approach commonly used today to increase performances, will soon result in power inefficient designs. Indeed, the static power consumption of the large number of rarely used transistors will highly penalize the total power consumption.

The purpose of this thesis is to investigate the influence of static power on the design methodologies for low power. In particular, the effects of architectural as well as technology modifications are explored. The use of technology as an optimization parameter has become possible in recent technologies. In fact, they offer different threshold voltages, each one showing a different trade-off between speed and leakage current.

In this work, two different frameworks are considered. In the first one, both the supply voltage and the transistor threshold voltage are freely tunable parameters. This is the most general case and corresponds to the situation where the designer has the largest freedom. In the latter framework, we assume that the designer cannot change the supply voltage nor the transistor threshold voltage and they are hence considered constants. This case corresponds to the most common one, where the designer has a supply voltage and a technology type (and hence a threshold voltage) fixed by the application and by the devices the circuit has to interface. In both cases, lot of efforts have been put to the development of a handy way to rapidly estimate the total power consumption and consequently easily compare different architectural/technology variants at the early stages of development.

Examples, based on multipliers, are used extensively in the whole thesis and, at the end, the presented theory is applied to a real circuit implemented in a 90 nm technology by ST Microelectronics. Measurements show a very large variability of the static power over 16 dies manufactured on the same wafer. For instance, the highest static power consumption at nominal condition ( $\mathrm{Vdd}=1 \mathrm{~V}, \mathrm{f}=62.5 \mathrm{MHz}$ ) over the lowest one corresponds to more than a factor of 2.5 . Measured data also report multipliers able to work at 210 mV for a frequency of 1 MHz !


## Keywords

Low power digital design, static power, leakage current, dynamic power, very low supply voltage, multiplier, architecture, 90 nm , CMOS nanometer technology.

## Mots clés

Circuit numérique à faible consommation, puissance statique, courant de fuite, puissance dynamique, tension d'alimentation très basse, multiplicateur, architecture, 90 nm , technologie CMOS nanométrique.

## Acknowledgements

During the last four years, an important number of people have contributed to my personal knowledge expansion and have helped me progressing in my thesis. I will try to acknowledge most of them, being difficult to extensively report everyone in a few lines. I apologize in advance for the missing ones.

I thank Prof P.-A. Farine who received me in his group and gave me the freedom and all the tools that I needed to successfully finish my work. I also thank my thesis co-director Prof. C. Piguet for his endless support and the large number of suggestions he gave me during our weekly meetings. He also provided a great effort in promoting my work outside the IMT walls. Moreover, I would like the acknowledge Dr. J.-L. Nagel for being my project leader for the first three years and for sharing his vast knowledge with me (besides sharing the office too). During the last year, my new project leader Dr. S. Tanner helped me to finalize my work and supported me in the integration and chip testing part of the project. Many thanks to Dr. M. Belleville too, who kindly accepted to be one of the jury experts.

I am also grateful to all my colleagues, who created a pleasant ambient in the group and helped me to master some not-so-easy-to-use tools in this work. Particularly, I would like to thank P. Stadelmann and D. Manetti for all kind of discussions, C. Robert for the help provided in the PCB design, R. Merz for the help in the use of GPIB based instruments with MATLAB, without forgetting J.-L. Nagel, P. Thoppay and M. Moridi for sharing the office with me.

Furthermore, I want to express my complete gratitude to my parents who permitted me to successfully end my studies and always motivated me to progress: "You can always fly higher, as long as you want it!".

Finally, I would like to acknowledge my wonderful wife for always being beside me, bearing with me, and accepting me as I am, with my merits and demerits as well as my various moods: Thank you very much!

[^0]
## Contents

1 Introduction ..... 1
1.1 Motivations ..... 1
1.2 Thesis outline ..... 3
1.3 Contributions ..... 3
2 Sources of dissipation in CMOS transistors ..... 5
2.1 Dynamic consumption ..... 6
2.1.1 Switching energy ..... 6
2.1.2 Shortcut energy ..... 6
2.2 Static consumption ..... 8
2.2.1 Sub-threshold current ..... 9
2.2.2 Gate leakage current ..... 11
2.2.3 Reverse bias p-n junction leakage and band to band tunneling ..... 12
2.2.4 Gate-Induced Drain Leakage (GIDL) ..... 12
2.2.5 Punchthrough ..... 13
2.3 Summary ..... 13
3 Delay and power models ..... 15
3.1 Current models ..... 15
3.2 Power models ..... 16
3.2.1 Dynamic power ..... 16
3.2.2 Static power ..... 17
3.2.3 Total power ..... 18
3.3 Delay models ..... 18
3.4 Summary ..... 19
4 Technology characterization ..... 21
4.1 Parameters extraction methodology ..... 21
4.1.1 The sub-threshold slope $n$ ..... 22
4.1.2 The DIBL effect factor $\eta$ ..... 23
4.1.3 The $\alpha$ factor and the reference threshold voltage $V$ th 0 ..... 23
4.1.4 The body effect coefficient $\gamma$ ..... 23
4.1.5 Remark on $I_{o}$ ..... 24
4.2 STM 90nm technology ..... 24
4.2.1 Low Vth Transistors (lvt) ..... 25
4.2.2 Standard Vth Transistors (svt) ..... 29
4.2.3 High Vth Transistors (hvt) ..... 29
4.3 Summary ..... 30
5 Reference multiplier architectures ..... 31
5.1 Ripple Carry Array ..... 32
5.1.1 RCA parallel variations ..... 34
5.1.2 RCA horizontal pipeline variations ..... 35
5.1.3 RCA diagonal pipeline variations ..... 36
5.2 Wallace ..... 37
5.2.1 Wallace parallel versions ..... 39
5.3 Sequential ..... 39
5.3.1 Sequential-wallace ..... 40
5.3.2 Sequential parallel ..... 41
5.4 Summary ..... 41
6 Total power comparison for free Vdd and free Vth ..... 43
6.1 Existence of a total power consumption optimum ..... 43
6.2 Pdyn over Pstat ratio ..... 45
6.2.1 k1 derivation ..... 47
6.3 Optimal Vdd and Vth formulas ..... 48
6.3.1 Optimal threshold voltage derivation ..... 52
6.3.2 Optimal supply voltage derivation ..... 55
6.4 Optimal total power ..... 58
6.4.1 Optimal power comparison with $k 1$ constant ..... 58
6.4.2 Absolute optimal total power ..... 62
6.5 Summary ..... 64
7 Architectural impact on total power ..... 67
7.1 Summary ..... 75
8 Technology impact on total power ..... 77
8.1 Technology as a free parameter ..... 77
8.2 Application to technology selection ..... 79
8.3 Discussion on the modifiability of Vth ..... 82
8.3.1 Body biasing ..... 82
8.3.2 Transistor size modification ..... 83
8.4 Summary ..... 86
9 Total power comparison for fixed Vdd and fixed Vth ..... 87
9.1 Total power comparison ..... 87
9.2 Comparison of two architectures ..... 89
9.3 Selection of the best architecture ..... 91
9.4 Designing new circuits ..... 91
9.5 Case study: 16bit multipliers ..... 93
9.6 Summary ..... 97
10 Physical implementation of four 32 bit multipliers ..... 99
10.1 Circuit description ..... 99
10.1.1 Pseudo-random code generator ..... 100
10.1.2 Ring oscillators ..... 103
10.2 Circuit design and implementation ..... 104
10.2.1 Nominal values ..... 107
10.3 Measurements setup ..... 110
10.3.1 PCB design ..... 110
10.3.2 FPGA based signal generation ..... 112
10.3.3 MATLAB based measurements automation ..... 114
10.4 Measurements ..... 115
10.4.1 Nominal values ..... 115
10.4.2 Lowest working supply voltage ..... 116
10.4.3 Optimal total power ..... 118
10.4.4 Power and delay variability ..... 120
10.5 Summary ..... 122
11 Conclusions ..... 125
Bibliography ..... 129
List of Publications ..... 135
A VHDL source code ..... 137
A. 1 top.vhd ..... 137
A. 2 data_gen.vhd ..... 141
A. 3 mult.vhd ..... 143
A. 4 mult_par4.vhd ..... 145
A. 5 RCA_generic_arch.vhd ..... 148
A. 6 ring_svt.vhd ..... 149
A. 7 top_tb.vhd ..... 150
B Synopsys compilation scripts ..... 161
B. 1 compile_top.tcl ..... 161
B. 2 read_vhdl.tcl ..... 164
B. 3 power_sdf.do ..... 165
C SoC Encounter P\&R scripts ..... 167
C. 1 main.tcl ..... 167
C. 2 top.conf ..... 173
C. 3 IO_Filler.tcl ..... 175
C. 4 do_power_domains.tcl ..... 176
C. 5 create_global_net.tcl ..... 176
C. 6 pwr.tcl ..... 181
C. 7 followPin.tcl ..... 185
C. 8 place_output_bufs.tcl ..... 186
C. 9 output_nets.tcl ..... 186
C. 10 fix_drc_errors.tcl ..... 187
C. 11 top.ctstch ..... 188
C. 12 ioplace.io ..... 189
D FPGA source code ..... 191
D. 1 main_FPGA.vhd ..... 191
E MATLAB based automated test functions ..... 197
E. 1 test_mult.m ..... 197

## List of Figures

1.1 Ioff vs. Lg and total power vs. technology nodes ..... 2
2.1 CMOS inverter ..... 7
2.2 Sources of static power consumption in a NMOS transistor ..... 8
2.3 Effect of Drain Induced Barrier Lowering (DIBL) on short channel tran- sistors ..... 10
4.1 Schematic of the NMOS and PMOS transistors used for the extraction of the sub-threshold slope $n$ ..... 22
4.2 Linear fitting of $\ln \left(I_{d s}(V g s)\right)$ for $S T M 90 \mathrm{~nm}$ lvt ..... 26
4.3 Linear fitting of $\ln \left(I_{o f f}(V d d)\right)$ for 1 inverter ..... 27
4.4 Fitting of delay vs. Vdd for STM 90nm lvt ..... 27
4.5 Linear fitting of $\ln \left(I_{o f f}(V b s)\right)$ for 1 inverter ..... 28
5.1 Full adder symbol ..... 32
5.2 8bit RCA multiplier ..... 33
5.3 Critical path in a 8bit RCA multiplier ..... 33
5.42 times parallelized multiplier ..... 34
$5.5 \quad 2$ stages horizontally pipelined 8 bit RCA ..... 35
5.6 2 stages diagonally pipelined 8bit RCA ..... 37
5.7 Internal implementation of a Carry Save Adder (CSA) ..... 38
5.8 Wallace 8bit structure ..... 38
5.9 Sequential multiplier structure (16bit) ..... 39
5.10 Sequential multiplier (16bit) with a $4 x 16$ Wallace implementation ..... 40
6.1 Relationship between Vdd and Vth for $\alpha=1.65$ and $\chi=0.3$ ..... 45
6.2 Total power consumption of a 16 bit Wallace multiplier ..... 46
6.3 $V d d^{1 / \alpha}$ and its linear approximation ..... 49
6.4 Linearization coefficients for Vdd in $[0.3 \mathrm{~V} ; 1 \mathrm{~V}]$ ..... 50
6.5 Linearization coefficients for Vdd in $[0.3 \mathrm{~V} ; 0.6 \mathrm{~V}]$ ..... 51
6.6 Optimal $V$ th vs. activity ..... 53
6.7 Optimal $V t h$ vs. frequency ..... 54
6.8 Optimal $V$ th vs. logical depth ..... 55
6.9 Optimal $V d d$ vs. activity ..... 56
6.10 Optimal $V d d$ vs. frequency ..... 57
6.11 Optimal $V d d$ vs. logical depth ..... 57
7.1 Optimal Vdd calculated with numerical computation ..... 70
7.2 Optimal Vth calculated with numerical computation ..... 72
7.3 Optimal total power calculated with numerical computation ..... 73
8.1 Technology parameters influence on a RCA 16 multiplier in a SVT STM 90nm technology ..... 78
8.2 Optimal total power consumption of ten 16 bit multipliers in all STM 90nm technology flavors ..... 81
8.3 Vth vs. W for a NMOS transistor ..... 84
8.4 Vth vs. W for a PMOS transistor ..... 84
8.5 Vth vs. L for a NMOS transistor ..... 85
8.6 Vth vs. L for a PMOS transistor ..... 85
9.1 Lines of equal-consumption with $\mathrm{f}=62.5 \mathrm{MHz}$ in a STM SVT 90 nm technology ..... 90
9.2 Thirteen 16 bit multipliers plotted on the cells vs. transitions space ..... 94
10.1 Block schematic of the test circuit ..... 101
10.2 Schematic of the 64 bit linear feedback shift register ..... 102
10.3 Probability distribution of the pseudo-random generated data for 500 and 10000 generated data ..... 103
10.4 Final layout of the demonstrator circuit ..... 104
10.5 Block view of the demonstrator circuit ..... 105
10.6 Output pad level converter for different core supply voltages ..... 107
10.7 Schematic of the PCB used to test the demonstrator circuit ..... 111
10.8 Expected optimal supply voltage ..... 116
10.9 Measured optimal supply voltage for chip No. 2 ..... 117
10.10 Measured optimal supply voltage for chip No. 3 ..... 118
10.11 Expected optimal total power consumption ..... 119
10.12 Measured optimal total power consumption for chip No. 2 ..... 119
10.13 Measured optimal total power consumption for chip No. 3 ..... 120
10.14 Nominal static power distribution for 16 chips ..... 121
10.15 Nominal dynamic power distribution for 16 chips at 62.5 MHz ..... 121
10.16 Delay distribution of the RCA SVT multiplier for 16 chips . ..... 122

## List of Tables

1.1 The International Technology Roadmap for Semiconductors [1] (ITRS), update 2006 for low operating power, cost effective high volume MPU. ..... 1
2.1 Manifestation of specific leakage mechanism in a NMOS transistor de- pending on polarization ..... 13
2.2 Gate and sub-threshold leakage current for three different TSMC tech- nologies ..... 14
4.1 Results of the sub-threshold slope extraction for STM 90 nm lvt ..... 26
4.2 Results of the DIBL effect coefficient extraction for STM 90nm lvt ..... 26
4.3 Results for the $\alpha$ factor and $V t h 0$ for STM 90 nm lvt ..... 27
4.4 Results for the body effect coefficient for STM 90 nm lvt ..... 28
4.5 $\quad I_{o}$ for a NAND2x2 gate from the STM 90 nm lvt technology ..... 29
4.6 Technology parameters summary for the STM 90 nm lvt ..... 29
4.7 Technology parameters summary for the STM 90nm svt ..... 29
4.8 Technology parameters summary for the STM 90 nm hvt ..... 29
4.9 Technology parameters summary for the STM $90 \mathrm{~nm}-\mathrm{Vdd}=1 \mathrm{~V}$ ..... 30
5.1 Number of CSA levels for some typical multiplier width ..... 39
5.2 Summary of the multipliers delays and cell counts ..... 42
6.1 Approximation of k 1 for STM 90nm technology ..... 47
6.2 SIA ITRS 2004 expected transistors $I_{\text {On }} / I_{\text {off }}$ ..... 48
6.3 Values of $A$ and $B$ for the three types of STM090 transistors ..... 51
6.4 Parameters of a 16 bit Wallace multiplier ..... 53
6.5 Effect of parallelization on architectural parameters ..... 60
6.6 Effect of pipelining on architectural parameters ..... 61
7.1 Nominal values for thirteen 16 bit multipliers based on the STM 90 nm technology and transistors of the SVT type. ..... 68
7.2 Optimal Vdd, Vth and Ptot. ..... 71
8.1 Optimal total power consumption of thirteen 16 bit multipliers in all STM 90nm technology flavors ..... 80
9.1 Comparison table between two circuits having a difference of $\Delta N=$ $\left(N_{1}-N_{2}\right)$ cells and $\Delta T r=\left(a_{1} N_{1}-a_{2} N_{2}\right)$ transitions. ..... 89
9.2 Consumption of the thirteen multipliers in $\mu W$ for $\mathrm{Vdd}=1 \mathrm{~V}$, $\mathrm{Vth}=0.4 \mathrm{~V}$ and $\mathrm{f}=62.5 \mathrm{MHz}$. ..... 95
9.3 Consumption of the thirteen multipliers in $\mu W$ for $\mathrm{Vdd}=1 \mathrm{~V}$, Vth $=0.12 \mathrm{~V}$ and $\mathrm{f}=62.5 \mathrm{MHz}$. ..... 96
10.1 Nominal values of the 4 implemented multipliers. Nominal frequency is 62.5 MHz ..... 108
10.2 Pin assignments for the APEX EP20K600EFC672 FPGA ..... 113
10.3 Measured nominal ( $1 \mathrm{~V} @ 62.5 \mathrm{MHz}$ ) power consumption and maximal working frequency ..... 115

## List of Symbols

| Symbol | Description | Unit |
| :--- | :--- | :--- |
| $V d s$ | Transistor drain-to-source voltage | $[\mathrm{V}]$ |
| $V g s$ | Transistor gate-to-source voltage | $[\mathrm{V}]$ |
| $V b s$ | Transistor bulk-to-source voltage | $[\mathrm{V}]$ |
| $V d d$ | Power supply voltage | $[\mathrm{V}]$ |
| $V t h 0$ | Reference transistor threshold voltage | $[\mathrm{V}]$ |
| $V t h=V t h 0-\eta V d s-\gamma V b s$ | Effective threshold voltage | $[\mathrm{V}]$ |
| $\eta$ | DIBL effect coefficient |  |
| $\gamma$ | Body bias effect coefficient |  |
| $n$ | Sub-threshold slope | $[\mathrm{V}]$ |
| $U t=k_{b} T / q$ | Thermal potential | $[\mathrm{J} / \mathrm{K}]$ |
| $k_{b}=1.38 \mathrm{E}-23$ | Boltzmann constant | $[\mathrm{K}]$ |
| $T$ | Temperature | $[\mathrm{C}]$ |
| $q=1.6 \mathrm{E}-19$ | Elementary charge | $[\mathrm{A}]$ |
| $I_{\text {On }}$ | Transistor on current | $[\mathrm{A}]$ |
| $I_{\text {off }}$ | Transistor off current | $[\mathrm{A}]$ |
| $I_{0}$ | Reference current | $\left[\mathrm{cm}^{2} / \mathrm{V} / \mathrm{s}\right]$ |
| $\mu_{0}$ | Low field mobility | $\left[\mathrm{cm}{ }^{2} / \mathrm{V} / \mathrm{s}\right]$ |
| $\mu_{\text {eff }}$ | Effective carriers mobility |  |
| $\alpha$ | Alpha power law coefficient |  |
| $k_{t}$ | Delay proportional constant | $[\mathrm{Hz}]$ |
| $C_{i}$ | Capacitance of node i on the critical path | $[\mathrm{F}]$ |
| $C$ | Average cell capacitance | $[\mathrm{F}]$ |
| $L D$ | Logical depth |  |
| $a$ | Circuit activity |  |
| $N$ | Number of cells |  |
| $f$ | Circuit working frequency |  |


| Symbol | Description | Unit |
| :--- | :--- | :--- |
| $\chi$ | Intrinsic design delay relating $V d d$ to $V t h$ |  |
| $k_{1}=P d y n /$ Pstat | Dynamic power over static power ratio |  |
| $t_{\text {cout }}$ | Full adder carry out delay | $[\mathrm{s}]$ |
| $t_{\text {sum }}$ | Full adder sum delay | $[\mathrm{s}]$ |
| $t_{d f f}$ | Register delay | $[\mathrm{s}]$ |
| $t_{d f f-\text { setup }}$ | Register setup time | $[\mathrm{s}]$ |
| $t_{F A}$ | Worst case full adder delay | $[\mathrm{s}]$ |
| $t_{\text {bk_adder }}$ | Brent-kung adder delay | $[\mathrm{s}]$ |

## Chapter 1

## Introduction

### 1.1 Motivations

Digital integrated circuits are found everywhere in modern life and many of them are embedded in mobile devices where limited power resource is available (e.g. mobile phones, watches, mobile computers, personal assistants, ...). To permit an usable battery runtime, such devices must be designed to consume the lowest possible power. Furthermore, low power is also very important for non-portable devices, too. Indeed, a reduced power consumption can highly decrease the packaging costs and highly increase the circuit reliability, which is tightly related to the circuit working temperature. For these reasons, low power design is now mandatory for all types of digital circuits.

|  | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Technology node $[\mathrm{nm}]$ | 90 | 65 | 65 | 65 | 45 | 45 | 45 | 32 | 32 |
| Printed gate length $[\mathrm{nm}]$ | 48 | 42 | 38 | 34 | 30 | 27 | 24 | 21 | 19 |
| Transistors Number $[\mathrm{M}]$ | 193 | 386 | 386 | 386 | 773 | 773 | 773 | 1546 | 1546 |
| Chip size $\left[\mathrm{mm}^{2}\right]$ | 88 | 140 | 111 | 88 | 140 | 111 | 88 | 140 | 111 |
| Voltage supply $[\mathrm{V}]$ | 0.9 | 0.8 | 0.8 | 0.8 | 0.7 | 0.7 | 0.7 | 0.6 | 0.6 |
| Internal frequency $[\mathrm{GHz}]$ | 6.7 | 9.2 | 10.9 | 12.3 | 15 | 17 | 20 | 22 | 28 |
| Total power $[\mathrm{W}]$ | 98 | 104 | 111 | 116 | 119 | 119 | 125 | 137 | 137 |

Table 1.1: The International Technology Roadmap for Semiconductors [1] (ITRS), update 2006 for low operating power, cost effective high volume MPU.

As shown in Table 1.1, the number of transistors per circuit will continue to increase as predicted by Moore's law [2], whereas the transistor sizes will continue to shrink. Despite a decreased supply voltage, the total power will continue to increase.

The reduction of the supply voltage is dictated by the need to maintain the electric field constant on the ever shrinking gate oxide. Unfortunately, to keep transistor speed
(proportional to the transistor "on" current) acceptable, the threshold voltage must be reduced too, which results in an exponential increase of the "off" transistor current, i.e. the current constantly flowing through the transistor even when it should be "non-conducting".


Figure 1.1: The left graph shows the transistor off-state current versus the gate length, squares indicate pre-production transistors and diamonds indicate research devices. The histogram on the right shows the total power as a function of technology node, for a fixed (30m) total transistor width. Source: Intel [3].

The left part of Fig. 1.1 shows the exponential increase of static power for real transistors of various sizes. By looking at the right part of Fig. 1.1, we can observe that this exponential increase of the static power can reach a point (starting on the 90 nm node on the histogram) where it completely cancels the benefit of a reduced dynamic power (due to reduced capacitances and supply voltage).

Static consumption being now an important contributor to the total power, the design methodologies used in the past, based on dynamic power considerations only, are not effective any more and need to be reconsidered.

In the recent past years, static power was only relevant when the circuit was idle. This explains why many of static power reduction techniques are only applicable when blocks are unused. A typical example can be the Gated-Vdd approach [4] [5] [6], where a transistor is put between the real supply voltage and a virtual supply voltage, allowing to power off the unused blocks. However, Fig. 1.1 clearly shows that static power reduction should now be tackled in running mode, too.

Moreover, the large majority of the existing leakage reduction techniques apply at circuit and transistor level. Examples are:

- Multi Vth technology, with fast low Vth transistors on critical paths and slow high $V$ th transistors outside critical paths (MTCMOS) [7] [8] [9] [10]
- Electrical regulation of $V$ th (VTCMOS, SATS) [11] [12] [13]
- DTCMOS (Dynamic Vth) with transistor bodies connected to MOS gates [4] [14]

This thesis considers the reduction of the total power, i.e. dynamic plus static contributions, at a high level and during runtime. Basically, the low power consumption is searched through architectural and technology modifications in modern nanometer CMOS processes.

### 1.2 Thesis outline

In Chapter 2, the main sources of power consumption in CMOS technologies are reviewed, with an emphasis on the static ones. This permits to define the delay and power models in Chapter 3. These models are extensively used in the entire thesis and are hence considered as the foundation of this work. In Chapter 4, the 90 nm CMOS technology from ST Microelectronics is described in details and the required model parameters are derived from SPICE-like simulations. Chapter 5 illustrates and describes the different multiplier architectures used in the various examples and case studies. In Chapter 6, the models for a total power consumption comparison in the case where the supply voltage and the threshold voltage are freely modifiable is derived. In particular, this chapter shows that, under such conditions, the total power consumption (for a given delay) presents a minimum. Its application to architectural modifications is reported in Chapter 7, followed by a similar analysis for technology modifications in Chapter 8. A different situation is considered in Chapter 9, where total power comparison models and charts are obtained for the case where the supply and threshold voltages are fixed. Finally, Chapter 10 reports the power consumptions of a circuit manufactured in a 90 nm technology. This circuit is composed by 4 multipliers presenting different combinations of architecture and technology modifications. The thesis is closed by the conclusions in Chapter 11.

### 1.3 Contributions

The main contributions provided by this thesis are:

- Chapter 2-3: Collection and description of existing models for static power, dynamic power, total power and delay.
- Chapter 4: Complete characterization of the ST Microelectronics 90 nm general purpose technology for all three available transistor types (LVT, SVT, HVT).
- Chapter 5: Detailed description and classification of thirteen multiplier architectures.
- Chapter 6: Development and analysis of closed-form equations for optimal total power, optimal supply voltage and optimal threshold voltage in a scenario where supply and threshold voltages are freely tunable.
- Chapter 7: Applications of the theory exposed in Chapter 6 to architecture modifications.
- Chapter 8: Applications of the theory exposed in Chapter 6 to technology modifications.
- Chapter 9: Development and application of easy-to-use equations and graphical tools for architectures comparison under fixed supply and threshold voltages condition.
- Chapter 10: Implementation, testing and analysis of a physical realisation of 4 multipliers representing different combinations of technology flavors and architectures.


## Chapter 2

## Sources of dissipation in CMOS transistors

Circuits designed before 1980 were mainly implemented in NMOS technology. Such devices presented the major inconvenient of a large current constantly flowing through the circuit even when no transitions occurred. To solve this issue, CMOS (Complementary Metal Oxide Semiconductor) technology was introduced. This seemed to be an ultimate solution for avoiding static power consumption. Thus, the only remaining sources of dissipation were the switched capacitance power (due to the charging/discharging of capacitance nodes) and the shortcut power (due to the current flowing from supply voltage $(V d d)$ to the ground ( $V s s$ ) when switching), both only present during node transitions.

Unfortunately, the constant dimension reduction driven by Moore's law and the corresponding reduction of the supply voltage (needed to maintain the electric field on the transistor gates constant) yielded a huge increase of the static power consumption, taking it back to a non negligible source of consumption. The reasons why this occurred are mainly two. The former is the reduction of the threshold voltage imposed by the $V d d$ reduction in order to maintain the speed acceptable, and the latter is the new electrical effects originated by the reduction of the transistors geometrical dimensions, known under the name of short channel effects.

Starting from $0.13 \mu m$ technology node (i.e. a technology with a minimal transistor size of $0.13 \mu \mathrm{~m})$, the static power consumption cannot be neglected anymore and must be added to the dynamic power to correctly estimate the total power consumption.

In this chapter, the sources of dissipation in CMOS transistors are discussed in details, with a special focus on those contributing to the static consumption.

### 2.1 Dynamic consumption

Dynamic consumption is considered as the dissipation that occurs only when the circuit is active (i.e. internal circuit nodes are switching).

Two distinct contributions exist. The first is the so called switching energy and corresponds to the energy required to charge (and discharge) the node capacitances during transitions. The second is the energy dissipated during transitions due to the conductive path existing, for a short period of time, between the supply voltage and the ground. This effect is known as shortcut or short-circuit.

### 2.1.1 Switching energy

The energy consumed to charge (and then discharge) a capacitance $C$ to a voltage $V$ is given by ${ }^{1}[7]$ :

$$
\begin{equation*}
\text { Capacitance switching energy }=C V^{2} \tag{2.1}
\end{equation*}
$$

This type of consumption can easily be reduced from a technology node to the other by reducing capacitance $C$ and supply voltage $V$. Both reductions are effectively obtained in a new scaled technology; in fact, the supply voltage has to be reduced in order to avoid high electric fields on the transistor gates and the reduction of the transistor physical dimensions automatically results in reduced capacitances. This type of dissipation was the primary source of consumption in active mode for circuit implemented in technology larger than $0.13 \mu m$ [15].

### 2.1.2 Shortcut energy

The second source of dynamic consumption arises from shortcut paths. Consider a CMOS inverter (Fig. 2.1) with the input node at zero. In this condition the NMOS transistor is off and the PMOS transistor is conducting. Now, if the input node potential increases from 0 to $V d d$, the NMOS will start to conduct for Vin $>$ Vth_nmos while the PMOS is still on, which result in a current flowing from $V d d$ to $V s s$. Then, when Vin acquires the potential $V d d-V t h \_p m o s$, the PMOS stops to conduct and the shortcut current vanishes too.

Clearly, this type of conduction only exists if the supply voltage $V d d$ is greater than the sum of the NMOS/PMOS sub-threshold voltages (Vth_nmos $+V$ th_pmos).

[^1]

Figure 2.1: CMOS inverter

The energy dissipated during one transition can be expressed as [16]:

$$
\begin{equation*}
\text { Shortcut energy per transition } \propto\left(V d d-V t h \_n m o s-V t h \_p m o s\right)^{3} \cdot \tau \tag{2.2}
\end{equation*}
$$

With Vdd the supply voltage, Vth_nmos and Vth_pmos the threshold voltages for NMOS and PMOS, respectively and $\tau$ is the transition time, i.e. the period of time needed to sweep the input voltage from 0 to $V d d$. More accurate models can be found in [17] [18] [19].

For well designed cells (i.e. with balanced rising and falling edges), the shortcut energy is in general much smaller than the switching energy. Moreover, for very low supply voltage designs, the value $V d d$ - Vth_nmos - Vth_pmos can be very small. Additionally, the case where $V d d<V$ th_nmos $+V$ th_pmos will not present shortcut dissipation at all. For these reasons, in modern designs, shortcut power is often not considered or is simply included in the switching consumption by increasing the switching capacitance to an equivalent capacitance which incorporates the shortcut effect.

### 2.2 Static consumption

Contrary to the dynamic consumption, static power is defined as the consumption originated from currents constantly flowing from $V d d$ to ground. This means that even when the circuit is in idle mode (no transition occurs), power continues to be dissipated. For long channel transistors with high threshold voltage, this type of dissipation was completely negligible. Unfortunately, present and future technologies will suffer from high static power, which could even exceed the dynamic contribution in active mode. Hence, it is of uttermost importance to consider this type of dissipation in present and future design methodologies.

To understand the main sources of static dissipation, let us look at the structure of a transistor in CMOS technology. Fig. 2.2 shows 5 different leakage mechanisms that can be observed in a CMOS transistor (only the NMOS transistor is illustrated, as PMOS behaves exactly in the same way).

These mechanisms are:
(a) Sub-threshold current;
(b) Gate leakage current;
(c) Reverse-bias p-n junction current and band to band tunneling;
(d) Gate-Induced Drain Leakage (GIDL) current;
(e) Punchthrough current.


## substrate

Figure 2.2: Sources of static power consumption in a NMOS transistor

### 2.2.1 Sub-threshold current

The most important leakage current is the sub-threshold one originated by the diffusion of minority carriers in a non conducting transistor ( $V_{\text {gate }}-V_{\text {source }}<V t h$ ). Under this condition, the transistor is operating in weak inversion. The potential applied between drain and source creates a flow of the minority carriers on the surface of the channel. The equation describing this mechanism is [20] [21]:

$$
\begin{equation*}
I_{\text {Sub-threshold }}=I_{o} \cdot e^{-\frac{V t h}{n U t}}\left(1-e^{-\frac{V d s}{U t}}\right) \approx I_{o} \cdot e^{-\frac{V t h}{n U t}} \tag{2.3}
\end{equation*}
$$

With $I_{o}$ the reference static current, $V t h$ the threshold voltage, $n$ the sub-threshold slope, $U t\left(\equiv k_{b} T / q\right)$ the thermal potential and $V d s$ the Drain-Source voltage.

Eq. (2.3) shows an exponential dependency of the sub-threshold current on the threshold voltage Vth. This is the reason why the low Vth characterizing recent technologies leads to large sub-threshold currents. Moreover, in typical digital designs, $V d s$ is much larger than $n U t$, which leads to the approximation $1-e^{-\frac{V d s}{U t}} \approx 1$.

The value of $V t h$ is not fixed for a given technology; in fact, it can be modulated through different effects like:

- Drain Induced Barrier Lowering (DIBL) effect: In short channel transistors, the potential on the drain contact modulates the threshold voltage by lowering the energy barrier at the surface of the channel. A schematic representation of this effect is illustrated in Fig. 2.3. For long channel transistors (L1), the potential in the channel is independent on the drain voltage ( $V d 1$ and $V d 2$ show the same potential profile), whereas for short channels (L2), an increase of the drain voltage also reduces the barrier energy level in the channel, which can be modeled by a reduction of the threshold voltage. Ideally, the DIBL effect doesn't change the sub-threshold slope $n$. DIBL can be reduced by using high surface and channel doping and shallow source/drain junction depths.
- Body effect: The body effect appears when a potential difference is present between body (bulk) and source. This happens because bulk and source operate as a reverse biased p-n junction. By increasing the body potential in a NMOS or by decreasing it in a PMOS (forward biasing), the junction depletion reduces the channel potential and the sub-threshold leakage current increases. Similarly, a reduction of the body potential (lower than Vss for NMOS and higher than $V d d$ for PMOS, called reverse biasing) increases the channel potential, leading to a reduced sub-threshold leakage. It should be noted that for body-source potentials (Vbs) higher than 0.5 V the p-n junction starts to conduct as forward
biased diode, drawing very large current, which has to be avoided at all costs. Body effect is more pronounced for high bulk doping levels and decreases as substrate reverse bias increases. At $V b s=0$, the body effect sensitivity is equal to ( $n-1$ ), with $n$ the sub-threshold slope. The body effect can be modeled as a modification of the threshold voltage $V t h$.


Figure 2.3: Effect of Drain Induced Barrier Lowering (DIBL) on short channel transistors

By considering the effects of DIBL and body bias, the threshold voltage can be expressed by [22] [23] [24]:

$$
\begin{equation*}
V t h=V t h 0-\eta V d s-\gamma V b s \tag{2.4}
\end{equation*}
$$

With $V t h 0$ the reference threshold voltage for $V d s=V b s=0, \eta$ (eta) the DIBL effect coefficient and $\gamma$ (gamma, equal to $\mathrm{n}-1$ for $V b s=0$ ) the linearized body effect coefficient.

By considering the described effects, the sub-threshold current can be expressed as:

$$
\begin{equation*}
I_{\text {sub-threshold }}=I_{o} \cdot e^{-\frac{V t h 0-\eta V d s-\gamma V b s}{n U t}} \tag{2.5}
\end{equation*}
$$

### 2.2.2 Gate leakage current

The transistor gate potential influences the charges in the channel by electrostatic effect: an accumulation of holes in the gate produces an accumulation of electrons at the surface of the channel, obtaining exactly the behavior of a capacitance with gate and channel as poles and the silicon oxide as dielectric. Ideally, no current should occur across the gate oxide, but practically some electrons are able to pass through the oxide, generating a gate current. The mechanisms behind this effect can be divided into two categories: oxide tunneling and hot carrier injection.

## Oxide tunneling current

Tunneling through the gate oxide is primarily due to direct tunneling across very thin oxide layers (less than 3-4 nm). A model for this effect has been reported in [25] [26]:

$$
\begin{equation*}
I_{\text {gate }}=K_{g} \cdot W\left(\frac{V}{t_{o x}}\right)^{2} e^{-\alpha_{g} t_{o x} / V} \tag{2.6}
\end{equation*}
$$

With $K_{g}$ and $\alpha_{g}{ }^{\text {II }}$ (alpha_gate) experimentally derived constants, $W$ the width of the transistor, $t_{o x}$ the gate oxide thickness and $V$ the potential across the gate oxide. The previous equation clearly shows how the reduction of the oxide thickness exponentially increases the tunneling effect. An efficient way to reduce this source of leakage in future technologies is to use other insulators with a higher dielectric constant, resulting in a higher effective oxide thickness (i.e. the thickness of the silicon oxide that would show the same behavior as this high dielectric insulator). In this way, it should be possible to maintain the gate tunneling current to acceptable (i.e. negligible) levels. The main candidates to substitute the silicon oxide $(\kappa=3.9)$ are the hafnium oxide $\left(\mathrm{HfO}_{2}, \kappa=25\right)$ and Hafnium silicate $\left(\mathrm{HfSiO}_{4}, \kappa=11\right)$ [27].

## Hot carrier injection

Due to the high electric field in the interface $\mathrm{Si}-\mathrm{SiO}_{2}$ (channel-oxide), electrons and holes can gain sufficient energy to enter into the gate oxide. Because the effective mass of the electrons, as well as their barrier height, is lower than the corresponding ones for holes, electrons injection is much more probable [28]. A reduction of the supply voltage will reduce the electric field on the gate, also reducing in this way the hot carrier injection.

[^2]
### 2.2.3 Reverse bias p-n junction leakage and band to band tunneling

In the normal transistor operation mode, the drain/source to well junctions are reverse biased. Under this condition, a small current exists due to the drift of carriers originated by the thermal electron-hole generation. Nevertheless, in advanced short channel MOS (where heavily doped and shallow junctions are used), such effects are masked by the dominating band-to-band tunneling.

Band to band tunneling happens on junctions with high electric field ( $>10^{6} \mathrm{~V} / \mathrm{cm}$ ) and is due to the direct tunneling of electrons from the band of valence of the p region to the band of conduction in the n region. Closed form equations describing this type of leakage exist [25] [29].

### 2.2.4 Gate-Induced Drain Leakage (GIDL)

In the overlapping zone between gate and drain, a high electric field can exist, leading to the generation of currents from drain to substrate. Consider a NMOS transistor; when a low gate potential is applied ( $V g$ near zero volts or below), holes accumulate at the surface and create a region which is more heavily p doped than the substrate. If this happens while the drain is connected to a high potential (let say $V d d$ ), the depletion layer near the drain becomes narrower. If this is important enough to invert the polarity of the $\mathrm{n}+$ drain region under the gate, high field effects like band-to-band tunneling, avalanche multiplication and traps-assisted tunneling take place. As a consequence minority carriers are emitted in the drain region underneath the gate and pushed to the substrate due to the vertical electric field. All these effects are increased by a reduction of the gate oxide thickness.

This type of leakage is especially important for "relatively high" supply voltage circuits $(V d d>1.1 \mathrm{~V})$. Low power digital designs, with very low supply voltage (i.e. $V d d$ around 0.5 V ), are not heavily concerned by this type of leakage. More detail on GIDL effect can be found in [30] [28] [25].

The equivalent of the GIDL effect for a "high" source potential is called GISL (Gate-Induced Source Leakage). This effect is generally not considered because, in normal transistor operations, the source will show a low or zero potential compared to the bulk.

### 2.2.5 Punchthrough

With the physical dimensions reduction, the depletion layers of source and drain become nearer and nearer until they touch each other, originating punchthourgh currents. In submicron MOS transistors, implants at the substrate surface aiming Vth adjustment are used, forcing the punchthrough to occur deeper in the substrate. The size of the depletions directly depends on the $V d s$ potential. Hence, low voltage design can prevent the generation of punchthrough currents [31] [25].

### 2.3 Summary

In deep sub-micron and nanometer technologies, the dynamic power consumption is no longer the only relevant source of power dissipation. In fact, present and future technologies will be characterized by large static power consumption coming from different leakage sources. In this chapter, the principal ones have been explained. However, it is important to observe that, depending on the transistor polarization, only a part of the described mechanisms occur. All realistic combinations of polarization are shown in Table 2.1 for a NMOS transistor.

| Vg | Vd | Vs | Sub-threshold | Gate leakage | p-n junction | GID/SL | Punchtrough |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 0 | 0 | 0 | NO | NO | NO | NO | NO |
| 0 | 0 | 1 | YES | NO | YES | GISL | YES |
| 0 | 1 | 0 | YES | NO | YES | GIDL | YES |
| 0 | 1 | 1 | NO | NO | YES | BOTH | NO |
| 1 | 0 | 0 | NO | YES | NO | NO | NO |
| 1 | 1 | 1 | NO | YES | YES | NO | NO |

Table 2.1: Manifestation of specific leakage mechanism in a NMOS transistor depending on polarization

In a typical CMOS digital design, the NMOS transistor will have two modes of operation: $\mathrm{Vg} / \mathrm{Vd} / \mathrm{Vs}=0 / 1 / 0$ for the off transistor and $\mathrm{Vg} / \mathrm{Vd} / \mathrm{Vs}=1 / 0 / 0$ for a conducting transistor. When the transistor is on (conducting), the only mechanism that occurs is the gate leakage, whereas for an off transistor, sub-threshold, p-n junction, GIDL and punchthrough could be present. Nevertheless, the use of very low supply voltage (less than 1 V ) maintains the p-n junction and the punchthrough effects much lower compared to the sub-threshold one. Moreover, for gate potentials no lower than Vss for NMOS and not higher than Vdd for PMOS, the GIDL mechanism can also be neglected.

To summarize, the main sources of static power are the sub-threshold current for off transistors and gate leakage for conducting transistors.

|  | CLN90G | CL013GHP | CL013LVHP |  |
| :--- | :--- | :---: | :---: | :---: |
|  | Transistor size [nm] | 90 | 130 | 130 |
|  | Vdd [V] | 1.0 | 1.2 | 1.0 |
| INVD1 | Sub-treshold current [nW] | 5.14 | 0.56 | 6.93 |
|  | Gate leakage current [nW] | 0.82 | 0.10 | 0.34 |
| NAND2D1 | Sub-treshold current [nW] | 4.91 | 0.61 | 6.63 |
|  | Gate leakage current [nW] | 1.40 | 0.15 | 0.56 |

Table 2.2: Gate and sub-threshold leakage current for three different TSMC technologies

Table 2.2 reports the sub-threshold and gate leakage power dissipation in 3 recent technologies. Values are reported for an inverter (INVD1) and a 2 input NAND gate (NAND2D1)[32]. We can observe that sub-threshold current remains the principal source of static power dissipation in deep sub-micron and nanometer technologies. The next generations could see an exponential increase in the gate leakage if silicon oxide is still used as insulator. Luckily, referring to [33], high dielectric constant oxide should be used starting from 2007. Intel also announced at the end of January 2007 [34] that high-k gate oxide will be used in their 45nm technology for the new generation of the Intel Core 2 Duo, Intel Core 2 Quad and Xeon families of multi-core processors.

## Chapter 3

## Delay and power models

### 3.1 Current models

As stated in Chapter 2, the main contribution to static power comes from subthreshold currents flowing from drain to source in off transistors. In short channel transistors ( $\mathrm{L}<1 \mu m$ ), the voltage applied between drain an source also influences the channel conduction by a mechanism known as Drain Induced Barrier Lowering (DIBL).

$$
\begin{equation*}
I_{\mathrm{off}}=I_{o} e^{-\frac{V t h 0-\eta V d d-\gamma V b s}{n U t}}=I_{o} e^{-\frac{V t h}{n U t}} \tag{3.1}
\end{equation*}
$$

With $I_{o}$ the reference static current, $V t h 0$ the reference threshold voltage, $V t h$ the modulated threshold voltage, $\eta$ the DIBL effect coefficient, $V d d$ the supply voltage, $\gamma$ the body effect coefficient, $V b s$ the body-source voltage, $n$ the sub-threshold slope and $U t$ the thermal potential $(\equiv k T / q)$.

The "on" current, i.e. the current flowing in a conducting transistor can be approximated by the following formula [35] [36] [37] [38]:

$$
\begin{equation*}
I_{\mathrm{On}}=I_{o}\left(\frac{e}{\alpha n U t}\right)^{\alpha}\left(V_{d d}-V_{t h}\right)^{\alpha} \tag{3.2}
\end{equation*}
$$

With $I_{o}$ the reference static current, $e$ the euler number, $\alpha$ the alpha power law coefficient, $n$ the sub-threshold slope, $U t$ the thermal potential, $V d d$ the supply voltage and $V t h\left(\equiv V t h 0-\eta V-\gamma V_{b s}\right)$ the effective threshold voltage.

This model is an empirical fitting equation that accounts for the carriers mobility reduction. According to [39], the parameter $\alpha$ can be related to mobility by:

$$
\begin{equation*}
\alpha=1+\frac{\mu_{\mathrm{eff}}}{\mu_{0}} \tag{3.3}
\end{equation*}
$$

With $\mu_{\text {eff }}$ the effective carriers mobility and $\mu_{0}$ the low field mobility. Being $0<$ $\mu_{\text {eff }} \leq \mu_{0}$, the parameter $\alpha$ will always be included in the range [1;2]; with $\alpha=2$ for long channel transistors.

Based on these equations, it is now possible to define the dynamic and static power consumption as well as delay models.

### 3.2 Power models

As illustrated in the previous chapter, the total power can be divided into two categories: dynamic and static power.

### 3.2.1 Dynamic power

Dynamic power is due to the dissipation during the capacitances charge/discharge process. The well known equation describing it is:

$$
\begin{equation*}
P_{\mathrm{dyn}}=\left(\sum_{i}^{N} a_{i} C_{i}\right) f \cdot V_{d d}^{2}=a C N f V_{d d}^{2} \tag{3.4}
\end{equation*}
$$

With $a_{i}$ the switching probability per clock period of the node $i, C_{i}$ is the capacitance of node $i$ plus the internal cell capacitance driven by node $i, f$ is the circuit frequency, $V d d$ the supply voltage, $N$ the number of cells, $a$ the average activity per cell better understood as the average number of switching cells over the number of total cells during a clock cycle and $C$ is the equivalent capacitance defined as $\left(\sum_{i} a_{i} C_{i}\right) / a N$. Using the proposed definition of activity, only the transitions from 0 to 1 are considered.

The expression of $a C N$ using average parameters must be treated carefully. First, the average activity on the net is considered the same as the average activity in the cells, moreover the equivalent capacitance $C$ is only equal to the average cell capacitance (net + internal cell) when all cells present the same activity, which is practically never the case. Therefore, $C$ depends on activity distribution. For this reason, every time the parameters $a C N$ are used together in equations, they must be considered as $\sum_{i} a_{i} C_{i}$, rather than average activity times average capacitance times number of cells.

A second contribution to dynamic power comes from the shortcut dissipation due to current flowing from $V d d$ to $V s s$ during node transition. As seen in Chapter 2, this contribution is inexistent for supply voltage $V d d$ smaller than NMOS plus PMOS threshold voltages, and is very small for $V d d$ near $V t h n+V t h p$. Moreover, the quick
transition time, typically present in current technologies, further reduces the shortcut dissipation. Thus, this source of dynamic power can simply be accounted by lumping this effect into the cell capacitance, which will increase slightly.

### 3.2.2 Static power

This new source of dissipation coming from non-ideal transistor behavior is particularly important in deep submicron technologies and can become the main contributor even in running mode. Moreover, this type of consumption is always present as long as the circuit is supplied. Hence, even when the circuit does nothing (idle mode), static power continues to be dissipated. For simplicity of the model, only the main contributor (i.e. sub-threshold current) is considered. For a detailed discussion on the others existing sources of static power consumption, please refer to Chapter 2.

Static power model is given by:

$$
\begin{equation*}
P_{\mathrm{stat}}=V_{d d} \cdot \sum_{i}^{N} I_{\mathrm{off}}(i)=N \cdot V_{d d} \cdot I_{o} e^{-\frac{V t h}{n U t}}=N \cdot V_{d d} \cdot I_{o} e^{-\frac{V t h 0-\eta V d d-\gamma V_{b s}}{n U t}} \tag{3.5}
\end{equation*}
$$

With $N$ the number of cells, $V d d$ the supply voltage, $I_{o}$ the cell reference current, $n$ the sub-threshold slope, $U t$ the thermal potential, $V t h$ the modulated threshold voltage, $V t h 0$ the reference threshold voltage, $\eta$ the DIBL coefficient and $\gamma$ the body bias coefficient.

It is important to note that $I_{o}$ in Eq. (3.5) is the average reference off-current per cell. This factor is different from the single transistor reference off-current, because complex cells present a modified $I_{o}$ due to stack effect, different transistor sizing, etc. According to [40], the ratio $k_{\text {design }}=\left(I_{o}\right.$ cell $) /\left(I_{o}\right.$ transistor $) /(\#$ of transistors $)$ is about 1.4 for flip-flops, 2.0 for latches, 1.2 for 6 T RAM cells and 11 for static logic. We carried out the same calculation for few cells with a driving force of 2 in the STM 90nm SVT technology and our results show a $k_{\text {design }}$ spanning over a slightly narrower range; in fact, we obtain a $k_{\text {design }}$ of 7.3 for a NAND gate, 6.5 for AND gate, 2.5 for a flip-flop and 3.7 for a full adder. Nevertheless, this shows that the static power consumption per cell can vary from cell to cell. For this reason, power comparison using Eq. (3.5) requires that both circuits present the same type of cells (i.e. static logic) or a similar distribution of different cell types. Otherwise, a compensation factor should be used depending on the type of cells used.

### 3.2.3 Total power

Total power is defined as the sum of dynamic plus static consumption. Referring to the previous sub-chapters, the total power model is given by:

$$
\begin{align*}
P_{\mathrm{tot}} & =P_{\mathrm{dyn}}+P_{\mathrm{stat}} \\
& =a C N f V_{d d}^{2}+N \cdot V_{d d} \cdot I_{o} e^{-\frac{V t h}{n U t}} \\
& =N \cdot V_{d d}\left(a C f V_{d d}+I_{o} e^{-\frac{V t h}{n U t}}\right) \tag{3.6}
\end{align*}
$$

With $N$ the number of cells, $V d d$ the supply voltage, $a$ the circuit activity, $C$ the equivalent capacitance, $f$ the frequency, $I_{o}$ the average off-current per cell, $V t h$ the modulated threshold voltage, $n$ the sub-threshold slope and $U t$ the thermal potential.

### 3.3 Delay models

All power related discussions are worthless if the circuit delay (related to performance) is not considered. The model retained here is the very common one, that considers the delay of a cell as the time needed to charge the load capacitance by a driving current. So, to charge a capacitance $C$ to the potential $V$ the number of electric charges needed is $Q=C V$. Considering that these charges are coming at the speed of $I_{\mathrm{On}}[\mathrm{A}=\mathrm{C} / \mathrm{s}]$, it is easy to find that:

$$
\begin{equation*}
t_{\mathrm{gate}}=k_{t} \frac{C V}{I_{\mathrm{on}}} \tag{3.7}
\end{equation*}
$$

With $k_{t}$ a constant accounting for the fact that the driving current is not constant during the capacitance charge (the values of this constant for the technology flavors used in this thesis are 15.1 for LVT, 24.7 for SVT and 30.1 for HVT. These values were obtained by multiplying the delay of a NAND2x2 cell with $I_{\mathrm{On}}$ and then by dividing it by the driven capacitance and by the supply voltage). $I_{\mathrm{On}}$ is the on transistor current and its formulation is given by Eq. (3.2).

In a digital design, the maximal achievable frequency is the inverse of the sum of delays on the critical path. In a mathematical form it appears as:

$$
\begin{equation*}
\left(f_{\max }\right)^{-1}=t_{\text {critical path }}=k_{t} \sum_{i}^{L D} \frac{C_{i} \cdot V_{d d}}{I_{\mathrm{on}}}=k_{t} C \frac{L D \cdot V_{d d}}{I_{\mathrm{On}}} \tag{3.8}
\end{equation*}
$$

With $C_{i}$ the load capacitance $i$ on the critical path, $L D$ the logical depth defined as the number of cells forming the critical path, $C$ the average critical path capacitance defined as $\sum_{i} C_{i} / L D$.

Combining Eq. (3.8) with Eq. (3.2) yields:

$$
\begin{equation*}
f_{\max }=\frac{I_{o} \cdot e^{\alpha}}{k_{t} \cdot C \cdot L D \cdot(\alpha n U t)^{\alpha}} \frac{\left(V_{d d}-V_{t h}\right)^{\alpha}}{V_{d d}} \tag{3.9}
\end{equation*}
$$

In the previous equation, it is interesting to observe that a high $I_{o}$ corresponding to a high leaky technology also corresponds to a high maximal frequency, thus underlining the tight relation between high performance and static dissipation.

### 3.4 Summary

In this chapter, equations for the dynamic and static power consumption as well as the circuit delay (corresponding to the maximal frequency) have been obtained starting from simple and well known expressions of the on and off currents of a CMOS transistor. These equations are the foundation for the theory presented in this thesis. The use of very simplified equations, as well as the exclusion of secondary effects like gate leakage, are voluntary. This is necessary in order to be able to work with analytical expressions or simple closed form approximations, which makes it possible to understand the influence of each single parameter on the lowest achievable total power consumption.

## Chapter 4

## Technology characterization

The equations in the Chapter 3 depend on a certain number of technology parameters that must be characterized for a given technology before the equations can be exploited. To be sure that they really match the models used in this work, every parameter have been estimated by fitting SPICE simulations curves to our models with the program Graphical Analysis v3.2. The obtained values can vary compared to the original SPICE parameters, because used models are different. Actually, our models (explained in previous chapters) are much simpler than the BSIM3.3 ones, which are what the provided SPICE libraries use. In this thesis, the technology of ST Microelectronics with a minimal size of 90 nm has been chosen as reference. The advantage of this technology is that it is available for 3 different transistor types, corresponding to 3 different threshold voltages.

### 4.1 Parameters extraction methodology

The technology parameters required in this work are:

- $n$ : the sub-threshold slope;
- $\eta$ : the DIBL effect coefficient;
- $\alpha$ : the alpha power law coefficient;
- Vth0 : the reference transistor threshold;
- $\gamma$ : the body effect coefficient.

Each one of these parameters will be discussed in details in the following sections.

### 4.1.1 The sub-threshold slope $n$

The sub-threshold slope $n$ is extracted from the simulation of $I_{d s}(V g s)$. The schematic used to measure the $I_{d s}$ current is reported in Fig. 4.1.


NMOS


PMOS

Figure 4.1: Schematic of the NMOS (left) and PMOS (right) transistors used for the extraction of the sub-threshold slope $n$. Transistor sizes are: $W_{\text {nmos }}=0.51 \mu \mathrm{~m}$, $W_{\text {pmos }}=0.88 \mu \mathrm{~m}$ and $L_{n \text { mos }}=L_{\text {pmos }}=0.1 \mu \mathrm{~m}$

The equation of the drain current in weak inversion is given by:

$$
\begin{equation*}
I_{d s}(V g s)=I_{o} e^{\frac{V g s-V t h}{n U t}} \tag{4.1}
\end{equation*}
$$

Consequently, by considering the natural logarithm of the previous equation, the simulated curve should match the corresponding linear function:

$$
\begin{equation*}
\ln \left(I_{d s}(V g s)\right)=\frac{1}{n U t} \cdot V g s+\left[\ln \left(I_{o}\right)-\frac{V_{t h}}{n U t}\right] \equiv m \cdot V g s+b \tag{4.2}
\end{equation*}
$$

Through a linear fitting, it is possible to extract the slope $m$ of Eq. (4.2) to obtain $1 / n U t$. Knowing the temperature used during the simulation, $U t\left(\equiv k_{b} T / q\right)$ is also known ( $k_{b}=1.38 \mathrm{E}-23, q=1.6 \mathrm{E}-19$ ).

As the values of $n$ for the NMOS and the PMOS transistors can be different, the retained value will be their average.

The size of both NMOS and PMOS used in the SPICE simulations are the same than the corresponding ones in an inverter cell with a driving force of one.

### 4.1.2 The DIBL effect factor $\eta$

The extraction of the DIBL effect factor $\eta$ is very similar to how $n$ is obtained. The difference comes from the swept variable during simulation, which is now $V d d$, while $V g s$ is set to $0 V$, thus resulting in an off transistor. The corresponding equations are:

$$
\begin{align*}
I_{o f f}\left(V_{d d}\right) & =I_{o} e^{-\frac{V_{t h o-\eta V_{d d}}^{n U t}}{n U t}}  \tag{4.3}\\
\ln \left(I_{o f f}\left(V_{d d}\right)\right) & =\frac{\eta}{n U t} \cdot V_{d d}+\left[\ln \left(I_{o}\right)-\frac{V_{t h 0}}{n U t}\right] \equiv m \cdot V_{d d}+b \tag{4.4}
\end{align*}
$$

Once the slope $\eta / n U t$ has been extracted, $\eta$ is easily obtained, since $1 / n U t$ was estimated in the previous section 4.1.1.

The static current $I_{\text {off }}$ is measured as the supply current on a closed chain composed by an even number of inverters (10 in our case). In such a configuration, the circuit is in a stable condition and no node transitions occur. All inverters present a driving force of one.

### 4.1.3 The $\alpha$ factor and the reference threshold voltage $V$ th 0

The parameter $\alpha$ (discussed in Chapter 3) and the reference threshold voltage $V$ th 0 can both be estimated by fitting the delay equation (from Eq. (3.9)):

$$
\begin{equation*}
\operatorname{Delay}\left(V_{d d}\right) \propto \frac{V_{d d}}{\left(V_{d d}-V_{t h}\right)^{\alpha}}=\frac{V_{d d}}{\left(V_{d d}(1+\eta)-V_{t h 0}\right)^{\alpha}} \tag{4.5}
\end{equation*}
$$

As $\eta$ is a known parameter, a non-linear curve fitting on a circuit delay plotted in function of $V d d$ permits to determine the values of $\alpha$ and $V t h 0$. Because both parameters are referred to the circuit delay (and this is the way the parameters will be used later), their values can be quite different from the single NMOS or PMOS ones defined by the manufacturer.

The delays are obtained by measuring the oscillating frequencies of a ring oscillator formed by 9 inverters with a driving force of one.

### 4.1.4 The body effect coefficient $\gamma$

The body effect coefficient $\gamma$ models the first order influence of the body potential to the reference threshold voltage $V t h 0$ :

$$
\begin{equation*}
V t h(V b s)=V t h 0-\gamma V b s \tag{4.6}
\end{equation*}
$$

The extracting methodology for this parameter is the same as for the DIBL effect coefficient $\eta$, but the measured parameter is $I_{o f f}(V b s)$ :

$$
\begin{align*}
I_{o f f}(V b s) & =I_{o} e^{-\frac{V_{t h 0}-\eta V_{d d}-\gamma V b s}{n U t}}  \tag{4.7}\\
\ln \left(I_{o f f}(V b s)\right) & =\frac{\gamma}{n U t} \cdot V b s+\left[\ln \left(I_{o}\right)-\frac{V_{t h 0}-\eta V_{d d}}{n U t}\right] \equiv m \cdot V b s+b \tag{4.8}
\end{align*}
$$

A simple linear curve fitting on $\ln \left(I_{o f f}(V b s)\right)$ is enough to determine $m=\gamma / n U T$. It is then easy to multiply the previous value by $n U t$ to obtain $\gamma$.

Here too, the static current $I_{o f f}$ is obtained by simulating a looped chain composed by an even number (10) of inverter with a driving force of one.

It is important to note that the body bias potential must be kept below 0.5 V in the forward bias condition $(V b s>0)$. Otherwise, the p-n junction between the body and the source will start to conduct as a forward-biased diode, creating an extremely large leakage current.

### 4.1.5 Remark on $I_{o}$

The parameter $I_{o}$ representing the reference static current is also a technology related parameter, but its value cannot be extracted and used in a universal way as it is done for the other technology parameters. In fact, in this work, $I_{o}$ is considered as the reference static power per cell. This means that the specific value is dependent on the cells used (as discussed in Chapter 3.2.2) and cannot simply be represented with an unique value. Except when stated differently, the average $I_{o}$ of a circuit is estimated from cell nominal values of the static power in the following way:

$$
\begin{equation*}
I_{o}=\frac{\text { Total Nominal Static Power }}{V d d \_n o m \cdot N} e^{\frac{V t h}{n U t}} \tag{4.9}
\end{equation*}
$$

With $V d d \_n o m$ the nominal supply voltage and $N$ the number of cells.

### 4.2 STM 90nm technology

The STM 90 nm is the most recent technology available at our laboratory and it presents the following main features:

- Designed for $1.0 \mathrm{~V} \pm 10 \%$ applications, with $1.8 \mathrm{~V} / 2.5 \mathrm{~V} / 3.3 \mathrm{~V}$ IO's
- Shallow trench isolation, isolated P-Well (DNW) twin-tub, single poly CMOS process using a type $<100>$ P-substrate
- $16 \AA \AA$ gate oxide
- Cobalt silicide on junctions, polysilicon gates, lines, resistors on active and interconnect poly ( $\mathrm{N}+$ or $\mathrm{P}+$ )
- Dual Vth transistors
- IOs using 2.8 nm or 5.0 nm or 6.5 nm gate oxide for 1.8 V or 2.5 V or 3.3 V respectively
- 6 to 9 metal levels
- Damascene Copper for all metals
- Thick metal layer for power, clock, busses and major interconnect signal distribution, as well as for inductors in Analog/RF applications
- Tight pitch levels for routing on thin copper for lower metal layers
- Low K ( $<3.0$ ) inter-metal dielectric for thin metal layers

To extract the required parameters for each one of the 3 transistor flavors, the program ELDO version 6.1_1.1 from Mentor Graphics (SPICE-like simulator) has been used.

### 4.2.1 Low Vth Transistors (lvt)

The "Low Vth" transistor type is the fastest available flavor in the STM 90 nm general purpose technology, and is used for applications where the speed is of primary importance. The disadvantage of this type of transistors is that, due to the low threshold voltage (Vth), the static power is very high.

## The sub-threshold slope $n$

It is important to note that the linear fitting on Eq. (4.2) must be estimated on a region where the transistor is in the weak inversion mode (i.e. $V d d<V t h$ ). Otherwise Eq. (4.2) is no longer valid and the alpha power law should be used instead to describe the transistor current. In our case the fitting range apply to $V d d \in[0 \mathrm{~V} ; 0.2 \mathrm{~V}]$. Moreover, the temperature was set to $27^{\circ} \mathrm{C}$, corresponding to an $U t=0.02588 \mathrm{~V}$.

The following table summarizes the parameters extraction:


Figure 4.2: Linear fitting of $\ln \left(I_{d s}(V g s)\right)$ for STM 90 nm lvt

|  | $m=1 / n U t\left[V^{-1}\right]$ | unified $1 / n U T\left[V^{-1}\right]$ | $U t[V]$ | $n$ | unified $n$ |
| :---: | :---: | :---: | :---: | :---: | :---: |
| NMOS | 23.05 | 22.72 | 0.02588 | 1.68 |  |
| PMOS | 22.39 |  |  | 1.73 | 1.70 |

Table 4.1: Results of the sub-threshold slope extraction for STM 90nm lvt

## The DIBL effect factor $\eta$

The DIBL effect factor $\eta$ is extracted from the curve $\ln \left(I_{o f f}(V d d)\right)$. The static power is measured on a chain of 10 inverters, all with a driving force of one.

The results of the curve fitting in Fig. 4.3 are summarized in Table 4.2.

| $m=\eta / n U t$ | unified $1 / n U t$ (from table 4.1) | $\eta$ |
| :---: | :---: | :---: |
| 1.98 | 22.72 | 0.087 |

Table 4.2: Results of the DIBL effect coefficient extraction for STM 90nm lvt

## The $\alpha$ factor and the reference threshold voltage $V t h 0$

The extraction of the $\alpha$ factor and of the reference threshold voltage $V$ th 0 is done conjointly by fitting the non-linear equation (4.5) with a known value for $\eta$. The delays are obtained by measuring the oscillating frequency of a ring oscillator formed by 9 inverters with a driving force of one.


Figure 4.3: Linear fitting of $\ln \left(I_{o f f}(V d d)\right)$ for 1 inverter (averaged over 10 inverters)


Figure 4.4: Fitting of delay vs. Vdd for STM 90 nm lvt

Results, based on the fitting in figure 4.4, are presented in Table 4.3.

| $\alpha$ | Vth 0 |
| :---: | :---: |
| 1.56 | 0.342 |

Table 4.3: Results for the $\alpha$ factor and $V t h 0$ for STM 90 nm lvt

## The body effect coefficient $\gamma$

The extraction of the body effect factor is achieved with a linear fitting on the curve $\ln \left(I_{o f f}(V b s)\right)$. The static current is measured over 10 inverters connected in chain and the result has been divided by 10 to average the static current to one inverter.


Figure 4.5: Linear fitting of $\ln \left(I_{o f f}(V b s)\right)$ for 1 inverter (averaged over 10 inverters)

It should be noted that the $\gamma$ is only a first order approximation of the body bias effect, because, as shown in Fig. 4.5, the curve is more like a square root function than a linear one.

Results are summarized by Table 4.4.

| $m=\gamma / n U t$ | unified $1 / \mathrm{nUt}$ (from table 4.1) | $\gamma$ |
| :---: | :---: | :---: |
| 2.72 | 22.72 | 0.12 |

Table 4.4: Results for the body effect coefficient for STM 90 nm lvt

## $I_{o}$ for a 2 inputs NAND gate

Even if it is not possible to give here a unique $I_{o}$ value for the technology, the value $I_{o}$ for a 2 inputs NAND gate with a driving force of 2 is given as a reference.

| $I_{o}[\mu \mathrm{~A}]$ | 30.9 |
| :--- | :--- |

Table 4.5: $I_{o}$ for a NAND2x2 gate from the STM 90nm lvt technology

## Summary of the lvt technology parameters

All the technology parameters for the lvt flavor are summarized by Table 4.6.

| $V t h 0[\mathrm{~V}]$ | $\alpha$ | $n$ | $\eta$ | $\gamma$ | $I_{o}($ NAND $2 x 2)[\mu \mathrm{A}]$ |
| :---: | :---: | :---: | :---: | :---: | :---: |
| 0.342 | 1.56 | 1.70 | 0.087 | 0.12 | 30.9 |

Table 4.6: Technology parameters summary for the STM 90nm lvt

### 4.2.2 Standard Vth Transistors (svt)

The "Standard Vth" transistor type is an all-purpose flavor where delay and static power has been traded-off to match typical design requirements. The procedure used to characterize this technology variation is exactly the same as the one used for lvt. For the sake of simplicity, only the summary table is reported.

| $V t h 0[\mathrm{~V}]$ | $\alpha$ | $1 / n U t\left[V^{-1}\right]$ | $n$ | $\eta$ | $\gamma$ | $I_{o}(N A N D 2 x 2)[\mu \mathrm{A}]$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 0.353 | 1.65 | 26.30 | 1.47 | 0.060 | 0.14 | 26.0 |

Table 4.7: Technology parameters summary for the STM 90nm svt

### 4.2.3 High Vth Transistors (hvt)

The "High Vth" transistor type is a flavor especially optimized for extremely low static power consumption. Typical applications for this technology variation are circuit idle most of the time and/or where speed/performance are not of utmost importance. The procedure used to characterize this technology variation is exactly the same as the one used for lvt. For the sake of simplicity, only the summary table is reported.

| $V t h 0[\mathrm{~V}]$ | $\alpha$ | $1 / n U t\left[V^{-1}\right]$ | $n$ | $\eta$ | $\gamma$ | $I_{o}(N A N D 2 x 2)[\mu \mathrm{A}]$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| 0.425 | 1.84 | 26.16 | 1.48 | 0.062 | 0.19 | 17.7 |

Table 4.8: Technology parameters summary for the STM 90 nm hvt

### 4.3 Summary

In this chapter, the methodology used to extract the technology parameters has been presented. After an introduction of the general procedure, the parameters $V t h 0, \alpha, n, \eta, \gamma$ and $I_{o}(N A N D 2 x 2)$ have been evaluated for all 3 transistor flavors available in the STM 90 nm general purpose technology. In order to have an easy access to the extracted data, values are summarized in Table 4.9.

|  | $V t h 0[\mathrm{~V}]$ | $\alpha$ | $1 / n U t\left[V^{-1}\right]$ | $n$ | $\eta$ | $\gamma$ | $I_{o}(N A N D 2 x 2)[\mu \mathrm{A}]$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| lvt | 0.342 | 1.56 | 22.72 | 1.70 | 0.087 | 0.12 | 30.9 |
| svt | 0.353 | 1.65 | 26.30 | 1.47 | 0.060 | 0.14 | 26.0 |
| hvt | 0.425 | 1.84 | 26.16 | 1.48 | 0.062 | 0.19 | 17.7 |

Table 4.9: Technology parameters summary for the STM 90nm - Vdd $=1 \mathrm{~V}$

## Chapter 5

## Reference multiplier architectures

This chapter presents a set of 13 reference multipliers widely used in this thesis. The reason why we choose multipliers as reference comes from the fact that many possible implementations exist, each one with very different characteristics. The architectures proposed in this chapter can be divided into 3 families, each one containing more variations of the basic implementation.

The 3 families are:

1. Ripple Carry Array (RCA): This structure is based on a regular matrix of full adders; considered versions are:

- basic;
- 2 and 4 times parallel;
- 2 and 4 times horizontal pipeline;
- 2 and 4 times diagonal pipeline.

2. Wallace: This type of multiplier is based on a tree of full adder used as 3-to-2 compressors, considered versions are:

- basic;
- 2 and 4 times parallel.

3. Sequential: Here the multiplication is obtained by a sequential add and shift implementation, considered versions are:

- basic;
- sequential-wallace;
- 2 times parallel.


### 5.1 Ripple Carry Array

The Ripple Carry Array multiplier (or RCA) is the most intuitive implementation for a multiplier. Its structure derives from the way we usually do multiplications by hand. That is, a sum of shifted partial products. A partial product $\left(P_{i}\right)$ is the result of the multiplication of the multiplicand ( $A$, first number to multiply) with one bit of the multiplier ( $B$, second number to multiply). Practically, the multiplication by a bit is obtained by AND gates. The number of partial products will be equal to the size of $B$ in bits. Mathematically, this can by written as ( $2^{i}$ represents the bits shift):

$$
\begin{equation*}
M=A * B=\sum_{i=0}^{\operatorname{size}(B)-1} P_{i} \cdot 2^{i}=\sum_{i=0}^{s i z e(B)-1}\left(A \text { and } B_{i}\right) \cdot 2^{i} \tag{5.1}
\end{equation*}
$$

In a physical implementation, the summation showed in Eq. (5.1) will be obtained by a series of full adder (FA), i.e. a 1 bit adder defined as:

$$
\begin{aligned}
S & =a \text { xor } b \text { xor } c i n \\
\text { Cout } & =(a \text { and } b) \text { or }(a \text { and } c i n) \text { or }(b \text { and } c i n)
\end{aligned}
$$

A graphical representation of a FA is provided in Fig. 5.1.


Figure 5.1: Full adder symbol
By implementing Eq. (5.1) directly, a multiplier known with the name of Ripple Carry Array multiplier (or RCA) can be constructed. Fig. 5.2 represents such implementation for $\mathrm{N}=8$.

The first line of full adders in a RCA doesn't have to sum the partial products with the result of the precedent line because no precedent result exists. Hence, only partial products (AND gates) are generated by the synthesis tools. Moreover, the most right cell of each line has a fixed carry in of zero. Those cells can be simplified to an Half Adder (HA) i.e. an adder without carry in. The logical expressions of an HA are:

$$
\begin{aligned}
S & =a \text { xor } b \\
\text { Cout } & =a \text { and } b
\end{aligned}
$$



Figure 5.2: 8bit RCA multiplier

The FA has two characteristic delays. The first is the time that a signal needs to propagate from the inputs ( $a$ and $b$ and cin) to the sum port $(S)$. The second is the propagation delay for a signal going from the inputs ( $a$ and $b$ and cin) to the carry out port (Cout). Fig. 5.3 shows one of the possible critical path that exists in such a multiplier.


Figure 5.3: Critical path in a 8 bit RCA multiplier

It is not surprising why, in Fig. 5.3, the critical path doesn't include the first line of full adders; indeed, it corresponds to simple AND gates for the generation of partial
products (because $a$ and cin are zero), and they are executed in parallel with the partial products of the second line (corresponding to the bit B1).

The total delay for a RCA is given by :

$$
\begin{equation*}
t(\text { Basic RCA })=(2 \cdot N-2) \cdot t_{\text {cout }}+(N-2) \cdot t_{\text {sum }} \tag{5.2}
\end{equation*}
$$

With $N$ the size of the multiplier, $t_{\text {cout }}$ the carry out delay, $t_{\text {sum }}$ the sum delay.
The structure presented in Fig. 5.2 and Fig. 5.3 are what we will call the "basic RCA" implementation. Others RCA implementations are explained hereafter.

### 5.1.1 RCA parallel variations

The first transformation of the RCA multiplier is the parallelization: the RCA multiplier is implemented twice (or more in general) and the data is multiplexed to a different multiplier at each clock period. The advantage of this architecture is that each multiplier has two (or as many as the number of instantiated blocks in general) clock periods to terminate the computation. So, the throughput is the same than for the non parallelized version, but the latency is bigger (corresponding to the number of blocks). Fig. 5.4 shows the structure of a 2 times parallelized multiplier.


Figure 5.4: 2 times parallelized multiplier

The sel signal is used to select which multiplier will calculate the multiplication for the incoming data and it typically switches each clock cycles. The use of the input registers is required in order to latch the data at the input of the multipliers. In fact, each multiplier has now more than 1 clock cycle (corresponding to the degree of parallelization) to compute one multiplication, and the incoming data need to be stable over those clock cycles. Considering the throughput frequency as the reference clock, the effective logical depth, defined as the real logical depth divided by the number of clock cycles the signals have for propagating through it, is now reduced by the number of parallelizations.

The major drawback of the parallelization process is that the hardware is more than doubled (or N times for an N times parallel implementation). This also means that the static power is also more than doubled, while the dynamic power is only slightly increased due to the added registers and multiplexer.

### 5.1.2 RCA horizontal pipeline variations

The goal of pipelining is to reduce the critical path (logical depth) by inserting register banks in the design. This can be done in several ways with considerably different results. The more intuitive and easy manner to realize it is to "cut" the RCA horizontally in the middle of the structure. This can be imagined as two $\mathrm{NxN} / 2$ multipliers divided by a register bank as showed in Fig. 5.5.


Figure 5.5: 2 stages horizontally pipelined 8 bit RCA

The number of registers needed to divide the multiplier in this way is easily ob-
tained from Fig. 5.5. Actually, all bits of $A$ (N registers) plus all the result bits of the previous stage ( $\mathrm{N}+\mathrm{N} / 2$ registers) must be latched. Moreover, in order to maintain data synchronization, the most significant bits of $B$ must be latched too (N/2 registers). Hence, the total overhead corresponds to 3 N registers.

The critical path after such an architectural transformation is:

$$
\begin{equation*}
t(\text { Horizontal Pipeline })=(3 / 2 N-1) \cdot t_{\text {cout }}+(1 / 2 N-1) \cdot t_{\text {sum }}+t_{d f f} \tag{5.3}
\end{equation*}
$$

With $N$ the size of the multiplier, $t_{\text {cout }}$ the carry out delay, $t_{\text {sum }}$ the sum delay and $t_{d f f}$ the registers delay.

The "vertical delay" (corresponding to the $t_{\text {sum }}$ ) is effectively reduced by two, but the "horizontal delay" (related to $t_{\text {cout }}$ ) is just reduced by about $4 / 3$. Additionally, the "clk to Q" delay of a register must be added. Hence, the global delay reduction compared to the non pipelined version is far from the expected (or hoped) value of 2 .

A similar calculation can be done for a 4 stages pipeline, in this case the critical path delay will be of $(5 / 4 N-1) t_{\text {cout }}+(1 / 4 N-1) t_{\text {sum }}+t_{\text {dff }}+t_{\text {dff_setup }}$.

It is important to remark that pipelining remains interesting only for a small number of stages $(2-4)$; in fact, the quantity of needed registers rapidly grows for a large number of stages and the overhead is quickly non-negligible. In the case of a RCA multiplier with width N and S stages of pipeline, we have a register overhead of $3^{*} \mathrm{~N}^{*}(\mathrm{~S}-1)$. Just as an example, a 32 bit / 4 stages horizontal pipeline multiplier needs 288 extra flip-flops!

### 5.1.3 RCA diagonal pipeline variations

From a delay point of view, a better way to pipeline an RCA multiplier is to divide it in diagonal. This approach is less easy to code in a high level language compared to the horizontal split. In fact, the split parts cannot be considered anymore as multipliers of reduced size. An example on how to diagonally pipeline a RCA multiplier is illustrated in Fig. 5.6.

The critical path for a 2 stages diagonal pipeline is obtained by:

$$
\begin{equation*}
t(\text { Diagonal Pipeline })=3 / 4 N \cdot t_{\text {cout }}+(3 / 4 N-1) \cdot t_{\text {sum }}+t_{d f f} \tag{5.4}
\end{equation*}
$$

With the diagonal pipeline implementation, the register overhead is slightly greater than the horizontal pipeline. In fact, for two stages pipeline, we can count N latches for the $A$ bits, $3 / 4 \mathrm{~N}$ latches for the $B$ bits, $5 / 4 \mathrm{~N}$ registers for the internal sum propagation


Figure 5.6: 2 stages diagonally pipelined 8bit RCA
and $1 / 2 \mathrm{~N}$ registers for the carry propagation. All these contributions account for 3.5 N registers. This value can be compared to the one for horizontal pipeline case where 3 N registers were needed.

In a 4 stages diagonal pipeline version, the register overhead for each of the two new added banks is: $3 / 4 \mathrm{~N}$ registers for the $A$ bits, $3 / 8 \mathrm{~N}$ registers for the $B$ bits, $13 / 8 \mathrm{~N}$ registers for internal sum propagation and $3 / 8 \mathrm{~N}$ for the carry propagation. The total number of registers per stage is hence $25 / 8 \mathrm{~N}$. Summing all extra registers, the total overhead for a 4 stages diagonal pipeline is: $3.5 \mathrm{~N}+2^{*}(25 / 8 \mathrm{~N})=39 / 4 \mathrm{~N}$ and the corresponding delay would be of $(3 / 4 N-1) t_{s u m}+t_{d f f}+t_{d f f_{-} \text {setup }}$. Just to compare with the horizontal pipeline version, a $32 \mathrm{bit} / 4$ stages diagonal pipeline multiplier has 312 extra registers.

### 5.2 Wallace

The Wallace multiplier [41] [42] [43] is a very rapid and well balanced architecture. To achieve this efficiency, the partial products (i.e. $A \cdot B_{i}$, called P0-P7 in Fig. 5.8) are summed in parallel by using Carry Save Adders (CSA) [44]. A CSA (Fig. 5.7) is nothing else than a series of full adders disposed in a 3-2 compressor way. In a CSA, there exists no propagation delay between the full adders, consequently the total delay corresponds to the worst case delay of one FA. The main drawback of a CSA is that it doesn't return a unique sum but two vectors with a sum (S plus shifted C) equal to the sum of the three input vectors $(x+y+z=S+2 C)$.


Figure 5.7: Internal implementation of a Carry Save Adder (CSA)

The structure of a Wallace multiplier is shown in Fig. 5.8 for a 8 bit version. The partial products P0-P7 are added 3 by 3 with CSAs until only two bit vectors remain (Sum and Carry). At this point, a fast final adder will sum them to obtain the result of the multiplication. The kind of final adder can vary from one implementation to the other. In the Wallace tree implementations presented in this thesis, a Brent-Kung [45] adder is used. The advantage of the Brent-Kung (bk) implementation is that it is very fast.


Figure 5.8: Wallace 8bit structure

The worst case delay for the multiplier tree (without the final adder) is equal to the number of levels times the worst case delay of a FA.

To calculate the total delay of the Wallace tree multiplier, the delay of the final adder (Brent-Kung type in this case) needs to be added.

$$
\begin{equation*}
t(\text { Basic Wallace }) \approx \log _{1.5}(N) \cdot t_{F A}+t_{b k \text { adder }} \tag{5.5}
\end{equation*}
$$

With $N$ the bit width of the multiplier, $t_{F A}$ the worst delay for a full adder and $t_{b k}$ adder the delay of the final bk adder, which is also dependent on the size of the multiplier.

| Data width | Number of levels |
| :---: | :---: |
| 8 | 4 |
| 16 | 6 |
| 32 | 8 |
| 64 | 10 |
| 128 | 12 |
| N | $\approx \log _{1.5}(N)$ |

Table 5.1: Number of CSA levels for some typical multiplier width

### 5.2.1 Wallace parallel versions

The parallelized versions of the Wallace multiplier are obtained exactly in the same way as for the RCA (Fig. 5.4). The description in Section 5.1.1 remains valid for the Wallace multiplier, too.

### 5.3 Sequential

The Sequential multiplier takes its name from the fact that this implementation uses several clock cycles to compute one multiplication by sequentially "adding and shifting" the previous partial result. The structure of such multiplier is illustrated in Fig. 5.9. The main advantage of this implementation is the compactness of the circuit. In fact, to calculate a 16 bit multiplication, only a 17 bit adder with some registers and a bit of control logic are required. On the other hand, the result will not be available until 16 clock cycles have taken place.


Figure 5.9: Sequential multiplier structure (16bit)

In the case of the present thesis, the adder used in the "add and shift" structure is a Brent-Kung type (bk), which is known for being a very rapid adder. Considering
the best case where the adder uses exactly all the clock cycle period, the total delay of a Sequential multiplier is given by:

$$
\begin{equation*}
t(\text { Sequential })=N \cdot\left(t_{b k \text { adder }}+t_{A N D}+t_{d f f}+t_{d f f_{-} \text {setup }}\right) \tag{5.6}
\end{equation*}
$$

With $N$ the multiplier bit width, $t_{b k}$ adder the Brent-Kung adder delay, $t_{A N D}$ the delay of the AND gate used to generate the partial products, $t_{d f f}$ the registers clock-to-Q delay and $t_{d f f \text { _setup }}$ the registers setup time.

In the case where the clock frequency is smaller than the maximal allowed one, the total delay will correspond to $N \cdot t_{\text {clock }}$.

### 5.3.1 Sequential-wallace

A special modification of the Sequential multiplier is what we call the Sequentialwallace (Fig. 5.10). The idea is to reduce the number of clock cycles required to compute one multiplication by adding partial multiplications rather than partial products. In the case of a 16 bit implementation (as reported in Fig. 5.10), a $4 \times 16$ bit Wallace multiplier is used to compute partial multiplications and then the results are summed sequentially. In this way we obtain a version between the Wallace (large area, small delay) and the Sequential (small area, large delay). Actually, for the proposed example, only 4 clock cycles are required per multiplication compared to the 16 cycles necessary for the basic Sequential implementation.


Figure 5.10: Sequential multiplier (16bit) with a $4 x 16$ Wallace implementation

The delay of a Sequential-wallace multiplier is obtained by:

$$
\begin{equation*}
t(\text { Sequential-wallace })=M \cdot\left(t_{b k \text { adder }}+t_{N / M x N} \text { Wallace }+t_{d f f}+t_{d f f-s e t u p}\right) \tag{5.7}
\end{equation*}
$$

With $M(<N)$ the number of required cycles, $N$ the bit width of the multiplier, $t_{b k \text { adder }}$ the Brent-Kung adder delay and $t_{N / M x N}$ Wallace the delay of the $\mathrm{N} / \mathrm{MxN}$ Wallace multiplier.

### 5.3.2 Sequential parallel

The parallelized version of the Sequential multiplier is obtained exactly the same way as for the RCA and the Wallace (Fig. 5.4). The only difference is the sel pin that only switches once every N clock cycles, where N is the size of the multiplier.

### 5.4 Summary

In this chapter, 13 multiplier architectures have been discussed. These circuits are divided in 3 families (namely RCA, Wallace and Sequential) and they cover a large combination of delay, area and complexity. For this reason, they are well suited as reference circuits for the discussions presented further in this thesis. For commodity, the periods of the maximal throughput frequency as well as the cell count for each design are summarized in Table 5.2. The equations of the cell count for the Wallace implementations are obtained from [46] [47].


| N8 | $\mathbf{Z X \cap N}+\mathbf{U N V} \cdot \mathrm{NZ}+\mathrm{VHZ}+\mathrm{VA} \cdot \mathrm{NZ}$ |  |  |
| :---: | :---: | :---: | :---: |
| $N \mathrm{~S}$ |  |  |  |
| $N \mathrm{C}$ | CNV $\cdot N+\mathbf{V H}+\mathbf{V A} \cdot N$ |  | э!seq [セ!! ¢әпnbəS |
| N0I |  |  |  |
| N9 |  |  |  |
| $N \pm$ |  |  |  |
| NEL | CNV $\cdot{ }_{Z} N+\mathbf{V H} \cdot N+\mathbf{V A} \cdot{ }_{Z}(\mathrm{I}-N)$ |  |  |
| NL | GNV $\cdot{ }_{7} N+\mathbf{V H} \cdot N+\mathbf{V A} \cdot{ }_{Z}(\underline{L}-N)$ |  | Z әu!̣əd!̣d -se!̣ VDU |
| NEI | UNV $\cdot{ }_{2} N+\mathbf{V H} \cdot N+\mathbf{V A} \cdot{ }_{Z}(\underline{L}-N)$ |  | Ø әuب̣ə dịd 'z!̣oч VDu |
| NL | GNV $\cdot{ }_{7} N+\mathbf{V H} \cdot N+\mathbf{V A} \cdot{ }_{Z}(\mathrm{I}-N)$ |  |  |
| N0I |  |  | चİाered VDu |
| N9 | $\boldsymbol{Z X \cap N}+\mathbf{C N V} \cdot{ }_{Z} N Z+\mathbf{V H} \cdot N Z+\mathbf{V A} \cdot{ }_{Z}(\mathrm{I}-N) \boldsymbol{Z}$ | ${ }^{\text {mnu }}{ }_{7}+{ }^{\text {uns }}{ }_{7} \cdot(\mathrm{I}-\mathrm{Z} / \mathrm{N})+{ }^{\text {noo }} \mathrm{f} \cdot(\mathrm{I}-N)$ | ZİIfered VDY |
| $N \pm$ | CNV $\cdot{ }_{Z} N+\mathbf{V H} \cdot N+\mathbf{V A} \cdot{ }_{z}(\mathrm{I}-N)$ | ${ }^{\text {uns }} 7 \cdot(z-N)+{ }^{\text {nnos }} 7 \cdot(\%-N \cdot z)$ | ग!seq VDY |
|  | sโəə [е!ıодеи!quoə јо ләqum N |  | ${ }^{\text {ur }} \mathrm{N}$ |

## Chapter 6

## Total power comparison for free Vdd and free Vth

A very effective way to reduce the total power consumption in digital circuits is the reduction of the supply voltage $V d d$. This approach is simple and easy to implement and it will simultaneously reduce dynamic power in a square way and static power linearly. Unfortunately, in this way, the performances or speed rapidly decrease. In order to avoid this, it is possible to re-establish the original performances by reducing the transistors threshold voltage Vth. The price for this is an exponential increase of the static power. For this reason, counterbalancing the reduction of the dynamic power with the increase of static power leads to a point in the $(V d d, V t h)$ space where, for a given delay, the total power presents a minimum. This chapter will discuss this minimum of the total power consumption and will derive an approximated formula for the total power at the optimal (Vdd, Vth) point.

### 6.1 Existence of a total power consumption optimum

To convince the reader of the existence of the minimum of the total power consumption, it is important to recall the power and delay equations reported in Chapter 3:

$$
\begin{gather*}
\text { Ptot }=P d y n+\text { Pstat }=a C N f V_{d d}^{2}+N V_{d d} I_{0} e^{-\frac{V t h}{n U t}}  \tag{6.1}\\
f_{\max }=\frac{I_{\mathrm{On}}}{k_{t} \cdot C \cdot L D \cdot V_{d d}}=\frac{I_{0} \cdot e^{\alpha}}{k_{t} \cdot C \cdot L D \cdot(\alpha n U t)^{\alpha}} \frac{\left(V_{d d}-V_{t h}\right)^{\alpha}}{V_{d d}} \tag{6.2}
\end{gather*}
$$

With $a$ the activity factor, $C$ the equivalent capacitance per cell, $N$ the number of cells, $f$ the working frequency, $V d d$ the supply voltage, $I_{0}$ the reference transistor current, $V t h$ the transistor threshold current, $n$ the sub-threshold slope, $U t$ the thermal potential, $k_{t}$ the delay proportional constant, $L D$ the logical depth and $\alpha$ the alpha power law coefficient.

If now we consider that the frequency $f_{\max }$ (called $f$ from now on) is fixed and defined by the application, it is possible to rewrite Eq. (6.2) to obtain the formula tying $V d d$ and $V t h$ together:

$$
\begin{equation*}
V_{t h}=V_{d d}-\chi \cdot V_{d d}^{1 / \alpha} \quad \text { with: } \chi^{\alpha}=\frac{k_{t} \cdot C \cdot f \cdot L D}{I_{0}\left(\frac{e}{\alpha n U t}\right)^{\alpha}} \tag{6.3}
\end{equation*}
$$

The parameter $\chi$ in Eq. (6.3) is a very important one. This parameter ties together the supply voltage and the threshold voltage. Its value represents a kind of "global rapidness" accounting for both technology and architectural impacts. Actually, a large $\chi$ means a "slow" design, which can be due to a large logical depth or a slow technology or a combination of architectural and technology parameters. The presence of the working frequency in the equation of $\chi$ shows that the concept of slow or quick design is dependent on the desired working frequency. For instance, a design considered rapid for a working frequency of 1 MHz , could be considered slow for a working frequency of 100 MHz .

A graphical representation of Eq. (6.3) is given in Fig. 6.1. There, we can see that the reduction of the supply voltage requires a reduction of the threshold voltage too in order to maintain speed. Even if there exists an infinite number of couples ( $V d d, V t h$ ) showing the same performance, they don't present the same power consumption. In fact, while the reduction of the supply voltage $V d d$ reduces the dynamic power in a square way and reduces the static power linearly, the reduction of the threshold voltage $V$ th shows an exponential increase of the static power. Due to the exponential nature of this last dependency, the static power increase can rapidly cancel the benefit of the reduced supply voltage $V d d$. Therefore, between all the combinations of ( $V d d, V t h$ ) guaranteeing the desired speed, only one couple will result in the lowest power consumption for a given architecture (Fig. 6.2). From now on, this working condition will be called optimal working point or ideal working point.

The location of this optimal working point and its associated total power consumption are tightly related to architectural and technology parameters. For instance, Fig. 6.2 illustrates the fact that reducing the activity factor allows a reduction of Ptot, whereas it tends to increase the optimal $V d d$ and $V t h$. As architectural modifications


Figure 6.1: Relationship between Vdd and Vth for $\alpha=1.65$ and $\chi=0.3$
will change simultaneously several factors (not just the activity), it is necessary to develop a methodology to evaluate the influence of such transformations on the total power consumption (Ptot).

In related contributions ([48], [49], [50], [51], [52], [53], [54]), the authors preferred to seek for the minimum of the energy rather than the minimum of the total power as done in this work. From a mathematical point of view, looking for the minimum of the energy is slightly easier and the results are different from what we derive here. Indeed, they found that the minimum of total energy is most of the time located in the weak-inversion transistor region (optimal $V d d<$ optimal $V t h$ ), which corresponds to very low performances logic.

### 6.2 Pdyn over Pstat ratio

Looking at the ratio Pdyn over Pstat at the optimal working point in Fig. 6.2, it is possible to observe that dynamic contribution still remain greater than the static one.

$$
\begin{equation*}
k 1=\left.\frac{P d y n}{P s t a t}\right|_{\text {optimum }} \tag{6.4}
\end{equation*}
$$

This ratio (k1) is a measurement of the circuit usefulness. In fact, rarely used


Figure 6.2: Total power consumption of a 16 bit Wallace multiplier in a STM 90 nm technology (CMOS090-SVT, 100 MHz ) with freely modifiable Vdd and Vth. Three different circuit activities (a) are reported. The optimal working points are marked by a cross mark.
transistors will provide low k1 due to the high static consumption compared to the dynamic one. For this reason, it is better to have fewer transistors (less static power) working more actively (more dynamic power) than having lots of idle transistors that just increase the static power. In related works [55] [56], authors stated that this ratio should be equal to 1 , whereas our experiences, based on many designs (like multipliers, FIR, shift registers, micro-processors, counters, ...) in deep sub-micron technologies $(0.18 \mu m, 0.13 \mu m$ and 90 nm$)$, suggest that typical values of k 1 are between 3 and 7 .

### 6.2.1 k1 derivation

A precise calculation of k 1 is possible and easy to obtain. In fact, k 1 can be derived by searching the minimum of $\operatorname{Ptot}(\mathrm{Vdd})$ as:

$$
\begin{equation*}
\frac{\partial P \operatorname{tot}\left(V_{d d}\right)}{\partial V_{d d}}=\frac{\partial P d y n\left(V_{d d}\right)}{\partial V_{d d}}+\frac{\partial P \operatorname{stat}\left(V_{d d}\right)}{\partial V_{d d}}=0 \tag{6.5}
\end{equation*}
$$

The combination of Eq. (6.5) with Eq. (6.1) and Eq. (6.3) leads to:

$$
\begin{equation*}
k 1=\frac{(\alpha-1) V_{d d}^{\text {opt }}+V_{t h}^{\text {opt }}}{2 n U t \alpha}-\frac{1}{2} \tag{6.6}
\end{equation*}
$$

With $\alpha$ the alpha power law coefficient, $n$ the sub-threshold slope and $U t$ the thermal potential.

Table 6.1 shows the equivalent of Eq. (6.6) in the case of the STM 90nm technology (used values are obtained from Chapter 4).

| LVT | $k 1 \approx 4.0 V_{d d}^{\text {opt }}+7.3 V_{t h}^{\text {opt }}-0.5$ |
| :--- | :--- |
| SVT | $k 1 \approx 5.2 V_{d d}^{\text {opt }}+8.0 V_{t h}^{\text {opt }}-0.5$ |
| HVT | $k 1 \approx 5.9 V_{d d}^{\text {opt }}+7.1 V_{t h}^{\text {opt }}-0.5$ |

Table 6.1: Approximation of k 1 for STM 90 nm technology

From the equations in Table 6.1 is possible to see how the case $\mathrm{k} 1=1$ is very difficult to reach and it would correspond to extremely low optimal $V d d$ and $V t h$.

In Eq. (6.6), k1 was expressed in term of optimal $V d d$ and optimal $V t h$, but it can also be related to the on current ( $I_{\mathrm{On}}$ ) and the off current ( $I_{\text {off }}$ ), or better to the ratio of these two. In fact, using Eq. (6.1) and Eq. (6.2):

$$
\begin{align*}
P d y n & =k 1 \cdot P s t a t  \tag{6.7}\\
a \cdot C \cdot N \cdot V_{d d}^{2} \frac{I_{\mathrm{on}}^{o p t}}{k_{t} \cdot C \cdot L D \cdot V_{d d}} & =k 1 \cdot N \cdot V_{d d} \cdot I_{\mathrm{off}}^{\mathrm{opt}}  \tag{6.8}\\
k 1 & =\left.\frac{a}{L D} \frac{1}{k_{t}} \frac{I_{\mathrm{On}}}{I_{\mathrm{off}}}\right|^{\mathrm{opt}} \tag{6.9}
\end{align*}
$$

If now we remember that $k_{t}$ is just a constant, k 1 can easily be expressed by:

$$
\begin{equation*}
\left.k 1 \propto \frac{a}{L D} \frac{I_{\mathrm{on}}}{I_{\mathrm{off}}}\right|^{\mathrm{opt}} \tag{6.10}
\end{equation*}
$$

It is important to note that in Eq. (6.10) the Ion/Ioff also depends on activity (a) and logical depth ( $L D$ ).

Based on SIA International Technology Roadmap for Semiconductors 2004 [33], the expected ratios of $I_{\mathrm{On}}$ over $I_{\mathrm{off}}$ for present and future technologies are:

| Year | 2006 | 2009 | 2012 | 2015 | 2018 |
| :--- | :---: | :---: | :---: | :---: | :---: |
| HP | 23400 | 22714 | 17900 | 7000 | 4380 |
| LOP | 203333 | 154000 | 118571 | 90000 | 31667 |
| LSTP | 25.5 E 6 | 17.5 E 6 | 13.2 E 6 | 10.9 E 6 | 9.9 E 6 |

Table 6.2: SIA ITRS 2004 expected transistors $I_{\mathrm{On}} / I_{\text {off }}$ for High Performance (HP), Low Operating Power (LOP) and Low Standby Power (LSTP) circuits.

Looking at Table 6.2, we see how the ratio $I_{\text {on }}$ over $I_{\text {off }}$ will decrease with time due to the large increase of the static power consumption. On the other hand, we have previously seen that the variable $k 1$ doesn't change so much. Hence, we can conclude that an architecture with activity $a$ and logical depth $L D$ working at its optimal condition in a present technology will require a higher ratio $a / L D$ in a future technology. This can be achieved by reducing the logical depth LD, but also by increasing the activity $a$, which correspond to having a better use of the implemented hardware. This reasoning, for instance, will tend to favor pipeline over parallelization. Indeed, the ratio $a / L D$ is increased in a pipelined design due to the reduction of $L D$, whereas the same ratio will remain almost unchanged during parallelization (cf Table 6.5 and Table 6.6).

By using Eq. (6.22) (derived later in this chapter), it is possible to express k 1 in a much simpler expression.

$$
\begin{equation*}
k 1=\left.\frac{P d y n}{P s t a t}\right|^{o p t}=\frac{a C N f V_{d d}^{2}}{I_{0} N V_{d d} e^{-V t h / n U T}} \cong \frac{a C f V_{d d}}{2 n U t a C f /(1-\chi A)}=\frac{V_{d d}}{2 n U t}(1-\chi A) \tag{6.11}
\end{equation*}
$$

In a similar way, Eq. (6.11) can also be expressed by the optimal $V$ th by applying Eq. (6.15).

$$
\begin{equation*}
k 1=\frac{V_{t h}+\chi B}{2 n U t} \tag{6.12}
\end{equation*}
$$

### 6.3 Optimal Vdd and Vth formulas

In this section the complete derivation of the optimal threshold voltage and supply voltage is presented. The difficulty of the derivation is to express $V t h^{o p t}$ without
the use of $V d d^{\text {opt }}$ and vice-versa, i.e. we need to decouple these two variables. To achieve this, we need to linearize the expression $V d d^{1 / \alpha}$ (with $\alpha$ the alpha power law coefficient, its value spanning from 1 to 2 ), which is the origin of the transcendental nature of Eq. (6.3).

Fig. 6.3 shows the expression $V d d^{1 / \alpha}$ and its linear approximation for $V d d$ from 0.3 V to 1 V for $\alpha=1.65$.

$$
\alpha=1.65
$$



Figure 6.3: $V d d^{1 / \alpha}$ [solid line] and its linear approximation [dashed line]

From this figure, we see how well $V d d^{1 / \alpha}$ can be linearized over a relative large interval, leading to the follow approximation:

$$
\begin{equation*}
V_{d d}^{1 / \alpha} \approx A(\alpha) \cdot V_{d d}+B(\alpha) \tag{6.13}
\end{equation*}
$$

With $A$ and $B$ depending on $\alpha$ but also on the interval of $V d d$ where the approximation is done. $A$ and $B$ can be determined numerically (easy) and analytically (more complex, but feasible). For $V d d$ in the interval [0.3V;1V], the graph in Fig. 6.4 can be used to estimate $A$ and $B$.

The lower graph in Fig. 6.4 shows the maximal error in percent obtained with the proposed linear approximation. For the range of $V d d$ restricted to the interval $[0.3 \mathrm{~V} ; 1 \mathrm{~V}]$ the error always remain lower than $5 \%$. It is important to note that newer technology will tend to have even smaller values of $\alpha$ which results in an even better


Figure 6.4: Linearization coefficients for Vdd in $[0.3 \mathrm{~V} ; 1 \mathrm{~V}]$
approximation of Eq. (6.13). Moreover, in the case where a better approximation is needed, the error can be further reduced by limiting the range of $V d d$.

In Fig. 6.5, the parameters $A$ and $B$ are calculated for $V d d$ between 0.3 V and 0.6 V and they report a maximal error lower than $1.4 \%$. The values of $A$ and $B$ for the $\alpha$ corresponding to the three different variations of the STM 90nm CMOS technology are reported in Table 6.3.

Using the approximation in Eq. (6.13) is now possible to rewrite Eq. (6.3) in a simpler way:

$$
\begin{equation*}
V_{t h}^{o p t}\left(V_{d d}^{o p t}\right) \cong V_{d d}^{o p t}-\chi\left(A \cdot V_{d d}^{o p t}+B\right)=V_{d d}^{o p t}(1-\chi \cdot A)-\chi \cdot B \tag{6.14}
\end{equation*}
$$



Figure 6.5: Linearization coefficients for Vdd in $[0.3 \mathrm{~V} ; 0.6 \mathrm{~V}]$

|  | $V d d \in[0.3 V ; 1 V]$ |  |  | $V d d \in[0.3 V ; 0.6 V]$ |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | LVT | SVT | HVT | LVT | SVT | HVT |
| $\alpha$ | 1.56 | 1.65 | 1.84 | 1.56 | 1.65 | 1.84 |
| $A(\alpha)$ | 0.760 | 0.731 | 0.676 | 0.859 | 0.835 | 0.788 |
| $B(\alpha)$ | 0.260 | 0.286 | 0.342 | 0.210 | 0.238 | 0.290 |

Table 6.3: Values of $A$ and $B$ for the three types of STM090 transistors

Such an approximation is now invertible and permits to estimate the optimal $V d d$ :

$$
\begin{equation*}
V_{d d}^{o p t}\left(V_{t h}^{\text {opt }}\right) \cong \frac{V_{t h}^{\text {opt }}+\chi \cdot B}{1-\chi \cdot A} \underbrace{=}_{V t h=V t h 0-\eta V d d} \frac{V_{t h 0}^{\text {opt }}+\chi \cdot B}{1-\chi \cdot A+\eta} \tag{6.15}
\end{equation*}
$$

Another useful expression is the first derivative of $V t h$ with respect to $V d d$. This expression will be used in the next sections, but for the sake of simplicity, it will be presented here. From Eq. (6.14) the partial derivative becomes:

$$
\begin{equation*}
\frac{\partial V_{t h}^{\text {opt }}}{\partial V_{d d}^{\text {opt }}} \cong(1-\chi \cdot A) \tag{6.16}
\end{equation*}
$$

### 6.3.1 Optimal threshold voltage derivation

The expression of the optimal threshold voltage can be derived by searching for the $V t h$ that would minimize the total power consumption. Hence:

$$
\begin{equation*}
\frac{\partial P t o t ~}{}\left(V_{t h}\right), \frac{\partial P d y n\left(V_{t h}\right)}{\partial V_{t h}}+\frac{\partial P \operatorname{stat}\left(V_{t h}\right)}{\partial V_{t h}}=0 \tag{6.17}
\end{equation*}
$$

Or, better:

$$
\begin{equation*}
\frac{\partial P d y n\left(V_{t h}\right)}{\partial V_{t h}}=-\frac{\partial P \operatorname{stat}\left(V_{t h}\right)}{\partial V_{t h}} \tag{6.18}
\end{equation*}
$$

It is now possible to substitute Eq. (6.1) in Eq. (6.18) to obtain:

$$
\begin{align*}
2 a C N f V_{d d} \frac{\partial V_{d d}}{\partial V_{t h}} & =-I_{0} N e^{-V t h / n U t}\left(\frac{\partial V_{d d}}{\partial V_{t h}}-\frac{V_{d d}}{n U t}\right)  \tag{6.19}\\
e^{V t h / n U t} & =\frac{I_{0}}{2 n U t a C f}\left(\frac{\partial V_{t h}}{\partial V_{d d}}-\frac{n U t}{V_{d d}}\right)  \tag{6.20}\\
e^{V t h / n U t} & \underbrace{\cong}_{\text {Eq.(6.16) }} \frac{I_{0}}{2 n U t a C f}\left(1-\chi A-\frac{n U t}{V_{d d}}\right) \tag{6.21}
\end{align*}
$$

At room temperature, $n U t$ is about 0.04 V (refer to Table 4.9 for the exact value in the case of STM090 technology). So, even if for instance the optimal supply voltage will be as low as 0.4 V , the ratio $n U t / V_{d d}$ will be as low as 0.1 or even lower for higher optimal $V d d$. For this reason, we consider this term negligible compared to $1-\chi A$. This is a mandatory approximation in order to be able to decouple $V t h$ and $V d d$.

The optimal $V$ th can finally be calculated:

$$
\begin{align*}
e^{V t h / n U t} & \cong \frac{I_{0}}{2 n U t a C f}(1-\chi A)  \tag{6.22}\\
V_{t h}^{\text {opt }} & \cong n U t \ln \left(\frac{I_{0}}{2 n U t a C f}(1-\chi A)\right) \quad \text { with: } \chi^{\alpha}=\frac{k_{t} \cdot C \cdot f \cdot L D}{I_{0}\left(\frac{e}{\alpha n U t}\right)^{\alpha}} \tag{6.23}
\end{align*}
$$

Eq. (6.23) shows the influence of architectural parameters (like $a, L D$ [included in $\chi], f$ ) and technology parameters (like $I_{0}, n, C, \alpha, k_{t}$ ) to the optimal threshold voltage $V t h$.

Consider a 16 bit Wallace multiplier with the following properties:

| Technology | STM090 SVT |
| :--- | ---: |
| Nominal Dynamic Power | $693.28 \mu \mathrm{~W}$ |
| Nominal Static Power | $9.90 \mu \mathrm{~W}$ |
| Nominal Activity | 0.267 |
| Nominal Frequency | 100 MHz |
| Nominal Max Delay | 2.38 ns |
| Nominal Supply voltage | 1 V |
| Nominal Threshold voltage | 0.353 V |

Table 6.4: Parameters of a 16 bit Wallace multiplier

Fig. 6.6 shows the optimal $V t h$ vs. activity for the multiplier described in Table 6.4, while maintaining the other architectural parameters constant.


Figure 6.6: Optimal Vth vs. activity

The optimal $V$ th has been calculated in two separated ways. The former, called analytical approximation on the plot, is the direct use of Eq. (6.23) with $V d d^{1 / \alpha}$ lin-
earized over the interval $[0.3 \mathrm{~V} ; 1 \mathrm{~V}]$, whereas the second, called numerical computation on the plot, is obtained with a high resolution numerical computation based on the non-approximated Eq. (6.1) and Eq. (6.3).

The first remark on Fig. 6.6 is that the error of the approximation remains lower than $5 \%$ for the proposed range of activities.

Another interesting point is the shape of the curve optimal $V t h$ vs. $a$. In fact, we can observe how $V t h^{\text {opt }}$ increases for low activity, while it decreases for high activities, as already noted on Fig. 6.2. Moreover, it is visible that for high activities, $V t h^{o p t}$ becomes almost constant or varies only very slightly.

A similar graph is found in Fig. 6.7, but this time with the frequency as a variable parameter, while the other parameters are kept constant.


Figure 6.7: Optimal Vth vs. frequency

As expected, the increase of the working frequency results in a reduction of the optimal $V t h$. In fact, in order to achieve the higher frequency, $V t h$ is reduced to obtain a larger $(V d d-V t h)$.

The last optimal $V$ th graph is Fig. 6.8 and it shows the optimal $V t h$ vs. the logical depth ( $L D$ ).

It is important to note that the optimal $V t h$ is almost insensitive to the logical depth. This can be quite surprising, but it is explained by the important change in


Figure 6.8: Optimal $V t h$ vs. logical depth
the optimal $V d d$ (refer to the next section), which "absorbs" almost completely the changes in the logical depth.

In the case where the nominal technology values of $V d d, V t h, P d y n$ and Pstat are known, Eq. (6.23) can be also written as:

$$
\begin{equation*}
V_{t h}^{\text {opt }} \cong n U t \ln \left(\frac{P s t a t^{n o m}}{P d y n^{n o m}} \frac{V_{d d}^{\text {nom }} e^{\left(V t h 0^{n o m}-\eta V d d^{n o m}\right) / n U t}}{2 n U t}(1-\chi A)\right) \tag{6.24}
\end{equation*}
$$

### 6.3.2 Optimal supply voltage derivation

Once the optimal $V t h$ has been calculated, the derivation of the optimal $V d d$ is very simple thanks to Eq. (6.15). In fact, by simply replacing the expression of $V t h^{o p t}$, Eq. (6.25) and Eq. (6.26) can be obtained.

$$
\begin{gather*}
V_{d d}^{\text {opt }} \cong \frac{n U t \ln \left(\frac{I_{0}}{2 n U t a C f}(1-\chi A)\right)+\chi B}{1-\chi A} \quad \text { with: } \chi^{\alpha}=\frac{k_{t} \cdot C \cdot f \cdot L D}{I_{0}\left(\frac{e}{\alpha n U t}\right)^{\alpha}}  \tag{6.25}\\
V_{d d}^{\text {opt }} \cong \frac{n U t \ln \left(\frac{\text { Pstat }^{n o m}}{\text { Pdynnom }} \frac{\left.V_{d d}^{n o m} e^{\left(V t h 0^{n o m}\right.}-n V d d^{n o m}\right) / n U t}{2 n U t}(1-\chi A)\right)+\chi B}{1-\chi A} \tag{6.26}
\end{gather*}
$$

To discuss the validity of this approximation, we can reconsider the circuit described in Table 6.4. Fig. 6.9 shows the optimal $V d d$ for different activities. The values of $V d d^{\text {opt }}$ are calculated in two ways. The analytical approximation is based on Eq. (6.25), whereas the numerical computation is based on the non-approximated equations (6.1) and (6.3).


Figure 6.9: Optimal $V d d$ vs. activity

From Fig. 6.9, we can see that, for the chosen range of activity, the error remains smaller than $5 \%$. Moreover, by looking at the shape of the $V d d^{\text {opt }}$ curve, we observe a trend very similar to the one for $V t h$. Actually, the increase of activity reduces both $V t h^{o p t}$ and $V d d^{o p t}$ in a similar way. This can be explained by the fact that a change in activity doesn't modify the timing constraints, and hence the difference $V d d-V t h$ (cf. Eq. (6.2)) remains almost unchanged.

A similar graph can be plotted for the frequency as the free variable. This situation is represented by Figure 6.10.

It is interesting to note the shape of the $V d d^{o p t}(f)$ curve. For the high frequencies the behavior corresponds to what we would expect, in fact the reduction of the working frequency allows a reduction of the optimal supply voltage (which correspond to an increase of the optimal threshold voltage), but for low frequencies the optimal $V d d$ starts to increase again. This behavior comes from the high increase of the optimal



Figure 6.10: Optimal $V d d$ vs. frequency


Figure 6.11: Optimal $V d d$ vs. logical depth
$V t h$ in this zone. In fact, to avoid a weak inversion regime ( $V d d<V t h$ ), $V d d^{o p t}$
needs to increase in order to maintain the difference $V d d-V t h$ positive.
The last graph of the optimal $V d d$ is reported in Fig. 6.11. There, $V d d^{o p t}$ is plotted versus the logical depth. This curve shows an almost linear behavior. In fact, as stated before, the change in the timing requirements resulting from the change in the logical depth affects almost exclusively the optimal $V d d$ whereas the optimal $V t h$ remains quite constant (cf. Fig. 6.8).

Finally, we can say that frequency mainly affects the optimal $V t h$, logical depth mainly affects the optimal $V d d$, and activity affects both of them.

### 6.4 Optimal total power

From what has been developed in the previous pages, it is now possible to obtain some approximations of the optimal total power consumption. Unfortunately, due to the transcendental nature of the involved equations, no exact formula exists to determine the optimal Ptot. Nevertheless, with the help of a few basic assumptions, approximated equations can be found. In the next sections, two different approaches are proposed. The former develops a rough way to compare architectures that present similar values of k 1 ( $\equiv$ optimal Pdyn/Pstat), whereas the latter is a much more precise approximation for an absolute optimal total power estimation.

### 6.4.1 Optimal power comparison with $k 1$ constant

For this first derivation, the assumption is done that k 1 is constant or at least varies very few. This rough approach can be used as a quick way to compare the optimal total power consumption of two (or more) circuits having very similar characteristics in the sense of a similar k1 ( $\equiv$ optimal Pdyn/Pstat).

The optimal total power can be expressed with k 1 as:

$$
\begin{equation*}
\text { Ptot }^{o p t}=\text { Pdyn }^{o p t}\left(1+\frac{1}{k 1}\right) \tag{6.27}
\end{equation*}
$$

From our experience, typical values of k1 span from 3 to 7 considering very different architectural blocks like multipliers, adders, counters, shift registers, FIR, microprocessors, etc. In the case of circuits with similar functions and working conditions, k1 can be considered constant, at least for a first rough approximation. Just as an example, ten different 16bit multipliers ( 7 RCA variations and 3 Wallace variations) implemented in a STM 90 nm technology and with a working frequency of 33 MHz have a k 1 included in the range between 4.22 and 4.69.

To fix the ideas, the error introduced by a $\Delta k 1 \neq 0$ can be calculated:

$$
\begin{equation*}
\Delta \text { Ptot }=\frac{\partial}{\partial k 1} \operatorname{Pdyn}\left(1+\frac{1}{k 1}\right) \Delta k 1=-\operatorname{Pdyn} \frac{\Delta k 1}{k 1^{2}}=-\operatorname{Ptot} \frac{\Delta k 1}{k 1(k 1+1)} \tag{6.28}
\end{equation*}
$$

Or:

$$
\begin{equation*}
\frac{\Delta \text { Ptot }}{\text { Ptot }}=-\frac{\Delta k 1 / k 1}{k 1+1} \tag{6.29}
\end{equation*}
$$

Practically, Eq. (6.29) means that the relative error $(\Delta k 1 / k 1)$ introduced by a non constant k1 has an effect divided by $k 1+1$ on the optimal total power Ptot. Hence the worst case $\Delta$ Ptot/Ptot in our example of the ten 16 bit multipliers presents an error of about $2.1 \%$.

Thanks to the constant k1 hypothesis, the optimal total power consumption comparison is now reduced to the comparison of the optimal dynamic power (Pdyn).

$$
\begin{align*}
\text { Ptot }^{\prime} & \stackrel{?}{<} \text { Ptot }  \tag{6.30}\\
\text { Pdyn }^{\prime}\left(1+\frac{1}{k 1}\right) & \stackrel{?}{<} P d y n\left(1+\frac{1}{k 1}\right)  \tag{6.31}\\
P d y n^{\prime} & \stackrel{?}{<} P d y n  \tag{6.32}\\
a^{\prime} C^{\prime} N^{\prime} f^{\prime} V_{d d}^{\prime 2} & \stackrel{?}{<} a C N f V_{d d}^{2}  \tag{6.33}\\
V_{d d}^{\prime} & \stackrel{?}{<} V_{d d} \sqrt{\frac{a C N f}{a^{\prime} C^{\prime} N^{\prime} f^{\prime}}} \tag{6.34}
\end{align*}
$$

The parameters with an apostrophe (') correspond to the new architecture which is compared to a reference design (no apostrophe).

## Parallelization example

To better understand the usefulness of Eq. (6.34), let us apply it to the case of a circuit parallelization. Table 6.5 reports the typical architectural parameter variations in the case of a $P$ times parallelization.

In a parallelization process, the number of cells is more than $P$ times the original one due to the overhead introduced mainly by the multiplexer and the additional registers required to maintain a valid data on both blocks. We can define the Dynamic OverHead ( DOH ) as the relative increment of the dynamic power due to this overhead at nominal conditions (i.e. $\left.P d y n_{\text {nom }}^{\prime}=(1+D O H) P d y n_{\text {nom }}\right)$.

| Symbol | Name | Effect of parallelization |
| :--- | :--- | :--- |
| $a$ | activity | $\approx / \mathrm{P}$ |
| $N$ | number of cells | $\approx$ *P + overhead |
| $L D_{\text {eff }}$ | effective logical depth | $/ \mathrm{P}$ |
| $f$ | frequency | unchanged |

Table 6.5: Effect of parallelization on architectural parameters

From Eq. (6.34) we now know that in order to reduce the optimal power consumption through parallelization, the following expression must be respected:

$$
\begin{equation*}
V_{d d}^{\prime} \stackrel{!}{<} V_{d d} / \sqrt{1+D O H} \tag{6.35}
\end{equation*}
$$

With $V_{d d}^{\prime}$ the optimal supply voltage after the parallelization and $V_{d d}$ the optimal supply voltage before parallelization.

On the other hand, the optimal $V t h$, which depends mainly on activity, can be approximated as (from Eq. (6.23)):

$$
\begin{equation*}
V t h^{\prime} \cong V t h+n U t \ln P \tag{6.36}
\end{equation*}
$$

With $V t h^{\prime}$ the optimal threshold voltage after parallelization and $V t h$ the optimal threshold voltage before parallelization.

Moreover, from Eq. (6.3) we can write:

$$
\begin{equation*}
\frac{V_{d d}^{\prime}-V_{t h}^{\prime}}{V_{d d}^{11 / \alpha}}=\chi^{\prime}=\chi / P^{1 / \alpha}=\frac{V_{d d}-V_{t h}}{P^{1 / \alpha} V_{d d}^{1 / \alpha}} \tag{6.37}
\end{equation*}
$$

The combination of Eq. (6.35), Eq. (6.36) and Eq. (6.37) yields:

$$
\begin{equation*}
\chi>\left(\frac{P \sqrt{1+D O H}}{V_{d d}}\right)^{1 / \alpha}\left(\frac{V_{d d}}{\sqrt{1+D O H}}-V_{t h}-n U t \ln P\right) \tag{6.38}
\end{equation*}
$$

All parameters in Eq. (6.38) refer to the design before parallelization. Hence, to know if a circuit can reach a lower optimal total power through parallelization it is sufficient to check that the previous inequality is respected.

In the same way, it is possible to determine the maximal value of $D O H$ that still allow power savings when parallelization is performed.

## Pipelining example

The same approach can be carried out in the case of a pipelining transformation. The effect of a typical pipelining transformation to the architectural parameters is shown in Table 6.6.

| Symbol | Name | Effect of parallelization |
| :--- | :--- | :--- |
| $a$ | activity | $\approx$ unchanged |
| $N$ | number of cells | + registers overhead |
| $L D_{\text {eff }}$ | effective logical depth | $/ p_{f}$ |
| $f$ | frequency | unchanged |

Table 6.6: Effect of pipelining on architectural parameters

Ideally, the critical path would be divided by two (or by the number of pipelining stages in general) through a register bank insertion. Unfortunately, this ideal factor is practically never achieved because it is rare to be able to split the path exactly in the middle. For the sake of generality, the factor $p_{f}$ (pipeline factor) is introduced. Its value represents the achieved ratio between the logical depth before and after the pipeline transformation.

Unlike the parallelization, the activity on a pipeline transformation remains almost unchanged, even if a small reduction could be observed due to less glitches. This will also mean that the optimal threshold voltage after the transformation is practically the same as before:

$$
\begin{equation*}
V t h^{\prime} \approx V t h \tag{6.39}
\end{equation*}
$$

With $V t h$ and $V t h^{\prime}$ the optimal threshold voltage before and after the transformation respectively.

The overhead in a pipeline structure comes from the registers banks inserted in the data path to cut it in different segments. Like before, this overhead is considered as a dynamic power overhead and will be represented by the variable $D O H$ (defined before). So, the condition on the optimal supply voltage remains the same as for the parallelization, i.e.:

$$
\begin{equation*}
V_{d d}^{\prime} \stackrel{!}{<} V_{d d} / \sqrt{1+D O H} \tag{6.40}
\end{equation*}
$$

Once more, a third condition can be obtained from Eq. (6.3):

$$
\begin{equation*}
\frac{V_{d d}^{\prime}-V_{t h}^{\prime}}{V_{d d}^{1 / \alpha}}=\chi^{\prime}=\chi / p_{f}^{1 / \alpha}=\frac{V_{d d}-V_{t h}}{p_{f}^{1 / \alpha} V_{d d}^{1 / \alpha}} \tag{6.41}
\end{equation*}
$$

The combination of Eq. (6.39), Eq. (6.40) and Eq. (6.41) gives:

$$
\begin{equation*}
p_{f} \stackrel{\vdots}{>} \frac{1}{\sqrt{1+D O H}}\left(\frac{V_{d d}-V_{t h}}{V_{d d} / \sqrt{1+D O H}-V_{t h}}\right)^{\alpha} \tag{6.42}
\end{equation*}
$$

Or:

$$
\begin{equation*}
\chi>\left(\frac{p_{f} \sqrt{1+D O H}}{V_{d d}}\right)^{1 / \alpha}\left(V_{d d} / \sqrt{1+D O H}-V_{t h}\right) \tag{6.43}
\end{equation*}
$$

Or even:

$$
\begin{equation*}
\chi \stackrel{\prime}{>}\left(\frac{p_{f} \sqrt{1+D O H}}{V_{d d}}\right)^{1 / \alpha} \frac{V_{d d}-V_{d d} / \sqrt{1+D O H}}{\left(p_{f} \sqrt{1+D O H}\right)^{1 / \alpha}-1} \tag{6.44}
\end{equation*}
$$

If one of the conditions in Eq. (6.42) or Eq. (6.43) or Eq. (6.44) is respected, pipelining the design is worthwhile from a optimal total power point of view.

Considering both the results for parallelization and pipelining, we can say that these transformations are more effective for large logical depths or high frequencies. Moreover, new technologies will tend to reduce the value of $\chi$, making pipelining and parallelization less interesting techniques.

If we want to compare parallelization against pipelining, we can use Eq. (6.38) and Eq. (6.43). The two equations are very similar. If we consider that $n U t \ln P$ is much smaller than $V_{d d} / \sqrt{1+D O H}-V_{t h}$, which is in general the case, and we also assume that both transformations have the same $D O H$, we can compare parallelization against pipelining by simply comparing the parameter $P$ against $p_{f}$. As we have seen before, $p_{f}$ is always smaller than the ideal factor which would correspond to the number of stages. So, for the same degree of pipelining and parallelization, $p_{f}$ will always be smaller than the factor $P$. For this reason we can conclude that the condition in Eq. (6.43) will be easier to fulfill compared to Eq. (6.38), making pipelining a preferred transformation against parallelization.

### 6.4.2 Absolute optimal total power

The previous section illustrates a rough approximation to quickly compare architectures with a similar $k 1$. Even if this approach can be useful, we would sometimes prefer to be able to estimate the absolute value of the optimal total power, rather than by comparison with other architectures.

With Eq. (6.23) and Eq. (6.25), we are able to calculate the optimal total power, but it could be useful to be able to express the optimal total power directly from the
architectural and technology parameters. This would avoid the need to pre-calculate the optimal threshold and supply voltage and would permit to better understand the influence of the architectural and technology parameters on the optimal total power.

Let us start by including Eq. (6.23) in the total power equation:

$$
\begin{align*}
\text { Ptot } & =a C N f V_{d d}^{2}+N V_{d d} I_{0} e^{-\frac{V t h}{n v t}}  \tag{6.45}\\
& =a C N f V_{d d}^{2}+2 V_{d d} \frac{n U t a C N f}{1-\chi A}  \tag{6.46}\\
& =a C N f\left(V_{d d}^{2}+2 V_{d d} \frac{n U t}{1-\chi A}\right) \tag{6.47}
\end{align*}
$$

Eq. (6.47) shows a term in $V_{d d}^{2}$ and a term in $2 V_{d d}$. This means that two of the three terms of the square development of $(a+b)^{2}=a^{2}+2 a b+b^{2}$ are present. Supposing that the missing term $\left(b^{2}\right)$ is very small compared to the sum of the other two, then the development can be reversed.

$$
\begin{align*}
\text { Ptot } & =a C N f\left(V_{d d}^{2}+2 V_{d d} \frac{n U t}{1-\chi A}\right)  \tag{6.48}\\
& \approx a C N f\left(V_{d d}^{2}+2 V_{d d} \frac{n U t}{1-\chi A}+\left(\frac{n U t}{1-\chi A}\right)^{2}\right)  \tag{6.49}\\
& =a C N f\left(V_{d d}+\frac{n U t}{1-\chi A}\right)^{2} \tag{6.50}
\end{align*}
$$

The approximation that has just been used is the same as the one used to obtain Eq. (6.23), namely that $n U t / V_{d d} \ll(1-\chi A)$. The validity of this approximation can be verified in the practical cases reported in the next chapters.

Finally, the expression of the optimal supply voltage (Eq. (6.25)) can be inserted in Eq. (6.50) to obtain the optimal total power formula.

$$
\begin{equation*}
P_{t o t}{ }^{o p t} \cong \frac{a C N f}{(1-\chi A)^{2}}\left[n U t\left(\ln \left(\frac{I_{0}}{2 n U t a C f}(1-\chi A)\right)+1\right)+\chi B\right]^{2} \tag{6.51}
\end{equation*}
$$

Eq. (6.51) is a fundamental equation, in fact it permits to analytically estimate an approximation of the optimal total power directly from architectural parameters like activity $(a)$, number of cells $(N)$, frequency $(f)$, logical depth $(L D$, included in $\chi)$ and technology parameters like transistor reference current $\left(I_{0}\right)$, sub-threshold slope $(n)$, alpha power law coefficient ( $\alpha$, include in $A$ and $B$ ), delay coefficient ( $k_{t}$, included in
$\chi)$ and average capacitance $C$. The detailed discussion of the influence of these two families of parameters on the optimal total power consumption will be carried out in the next two chapters.

An alternative expression for the optimal total power can also be obtained by combining Eq. (6.50) with Eq. (6.11). The resulting formula illustrates the relationship between the optimal total power and $k 1$ :

$$
\begin{equation*}
\text { Ptot }^{\text {opt }} \cong a C N f\left(\frac{n U t}{1-\chi A}\right)^{2}(2 k 1+1)^{2} \tag{6.52}
\end{equation*}
$$

### 6.5 Summary

In this chapter, we have discussed the existence of a total power consumption optimum characterized by a trade-off between dynamic and static power contributions. We have also seen that typical values of k1 (optimal Pdyn over optimal Pstat ratio) are between 3 and 7.

After that, we have developed models for the optimal supply voltage and optimal threshold voltage, showing that frequency modifications mainly influence $V t h$, logical depth modifications mainly affect $V d d$, whereas activity modifications have impacts on both of them. Then, a total power comparison based on the rough assumption of a quasi-constant k1 revealed that pipelining and parallelization are more effective for large logical depths and high frequencies and that new technologies (which will tend to have lower $\chi$ ) will make these two transformations less interesting. Finally, we observed that the condition for achieving a power saving through pipelining is more easily fulfilled than the one for parallelization.

In the case where an absolute estimation of the optimal total power is required, the expression of an approximated closed-form equation has been given.

The most important equations provided in this chapter are summarized below to permit a quick access.

Starting from:

$$
\begin{aligned}
& \text { Ptot }=\text { Pdyn }+ \text { Pstat }=a C N f V_{d d}^{2}+N V_{d d} I_{0} e^{-\frac{V t h}{n V t}} \\
& V_{t h}^{\text {opt }}=V_{d d}^{\text {opt }}-\chi \cdot\left(V_{d d}^{\text {opt })^{1 / \alpha}} \quad \text { with: } \chi^{\alpha}=\frac{k_{t} \cdot C \cdot f \cdot L D}{I_{0}\left(\frac{e}{\alpha n U t}\right)^{\alpha}}\right.
\end{aligned}
$$

we obtained:

$$
\begin{aligned}
V_{t h}^{o p t} & \cong n U t \ln \left(\frac{I_{0}}{2 n U t a C f}(1-\chi A)\right) \\
& \cong n U t \ln \left(\frac{\left.P_{s t a t^{n o m}}^{P d y n^{n o m}} \frac{V_{d d}^{\text {nom }} e^{\left(V t h 0^{n o m}-\eta V d d^{n o m}\right) / n U t}}{2 n U t}(1-\chi A)\right)}{V_{d d}^{\text {opt }}} ⿱ \cong \cong \frac{n U t \ln \left(\frac{I_{0}}{2 n U t a C f}(1-\chi A)\right)+\chi B}{1-\chi A}\right. \\
& \cong \frac{n U t \ln \left(\frac{P_{s t a t^{n o m}}^{P d y n^{n o m}} \frac{V_{d d}^{n o m}}{}\left(e^{\left(V t h 0^{n o m}-\eta V d d^{n o m}\right) / n U t}\right.}{2 n U t}(1-\chi A)\right)+\chi B}{1-\chi A} \\
\text { Ptot }^{\text {opt }} & \cong \frac{a C N f}{(1-\chi A)^{2}}\left[n U t\left(\ln \left(\frac{I_{0}}{2 n U t a C f}(1-\chi A)\right)+1\right)+\chi B\right]^{2}
\end{aligned}
$$

## Chapter 7

## Architectural impact on total

## power

Many architectural parameters, e.g. activity $a$, number of cells $N$, logical depth $L D$ (contained in $\chi$ ), influence the optimal total power consumption (Eq. (6.51)). Knowing the effect of an architecture transformation (e.g. pipelining or parallelization) on such parameters allows to directly determine if a power saving can be obtained, just by using Eq. (6.51).

To discuss the impact of architectural modification on the optimal total power consumption, a set of thirteen 16 bit multipliers (described in details in Chapter 5) was designed in VHDL and synthesized using Synopsys Design Compiler (V2004.06). The library used for the synthesis was the 90 nm CMOS090GPSVT from ST Microelectronics.

The data characterizing these thirteen multipliers at their nominal values (the ones provided by Synopsys DC) are reported in Table 7.1. Every multiplier works with a frequency able to generate one completed multiplication every 16 ns . This means, for instance, that the 16 bit sequential architecture requires a local clock period of 1 ns , whereas the 2 times parallelized implementation has 32 ns of time per block.

The definitions of the parameters reported in Table 7.1 are:

- Cells: the number of design cells. One cell can be a very simple one (like an inverter) or a complex one (like a full adder);
- Nets: the number of inter-cells nets in the design;
- Area: the area of the design core; pads and routing spaces are not included;
- Activity: the average number of switching nets over the total number of nets per clock period. These values are obtained by an event-driven simulation under



| L9．987\％ | $\angle \mathrm{G}$ | 8．0¢7\％ | \＆SE 0 | $0^{\circ} \mathrm{I}$ | $97 \% 0$ | LZ®＊ 0 | 807 | $28^{\circ} 0$ | 7TL ${ }^{\circ} \mathrm{L}$ | z：8997 | 879 | 769 | Z IəI［exed［e！ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| ［L2 260 T | L＇才 | $0 \cdot 860$ T | \＆¢E 0 | 0＇I | 9850 | 909＊0 | 698 | $60^{\circ} \mathrm{E}$ | 080 ${ }^{\circ}$ I | ¢．069\％ | LLD | 668 | әэегцмл－пе！ |
| $67697 \%$ | $\tau \cdot \varepsilon$ | $8.997 \%$ | \＆GE＊ 0 | 0＇I | $987^{\circ} 0$ | 9790 | ILT | $98^{\circ} 0$ | 888.7 | ［ $999 \%$ | ¢ $\% 8$ | 687 | эı̣seq［е！ұuәnbes |
| 9İもGL | L．98 | G：LIL | \＆SE 0 | 0＇I | \＆ 800 | 701 0 | 6 I | ［9\％\％ | ［60．0 | 9：27987 | 9才It | Z¢Z¢ |  |
| 07•¢99 | 7．8I | 8＇㠶 | \＆GE 0 | 0＇I | 9700 | z¢ $\square^{\circ} 0$ | 88 | ¢9．7 | L8 ${ }^{\circ} 0$ | 8．2EtII | 8907 | モ09I |  |
| 08＊LGG | \＆ 6 | 9.779 | \＆GE＊ 0 | 0＇I | 9800 | L7\％ 0 | \＆ | 976 | $\angle E E^{\circ} 0$ | 066ILL | L\＆0I | 682 |  |
| \＆¢•86L | G． $1 T$ | 8．984 | \＆SE0 | 0． I | 80.0 | 9 IZ 0 | 89 | $27 \cdot 7$ | 898.0 | ［ 0 0668 | 780I | \＆ 68 | I əu！jəd！̣d •se！̣ V：OY |
| 09＊T\＆L | G．6 | ［＇762 | EGE 0 | 0．${ }^{\text {I }}$ | 67I 0 | 067＊0 | 0IL | 298 | 78F＊ 0 | 927672 | 896 | L02 | $\boldsymbol{Z}$ әu！ןəd！̣＇8セ！̣ VOU |
| LE972 | $\varepsilon^{\prime} \cdot I$ | ［ 9 ¢ L | \＆gE 0 | 0＇I | モ0I 0 | モ¢ $\square^{\circ} 0$ | 88 | 96.7 | $88 ¢^{\circ} 0$ | LGTL8 | \＆LOL | 918 |  |
| 81•889 | 76 | 8．829 | ¢GE 0 | 0＇I | 97I．0 | LIE 0 | ¢ZI | \＆1＇も | £LF＊0 | 8．67¢ | 976 | 889 |  |
| $67 \cdot 676$ | L． $8 ¢$ | 9｀968 | \＆̧E 0 | 0． I | モ900 | LLI＇0 | 97 |  | $98 \mathrm{~T}^{\circ} 0$ | †｀¢0897 | TL98 | もø97 | ¢ İIIexed VOY |
| $67^{\circ}$ T98 | 6.91 | 9＇も¢8 | EGE 0 | 0． I | L0I＇0 | $897^{\prime} 0$ | I6 | 80.9 | モL\％ 0 | $0 ` 678 \mathrm{~L}$ | ILLI | 687I | $\boldsymbol{z}$ IPI［exed VOY |
| 8¢ 1 LTL | 9.8 | 6.782 | \＆ge 0 | 0＇I | LLZ＊ 0 | 068＊0 | 62I | 66.9 | 8L9．0 | 6．9999 | 268 | 079 | ọseq VOU |
| $\begin{aligned} & {\left[M^{r l}\right]} \\ & \text { qołd } \end{aligned}$ | ［ $M^{r \prime}$ ］ 7e7sd | $\begin{aligned} & {\left[M^{d}\right]} \\ & \mathbf{u} \boldsymbol{\kappa} \mathbf{p}_{\mathbf{d}} \end{aligned}$ | $\begin{gathered} {[\Lambda]} \\ 0 Ч 7 \Lambda \end{gathered}$ | $\begin{gathered} {[\Lambda]} \\ \operatorname{pp} \Lambda \end{gathered}$ | ${ }_{0} \chi$ | $\chi$ |  | ［su］ Кегே | K7！ 1 ！${ }^{\text {a }}$ | ［ $\left.{ }_{z} u r t\right]$ eә．．$V$ | stan | SIIPO |  |
| sonjes［eu！uon |  |  |  |  |  |  |  |  |  |  |  |  |  |

ModelSIM (from MentorGraphics). The results are based on the multiplication of uniformly distributed pseudo-random data during $2 \mu \mathrm{~s}$; Standard library delays are used so that glitches can be accounted;

- Delay: the typical combinatorial delay from register output to register input on the critical path;
- LD_eff: the effective logical depth in equivalent NAND2 gates. The term "effective" is related to the fact that the length of the logical depth is considered against the throughput frequency or one-complete-multiplication frequency. In the case of a parallelization, for instance, LD_eff corresponds to half of the real LD because each block has two clock periods to compute one multiplication. Similarly, in the case of the sequential implementation, the LD_eff represents 16 times the real LD because to complete one multiplication, 16 hss clock periods are required. The delay of the reference NAND2 gate has been estimated by building a 1000 NAND2 inverter chain. The inversion effect has been obtained by tying the two inputs together. The resulting delay per gate is 33.5 ps for the SVT transistor type;
- $\chi$ and $\chi^{\alpha}$ : these two parameters are obtained by using Eq. (6.3) from the nominal $V d d, V t h$ and delay. These parameters are reported there to be easily accessible during the following discussions;
- Nominal Vdd: the nominal technology supply voltage;
- Nominal Vth0: the nominal technology threshold voltage;
- Nominal Pdyn: the nominal dynamic power consumption as reported by Synopsys DC;
- Nominal Pstat: the nominal static power consumption as reported by Synopsys DC;
- Nominal Ptot: the nominal total power consumption obtained by summing the nominal Pdyn and the nominal Pstat.

With the data reported in Table 7.1, the optimal supply voltage $V d d$ and the optimal threshold voltage $V t h$ can now be calculated. The values of $V d d$ and $V t h$ in Table 7.2 are obtained in two different ways. In the first case, called numerical computation, a high resolution numerical search of the optimal supply and threshold voltage is used. This approach is very time consuming and requires the calculation of
a high number of total power consumption for a large amount of couple ( $V d d, V t h$ ) ( $100^{\prime} 000$ in our case) using the non approximated equations described in Chapter 3. Moreover, such type of calculation doesn't permit to understand the real effect of each parameter on the final result. However, results calculated in this way are precise (up to the precision of models used) and for this reason they will be considered as a reference to be compared to the other approach which is based on Eq. (6.23) and Eq. (6.25) and is called analytical approximation. In this latter case, the optimal $V d d$ and the optimal $V t h$ can easily be calculated from the values reported in Table 7.1.

The error between the reference data (numerical computation) and the analytical approximation is also reported in the same table. All the errors remains bounded to a few percent.

In Fig. 7.1 and Fig. 7.2, the same results are reported in a graphical manner, making it easier to read.


Figure 7.1: Optimal Vdd calculated with numerical computation (STM 90nm, 62.5 MHz ) using Eq. (6.25)

|  | Optimal values |  |  |  |  |  |  |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | Numerical computation |  | Analytical approximation |  | Approx. error |  | Numerical computation |  |  |  | Analytical approx. | Approx. error |
|  | Vdd <br> [ $V$ ] | $\begin{aligned} & \text { Vth } \\ & {[V]} \\ & \hline \end{aligned}$ | Vdd <br> [ $V$ ] | $\begin{aligned} & \text { Vth } \\ & {[V]} \\ & \hline \end{aligned}$ | Vdd <br> [\%] | $\begin{aligned} & \text { Vth } \\ & {[\%]} \end{aligned}$ | $\begin{gathered} \text { Pdyn } \\ {[\mu W]} \end{gathered}$ | Pstat <br> [ $\mu W]$ | Ptot <br> $[\mu W]$ | k1 | $\begin{aligned} & \text { Ptot } \\ & {[\mu W]} \end{aligned}$ | Ptot [\%] |
| RCA basic | 0.437 | 0.201 | 0.444 | 0.206 | 1.5 | 2.8 | 140.08 | 41.62 | 181.70 | 3.4 | 183.16 | 0.8 |
| RCA parallel 2 | 0.368 | 0.227 | 0.376 | 0.233 | 2.2 | 2.9 | 113.10 | 35.17 | 148.27 | 3.2 | 150.40 | 1.4 |
| RCA parallel 4 | 0.344 | 0.254 | 0.351 | 0.260 | 2.1 | 2.4 | 105.83 | 32.01 | 137.84 | 3.3 | 140.05 | 1.6 |
| RCA horiz. pipeline 2 | 0.385 | 0.211 | 0.393 | 0.217 | 2.1 | 2.8 | 100.53 | 31.72 | 132.25 | 3.2 | 133.99 | 1.3 |
| RCA horiz. pipeline 4 | 0.351 | 0.216 | 0.360 | 0.223 | 2.5 | 3.3 | 90.47 | 29.76 | 120.23 | 3.0 | 122.27 | 1.7 |
| RCA diag. pipeline 2 | 0.367 | 0.209 | 0.375 | 0.216 | 2.4 | 3.2 | 97.11 | 31.70 | 128.81 | 3.1 | 130.79 | 1.5 |
| RCA diag. pipeline 4 | 0.325 | 0.216 | 0.335 | 0.223 | 3.0 | 3.4 | 83.15 | 28.67 | 111.82 | 2.9 | 114.37 | 2.3 |
| Wallace basic | 0.339 | 0.222 | 0.348 | 0.228 | 2.7 | 3.0 | 62.44 | 20.67 | 83.11 | 3.0 | 84.64 | 1.8 |
| Wallace parallel 2 | 0.321 | 0.244 | 0.329 | 0.251 | 2.4 | 2.8 | 66.23 | 21.33 | 87.56 | 3.1 | 89.30 | 2.0 |
| Wallace parallel 4 | 0.320 | 0.269 | 0.326 | 0.275 | 2.0 | 2.1 | 73.29 | 22.24 | 95.53 | 3.3 | 97.17 | 1.7 |
| Sequential basic | 0.563 | 0.107 | 0.570 | 0.109 | 1.3 | 2.0 | 777.27 | 237.90 | 1015.17 | 3.3 | 1045.75 | 3.0 |
| Sequential-wallace | 0.600 | 0.157 | 0.608 | 0.157 | 1.3 | 0.0 | 393.74 | 102.14 | 495.88 | 3.9 | 512.09 | 3.3 |
| Sequential parallel 2 | 0.374 | 0.139 | 0.387 | 0.147 | 3.5 | 6.4 | 312.19 | 122.73 | 434.92 | 2.5 | 443.82 | 2.0 |

Table 7.2: Optimal $V d d, V t h$ and Ptot. These values are calculated once with a numerical computation and once using Eq. (6.25) for $V d d$, Eq. (6.23) for $V t h$ and Eq. (6.51) for Ptot. Relative errors are shown in the corresponding columns. Used $A$ and $B$ factors are for $V d d \in[0.3 V ; 0.6]$


Figure 7.2: Optimal Vth calculated with numerical computation (STM 90nm, $62.5 \mathrm{MHz})$ using Eq. (6.23)

What we can observe from the values of $V d d^{\text {opt }}$ and $V t h^{\text {opt }}$ is, for instance, the effect of parallelization. In such a transformation, $V d d$ is reduced and $V t h$ is increased. Both trends will favor a lower total power by reducing dynamic and static power at the same time. It is also interesting to note that the reduction of the supply voltage is less important for Wallace than for RCA. This can be easily explained by the lower $\chi$ factor of the Wallace implementation. In fact, being the Wallace already a quick architecture compared to the required frequency ( 62.5 MHz ), the gain from the reduction of the effective logical depth (LD_eff) is only marginal, whereas it is much more consequent for the RCA multiplier.

It is also possible to observe that $V t h$ is almost constant for the pipeline transformation as it was deduced in Chapter 6. Finally, the large delay involved in the sequential architectures (corresponding to a high $\chi$ ) clearly shows a high $V d d$ and a low $V t h$, both negatively impacting the total power.

Nevertheless, optimal $V d d$ and $V t h$ are not mandatory to compute the optimal total power consumption, thanks to Eq. (6.51). In fact, all required parameters can be obtained from Table 7.1 without needing intermediate steps. Once more, the results of our analytical approximation are compared to the numerical computation, where no approximations are applied. Results are reported in Table 7.2 with the corresponding errors. The same results are also provided in a graphical way in Fig. 7.3.


Figure 7.3: Optimal total power calculated with numerical computation (STM 90nm, 62.5 MHz ) using Eq. (6.51)

It is interesting to see that the errors for Eq. (6.51) over a set of so different implementations is always less than $3.5 \%$. The second quite evident thing is the
huge optimal total consumption of the three sequential implementations compared to non-sequential ones. The explanation for this effect can be found by looking at the $\chi$ factor (Eq. (6.3)). This parameter, which establishes the relationship between the optimal $V d d$ and the optimal $V t h$, directly depends on the effective logical depth, which is very large for these three architectures. A large logical depth (i.e. a large $\chi$ ) results in a high optimal $V d d$ (which increases the dynamic power in a square way and the static power linearly), and in a low optimal Vth (which increase the static power exponentially!). Moreover, sequential structures also present large activities. Because their activity is defined over a period of the throughput clock, it is not uncommon to observe activities higher than 1. Unfortunately, this high activity (a) is not counterbalanced by a small enough number of cells $(N)$, which results in a much higher number of transitions $(a \cdot N)$ compared to the others implementations. As stated in Eq. (6.51) a large number of transitions also penalize the optimal total power consumption.

The RCA architecture is based on a very regular structure that permits many variations to be implemented. Both parallelization and pipelining transformations shorten the effective logical depth (which correspond to a reduction of $\chi$, although not proportionally). In this case, the benefit of the relaxed timing constraints permits to further reduce $V d d$ and increase $V t h$, reducing this way the optimal total power consumption.

The diagonal pipelined versions present a lower $\chi$ and a lower activity compared to the classical horizontal pipeline versions, and hence they feature a lower optimal total power consumption. Nevertheless, the gain in power between the two ways of pipelining is small, and the time spent by the designer to correctly implement a diagonal pipeline may not be worth the resulting gain in power.

Finally, the Wallace family presents the fastest circuits of our set. By applying a parallelization to the basic version, we observe that, similar to the RCA family, the logical depth is reduced and hence $\chi$ is also reduced. Once more, this results in a lower $V d d$ and higher $V t h$, which should be synonymous of power saving. However, if we look at the resulting optimal power we see that the Wallace basic version has a lower optimal total power compared to the two parallelized versions. The explanation comes from the fact that, the Wallace architecture being already a fast circuit (compared to the desired clock frequency), the reduction of $\chi$ obtained by parallelization is only marginal and its benefit is canceled by the increase of the static power due to the doubling in hardware and the overhead introduced to multiplex data. This is not the case for the RCA because its $\chi$ is higher. This example illustrates very well how the
same architectural transformation can yield completely different results. Fortunately, all these cases are well modeled by Eq. (6.51).

### 7.1 Summary

In this chapter we have shown how the architectural parameters like activity $a, \log$ ical depth $L D$ and frequency $f$ can modify the optimal supply voltage $V d d$, the optimal threshold voltage $V$ th and finally the optimal total power Ptot of a design. In particular, we have pointed out how sequential circuits, characterized by very slow architectures (large $L D$ ), really present a huge power consumption compared to the other designs. Hence, unless a circuit working at extremely low frequency is needed, sequential implementations are not well suited for low power when working at the optimal point.

On the other hand, fast circuits (showing a short $L D$ ) like Wallace are not interesting for parallelization because the large increase of static consumption, caused by the hardware replication, easily cancels the poor benefit obtained from the reduced critical path.

For an architecture with an average logical depth like the RCA, we can observe that a moderate power gain can be obtained through parallelization, but even in this case, pipeline transformation reports better results with a much smaller area, which also correspond to lower production costs.

This leads us to the conclusion that, in designs where the static power consumption in not negligible, parallelization is rarely a good choice and most of the time pipelining should be preferred.

## Chapter 8

## Technology impact on total power

As explained in Chapter 6, the optimal total power not only depends on architectural parameters, but it also depends on technology parameters. In the past, it was in general not possible to change these parameters, because the designer had a given technology to use and was not able to modify them. This may change in the future. Until now, new technology nodes always presented better performances and a better power characteristics compared to the precedent ones, but nowadays, with the high increase in leakage current, performance gain can correspond to a power lost. For this reason, the technologies start now to exist under different "flavors", which are in general characterized by their Vth. For instance, the technology used in this thesis presents three different types of transistors, namely Low Vth (LVT), Standard Vth (SVT) and High Vth (HVT). Moreover, two of these three kinds can be implemented together on the same chip. Under such conditions, it is interesting to determine, between the proposed flavors, the best suited for a required work. Before that, we will consider the virtual case where the technology parameters could be freely modified in an independent way. This will permit us to understand the influence of each parameter to the optimal total power.

### 8.1 Technology as a free parameter

In general, technology parameters $\left(I_{0}, n, \alpha, k_{t}, C\right)$ are not independent and the variation of one of them results in a variation of others. Nevertheless, to understand the importance and the effective influence of a specific parameter, it is useful to observe how the total power is modified by single parameter variations. This is shown in Fig. 8.1 for a RCA 16 bit multiplier. The nominal case (no technology parameters variations) corresponds to the RCA basic structure reported in Table 7.1.

The abscissa represents the ratio of the new (modified) parameter over the original one, while the ordinate represents the optimal total power consumption.

The most sensitive parameter is $\alpha$. This parameter comes from the alpha power law fitting formula and it represents the velocity saturation of electrons/holes. Typically, switching to a newer (finer) technology corresponds to a lower $\alpha$. From Fig. 8.1, we can see how this is penalizing for the optimal total power. Actually, a low $\alpha$ will correspond to a reduced $I_{\text {On }}$ current, which also means a slower technology. In practice, the speed reduction caused by $\alpha$ is largely counterbalanced by the reduced capacitances and $k_{t}$.

Moreover, it is interesting to observe that an increase of $I_{0}$, results in a very moderate power saving. The explanation comes from the fact that a bigger $I_{0}$ not only increases the static power, but also increases the on current by the same amount. Hence, it results that the speed related parameter $\chi$ is reduced, achieving a moderate gain. Conversely, the reduction of $I_{0}$ can highly penalize the total power. Once again, the delay increase easily explains this behavior.

The behavior of the capacitance $C$ or delay parameter $k_{t}$ is not really surprising. In fact, an increase of $C$ means an augmentation of the delay (like for $k_{t}$ ) and so a worst optimal total power.

Finally, the curve of $n$ shows a important increase of the optimal total power for an increase of the parameter and vice-versa. In fact, an increase in the factor $n$ is equivalent to a reduction of $V t h$, i.e. an increase of the leakage current.

To summarize, the ideal technology would be characterized by a low $C, k_{t}$ and $n$, whereas $I_{0}$ and $\alpha$ should be as high as possible. This may not be the trend in coming technologies, for instance in the case of $\alpha$.

### 8.2 Application to technology selection

The 90 nm technology from ST Microelectronics is available with 3 different transistor types (LVT; SVT; HVT). The optimal total power consumption for the 13 multipliers of Chapter 5 has been calculated for all existing flavors. Table 8.1 shows the results. By looking at the bold values, which represent the best technology choices for a given architecture, we can see that the best transistor type is not always the same. In particular, the HVT is the best for 6 cases, the SVT the best for 5 cases and the LVT is the best for 2 cases.

To better illustrate these results, they have been plotted in Fig. 8.2. Data corresponding to the sequential versions are omitted to permit a better reading of the

|  | Optimal Ptot $[\mu W]$ |  |  |
| ---: | ---: | ---: | ---: |
| Design Name | LVT | SVT | HVT |
| RCA basic | 197.43 | $\mathbf{1 8 1 . 7 0}$ | 182.11 |
| RCA parallel | 179.39 | $\mathbf{1 4 8 . 2 7}$ | 152.53 |
| RCA parallel 4 | 176.46 | 137.84 | $\mathbf{1 3 5 . 1 6}$ |
| RCA horiz. Pipeline 2 | 151.93 | 132.25 | $\mathbf{1 2 8 . 0 6}$ |
| RCA horiz. Pipeline 4 | 142.77 | 120.23 | $\mathbf{1 1 3 . 3 4}$ |
| RCA diag. Pipeline 2 | 143.44 | $\mathbf{1 2 8 . 8 1}$ | 129.81 |
| RCA diag. Pipeline 4 | 136.82 | $\mathbf{1 1 1 . 8 2}$ | 112.03 |
| Wallace basic | $\mathbf{8 0 . 2 6}$ | 83.11 | 96.95 |
| Wallace parallel | 104.17 | 87.56 | $\mathbf{8 1 . 1 3}$ |
| Wallace parallel 4 | 121.57 | 95.53 | $\mathbf{8 5 . 9 8}$ |
| Sequential basic | 1547.98 | 1015.17 | $\mathbf{1 0 0 7 . 4 9}$ |
| Sequential-wallace | $\mathbf{3 5 8 . 3 7}$ | 495.88 | 483.10 |
| Sequential parallel 2 | 620.49 | $\mathbf{4 3 4 . 9 2}$ | 486.46 |

Table 8.1: Optimal total power consumption of thirteen 16 bit multipliers in all STM 90 nm technology flavors. The bold values represent the best technology choice for the given architecture.
other cases.
Looking at the data for the three Wallace implementations, we can observe the effect of parallelization in different technology conditions. If we consider the HVT type (high Vth, hence low static power), we see that the parallelization of the basic implementation is interesting from a power point of view because doubling the hardware (so doubling the static power) is not so negative compared to reduction of the supply voltage and the increase of the threshold voltage coming from the relaxed timing constraints. Nevertheless, if the transformation is iterated one more time, leading to the Wallace parallel 4, the power figure is now starting to degrade, because $V d d$ and $V t h$ are now only slightly modified, whereas the static power is doubled compared to Wallace parallel 2.

In the case of the SVT (standard Vth) the 2 times parallelization is already a bad transformation for low power, getting even worst in the 4 times parallelized version. This can be explained by a greater static power compared to the HVT, which penalize all types of parallelization for the Wallace structure.

Finally, the results for LVT (low Vth, hence high static power) clearly show an important increase of the optimal total power for each parallelized version. Once more, it is the doubling (or multiplying by 4) of the hardware that cannot be tolerated in a


Figure 8.2: Optimal total power consumption of ten 16 bit multipliers in all STM 90 nm technology flavors
flavor with so much leakage.
On the other hand, the parallelization of the RCA family remains interesting for all the three transistors types. This can be explained by the fact that the RCA multiplier has a longer logical depth and hence a higher $\chi$ compared to the Wallace. For this reason, the parallelization has a much important effect on the reduction of $V d d$ and the increase of $V t h$ which can overcome the increase of hardware and hence of static power.

From Fig. 8.2, it is also possible to note that the pipeline transformations on the RCA multipliers present a better power consumption compared to the parallelized versions. This comes from the fact that pipelining can reduce the timing constraints without the need of doubling the static power due to hardware replication. It is hence possible to conclude that for technologies characterized by important leakage power, this situation being probably representative of all future technologies, pipelining needs to be preferred over parallelization. This also needs to be understood by the CAD programmers in order to include powerful automated pipelining tools that will replace the present massively parallelization-based algorithms.

Considering all the architectures and transistors types, the best choice for a frequency of 62.5 MHz is the Wallace basic implemented with a LVT transistor flavor.

### 8.3 Discussion on the modifiability of Vth

All the theory developed in the last chapters considers Vth as a freely modifiable parameter. This is not the way people normally think about the threshold voltage, probably because the modification of $V t h$ is not an easy task. In the precedent section, we discussed the possibility to select the best technology flavor from a set of given ones. This does not allow a continuous modification of the $V t h$, but still permits to modify it in a discrete way. An important drawback of such an approach is that the $V$ th cannot be dynamically modified to follow the various runtime needs. In this section, two other possible ways to interact with the threshold voltage are presented.

### 8.3.1 Body biasing

In Chapter 2, we discussed the body effect showing how a voltage between the body and the source of a transistor ( $V b s$ ) can modify the threshold voltage. The body biasing equation is replicated there:

$$
\begin{equation*}
V t h=V t h 0-\eta V d s-\gamma V b s \tag{8.1}
\end{equation*}
$$

With $\eta$ the DIBL effect coefficient and $\gamma$ the body bias coefficient.
This is clearly a simplification of the relationship between $V t h$ and $V b s$, but it is useful to understand the principle. In a more precise way, the body bias can be modeled by [57]:

$$
\begin{equation*}
V t h(V b s)=V \operatorname{th}(V b s=0)-\frac{\sqrt{2 q \epsilon_{S} N_{A}}}{C_{0}}\left(\sqrt{2 \psi_{B}+V b s}-\sqrt{2 \psi_{B}}\right) \tag{8.2}
\end{equation*}
$$

With $q$ the elementary charge, $\epsilon_{S}$ the silicon permittivity, $N_{A}$ the acceptor impurity density in the channel, $C_{0}$ the gate oxide capacitance per unit area, $\psi_{B}$ the Fermi potential and $V b s$ the voltage between body and source.

From Eq. (8.2), we observe that the ability to modify $V t h$ is more efficient for $V b s$ near zero, whereas it decreases in a typical square root way for larger values of $V b s$. Moreover, the pre-factor $\sqrt{2 q \epsilon_{S} N_{A}} / C_{0}$ tends to be smaller with newer technologies due to the reduction of the oxide thickness and hence the range where $V t h$ can be modified will tend to be reduced on all new technology nodes.

Another important point is the sign of $V b s$. In fact, the body can have a potential higher or lower than the source. When the body potential is higher than the source for the NMOS and lower than the source for the PMOS, the polarization is called forward
body biasing (FBB) and it corresponds to a reduction of the threshold voltage. The contrary, i.e. the body potential lower than the source for the NMOS and higher than the source for the PMOS, is called reverse body bias (RBB) and results in an increase of $V t h$.

If the RBB have no limit on the maximal $V b s$ other than the maximum reversebias junction potential, this is not the case for FBB. In FBB, if the potential goes over 0.5 V the p-n junction between body and source will start to conduct, creating a very high current flow. For this reason, FBB always needs to be lower than 0.5 V .

Just as an example, a FBB of 0.5 V (the maximum applicable) on the 90 nm STM SVT technology shows a $V t h$ reduction of only 40 mV , whereas the same FBB correspond to a $V t h$ variation of 60 mV for the 130 nm STM technology.
H. Ananthan \& al. showed in [58] [59] that the FBB has the advantage to reduce the sensitivity of $V t h$ to variations in gate length, oxide thickness and channel doping and it is hence preferable to RBB.

The principles of body bias has been successfully applied in circuits like the 150 MHz discrete cosine transformation core processor of Kuroda et al. [60], the 200 MHz processor of Mizuno et al. [13] and the 1Ghz router of Narendra et al. [61]

### 8.3.2 Transistor size modification

Another way to modify the threshold voltage of a transistor is by modifying its physical dimensions. The important dimensions of the transistor are the width (W) and the length ( L ) of its channel.

Fig. 8.3 (NMOS) and Fig. 8.4 (PMOS) show the plots of $V t h$ versus $W$ for the 130 nm STM (HCMOS9GP_LL) technology. These graphs are part of the STM documentation and the details on how to generate them are not known. Nevertheless, these plots are very useful to understand the behavior of $V t h$ under a transistor resizing.

From these graphs, we can remark that the influence of $W$ to the $V t h$ presents a huge asymmetry between the NMOS and the PMOS transistors. In fact, for instance, the maximal change of $V t h$ due to a modification of $W$ from $0.3 \mu m$ to $10 \mu m$ (which is a very large modification) corresponds to about 60 mV for the NMOS, whereas it is of only $6-7 \mathrm{mV}$ for the PMOS. This means that any scaling of the device will create a completely unbalanced charging/discharging delays that will result in high shortcut currents, not mentioning the capacitances increase due to bigger channel area. Although the modification of channel width is probably not the best technique to modify the $V t h$, it is reported here for completeness.

The other modifiable size of the transistor is the channel length. In Fig. 8.5


Figure 8.3: Vth vs. W for a NMOS transistor. Curves correspond to Slow-Slow(SSA), Typical-Typical(TT) and Fast-Fast(FFA) corners


Figure 8.4: Vth vs. W for a PMOS transistor. Curves correspond to Slow-Slow(SSA), Typical-Typical(TT) and Fast-Fast(FFA) corners
(NMOS) and Fig. 8.6 (PMOS) the curves of Vth versus L are plotted for the HCMOS9GP_LL 130 nm STM technology. There, we can see that for small increases of the channel length, both NMOS and PMOS behave in a similar way with a relative steep slope. This is exactly the idea exploited by Gupta et al. [62]. What they
propose is to slightly increase (less than $10 \%$ ) the transistors length $L$ of devices that are not on the critical path, achieving a static power reduction of about $30 \%$ and delay penalty smaller than $10 \%$ in a 130 nm technology.


Figure 8.5: Vth vs. L for a NMOS transistor. Curves correspond to Slow-Slow(SSA), Typical-Typical(TT) and Fast-Fast(FFA) corners


Figure 8.6: Vth vs. L for a PMOS transistor. Curves correspond to Slow-Slow(SSA), Typical-Typical(TT) and Fast-Fast(FFA) corners

It is also important to note that transistor size modifications influence more parameters than simply the threshold voltage $V t h$ and the obtained $V t h$ modifications are very moderate. For these reasons, technology flavor selection and body bias are preferable techniques to use for modifying the sub-threshold voltage $V t h$.

### 8.4 Summary

In this chapter we have discussed the influence of the principle technology parameters on the optimal total power. In particular, we have observed that an ideal technology would be characterized by low $C, k_{t}$ and $n$, whereas $I_{0}$ and $\alpha$ should be as high as possible. Unfortunately, this will probably not be the trend of the future technologies.

Then we have analyzed thirteen different 16 bit multipliers synthesized in the three different technology flavors proposed by the STM 90 nm technology. This illustrates very well how the technology can be used as a design parameter to achieve the lowest possible total power consumption. In the examples proposed, the best architecture/technology flavor is the Wallace basic in a LVT transistor type.

Finally, other two methods for modifying the sub-threshold voltage are proposed; namely body bias and transistor resizing. For both techniques, advantages and limitations have been discussed.

## Chapter 9

## Total power comparison for fixed Vdd and fixed Vth

This chapter presents a new methodology allowing to compare several architectures performing the same function and to select, among them, the one presenting the lowest total power consumption under fixed supply voltage ( $V d d$ ), threshold voltage ( $V t h$ ) and frequency $(f)$ constraints. This situation is much more common to designers than the one proposed in Chapter 6, because most of the time they cannot choose the technology to use. Moreover, this approach could be applied in parallel to the free $V d d / V t h$ one. Actually, the best $V t h$ and $V d d$ could be chosen for the main block of the design and all the others will need to adapt. Thanks to the theory of this chapter secondary blocks can be optimized, too.

The lowest total power consumption, which is closely related to the architecture, results clearly from a trade-off between static and dynamic power. Static power reduction leads to the selection of architectures with a small number of cells and not with a small number of transitions, as it was the case when only dynamic power reduction was targeted. As an example, this methodology is applied to the selection of the lowest power consuming architecture among a set of thirteen 16 bit multipliers (described in Chapter 5). Moreover, by understanding the mechanism behind this selection, it is possible to propose and implement new architectures that will consume even less power as reported in Section 9.4.

### 9.1 Total power comparison

To be able to compare the consumption of two architectures under the same supply voltage $V d d$, threshold voltage $V t h$ and frequency $f$, we need a definition of the total
power. Once more, the used equation is the one described in Chapter 3.

$$
\begin{equation*}
\text { Ptot }=P d y n+\text { Pstat }=a C N f V_{d d}^{2}+N V_{d d} I_{0} e^{-\frac{V t h}{n J t}} \tag{9.1}
\end{equation*}
$$

The equivalent capacity is roughly related to the average cell capacitance and could be obtained by dividing the dynamic power consumption by the number of transitions $(a \cdot N)$, the squared supply voltage and the working frequency. Therefore $C$ is not exactly the same for two circuits implementing the same function because it varies with their respective distribution of activity and capacitance products over the nodes. The same observation holds for the leakage current $I_{0}$, which represents an average static consumption per cell over the entire circuit, although some cells clearly involve more leakage than others. Considering that the methodology presented here is applied to the comparison of architectures performing the same task, we assume that the equivalent capacitance $C$ and the average leakage current $I_{0}$ remain sufficiently similar across the set of architectures.

All the architectures in the implementation set share the same $V d d, V t h$ and $f$, but present different values for $a$ (activity) and N (number of cells). Two architectures are characterized by $a 1$ and $N 1$, and $a 2$ and $N 2$ respectively, and their total power consumption can be compared as follows:

$$
\begin{equation*}
a_{1} N_{1} C f V_{d d}^{2}+N_{1} V_{d d} I_{0} e^{-\frac{V t h}{n U t}} \stackrel{?}{<} a_{2} N_{2} C f V_{d d}^{2}+N_{2} V_{d d} I_{0} e^{-\frac{V t h}{n U t}} \tag{9.2}
\end{equation*}
$$

The inequality (9.2) is true if the first architecture consumes less power than the second one. This equation can be rewritten in the form:

$$
\begin{equation*}
\left(N_{1}-N_{2}\right) \stackrel{?}{<}-\left(a_{1} N_{1}-a_{2} N_{2}\right) \frac{C V_{d d} f}{I_{0} e^{-\frac{V t h}{n U t}}} \tag{9.3}
\end{equation*}
$$

Then, by defining the difference between the number of cells as $\Delta N=\left(N_{1}-N_{2}\right)$ and the difference between the number of transitions as $\Delta T r=\left(a_{1} N_{1}-a_{2} N_{2}\right)$, we can finally express this comparison as:

$$
\begin{align*}
& \Delta N \stackrel{?}{<}-\Delta \operatorname{Tr} \frac{C V_{d d} f}{I_{0} e^{-\frac{V t h}{n U t}}}  \tag{9.4}\\
& \Delta N \stackrel{?}{<}-\Delta \operatorname{Tr} \cdot R\left(V_{d d}, V_{t h}, f\right) \tag{9.5}
\end{align*}
$$

The expression $R\left(V_{d d}, V_{t h}, f\right)$ in Eq. (9.5) depends on $V d d, V t h, f$ and some technology parameters, which are imposed to the designer and are hence constant. Moreover, the value of $R$ is always positive.

Eq. (9.4) shows that the comparison of the total power consumption between two architectures depends on the difference between the number of cells $(\Delta N)$ and on the difference between the number of transitions $(\Delta T r)$. This is quite different from the conventional approach where only the number of transitions is relevant as only dynamic power consumption is taken into account.

### 9.2 Comparison of two architectures

A logical function can be implemented in several ways, using different topologies, for instance by parallelizing, pipelining or performing algorithmic improvements. All these various structures can be categorized based on their characteristics: number of cells, logical depth, number of transitions and activity (Table 7.1 is an example of such a classification). Two architectures can lead to positive or negative $\Delta N$ and $\Delta T r$ values while the value of R (Eq. (9.5)) is always positive. If both designs present the same amount of cells and transitions (i.e. $\Delta N=0$ and $\Delta T r=0$ ), the power consumption will clearly be the same. An architecture with more cells and more transitions will always consume more power, because inequality (9.5) becomes trivial, i.e. independent of R . Conversely, if one design has more cells but less transitions compared to the other (i.e. $\Delta N>0$ and $\Delta T r<0$ or vice versa), the choice of the architecture consuming less power is more complex and depends on R. This means that the selection will depend on the working conditions too, i.e. on $V d d, V t h, f$ and the technology parameters. All possible cases are summarized in Table 9.1 .

|  | $\Delta T r>0$ | $\Delta T r=0$ | $\Delta T r<0$ |
| :---: | :---: | :---: | :---: |
| $\Delta N>0$ | Circuit 2 | Circuit 2 | Depends on Eq. (9.5) |
| $\Delta N=0$ | Circuit 2 | Same consumption | Circuit 1 |
| $\Delta N<0$ | Depends on Eq. (9.5) | Circuit 1 | Circuit 1 |

Table 9.1: Comparison table between two circuits having a difference of $\Delta N=\left(N_{1}-\right.$ $\left.N_{2}\right)$ cells and $\Delta T r=\left(a_{1} N_{1}-a_{2} N_{2}\right)$ transitions. The circuit indicated is the one presenting the lowest total power consumption

Plotting the lines of equal-consumption (i.e. $R(V d d, V t h, f)=-\Delta N / \Delta T r$ ) on the space ( $V d d, V t h$ ) allows a better understanding of the role of R in the architecture selection (Fig. 9.1). These equal-consumption lines delimit the points where two designs having the corresponding ratio $-\Delta N / \Delta T r$ will present the same power consumption, despite the fact that the absolute value will vary with $V d d$ and $V t h$. For instance, if two architectures operating at $V d d=1 \mathrm{~V}$ and $V t h=0.33$ have $-\Delta N / \Delta T r$


Figure 9.1: Lines of equal-consumption with $\mathrm{f}=62.5 \mathrm{MHz}$ in a STM 90 nm SVT technology. The Vdd and Vth constraints can be represented with a point on this plot. A pair of architectures to be compared corresponds to one $-\Delta N / \Delta T r$ line in this space. If the working point is located above the $-\Delta N / \Delta T r$ line, then the architecture with less transitions is better in term of power consumption, otherwise the design with less cells is preferred
$=100$, they will present the same total power consumption. Otherwise, when the design constraints represented by $V d d$ and $V t h$ correspond to a point that is above the equal-consumption line (which would be the case for $V d d=1 \mathrm{~V}$ and $V t h=0.4 \mathrm{~V}$ in our example), the circuit with less transitions will dissipate less power. Conversely, if the working point is located below the equal-consumption line (which would be the case for $V d d=1 \mathrm{~V}$ and $V t h=0.2 \mathrm{~V}$ ), the design with less cells will consume less power. Actually, increasing $V t h$ results in a large decrease in static power, which in turn leads to a consumption dominated by the dynamic contribution. The architecture with fewer transitions is then naturally preferred. It is important to remember that the plot of Fig. 9.1 depends on the technology used. Here, the STM 90nm SVT technology was chosen, which corresponds to an average $C / I_{0}$ of $1.36 \mathrm{E}-9[\mathrm{~s} / \mathrm{V}]$ and a working frequency of 62.5 MHz .

### 9.3 Selection of the best architecture

The methodology illustrated in the precedent section to compare two architectures can be iterated over a large number of implementations of the same logical function. In this way, by repeating the comparisons on couples of structures, it is possible to eliminate the worst architectures and quickly converge to the best design for the specified $V d d$, $V$ th and $f$ constraints. It is important to note that the selected architecture is not always the same, but depends on the values of $V d d, V t h$ and $f$. This methodology can be used to easily select the better architecture under new constraints without re-synthesis. Generally speaking, the approach can be summarized as follows:

1. Delay constraints: Given $V d d, V t h, f$, architectures that are too slow to meet the timing constraints are eliminated. A slow architecture can be parallelized or pipelined to meet the constraints, but this represents a new architecture to be added to the set of structures to compare.
2. Compare a couple of architectures: The comparison of two architectures is achieved using the parameter $-\Delta N / \Delta T r$. If this value is negative the architecture with fewer cells and less transitions is chosen (circuit 1 or 2 in Table 9.1 when $-\Delta N / \Delta T r$ is negative). On the other hand, when $-\Delta N / \Delta T r>0$, the choice depends on Eq. (9.5) and therefore on the position of the working point with respect to the line of equal consumption.
3. Repeat step 2 for all remaining architectures: It can be a good idea to start eliminating trivial cases $(-\Delta N / \Delta T r<0)$ in order to reduce the number of non-trivial comparisons performed by using Fig. 9.1. Elimination of architectures will rapidly converge to a design presenting the overall lower total power consumption for the given working conditions ( $V d d, V t h, f)$.

### 9.4 Designing new circuits

In addition to the above considerations, the same graphical tool can be used to define guidelines for the design of new architectures (i.e. not yet present in the set of available architectures) presenting an even smaller total power consumption. First, the $-\Delta N / \Delta T r$ line that crosses the $(V d d, V t h)$ constraint point can be determined from Fig. 9.1. As a reminder, two architectures having this $-\Delta N / \Delta T r$ share the same power consumption under these constraints, whereas the architecture with fewer cells should be favored when this $-\Delta N / \Delta T r$ ratio is higher.

Starting from an existing design with N1 cells and Tr1 transitions, a new architecture with less cells $(N 2<N 1)$ can be searched for, which will usually present also more transitions (the trivial case where $N 2<N 1$ and $\operatorname{Tr} 2<\operatorname{Tr} 1$ would be in fact always better but rarely realizable). This new version with $N 2<N 1$ cells and $\operatorname{Tr} 2>\operatorname{Tr} 1$ transitions will consume less power, if and only if the ratio $-\Delta N / \Delta T r$ is higher than the one extracted from the line crossing the ( $V d d, V t h$ ) constraints. Indeed, in this case this line will actually pass above the working point in Fig. 9.1 and the new design with fewer cells will consume less power. Conversely, an architecture presenting a reduced number of transitions (which in general will present more cells) can be searched for. In this case, the new structure should present a ratio $-\Delta N / \Delta T r$ smaller than the one that can be read from the line crossing the ( $V d d$, $V t h)$ constraints in Fig. 9.1.

As an example, an existing circuit with $10^{\prime} 000$ cells and 100 transitions is working at $V d d=1 \mathrm{~V}, V t h=0.24 \mathrm{~V}$ and $f=62.5 \mathrm{MHz}$ and a new architecture consuming less power is sought. Fig. 9.1 specifies that in order to consume less power a new architecture must have a $-\Delta N / \Delta T r$ greater than 10 when reducing the number of cells, or smaller than 10 when reducing the number of transitions. Supposing that the designer can achieve a reduction of 1000 cells $(\mathrm{N} 2=9000)$ by an architectural transformation, he should verify that the number of transitions of this new design is no more than $200(\Delta T r<100)$, which is necessary in order to have $-\Delta N / \Delta T r$ greater than 10.

When performing a parallelization, the number of cells is more than doubled (due to the multiplexer overhead) and the activity is reduced by slightly less than two. In general, this results in a small increase of the number of transitions and in a large increase in the number of cells. For this reason, parallelized versions will always present more power consumption than the original design at the same working conditions. However, when the original architecture does not meet the speed requirements, the parallelization can relax the timing constraints to achieve the required performances. This is the only case where a parallelized architecture may be useful when $V d d$ and $V t h$ are fixed.

The same situation arises with pipelining where the overhead due to the extra registers often largely cancels the activity reduction achieved by suppressing glitches. At the same time, the number of cells increases due to the same overhead and, as a result, pipelining a circuit at the same working conditions is in general not interesting. Nevertheless, the pipelining technique can be used to reduce the logical depth and hence relax the timing constraints of circuits that do not meet the speed constraints at the required $V d d$ and $V t h$.

### 9.5 Case study: 16bit multipliers

To show how to apply the ideas of this chapter to a practical case, we will, one more time, refer to the thirteen 16 multiplier described in Chapter 5 . The data of the architectural parameters for all the structures is available in Table 7.1.

Knowing that the key parameters for power discrimination are the number of cells $(N)$ and the number of transitions ( $T r$ ), all architectures can be represented as points on a plot of $N$ versus $\operatorname{Tr}$ (Fig. 9.2). The label on the arcs connecting points stands for the value of $-\Delta N / \Delta T r$ for the corresponding couple of architectures. Fig. 9.2 allows a very easy detection of trivial cases characterized by $-\Delta N / \Delta T r<0$, as the slope of their arc is positive. Conversely, non-trivial cases present a negative slope. In Fig. 9.2, only non-trivial arcs are shown.

## A. Example 1: Vdd $=1 \mathrm{~V}, \mathrm{Vth}=0.4 \mathrm{~V}, \mathrm{f}=62.5 \mathrm{MHz}$

Applying the methodology described in section 9.3, we have:

1. Delay constraints: All design can work at these conditions.
2. Compare a couple of architectures: Architectures connected by a positive slope arc in Fig. 9.2, i.e. trivial cases such as RCA parallel 4 against Wallace parallel 2, are first considered. As RCA parallel 4 presents more cells and more transitions than Wallace parallel 2 , it is eliminated.

## 3. Repeat step 2 for all remaining architectures:

- By comparing other trivial cases, we can easily eliminate RCA horizontal pipeline 4, RCA diagonal pipeline 4, RCA parallel 2, Wallace parallel 2 and Wallace parallel 4 in favor of Wallace. Moreover the RCA diagonal pipeline 2 is eliminated in favor of RCA horizontal pipeline 2 and Sequential parallel in favor of the basic Sequential.
- The remaining cases are then considered. Looking at RCA and Sequential in Fig. 9.2, it can be seen that the arc connecting the two structures is characterized by $-\Delta N / \Delta T r=0.7$. On Fig. 9.1, the equal-consumption line corresponding to this value splits the space in two regions with the label "less transition is better" on the upper part and "less cells is better" in the lower part, meaning that at $V d d=1 \mathrm{~V}$ and $V t h=0.14 \mathrm{~V}$ the two designs will consume the same amount of power. However, in our example the working point corresponding to $V t h=0.4 \mathrm{~V}$ lies in the upper part of the plot where
әэе

the better structure is characterized by less transitions. Consequently, the RCA design is selected. The same reasoning can be applied to the Sequential-wallace 4_16 architectures which is eliminated in favor of the RCA. In fact, if the equal-consumption line is located in the lower part of Fig. 9.1, i.e. at low Vth, a working point above this line is dominated by dynamic consumption rather than static power. For this reason, designs with fewer transitions will present also less total power dissipation. The remaining architectures are RCA, RCA horizontal pipeline 2 and Wallace, but having all low values of $-\Delta N / \Delta T r$ compared to Wallace (1.59 and 10.03 respectively) only the Wallace structure remains.

For $V d d=1 \mathrm{~V}, V t h=0.4 \mathrm{~V}$ and $f=62.5 \mathrm{MHz}$, the better architecture from a power point of view is the Wallace. In order to validate the methodology, the total power consumption of all designs was calculated for the given operating conditions and is shown in Table 9.2.

| RCA | RCA par2 | RCA par4 | RCA horiz.pipe2 | RCA horiz.pipe4 |
| :---: | :---: | :---: | :---: | :---: |
| 735.4 | 839.5 | 905.4 | 681.5 | 738.3 |
| RCA diag.pipe2 | RCA diag.pipe4 | Wallace | Wallace par2 | Wallace par4 |
| 724.9 | 790.1 | $\mathbf{5 4 5 . 2}$ | 650.2 | 728.1 |
| Sequential | Sequential-wallace 4_16 | Sequential parallel |  |  |
| 2457.2 | 1094.4 |  | 2232.5 |  |

Table 9.2: Consumption of the thirteen multipliers in $\mu W$ for $\mathrm{Vdd}=1 \mathrm{~V}$, V th $=0.4 \mathrm{~V}$ and $\mathrm{f}=62.5 \mathrm{MHz}$.

These values are first obtained at the nominal conditions $(V d d=1 \mathrm{~V}, V t h 0=$ 0.353 V ) and then dynamic and static powers are separately recalculated based on Eq. (9.1) for the proposed working condition (i.e. $V d d=1 \mathrm{~V}, V t h=0.4 \mathrm{~V}$ ).

## B. Example 2: Vdd $=1 \mathrm{~V}, \mathrm{Vth}=0.12 \mathrm{~V}, \mathrm{f}=62.5 \mathrm{MHz}$

As a second example, we choose a working condition with a very low threshold voltage ( $V t h=0.12 \mathrm{~V}$ ) and the same supply voltage and frequency as in the previous example.

1. Delay constraints: In this case too, all designs meet the timing constraints.
2. Compare a couple of architectures: As in the previous example, trivial cases are detected first. Hence, the RCA parallel 4 is eliminated in favor of the Wallace parallel 2.

## 3. Repeat step 2 for all remaining architectures:

- By comparing other trivial cases, we can easily eliminate RCA horizontal pipeline 4, RCA diagonal pipeline 4, RCA parallel 2, Wallace parallel 2 and Wallace parallel 4 in favor of Wallace. Moreover, the RCA diagonal pipeline 2 is eliminated in favor of the RCA horizontal pipeline 2 and the Sequential parallel in favor of the basic Sequential.
- The remaining architectures are: RCA, RCA horizontal pipeline 2, Sequential, Sequential-wallace 4_16 and Wallace. As before, the couple RCA and Sequential is characterized by $-\Delta N / \Delta T r=0.7$, which corresponds to an equal-consumption line on Fig. 9.1. For $V d d=1 \mathrm{~V}$ these architectures will have the same power consumption if the threshold voltage is equal to 0.14 V . As the imposed $V t h$ is a little lower $(0.12 \mathrm{~V})$, it is located in the region where less cells are preferred. Hence, the Sequential architecture will be selected. Similar is the comparison between the RCA horizontal pipeline 2 and the Wallace. With a $-\Delta N / \Delta T r$ of 10.03 , we know that the architecture with less cells is preferred (i.e. the RCA horizontal pipeline 2). For the same reason, the Sequential-wallace 4_16 will be preferred over the RCA horizontal pipeline 2. Finally, the comparison between the Sequential and the Sequential-wallace $4 \_16$ is characterized by a $-\Delta N / \Delta T r=0.27$. From Fig. 9.1 we can see that the equal-consumption line passes under the working conditions couple ( $V d d, V t h$ ), meaning that the circuit with less transitions will present the best power figure. Hence, the only remaining architecture is the Sequential-wallace 4_16.

The results of the methodology indicate that the Sequential-wallace 4_16 is the circuit presenting the lowest total power consumption for $V d d=1 \mathrm{~V}, V t h=0.12 \mathrm{~V}$ and $f=62.5 \mathrm{MHz}$.

| RCA | RCA par2 | RCA par4 | RCA horiz.pipe2 | RCA horiz.pipe4 |
| :---: | :---: | :---: | :---: | :---: |
| 4618.5 | 8571.8 | 16342.5 | 4987.5 | 5898.4 |
| RCA diag.pipe2 | RCA diag.pipe4 | Wallace | Wallace par2 | Wallace par4 |
| 5070.3 | 6072.1 | 4788.5 | 9082.5 | 17537.0 |
| Sequential | Sequential-wallace 4_16 | Sequential parallel |  |  |
| 3939.4 | $\mathbf{3 2 4 5 . 1}$ |  | 4857.9 |  |

Table 9.3: Consumption of the thirteen multipliers in $\mu W$ for $\mathrm{Vdd}=1 \mathrm{~V}$, V th $=0.12 \mathrm{~V}$ and $\mathrm{f}=62.5 \mathrm{MHz}$.

The actual power consumption in these conditions is shown (after calculation) in Table 9.3, confirming that the Sequential-wallace $4 \_16$ presents actually the lowest total power consumption.

### 9.6 Summary

This chapter presented a new design methodology allowing the selection of the architecture presenting the lowest total power consumption within a set of equivalent designs working at the same (fixed) $V d d, V t h$ and $f$. This methodology considers dynamic power consumption (proportional to the number of transitions), as well as static power consumption (directly related to the number of cells). An example of application was reported for thirteen 16 bit multipliers, showing that, depending on the working condition (i.e. $V d d, V t h$ and $f$ ), the architecture with the lowest total power dissipation is not always the same. Moreover, this technique allows the determination of the architecture presenting the lowest total power consumption for conditions which are different from the one used during synthesis, without the need of re-synthesizing all the circuits.

98 Chapter 9. Total power comparison for fixed Vdd and fixed Vth

## Chapter 10

## Physical implementation of four 32 bit multipliers

In the previous chapters, the models for the optimal total power consumption have been proposed. In order to validate the reported equations and to reinforce the drawn conclusions, a physical ASIC implementation has been done. The circuit was designed to demonstrate both architectural and technology influences to the optimal total power consumption in the case where the static power consumption also largely contribute to the total power. This has been achieved with a state-of-the-art 90 nm technology from ST Microelectronics. The main advantage of this technology is the possibility to integrate, on the same die, 2 different kinds of transistor out of the 3 available. In this way, it is possible to "emulate" the effects of a technology change on the total power consumption with a single chip.

The implemented design is composed by two 32 bit multipliers (RCA basic and RCA parallel 4, these structures being described in details in Chapter 5) implemented once with the Standard Vth (SVT) transistors and once with the Low Vth (LVT) transistors, giving a total of 4 multipliers.

After a detailed description of the ASIC structure and functionality, this chapter will present the tools and resources used for the measurements. Then, measured data will be reported and commented. Finally, a discussion on technology parameter variations closes the chapter.

### 10.1 Circuit description

The test ASIC is mainly formed by 4 multipliers corresponding to all possible combinations of two technology flavors (SVT/LVT) with two architectures (RCA basic/RCA
parallel 4). The combinations are:

- mult_0: RCA basic with SVT transistor type;
- mult_1: RCA parallel 4 with SVT transistor type;
- mult_2: RCA basic with LVT transistor type;
- mult_3: RCA parallel 4 with LVT transistor type.

The choice of the RCA as the block to be implemented comes from the need to have an architecture "slow enough" (in fact, the RCA has a logical depth larger than the Wallace) to have the expected total power crosses (reported at the end of this chapter) at relatively low frequency (under 20 MHz in this case). This permitted us to reduce the requirements for the testing tools. Fig. 10.1 illustrates the block diagram of the test circuit. All multipliers have a data size of 32 bit, which corresponds to 64 output bits. Each multiplier also has a separated power supply in order to be able to measure its power consumption without including the rest of the circuit. For the same reason, the clock signal was multiplexed to each block. In fact, in this way, only the clock tree corresponding to the desired multiplier is accounted during the power measurements. This clock multiplexing, as well as the multiplier register enables and the output demultiplexer are controlled by the external signal sel, which is the binary representation of the number corresponding to the multiplier under test.

To be able to verify the correct functioning of the multipliers over many multiplications, the results are added with the precedents and only the final sum is verified. Mathematically, the content of the shift register after $n$ multiplications can be expressed by:

$$
\begin{equation*}
\text { Sum }=\left[\sum_{i=0}^{n} \text { multiplication(i) }\right] \bmod 2^{64} \tag{10.1}
\end{equation*}
$$

This sum is stored in a 64 bit shift register which permits to serially output the result externally in order to be checked after the test.

### 10.1.1 Pseudo-random code generator

The circuit being designed to work at a maximal frequency of 62.5 MHz (corresponding to 16 ns of clock period) at nominal conditions (i.e. $V d d=1 \mathrm{~V}$ ), it was not possible to externally generate the input data for the multipliers due to the high throughput required. Hence, a pseudo-random data generator has been implemented internally. This generator is based on a linear feedback shift register [63] [64] and is constructed

as a shift register with some bits logically "xnored" and seeded to the shift register input. The schematic of the data generator is depicted in Fig. 10.2.


Figure 10.2: Schematic of the 64 bit linear feedback shift register

The data is 64 bit wide and provides the two 32 bit vectors used as the two inputs of the multiplier under test.

The particularity of a linear feedback shift register (lfsr) is that all possible codes are generated in a equally distributed way, without repetitions, until all codes have passed. The only code never generated, and also the one to be avoided, is the "allones" code, which is a stable code and always generates itself. Another advantage of this implementation is the fact that the generated sequence is always the same given the same starting code. In the case of our circuit, the shift register will be reset prior to every multiplication so that knowing the number of executed multiplications $n$ permits us to pre-calculate the result of the cyclic adder expressed in Eq. (10.1) and in this way being able to verify that all the multiplications were executed correctly.

Fig. 10.3 shows the distribution of the generated numbers after 500 and after 10000 clock cycles. In the case of 500 generated numbers, it is possible to observe a slightly non uniform distribution due to the small number of generated data. If the amount of generated numbers increases, the distribution of probabilities becomes more uniform, as shown in Fig. 10.3. It is also interesting to note that, due to the shift nature of generated data, splitting the 64 bit code in two 32 bit vectors doesn't change the probability distribution, actually the new derivated vectors will present the same probability distribution as the original one. Moreover, the multiplication of two uniform distribution results in a distribution proportional to $\ln (1 / x)$ as shown by the last two graphs of Fig. 10.3.


Figure 10.3: Probability distribution of the pseudo-random generated data for 500 and 10000 generated data

### 10.1.2 Ring oscillators

Besides the design described in the Fig. 10.1, two small ring oscillators have been added to the implemented circuit. One is implemented with inverters based on SVT transistors, whereas the other is implemented with inverters based on LVT transistors. Both ring oscillators were designed to have an oscillation frequency of 62.5 MHz at nominal conditions, which corresponds to the expected working frequency of the multipliers under the same conditions. This means:

- ring_lvt: 533 inverters (IVLVTX1)
- ring_svt: 437 inverters (IVSVTX1)


### 10.2 Circuit design and implementation

The design has been written in the VHDL language and the source code can be found in Appendix A. The synthesis of this code has been done using Synopsys Design Compiler V2004.06-SP1 and the activity annotation for accurate power estimation has been obtained with ModelSIM from MentorGraphics version 5.6f. All the Synopsys scripts can be found in Appendix B.

The technology used for the synthesis is the 90 nm from ST Microelectronics. This technology has been fully described in Chapter 4.

The results of the synthesis are stored in a verilog netlist ready to be used for the Place\&Route (P\&R) software. In our case, we used SoC Encounter version 4.10 from Cadence. The scripts used for $\mathrm{P} \& \mathrm{R}$ are reported in Appendix C.

Finally, the design passed the DRC (Design Rule Check) done using Calibre DRC from MentorGraphics. The final layout of the circuit is shown in Fig. 10.4.


Figure 10.4: Final layout of the demonstrator circuit

In Fig. 10.4 we can recognize the two RCA parallel 4 multipliers in the upper part, the two RCA basic multipliers in the lower left part, whereas the control logic and the data generator are located in the middle left part. The square block located in
the bottom right angle is a compensation circuit required to stabilize the IO cells. A block view of the design in reported in Fig. 10.5.


Figure 10.5: Block view of the demonstrator circuit

The pin names and their functions are:

1. Z_lvt: Output of the ring oscillator formed by 533 LVT type inverters;
2. S_out: Serial output of the shift registers. This output is used to read the content of the shift registers. From the read value the correct multiplier behavior can be verified;
3. S_in: Serial input of the shift registers. This pin can be used to enter a value to be multiplied or to verify the correct functioning of the shift registers;
4. Vdd_0: Supply voltage for the multiplier 0 (RCA basic with SVT transistors);
5. Clk: Clock of the system;
6. Vdd_2: Supply voltage for the multiplier 2 (RCA basic with LVT transistors);
7. Load_n: When low, data is loaded in parallel from the p_in input into the shift registers (see Fig. 10.1). This is the typical behavior during the sum and accumulation process;
8. Vss_1: System ground;
9. Vss_ref: System ground;
10. Vdd_g: Supply voltage for the IO_REF_COMPENSATION block (1.0V);
11. Vss_g: System ground;
12. Sel0: Bit zero of the sel signal. This signal select which multiplier is under test;
13. Sel1: Bit one of the sel signal. Sel coding is binary;
14. Sel_reg: Selector for routing data from the pseudo-random number generator and to/from the shift registers;
15. Rst_n: System asynchronous reset signal, active low;
16. Vdd_3: Supply voltage for the multiplier 3 (RCA parallel 4 with LVT transistors);
17. Vss_3: System ground;
18. Vdd_IO: I/O supply voltage (3.3V);
19. Vdd_co: Supply voltage for the pseudo-random generator and serial interface block;
20. Vss_IO: IO ground;
21. Vdd_1: Supply voltage for the multiplier 1 (RCA parallel 4 with SVT transistors);
22. Vss_2: System ground;
23. shift_n: When low (and load_n is high), data in the shift register shifts one bit on each clock rising edge;
24. Z_svt: Output of the ring oscillator formed by 437 SVT type inverters;


Figure 10.6: Output pad level converter for different core supply voltages. The linear ramp represents the core supply voltage, the line marked with triangles and constantly bound to zero is the logical level from the core and the line marked by wide rectangles is the corresponding IO output.

This circuit being destined to work at very low supply voltage ( $<0.5 \mathrm{~V}$ ), the level converter included in the standard output cells is not suited for granting a good level conversion under this condition as reported by Fig. 10.6.

Actually, in Fig. 10.6 we can observe that, for a core powered with a tension lower than about 0.45 V , the output value jumps to 3.3 V whereas 0 V should be reported instead. For this reason the output ports (luckily only 3 ports of the design are outputs) have been assigned as analog pads and the level conversion has been left to an external circuit. This problem doesn't exist for the input ports, in fact, signals coming with a higher voltage than the core supply are never confused with the " 0 " logic level.

### 10.2.1 Nominal values

The nominal synthesis values, as well as the architectural parameters, for the four implemented multipliers are reported in Table 10.1.

The definitions of the parameters reported in Table 10.1 are:

- Cells: the number of design cells. Note however that cell can be a very simple


| 7\％99 | 0¢0I |  | $7 \mp 80$ | 0＇I | 70LI＇0 | 7¢ちで0 | 701 | 9 ${ }^{\prime}$＇LI | L6［＇0 | 9＇70980I | 68981 | 78L6 | 87InJ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $68 ¢ \pm$ | \＆LZ | $990 \pm$ | 7ヵ¢ 0 | $0 \cdot \mathrm{I}$ | L\＆Zた） 0 | L9L9 0 | 768 | \＆L．0I | 969＊0 | で¢0LLZ | 9698 | 899\％ | Z7InN |
| $07 \% 7$ | 87 I | 乙60t | \＆¢8．0 | 0． I | モ0Zİ0 | ZLLZ＇0 | 701 | ¢9．¢L | モ6「0 |  | 670才I | 7ZI0T | L7InN |
| 7898 | ¢¢ | L998 | ¢¢8．0 | 0． I | モ897＊0 | ¢LE90 | 968 | $87^{\circ} \mathrm{CL}$ | 989 0 | I＇09zLz | 0998 | ¢¢97 | 07TnN |
| $\begin{aligned} & \hline\left[M^{r]}\right. \\ & 7 \mathbf{o f d} \end{aligned}$ | $\begin{gathered} \hline \hline\left[M^{r l}\right] \\ \boldsymbol{q e q}^{\prime} \mathbf{s}_{\mathbf{d}} \end{gathered}$ | $\begin{gathered} {\left[M^{r l}\right]} \\ \mathbf{u} \kappa \mathbf{p}_{\mathbf{d}} \end{gathered}$ | $\begin{gathered} {[\Lambda]} \\ 0 ч \neq \Lambda \end{gathered}$ | $\begin{gathered} {[\Lambda]} \\ \operatorname{pp}_{\Lambda} \end{gathered}$ | ${ }_{0} \chi$ | $\chi$ | Ш®$^{-} \mathrm{CT}$ | $\begin{gathered} {[\mathrm{su}]} \\ \mathcal{K e}_{\mathrm{e} \boldsymbol{\partial}} \mathrm{C} \end{gathered}$ | Кұ！ム！̣コV | $\begin{aligned} & \hline \hline\left[{ }_{z} w r\right] \\ & \text { еәл } \end{aligned}$ | słən | SIIə ${ }^{\text {P }}$ |  |
| sonjes［eu！uon |  |  |  |  |  |  |  |  |  |  |  |  |  |

one (like an inverter) or a complex one (like a full adder);

- Nets: the number of inter-cells nets in the design;
- Area: the area of the design core; pads and routing spaces are not included;
- Activity: the average number of switching nets over the total number of nets per clock period. These values are obtained by an event-driven simulation under ModelSIM (from MentorGraphics). The results are based on the multiplication of pseudo-random data over 500 multiplications; Standard library delays are used so that glitches can be accounted;
- Delay: the typical combinatorial delay from register output to register input on the critical path;
- LD_eff: the effective logical depth in equivalent NAND2 gates. The term "effective" is used to emphasize the fact that the length of the logical depth is considered against the throughput frequency or one-complete-multiplication frequency. In the case of a 4 times parallelization, for instance, LD_eff corresponds to a forth of the real LD because each block has four clock periods to compute one multiplication. The delay of the reference NAND2 gate has been estimated by putting 1000 NAND2 as in a chain of inverters. The inversion effect has been obtained by tying the two inputs together. The resulting delay per gate is 33.5ps for the SVT transistor type and 27.4ps for the LVT;
- $\chi$ and $\chi^{\alpha}$ : these two parameters are obtained by using Eq. (6.3) from the nominal $V d d, V t h$ and delay;
- Nominal Vdd: the nominal technology supply voltage;
- Nominal Vth0: the nominal technology threshold voltage;
- Nominal Pdyn: the nominal dynamic power consumption as reported by Synopsys DC;
- Nominal Pstat: the nominal static power consumption as reported by Synopsys DC;
- Nominal Ptot: the nominal total power consumption obtained by summing the nominal Pdyn and the nominal Pstat.


### 10.3 Measurements setup

For measuring the power consumption of each multiplier at their limit of functionality (i.e. the lowest possible supply voltage guaranteeing correct results for a given frequency) the following things are required:

- Generate the supply voltages: The circuit requires many different supply voltages in order to work. The multiplier under test needs a separate supply voltage. Then, the core logic, containing the pseudo-random data generator and the cyclic adder, requires a supply voltage at the same potential in order to internally interface the multipliers without problems. Moreover, the IO controller IO_REF_COMPENSATION should always be maintained to 1.0 V and finally the IO pads must be powered with 3.3 V .
- Generate the control signals: The circuit requires a clock and a reset signal. Besides, other signals must be generated in order to select the multiplier under test and to read/write the shift registers for checking the correct functioning of the multiplier. All these signals are generated by an Altera FPGA based board.
- Convert output pins to 3.3V logic level: As reported previously, the circuit outputs (namely Z_lvt, Z_svt and S_out) are implemented as analog signals and they hence need to be converted to a 3.3 V logical level in order to be interfaced by the FPGA. This is obtained by putting discrete comparator devices on the output signals.
- Measure the consumed multiplier current independently: Finally, once the circuit can run, we must be able to measure the consumed current of the specific multiplier under test. This is accomplished by multiplexing the multiplier power supply to the correct multiplier power pins through reed relays. The advantage of using reed relays is that, the contact being mechanical, virtually no extra consumption is added to the measure, which would not be the case if a CMOS multiplexer circuit would be used instead.


### 10.3.1 PCB design

Fig. 10.7 shows the schematic of the PCB (Printed Circuit Board, designed with Altium Designer 2004 SP3, formerly Protel) used to interface the demonstrator circuit. The three connectors J5, J7, J9 are the "bridges" between the PCB and the FPGA board. On the right of the schematic, we can see the 4 reed relays (K1-K4) used to

supply the multiplier under test and the corresponding LEDs which provide a visual feedback on which multiplier is currently selected. On the left bottom corner there is a small user interface with 4 buttons and 4 LEDs. Two of these LEDs (OK, KO) are used to show if the content of the shift register was the expected one, i.e. if the multiplier worked correctly or not. The other buttons/LEDs are there to expand the functionalities if needed. The comparators (U2, U4, U5), required to convert the output level of the three output pins Z_lvt, Z_svt and S_out to 3.3V, are visible on the left part of the schematic with extra connectors (P1-P3, on top) designed for debugging purpose. The reference voltage defining the separation between the logical level 0 and the logical level 1 has been obtained with a potentiometer from the VCORE pin. In this way, the reference voltage will always be proportional to the supply voltage used for the core. All the chip input signals are connected directly to the FPGA through J9. The 3.3 V is generated from the 5 V on the card with a voltage regulator shown in the top right edge. A stabilized 1.0 V source was difficult to obtain from the 5 V as no voltage regulator was found that can provide tensions so low. For this reason, this supply voltage has be generated using an operational amplifier used as a voltage follower. In this configuration, the tension set at the input through a resistor divider is replicated at the output (almost) independently from the drawn current. This block is shown in the bottom-centered part of Fig. 10.7. Finally, the multipliers power source is obtained externally by the connector VDDM and the current drawn is measured by applying a ammeter to the AMP connector. The tension for the core (which is all the design but the multipliers) can be obtained from VDDM with the jumper JP1 set or supplied separately by the VCORE connector.

### 10.3.2 FPGA based signal generation

The FPGA development card used in this work was a Nova Constellation 20 KE card [65], which is based on a Altera APEX EP20K600EFC672 FPGA. This card has 150 user programmable IOs working at 3.3 V . It can be programmed through USB and JTAG interfaces. A serial programmer is also present, which permits automated FPGA reconfiguration on power-ups. Moreover, this card supports the SignalTrap II technology from Altera, allowing registers read back through JTAG during runtime. This feature is very practical for debugging. The card is powered by 5.0 V and an internal 40 MHz clock frequency is present. In our case, an external oscillator will be used in order to be able to measure the power consumption for different frequencies.

The FPGA code has been written in VHDL and compiled with Altera Quartus II v6.0 SP1. The source code is reported in Appendix D.

The FPGA pin assignments are reported in Table 10.2.

| Name | PIN | Name | PIN |
| :--- | :--- | :--- | :--- |
| OK_led | PIN_E13 | CHIP_rst_n | PIN_N19 |
| KO_led | PIN_H15 | CHIP_sel[0] | PIN_T22 |
| Power_mult0 | PIN_F12 | CHIP_sel[1] | PIN_M17 |
| Power_mult1 | PIN_H13 | CHIP_sel_reg | PIN_L20 |
| Power_mult2 | PIN_J16 | CHIP_shift_n | PIN_T23 |
| Power_mult3 | PIN_K15 | CHIP_sin | PIN_R23 |
| Switch1 | PIN_E16 | CHIP_sout | PIN_M21 |
| Switch2 | PIN_G16 | ext_clock | PIN_G15 |
| Switch3 | PIN_H16 | mult_num[0] | PIN_E14 |
| Switch4 | PIN_E15 | mult_num[1] | PIN_F15 |
| CHIP_clock | PIN_N22 | LED2 | PIN_G18 |
| CHIP_load_n | PIN_M18 | LED3 | PIN_F18 |
| Z_svt | PIN_U21 | Z_lvt | PIN_U22 |

Table 10.2: Pin assignments for the APEX EP20K600EFC672 FPGA

The FPGA code does:

- Select the desired multiplier;
- Reset internal registers;
- Execute $10^{\prime} 000$ '000 multiplications and accumulate the results on the 64 bit register;
- Read back the content of the accumulator register;
- Verify the read data with the expected value and output the decision on the pass/fail pins;
- At the end of this sequence, the chip clock is stopped to allow static power measurements.

A particularity of this code is the use of two clock frequencies for the circuit under test, depending on the executed task. In fact, while the chip clock runs at full speed (the same of the FPGA) during the execution of the $10^{\prime} 000^{\prime} 000$ multiplications, a clock divided by 4 is used during the data read-back phase. This was required in order to execute tests with frequencies bigger than 35 MHz (like the nominal circuit frequency
of 62.5 MHz ). The limiting factor was the propagation delay of the comparator used to convert the low voltage level of the s_out pin to the 3.3V level of the FPGA. In fact, if the frequency was too high, the read value was latched before it was ready.

### 10.3.3 MATLAB based measurements automation

To test the manufactured circuits, lots of current measurements were required at difference frequencies, supply voltages and this for every multiplier. Moreover, the measurement of the power consumption during runtime needed to be synchronized with the design under test. For these reasons, an automated way to set the parameters (frequency, supply voltage) and to check the results was required.

To perform an automated measurement the following devices have been used:

- Agilent 33250A: Frequency generator, this device can generate a square wave frequency up to 80 MHz ;
- Keithley 213: Power supply and control signal generator, this device is a Quad Voltage Source (QVS) and includes 8 digital inputs and 8 digital outputs.
- Keithley Sourcemeter 2400: Power supply and ammeter with a precision up to 10 pA .

All this devices support the GPIB (General Purpose Interface Bus) protocol. This protocol is a standard for controlling devices remotely. The described tools were connected with a cable to a computer provided with a National Instrument acquisition card and controlled by MATLAB. In order to be able to use the GPIB protocol, the Instrument Control Toolbox for MATLAB was required. The MATLAB source code used for the measurements is reported Appendix E.

To determine if a multiplier was able to work at the given frequency and supply voltage the test was performed 10 times in a row with the same frequency and supply voltage. If at least one of these 10 tries was successful, the multiplier was considered capable to work at this condition (even if not all the times).

The frequency range for most of the tests span from 1 to 20 MHz , whereas the supply voltage accuracy chosen was of 10 mV .

Finally, the core (i.e. all the design but the multipliers) was supplied with 100 mV more than the multiplier under test, and this to avoid as much as possible to be limited by the working supply voltage of the data generator block.

### 10.4 Measurements

Two chips (No. 2 and No.3) have been chosen (without any particular reason) for a complete power consumption analysis and discussion. First, the power measurements at nominal conditions ( $V d d=1 \mathrm{~V}$ and $f=62.5 \mathrm{MHz}$ ) and their comparison with values reported by Synopsys DC will be considered. Later, the detailed power measurements for each multiplier of both chips will be carried out for frequencies ranging from 1 to 20 MHz . Finally, a discussion on the power and delay variability with data measured over 16 dies manufactured on the same wafer will be presented.

### 10.4.1 Nominal values

The nominal power consumptions and the critical path delay for chip No. 2 and No. 3 are reported in Table 10.3.

|  | Chip No.2 |  |  |  | Chip No.3 |  |  |  |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  | Mult_0 | Mult_1 | Mult_2 | Mult_3 | Mult_0 | Mult_1 | Mult_2 | Mult_3 |
| Pstat $[\mu W]$ | 132 | 501 | 1152 | 4515 | 169 | 631 | 1571 | 5931 |
| Pdyn $[\mu W]$ | 3080 | 3312 | 2957 | 3395 | 3103 | 3350 | 2978 | 3385 |
| Ptot $[\mu W]$ | 3212 | 3813 | 4109 | 7910 | 3272 | 3981 | 4549 | 9316 |
| fmax $[\mathrm{MHz}]$ | 74.5 | - | - | - | 75.4 | - | - | - |
| Delay $[\mathrm{ns}]$ | 13.42 | - | - | - | 13.26 | - | - | - |

Table 10.3: Measured nominal (1V@62.5MHz) power consumption and maximal working frequency

These values can be compared with the ones provided by Synopsys and reported in Table 10.1. The most remarkable difference comes from the static power consumption. Indeed, real measurements of static power report values around 4-5 times bigger than the expected ones. This clearly point out a big problem related to the nanometer CMOS technologies: the parameters variability. As explained more in details on the coming subsections, this extreme increase of the static power should mainly be due to a threshold voltage much lower than the expected one, probably coming from a not so well mastered effective transistor dimensions and doping profiles. Nevertheless, the ratios of the static power between the 4 multipliers in the same chip remain almost correct, like the parallel 4 version which shows 4 times the static power of the basic version.

Regarding the dynamic power, the measurements are less astonishing, but still the results show a dynamic consumption lower than the expected one. The reasons
could be lower capacitances, due to variable transistor effective dimension, and/or an activity slightly different (as a reminder, activity of all nodes, including internal cell nodes, was estimated based on the activity on the nets connecting cells).

The delay of the critical path was measured by increasing the frequency at the nominal supply voltage of 1 V until the multiplier stop working. The measurement was only possible for the Multiplier 0 (RCA basic SVT) because the frequency generator available at our laboratory only reach the 80 MHz , and this was not enough for measuring the other three multipliers. The measured delay is very near to the expected one of 13.28 ns .

### 10.4.2 Lowest working supply voltage

The expected lowest working supply voltage for a given frequency is reported in Fig. 10.8 and is based on Eq. (6.15).


Figure 10.8: Expected optimal supply voltage

As we can observe, the supply voltages are reduced until they reach the corresponding threshold voltage at 0 MHz . The non parallel versions have a slightly steeper slope compared to the parallelized versions. This is due to the larger LD_eff for the basic version, making it "harder" to reduce the supply voltage unless it reaches very low frequencies. Mathematically, the larger LD_eff is observed as a bigger $\chi$.

A similar plot has been obtained, by measurement, for chip No. 2 and No.3. Results are reported in Fig. 10.9 and Fig. 10.10 respectively.


Figure 10.9: Measured optimal supply voltage for chip No. 2

At a first look, Fig. 10.9 and Fig. 10.10 show the same shape of Fig. 10.8, but in reality they present lower values compared to the theoretical case. In particular, it is interesting to note the converging values for very low frequencies (the missing values are due to non working conditions resulting from a too low supply voltage). As seen before, this converging values correspond to the threshold voltage of the technology. From these plots, it is possible to imagine that the threshold voltages for the measured circuits should be around 0.2 V or even lower. This is quite different from the one around 0.33 V reported in Fig. 10.8 (Remember that $V t h=V t h 0-\eta V d d$ ). With a lower $V t h$ is now understandable why optimal $V d d$ are lower than the theoretical ones, while the shape of the plot is maintained.

This much lower threshold voltage can now also easily explain the large factor of 45 between the measured static power and the expected one at the nominal conditions. In fact, the static power depends exponentially on the threshold voltage, as reported in Eq. (3.5).

It is also worth to note that all multipliers of chip No. 2 and No. 3 worked at 250 mV and that two multipliers (mult_1 and mult_2) of chip No. 3 worked at a supply voltage as slow as 210 mV with a frequency of 1 MHz !


Figure 10.10: Measured optimal supply voltage for chip No. 3

### 10.4.3 Optimal total power

The total power consumption can now be calculated for the lowest working supply voltage (optimal $V d d$ ) thanks to Eq. (3.6). Fig. 10.11 illustrates it for the theoretical case.

The measured optimal total power for the chip No. 2 and No. 3 are reported in Fig. 10.12 and Fig. 10.13 respectively. The missing points correspond to values of the optimal $V d d$ too low to permit correct measurements.

As for the optimal supply voltage, we can observe that the shape of the plots measured is very similar to the theoretical one, but the corresponding optimal power is lower for the real circuits. This can, once more, be explained by the lower real threshold voltage, which permits a lower optimal supply voltage and hence a lower optimal total power.

The measured optimal supply voltages for mult_3 (RCA parallel 4 LVT) were very low and for this reason, reported optimal total power should be taken with care.

In both chips, the measurements for the multipliers corresponding to the SVT transistor type are very similar, whereas chip No. 3 shows a slightly higher consumption for the LVT type compared to No.2. This can be explained by the higher power static consumption of chip No. 3 as reported in Table 10.3, which manifests it mainly on


Figure 10.11: Expected optimal total power consumption


Figure 10.12: Measured optimal total power consumption for chip No. 2

LVT multipliers where static power is predominant.
The large variations in the technology parameters, discussed further in the next


Figure 10.13: Measured optimal total power consumption for chip No. 3
section, makes it very difficult to accurately predict the optimal total power over a so large range of frequencies. Nevertheless, the main shapes of the plots are maintained. In particular, let consider the cross points between the RCA basic SVT curves and both RCA parallel 4 SVT and RCA basic LVT. In the theoretical plot these crosses occur at 7 MHz and 17 MHz respectively.

If we look to the same crosses on the measured data, we observe them at 5 MHz and 13 MHz for chip No. 2 and at 6 MHz and 17 MHz for chip No.3. These results are very similar to the expected ones, considering the high technology parameters variations observed.

Practically, we can say that if a design is destined to work at 2 MHz , the RCA basic SVT is the best choice for low power, if it is designed for 10 MHz RCA parallel 4 shows a better power profile and at 20 MHz RCA basic SVT will consume more than the RCA basic LVT which will consume more than the RCA parallel 4 SVT.

### 10.4.4 Power and delay variability

In the preceding discussions, it was pointed out many times that technology parameters are quite variable from die to die even when they come from the same wafer, as it is the case for all the chips investigated in this thesis.

To explore a little deeper this aspect, the static power, dynamic power and critical
path delay (obtained from the maximal working frequency) of the multiplier 0 ( RCA basic SVT) at nominal conditions $(1 \mathrm{~V} / 62.5 \mathrm{MHz})$ have been measured for 16 different dies.


Figure 10.14: Nominal static power distribution for 16 chips


Figure 10.15: Nominal dynamic power distribution for 16 chips at 62.5 MHz

The data corresponding to the nominal static power is reported in Fig. 10.14. Here, we can see that the static power spans from a minimum of $75 \mu W$ to a maximum of $190 \mu W$, which correspond to a factor larger than 2.5 ! Moreover, the average value of $117 \mu W$ is more than 3.5 times larger the value estimated by Synopsys! This variability makes very problematic the power estimation for circuits dominated by static power.

The nominal dynamic power consumption presents a much lower variability between dies, as illustrated in Fig. 10.15. In fact, all measured values are included in a range from $3012 \mu W$ to $3121 \mu W$, which correspond to a variation of $\pm 2 \%$ around the average value of $3062 \mu \mathrm{~W}$. This is "only" $17 \%$ lower compared to the value provided by Synopsys. Moreover, by comparing the static power distribution with the dynamic one, we can observe a small correlation between the two. Actually, most of the time a die with a higher static power consumption, also shows a relative high dynamic power. A possible answer to this can come from the shortcut current (explained in Chapter 2.1.2). In fact, a higher sub-threshold current (lower Vth or higher $I_{0}$ or both) also means a higher "on" current, which increases the shortcut dissipation. This could also explain why the variations of the dynamic power only account for a few percents.


Figure 10.16: Delay distribution of the RCA SVT multiplier for 16 chips
Fig. 10.16 reports the measured critical path variability over 16 different dies. As for the dynamic power, the variation is quite limited and corresponds to $\pm 3 \%$ around the average value of 13.55 ns . Moreover, this delay is only $2 \%$ larger than the value reported by Synopsys. It is also worth noting that no correlation was observed between the power consumption and delay distribution.

### 10.5 Summary

This chapter discussed the demonstrator circuit used to investigate the influence of technology and architectural modifications to the optimal total power. The technology used was the 90 nm from ST Microelectronics, which permitted us to implement
two different transistor types on the same chip. Moreover, two different 32 bit multipliers were implemented for each transistor type, yielding a total of 4 multipliers. The first part of the chapter was dedicated to the circuit design and conception, then the description of the measurements setup follows and, at the end, the measured data were exposed and commented. In particular, we observed an average static power 3.5 times higher than the typical values estimated by Synopsys, whereas the dynamic power was only $17 \%$ lower on average. The large difference between the simulation and real measurements can be explained with the threshold voltage, which, in reality, appeared to be much lower than the theoretical one. Besides these important differences observable at nominal conditions ( $V d d=1 \mathrm{~V}$ and $f=62.5 \mathrm{MHz}$ ), the total power for multipliers working at the lowest possible supply voltages was discussed. The measured values showed a shape very similar to the expected one, but with different absolute values. This can also be explained by the lower real $V t h$. At the end of the chapter, the variability of powers and delay were reported for the same multiplier in 16 different chips. The results showed a static power varying as much as a factor 2.5 between the lowest and highest value for multipliers coming from the same wafer! Without doubt, this large variability of static power will be a main issue in nanometer CMOS technologies, especially for designs where static power is a large contributor.

124 Chapter 10. Physical implementation of four 32 bit multipliers

## Chapter 11

## Conclusions

With the introduction of nanometer CMOS technologies, new sources of power dissipation appeared. The continue shrinking of the transistor sizes, dictated by Moore's law, reached a point where new physical phenomena need to be faced. One of the most important problems related to these new phenomena is the huge increase of the static power consumption, which can become even bigger than the dynamic power for a running circuits. The static power consumption is the portion of the power dissipation that is constantly flowing from $V d d$ to $V s s$, even when the circuit is in idle state. For nowadays technologies, the principal contributor to static power comes from the sub-threshold current flowing through the transistors in off state. This type of current arises from the diffusion of the minority carriers in the transistor channel. The reason why this current is increasing so much in recent nanometer technologies is that it has an exponential dependency on the transistor threshold voltage, which is constantly reduced with new technologies to maintain the speed acceptable.

The goal of this thesis was to investigate the low power methodologies in technologies dominated by a large static power consumption. In particular, we were interested in the architectural as well as in the technology influence on the total power consumption.

The principal theoretical framework exposed in this thesis considers a scenario in which both the supply voltage and the threshold voltage can be freely modified. Under such assumption, the total power consumption clearly shows a minimum located at very low supply voltages (examples showed optimal $V d d$ lower than 0.4 V even at frequency as 62.5 MHz ). The derivation of the ratio k 1 (i.e. optimal dynamic power over optimal static power) showed that, this ratio being quite constant compared to the variation of $I_{o n} / I_{o f f}$ between technology nodes, nanometer technologies will require a growing ratio $a / L D$ (activity over logical depth) to reach this optimum.

This, for instance, will make pipelining preferable over parallelization. After that, we have seen the influence of $a, L D$ and $f$ to the optimal $V d d$ and $V t h$, showing that frequency mainly influences the optimal Vth, logical depth mainly influences the optimal $V d d$, while activity influences both of them. By comparing architectures under the rough approximation of a quasi-constant $k 1$, we realized that pipelining and parallelization are more effective for low power when they show high logical depth and high frequency. We also observed that new technologies, characterized by a lower $\chi$ factor compared to older ones, will tend to penalize pipelining and parallelization, whereas the condition for a power saving by pipelining remains easier to fulfill compared to the parallelization one.

Going behind the quasi-constant k1 approach, analytical closed-form equations has been derived for the calculation of the optimal $V d d$, optimal $V t h$ and optimal total power directly from the architectural and technology parameters. Thanks to these equations, we observed that the optimal $V t h$ is quite unchanged by pipelining, while the parallelization increases it by a precise amount, which only depends on the degree of parallelization. Moreover, sequential multipliers were clearly shown to be inadequate for low power at the optimal working condition due the large effective logical depth and the high number of transitions $(a \cdot N)$.

From a low power point of view, the best characteristics for an ideal technology would be a capacitance $C$, delay constant $k_{t}$ and sub-threshold slope $n$ as low as possible, whereas the reference current $I_{0}$ and alpha power law coefficient $\alpha$ should be as high as possible.

After the technology influence discussion, a few possibilities for modifying the threshold voltage (like body bias, transistor resizing, technology choice) were also presented.

Under all the investigated architectural and technology modifications, the simple approximated analytical equations developed in Chapter 6 for the optimal $V d d, V t h$ and Ptot showed very good results, reporting errors always lower than a few percent compared to numerical computation based on non-approximated equations.

In a second framework, the opposite case was considered, in which the threshold voltage as well as the supply voltage were assumed constant. This particular case was explored because it corresponds to the most typical case for industrial designers. In fact, they often have a fixed supply voltage and threshold voltage imposed by the technology and/or the devices the circuit has to interface. Under this condition, graphical tools for total power comparison of different architectures were presented. Examples of application of these tools to the same multipliers used in the precedent
framework were reported. In particular, we showed that, depending on the constraints used, the multiplier presenting the lowest total power is not always the same.

At the end of the thesis, a physical implementation of four different 32 bit multipliers was presented. These 4 multipliers represent all the possible combinations between two transistor types (SVT and LVT) and two architectures (RCA basic and RCA parallel 4). After an in-deep description of the circuit design flow and measurement setup, the nominal power consumptions as well as the optimal ones (those corresponding to the lowest working supply voltages) were compared to the theoretical values. The measured data showed, in average, a static power 3.5 times larger than expected. This was supposed to be due to real threshold voltages much lower than the simulated ones. Nevertheless, the shapes of the plots remained very similar to the expected ones. This means that, even if the absolute values were not well estimated by the models (due to the large technology parameters variability), the relation between them was respected. This was essential to be able to predict which multiplier presented the lowest total power for a given working frequency. It is also interesting to note that a few multipliers were able to work at 210 mV of supply voltage at a frequency of 1 MHz . Finally, the variability of powers and delay for 16 chips coming from the same wafer were reported. In particular, the variations on the static power at nominal condition ( $V d d=1 \mathrm{~V}, f=62.5 \mathrm{MHz}$ ) were strongly fluctuating, accounting for a factor of more than 2.5 between the highest and the lowest measured values. On the other hand, the variations on the dynamic power and delays were within $\pm 3 \%$.

From these observations, we can conclude that the major problem that the technologues will have to face in the future will be the difficulty to master the variations of the technology parameters. The price to pay for not achieving it would be lots of circuit instabilities and very low production yields, due to many dies unable to meet the specifications.

## Bibliography

[1] SIA ITRS roadmap update 2006 - http://www.itrs.net/.
[2] http://en.wikipedia.org/wiki/moore's_law.
[3] Transistor elements for 30nm physical gate length and beyond. Intel Technology Journal, Vol. 06(No. 2):42-54, May 2002.
[4] H. Soeleman, K. Roy, and B. Paul. Robust ultra-low power sub-threshold DTMOS logic. International Symposium on Low Power Electronics and Design, 2000.
[5] T. Enomoto, Y. Oka, H. Shikano, and T. Harada. A self controllable voltage level (SVL) circuit for low power high speed CMOS circuit. European Solid-State Circuits Conference, pages 411-414, 2002.
[6] S. Cserveny, J.-M. Masgonty, and C. Piguet. Stand-by power reduction for storage circuits. PATMOS Conference, September 2003.
[7] S.M. Kang and Y. Leblebici. CMOS Digital Integrated Circuits: Analysis and Design, Third Edition. McGraw-Hill, 2003.
[8] M. Anis and M. Elmasry. Multi-Threshold CMOS Digital Circuits. Kluwer Academic Publisher, 2003.
[9] K. Usami, N. Kawabe, M.Koizumi, K. Seta, and T. Furusawa. Automated selective Multi-Threshold design for ultra-low standby applications. International Symposium on Low Power Electronics and Design, 2002.
[10] J. Kao and A. Chandrakasan. MTCMOS sequential circuits. European SolidState Circuits Conference, 2001.
[11] V.R. von Kaenel, M.D. Pardoen, E. Dijkstra, and E.A. Vittoz. Automatic adjustment of threshold \& supply voltage for minimum power consumption CMOS
digital circuits. IEEE Symposium on Low Power Electronics, pages 78-79, October 1994.
[12] C. H. Kim and K. Roy. Dynamic Vt SRAM: A leakage tolerant cache memory for low voltage microprocessor. International Symposium on Low Power Electronics and Design, 2002.
[13] H. Mizuno, K. Ishibashi, T. Shimura, T. Hattori, S. Narita, K. Shiozawa, S. Ikeda, and K. Uchiyama. An 18 uA standby current $1.8 \mathrm{~V}, 200 \mathrm{MHz}$ microprocessor with self-substrate-biased data-retention mode. IEEE International Solid-State Circuits Conference, pages 280-281, 1999.
[14] F. Assaderaghi, D. Sinitsky, S. Parke, S. Bokor, P.K. Ko, and C.Hu. A dynamic threshold voltage MOSFET (DTMOS) for ultra-low voltage operation. IEEE International Electron Devices Meeting Technical Digest, pages 809-812, 1994.
[15] A. P. Chandrakasan and R.W. Brodersen. Low power CMOS digital design. IEEE Journal of Solid-State Circuits, Vol. 27(No. 4):473-484, April 1992.
[16] H.J.M Veendrick. Short-circuit dissipation of static CMOS circuitry and its impact on the design of buffer circuits. IEEE Journal of Solid-State Circuits, Vol. 19(No. 4):468-473, August 1984.
[17] D. Auverne, P.Maurine, and N. Azémard. Low Power Electronics Design, Chapter 6:Modeling for Designing in Deep Submicron Technologies. CRC Press, 2005.
[18] K. Nose and T. Sakurai. Closed-form expression for short-circuit power of short channel CMOS gates and its scaling characteristics. International Technical Conference on Circuits/Systems, Computers and Communications, pages 1741-1744, 1998.
[19] S. Turgis, N. Azemard, and D. Auvergne. Short-circuit power dissipation calculation on CMOS inverters using the equivalent short-circuit capacitance concept. PATMOS Conference, 1995.
[20] T. Sakurai and A. R. Newton. Alpha-power law MOSFET model and its applications to CMOS inverter delay and other formulas. IEEE Journal of Solid-State Circuits, Vol. 25(No. 2):584-594, April 1990.
[21] T. Sakurai and A.R. Newton. A simple MOSFET model for circuit analysis. IEEE Transactions on Electron Devices, Vol. 38(No. 4):887-894, April 1991.
[22] J.L. Rosselló and J. Segura. Accurate modelling of leakage currents in nanometre CMOS technologies. Electronics Letters, Vol. 41(No. 3):122-124, February 2005.
[23] Z. Chen, M. Johnson, L. Wei, and K. Roy. Estimation of standby leakage power in CMOS circuits considering accurate modeling of transistor stacks. International Symposium on Low Power Electronics and Design, pages 239-244, 1998.
[24] B.J. Sheu, D.L.Scharfetter, P.-K. Ko, and M.-C. Jeng. BSIM: Berkley ShortChannel IGFET model for MOS transistors. IEEE Journal of Solid-Sate Circuits, Vol. 22(No. 4):558-566, August 1987.
[25] K. Roy, S. Mukhopadhyay, and H. Mahmoodi-Meinand. Leakage current mechanisms and leakage reduction techniques in deep-submicrometer CMOS circuits. Proceedings of the IEEE, Vol. 91(No. 2):pp. 305-327, February 2003.
[26] N. S. Kim, T. Austin, D. Blaauw, T.Mudge, K. Flauter, J.S. Hu, M.J. Irwin, M. Kandemir, and V. Narayanan. Leakage current: Moore's law meets static power. IEEE Computer, pages 68-75, December 2003.
[27] John Robertson. High dielectric constant gate oxides for metal oxide si transistors. Reports on Progress in Physics, Vol. 69:327-396, 2006.
[28] A. Keshavarzi, K. Roy, and C. F. Hawkins. Intrinsic leakage in low power deep submicron CMOS ICs. IEEE International Test Conference, pages 146-155, 1997.
[29] S. Mukhopadhyay and K. Roy. Modeling and estimation of total leakage current in nano-scaled CMOS devices considering the effect of parameter variation. IEEE International Symposium on Circuits and Systems, pages 172-175, 2003.
[30] A. Ferré and J. Figueras. Low Power Electronics Design, Chapter 3: Leakage in CMOS nanometric technologies. CRC Press, 2005.
[31] D. Helms, E. Schmidt, and W. Nebel. Tutorial: Leakage in CMOS circuits - an introduction. PATMOS Conference, pages 17-35, 2004.
[32] TSMC. ANTCBN90G_110A - TSMC technology manual.
[33] SIA ITRS roadmap 2004 - http://www.itrs.net/.
[34] http://www.intel.com/pressroom/archive/releases/20070128comp.htm.
[35] K. Nose and T. Sakurai. Optimization of Vdd and Vth for low-power and highspeed applications. Asia South Pacific Design Automation Conference, pages 469-474, January 2000.
[36] M. H. Fino. A simple submicron MOSFET model and its application to the analyitcal characterization of analog circuits. European Conference on Circuit Theory and Design, August 2005.
[37] K.A. Bowman, B.L. Austin, J.C. Eble, Xinghai Tang, and J.D. Meindl. A physical alpha-power law MOSFET model. IEEE Journal of Solid-State Circuits, Vol. 34(No. 10):1410-1414, October 1999.
[38] T. Sakurai. Alpha power-law MOS model. IEEE Solid-State Circuits Society Newsletter, Vol. 9(No. 4):4-5, October 2004.
[39] J.L Rosselló and J. Segura. Charge-based analytical model for evaluation of power consumption in submicron CMOS buffers. IEEE Transaction on Computer-Aided Design of Integrated Circuits and System, Vol. 21(No. 4), April 2002.
[40] J.A. Butts and G.S. Sohi. A static power model for architects. ACM International Symposium on Microarchitecture, pages 191-201, December 2000.
[41] C.S. Wallace. A suggestion for a fast multiplier. IEEE Transactions on Electronic Computers, Vol. 13:14-17, February 1964.
[42] P. C. H. Meier. Analysis and Design of Low Power Digital Multipliers. PhD thesis, Carnegie Mellon University, Pittsburgh, Pennsylvania, 1999.
[43] H. Reza. A new multiplier using Wallace structure and carry select adder with pipelining. IEEE International Symposium on Circuits and Systems, 2002.
[44] R. Zimmermann. Binary Adder Architectures for Cell-Based VLSI and their Synthesis. PhD thesis, Swiss Federal Institute of Technology, Zurich, 1997.
[45] R. P. Brent and H. T. Kung. A regular layout for parallel adders. IEEE Transactions on Computers, Vol. 31(No. 3):260-264, 1982.
[46] W. J. Townsend, E. E. Swartzlander, Jr., and J. A. Abraham. A comparison of dadda and wallace multiplier delays. In F. T. Luk, editor, Advanced Signal Processing Algorithms, Architectures, and Implementations XIII, pages 552-560, December 2003.
[47] K.A.C. Bickerstaff, M. Schulte, and Jr. Swartzlander, E.E. Reduced area multipliers. International Conference on Application-Specific Array Processors, pages 478-489, October 1993.
[48] A. Wang and A. Chandrakasan. A $180-\mathrm{mV}$ subthrehsold FFT processor using a minimum energy design methodology. IEEE Journal of Solid-State Circuits, Vol. 40(No. 1):310-319, January 2005.
[49] H. Q. Dao, B. R. Zeydel, and V. G. Oklobdzija. Architectural considerations for energy efficiency. International Conference on Computer Design, pages 13-16, 2005.
[50] B. Zhai et. al. A 2.6pJ/Inst subthreshold sensor processor for optimal energy efficiency. VLSI Circuits Symposium, 2006.
[51] B. H. Calhoun and A. Chandrakasan. Characterizing and modeling minimum energy operation for subthrehsold circuits. International Symposium on Low Power Electronics and Design, pages 90-95, 2004.
[52] S. Hanson, B. Zhai, D. Blaauw, D. Sylvester, A. Bryant, and X. Wang. Energy optimality and variability in subthreshold design. International Symposium on Low Power Electronics and Design, pages 363-365, October 2006.
[53] D. Markovic, V. Stojanovic, B. Nikolic, M. A. Horowitz, and R. W. Brodersen. Methods for true energy-performance optimization. IEEE Journal of Solid-State Circuits, Vol. 39(No. 8):1282-1293, August 2004.
[54] B. Zhai, D. Blaauw, D. Sylvester, and K. Flautner. Theoretical and practical limits of dynamic voltage scaling. Design Automation Conference, pages 868-873, 2004.
[55] J. Burr and A. Peterson. Ultra low power CMOS technology. NASA VLSI Design Symposium, pages 4.2.1-4.2.13, 1991.
[56] C. Heer and J. Berthold. Designing low power circuits: an industrial point of view. PATMOS Conference, September 2001.
[57] S.M. Sze. Semiconductor Devices - Physics and Technology. John Wiley \& Sons, 1985.
[58] H. Ananthan, C. H. Kim, and K. Roy. Larger-then-Vdd forward body bias in sub0.5 V nanoscale CMOS. IEEE International Symposium Low Power Electronics and Design, pages 8-13, 2004.
[59] H. Ananthan. Evaluation of digital forward body bias for 70nm bulk CMOS. Class Project, EE 695K, Fall 2003.
[60] T. Kuroda et al. A $0.9 \mathrm{~V}, 150 \mathrm{MHz}, 10 \mathrm{~mW}, 4 \mathrm{~mm} 2,2 \mathrm{D}$ discrete cosine transform core processor with variable threshold voltage scheme. IEEE Journal of SolidState Circuits, Vol. 31(No. 11):1770-1779, 1996.
[61] S. Narendra et al. 1.1V 1 GHz communications router with on-chip body bias in 150 nm CMOS. IEEE International Solid-State Circuits Conference, pages 270-271, 2002.
[62] P. Gupta, A.B. Kahng, P. Sharma, and D. Sylvester. Gate-length biasing for runtime-leakage control. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 25:1475-1485, 2006.
[63] http://en.wikipedia.org/wiki/linear_feedback_shift_register.
[64] XilinX. Xilinx LogiCORE: Linear Feedback Shift Register V3.0, 28 March 2003.
[65] http://www.nova-eng.com/inside.asp?n=products\&p=constellation.

## List of Publications

## Conferences

- C. Piguet, C. Schuster, J.-L. Nagel. "Static and dynamic power reduction by architecture selection". Proc. Int'l Workshop on Power and Timing Modeling, Optimization and Simulation, PATMOS'06, Montpellier, France, September 1315, 2006.
- C. Piguet, C. Schuster, J.-L. Nagel. "Leakage reduction at architecture level". Proc. Int'l Conference on Integrated Circuit Design \& Technology, ICICDT'06, Padova, Italy, May 24-26, 2006.
- C. Schuster, J.-L. Nagel, C. Piguet, P.-A. Farine. "Architectural and technology influence on the optimal total power consumption". Design, Automation and Test in Europe Conference, DATE06, Munich, Germany, March 06-10, 2006.
- C. Piguet, C. Schuster, J.-L. Nagel. "Réduction des consommations statique et dynamique par sélection des architectures". 5ème journées d'études Faible Tension Faible Consommation, FTFC05, Paris, May 18-19, 2005.
- C. Schuster, J.-L. Nagel, C. Piguet, P.-A. Farine. "Conception d'architectures à Vdd et Vth imposés avec consommation totale minimale". Journées Francophones sur l'Adéquation Algorithme Architecture, JFAAA'05, Dijon, France, January 18-21, 2005.
- C. Schuster, J.-L. Nagel, C. Piguet, P.-A. Farine. "Leakage reduction at the architectural level and its application to 16 bit multiplier architectures". Proc. Int'l Workshop on Power and Timing Modeling, Optimization and Simulation, PATMOS’04, Santorini Island, Greece, September 15-17, 2004.
- C. Piguet, C. Schuster, J.-L. Nagel. "Optimizing architecture activity and logic depth for static and dynamic power reduction". Proc. of the 2nd Northeast

Workshop on Circuits and Systems, NewCAS'04, Montréal, Canada, June 2023, 2004

## Journals

- C. Schuster, J.-L. Nagel, C. Piguet, P.-A. Farine. "An architecture design methodology for minimal total power consumption at fixed Vdd and Vth". Journal of Low Power Electronics, Vol.1(No.1):pp.3-10, April, 2005.


## Appendix A

## VHDL source code

## A. 1 top.vhd

```
-- Title : Circuit top (32bit)
-- Project
_- File : top.vhd
-- Author : <schuster@zebra>
-- Company
-- Created : 2006-02-17
-- Last update: 2006-09-22
-- Platform
-- Standard : VHDL'93
-- Description: This is the top of the design.
-- It includes the follow blocks:
-- - data_gen
-- - 2 mult32
-- -2 mult32 parallel 4
-- - one-hot decoder and mux
-- -2 ring oscillators
-- Copyright (c) 2006
-- Revisions :
-- Date Version Author Description
-- 2006-02-17 1.0 schuster Created
library ieee;
use ieee.std_logic_1164.all;
entity top is
    port (
        clk : in std_logic; -- clock
        rst_n : in std_logic; -- active low async reset
```

```
s_in : in std_logic; -- serial input
s_out : out std_logic; -- serial output
load_n : in std_logic; -- when low registers are loaded in parallel
shift_n : in std_logic; -- when low data is shifted
```

    -- select the source for data_out as well as for the data saved in registers
    sel_reg : in std_logic;
    sel : in std_logic_vector (1 downto 0\()\); -- select the working multiplier
    Z_svt : out std_logic; -- out svt ring oscillator
    Z_lvt : out std_logic); -- out lvt ring oscillator
    end top;
architecture arch of top is
component data_gen
port (
clk : in std_logic;
rst_n : in std_logic;
data_in_v : in std_logic_vector (63 downto 0 ) ;
data_out_v : out std_logic_vector (63 downto 0);
s_in : in std_logic;
s_out : out std_logic;
load_n : in std_logic;
shift_n : in std_logic;
sel_reg : in std_logic);
end component;
component mult
port (
clk : in std_logic;
rst_n : in std_logic;
en : in std_logic;
a_v : in std_logic_vector (31 downto 0);
b_v : in std_logic_vector (31 downto 0 );
m_v : out std_logic_vector (63 downto 0 )) ;
end component;
component mult_par4
port (
clk : in std_logic;
rst_n : in std_logic;
en : in std_logic;
a_v : in std_logic_vector (31 downto 0 );
b_v : in std_logic_vector (31 downto 0);
m_v : out std_logic_vector (63 downto 0));
end component;
component ring_svt
generic (
length : integer);
port
Z : out std_logic);
end component;

```
    component ring_lvt
```

    component ring_lvt
    generic (
    generic (
        length : integer);
        length : integer);
    port (
    port (
        Z : out std_logic);
        Z : out std_logic);
    end component;
    end component;
    -- demultiplexed mutlipliers output
    -- demultiplexed mutlipliers output
    signal general_m : std_logic_vector(63 downto 0);
    signal general_m : std_logic_vector(63 downto 0);
    -- multipliers input data containing both A and B
    -- multipliers input data containing both A and B
    signal general_a_b : std_logic_vector(63 downto 0);
    signal general_a_b : std_logic_vector(63 downto 0);
    -- multipliers input data separeted as A and B
    -- multipliers input data separeted as A and B
    signal general_a, general_b : std_logic_vector(31 downto 0);
    signal general_a, general_b : std_logic_vector(31 downto 0);
    signal m0_v, m1_v, m2_v, m3_v : std_logic_vector(63 downto 0); -- multipliers
                                    -- results
    signal en0, en1, en2, en3 : std_logic; -- multipliers registers enable
    signal clk0, clk1, clk2, clk3 : std_logic; -- multipliers registers clock
    begin -- arch
-- component mapping
data_gen_1 : data_gen
port map (
clk }\quad=>\mathrm{ clk,
rst_n => rst_n,
data_in_v => general_m,
data_out_v }=>\mathrm{ general_a_b,
s_in }\quad>\quad\mathrm{ s_in,
s_out }\quad=>\mathrm{ s_out,
load_n }\quad=>\mathrm{ load_n,
shift_n => shift_n,
sel_reg }\quad=>\mathrm{ sel_reg);
mult_0 : mult
port map (
clk = clk0,
rst_n => rst_n,
en }\quad=> en0
a_v => general_a,
b_v => general_b,
m_v => m0_v);
mult_1 : mult_par4
port map (
clk }\quad> clk1
rst_n => rst_n,
en men1,
a_v = general_a,
b_v => general_b,
m_v => m1_v);
mult_2 : mult
port map (

```
```

    clk => clk2,
    rst_n => rst_n,
    en men2,
    a_v => general_a
    b_v => general_b,
    m_v => m2_v);
    mult_3 : mult_par4
port map (
clk => clk3,
rst_n => rst_n,
en \quad}\quad> en3
a_v => general_a,
b_v => general_b,
m_v => m3_v);
ring_svt_1: ring_svt
generic map (
length => 437)
port map (
Z => Z_svt);
ring_lvt_1: ring_lvt
generic map (
length => 533)
port map (
Z => Z_lvt);
--combinatorial part
general_a}<=\mathrm{ general_a_b(63 downto 32);
general_b <= general_a_b(31 downto 0);
-- one-hot decoder
en0 <= '1', when sel = "00" else '0';
en1 <= '1' when sel = "01" else '0';
en2<='1', when sel = " 10" else '0';
en3 <= '1', when sel = "11" else '0';
-- clock demux
clk0<= clk when sel = "00" else '0';
clk1<= clk when sel = "01" else '0';
clk2<= clk when sel = " 10" else '0';
clk3<= clk when sel = "11" else '0';
-- ouput mux
with sel select
general_m <=
m0_v when " 00",
m1_v when "01",
m2_v when " 10",
m3_v when "11",
m0_v when others;

```

\section*{A. 2 data_gen.vhd}
```

-- Title: Data generator (32bit)
-- Project :
_- File : data_gen.vhd
-_ Author : [schuster@zebra](mailto:schuster@zebra)
-- Company :
-- Created : 2006-02-15
-- Last update: 2006-09-22
-- Platform :
_- Standard : VHDL'93
-- Description: This block contains the pseudo-random data generator,
-- as well as the cyclic adder and corresponding registers.
_- Both generated random and input data can be outputed in
-- serial by the shift_reg block.
-- Copyright (c) 2006
-- Revisions :
-- Date Version Author Description
-- 2006-02-15 1.0 schuster Created
library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_unsigned.all; -- used to add std_logic_vectors
-- designware implementation of the
-- shift register
library DWARE, DW03;
use DWARE.DWpackages.all;
use DW03.DW03_components.all;
entity data_gen is
port (
clk : in std_logic; -- clock
rst_n : in std_logic; -- active low async reset
-- data coming from the multipliers (result)
data_in_v : in std_logic_vector(63 downto 0);
-- generated_data (extern or pseudo random)
data_out_v : out std_logic_vector(63 downto 0);
s_in : in std_logic; -- serial input
s_out : out std_logic; -- serial output
load_n : in std_logic; -- when low registers are loaded in parallel
shift_n : in std_logic; -- when low data is shifted

```
```

    _- select the source for data_out as well as for the data saved in registers
    sel_reg : in std_logic
    );
    end data_gen;
architecture arch of data_gen is
--DesignWare shift register
component DW03_shfreg
generic
inst_length : integer);
port (
inst_clk : in std_logic;
inst_s_in : in std_logic;
inst_p_in : in std_logic_vector(inst_length-1 downto 0);
inst_shift_n : in std_logic;
inst_load_n : in std_logic;
p_out_inst : out std_logic_vector(inst_length-1 downto 0));
end component;

```
    ——local signals
    signal sum_v \(\quad: \quad\) std_logic_vector \((63\) downto 0\() ;--\quad\) result of the adder
    signal p_out_v \(\quad\) std_logic_vector (63 downto 0\()\); -- output of the register bank
    signal \(p_{-} n_{-} \quad: \quad\) std_logic_vector \((63\) downto 0\() ; \quad-\quad\) input of the register bank
    signal rand_data_v : std_logic_vector (63 downto 0 ) ; -- output of the pseudo
        -- random generator
    signal next_rand_data_v : std_logic_vector (63 downto 0\()\); -- next of rand_data
    - local constants
    constant inst_length: natural \(:=64 ; \quad-\quad\) size of the shift register bank
begin -- arch
    - recursive cyclic adder
    adder :
    sum_v \(<=\) data_in_v \(+p_{-} o u t \_v ;\)
    -- link the highest bit of \(p_{-} o u t_{-} v\) to \(s_{-} o u t\)
    \(\mathrm{s}_{-} \mathrm{out}<=\mathrm{p}_{-} \mathrm{out}-\mathrm{v}(63)\);
-- instance of the shift_register
- based on a DW model
shift_register: DW03_shftreg
    generic map (length \(\Rightarrow\) inst_length)
    port map \(\left(\mathrm{clk} \quad \Rightarrow \mathrm{clk}, \mathrm{s}_{-} \mathrm{in} \Rightarrow \mathrm{s}_{-} \mathrm{in}, \mathrm{p}_{-} \mathrm{in} \Rightarrow \mathrm{p}_{-} \mathrm{in} n_{-} \mathrm{V}\right.\),
        shift_n \(\Rightarrow \operatorname{shift}_{-} n\), load_n \(_{-}>\)load_n \(^{\prime}\)
        p_out \(\Rightarrow p_{-}\)out_v);
    - purpose: instanciation of mux1 and mux2
    - type: combinational
```

-- inputs : sel_reg, sum_v, rand_data_v, p_out_v
-- outputs: p_in_v, data_out_v
muxes : process (sel_reg, sum_v, rand_data_v, p_out_v)
begin -- process
case sel_reg is
when '0' =>
p_in_v <= sum_v;
data_out_v <= rand_data_v;
when '1' =>
p_in_v <= rand_data_v;
data_out_v <= p_out_v;
when others => null;
end case;
end process;
-- pseudo-random code generator
-- the next bit is based on the
-- taps 63, 61, 60, 0
-- the state to avoid is 1...1
pseudo_rand_logic:
next_rand_data_v < = ((rand_data_v(63) xnor rand_data_v(61)) xnor
(rand_data_v(60) xnor rand_data_v(0))) \&
rand_data_v (63 downto 1);
-- purpose: insert shift register bank used for the pseudo code generation
-- type : sequential
-- inputs : clk, rst_n, next_rand_data_v
-- outputs: rand_data_v
pseudo_rand_regs: process (clk, rst_n)
begin -- process
if rst_n = '0' then -- asynchronous reset (active low)
rand_data_v <= (others => '0');
elsif clk'event and clk = '1' then -- rising clock edge
rand_data_v <= next_rand_data_v;
end if;
end process;
end arch;

```

\section*{A. 3 mult.vhd}
```

-- Title : Simple multiplier (32bit)
-- Project :
_- File : mult.vhd
_- Author : [schuster@zebra](mailto:schuster@zebra)
-- Company :
-- Created : 2006-02-16
-- Last update: 2006-09-22

```
```

-- Platform
-- Standard : VHDL'93
-- Description: simple multiplier block with registers
-- Copyright (c) 2006
-- Revisions :
-- Date Version Author Description
-- 2006-02-16 1.0 schuster Created
library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_unsigned.all;
entity mult is
port (
clk : in std_logic; -- clock
rst_n : in std_logic; -- active low async reset
en : in std_logic; -- registers enable
a_v : in std_logic_vector(31 downto 0); -- input A
b_v : in std_logic_vector(31 downto 0); -- input B
m_v : out std_logic_vector(63 downto 0) -- result
);
end mult;
architecture arch of mult is
-- generic RCA multiplier declaration
component RCA
generic (
A_width : integer; -- size of A
B_width : integer); -- size of B
port (
S : out std_logic_vector (A_width+B_width-1 downto 0);
A : in std_logic_vector (A_width-1 downto 0);
B : in std_logic_vector (B_width-1 downto 0));
end component;
-- local signals
signal a_int_v : std_logic_vector(31 downto 0);
signal b_int_v : std_logic_vector(31 downto 0);
signal m_int_v : std_logic_vector(63 downto 0);
begin -- arch
--combinatorial part
-- infer the 32 bit multiplier
mult_1: RCA
generic map (

```
```

        A_width => 32,
        B_width => 32)
    port map (
        S => m_int_v,
    A }\quad=>\quad\mp@subsup{a}{-}{\prime
    B => b_int_v);
    ```
--sequential part
-- purpose: input and output registers
-- type : sequential
-- inputs : clk, rst-n, \(a_{-} v, b_{\_} v, m_{-} i n t_{-} v\)
-- outputs: \(a_{-} i n t_{-} v, b_{-} i n t_{-} v, m_{-} v\)
mult1_regs : process (clk, rst_n)
begin -- process mult1_regs
    if rst_n \(=\) '0' then -- asynchronous reset (active low)
            a_int_v \(^{\prime}=\left(\right.\) others \(\left.\Rightarrow{ }^{\prime} 0^{\prime}\right) ;\)
            \(\mathrm{b}_{-}\)int_v \(<=\left(\right.\)others \(\left.\Rightarrow{ }^{\prime} 0^{\prime}\right)\);
            m_v \(<=\left(\right.\) others \(\left.\Rightarrow \quad{ }^{\prime}{ }^{\prime}\right)\);
    elsif clk'event and clk \(=\) ' 1 ' then -- rising clock edge
            if en \(=\) ' 1 ' then
                a_int_v \(<=\) a_v;
                b_int_v \(<=\) b_v;
                m_v \(<=m_{\text {_int_v }}\);
        end if;
    end if;
    end process mult1_regs;
end arch;

\section*{A. 4 mult_par4.vhd}

```

library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_unsigned.all;
entity mult_par4 is
port(
clk : in std_logic; -- clock
rst_n : in std_logic; -- active low async
en : in std_logic; -- registers enable
a_v : in std_logic_vector (31 downto 0); -- input A
b_v : in std_logic_vector(31 downto 0); -- input B
m_v : out std_logic_vector(63 downto 0) -- result
);
end mult_par4;
architecture arch of mult_par4 is
-- generic RCA multiplier declaration
component RCA
generic (
A_width : integer; -- size of A
B_width : integer); -- size of B
port (
S : out std_logic_vector (A_width+B_width-1 downto 0);
A : in std_logic_vector (A_width-1 downto 0);
B : in std_logic_vector (B_width-1 downto 0));
end component;
signal count, next_count : std_logic_vector(1 downto 0);
signal A0, A1, A2, A3, B0, B1, B2, B3 : std_logic_vector(31 downto 0);
signal S, S0, S1, S2, S3 : std_logic_vector(63 downto 0);

```

\section*{begin}
--combinatorial part
-- output multiplexer
with count select
\(\mathrm{S}<=\mathrm{S} 0\) when " 00 ",
S1 when "01",
S2 when "11",
S3 when " 10 ",
S3 when others;
--multiplexers counter incrementer \(00 \rightarrow 01 \rightarrow 11 \rightarrow 10 \rightarrow\)
next_count \(<=" 01 "\) when count \(=" 00 "\) else
\(" 11 "\) when count \(=" 01 "\) else
\(" 10 "\) when count \(=" 11 "\) else
"00";
--implementation of the four 32bit multipliers
```

    mult_par4_0 : RCA
        generic map (
            A_width => 32,
            B_width => 32)
    port map (
        S = S0,
        A => A0,
        B }\quad=>\textrm{BO})
    mult_par4_1 : RCA
        generic map (
            A_width => 32,
            B_width => 32)
    port map (
        S => S1,
        A => A1,
        B = B1);
    mult_par4_2 : RCA
        generic map (
            A_width => 32,
            B_width => 32)
        port map (
            S = S2,
            A }\quad=>\mathrm{ A2,
            B }\quad= B2)
    mult_par4_3 : RCA
        generic map (
            A_width => 32,
            B_width => 32)
        port map (
            S => S3,
            A }\quad=>\mathrm{ A3,
            B }\quad= B3)
    --sequential part
process (clk, rst_n)
begin
if rst_n = '0' then
m_v <= (others => '0');
count <= "00";
A0 <= (others => '0');
A1 <= (others => '0');
A2 <= (others => '0');
A3 <= (others => '0');
B0 <= (others => '0');
B1 <= (others => '0');
B2 <= (others => '0');
B3 <= (others => '0');
elsif clk = '1' and clk'event then
if en = '1' then
-- output registers
m_v <= S;
-- increment state machine counter
count <= next_count;

```
```

        -- input reg and demultiplexer for A
        if count = "00" then A0 <= a_v; end if;
        if count = "01" then A1 <= a_v; end if;
        if count = "11" then A2<= a_v; end if;
        if count = "10" then A3<= a_v; end if;
        -- input reg and demultiplexer for B
        if count = "00" then B0 <= b_v; end if;
        if count = "01" then B1 <= b_v ; end if;
        if count = "11" then B2 <= b_v ; end if;
        if count = " 10" then B3 <= b_v; end if;
        end if;
    end if;
    end process;
    end arch;

```

\section*{A. 5 RCA_generic_arch.vhd}
```

-- Title: Genreric RCA Mulitplier
_- Project :
-- File : RCA_generic_arch.vhd
_- Author : [mtschuster@WS-3439](mailto:mtschuster@WS-3439)
-- Company :
-- Created : 2006-04-27
_- Last update: 2006-09-22
-- Platform :
-- Standard : VHDL'93
-- Description: Simple Ripple Carry Array multiplier implementation
-- Copyright (c) 2006
-- Revisions
-- Date Version Author Description
-- 2006-04-27 1.0 mtschuster Created
library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_unsigned.all;
entity RCA is
generic(
A_width : integer := 32; --size of A
B_width : integer := 32); -- size of B
port(
S : out std_logic_vector (A_width+B_width-1 downto 0); --output
A : in std_logic_vector (A_width-1 downto 0); --input A
B : in std_logic_vector (B_width-1 downto 0)); --input B
end RCA;
architecture arch of RCA is

```
```

type std_logic_array is -- array of internal nodes
array (B_width-1 downto 1) of std_logic_vector (A_width-1 downto 0);
type enlarged_std_logic_array is -- extended array of internal nodes
array (B_width-1 downto 0) of std_logic_vector (A_width downto 0);
--local signals
signal AandB : std_logic_array; -- partial products
signal S_partial : enlarged_std_logic_array; --internal sums
signal Init_val : std_logic_vector(A_width-1 downto 0); --first line values
begin
--combinatorial part
first_cell : --implemet first line
Init_val <= A when B (0) = '1' else (others => '0');
S_partial(0) <= '0'\&Init_val;
S(0) <= S_partial (0) (0);
int_cell: --implement internal lines
for i in 1 to B_width-1 generate
S(i) <= S_partial(i)(0);
AandB(i) <= A when B(i) = '1' else (others => '0');
S_partial(i) <= ('0'\&S_partial(i - 1)(A_width downto 1)) +('0'\&AandB(i));
end generate;
last_cell: --copy the result to the output
S(A_width+B_width-1 downto B_width)}<= S_partial(B_width-1)(A_width downto 1)
--sequential part
end arch;

```

\section*{A. 6 ring_svt.vhd}

```

-- Revisions :
-Date Version Author Description
-- 2006-04-27 1.0 schuster Created
library ieee;
use ieee.std_logic_1164.all;
entity ring_svt is
generic (
length : integer := 100); -- default inverter chain length
port (
Z : out std_logic -- ring oscillator output
);
end ring_svt;
architecture arch of ring_svt is
--local signal
signal internal_nets : std_logic_vector(length-1 downto 0);
--declare the technology inverter
component IVSVTX1
port(
Z : out STD_LOGIC ; --in
A : in STD_LOGIC --out
);
end component;
begin -- arch
--connect each inverter with the follow
invs: for i in 0 to length-2 generate
IVSVTX1_gen: IVSVTX1
port map (
Z => internal_nets(i),
A => internal_nets(i+1));
end generate invs;
--connect last inverter with the first
IVSVTX1_last: IVSVTX1
port map (
Z => internal_nets(length - 1),
A => internal_nets(0));
--output the first node
Z<= internal_nets(0);
end arch;

```

\section*{A. 7 top_tb.vhd}

1
```

-- Title : Testbench for design "top"

```
```

-- Project
_- File : top_tb.vhd
_- Author : [schuster@zebra](mailto:schuster@zebra)
-- Company :
-- Created : 2006-02-17
-- Last update: 2006-09-22
-- Platform :
-- Standard : VHDL'93

- Description: Testbench for design "top"
-- Copyright (c) 2006
-- Revisions :
-- Date Version Author Description
-- 2006-02-17 1.0 schuster Created
library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_textio.all; -- write std_logic signal to line
use ieee.std_logic_unsigned.all;
library std;
use std.textio.all; -- output data to std_output or text file
entity top_tb is
end top_tb;

```

```

architecture top_func_test of top_tb is
--component decalration
component top
port (
clk : in std_logic; --clock
rst_n : in std_logic; --active low async reset
s_in : in std_logic; --serial data in
s_out : out std_logic; --serial data out
load_n : in std_logic; --registers parallel load when low
shift_n : in std_logic; --shift data when low
-- select the source for data_out as well as for data saved in registers
sel_reg : in std_logic;
sel : in std_logic_vector(1 downto 0); --select multiplier
Z_svt : out std_logic; --SVT ring oscillator
Z_lvt : out std_logic); --LVT ring oscillator
end component;
-- component ports

```
```

signal rst_n : std_logic; --active low async reset
signal s_in : std_logic; --serial data in
signal s_out : std_logic; --serial data out
signal load_n : std_logic; -- registers parallel load when low
signal shift_n : std_logic; --shift data when low
signal sel_reg : std_logic; --registers select
signal sel : std_logic_vector(1 downto 0); --select multiplier
signal Z_svt : std_logic; --SVT ring oscillator
signal Z_lvt : std_logic; --LVT ring oscillator
--clock
signal clk : std_logic := '1'; --clock
-- constants

```
constant HALFCLOCKPERIOD : time \(:=8 \mathrm{~ns}\);
begin -- top_func_test
    -- component instantiation
DUT : top
    port map (
        clk \(\Rightarrow\) clk,
        rst_n \(\Rightarrow\) rst_n,
        s_in \(\quad \Rightarrow\) s_in,
        s_out \(\Rightarrow\) s_out,
        load_n \(\Rightarrow\) load_n,
        shift_n \(\Rightarrow\) shift_n,
        sel_reg \(\Rightarrow\) sel_reg,
        sel \(\quad \Rightarrow\) sel,
        Z_svt \(\Rightarrow\) Z_svt,
        Z_lvt \(\Rightarrow\) Z_lvt);
    -- clock generation
    clk <= not clk after HALFCLOCKPERIOD;
- main testbench processor
main : process

    -- procedures
    --check if specified test pass
    procedure check_errors (test_name : in string) is
    begin
        if pass_v then
```

            report test_name & "_-__ALL PASS__-_";
        else
            report test_name & "_____FAILED_____";
        end if;
        global_pass_v := global_pass_v and pass_v;
    end procedure check_errors;
    --check if all tests pass
    procedure global_check is
    begin
        if global_pass_v then
            report " _-__ALL TESTS PASS -> OK! _-_--";
        else
            report " -__ONE OR MORE TEST FAILED__-_";
        end if;
    end procedure global_check;
    --write and read specific patterns to/from the shift register
    procedure serial_read_write is
    --four different pattern to test
    constant SERIAL_DATA_IN0 : std_logic_vector(63 downto 0) := (others => '0');
    constant SERIAL_DATA_IN1 : std_logic_vector(63 downto 0) := (others => '1');
    constant SERIAL_DATA_IN2 : std_logic_vector(63 downto 0) :=
        X"5555555555555555"; --"0101010101010...1010101010101010101";
    constant SERIAL_DATA_IN3 : std_logic_vector(63 downto 0) :=
            X"AAAAAAAAAAAAAAAA" ; --"1010101010101...0101010101010101010";
    variable serial_data_out : std_logic_vector(63 downto 0); -- read data
                                    -- from the registers
    begin
    -- init
    rst_n <= '0';
    load_n <= '1';
    shift_n <= '1';
    s_in <= '0';
    sel_reg <= '1';
    sel <= "00";
    pass_v := true;
    wait for 9*HALFCLOCKPERIOD;
    wait until clk='0';
    -- write first pattern
    shift_n <= '0';
    for i in 63 downto 0 loop
        s_in <= SERIAL_DATA_IN0(i );
        wait until clk = '0';
    end loop; -- i
    --write second pattern and read first pattern
    for i in 63 downto 0 loop
        s_in <= SERIAL_DATA_IN1(i);
        serial_data_out(i) := s_out;
    ```
```

    wait until clk = '0';
    end loop; -- i
--check first pattern
match_v := serial_data_out = SERIAL_DATA_IN0;
pass_v := pass_v and match_v;
assert serial_data_out = SERIAL_DATA_IN0
report "Error on SERIAL_DATA_INO" severity error;
--ieee.std_logic_textio.write(d_l, serial_data_out); writeline(output, d_l);
if DEBUGMODE then
report "SERIAL_DATA_IN0 read!" severity note;
end if;
--write third pattern and read second pattern
for i in 63 downto 0 loop
s_in <= SERIAL_DATA_IN2(i);
serial_data_out(i) := s_out;
wait until clk = '0';
end loop; -- i
--check second pattern
match_v := serial_data_out = SERIAL_DATA_IN1;
pass_v := pass_v and match_v;
assert serial_data_out = SERIAL_DATA_IN1
report "Error on SERIAL_DATA_IN1" severity error;
if DEBUGMODE then
report "SERIAL_DATA_IN1 read!" severity note;
end if;
--write fourth pattern and read the third pattern
for i in 63 downto 0 loop
s_in <= SERIAL_DATA_IN3(i);
serial_data_out(i) := s_out;
wait until clk = '0';
end loop; -- i
--check the third pattern
match_v := serial_data_out = SERIAL_DATA_IN2;
pass_v := pass_v and match_v;
assert serial_data_out = SERIAL_DATA_IN2
report "Error on SERIAL_DATA_IN2" severity error;
if DEBUGMMODE then
report "SERIAL_DATA_IN2 read!" severity note;
end if;
--read the fourth pattern
for i in 63 downto 0 loop
serial_data_out(i) := s_out;
wait until clk = '0';
end loop; -- i
--check the fourth pattern
match_v := serial_data_out = SERIAL_DATA_IN3;
pass_v := pass_v and match_v;
assert serial_data_out = SERIAL_DATA_IN3

```
```

            report "Error on SERIAL_DATA_IN3" severity error;
        if DEBUGMODE then
            report "SERIAL_DATA_IN3 read!" severity note;
        end if;
    end serial_read_write;
    -- check that the shift register can be reset from the pseudo random generator
    procedure check_shift_reg_rst is
        variable serial_data_out : std_logic_vector(63 downto 0);
    begin -- check_shift_reg_rst
        -- init
        rst_n <= '0';
        load_n <= '1';
        shift_n <= '1';
        s_in <= '1';
        sel_reg <= '1'; -- from pseudo random generator
        sel <= "00";
        pass_v := true;
        wait for 9*HALFCLOCKPERIOD;
        wait until clk ='0';
        -- load parallel zeros to shift registers
        load_n <= '0';
        wait until clk = '1';
        wait until clk = '0';
        -- output shift registers data serially
        for i in 63 downto 0 loop
            serial_data_out(i) := s_out;
            wait until clk = '0';
        end loop; -- i
    -- check that data is zeroed
        match_v := serial_data_out = X" 0000000000000000";
        pass_v := pass_v and match_v;
        assert match_v report "Error on shift register reset" severity error;
        --ieee.std_logic_textio.write(d_l, serial_data_out); writeline(output, d_l);
    end check_shift_reg_rst;
    -- purpose: read the first 100 and 200 pseudo random generated data
    procedure read_rand is
        variable serial_data_out : std_logic_vector(63 downto 0);
    begin -- read_rand
        -- init
        rst_n <= '0';
        load_n <= '1';
        shift_n <= '1';
        s_in <= '1';
        sel_reg <= '1'; -- from pseudo random generator
        sel <= "00";
    ```
```

pass_v := true;
wait for }9*HALFCLOCKPERIOD
wait until clk='0'

```
```

sel_reg<= '1';; - pseudo random data to shift_reg
load_n <= '0'; -- ready to load the zeroed vector
wait until clk = '0';
rst_n<='1'; -- clear the reset

```
wait until clk \(={ }^{\prime} 0^{\prime}\);
wait for \(200 *\) HALFCLOCKPERIOD;
load_n \(<={ }^{\prime} 1^{\prime} ; \quad-\quad\) switch to serial mode to extract data
shift_n \(<={ }^{\prime} 0^{\prime} ;\)
- output shift registers data serially
for in in 63 downto 0 loop
    serial_data_out (i) \(:=s_{-} o u t\);
    wait until clk \(={ }^{\prime} 0^{\prime}\);
end loop; - - \(i\)
- check data
match_v \(:=\) serial_data_out \(=X " 2 F 8 D 072 F 8 D 0 B D 0 B D " ;\)
pass_v \(:=\) pass_v and match_v;
assert match_v report "Error on read random data " severity error;
\(--i e e e . s t d_{-} l o g i c_{-} t e x t i o . w r i t e\left(d_{-} l\right.\), serial_data_out); writeline(output, \(\left.\quad d_{-} l\right)\);
load_n \(<={ }^{\prime}{ }^{\prime} ; \quad\) - reselect parallel input to shift_regs
wait for \(72 *\) HALFCLOCKPERIOD;
load_n \(<={ }^{\prime} 1^{\prime} ; \quad-\quad\) switch to serial mode to extract data
shift_n \(<={ }^{\prime} 0^{\prime}\);
- output shift registers data serially
for in in 63 downto 0 loop
    serial_data_out (i) \(:=s_{-}\)out;
    wait until clk \(={ }^{\prime} 0^{\prime}\);
end loop; - - \(i\)
-- check data
match_v \(:=\) serial_data_out \(=\) X" 7 DF14972F14972F1";
pass_v \(:=\operatorname{pass}_{-} v\) and match_v;
assert match_v report "Error on read random data " severity error;
\(--i e e e . s t d_{-} l o g i c_{-} t e x t i o . w r i t e\left(d_{-} l\right.\), serial_data_out); writeline(output, \(\left.\quad d_{-} l\right) ;\)
end read_rand;
```

-- purpose: reset, read external data, multiply, output result

```
procedure mult_ext_data (
    -- the number corresponding to the tested multiplier
    constant mult_number : in integer) is
    variable serial_data_out: std_logic_vector (63 downto 0\()\);
```

    -- input multiplier data
    constant DATA_IN : std_logic_vector(63 downto 0) := X"AC0E5F8EAC0E5F8E";
    -- expected result = DATA_IN+DATA_IN(63 downto 32)*DATA_IN(31 downto 0)
    constant EXPECTED_DATA_OUT : std_logic_vector(63 downto 0) :=
    DATAIN+(DATAIN(63 downto 32)*DATAIN(31 downto 0))
    begin -- mult_ext_data
    -- init
    rst_n <= '0';
    load_n <= '1';
    shift_n <= '1';
    s_in <= '1';
    sel_reg <= '1'; -- from regs to mults
    case mult_number is
    when 0 => sel <= "00";
    when 1 => sel <= "01";
    when 2 => sel <= "10";
    when 3 > sel <= "11";
    when others => sel <= "XX";
    end case
pass_v := true;
wait for 9*HALFCLOCKPERIOD;
wait until clk ='0';
sel_reg <= '1'; -- pseudo random data to shift_reg
load_n <= '0'; -- ready to load the zeroed vector
wait until clk = '0';
rst_n <= '1'; -- clear the reset
wait until clk = '0';
shift_n <= '0'; -- enter serial data
load_n <= '1';
for i in 63 downto 0 loop
s_in <= DATA_IN(i);
wait until clk = '0';
end loop; -- i
shift_n <= '1';
load_n <= '1';
wait until clk = '0';
sel_reg <= '0';
wait until clk = '0';
-- delay if parallel 4 implementation is used
if mult_number = 1 or mult_number = 3 then
wait for }6*\mathrm{ HALFCLOCKPERIOD;
end if;
load_n <= '0';
wait until clk = '0';

```
```

load_n <= '1'; -- switch to serial mode to extract data
shift_n <= '0';
-- output shift registers data serially
for i in 63 downto 0 loop
serial_data_out(i) := s_out;
wait until clk = '0';
end loop; -- i
-- check data
match_v := serial_data_out = EXPECTED_DATA_OUT;
pass_v := pass_v and match_v;
assert match_v report "Error on multiply external data " severity error;
--ieee.std_logic_textio.write(d_l,serial_data_out); writeline(output, d_l);
end mult_ext_data;
-- purpose: multiply and add pseudo random generated data
procedure random_mac (
-- the number corresponding to the tested multiplier
constant mult_number : in integer) is
variable serial_data_out : std_logic_vector(63 downto 0);
begin -- random_mac
-- init
rst_n <= '0';
load_n <= '0';
-- store incoming data
shift_n < ' ' ';
s_in <= '0';
sel_reg<='1';; -- from rand to regs
case mult_number is
when 0 => sel <= "00";
when 1 => sel <= "01";
when 2 => sel <= "10";
when 3 > sel <= "11";
when others => sel <= "XX";
end case;
pass_v := true;
wait for 9*HALFCLOCKPERIOD;
wait until clk ='0';
sel_reg <= '0';; -- pseudo random data to mult
load_n < ' 0'; -- ready to load sum to regs
rst_n<='1'; -- clear the reset
wait until clk = '0';
wait for 4*HALFCLOCKPERIOD;
-- delay if parallel 4 implementation is used
if mult_number = 1 or mult_number = 3 then
wait for 6*HALFCLOCKPERIOD;
end if;
wait for 1000*HALFCLOCKPERIOD;

```
    load_n \(<={ }^{\prime} 1^{\prime} ; \quad-\quad\) switch to serial mode to extract data
    shift_n \(<=\) ' 0 ';
-- output shift registers data serially
for in 63 downto 0 loop
    serial_data_out(i) \(:=\) s_out;
    wait until clk \(={ }^{\prime} 0^{\prime}\);
end loop; -- \(i\)
-- check data
match_v \(:=\) serial_data_out \(=\mathrm{X} " 14\) C9836842DEF744";
pass_v \(:=\) pass_v and match_v;
assert match_v report "Error on multiply random data " severity error;
- -ieee.std_logic_textio.write (d_l, serial-data_out); writeline (output, \(d_{-} l\) );
end random_mac;
--test sequence

\section*{begin}
check_shift_reg_rst;
check_errors ("Shift Registers Reset: ");
serial_read_write;
check_errors("Serial Read/Write: ");
read_rand;
check_errors ("Read Rand: ");
mult_ext_data (0) ;
check_errors ("Multiply external data on mult0: ");
mult_ext_data (1) ;
check_errors ("Multiply external data on mult1: ");
mult_ext_data (2);
check_errors ("Multiply external data on mult2: ");
mult_ext_data (3) ;
check_errors ("Multiply external data on mult3: ");
random_mac (0);
check_errors ("Add random multiplied data for mult0: ");
random_mac (1) ;
check_errors ("Add random multiplied data for mult1: ");
random_mac (2);
check_errors ("Add random multiplied data for mult2: ");
random_mac (3);
check_errors ("Add random multiplied data for mult3: ");
global_check;
wait ;
end process;
end top_func_test;
configuration top_tb_top_func_test_cfg of top_tb is
for top_func_test
end for;

498 end top_tb_top_func_test_cfg;
499
500

\section*{Appendix B}

\section*{Synopsys compilation scripts}

\section*{B. 1 compile_top.tcl}
```

\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#

## Global variables

\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#
set BIN ./bin/
set DB ./db/
set PAR ./par/src/
set GATE ./gate/
set WORK ./work/
set SRC ./vhdl/
set reports_path ./reports/
set design_name top_stm090
\#ignore case to avoid problem on activity annotation
set find_ignore_case true
set suppress_errors "VHDL-2285 OPT-150 TIM-111 TIM-112"
\#remove the limit for high fanout nets
set high_fanout_net_threshold 0
\#define the working path
define_design_lib work - path \$WORK
\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#

## Bus name variables

\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#
set bus_naming_style %s(%d)
\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#

## Remove previous designs

\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#
\#remove_constraint -all
\#remove_design -all
\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#

## Read design

```
```

\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#
source read_vhdl.tcl
\#link design with library (i.e. load required libraries)
link
\#Uniquify the multilpiers blocks
set uniquify_naming_style %s_%d
\#rename the 4 main multipliers
uniquify -cell {mult_0 mult_1 mult_2 mult_3} -base_name multd
\#rename the remaining design
uniquify
\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#

## Create clock MHz 62.5MHz

\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#
\#main clock
create_clock clk -period 16
\#generated clock
create_generated_clock -source clk -name dclk0 -divide_by 1 mult_0/clk
create_generated_clock -source clk -name dclk1 -divide_by 1 mult_1/clk
create_generated_clock -source clk -name dclk2 -divide_by 1 mult_2/clk
create_generated_clock -source clk -name dclk3 -divide_by 1 mult_3/clk
\#allow maximum delay at input and output
set_input_delay 0 -clock clk [all_inputs]
set_output_delay 0 -clock clk [all_outputs]
\#next lines are there to avoid TIM-111 warning
set_input_delay 0 -clock dclk0 mult_0/clk
set_input_delay 0 -clock dclk1 mult_1/clk
set_input_delay 0 -clock dclk2 mult_2/clk
set_input_delay 0 -clock dclk3 mult_3/clk
\#Set_propagated_clock automaticaly set the correct
\#set_clock_latency value for the generated clocks
set_propagated_clock [all_clocks]
\#report_clock -skew
\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#

## Set Loads

\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#
set_drive [drive_of CORE90GPSVT_NomLeak.db:CORE90GPSVT/IVSVTX1/Z] [all_inputs]
\#2.004893
set_load [load_of CORE90GPSVT_NomLeak.db:CORE90GPSVT/IVSVTX1/A] [all_outputs]
\#0.002090
\#put an high load on the s_out because it
\#will drive a analogic pad
set_load 10 s_out

```

\section*{\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#}

\section*{\#\# Design Constraints\#\#}

\section*{\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#}
set_max_area 0
\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#
\#\# Compile Design \#\#
\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#
\# Write unmapped top
current_design top
write -hierarchy -output \(\$\{\mathrm{DB}\}\) unmapped_top.db
\# characterize the 4 multipliers
set target_library CORE90GPSVT_NomLeak.db
characterize -constraint \{mult_0 mult_1 mult_2 mult_3\}
\# compile mult_0 and mult_1 with SVT
set target_library CORE90GPSVT_NomLeak.db
current_design multd_0
compile
current_design multd_1
compile
\# compile mult_2 and mult_3 with LVT
set target_library CORE90GPLVT_NomLeak.db
current_design multd_2
compile
current_design multd_3
compile
\#set dont touch to compile mutlipliers
current_design top
set_dont_touch \(\left\{\mathrm{mult}_{-} 0\right.\) mult_1 mult_2 mult_3 ring_svt_1 ring_lvt_1\}
\# the rest of the design will be compiled with
\# the SVT technology
set target_library CORE90GPSVT_NomLeak.db
compile -map_effort high
\#show which DW implementation has been selected
report_resources - hier \(>\$\{\) reports_path \(\} \$\{\) design_name \(\} . s_{-} n^{\prime}\) _rprh
\#remove unconnected ports in DW designs
set_dont_touch \(\left\{\mathrm{mult}_{-} 0\right.\) mult_1 mult_2 mult_3\} false
remove_unconnected_ports [get_cells -hier *]
remove_unconnected_ports -blast_buses [get_cells -hier *]
set_dont_touch \(\{\) mult_0 mult_1 mult_2 mult_3 \(\}\) true

\#\# Fix hold violations \#\#
\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#
set_fix_hold [all_clocks]
```

\#recompile top only
compile -inc
\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#
\#\#Write Mapped design\#\#
\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#
change_names -rules vhdl -hierarchy
write -hierarchy -format vhdl -output \${GATE}top.vhd
write_sdf \${GATE} top.sdf
\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#

## Annotate Activtiy

\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#
\#sh cp msim/modelsim.ini ./modelsim.ini
sh cp \${SRC}top_tb.vhd \${GATE} top_tb.vhd
sh vsim -c - do \${BIN}power_sdf.do
read_saif -unit ns -scale 1 -instance top_tb/dut -input \${GATE}back.saif
\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#

## Save reports in the defined directory

\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#
report_area }>${{reports_path}${design_name}.syn_rpa
check_design > ${reports_path}${design_name}.syn_rpd
report_timing > ${reports_path}${design_name}.syn_rpt
report_hierarchy }>>\${\mp@subsup{r}{\mathrm{ eports_path } \${design_name}.syn_rph}}{~
report_resources > ${reports_path}${design_name}.syn_rpr
report_cell > ${reports_path}${design_name}.syn_rpc
report_power - net -cell -flat -include_input_nets > ${reports_path}${design_name}
.syn_rpp
report_saif -flat > ${reports_path}${design_name}.syn_rps
report_constraint > ${reports_path}${design_name}.syn_rpn
report_reference - nosplit > ${reports_path}${design_name}.syn_rpf
report_clock -skew }>${{reports_path}${design_name}.syn_rpk
report_power - include_input_nets - hier - hier_level 1 > ${reports_path}${design_name}
.syn_rpph

```
```

\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#

## Save design

\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#
change_names - rules verilog -hierarchy

set bus_naming_style %s$$
%d
$$

write -hierarchy -output \${DB}top_gate.db
write -hierarchy - format verilog -output \${PAR}top_gate.v
write_sdf \${PAR}top_gate.sdf
write_sdc \${PAR}top_gate.sdc
quit

```

\section*{B. 2 read_vhdl.tcl}
```

analyze -f vhdl vhdl/ring_svt.vhd
analyze -f vhdl vhdl/ring_lvt.vhd
analyze -f vhdl vhdl/RCA_generic_arch.vhd
analyze -f vhdl vhdl/mult.vhd

```
```

analyze -f vhdl vhdl/mult_par4.vhd
analyze -f vhdl vhdl/data_gen.vhd
analyze -f vhdl vhdl/top.vhd
elaborate top -update

```

\section*{B. 3 power_sdf.do}
```

\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#

# Script to computate the switching activity with ModelSim

# Schuster Christian, June 2003, IMT Neuchatel

\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#
\#execute this script with vsim -c -do power_sdf.do
\#needed files are: top.sdf, top.vhd, top_tb.vhd

# Testbench path and file names

set work_dir /scratch/schuster/stm090_svt_lvt
set testbench top_tb
set bench_path \$testbench/dut
set dir gate
set sdffname \$dir/top.sdf

# Time and simulation settings

set time_scale ps
set backsaif_basetime 1E-12
set init_time 19592000
\#19592ns
set evaluation_time 36704000
\#36704ns
\#compile design+testbench
vcom -93 \$dir/top.vhd -work \$work_dir
vcom -93 \$dir/top_tb.vhd -work \$work_dir

# Use the same path separator as Synopsys SAIF file

set PathSeparator /
set DatasetSeparator :
vsim +notimingchecks -sdftyp $bench_path=$sdffname -foreign "dpfli_init /synopsys/
v2004.06/auxx/syn/power/dpfli/lib-sparcOS5/dpfli.so" -lib \$work_dir -t
\$time_scale \$testbench
\#+notimingchecks is used to avoid unreal problems on sdf annotation and verilog model
\#initialize ring oscillators
force top_tb/dut/ring_svt_1/z_port 0 0 -c 16
force top_tb/dut/ring_lvt_1/z_port 0 0 -c 16

# Select toggle region

set_toggle_region \$bench_path

# Init the circuit

run \$init_time

```
```


# Start switching annotation

toggle_start

# Execute testbench

run \$evaluation_time

# Stop switching annotation

toggle_stop

# Write back annotation SAIF

toggle_report \$dir/back.saif \$backsaif_basetime \$bench_path
quit

```

\section*{Appendix C}

\section*{SoC Encounter P\&R scripts}

\section*{C. 1 main.tcl}
```

\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#

## Main file for ENCOUNTER SOC4.1

## CSch, July 2006, version 1.1

\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#

# Required Files:

# Scripts:

\#- top.conf

# - IO_Filler.tcl

# - create_global_net.tcl

# - pwr.tcl

# - do_power_domains

# - followPin.tcl

# - top.ctstch

# - output_nets.tcl

# - place_output_bufs.tcl

\#- fix_drc_errors.tcl

# Data:

# - ioplace.io

# - LEF/IO90GPHVT_BASIC_50A_7M2T_PGC.lef (ALL external layers of COREVDD1V0 pin need

    the line "CLASS CORE ;" in order to be be routed by sroute, diff file present)
    
# - LEF/IO90GPHVT_3V3_50A_7M2T_PGC.lef (diff file present)

# Src:

# - top_io.v (from cat src/top_gate.v data/io_wrapper.v>src/top_io.v)

# - top_gate.sdc (change get_pins -> get_pins -hierarchy)

Puts '\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#'
Puts "\#\#\#"
Puts "\#\#\# Load Design
Puts "\#\#\#'
Puts "\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#,

## uniquify the netlist (shell to execute before an encounter session)

### uniquifyNetlist -top

\#In my case netlist was already unique!

```
```

set CMOS090GP_DIR / designkit/cmos090_50a
set scpts scripts/
set data data/
setRCFactor -cap 1.1
\#set the size of the smallest displayed module }->\mathrm{ display all
setPreference MinFPModuleSize 1
\#load the design + io + corners
loadConfig \${scpts}top.conf
\#load footprints used for timining driven analysis
loadfootprint - infile \${CMOS090GP_DIR}/SocEncounter_cmos090gp_2.2/cmos090gp_50a.cfp
setInvFootPrint IVSVTX1
setBufFootPrint BFSVTX1
\#setDelayFootPrint DLY1SVTX2
Puts '"\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#'
Puts "\#\#\#"
Puts "\#\#\# Create Floorplan "
Puts "\#\#\#"
Puts "\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#'

### define floorplan

\#floorPlan -r 1 0.7 40 404040
\#Fixed dimension allow io_corners to be aligned with the 0.56um grid
floorPlan -s 800.28 800.28 50.08 50.08 50.08 50.08

### Add IO filler

source \${scpts}IO_Filler.tcl
Puts "\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#'
Puts "\#\#\#"
Puts "\#\# Create PowerDomains and "
Puts "\#\#\# Place Block(s)"
Puts "\#\#\#"
Puts "\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#,
\#create the 5 separated power domains
source \${scpts}do_power_domains.tcl
\#place ioref_comp instance needed for the 3V3 IOs
placeInstance ioref_comp 732.88 204.04 R180
addHaloToBlock 31.5 20 20 30 -allBlock
\#connects all global nets
source \${scpts}create_global_net.tcl
Puts "\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#'
Puts "\#\#\#"
Puts "\#\#\# Create power stripes "
Puts "\#\#\#"
Puts "\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#'

```

90
```


### add power ring + stripes

source \${scpts}pwr.tcl

## std-cell follow pin

source \${scpts}followPin.tcl

# save floor-plan

saveFPlan ./fplan.fp

# check floor-plan

verifyGeometry
saveDesign ./top.fp.enc
\#source ./top.fp.enc
Puts "\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#
Puts "\#\#\#"
Puts "\#\#\# Place Design ..."
Puts "\#\#\#"
Puts "\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#
\#exec mkdir Timing
source \${scpts}/place_output_bufs.tcl
amoebaPlace -timingdriven \
-doCongOpt \
-highEffort \
-ignoreScan \
-ignoreSpare \
-QA \
-slack init_virtual.slk
saveDesign ./top.place.enc
\#source ./top.place.enc
checkPlace
buildTimingGraph
timeDesign -preCTS -outDir ./Timing/PLACE.timing
Puts "\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#,
Puts "\#\#\#"
Puts "\#\#\# Optimization..."
Puts "\#\#\#'
Puts "\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#,
setOptMode -highEffort \
-fixFanoutLoad \
-maxDensity 0.8 \
-reclaimArea \
-setupTargetSlack 0.0 \
-holdTargetSlack 0.0
optDesign -preCTS -setup -drv

```

145
```

saveDesign ./top.IPO.enc

# source ./top.IPO.enc

timeDesign -preCTS -outDir ./Timing/IPO.timing
Puts "\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#'
Puts "\#\#\#"
Puts "\#\#\# Run CTS..."
Puts "\#\#\#"
Puts "\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#'
\#clock with different libraries (SVT/LVT) depending on power domains
setCTSMode -fence -MSMV
\#specify clocktree file
specifyClockTree -clkfile \${scpts}top.ctstch
\#create report directory
createSaveDir top_cts
\#do clock tree synthesis
ckSynthesis -rguide top_cts/top_cts.guide -report top_cts/top_cts.ctsrpt
saveClockNets -output top_cts/top_cts.ctsntf
saveNetlist top_cts/top_cts.v
savePlace top_cts/top_cts.place
saveDesign ./top.POST_CTS.enc
\#source ./top.POST_CTS.enc
setAnalysisMode - clockTree
buildTimingGraph
timeDesign -postCTS -outDir ./Timing/POST_CTS.timing
Puts '\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#'
Puts "\#\#\#'
Puts "\#\#\# Optimization post CTS..."
Puts "\#\#\#"
Puts "\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#"'
setOptMode -highEffort \
-fixFanoutLoad \
-maxDensity 0.8 \
-reclaimArea \
-setupTargetSlack 0.0 \
-holdTargetSlack 0.0
optDesign - postCTS
saveDesign ./top.POST_CTS_IPO.enc

# source ./top.POST_CTS_IPO.enc

timeDesign -postCTS -outDir ./Timing/POST_CTS_IPO.timing
Puts "\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#
Puts "\#\#\#"
Puts "\#\#\# Nanoroute.... "

```
```

Puts "\#\#\#'
Puts "\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#'

# Filler Cell between std-cells

addFiller -cell FILLERCELL64 FILLERCELL32 FILLERCELL16 FILLERCELL8 FILLERCELL4
FILLERCELL2 FILLERCELL1 - prefix FILLER -powerDomain PDCORE
addFiller - cell FILLERCELL64 FILLERCELL32 FILLERCELL16 FILLERCELL8 FILLERCELL4
FILLERCELL2 FILLERCELL1 - prefix FILLER -powerDomain PD0
addFiller -cell FILLERCELL64 FILLERCELL32 FILLERCELL16 FILLERCELL8 FILLERCELL4
FILLERCELL2 FILLERCELL1 - prefix FILLER -powerDomain PD1
addFiller -cell FILLERCELL64 FILLERCELL32 FILLERCELL16 FILLERCELL8 FILLERCELL4
FILLERCELL2 FILLERCELL1 - prefix FILLER -powerDomain PD2
addFiller -cell FILLERCELL64 FILLERCELL32 FILLERCELL16 FILLERCELL8 FILLERCELL4
FILLERCELL2 FILLERCELL1 -prefix FILLER -powerDomain PD3

# connect all new std-cell instances to vdd/gnd

source \${scpts}create_global_net.tcl
\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#

## Route clocks first

\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#
setAttribute - net @clock -weight 5 -avoid_detour true - bottom_preferred_routing_layer
4 -preferred_extra_space 1
selectNet -allDefClock
setNanoRouteMode -quiet routeWithTimingDriven false
setNanoRouteMode -quiet envNumberProcessor 1
setNanoRouteMode -quiet route_selected_net_only true
globalDetailRoute
saveDesign ./top.POST_CLK_ROUTE.enc
\#source ./top.POST_CLK_ROUTE.enc
\#allow wide routing for s_out Z_lvt Z_svt
convertNetToSNet -nets {s_out Z_lvt Z_svt}
source \${scpts}/output_nets.tcl
\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#

## Route All Nets

\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#
setNanoRouteMode -quiet routeFixPrewire true
setNanoRouteMode -quiet route_selected_net_only false
setNanoRouteMode -quiet routeWithTimingDriven false
setNanoRouteMode -quiet routeTdrEffort 1
setNanoRouteMode -quiet drouteFixAntenna true
setNanoRouteMode -quiet routeWithSiDriven true
setNanoRouteMode -quiet routeSiLengthLimit 200
setNanoRouteMode -quiet routeSiEffort normal
globalDetailRoute
\#fix errors found with calibre DRC check

```
```

source \${scpts}/fix_drc_errors.tcl
saveDesign ./top.POST_ROUTE.enc
\#source ./top.POST_ROUTE.enc
\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#

## Check for violations

\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#
clearDrc
verifygeometry - allowDiffCellViols
verifyConnectivity -type regular -error 1000 -warning 50
verifyProcessAntenna
reportLeakagePower
Puts "\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#
Puts "\#\#\#"
Puts "\#\#\# Create abstract views : verilog / LEF / DEF / GDS /SDF ..."
Puts "\#\#\#"
Puts "\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#'
\#exec mkdir RESULTS
\#\#\#\#\#\#\#\#\#

### verilog

\#\#\#\#\#\#\#\#\#
saveNetlist ./RESULTS/top.v
\#\#\#\#\#\#\#\#\#

### lef

\#\#\#\#\#\#\#\#\#
lefOut ./RESULTS/top.lef -stripePin -PGpinLayers 6 7
\#\#\#\#\#\#\#\#

### def

\#\#\#\#\#\#\#\#
defOut -floorplan -routing ./RESULTS/top.def
\#\#\#\#\#\#\#\#

### gds

\#\#\#\#\#\#\#
streamOut ./RESULTS/top_with_io.gds \
-mapFile \${CMOS090GP_DIR}/SocEncounter_cmos090gp_2.2/gds2_cmos90.map \
-libName DesignLib \
-structureName top_with_io \
-stripes 1 \
-units 2000 \
-mode ALL
\#\#\#\#\#\#\#\#

### sdf

\#\#\#\#\#\#\#
setExtractRCMode - detail
extractRC

```

\section*{C. 2 top.conf}

```


# 

# Input configuration file

# 

\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#
\#set designkit path
set CMOS090GP_DIR / designkit/cmos090_50a
global rda_Input
\#set cwd ./work
set rda_Input(import_mode) {-treatUndefinedCellAsBbox 0 -verticalRow 0
-keepEmptyModule 1 }
set rda_Input(ui_netlist) "src/top_io.v"
set rda_Input(ui_netlisttype) {Verilog}
set rda_Input(ui_ilmlist) {}
set rda_Input(ui_settop) {1}
set rda_Input(ui_topcell) {top_with_io}
set rda_Input(ui_celllib) {}
set rda_Input(ui_iolib) {}
set rda_Input(ui_areaiolib) {}
set rda_Input(ui_blklib) {}
set rda_Input(ui_kboxlib) {}
set rda_Input(ui_gds_file) {}
set rda_Input(ui_timelib,min) "
\${CMOS090GP_DIR}/CORE90GPLVT_SNPS_AVT_2.1/SIGNOFF/bc_1.10V_m40C_wc_0.90V_105C /PT_LIB /
CORE90GPLVT_Best.lib
\$ {CMOS090GP_DIR } /CORE90GPSVT_SNPS_AVT_2.1/SIGNOFF/bc_1.10V_m40C_wc_0.90V_105C /PT_LIB /
CORE90GPSVT_Best.lib
\$ {CMOS090GP_DIR } /CORX90GPLVT_SNPS_AVT_4.2/SIGNOFF/bc_1.10V_m40C_wc_0.90V_105C /PT_LIB /
CORX90GPLVT_Best.lib
\${CMOS090GP_DIR}/CORX90GPSVT_SNPS_AVT_4.2/SIGNOFF/bc_1.10V_m40C_wc_0.90V_105C/PT_LIB /
CORX90GPSVT_Best.lib
\${CMOS090GP_DIR}/CLOCK90GPLVT_SNPS_AVT_2.1/SIGNOFF/bc_1.10V_m40C_wc_0.90V_105C /PT_LIB
/ CLOCK90GPLVT_Best.lib
\${CMOS090GP_DIR}/CLOCK90GPSVT_SNPS_AVT_2.1/SIGNOFF/bc_1.10V_m40C_wc_0.90V_105C /PT_LIB
/ CLOCK90GPSVT_Best.lib
\${CMOS090GP_DIR}/PR90M7_SNPS-AVT_3.0/SIGNOFF/bc_1.10V_m40C_wc_0.90V_105C/PT_LIB/
PR90M7_Best.lib
\${CMOS090GP_DIR }/IO90GPHVT_3V3_50A_7M2T_SNPS-AVT_4.0/SIGNOFF/
bc_1.10V_m40C_wc_0.90V_125C/PT_LIB/IO90GPHVT_3V3_50A_7M2T_Best.lib
\$ {CMOS090GP_DIR } /IO90GPHVT_BASIC_50A_7M2T_SNPS-AVT_4.0/SIGNOFF /
bc_1.10V_m40C_wc_0.90V_125C /PT_LIB/IO90GPHVT_BASIC_50A_7M2T_Best.lib
\$ {CMOS090GP_DIR }/IO90GPHVT_REF_COMPENSATION_3V3_50A_SNPS_AVT_4.0/SIGNOFF/
bc_1.10V_m40C_wc_0.90V_125C /PT_LIB/IO90GPHVT_REF_COMPENSATION_3V3_50A_Best.lib"
set rda_Input(ui_timelib,max) "

```
```

\$ \{CMOS090GP_DIR $\} /$ CORE90GPLVT_SNPS-AVT_2.1/SIGNOFF/bc_1.10V_m40C_wc_0.90V_105C/PT_LIB / CORE90GPLVT_Worst.lib
\$ \{CMOS090GP_DIR $\} /$ CORE90GPSVT_SNPS-AVT_2.1/SIGNOFF/bc_1.10V_m40C_wc_0.90V_105C /PT_LIB /
CORE90GPSVT_Worst.lib
$\$\{$ CMOS090GP_DIR $\} / C O R X 90 G P L V T \_S N P S-A V T \_4.2 / S I G N O F F / b c \_1.10 V \_m 40 C \_w c \_0.90 V \_105 C / P T \_L I B /$
CORX90GPLVT_Worst.lib
\$ \{CMOS090GP_DIR \} /CORX90GPSVT_SNPS-AVT_4.2/SIGNOFF/bc_1.10V_m40C_wc_0.90V_105C /PT_LIB /
CORX90GPSVT_Worst.lib
\$ $\{$ CMOS090GP_DIR $\} /$ CLOCK90GPHVT_SNPS_AVT_2.1.a/SIGNOFF/bc_1.10V_m40C_wc_0.90V_105C /
PT_LIB / CLOCK90GPHVT_Worst.lib
\$ \{CMOS090GP_DIR $\} /$ CLOCK90GPLVT_SNPS-AVT_2.1/SIGNOFF/bc_1.10V_m40C_wc_0.90V_105C /PT_LIB
/ CLOCK90GPLVT_Worst.lib
\$ \{CMOS090GP_DIR $\} /$ PR90M7_SNPS-AVT_3.0/SIGNOFF/bc_1.10V_m40C_wc_0.90V_105C /PT_LIB/
PR90M7_Worst.lib
\$ \{CMOS090GP_DIR $\}$ /IO90GPHVT_3V3_50A_7M2T_SNPS-AVT_4.0/SIGNOFF /
bc_1.10V_m40C_wc_0.90V_125C /PT_LIB/IO90GPHVT_3V3_50A_7M2T_Worst.lib
\$ \{CMOS090GP_DIR $\} /$ IO90GPHVT_BASIC_50A_7M2T_SNPS-AVT_4.0/SIGNOFF /
bc_1.10V_m40C_wc_0.90V_125C /PT_LIB/IO90GPHVT_BASIC_50A_7M2T_Worst.lib
\$ $\{$ CMOS090GP_DIR $\} / I O 90 G P H V T \_R E F \_C O M P E N S A T I O N \_3 V 3 \_50 A \_S N P S-A V T \_4.0 / S I G N O F F / ~$
bc_1.10V_m40C_wc_0.90V_125C /PT_LIB/IO90GPHVT_REF_COMPENSATION_3V3_50A_Worst.lib"
set rda_Input(ui_timelib) \{\}
set rda_Input(ui_smodDef) \{\}
set rda_Input(ui_smodData) \{\}
set rda_Input(ui_dpath) $\}$
set rda_Input(ui_tech_file) \{\}
set rda_Input(ui_io_file) \{data/ioplace.io\}
set rda_Input(ui_timingcon_file) \{src/top_gate.sdc \}
set rda_Input(ui_latency_file) $\}$
set rda_Input(ui_scheduling_file) \{\}
set rda_Input(ui_buf_footprint) \{\}
set rda_Input(ui_delay_footprint) \{\}
set rda_Input(ui_inv_footprint) \{\}
set rda_Input(ui_leffile) "
\$ \{CMOS090GP_DIR\} / SocEncounter_cmos090gp_2.2/cmos090gp_soc.lef
\$ \{CMOS090GP_DIR \} / CORE90GPLVT_SNPS—AVT_2.1/SIGNOFF/common/LEF/CORE90GPLVT_ANT.lef
\$ \{CMOS090GP_DIR\}/CORE90GPSVT_SNPS-AVT_2.1/SIGNOFF/common/LEF/CORE90GPSVT_ANT.lef
\$ \{CMOS090GP_DIR $\}$ /CORX90GPLVT_SNPS-AVT_4.2/SIGNOFF/common/LEF/CORX90GPLVT_ANT.lef
\$ \{CMOS090GP_DIR $\}$ /CORX90GPSVT_SNPS-AVT_4.2/SIGNOFF/common/LEF/CORX90GPSVT_ANT.lef
$\$$ \{CMOS090GP_DIR $\}$ /CLOCK90GPLVT_SNPS—AVT_2.1/SIGNOFF/common/LEF /CLOCK90GPLVT_ANT.lef
\$ \{CMOS090GP_DIR \} / CLOCK90GPSVT_SNPS—AVT_2.1/SIGNOFF/common/LEF/CLOCK90GPSVT_ANT.lef
\$ \{CMOS090GP_DIR\}/PR90M7_SNPS-AVT_3.0/SIGNOFF/common/LEF/PR90M7_ANT.lef
data/LEF/IO90GPHVT_3V3_50A_7M2T_PGC.lef
data/LEF/IO90GPHVT_BASIC_50A_7M2T_PGC.lef
\$ \{CMOS090GP_DIR \} /IO90GPHVT_REF_COMPENSATION_3V3_50A_SNPS-AVT_4.0/SIGNOFF/common/LEF /
IO90GPHVT_REF_COMPENSATION_3V3_50A.lef"
set rda_Input(ui_core_cntl) \{aspect $\}$
set rda_Input(ui_aspect_ratio) $\{1.0\}$
set rda_Input(ui_core_util) $\{0.7\}$
set rda_Input(ui_core_height) $\}$
set rda_Input(ui_core_width) $\}$
set rda_Input(ui_core_to_left) \{\}
set rda_Input(ui_core_to_right) \{\}

```
```

set rda_Input(ui_core_to_top) {}
set rda_Input(ui_core_to_bottom) {}
set rda_Input(ui_max_io_height) {0}
set rda_Input(ui_row_height) {3.92}
set rda_Input(ui_isHorTrackHalfPitch) {0}
set rda_Input(ui_isVerTrackHalfPitch) {1}
set rda_Input(ui_ioOri) {R0}
set rda_Input(ui_isOrigCenter) {0}
set rda_Input(ui_exc_net) {}
set rda_Input(ui_delay_limit) {1000}
set rda_Input(ui_net_delay) {1000.0ps}
set rda_Input(ui_net_load) {0.5pf}
set rda_Input(ui_in_tran_delay) {120.0ps}
set rda_Input(ui_captbl_file) {}
set rda_Input(ui_defcap_scale) {1.0}
set rda_Input(ui_detcap_scale) {1.0}
set rda_Input(ui_xcap_scale) {1.0}
set rda_Input(ui_res_scale) {1.0}
set rda_Input(ui_shr_scale) {1.0}
set rda_Input(ui_time_unit) {none}
set rda_Input(ui_cap_unit) {}
set rda_Input(ui_oa_reflib) {}
set rda_Input(ui_oa_abstractname) {}
set rda_Input(ui_sigstormlib) {}
set rda_Input(ui_cdb_file) {}
set rda_Input(ui_echo_file) {}
set rda_Input(ui_xilm_file) {}
set rda_Input(ui_qxtech_file) {}
set rda_Input(ui_qxlib_file) {}
set rda_Input(ui_qxconf_file) {}
set rda_Input(ui_pwrnet) {vdd vdde vdd0 vdd1 vdd2 vdd3 vddcore}
set rda_Input(ui_gndnet) {gnd gnde \
CLKSLEEP TQ DIGA DIGB KOFF REFA REFB REFC REFD REFE REFF \
A13SRC A12SRC A11SRC A10SRC A9SRC A8SRC A7SRC A6SRC A5SRC A4SRC A3SRC A2SRC A1SRC
A0SRC \
IO_CLKSLEEP IO_TQ IO_DIGA IO_DIGB IO_KOFF IO_REFA IO_REFB IO_REFC IO_REFD IO_REFE
IO_REFF \
IO_A13SRC IO_A12SRC IO_A11SRC IO_A10SRC IO_A9SRC IO_A8SRC IO_A7SRC IO_A6SRC IO_A5SRC
IO_A4SRC IO_A3SRC IO_A2SRC IO_A1SRC IO_A0SRC \
}
set rda_Input(flip_first) {1}
set rda_Input(double_back) {1}
set rda_Input(assign_buffer) {1}
set rda_Input(ui_gen_footprint) {0}

```

\section*{C. 3 IO_Filler.tcl}
```

\#define user grid
setPreference ConstraintUserXGrid 0.56
setPreference ConstraintUserYGrid 0.56
snapFPlanIO -usergrid
redraw
\#add IO filler from the bigger to the smaller

```
```

addIoFiller -cell IOFILLER64_LIN -prefix io_fillperi
addIoFiller - cell IOFILLER32_LIN -prefix io_fillperi
addIoFiller -cell IOFILLER16_LIN -prefix io_fillperi
addIoFiller -cell IOFILLER8_LIN -prefix io_fillperi
addIoFiller - cell IOFILLER4_LIN -prefix io_fillperi
addIoFiller -cell IOFILLER2_LIN -prefix io_fillperi
addIoFiller -cell IOFILLER1_LIN -prefix io_fillperi
redraw

```

\section*{C. 4 do_power_domains.tcl}
```

\#create power domains
deletePowerDomain
createPowerDomain PD0 -timinglibs "CORE90GPSVT"
createPowerDomain PD1 -timinglibs "CORE90GPSVT"
createPowerDomain PD2 -timinglibs "CORE90GPLVT"
createPowerDomain PD3 -timinglibs "CORE90GPLVT"
createPowerDomain PDCORE - timinglibs "CORE90GPSVT"
\#include instances
modifyPowerDomainMember PD0 -instance core/mult_0 -power (vdd0:vdd) -ground (gnd:gnd)
modifyPowerDomainMember PD0 -instance ioco_vddioco_0 -power (vdd0:VDDCORE1V0) -move
modifyPowerDomainMember PD1 -instance core/mult_1 -power (vdd1:vdd) -ground (gnd:gnd)
modifyPowerDomainMember PD1 -instance ioco_vddioco_1 -power (vdd1:VDDCORE1V0) -move
modifyPowerDomainMember PD2 -instance core/mult_2 -power (vdd2:vdd) - ground (gnd:gnd)
modifyPowerDomainMember PD2 -instance ioco_vddioco_2 -power (vdd2:VDDCORE1V0) -move
modifyPowerDomainMember PD3 -instance core/mult_3 -power (vdd3:vdd) -ground (gnd:gnd)
modifyPowerDomainMember PD3 -instance ioco_vddioco_3 -power (vdd3:VDDCORE1V0) -move
modifyPowerDomainMember PDCORE - instance ioco_vddioco_core - power (vddcore:VDDCORE1V0
) -move
modifyPowerDomainMember PDCORE - instance * - power (vddcore:vdd) -ground (gnd:gnd)
\#resize it
modifyPowerDomainAttr PDCORE -box 194.04 381.76 691.76 587.76 -rsExts 10 10 40 10
-minGaps 10}1010\quad10 1
createPowerDomainCut 640.88 469.28 691.76 597.76

```

```

    -minGaps 10 10}101
    ```

```

    -minGaps 10 10}101
    ```

```

    -minGaps 10 10}101
    modifyPowerDomainAttr PD3 -box }\quad\begin{array}{llllllllllllll}{659.48}\&{490.44}\&{994.32}\&{994.32}\&{-rsExts}\&{10}\&{10}\&{10}\&{10}
-minGaps }1

```

\section*{C. 5 create_global_net.tcl}

\footnotetext{
1
Puts "\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#
}
```

Puts "\#\#\#'
Puts "\#\#\# Power declaration for std-cells and IO PADs"
Puts "\#\#\#"
Puts "\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#

### 

### WARNING : All the global nets should be declared first in the ".conf" file

### 

### 

### first, declare vdd/gnd pin's for all std-cells

### 

globalNetConnect vdd -type pgpin -pin {vdd } -inst * -module {}
globalNetConnect gnd -type pgpin -pin {gnd } -inst * -module {}

## declare 0/1 vhdl/verilog constants to be on vdd/gnd supplys

globalNetConnect vdd -type tiehi -module {}
globalNetConnect gnd -type tielo -module {}

### 

### IO pads

### - All the instance names for the IO pads must have the "io_" prefix

### 

### IO's \& core supply

globalNetConnect vdd - type pgpin -pin {vdd } -inst io* -module {} -override
globalNetConnect gnd -type pgpin -pin {gnd } -inst io* -module {} -override

### remaining IOs pins

    globalNetConnect gnde -type pgpin -pin {gnde } -inst io* -module {} -override
    globalNetConnect vdde -type pgpin - pin {vdde } -inst io* -module {} -override
    globalNetConnect IO_CLKSLEEP - type pgpin -pin {CLKSLEEP } -inst io* -module {}
            -override
    globalNetConnect IO_TQ -type pgpin - pin {TQ } -inst io* -module {} -override
    globalNetConnect IO_DIGA -type pgpin -pin {DIGA } -inst io* -module {} -override
    globalNetConnect IO_DIGB -type pgpin -pin {DIGB } -inst io* -module {} -override
    globalNetConnect IO_KOFF -type pgpin -pin {KOFF } -inst io* -module {} -override
    globalNetConnect IO_REFA - type pgpin -pin {REFA } -inst io* -module {} -override
    globalNetConnect IO_REFB -type pgpin -pin {REFB } -inst io* -module {} -override
    globalNetConnect IO_REFC - type pgpin -pin {REFC } -inst io* -module {} -override
    globalNetConnect IO_REFD -type pgpin - pin {REFD } -inst io* -module {} -override
    globalNetConnect IO_REFE -type pgpin -pin {REFE } -inst io* -module {} -override
    globalNetConnect IO_REFF - type pgpin -pin {REFF } -inst io* -module {} -override
    globalNetConnect IO_A0SRC -type pgpin -pin {A0SRC } -inst io* -module {} -override
    globalNetConnect IO_A1SRC -type pgpin -pin {A1SRC } -inst io* -module {} -override
    globalNetConnect IO_A2SRC -type pgpin -pin {A2SRC } -inst io* -module {} -override
    globalNetConnect IO_A3SRC -type pgpin -pin {A3SRC } -inst io* -module {} -override
    ```
```

globalNetConnect IO_A4SRC -type pgpin -pin {A4SRC } -inst io* -module {} -override

```
globalNetConnect IO_A4SRC -type pgpin -pin {A4SRC } -inst io* -module {} -override
globalNetConnect IO_A5SRC -type pgpin -pin {A5SRC } -inst io* -module {} -override
globalNetConnect IO_A5SRC -type pgpin -pin {A5SRC } -inst io* -module {} -override
globalNetConnect IO_A6SRC -type pgpin -pin {A6SRC } -inst io* -module {} -override
globalNetConnect IO_A6SRC -type pgpin -pin {A6SRC } -inst io* -module {} -override
globalNetConnect IO_A7SRC -type pgpin - pin {A7SRC } -inst io* -module {} -override
globalNetConnect IO_A7SRC -type pgpin - pin {A7SRC } -inst io* -module {} -override
globalNetConnect IO_A8SRC -type pgpin -pin {A8SRC } -inst io* -module {} -override
globalNetConnect IO_A8SRC -type pgpin -pin {A8SRC } -inst io* -module {} -override
globalNetConnect IO_A9SRC -type pgpin - pin {A9SRC } -inst io* -module {} -override
globalNetConnect IO_A9SRC -type pgpin - pin {A9SRC } -inst io* -module {} -override
globalNetConnect IO_A10SRC -type pgpin -pin {A10SRC } -inst io* -module {}
globalNetConnect IO_A10SRC -type pgpin -pin {A10SRC } -inst io* -module {}
    -override
    -override
globalNetConnect IO_A11SRC -type pgpin -pin {A11SRC } -inst io* -module {}
globalNetConnect IO_A11SRC -type pgpin -pin {A11SRC } -inst io* -module {}
    -override
    -override
globalNetConnect IO_A12SRC -type pgpin -pin {A12SRC } -inst io* -module {}
globalNetConnect IO_A12SRC -type pgpin -pin {A12SRC } -inst io* -module {}
        -override
        -override
globalNetConnect IO_A13SRC -type pgpin -pin {A13SRC } -inst io* -module {} -override
```

globalNetConnect IO_A13SRC -type pgpin -pin {A13SRC } -inst io* -module {} -override

```

```

globalNetConnect vdde -type pgpin - pin $\{\operatorname{vdde} 3 \mathrm{v} 3\}$-inst io* -module $\}$-override
globalNetConnect IO_CLKSLEEP -type pgpin -pin \{CLKSLEEP3V3 \} -inst io* -module \{\}
-override
globalNetConnect IO_TQ -type pgpin -pin \{TQ3V3 \} -inst io* -module $\}$-override
globalNetConnect IO_DIGA -type pgpin -pin \{CHIPSLEEP3V3 \} -inst io* -module \{\}
-override
globalNetConnect IO_REFA - type pgpin - pin \{REFAPBIAS3V3 \} -inst io* -module \{\}
-override
globalNetConnect IO_REFB - type pgpin - pin \{REFBAMPL3V3 \} -inst io* -module \{\}
-override
globalNetConnect IO_REFC -type pgpin - pin \{REFCAMPH3V3 \} -inst io* -module $\}$
-override
globalNetConnect IO_REFD -type pgpin - pin \{REFDNBIAS3V3 \} -inst io* -module \{\}
-override
globalNetConnect IO_REFE -type pgpin -pin \{REFEIO3V3 \} -inst io* -module \{\}
-override
globalNetConnect IO_A0SRC - type pgpin -pin \{A0SRC3V3 \} -inst io* -module \{\}
-override
globalNetConnect IO_A1SRC - type pgpin - pin \{A1SRC3V3 \} -inst io* -module \{\}
-override
globalNetConnect IO_A2SRC -type pgpin -pin \{A2SRC3V3 \} -inst io* -module \{\}
-override
globalNetConnect IO_A3SRC -type pgpin -pin \{A3SRC3V3 \} -inst io* -module \{\}
-override
globalNetConnect IO_A4SRC -type pgpin -pin \{A4SRC3V3 \} -inst io* -module $\}$
-override
globalNetConnect IO_A5SRC -type pgpin -pin \{A5SRC3V3 \} -inst io* -module $\}$
-override
globalNetConnect IO_A6SRC - type pgpin -pin \{A6SRC3V3 \} -inst io* -module \{\} -override
globalNetConnect IO_A7SRC - type pgpin -pin \{A7SRC3V3 \} -inst io* -module \{\}
-override
globalNetConnect IO_A8SRC -type pgpin -pin \{A8SRC3V3 \} -inst io* -module \{\}
-override
globalNetConnect IO_A9SRC - type pgpin - pin \{A9SRC3V3 \} -inst io* -module \{\}
-override

```
```

globalNetConnect IO_A10SRC -type pgpin -pin {A10SRC3V3 } -inst io* -module {}

```
globalNetConnect IO_A10SRC -type pgpin -pin {A10SRC3V3 } -inst io* -module {}
    -override
    -override
globalNetConnect IO_A11SRC -type pgpin -pin {A11SRC3V3 } -inst io* -module {}
globalNetConnect IO_A11SRC -type pgpin -pin {A11SRC3V3 } -inst io* -module {}
        -override
        -override
globalNetConnect IO_A12SRC -type pgpin -pin {A12SRC3V3 } -inst io* -module {}
globalNetConnect IO_A12SRC -type pgpin -pin {A12SRC3V3 } -inst io* -module {}
        -override
        -override
globalNetConnect IO_A13SRC -type pgpin -pin {A13SRC3V3 } -inst io* -module {}
globalNetConnect IO_A13SRC -type pgpin -pin {A13SRC3V3 } -inst io* -module {}
    -override
    -override
###
###
globalNetConnect CLKSLEEP -type pgpin -pin {CLKSLEEP3V3 } -inst ioref* -module {}
globalNetConnect CLKSLEEP -type pgpin -pin {CLKSLEEP3V3 } -inst ioref* -module {}
        -override
        -override
globalNetConnect TQ -type pgpin -pin {TQ3V3 } -inst ioref* -module {} -override
globalNetConnect TQ -type pgpin -pin {TQ3V3 } -inst ioref* -module {} -override
globalNetConnect DIGA -type pgpin -pin {CHIPSLEEP3V3 } -inst ioref* -module {}
globalNetConnect DIGA -type pgpin -pin {CHIPSLEEP3V3 } -inst ioref* -module {}
        -override
        -override
globalNetConnect REFA -type pgpin -pin {REFAPBIAS3V3 } -inst ioref* -module {}
globalNetConnect REFA -type pgpin -pin {REFAPBIAS3V3 } -inst ioref* -module {}
        -override
        -override
globalNetConnect REFB -type pgpin -pin {REFBAMPL3V3 } -inst ioref* -module {}
globalNetConnect REFB -type pgpin -pin {REFBAMPL3V3 } -inst ioref* -module {}
        -override
        -override
globalNetConnect REFC -type pgpin - pin {REFCAMPH3V3 } -inst ioref* -module {}
globalNetConnect REFC -type pgpin - pin {REFCAMPH3V3 } -inst ioref* -module {}
        -override
        -override
globalNetConnect REFD -type pgpin -pin {REFDNBIAS3V3 } -inst ioref* -module {}
globalNetConnect REFD -type pgpin -pin {REFDNBIAS3V3 } -inst ioref* -module {}
        -override
        -override
globalNetConnect REFE -type pgpin -pin {REFEIO3V3 } -inst ioref* -module {}
globalNetConnect REFE -type pgpin -pin {REFEIO3V3 } -inst ioref* -module {}
        -override
        -override
globalNetConnect A0SRC -type pgpin -pin {A0SRC3V3 } -inst ioref* -module {}
globalNetConnect A0SRC -type pgpin -pin {A0SRC3V3 } -inst ioref* -module {}
        -override
        -override
globalNetConnect A1SRC -type pgpin -pin {A1SRC3V3 } -inst ioref* -module {}
globalNetConnect A1SRC -type pgpin -pin {A1SRC3V3 } -inst ioref* -module {}
        -override
        -override
globalNetConnect A2SRC -type pgpin -pin {A2SRC3V3 } -inst ioref* -module {}
globalNetConnect A2SRC -type pgpin -pin {A2SRC3V3 } -inst ioref* -module {}
        -override
        -override
globalNetConnect A3SRC -type pgpin -pin {A3SRC3V3 } -inst ioref* -module {}
globalNetConnect A3SRC -type pgpin -pin {A3SRC3V3 } -inst ioref* -module {}
        -override
        -override
globalNetConnect A4SRC -type pgpin -pin {A4SRC3V3 } -inst ioref* -module {}
globalNetConnect A4SRC -type pgpin -pin {A4SRC3V3 } -inst ioref* -module {}
        -override
        -override
globalNetConnect A5SRC -type pgpin -pin {A5SRC3V3 } -inst ioref* -module {}
globalNetConnect A5SRC -type pgpin -pin {A5SRC3V3 } -inst ioref* -module {}
        -override
        -override
globalNetConnect A6SRC -type pgpin -pin {A6SRC3V3 } -inst ioref* -module {} -override
globalNetConnect A6SRC -type pgpin -pin {A6SRC3V3 } -inst ioref* -module {} -override
###
###
globalNetConnect CLKSLEEP -type pgpin -pin {CLKSLEEP3V3 } -inst ioco_vssio_ref_asrc
globalNetConnect CLKSLEEP -type pgpin -pin {CLKSLEEP3V3 } -inst ioco_vssio_ref_asrc
        -module {} -override
        -module {} -override
globalNetConnect TQ -type pgpin -pin {TQ3V3 } -inst ioco_vssio_ref_asrc -module {}
globalNetConnect TQ -type pgpin -pin {TQ3V3 } -inst ioco_vssio_ref_asrc -module {}
        -override
        -override
globalNetConnect DIGA -type pgpin - pin {CHIPSLEEP3V3 } -inst ioco_vssio_ref_asrc
globalNetConnect DIGA -type pgpin - pin {CHIPSLEEP3V3 } -inst ioco_vssio_ref_asrc
        -module {} -override
        -module {} -override
globalNetConnect REFA -type pgpin - pin {REFAPBIAS3V3 } -inst ioco_vssio_ref-asrc
globalNetConnect REFA -type pgpin - pin {REFAPBIAS3V3 } -inst ioco_vssio_ref-asrc
        -module {} -override
        -module {} -override
globalNetConnect REFB -type pgpin -pin {REFBAMPL3V3 } -inst ioco_vssio_ref_asrc
globalNetConnect REFB -type pgpin -pin {REFBAMPL3V3 } -inst ioco_vssio_ref_asrc
        -module {} -override
        -module {} -override
globalNetConnect REFC -type pgpin -pin {REFCAMPH3V3 } -inst ioco-vssio_ref_asrc
globalNetConnect REFC -type pgpin -pin {REFCAMPH3V3 } -inst ioco-vssio_ref_asrc
        -module {} -override
```

        -module {} -override
    ```
```

globalNetConnect REFD -type pgpin -pin {REFDNBIAS3V3 } -inst ioco_vssio_ref_asrc
-module {} -override
globalNetConnect REFE - type pgpin -pin {REFEIO3V3 } -inst ioco_vssio_ref_asrc -module
{} -override
globalNetConnect A0SRC -type pgpin -pin {A0SRC3V3 } -inst ioco_vssio_ref_asrc -module
{} -override
globalNetConnect A1SRC -type pgpin -pin {A1SRC3V3 } -inst ioco_vssio_ref_asrc -module
{} -override
globalNetConnect A2SRC -type pgpin -pin {A2SRC3V3 } -inst ioco_vssio_ref_asrc -module
{} -override
globalNetConnect A3SRC -type pgpin -pin {A3SRC3V3 } -inst ioco_vssio_ref_asrc -module
{} -override
globalNetConnect A4SRC -type pgpin -pin {A4SRC3V3 } -inst ioco_vssio_ref_asrc -module
{} -override
globalNetConnect A5SRC - type pgpin -pin {A5SRC3V3 } -inst ioco_vssio_ref_asrc -module
{} -override
globalNetConnect A6SRC -type pgpin -pin {A6SRC3V3 } -inst ioco_vssio_ref_asrc -module
{} -override

### Mult IO power pad

globalNetConnect vddcore -type pgpin -pin {VDDCORE*} -inst ioco_vddioco_core -module
{} -override
globalNetConnect vdd0 -type pgpin -pin {VDDCORE*} -inst ioco_vddioco_0 -module {}
-override
globalNetConnect vdd1 -type pgpin -pin {VDDCORE*} -inst ioco_vddioco_1 -module {}
-override
globalNetConnect vdd2 -type pgpin -pin {VDDCORE*} -inst ioco_vddioco_2 -module {}
-override
globalNetConnect vdd3 -type pgpin -pin {VDDCORE*} -inst ioco_vddioco_3 -module {}
-override

### connect cells to the correct io

globalNetConnect vddcore -type pgpin -pin {vdd} -inst * -module core -override
globalNetConnect vdd0 -type pgpin -pin {vdd} -inst * -module core/mult_0 -override
globalNetConnect vdd1 -type pgpin -pin {vdd} -inst * -module core/mult_1 -override
globalNetConnect vdd2 -type pgpin -pin {vdd} -inst * -module core/mult_2 -override
globalNetConnect vdd3 -type pgpin -pin {vdd} -inst * -module core/mult_3 -override

### 

### execute command

### 

applyGlobalNets

### 

### check all design

### (a specific check can also be performed in menu : FloorPlan->Global Net

    Connection-> check button)
    
### 

\#checkdesign -all

```

\section*{C. 6 pwr.tcl}
```

\#add rings (core + power_domains)
\#extern ring
addRing \
-spacing_bottom 3.0 \
-spacing_top 3.0 \
-spacing_right 3.0 \
-spacing_left 3.0 \
-width_bottom 10 \
-width_top 10 \
-width_right 10 \
-width_left 10 \
_layer_bottom M7 \
-layer_top M7 \
-layer_right M6 \
-layer_left M6 \
-offset_bottom 0.45 \
_offset_top 0.45 \
-offset_right 0.45 \
-offset_left 0.45
-center 1 \
-stacked_via_top_layer M7 \
-stacked_via_bottom_layer M1 \
-around core \
-jog_distance 0.45 \
-threshold 0.45 \
-nets {gnd vddcore}
\#PD0
deselectAll
selectGroup PD0
addRing \
-type block_rings\
-around power_domain \
-spacing_bottom 1.5 \
-spacing_top 1.5 \
_spacing_right 1.5 \
-spacing_left 1.5 \
-width_bottom 8 \
-width_top 8 \
-width_right 8 \
-width_left 8 \
-layer_bottom M7 \
-layer_top M7 \
-layer_right M6 \
-layer_left M6 \
_offset_bottom 0.45 \
-offset_top 0.45 \
_offset_right 0.45 \
_offset_left 0.45 \
-stacked_via_top_layer M7 \
-stacked_via_bottom_layer M1 \
-jog_distance 0.45 \

```
```

    -threshold 0.45 \
    -nets {vdd0}
    deselectGroup PD0
\#PD1
selectGroup PD1
addRing \
-type block_rings\
-around power_domain \
-spacing_bottom 1.5 \
-spacing_top 1.5 \
_spacing_right 1.5 \
-spacing_left 1.5 \
-width_bottom 8 \
-width_top 8 \
-width_right 8 \
-width_left 8 \
-layer_bottom M7 \
-layer_top M7 \
-layer_right M6 \
-layer_left M6 \
-offset_bottom 0.45 \
-offset_top 0.45 \
-offset_right 0.45 \
-offset_left 0.45 \
-stacked_via_top_layer M7 \
-stacked_via_bottom_layer M1 \
-jog_distance 0.45 \
-threshold 0.45 \
-nets {vdd1}
deselectGroup PD1
\#PD2
selectGroup PD2
addRing \
-type block_rings\
-around power_domain \
-spacing_bottom 1.5 \
-spacing_top 1.5 \
_spacing_right 1.5 \
-spacing_left 1.5 \
-width_bottom 8 \
-width_top 8 \
-width_right 8 \
-width_left 8 \
-layer_bottom M7 \
-layer_top M7 \
-layer_right M6 \
-layer_left M6 \
-offset_bottom 0.45 \
-offset_top 0.45 \
-offset_right 0.45 \
-offset_left 0.45 \
-stacked_via_top_layer M7 \
-stacked_via_bottom_layer M1 \

```
```

    _jog_distance 0.45 \
    -threshold 0.45 \
    -nets {vdd2}
    deselectGroup PD2
\#PD3
selectGroup PD3
addRing \
-type block_rings\
-around power_domain \
-spacing_bottom 1.5 \
-spacing_top 1.5 \
_spacing_right 1.5 \
-spacing_left 1.5 \
-width_bottom 8 \
-width_top 8 \
-width_right 8 \
-width_left 8 \
-layer_bottom M7 \
-layer_top M7 \
-layer_right M6 \
-layer_left M6 \
-offset_bottom 0.45 \
-offset_top 0.45 \
-offset_right 0.45 \
_offset_left 0.45 \
-stacked_via_top_layer M7 \
-stacked_via_bottom_layer M1 \
-jog_distance 0.45 \
-threshold 0.45 \
-nets {vdd3}
deselectGroup PD3
\#PDCORE
selectGroup PDCORE
addRing \
-type block_rings\
-around power_domain \
-spacing_bottom 1.5 \
-spacing_top 1.5 \
-spacing_right 1.5 \
-spacing_left 1.5 \
-width_bottom 8 \
-width_top 8 \
-width_right 8 \
-width_left 8 \
-layer_bottom M7 \
-layer_top M7 \
-layer_right M6 \
-layer_left M6 \
_offset_bottom 0.45 \
-offset_top 0.45 \
-offset_right 0.45 \
_offset_left 0.45 \
-stacked_via_top_layer M7 \

```
```

    -stacked_via_bottom_layer M1 \
    -jog_distance 0.45 \
    -threshold 0.45 \
    -left 0 \
    -tl 1\
    -bl 1\
    -nets {vddcore}
    deselectGroup PDCORE
\#IO_REF_COMPENSATION
addRing \
-type block_rings\
-around each_block \
-spacing_bottom 1.5 \
-spacing_top 1.5 \
-spacing_right 1.5 \
-spacing_left 1.5 \
-width_bottom 8 \
-width_top 8 \
-width_right 8 \
-width_left 8 \
-layer_bottom M7 \
-layer_top M7 \
-layer_right M6 \
-layer_left M6 \
-offset_bottom 0.55 \
-offset_top 0.55 \
-offset_right 0.55 \
_offset_left 0.55 \
-stacked_via_top_layer M7 \
-stacked_via_bottom_layer M1 \
-jog_distance 0.45 \
-threshold 0.45 \
-nets {vdd vdde}
addRing \
-type block_rings\
-around each_block \
-spacing_bottom 1.5 \
_spacing_top 1.5 \
_spacing_right 1.5 \
-spacing_left 1.5 \
-width_bottom 8 \
-width_top 8 \
-width_right 8 \
-width_left 8
-layer_bottom M7 \
-layer_top M7 \
-layer_right M6 \
-layer_left M6 \
-offset_bottom 20.5 \
-offset_top 20.5 \
-offset_right 20.5 \
-offset_left 20.5 \
-stacked_via_top_layer M7 \

```
```

-stacked_via_bottom_layer M1 \
-threshold 0.45 \
-bottom 0 \
-right 0\
-lb 1 \
-tr 1 \
-nets {gnd}

```

\section*{C. 7 followPin.tcl}
```

Puts "\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#"'
Puts "\#\#\#'
Puts "\#\#\# Create std-cell follow pin"
Puts "\#\#\#'
Puts "\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#'
deselectAll
cutCoreRow
\#avoid via6_7 on VDDCO_HDRV_MT_1VO_LIN edge
createRouteBlk -box 233.4700 1042.6900 276.7550 1047.9950 -layer 6
createRouteBlk -box 911.768 1042.69 955.176 1049.237 -layer 6
createRouteBlk -box 504.69 139.071 548.532 146.015 -layer 6
createRouteBlk -box 232.895 140.142 277.333 146.272 -layer 6

# Use editPowerVia to generate stripes-followpins

\#-noBlockPins firstAfterRowEnd
sroute -verbose -noPadRings -noStripes \
-corePinMaxViaWidth 30 -corePinMaxViaHeight 70 \
-targetViaTopLayer 7 -crossoverViaTopLayer 7 \
-secondaryStopSCPin firstStripe \
-viaConnectToShape { stripe ring } \
-deleteExistingRoutes \
-padPinWidth 7\
-nets {gnd vdd vdd0 vdd1 vdd2 vdd3 vddcore vdde}
\#avoid routing too close to the pad
createRouteBlk -box 923.796 143.219 927.711 156.959 -layer all
\#Route IO_REF special nets
sroute -verbose - noPadRings - padPinToAlignedBlockPin \
-stopStripeSCPin lastPadRing - deleteExistingRoutes - nets {\
CLKSLEEP TQ DIGA DIGB KOFF REFA REFB REFC REFD REFE REFF \
A6SRC A5SRC A4SRC A3SRC A2SRC A1SRC A0SRC }

# A13SRC A12SRC A11SRC A10SRC A9SRC A8SRC A%SRC

\#Remove blockages
deleteAllRouteBlks
clearCutRow
deselectAll

```

\section*{C. 8 place_output_bufs.tcl}
```

placeInstance Z_svt_buf 194.04 570.385 MY
placeInstance Z_lvt_buf 194.04 535.085 R180
placeInstance s_out_buf 194.04 409.714 R180

```

\section*{C. 9 output_nets.tcl}
```

\#outputs nets with width of 1 um
setEdit -force_special 1
setEdit -width_horizontal 1
setEdit - width_vertical 1
\#s_out
setEdit - nets s_out
setEdit -layer_horizontal M1
setEdit - layer_vertical M2
uiSetTool addWire
editAddRoute 194.847 411.480
editAddRoute 145.233 411.459
editAddRoute 144.821 413.911
editAddRoute 145.115 413.911
editCommitRoute 145.115 413.911
setEdit -layer_horizontal M2
editAddRoute 142.614 413.933
editAddRoute 145.259 413.439
editAddRoute 145.043 413.933
editCommitRoute 145.043 413.933
uiSetTool select
\#Z_lvt
setEdit - nets Z_lvt
setEdit -layer_horizontal M1
setEdit - layer_vertical M2
uiSetTool addWire
editAddRoute 195.073 536.971
editAddRoute 145.874 536.559
editAddRoute 145.028 549.990
setEdit -layer_horizontal M2
editAddRoute 142.497 549.872
editCommitRoute 142.497 549.872
uiSetTool select
\#Z_svt
setEdit - nets Z_svt
uiSetTool addWire
setEdit -layer_horizontal M1
setEdit - layer_vertical M2
editAddRoute 195.030 572.574
editAddRoute 146.930 572.574
editAddRoute 146.764 685.499
setEdit -layer_horizontal M2
editAddRoute 142.467 685.146
editCommitRoute 142.467 685.146

```

\section*{C. 10 fix_drc_errors.tcl}
```

\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#
\#Fix the DRC errors discovered with calibre DRC
\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#
\#lower net
\#move the first part
selectWire 950.5100 203.6300 952.6100 203.7700 3 gnd
selectWire 952.4700 203.6300 954.2900 203.7700 3 gnd
selectWire 954.1500 203.6300 981.6100 203.7700 3 gnd
editMove y - 1.399
deselectAll
\#add via at the end
setEdit - layer_horizontal M3 - layer_vertical M4 - nets gnd
setEdit -width_horizontal 0.14 -width_vertical 0.14
editAddRoute 981.604 202.292
editAddRoute 983.905 202.306
editCommitRoute 983.905 202.306
uiSetTool select
deselectAll

## 

selectWire 983.9900 203.6300 984.6900 203.7700 3 gnd
editDelete -objects Selected
deleteTiles - selected
deleteBumps - selected
selectWire 984.5500 203.6300 984.6900 244.8500 2 gnd
editStretch y -0.683 low
editAddRoute 984.610 203.019
editAddRoute 981.323 203.076
editCommitRoute 981.323 203.076
uiSetTool select
deselectAll
\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#
\#upper net
\#delete existing wire
selectWire 946.4700 450.8700 947.7300 451.0100 5 gnd
selectWire 947.5900 450.8700 947.7300 451.5700 4 gnd
selectWire 947.5900 451.4300 981.6100 451.5700 3 gnd
editDelete -objects Selected
deleteTiles -selected
deleteBumps -selected
\#create the new one
editAddRoute 942.920 448.604
editAddRoute 944.305 455.932
editAddRoute 984.069 455.849
editCommitRoute 984.069 455.849
setEdit - layer_vertical M3
\#add extra via

```
```

editAddRoute 942.905 455.921
editAddRoute 942.913 455.503
editAddRoute 942.961 455.527
editCommitRoute 942.961 455.527
setEdit -layer_vertical M4
uiSetTool select
deselectAll
\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#
\#move offending net M1.S.3.1
\#uiSetTool moveWire
selectWire 385.6300 383.6700 387.7300 383.8100 3 clk_p
selectWire 383.6700 383.6700 385.7700 383.8100 3 clk_p
editMove y 1.246
deselectAll
selectWire 382.8400 383.4000 385.4200 383.5200 1 core/data_gen_1/clk_p__Fence_N0
editDelete -objects Selected
deleteTiles -selected
deleteBumps -selected
selectWire 380.5900 383.3900 382.9700 383.5300 3 core/data_gen_1/clk_p__Fence_N0
editStretch x 3.292 high
uiSetTool select
deselectAll
\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#
\#move an offending net on M1
selectWire 911.2600 528.7200 911.6000 528.8400 1 core/mult_3/mult_par4_0/
AandBx15xx17x
editMove y 0.558
uiSetTool select
deselectAll

```

\section*{C. 11 top.ctstch}
```


### CLK

# Sample Gated CTS Command

AutoCTSRootPin io_clk/ZI
NoGating NO
Buffer IVSVTX6 BFSVTX1 BFSVTX8 BFSVTX10 BFSVTX12 IVLVTX6 BFLVTX1 BFLVTX8 BFLVTX10
BFLVTX12
MaxDelay 10ps
MinDelay 0ps
MaxSkew 100ps
End

```
```


### Reset

# Sample Gated CTS Command

AutoCTSRootPin io_rst_n/ZI
NoGating NO
Buffer IVSVTX6 BFSVTX1 BFSVTX2 BFSVTX4 BFSVTX6 BFSVTX8 BFSVTX12 IVLVTX6 BFLVTX1
BFLVTX2 BFLVTX4 BFLVTX6 BFLVTX8 BFLVTX12
MaxSkew 1ns
End

```

\section*{C. 12 ioplace.io}
```

\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#

# 

# Silicon Perspective, A Cadence Company

# FirstEncounter IO Assignment

# 

\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#\#
Version: 2
Pad: io_corner_4 SE CORNER_LIN
Pad: io_corner_3 NE CORNERLIN
Pad: io_corner_2 NW CORNER_LIN
Pad: io_corner_1 SW CORNER_LIN
Pad: ioco_vddioco_1 N VDDCO_HDRV_MT_1V0_LIN
Pad: io_vssio_2 N VSSIO_3V3_LIN
Pad: ioco_vddioco_core N VDDCO_HDRV_MT_1V0LIN
Pad: io_vddio_2 N VDDIO_3V3_LIN
Pad: ioco_vssioco_3 N VSSIOCO_LIN
Pad: ioco_vddioco_3 N VDDCO_HDRV_MT_1V0LIN
Pad: io_s_in W
Pad: io_s_out W
Pad: io_Z_lvt W
Pad: io_Z_svt W
Pad: io_shift_n W
Pad: ioco_vssioco_2 W VSSIOCO_LIN
Pad: ioco_vddioco_0 S VDDCO_HDRV_MT_1V0_LIN
Pad: io_clk S
Pad: ioco_vddioco_2 S VDDCO_HDRV_MT_1V0LIN
Pad: io_load_n S
Pad: ioco_vssioco_1 S VSSIOCO_LIN
Pad: ioco_vssio_ref_asrc S VSSIO_3V3_REF_ASRC_LIN
Pad: ioco_vddioco_g E VDDIOCO_LIN
Pad: ioco_vssioco_g E VSSIOCO_LIN
Pad: io_sel0 E

```

39 Pad: io_sel1
E
40 Pad: io_sel_reg
E
41 Pad: io_rst_n
E

\section*{Appendix D}

\section*{FPGA source code}

\section*{D. 1 main_FPGA.vhd}
```

_- Title : FPGA code for demostrator test board
-- Project
_- File : main_FPGA.vhd
_- Author : [mtschuster@WS-3439](mailto:mtschuster@WS-3439)
-- Company :
-- Created : 2007-02-03
-- Last update: 2007-02-03
-- Platform :
-- Standard : VHDL'93
-- Description: This code generate the stimuli needed to:
-_ 1) Select the desired multiplier;
_- 2) Reset internal registers;
_- 3) Execute 10'000'000 of Multiply and Accumulate on the 64
_- bit register;
-_ 4) Read back the content of the accumulator register with a
-_ frequency divided by 4;
-- 5) Verify the read data with the expected value and output
-- the decision on the pass/fail pins;
-_ 6) At the end of this sequence, chip clock is stopped to
allow static power measurements.
-- Copyright (c) 2007
-- Revisions :
-Date Version Author Description
-- 2007-02-03 1.0 mtschuster Created
library ieee;
use ieee.std_logic_1164.all;
use ieee.std_logic_unsigned.all;
entity main is

```
```

    port(
    --4 user switches (normally ON)
S1 : in std_logic;
S2 : in std_logic;
S3 : in std_logic;
S4 : in std_logic;
_-relay outputs for the 4 multipliers
-- BEWARE: 0=on (VDDM); 1=off (GND)
P1 : out std_logic; -- mult0: RCA32_SVT
P2 : out std_logic; -- mult1: RCA32_PAR4_SVT
P3 : out std_logic; -- mult2: RCA32_LVT
P4 : out std_logic; -- mult3: RCA32_PAR4_LVT
-- Test result leds
OK_led : out std_logic; -- test passed
KO_led : out std_logic; -- test failed
-- Control FPGA pins
mult_num : in std_logic_vector(1 downto 0); --multiplier selector
-- Serial interface pins
CHIP_sout : in std_logic; -- serial interface output
CHIP_sin : out std_logic; -- serial interface input
CHIP_shift_n : out std_logic; -- enable bit shifting, active low
CHIP_load_n : out std_logic; -- enable parallel load, active low
CHIP_sel : out std_logic_vector(1 downto 0); -- select the multiplier unde
test
CHIP_sel_reg : out std_logic; -- route to/from the shift register
CHIP_clock : out std_logic; -- chip clock
CHIP_rst_n : out std_logic; -- chip asynchronous reset, active low
clock : in std_logic; -- FPGA clock
);
end main;
architecture arch of main is
-- state machine states
type FSM_states is (INIT, RUN, READBACK, VERIFY);
signal curr_state, next_state : FSM_states;
signal rst_n : std_logic; -- global reset
signal count : integer range 0 to 16777215; -- counter delaying
the next state
signal clock_slow : std_logic; -- clock divided by 4
signal clock_slow_enable : std_logic; -- enable clock_slow, active high
signal clock_div_counter : std_logic_vector(1 downto 0); -- clock divider
counter
signal read : std_logic; -- readback trigger
signal data : std_logic_vector(63 downto 0); --readback data
signal fail : std_logic; -- test passed
signal pass : std_logic; -- test failed
signal mult_sel : std_logic_vector(1 downto 0); -- multiplier
selection
begin
-- COMBINATORIAL LOGIC

```
```

87
88
-- reset: through switch 4
rst_n <= S4;
-multiplier selection and chip clock multiplexing
mult_sel <= mult_num;
CHIP_sel <= mult_sel;
clock_slow <= clock_div_counter(1);
CHIP_clock <= clock_slow when clock_slow_enable = '1' else clock ;
--test result leds
OK_led <= pass;
KO_led <= fail;
--select multipliers power
P1 <= '0' when mult_sel = "00" else '1';
P2 <= '0' when mult_sel = "01" else '1';
P3 <= '0' when mult_sel = " 10" else '1';
P4 <= '0' when mult_sel = "11" else '1';
-- Finite state machine definition
FSM : process(curr_state, count, mult_sel, data, S4)
-- number of clock of the init state
constant INIT_LENGTH : integer := 4;
-- number of clocks for the running state
-- it corresponds to number of multiplications + 2
-- parallel multipliers require 3 extra clocks due to latency
constant RUNNINGLENGTH : integer := 10000002;
constant RUNNING_LENGTHPAR : integer := RUNNINGLENGTH +
3;
-- number of clock to execute readback task based on full speed clock
constant READBACKLENGTH : integer := 254;
-- expected result after 10'000'000 multiplications and accumulations
constant EXPECTED_RESULT : std_logic_vector(63 downto 0) := X"0
E4DD39EA61421FC";
-- on low supply voltages (<0.4V) one extra multiplication can occur
constant EXPECTED_RESULT_LV : std_logic_vector(63 downto 0) := X"1628
d37ce47c248c";
begin
chip defaults values
CHIP_sin <= '0';
CHIP_shift_n <= '1';
CHIP_load_n <= '1';
CHIP_sel_reg <= '0';
clock_slow_enable <= '0';
CHIP_rst_n <= '1';
read <= '0';
pass <= '0';
fail <= '0';
-- state machine
case curr_state is
-- load zeros from random generator to serial interface registers
when INIT =>
CHIP_load_n <= '0'; -- parallel load

```
```

CHIP_shift_n <= '1';
-- no shift
CHIP_sel_reg <= '1';
-- from rand to regs
CHIP_rst_n <= '0'; -- mantain the random generator to zeros
-- after the init time is passed go to the next state
if count = INIT_LENGTH then
next_state <= RUN;
else
next_state <= INIT;
end if;

```
-- run the multiplications and accumulations
when RUN \(\Rightarrow\)
    CHIP_load_n \(<=\) '0'; -- parallel load
    CHIP_shift_n \(<={ }^{\prime} 1^{\prime}\);
    -- no shift
    CHIP_sel_reg \(<=\) ' \({ }^{\prime} ; \quad \quad-\quad\) from rand to multiplier
    CHIP_rst_n \(<=\) ' 1 '; -- activate the random generator
    - after the multication and accumulation, go to the next state
    -- due to the parallel nature of mult1 and mult3, few extra clocks are
        required
    if (count \(=\) RUNNING_LENGTH and mult_sel \(\left.(0)={ }^{\prime} 0^{\prime}\right)\)
        or ( count \(=\) RUNNINGLENGTHPAR and mult_sel (0) \(=\) ' 1 ') then
        next_state \(<=\) READBACK;
    else
        next_state \(<=\) RUN;
    end if;
-- read back values from the registers through the serial interface
when READBACK \(\Rightarrow\)
    CHIP_load_n \(<=\) '1'; -- serial behaviour
    CHIP_shift_n \(<=\) '0'; -- activate data shifting
    read \(<='^{\prime}\) '; -- activate readback
    clock_slow_enable \(<=\) ' 1 '; -- switch to slow clock
    -- trigger data reading and once finished go to the final state
    if count \(=\) READBACKLENGTH then
        next_state \(<=\) VERIFY;
    else
        next_state \(<=\) READBACK;
    end if;
-- verify read data with the expected data and output the result to leds
when VERIFY =>
    next_state \(<=\) VERIFY; -- looped state until a reset is fired
    clock_slow_enable \(<=\) ' 1 '; -- remain with slow clock
    if (data \(=\) EXPECTEDRESULT) or (data \(=\) EXPECTED_RESULTLV) then
        pass \(<=\) ' 1 '; -- green LED on
    else
        fail \(<=\) ' 1 '; \(\quad-\quad\) red \(L E D\) on
    end if;
-- if something strange happens, go to the init state
```

when others =>

```
```

            next_state <= INIT;
        end case;
    end process FSM;
    - SEQUENTIAL LOGIC
    --asynchrounous reset registers
FSM_regs : process(clock, rst_n)
begin
if rst_n = '0' then
curr_state <= INIT;
data <= (others => '0');
elsif (clock'event and clock = '1') then
curr_state <= next_state;
-- read data back when read = '1'
if read = '1' and clock_div_counter = "01" then
data}<=\mathrm{ data(62 downto 0)\&CHIP_sout;
end if;
end if;
end process FSM_regs;
--synchronous reset registers
--at each new FSM state the counter is reset
counter : process(clock, rst_n)
begin
if clock'event and clock ='1' then
if rst_n = '0' or (curr_state /= next_state) then
count <= 0;
else
count <= count + 1;
end if;
end if;
end process counter;
--generate lower frequency clock
gen_CHIP_clk: process (clock, read)
begin
if clock'event and clock = '0' then
if read = '0' then
clock_div_counter <= "10";
else
clock_div_counter < = clock_div_counter + "01";
end if;
end if;
end process gen_CHIP_clk;
end arch;

```

\section*{Appendix E}

\section*{MATLAB based automated test functions}

\section*{E. 1 test_mult.m}
```

function data = test_mult(mult, freq, volt)
%mult in 0-3
%freq lower than 80MHz
%volt lower or equal to 1V
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Connect devices
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Open devices
k2400 = visa('ni', 'GPIB0::1 3::0::INSTR');
k213 = visa('ni', 'GPIB0::11::0::INSTR');
agilent = visa('ni', 'GPIB0::1 0::0::INSTR');
fopen(k2400);
fopen(k213);
fopen(agilent);
% Get information about devices
fprintf(k2400, '*IDN?');
current_sense = fscanf(k2400)
voltage_source = fscanf(k213)
fprintf(agilent, '*IDN?');
frequency_generator = fscanf(agilent)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Initialize devices
%0%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Reset devices
fprintf(k2400, '*RST');
fprintf(agilent, '*RST');
% Prepare the k2400 for current measurements
fprintf(k2400, ':SOUR:FUNC VOLT'); % set source to voltage

```
```

fprintf(k2400, ':SOUR:VOLT:MODE FIXED'); % set source to DC
fprintf(k2400, ':SOUR:VOLT 0'); % reset source to 0
fprintf(k2400, ':SENS:FUNC "CURR"'); % select current measurement
fprintf(k2400, ':CURR:NPLC 0.1'); % set integration time 1 = 1/50Hz, 0.1 = 1/500 Hz
fprintf(k2400, ':CURR:PROT 0.02'); % set Compliant to 20mA
fprintf(k2400, ':CURR:RANG 0.01'); % set range to 10mA
fprintf(k2400, ':FORM:ELEM CURR'); % set current data format
fprintf(k2400, ':TRIG:OOUNT 5'); % number of multi read
fprintf(k2400, ':ARM:SOUR PSTEST'); % enable trigger on positive edge of SOT
fprintf(k2400, ':SOUR:DEL 0.05'); % intra measure delay to 50ms
fprintf(k2400, ':OUTP ON'); % enable output
0%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Body of the code
%0%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
set_mult(mult, k213); % select multiplier
i = 0;
for f = freq % for each frequency do
set_freq(f, agilent); % set the frequency
pause(2); % allow frequency to stabilize
i = i+1; j = 0;
for V = volt % for each supply voltage do
j = j+1;
% check that supply voltage never exceed 1V
vdd_core = V +0.1; % set core voltage 100mV higher than multiplier
if vdd_core > 1
vdd_core = 1;
end
if V > 1
V = 1;
end
set_voltage(vdd_core, k213); % set the core supply voltage
fprintf(k2400, [':SOUR:VOLT ' num2str(V) ]); % set multiplier supply voltage
start_off(k213); % reset the FPGA
fprintf(k2400, ':INIT');% arm the current sensing
start_on(k213); % activate the FPGA and trigger the sensing
dyn = str2num(get_current(k2400)); % read the current values
pass = pass_test(k213); % check if test pass of fail
fprintf(k2400, ':ARM:SOUR IMM'); % take an immadiate measure for static
fprintf(k2400, ':INIT'); % arm the current sensing
stat = str2num(get_current(k2400)); % read the current values
data(i,j,:) = [f V max(dyn) min(stat) pass]; % store results in data
fprintf(k2400, ':ARM:SOUR PSTEST'); % enable trigger on positive edge of SOT
end
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Disconnect devices
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Disable outputs
fprintf(k2400, 'OUTP OFF');

```
```

fprintf(agilent, 'OUTP OFF');
% Close devices
fclose(k2400);
fclose(k213);
fclose(agilent);
\$0%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Agilent 33250A frequency generator code
%0%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function set_freq(freq, dev) % set a square clock on agilent
fprintf(dev, ['APPL:SQU ' num2str(freq) ',3.3, 1.65']);
%0%%%%%7%%%%%%%%%%%%%%%%%%%%%%0%%%%%%%%%%70%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Keithley 2400 current sensing code
%0%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function curr = get_current_one(dev ) % take a one-shot measure
fprintf(dev, ':MEAS:CURR?');
curr = fscanf(dev);
function curr = get_current(dev) % take a current measure
fprintf(dev, ':FETCH?')
curr = fscanf(dev);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Keithley 213 quad voltage source code
%0%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function set_voltage(v, dev) % calibrated voltage on k213
fprintf(dev, ['P1C0A0R1H0J128,143V' num2str(v) 'X']);
function start_on(dev) % CHIP_rst_n high
fprintf(dev, 'P4V3.3X');
function start_off(dev) % CHIP_rst_n low
fprintf(dev, 'P4V0X');
function set_mult(num, dev); % select the multiplier under test
if num=0
str1 = 'P2V0X';
str2 = 'P3V0X';
elseif num =1
str1 = 'P2V3.3X';
str2 = 'P3V0X';
elseif num =2
str1 = 'P2V0X';
str2 = 'P3V3.3X';
else
str1 = 'P2V3.3X';
str2 = 'P3V3.3X';
end
fprintf(dev, str1);
fprintf(dev, str2);
function pass = pass_test(dev) %wait for test results and check if test passed
pass = 0;

```
```

fail = 0;
while (pass== 0 \&\& fail=0)
fprintf(dev, 'U5X');
din}=\boldsymbol{\operatorname{str}2num}(\mathbf{fscanf}(\operatorname{dev}))
pass = bitget(din,1);
fail = bitget(din,3);
end

```
```


[^0]:    This work has been supported by CSEM (Neuchâtel, Switzerland) and the Swiss National Science Foundation (SNSF, under grant 105619).

[^1]:    ${ }^{\text {I }}$ This equation refers to the energy required to charge and discharge the capacitance, both processes contributing as $1 / 2 C V^{2}$

[^2]:    ${ }^{\text {II }}$ This $\alpha_{g}$ parameter has nothing to do with the $\alpha$ parameter used in the alpha power law model of the transistor on current, which is extensively used in this thesis

