scispace - formally typeset
Open AccessProceedings ArticleDOI

Using cluster-based logic blocks and timing-driven packing to improve FPGA speed and density

Reads0
Chats0
TLDR
This paper opened up an entirely new research area, setting the framework for numerous packing algorithms that have become a fundamental part of any FPGA CAD flow.
Abstract
In 1999, most commercial FPGAs, like the Altera Flex and Xilinx Virtex FPGAs already had cluster-based logic blocks. However, the modeling and evaluation of these sorts of architectures was still in its infancy. In the previous year, Betz had shown that cluster-based logic blocks led to improved density. The real advantage of clustered-based logic blocks, though, was speed, as this paper demonstrates. In doing so, this paper opened up an entirely new research area, setting the framework for numerous packing algorithms that have become a fundamental part of any FPGA CAD flow.

read more

Content maybe subject to copyright    Report

Using Cluster-Based Logic Blocks and Timing-Driven
Packing to Improve FPGA Speed and Density
Alexander (Sandy) Marquardt, Vaughn Betz, and Jonathan Rose
Department of Electrical and Computer Engineering
University of Toronto
Toronto, ON, Canada M5S 3G4
{ arm,vaughn,jayar} @eecg.toronto.edu
Abstract
In this papel; we investigate the speed and area-eficiency of
FPGAs employing “logic clusters” containing multiple LUTs and
registers as their logic block. We introduce a new, timing-driven
tool (T-VPack) to “pack” LUTs and registers into these logic
clusters, and we show that this algorithm is superior to an existing
packing algorithm. Then, using a realistic routing architecture and
sophisticated delay and area models, we empirically evaluate
FPGAs composed of clusters ranging in size from one to twenty
LUTs, and show that clusters of size seven through ten provide the
best area-delay trade-o@ Compared to circuits implemented in an
FPGA composed of size one clusters, circuits implemented in an
FPGA with size seven clusters have 30% less delay (a 43% increase
in speed) and require 8% less area, and circuits implemented in an
FPGA with size ten clusters have 34% less delay (a 52% increase in
speed), and require no additional area.
1. Introduction
Much of the speed and area-efficiency of an FPGA is determined by
the logic block it employs. If a very small, or fine-grained, logic
block is used, many connections must be routed between the
numerous logic blocks [Rose93]. Since routing consumes most of
the area and accounts for most of the delay in FPGAs, a small logic
block often results in poor area-efficiency and speed due to the
excessive routing required to connect all the logic blocks. If, on the
other hand, a very large, or coarse-grained, logic block is employed,
the logic block area and delay may become excessive, again result-
ing in poor area-efticiency and speed [Rose93]. Choosing the best
size, or granularity, for an FPGA logic block therefore involves bal-
ancing complex trade-offs.
In this work we determine the best size for “cluster-based” logic
blocks, which we refer to as “logic clusters”. This style of logic
block is of interest for several reasons. First, the Altera Flex series
FPGAs [Alte98], the Xilinx 5200 and Virtex FPGAs [Xili97,
Xili98], and the Vantis VFl FPGAs [Vant98] all employ cluster-
based logic blocks, so research concerning the best size of logic
clusters is of clear commercial interest. Second, prior research
[Betz98a] has shown that the area-efficiency of large logic clusters
Copyright AC‘M IWO l-181 13.0X8-0/00/02...$5.~~
is quite competitive with that of FPGAs using single look-up table
(LUT) logic blocks. Third, an FPGA composed of large logic clus-
ters requires fewer logic blocks to implement a circuit than an
FPGA using a more fine-grained block. This reduces the size of the
placement and routing problem, and hence design compile time -
an increasingly important concern as the logic capacity of FPGAs
rises. Finally, we show in this paper that cluster-based logic blocks
can improve FPGA speed compared to single-LUT logic blocks by
reducing the number of connections on the critical path that must be
routed between logic blocks.
Prior research [Retz98a] has focused only on the area-efficiency of
different sizes of logic clusters. In this work, we simultaneously
examine both the area-efficiency and the speed of FPGAs using dif-
ferent logic cluster sizes. Since both speed and density are crucial in
modem FPGAs, only by examining both issues can we determine
the best logic cluster size. As well, we use a more complex and
realistic routing architecture than [Betz98a] in our investigations,
leading to more accurate architectural conclusions. Finally, we
present a new, timing-driven algorithm (T-VPack) to “pack” cir-
cuitry into logic clusters. Relative to prior work [Betz97a], this new
algorithm not only improves circuit speed, but also reduces the total
amount of routing required between logic blocks, resulting in
improved area-efficiency.
This paper is organized as follows. Section 2 introduces the struc-
ture of cluster-based logic blocks. In Section 3 we outline the
experimental methodology used to evaluate the utility of different
cluster sizes. Then, in Section 4 we explain why the area-delay
product is useful for evaluating the quality of each architecture.
Next, Section 5 describes the FPGA architecture and timing models
used in our experiments. Section 6 describes a new timing-driven
logic block packing algorithm (T-VPack) and explains the enhance-
ments it contains relative to an earlier CAD tool, VPack. In
Section 7 we present experimental results comparing VPack and T-
VPack, and the effect of various cluster sizes on FPGA area and
delay. Section 8 discusses potential sources of inaccuracies. Finally,
in Section 9 we present our conclusions.
2. Cluster-Based Logic Blocks
Cluster-based logic blocks, or logic clusters are a generalized ver-
sion of the Logic Array Blocks used in Altera’s FLEX 8K and
FLEX 10K parts [Alte98]. Figure l-a shows the structure of a basic
logic element or BLE [Betz98a] which consists of a CLUT plus a
flip-flop. A logic cluster consists of one or more BLEs, plus the
local routing required to connect them together. Figure l-b shows
how the BLEs are connected. For clusters of size greater than one,
the architecture used is fully connected: each BLE input can be
connected to any of the cluster inputs or to the output of any of the
BLEs within the cluster. Clusters of size one (i.e. a cluster contain-
37

Circuit
+,
I
Inputs
a) Basic Logic Element (BLE)
Clock
L______________1
I
b) Logic Cluster Structure
OuYputs
Figure 1. Structure of basic logic element (BLE) and logic
cluster.
ing a single BLE) do not contain local routing, and hence have nei-
ther multiplexors on the BLE inputs nor local feedback paths.
Following the convention of [Betz97a], we use two parameters to
describe a logic cluster, N and I, where N is the number of BLEs
per cluster and I is the number of inputs per cluster. In [Betz97a] it
is shown that setting I = 2 N + 2 is sufficient for complete logic
utilization, so we use this relation for all of our experiments.
3. Experimental Methodology
We use an empirical method to explore different FPGA architec-
tures. This involves technology-mapping, packing, placing, and
routing benchmark circuits’ into realistic architectures with clus-
ters of size 1 through 20. We then estimate the area required by
each architecture to implement each benchmark circuit, and mea-
sure the speed of each implementation. At this point we have
enough information to judge the quality of each architecture.
3.1 CAD Flow
Figure 2 illustrates the CAD flow for our experiments. Each circuit
we use is logic-optimized by SIS [Sent921 and then technology-
mapped into 4-LUTs by FlowMap [Cong94]. VPack [Betz98b,
Betz97b, Betz99] or T-VPack is then used to group the LUTs and
registers into logic clusters of the desired size. Finally, we use VPR
[Betz98b, Betz97b, Betz99] to place and route each circuit. VPR’s
timing-driven router extracts the elmore delay [Elmo48] of each
routed net, and performs a path-based timing analysis to determine
the delay of the circuit critical path. Finally, VPR uses a transistor-
based area model [Betz98b, Betz99] to estimate the total layout
area required by this FPGA.
Our benchmarks consist of 20 of the largest MCNC circuits [Yang9 I] and
5 University of Toronto benchmark circuits [Leve98, Ye98, Ga1198,
Padi98, Hame98]. The circuits range in size from 1047 to 8383 4-LUTs.
The MCNC circuits used are: alu4, apex2, apcx4, bigkey, clma, des, dif-
feq, dsip, elliptic, ex1010, ex5p, frisc, misex3, pdc, ~298, ~38417,
~38584.1, seq, spla, and tseng. The University of Toronto circuits used
are: des_fm, des_sis, marb, grayscale, and wood.
1 Technology Map to 4-LUTs (FlowMap) 1
+
Cluster
Size (N)
_) Pack BLEs into Logic Clusters (T-VPack)
I
Placement (VPR)
Cluster Size Dependent
Architecture Models +
(based on 0.35 pm process)
+
Timing and Area Results
Figure 2. CAD Flow
In FPGA architecture and CAD research, it is convenient to have
tools which can vary the FPGA dimensions (number of columns
and rows) and channel width (number of tracks in each channel).
VPR allows this, and it also allows us to find the minimum channel
width required to successfully route a circuit. By allowing the
channel width to vary, and searching for the minimum routable
width, we can detect small improvements in FPGA architectures or
CAD algorithms that might otherwise go unnoticed. Compare this
to mapping a circuit into a fixed size FPGA - this would only tell
us if it fit or not. It is more difficult to draw architectural conclu-
sions from such a “binary” result.
VPR is capable of performing both high-stress and low-stress rout-
ings [Swar98]. A high-stress routing occurs when VPR routes a
given circuit into an FPGA with the minimum channel width
required for a successful routing. To accomplish this, VPR repeat-
edly routes each circuit with different channel widths, scaling the
architecture accordingly until it finds the minimum number of
tracks in which the circuit will route. A low-stress routing occurs
when an FPGA has significantly more routing resources than the
minimum required to route a given circuit. In our experiments we
define a low-stress routing to occur when there are 30% more
tracks per channel than the minimum required.
We feel that low-stress routings are indicative of how an FPGA
would generally be used (it is rare that a user will utilize 100% of
the routing and logic resources), so all of the results that we
present are based on low-stress routings. Additionally, the low-
stress and high-stress results are very similar, and both cases result
in the same conclusions.
4. Architecture Evaluation - Area-Delay Product
One metric that we will use to evaluate the quality of different
architectures is the area-delay product. We feel that there are two
reasons that this metric makes sense:
1.
Intuitively, we want to find the point at which we are
sacrificing the least amount of area for the most
improvement in speed. Given that we can always trade
area for speed (see below), and speed for area, it makes
sense to combine these two factors into one curve to see
where the best trade-off occurs.
38

2. Much of the performance gain from using an FPGA is
derived from parallelizing functional units, rather than
raw clock speed. In this case, rhrou,$zput = number of
fkcrional units clock mle. Another way of looking at
this is, throughput = (I/urea per funcfionml unir) . (I/
delay). Therefore if WC minimize the area-delay product,
we will maximize throughput.
There are two main factors which can affect the area-delay product
of an FPGA: transistor sizing, and the FPGA architecture. In gen-
eral, the speed of an FPGA can be increased (to a point) by sizing
up the buffers and transistors within the FPGA, but this increases
area. Alternatively, the FPGA can be made smaller by sizing down
the buffers and transistors, but this degrades the FPGA perfor-
mance.
Throughout this paper, we will size the transistors in each FPGA
architecture to minimize the FPGA’s area-delay product. Only by
resizing transistors appropriately for each architecture in this way
can we fairly compute the speed and area-efficiency of FPGAs
with different logic block architectures.
5. Architecture Modeling
To evaluate the speed and area of an FPGA we must choose not
only the logic block architccturc, but also a routing architecture
and transistor sizes. The following sections detail all of our archi-
tectural choices, which are provided to VPR in an architecture
description file [BetzYgb, BetzYY].
5.1 Basic Architecture
We investigate island-style FPGAs in which each logic block bor-
ders a routing channel on its four sides. Each circuit is mapped to
the smallest square FPGA with enough logic blocks and pads to
accommodate it. The FPGAs of Xilinx [Xili94], Lucent Technolo-
gies [Luce98], and Vantis [Vant98] employ an island-style archi-
tecture.
Delays, capacitances, and resistances of the FPGA circuitry are
obtained from SPICE. [Meta92] simulations of TSMC’s 0.35 pm
CMOS process.
5.2 Routing Architecture
We define the number of logic blocks which a routing segment
spans as the logical lengfh of that segment. [BetzYsb, Betz9Y]
found that an architecture in which routing segments have a logical
I II I I
lllul~llu
Figure 3. FPGA Architecture with Length 4 Segments, and
SO/SO Unbuffered/Buffered Switches.
length of four, with 50% of the segments connected by tri-state
buffers and 50% connected by pass-transistors, provides good
area-efficiency and speed for FPGAs containing logic clusters of
size four. An example of- this routing architecture is shown in
Figure 3. We implicitly assume that this routing architecture is
good for architectures containing logic clusters of all sizes, and we
use this routing architecture in all of our experiments. Ideally, one
would like to find the best routing architecture for each FPGA
employing a different cluster size, but this would require a huge
amount of effort. By basing all of our experiments on this routing
architecture, we may slightly favor architectures with size four
clusters over other architectures.
5.3 Effect of Varying Cluster Size on FPGA Routing
Segment Length
As we increase the cluster size, both the logic area per cluster and
routing area per cluster grow. The logic cluster and its associated
routing is called a tile. Figure 4 demonstrates how a tile grows as
cluster size is increased. This increased tile size results in routing
segments with the same logical length having physically different
lengths for logic clusters of different sizes.
We define the measured length of a routing segment as its physical
length. There is a linear relation between the physical length of a
routing segment, and the resistance and capacitance of that seg-
ment. We have experimentally determined the average rate at
which the FPGA tiles grow with cluster size, and have used this
knowledge to appropriately scale the routing segment resistance
and capacitance values for the various cluster sizes.
5.4 Scaling Transistor and Buffers to Compensate for
Increased Segment Physical Length
To compensate for differences in the capacitance and resistance of
different length routing segments, we scale the routing pass-tran-
sistors and buffers. All of our transistor and buffer scaling is in
relation to a base architecture that has been area-delay optimized
for clusters of size four [Betz98b, Betz99]. From this base archi-
tecture, we linearly scale our buffers and pass transistors depend-
ing on the relation between the new segment lengths and the base
segment length. For example, in an FPGA with size 16 clusters, the
physical segment length is approximately 2x longer than in an
architecture with size 4 clusters. To maintain roughly the same
routing speed, we increase the size of the routing switches con-
necting to each wire by a factor of 2. In Section 7.2 we verify that
this linear scaling of buffers and pass-transistors with segment
length provides the best results.
In our architecture models, we account for variations in delay
caused by resizing buffers and pass-transistors. Also, changes in
Channel
Width
Increased
Channel
Width
Increased
-
-
Logic
luste
Ijig
Segment
Length
Increase
Cluster
Size
< >
Area
Increased
Per Cluster
ScEment
Length
Figure 4. Effect of Increased Cluster Size on Segment Length
39

area due to the use of different sizes of routing pass-transistors and
inverter chains are automatically calculated by VPR.
5.5 Varying F,, in and Fc, OUt with Logic Cluster Size
In [Rose911 it is shown that F, = W is good for logic clusters of
size one; i.e. each logic block pin can be connected to any routing
track in an adjacent channel. As cluster size increases, setting
F, = W provides more flexibility than is required, wasting area.
In [Betz98b, Betz99] it is shown that setting F, on the input pins
(FC, in) to 2. W/N and F, on the output pins (FC, our) to W/N
provides a good level of routing flexibility, so all of our experi-
ments use these values for clusters of sizes other than one.
5.6 Detailed Logic Cluster Structure.
In Figure 5 we show the structure of a logic cluster and the cir-
cuitry connecting the logic clusters to the main FPGA routing.
Table 1 shows delay values for selected cluster sizes. The multi-
plexor, buffer, LUT, and flip-flop delays were obtained by model-
ing the structures in SPICE [Meta92] with TSMC’s 0.35 /tm
process parameters.
& Muxes
Logic Cluster
Figure 5. Detailed Logic Cluster Structure
Table 1: Selected Logic Cluster Delay Values (in
picoseconds) 0.35 pm CMOS
A to B
761
761
761
761
761
761
4
B to D
6. Packing Algorithms
The packing step (in Figure 2) takes a netlist consisting of LUTs
and flip-flops and produces a netlist consisting of logic clusters.
This involves combining the LUTs and flip-flops into BLEs, and
then grouping the BLEs into logic clusters.
There are two main constraints that packing algorithms must meet:
1.
The number of BLEs must be less than the cluster size,
N.
2. The number of distinct inputs generated outside the
cluster and used as inputs to BLEs within the cluster
must be less than or equal to the number of cluster
inputs, I.
In this section, we present two packing algorithms, VPack
[Betzgirb, Betz98b. Betz99], and T-VPack. Then we show that our
new T-VPack algorithm outperforms the original VPack algorithm
in both area and critical path delay.
6.1 Input-Sharing VPack Algorithm
The original VPack algorithm has two optimization goals. The first
is to pack each logic cluster to its capacity in order to minimize the
number of clusters needed. The second goal is to minimize the
number of inputs to each cluster in order to reduce the number of
connections required between clusters.
Vpack uses a greedy algorithm to construct each cluster sequen-
tially. At the start of each cluster operation, Vpack selects as a
“seed” an unclustered BLE with the most used inputs, and then
places this “seed” into a cluster C. Then VPack selects a new BLE,
B to pack into C based on the attraction that B has to C. Attraction
is determined by the number of inputs and outputs that B and C
have in common:
Attraction(B) = ]Nets(B) n Nets(
(1.1)
After each cluster reaches capacity, packing begins on a new clus-
ter. The process terminates when there are no more unclustered
BLEs left. The time complexity of this algorithm is O(k,,,n)
(where n is the number of BLEs in the circuit and k,,, is the
fanout of the highest fanout net) which results in an execution time
of about four seconds to pack the largest circuit (clma) on a 296
MHz UltraSPARC-II processor.
6.2 Timing-Driven T-VPack Algorithm
Our new packing algorithm is based on the original VPack algo-
rithm, but its optimization goal is minimizing the number of exter-
nal connections (connections between clusters) on the critical path.
The reasoning behind this is that external connections have higher
delay than internal connections (connections within a cluster), so
by reducing the number of external nets on the critical path, we
will reduce the circuit delay. The first stage of this algorithm
involves computing which connections are on the critical path. We
then sequentially pack BLEs along the critical path into logic clus-
ters and recompute which BLEs are critical.
6.2.1 An Overview of Slack and Criticality Calculation
The first step in determining which nets are critical is to determine
the slack of each connection [Hitc83, Fran92]. Slack is defined as
the amount of delay which can be added to a connection without
increasing the delay of the entire circuit.
Calculating slack involves computing the arrival time, Tarriva[ and
the required arrival time, TrequiRd at all BLE input pins. This is
accomplished using two breadth-first traversals of the circuit; the
40

first traversal propagates Tarrival
forward from input pins and regis-
ter outputs (Sources), and the second propagates Trequired back
from output pins and register inputs (Sinks). The slack of a connec-
tion driving a BLE input pin, i, is defined as:
Finally, we define the criticality of the connection driving input i
as:
slack(i)
Connection_Criticafity(i) = I - MaxSlack
(1.3)
where MaxSlack is the largest slack amongst all point-to-point con-
nections in the entire circuit.
6.2.2 Delay Estimates of an Unplaced and Unrouted Circuit
To obtain a good packing solution’ the T-VPack algorithm models
three types of delay: The delay through a BLE, or logic-delay, the
connection delay between blocks within the same cluster or
intra_cluster_connection_delay, and the connection delay between
blocks
that are in different clusters, or
inter_cluster_connection_delay. We experimentally determined
that setting logic_delay=O. 1, intra_cluster_connection_delay=O. 1,
and inter_cluster_connection_delay=l.0 results in the clustered
circuits having the smallest delay after placement and routing by
VPR’.
6.2.3 The Attraction Function
We extend the attraction function from the original VPack algo-
rithm to include timing information. The first BLE that is placed
into a cluster is the unclustered BLE that is driven by the most crit-
ical connection in the circuit. Then, based on our attraction func-
tion (Equation 1.8, below) we add the most attractive BLEs to the
cluster. We repeat this absorbtion until either no more BLEs will fit
into the cluster, or all of the cluster inputs are used. Once a cluster
is full, we start a new cluster with a new seed, and repeat the pro-
cess until there are no unclustered BLEs left in the circuit. We next
describe how blocks are selected for absorbtion.
We define the base criticality of each unclustered BLE, B, or
Base_BLE_Criticality(B), to be
the
maximum
Connection_Criticulity value of all connections joining B to BLEs
Base BLE
Base BLE
Criticality=O.W
Cluster, C
Figure 6. BLE Base Criticality Assignment
A good packing solution is one that results in the smallest delay after be-
ing placed and routed by VPR.
2 Note that these delay values are only used in the packing process. After
packing is complete, VPR places and routes the circuits and extracts the
real (elmore) delay of each routed net. All of the delay results that we
present in this paper are computed by VPR.
within the cluster currently being packed, C. If B does not have any
connections to C then the base criticality score is zero. In Figure 6
we illustrate how the Base_BLE_Criticulity values are assigned.
We have labelled each connection between unclustered BLEs and
BLEs within the cluster with a criticality value. Notice how the
base criticality of each BLE is assigned the highest criticality value
of all its connections to the cluster.
When selecting which BLE to absorb into a cluster there is a high
potential for multiple BLEs to have the same base criticality value.
We use a tie-breaker mechanism to select which BLEs are the most
beneficial to pack. This mechanism is based on the desire to pack
BLEs together in a manner that most effectively reduces the
number of BLEs remaining on the critical paths. This is best illus-
trated by an example.
In Figure 7 we have darkened connections and BLEs on the critical
paths. Notice that when selecting which BLEs to pIace into a clus-
ter, it is more beneficial to absorb certain critical BLEs over other
critical BLEs. In this case, absorbing BLEs H, I, and J would be
much more beneficial than absorbing BLEs A, D, and F. We can
see that absorbing H, I, and J affects the criticality of seven BLEs
(A, B, C, D, E, F, and G), while absorbing A, D, and F would only
affect the criticality of three BLEs (H, I, and J). Clearly it is best to
cluster BLEs that reduce the criticalities of the most other BLEs.
We define three variables that keep track of the number of critical
paths that each BLE in the circuit effects. First we define
inputgaths_uffected as the number of critical paths between
sources in the circuit and the BLE currently being labelled. Next
we define outputgaths_uffected as the number of critical paths
between the sinks in the circuit and the BLE currently being
labelled. Finally, we define total_paths_affected as the sum of the
previous two variables. The calculation of these variables is
explained below.
The BLE labels in Figure 7 demonstrate the input_paths_uJected
value for each BLE. We assign any sources that are on the critical
paths with an input_puths_ufSected value of one, and all other
sources are set to zero. Then we perform a breadth-first traversal of
the circuit starting at the sources, and define the
input_puths_ufSected value as in (1.4).
SiIlkS
Figure 7. Criticality Tie-Breakers
41

Citations
More filters
Book

Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation

Scott Hauck, +1 more
TL;DR: This book is intended as an introduction to the entire range of issues important to reconfigurable computing, using FPGAs as the context, or "computing vehicles" to implement this powerful technology.
Proceedings ArticleDOI

The effect of LUT and cluster size on deep-submicron FPGA performance and density

TL;DR: This paper revisits the field-programmable gate-array (FPGA) architectural issue of the effect of logic block functionality on FPGA performance and density, and experimentally determines the relationship between the number of inputs required for a cluster as a function of the LUT size and cluster size.
Journal ArticleDOI

The effect of LUT and cluster size on deep-submicron FPGA performance and density

TL;DR: This paper revisits the field-programmable gate-array (FPGA) architectural issue of the effect of logic block functionality on FPGA performance and density, and experimentally determines the relationship between the number of inputs required for a cluster as a function of the LUT size and cluster size.
Proceedings ArticleDOI

The VTR project: architecture and CAD for FPGAs from verilog to routing

TL;DR: The current status and new release of an ongoing effort to create a downstream full-implementation flow of Verilog to Routing is described, and the use of the new flow is illustrated by using it to help architect a floating-point unit in an FPGA, and compared with a prior, much longer effort.
Proceedings ArticleDOI

An architectural exploration of via patterned gate arrays

TL;DR: This work investigates the architecture of a Via Patterned Gate Array (VPGA), focusing primarily on the optimal lookup table (LUT) size; and a comparison the crossbar and switch block routing architectures.
References
More filters
Journal ArticleDOI

The Transient Response of Damped Linear Networks with Particular Regard to Wideband Amplifiers

TL;DR: It is found possible to define delay time and rise time in such a way that these quantities can be computed very simply from the Laplace system function of the network.
Book

Architecture and CAD for Deep-Submicron FPGAS

TL;DR: From the Publisher: Architecture and CAD for Deep-Submicron FPGAs addresses several key issues in the design of high-performance FPGA architectures and CAD tools, with particular emphasis on issues that are important for FPG as implemented in deep-submicron processes.
Book

Principles of CMOS VLSI Design: A Systems Perspective

TL;DR: CMOS Circuit and Logic Design: The Complemenatry CMOS Inverter-DC Characteristics and Design Strategies.

PRINCIPLES OF CMOS VLSI DESIGN A Systems Perspective Second Edition

Abstract: Introduction to CMOS Circuits. Introduction. MOS Transistors. MOS Transistor Switches. CMOS Logic. Circuit Representations. CMOS Summary. MOS Transistor Theory. Introduction. MOS Device Design Equation. The Complemenatry CMOS Inverter-DC Characteristics. Alternate CMOS Inverters. The Differential Stage. The Transmission Gate. Bipolar Devices. CMOS Processing Technology. Silicon Semiconductor Technology: An Overview. CMOS Technologies. Layout Design Rules. CAD Issues. Circuit Characterization and Performance Estimation. Introduction. Resistance Estimation. Capacitance Estimation. Inductance. Switching Characteristics. CMOS Gate Transistor Sizing. Power Consumption. Determination of Conductor Size. Charge Sharing. Design Margining. Yield. Scaling of MOS Transistor Dimensions. CMOS Circuit and Logic Design. Introduction. CMOS Logic Structures. Basic Physical Design of Simple Logic Gates. Clocking Strategies. Physical and Electrical Design of Logic Gates. 10 Structures. Structured Design Strategies. Introduction. Design Economics. Design Strategies. Design Methods. CMOS Chip Design Options. Design Capture Tools. Design Verification Tools. CMOS Test Methodolgies. Introduction. Fault Models. Design for Testability. Automatic Test Pattern Generation. Design for Manufacturability. CMOS Subsystem Design. Introduction. Adders and Related Functions. Binary Counters. Multipliers and Filter Structures. Random Access and Serial Memory. Datapaths. FIR and IIR Filters. Finite State Machines. Programmable Logic Arrays. Random Control Logic.
Book ChapterDOI

VPR: A new packing, placement and routing tool for FPGA research

TL;DR: In terms of minimizing routing area, VPR outperforms all published FPGA place and route tools to which the authors can compare and presents placement and routing results on a new set of circuits more typical of today's industrial designs.
Frequently Asked Questions (4)
Q1. What is the reason for the improvement in circuit speed at larger cluster sizes?

The reason for this improvement in circuit speed at larger cluster sizes is partly due to an increased number of critical connections becoming local within clusters, and partly due to a reduction in the “logical” manhattan distance between BLEs.10. 

Using the area-delay product evaluation metric, the authors demonstrated that clusters of size seven to ten are the best size to use when constructing an FPGA. 

Xilinx Inc., “Virtex 2.5 V Field Programmable Gate Arrays”, Advance Product Data Sheet, 1998.S. Yang, “Logic Synthesis and Optimization Benchmarks, Version 3.0,” Tech. 

The authors have repeated the experiments described in Section 7.1 using transistor and buffer sizes of one-half and double the sizes used in Section 7.1.