What is the reason for the improvement in circuit speed at larger cluster sizes?

The reason for this improvement in circuit speed at larger cluster sizes is partly due to an increased number of critical connections becoming local within clusters, and partly due to a reduction in the “logical” manhattan distance between BLEs.10.

What is the way to evaluate a FPGA?

Using the area-delay product evaluation metric, the authors demonstrated that clusters of size seven to ten are the best size to use when constructing an FPGA.

What is the name of the book?

Xilinx Inc., “Virtex 2.5 V Field Programmable Gate Arrays”, Advance Product Data Sheet, 1998.S. Yang, “Logic Synthesis and Optimization Benchmarks, Version 3.0,” Tech.

What size is used to determine the critical path delay?

The authors have repeated the experiments described in Section 7.1 using transistor and buffer sizes of one-half and double the sizes used in Section 7.1.

(Open Access) Using cluster-based logic blocks and timing-driven packing to improve FPGA speed and density (1999) | Alexander Marquardt

Using Cluster-Based Logic Blocks and Timing-Driven

Packing to Improve FPGA Speed and Density

Alexander (Sandy) Marquardt, Vaughn Betz, and Jonathan Rose

Department of Electrical and Computer Engineering

University of Toronto

Toronto, ON, Canada M5S 3G4

{ arm,vaughn,jayar} @eecg.toronto.edu

Abstract

In this papel; we investigate the speed and area-eficiency of

FPGAs employing “logic clusters” containing multiple LUTs and

registers as their logic block. We introduce a new, timing-driven

tool (T-VPack) to “pack” LUTs and registers into these logic

clusters, and we show that this algorithm is superior to an existing

packing algorithm. Then, using a realistic routing architecture and

sophisticated delay and area models, we empirically evaluate

FPGAs composed of clusters ranging in size from one to twenty

LUTs, and show that clusters of size seven through ten provide the

best area-delay trade-o@ Compared to circuits implemented in an

FPGA composed of size one clusters, circuits implemented in an

FPGA with size seven clusters have 30% less delay (a 43% increase

in speed) and require 8% less area, and circuits implemented in an

FPGA with size ten clusters have 34% less delay (a 52% increase in

speed), and require no additional area.

1. Introduction

Much of the speed and area-efficiency of an FPGA is determined by

the logic block it employs. If a very small, or fine-grained, logic

block is used, many connections must be routed between the

numerous logic blocks [Rose93]. Since routing consumes most of

the area and accounts for most of the delay in FPGAs, a small logic

block often results in poor area-efficiency and speed due to the

excessive routing required to connect all the logic blocks. If, on the

other hand, a very large, or coarse-grained, logic block is employed,

the logic block area and delay may become excessive, again result-

ing in poor area-efticiency and speed [Rose93]. Choosing the best

size, or granularity, for an FPGA logic block therefore involves bal-

ancing complex trade-offs.

In this work we determine the best size for “cluster-based” logic

blocks, which we refer to as “logic clusters”. This style of logic

block is of interest for several reasons. First, the Altera Flex series

FPGAs [Alte98], the Xilinx 5200 and Virtex FPGAs [Xili97,

Xili98], and the Vantis VFl FPGAs [Vant98] all employ cluster-

based logic blocks, so research concerning the best size of logic

clusters is of clear commercial interest. Second, prior research

[Betz98a] has shown that the area-efficiency of large logic clusters

is quite competitive with that of FPGAs using single look-up table

(LUT) logic blocks. Third, an FPGA composed of large logic clus-

ters requires fewer logic blocks to implement a circuit than an

FPGA using a more fine-grained block. This reduces the size of the

placement and routing problem, and hence design compile time -

an increasingly important concern as the logic capacity of FPGAs

rises. Finally, we show in this paper that cluster-based logic blocks

can improve FPGA speed compared to single-LUT logic blocks by

reducing the number of connections on the critical path that must be

routed between logic blocks.

Prior research [Retz98a] has focused only on the area-efficiency of

different sizes of logic clusters. In this work, we simultaneously

examine both the area-efficiency and the speed of FPGAs using dif-

ferent logic cluster sizes. Since both speed and density are crucial in

modem FPGAs, only by examining both issues can we determine

the best logic cluster size. As well, we use a more complex and

realistic routing architecture than [Betz98a] in our investigations,

leading to more accurate architectural conclusions. Finally, we

present a new, timing-driven algorithm (T-VPack) to “pack” cir-

cuitry into logic clusters. Relative to prior work [Betz97a], this new

algorithm not only improves circuit speed, but also reduces the total

amount of routing required between logic blocks, resulting in

improved area-efficiency.

This paper is organized as follows. Section 2 introduces the struc-

ture of cluster-based logic blocks. In Section 3 we outline the

experimental methodology used to evaluate the utility of different

cluster sizes. Then, in Section 4 we explain why the area-delay

product is useful for evaluating the quality of each architecture.

Next, Section 5 describes the FPGA architecture and timing models

used in our experiments. Section 6 describes a new timing-driven

logic block packing algorithm (T-VPack) and explains the enhance-

ments it contains relative to an earlier CAD tool, VPack. In

Section 7 we present experimental results comparing VPack and T-

VPack, and the effect of various cluster sizes on FPGA area and

delay. Section 8 discusses potential sources of inaccuracies. Finally,

in Section 9 we present our conclusions.

2. Cluster-Based Logic Blocks

Cluster-based logic blocks, or logic clusters are a generalized ver-

sion of the Logic Array Blocks used in Altera’s FLEX 8K and

FLEX 10K parts [Alte98]. Figure l-a shows the structure of a basic

logic element or BLE [Betz98a] which consists of a CLUT plus a

flip-flop. A logic cluster consists of one or more BLEs, plus the

local routing required to connect them together. Figure l-b shows

how the BLEs are connected. For clusters of size greater than one,

the architecture used is fully connected: each BLE input can be

connected to any of the cluster inputs or to the output of any of the

BLEs within the cluster. Clusters of size one (i.e. a cluster contain-

Circuit

Inputs

a) Basic Logic Element (BLE)

Clock

L______________1

b) Logic Cluster Structure

OuYputs

Figure 1. Structure of basic logic element (BLE) and logic

cluster.

ing a single BLE) do not contain local routing, and hence have nei-

ther multiplexors on the BLE inputs nor local feedback paths.

Following the convention of [Betz97a], we use two parameters to

describe a logic cluster, N and I, where N is the number of BLEs

per cluster and I is the number of inputs per cluster. In [Betz97a] it

is shown that setting I = 2 N + 2 is sufficient for complete logic

utilization, so we use this relation for all of our experiments.

3. Experimental Methodology

We use an empirical method to explore different FPGA architec-

tures. This involves technology-mapping, packing, placing, and

routing benchmark circuits’ into realistic architectures with clus-

ters of size 1 through 20. We then estimate the area required by

each architecture to implement each benchmark circuit, and mea-

sure the speed of each implementation. At this point we have

enough information to judge the quality of each architecture.

3.1 CAD Flow

Figure 2 illustrates the CAD flow for our experiments. Each circuit

we use is logic-optimized by SIS [Sent921 and then technology-

mapped into 4-LUTs by FlowMap [Cong94]. VPack [Betz98b,

Betz97b, Betz99] or T-VPack is then used to group the LUTs and

registers into logic clusters of the desired size. Finally, we use VPR

[Betz98b, Betz97b, Betz99] to place and route each circuit. VPR’s

timing-driven router extracts the elmore delay [Elmo48] of each

routed net, and performs a path-based timing analysis to determine

the delay of the circuit critical path. Finally, VPR uses a transistor-

based area model [Betz98b, Betz99] to estimate the total layout

area required by this FPGA.

’ Our benchmarks consist of 20 of the largest MCNC circuits [Yang9 I] and

5 University of Toronto benchmark circuits [Leve98, Ye98, Ga1198,

Padi98, Hame98]. The circuits range in size from 1047 to 8383 4-LUTs.

The MCNC circuits used are: alu4, apex2, apcx4, bigkey, clma, des, dif-

feq, dsip, elliptic, ex1010, ex5p, frisc, misex3, pdc, ~298, ~38417,

~38584.1, seq, spla, and tseng. The University of Toronto circuits used

are: des_fm, des_sis, marb, grayscale, and wood.

1 Technology Map to 4-LUTs (FlowMap) 1

Cluster

Size (N)

_) Pack BLEs into Logic Clusters (T-VPack)

Placement (VPR)

Cluster Size Dependent

Architecture Models +

(based on 0.35 pm process)

Timing and Area Results

Figure 2. CAD Flow

In FPGA architecture and CAD research, it is convenient to have

tools which can vary the FPGA dimensions (number of columns

and rows) and channel width (number of tracks in each channel).

VPR allows this, and it also allows us to find the minimum channel

width required to successfully route a circuit. By allowing the

channel width to vary, and searching for the minimum routable

width, we can detect small improvements in FPGA architectures or

CAD algorithms that might otherwise go unnoticed. Compare this

to mapping a circuit into a fixed size FPGA - this would only tell

us if it fit or not. It is more difficult to draw architectural conclu-

sions from such a “binary” result.

VPR is capable of performing both high-stress and low-stress rout-

ings [Swar98]. A high-stress routing occurs when VPR routes a

given circuit into an FPGA with the minimum channel width

required for a successful routing. To accomplish this, VPR repeat-

edly routes each circuit with different channel widths, scaling the

architecture accordingly until it finds the minimum number of

tracks in which the circuit will route. A low-stress routing occurs

when an FPGA has significantly more routing resources than the

minimum required to route a given circuit. In our experiments we

define a low-stress routing to occur when there are 30% more

tracks per channel than the minimum required.

We feel that low-stress routings are indicative of how an FPGA

would generally be used (it is rare that a user will utilize 100% of

the routing and logic resources), so all of the results that we

present are based on low-stress routings. Additionally, the low-

stress and high-stress results are very similar, and both cases result

in the same conclusions.

4. Architecture Evaluation - Area-Delay Product

One metric that we will use to evaluate the quality of different

architectures is the area-delay product. We feel that there are two

reasons that this metric makes sense:

Intuitively, we want to find the point at which we are

sacrificing the least amount of area for the most

improvement in speed. Given that we can always trade

area for speed (see below), and speed for area, it makes

sense to combine these two factors into one curve to see

where the best trade-off occurs.

2. Much of the performance gain from using an FPGA is

derived from parallelizing functional units, rather than

raw clock speed. In this case, rhrou,$zput = number of

fkcrional units clock mle. Another way of looking at

this is, throughput = (I/urea per funcfionml unir) . (I/

delay). Therefore if WC minimize the area-delay product,

we will maximize throughput.

There are two main factors which can affect the area-delay product

of an FPGA: transistor sizing, and the FPGA architecture. In gen-

eral, the speed of an FPGA can be increased (to a point) by sizing

up the buffers and transistors within the FPGA, but this increases

area. Alternatively, the FPGA can be made smaller by sizing down

the buffers and transistors, but this degrades the FPGA perfor-

mance.

Throughout this paper, we will size the transistors in each FPGA

architecture to minimize the FPGA’s area-delay product. Only by

resizing transistors appropriately for each architecture in this way

can we fairly compute the speed and area-efficiency of FPGAs

with different logic block architectures.

5. Architecture Modeling

To evaluate the speed and area of an FPGA we must choose not

only the logic block architccturc, but also a routing architecture

and transistor sizes. The following sections detail all of our archi-

tectural choices, which are provided to VPR in an architecture

description file [BetzYgb, BetzYY].

5.1 Basic Architecture

We investigate island-style FPGAs in which each logic block bor-

ders a routing channel on its four sides. Each circuit is mapped to

the smallest square FPGA with enough logic blocks and pads to

accommodate it. The FPGAs of Xilinx [Xili94], Lucent Technolo-

gies [Luce98], and Vantis [Vant98] employ an island-style archi-

tecture.

Delays, capacitances, and resistances of the FPGA circuitry are

obtained from SPICE. [Meta92] simulations of TSMC’s 0.35 pm

CMOS process.

5.2 Routing Architecture

We define the number of logic blocks which a routing segment

spans as the logical lengfh of that segment. [BetzYsb, Betz9Y]

found that an architecture in which routing segments have a logical

I II I I

lllul~llu

Figure 3. FPGA Architecture with Length 4 Segments, and

SO/SO Unbuffered/Buffered Switches.

length of four, with 50% of the segments connected by tri-state

buffers and 50% connected by pass-transistors, provides good

area-efficiency and speed for FPGAs containing logic clusters of

size four. An example of- this routing architecture is shown in

Figure 3. We implicitly assume that this routing architecture is

good for architectures containing logic clusters of all sizes, and we

use this routing architecture in all of our experiments. Ideally, one

would like to find the best routing architecture for each FPGA

employing a different cluster size, but this would require a huge

amount of effort. By basing all of our experiments on this routing

architecture, we may slightly favor architectures with size four

clusters over other architectures.

5.3 Effect of Varying Cluster Size on FPGA Routing

Segment Length

As we increase the cluster size, both the logic area per cluster and

routing area per cluster grow. The logic cluster and its associated

routing is called a tile. Figure 4 demonstrates how a tile grows as

cluster size is increased. This increased tile size results in routing

segments with the same logical length having physically different

lengths for logic clusters of different sizes.

We define the measured length of a routing segment as its physical

length. There is a linear relation between the physical length of a

routing segment, and the resistance and capacitance of that seg-

ment. We have experimentally determined the average rate at

which the FPGA tiles grow with cluster size, and have used this

knowledge to appropriately scale the routing segment resistance

and capacitance values for the various cluster sizes.

5.4 Scaling Transistor and Buffers to Compensate for

Increased Segment Physical Length

To compensate for differences in the capacitance and resistance of

different length routing segments, we scale the routing pass-tran-

sistors and buffers. All of our transistor and buffer scaling is in

relation to a base architecture that has been area-delay optimized

for clusters of size four [Betz98b, Betz99]. From this base archi-

tecture, we linearly scale our buffers and pass transistors depend-

ing on the relation between the new segment lengths and the base

segment length. For example, in an FPGA with size 16 clusters, the

physical segment length is approximately 2x longer than in an

architecture with size 4 clusters. To maintain roughly the same

routing speed, we increase the size of the routing switches con-

necting to each wire by a factor of 2. In Section 7.2 we verify that

this linear scaling of buffers and pass-transistors with segment

length provides the best results.

In our architecture models, we account for variations in delay

caused by resizing buffers and pass-transistors. Also, changes in

Channel

Width

Increased

Channel

Width

Increased

Logic

luste

Ijig

Segment

Length

Increase

Cluster

Size

< >

Area

Increased

Per Cluster

ScEment

Length

Figure 4. Effect of Increased Cluster Size on Segment Length

area due to the use of different sizes of routing pass-transistors and

inverter chains are automatically calculated by VPR.

5.5 Varying F,, in and Fc, OUt with Logic Cluster Size

In [Rose911 it is shown that F, = W is good for logic clusters of

size one; i.e. each logic block pin can be connected to any routing

track in an adjacent channel. As cluster size increases, setting

F, = W provides more flexibility than is required, wasting area.

In [Betz98b, Betz99] it is shown that setting F, on the input pins

(FC, in) to 2. W/N and F, on the output pins (FC, our) to W/N

provides a good level of routing flexibility, so all of our experi-

ments use these values for clusters of sizes other than one.

5.6 Detailed Logic Cluster Structure.

In Figure 5 we show the structure of a logic cluster and the cir-

cuitry connecting the logic clusters to the main FPGA routing.

Table 1 shows delay values for selected cluster sizes. The multi-

plexor, buffer, LUT, and flip-flop delays were obtained by model-

ing the structures in SPICE [Meta92] with TSMC’s 0.35 /tm

process parameters.

& Muxes

Logic Cluster

Figure 5. Detailed Logic Cluster Structure

Table 1: Selected Logic Cluster Delay Values (in

picoseconds) 0.35 pm CMOS

A to B

761

B to D

6. Packing Algorithms

The packing step (in Figure 2) takes a netlist consisting of LUTs

and flip-flops and produces a netlist consisting of logic clusters.

This involves combining the LUTs and flip-flops into BLEs, and

then grouping the BLEs into logic clusters.

There are two main constraints that packing algorithms must meet:

The number of BLEs must be less than the cluster size,

2. The number of distinct inputs generated outside the

cluster and used as inputs to BLEs within the cluster

must be less than or equal to the number of cluster

inputs, I.

In this section, we present two packing algorithms, VPack

[Betzgirb, Betz98b. Betz99], and T-VPack. Then we show that our

new T-VPack algorithm outperforms the original VPack algorithm

in both area and critical path delay.

6.1 Input-Sharing VPack Algorithm

The original VPack algorithm has two optimization goals. The first

is to pack each logic cluster to its capacity in order to minimize the

number of clusters needed. The second goal is to minimize the

number of inputs to each cluster in order to reduce the number of

connections required between clusters.

Vpack uses a greedy algorithm to construct each cluster sequen-

tially. At the start of each cluster operation, Vpack selects as a

“seed” an unclustered BLE with the most used inputs, and then

places this “seed” into a cluster C. Then VPack selects a new BLE,

B to pack into C based on the attraction that B has to C. Attraction

is determined by the number of inputs and outputs that B and C

have in common:

Attraction(B) = ]Nets(B) n Nets(

(1.1)

After each cluster reaches capacity, packing begins on a new clus-

ter. The process terminates when there are no more unclustered

BLEs left. The time complexity of this algorithm is O(k,,,n)

(where n is the number of BLEs in the circuit and k,,, is the

fanout of the highest fanout net) which results in an execution time

of about four seconds to pack the largest circuit (clma) on a 296

MHz UltraSPARC-II processor.

6.2 Timing-Driven T-VPack Algorithm

Our new packing algorithm is based on the original VPack algo-

rithm, but its optimization goal is minimizing the number of exter-

nal connections (connections between clusters) on the critical path.

The reasoning behind this is that external connections have higher

delay than internal connections (connections within a cluster), so

by reducing the number of external nets on the critical path, we

will reduce the circuit delay. The first stage of this algorithm

involves computing which connections are on the critical path. We

then sequentially pack BLEs along the critical path into logic clus-

ters and recompute which BLEs are critical.

6.2.1 An Overview of Slack and Criticality Calculation

The first step in determining which nets are critical is to determine

the slack of each connection [Hitc83, Fran92]. Slack is defined as

the amount of delay which can be added to a connection without

increasing the delay of the entire circuit.

Calculating slack involves computing the arrival time, Tarriva[ and

the required arrival time, TrequiRd at all BLE input pins. This is

accomplished using two breadth-first traversals of the circuit; the

first traversal propagates Tarrival

forward from input pins and regis-

ter outputs (Sources), and the second propagates Trequired back

from output pins and register inputs (Sinks). The slack of a connec-

tion driving a BLE input pin, i, is defined as:

Finally, we define the criticality of the connection driving input i

as:

slack(i)

Connection_Criticafity(i) = I - MaxSlack

(1.3)

where MaxSlack is the largest slack amongst all point-to-point con-

nections in the entire circuit.

6.2.2 Delay Estimates of an Unplaced and Unrouted Circuit

To obtain a good packing solution’ the T-VPack algorithm models

three types of delay: The delay through a BLE, or logic-delay, the

connection delay between blocks within the same cluster or

intra_cluster_connection_delay, and the connection delay between

blocks

that are in different clusters, or

inter_cluster_connection_delay. We experimentally determined

that setting logic_delay=O. 1, intra_cluster_connection_delay=O. 1,

and inter_cluster_connection_delay=l.0 results in the clustered

circuits having the smallest delay after placement and routing by

VPR’.

6.2.3 The Attraction Function

We extend the attraction function from the original VPack algo-

rithm to include timing information. The first BLE that is placed

into a cluster is the unclustered BLE that is driven by the most crit-

ical connection in the circuit. Then, based on our attraction func-

tion (Equation 1.8, below) we add the most attractive BLEs to the

cluster. We repeat this absorbtion until either no more BLEs will fit

into the cluster, or all of the cluster inputs are used. Once a cluster

is full, we start a new cluster with a new seed, and repeat the pro-

cess until there are no unclustered BLEs left in the circuit. We next

describe how blocks are selected for absorbtion.

We define the base criticality of each unclustered BLE, B, or

Base_BLE_Criticality(B), to be

the

maximum

Connection_Criticulity value of all connections joining B to BLEs

Base BLE

Criticality=O.W

Cluster, C

Figure 6. BLE Base Criticality Assignment

’ A good packing solution is one that results in the smallest delay after be-

ing placed and routed by VPR.

2 Note that these delay values are only used in the packing process. After

packing is complete, VPR places and routes the circuits and extracts the

real (elmore) delay of each routed net. All of the delay results that we

present in this paper are computed by VPR.

within the cluster currently being packed, C. If B does not have any

connections to C then the base criticality score is zero. In Figure 6

we illustrate how the Base_BLE_Criticulity values are assigned.

We have labelled each connection between unclustered BLEs and

BLEs within the cluster with a criticality value. Notice how the

base criticality of each BLE is assigned the highest criticality value

of all its connections to the cluster.

When selecting which BLE to absorb into a cluster there is a high

potential for multiple BLEs to have the same base criticality value.

We use a tie-breaker mechanism to select which BLEs are the most

beneficial to pack. This mechanism is based on the desire to pack

BLEs together in a manner that most effectively reduces the

number of BLEs remaining on the critical paths. This is best illus-

trated by an example.

In Figure 7 we have darkened connections and BLEs on the critical

paths. Notice that when selecting which BLEs to pIace into a clus-

ter, it is more beneficial to absorb certain critical BLEs over other

critical BLEs. In this case, absorbing BLEs H, I, and J would be

much more beneficial than absorbing BLEs A, D, and F. We can

see that absorbing H, I, and J affects the criticality of seven BLEs

(A, B, C, D, E, F, and G), while absorbing A, D, and F would only

affect the criticality of three BLEs (H, I, and J). Clearly it is best to

cluster BLEs that reduce the criticalities of the most other BLEs.

We define three variables that keep track of the number of critical

paths that each BLE in the circuit effects. First we define

inputgaths_uffected as the number of critical paths between

sources in the circuit and the BLE currently being labelled. Next

we define outputgaths_uffected as the number of critical paths

between the sinks in the circuit and the BLE currently being

labelled. Finally, we define total_paths_affected as the sum of the

previous two variables. The calculation of these variables is

explained below.

The BLE labels in Figure 7 demonstrate the input_paths_uJected

value for each BLE. We assign any sources that are on the critical

paths with an input_puths_ufSected value of one, and all other

sources are set to zero. Then we perform a breadth-first traversal of

the circuit starting at the sources, and define the

input_puths_ufSected value as in (1.4).

SiIlkS

Figure 7. Criticality Tie-Breakers

Using cluster-based logic blocks and timing-driven packing to improve FPGA speed and density

Figures

Citations

Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation

The effect of LUT and cluster size on deep-submicron FPGA performance and density

The effect of LUT and cluster size on deep-submicron FPGA performance and density

The VTR project: architecture and CAD for FPGAs from verilog to routing

An architectural exploration of via patterned gate arrays

References

The Transient Response of Damped Linear Networks with Particular Regard to Wideband Amplifiers

Architecture and CAD for Deep-Submicron FPGAS

Principles of CMOS VLSI Design: A Systems Perspective

PRINCIPLES OF CMOS VLSI DESIGN A Systems Perspective Second Edition

VPR: A new packing, placement and routing tool for FPGA research

Related Papers (5)

Architecture and CAD for Deep-Submicron FPGAS

VPR: A new packing, placement and routing tool for FPGA research

The effect of LUT and cluster size on deep-submicron FPGA performance and density

FlowMap: an optimal technology mapping algorithm for delay optimization in lookup-table based FPGA designs

PathFinder: A Negotiation-Based Performance-Driven Router for FPGAs

Frequently Asked Questions (4)

Q1. What is the reason for the improvement in circuit speed at larger cluster sizes?

Q2. What is the way to evaluate a FPGA?

Q3. What is the name of the book?

Q4. What size is used to determine the critical path delay?