scispace - formally typeset
Open AccessProceedings ArticleDOI

The routability of multiprocessor network topologies in FPGAs

Reads0
Chats0
TLDR
By exploring the routability of different multiprocessor network topologies with 8, 16 and 32 nodes on a single FPGA, it is shown that the difference between resource utilization of a ring, star, hypercube and mesh topologies is not significant up to 32 nodes.
Abstract
A fundamental difference between ASICs and FPGAs is that wires in ASICs are designed such that they match the requirements of a particular design. Wire parameters such as length, width, layout and the number of wires can be varied to implement a desired circuit. Conversely, in an FPGA, area is fixed and routing resources exist whether or not they are used, so the goal becomes implementing a circuit within the limits of available resources. The architecture for existing routing structures in FPGAs has evolved over time to suit the requirements of large, localized digital circuits. However, FPGAs now have the capacity to host networks of such circuits, and system-level interconnection becomes a key element of the design process.Following a standard design flow and using commercial tools, we investigate how this fundamental difference in resource usage affects the mapping of various network topologies to a modern FPGA routing structure. By exploring the routability of different multiprocessor network topologies with 8, 16 and 32 nodes on a single FPGA, we show that the difference between resource utilization of a ring, star, hypercube and mesh topologies is not significant up to 32 nodes. We also show that a fully-connected network can be implemented with at least 16 nodes, but with 32 nodes it exceeds the routing resources available on the FPGA. We also derive a cost metric that helps to estimate the impact of the topology selection based on the number of nodes.

read more

Content maybe subject to copyright    Report

The Routability of Multiprocessor Network Topologies in
FPGAs
Manuel Salda
˜
na, Lesley Shannon and Paul Chow
Dept. of Electrical and Computer Engineering
University of Toronto
Toronto, Ontario, Canada, M5S 3G4
{msaldana,lesley,pc}@eecg.toronto.edu
ABSTRACT
A fundamental difference between ASICs and FPGAs is that
wires in ASICs are designed such that they match the re-
quirements of a particular design. Wire parameters such as
length, width, layout and the number of wires can b e varied
to implement a desired circuit. Conversely, in an FPGA, area
is fixed and routing resources exist whether or not they are
used, so the goal becomes implementing a circuit within the
limits of available resources. The architecture for existing
routing structures in FPGAs has evolved over time to suit
the requirements of large, localized digital circuits. How-
ever, FPGAs now have the capacity to implement networks
of such circuits, and system-level interconnection becomes a
key element of the design pro cess.
Following a standard design flow and using commercial
to ols, we investigate how this fundamental difference in re-
source usage affects the mapping of various network topol-
ogies to a modern FPGA routing structure. By exploring
the routability of different multiprocessor network topolo-
gies with 8, 16 and 32 nodes on a single FPGA, we show
that the difference between resource utilization of a ring,
star, hypercube and mesh topologies is not significant up to
32 nodes. We also show that a fully-connected network can
b e implemented with at least 16 nodes, but with 32 nodes
it exceeds the routing resources available on the FPGA. We
also derive a cost metric that helps to estimate the impact
of the topology selection based on the number of nodes.
Categories and Subject Descriptors
C.1.2 [Processor Architectures]: Multiple Data Stream
Architectures (Multiprocessors); D.0 [Computer Systems
Organization]: General
General Terms
Design
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SLIP’06, March 4–5, 2006, Munich, Germany.
Copyright 2006 ACM 1-59593-255-0/06/0003 ...$5.00.
Keywords
Multiprocessor, FPGA, Network-on-Chip, Topology, Inter-
connect
1. INTRODUCTION
With the growing complexity of System-on-Chip (SoC)
circuits, more sophisticated communication schemes are re-
quired to connect the increasing number and variety of in-
tellectual property (IP) blocks. Approaches like AMBA [1],
CoreConnect [2], WISHBONE [3] and SiliconBackplane [4]
follow a shared bus scheme that works well for Master-
Slave communication patterns, where there are peripherals
(slaves) that wait for data to be received or requested from
a more complex processing IP (master). When there are
several masters (e.g., processors) in the system, synchro-
nization, data interchange and I/O may saturate the bus,
and contention will slow down data transfers.
The Network-on-Chip (NoC) [5, 6] provides a possible so-
lution for this problem by creating a scalable interconnec-
tion scheme. The concept uses a set of buses connected to
routers or switches that interchange packets, much in the
same way as traditional computer networks or multiproces-
sor machines do. Consequently, NoC approaches have design
parameters and properties similar to traditional networks.
One of these parameters is the topology, which defines the
interconnection pattern between the routers and switches.
Multiple topologies have been studied for NoCs on ASICs
[7] [8]. A popular choice is the mesh [6, 9] because it pro-
vides structure, better control over electrical characteristics,
has an easy packet routing algorithm. These advantages are
clear for ASICs, but not necessarily for FPGAs [10]. The
electrical characteristics of the FPGA are solved by the chip
vendor, not by the user. As for structure it is perhaps in-
tuitive to use a mesh topology in FPGAs since the recon-
figurable fabric layout is in the form of a mesh. However,
the placement and routing of components on an FPGA will
not typically result in a symmetric, well-organized struc-
tured layout that resembles a mesh. Furthermore, manually
restricting the placement of components or routing of nets
may lead to inefficient resource utilization for the logic that
is not part of the network. Finally, there are other topolo-
gies like hypercube or torus networks, or even tree topologies
that also have simple routing algorithms.
In this paper, we compare the routability of point-to-point
network topologies on FPGAs by measuring the impact of
each topology on a soft multiprocessor system implemented
on modern commercial FPGAs. We do this by measur-

ing the logic utilization, logic distribution (area), maximum
clo ck frequency, number of nets, and the place and route
time for five different network topologies. We also derive a
cost metric to try to extract trends for larger systems.
The rest of this paper is organized as follows. Section
2 provides some background about research on NoCs. Sec-
tion 3 describes the topologies implemented and gives a brief
description of the block used as the network nodes, which
we call the computing node. Section 4 describes the im-
plementation platform, and how the systems are generated.
Section 5 presents the results obtained for the baseline sys-
tem and Section 6 explores the chip area required for each
top ology. Section 7 shows the highest frequency that each
system could achieve. Section 8 presents a metric we pro-
p ose to evaluate the topologies and Section 9 provides some
conclusions.
2. RELATED WORK
In this section we present examples of typical research
on NoCs, and how it relates to this study. In Brebner
and Levi [10] discuss NoC implementations on FPGAs, but
their focus is on the issues of using packet switching on a
mesh topology in the FPGA and on implementing crossbar
switches in the routing structure of the FPGA. Most NoC
work assumes ASIC implementations and there are numer-
ous studies including work on mesh topologies [6][9] and fat
trees [7]. Other studies on NoCs are done using register-
transfer-level simulations [7] and simulation models [11], but
they do not show the implementation side of the NoC. In-
stead, we focus on the interaction between the network to-
p ologies and how well they can be mapped to a fixed FPGA
routing fabric. We create actual implementations by per-
forming synthesis, mapping, placement and routing for real
FPGAs using commercial tools.
Research has been done on synthesizing application-specific
network topologies [12]. A more general study on the routabil-
ity of different topologies would require the ability to gen-
erate arbitrary interconnection patterns. In our work, we
created a design flow and tools to automatically generate
multiprocessor systems using a set of well known topologies.
Based on the philosophy of routing packets, not wires [9,
13], NoC architectures have been proposed as packet-
switching networks, with the network interface itself being
the focus of much of the research. In this paper, we use a
simple network interface, more similar to a network hub than
a switch as it does not provide packet forwarding. Pack-
ets can only be sent to, and received from nearest-neighbor
no des. This makes the network interface extremely simple,
but it is sufficient for our purposes as the focus of this work is
on the routability of various topologies, not on the switching
element architecture.
3. EXPERIMENTAL ENVIRONMENT
The actual processor and network interface used are not
the critical elements in this study. What is required is to
create circuits that force particular routing patterns between
the computing nodes to see how the implementation re-
sources of these circuits on the FPGAs varies as the patterns,
i.e., topologies, are changed. We try five different topologies
and three different system sizes (8, 16 and 32 nodes) on five
different FPGAs with enough resources to implement such
systems.
Table 1: Characteristics of the topologies studied
Topology Diameter Link DegreeRegularBisection
Complexity Width
ring
n
2
n 2 yes 2
star 2 n 1 1,n 1 no 1
square
2(n
1/2
1) 2(n n
1/2
)
2,3,4 no 2
n
mesh
hypercube log
2
n
nlog
2
n
2
log
2
n yes
n
2
fully
1
n(n1)
2
n 1 yes
n
2
4
connected
In this section we describe the Network-on-Chip we used
to perform the experiments, which are explained later in this
paper.
3.1 Network Topologies
Networks can be classified into two categories. Static net-
works consist of point-to-point, fixed connections between
processors, and dynamic networks which have active ele-
ments, such as switches, that can change the connectiv-
ity pattern in the system according to a protocol. In an
FPGA, the network can be dynamically reconfigured to
adapt to communication patterns by utilizing the reconfig-
urability [14] of the FPGA.
In this paper, we focus on static message passing networks.
The ring, star, mesh, hypercube and fully-connected topol-
ogies are selected as a representative sample, ranging from
the simplest ring topology to the routing-intensive fully-
connected system.
Network topologies can be characterized by a number of
properties: node degree, diameter, link complexity, bisection
width and regularity [15]. Node degree is the number of
links from a node to its nearest neighbors. Diameter is the
maximum distance between two nodes. Link complexity is
the number of links the topology requires. A network is
deemed to be regular when all the nodes have the same
degree. Bisection width is the number of links that must be
cut when the network is divided into two equal set of nodes.
Table 1 shows a summary of these characteristics for each
of the topologies used in this paper.
The characteristics of the network topology define the net-
work interface of a node. For example, the four-dimensional
hypercube is a regular topology, with all nodes having a
degree of four. This means that this topology requires a
single network interface type, each with four ports, i.e., four
communication links. The network interface is used to com-
municate with other nodes in the network. The maximum
distance (diameter) is four, which means that data going
through the network may require redirection or routing at
intermediate nodes and travel on up to four links. The link
complexity is 32, which is the total number of point-to-point
links that the overall system will have. In contrast, a 16-
node mesh has a total of 24 links in the system, but it is
not a regular topology, requiring three different versions of
the network interface. Inner nodes require an interface with
four ports, perimeter nodes require one with three ports and
corner nodes use a two-port interface.
Figure 1 shows examples of systems with different num-
bers of nodes and topologies that are implemented to carry
out our experiments. Every top ology can be seen as a
graph that is made of edges (links) and vertices (computing
nodes). In our implementations, the links are 64 bits wide

(i.e. Channel width of 64 bits) with 32 bits used for transmis-
sion and 32 bits used for reception, making it a full-duplex
communication system. The links also include control lines
used by the network interface.
Figure 1: A) 8-node ring, B) 8-node star, C) 32-
node mesh, D) 16-node hypercube, and E) 8-node
fully-connected topology
3.2 Computing Node
The computing nodes in Figure 1 consist of a computing
element and a network interface module. Figure 2 shows
the structure of a computing node. The master computing
no de of the system is configured to communicate with the
external world using a UART attached to the peripheral bus
shown inside the dashed box of Figure 2. The rest of the
no des have no peripheral bus.
Figure 2: The computing node
We use a Harvard architecture soft core processor as the
computing element so that data memory and program mem-
ory are accessed by independent memory buses. The com-
munication between the computing element and the network
interface is achieved by using two 32-bit wide FIFOs: one
for transmission and one for reception.
The network interface module is an extremely simple
blo ck that has two sides. It interfaces to the network with
several links (channels) according to the degree of the node.
On the other side, two FIFOS are used as message buffers
to the processor.
The network interface is basically a hub that broadcasts
the data to the neighbors on transmission, and it filters out
the data from the neighbors on reception. It is effectively
a FIFO multiplexer that is controlled by the destination
field in the packet header. If the destination value matches
the processor’s ID number, then the packet passes through
the hub to the pro cessor attached to the hub. Again, this
interface is simple, but is enough for the purpose of this
research, since we are interested in the connectivity pattern.
Implementing a single version of the network interface
would not provide a good measure of the difference in logic
utilization between the various topologies because the topol-
ogies requiring nodes of lesser degree should use less logic.
It is likely that the optimizer in the synthesis tool would
remove the unused ports, still allowing the study to be per-
formed, but we chose to actually implement the different
node degrees required to be certain that only the necessary
logic was included.
The size of the remaining logic in the computing node is
independent of the node degree i.e., the logic in the proces-
sors, the FIFOs, the memory controllers and the UART are
independent of the topology selection.
4. IMPLEMENTATION PLATFORM
To build the net list, map the design, place it and route it,
we use the Xilinx [16] EDK tools version 7.1i in combination
with the Xilinx XST synthesis tool. To visualize the place-
ment of the systems, we use the Xilinx FPGA Floorplanning
tool. The network interface is developed in VHDL and sim-
ulated using ModelSim version 6.0b [17]. The routed nets
are counted with the help of the Xilinx FPGA Editor. For
Section 7, we use the Xilinx Xplorer utility to try to meet
the timing constraints. All the experiments are executed on
an IBM workstation with a Pentium 4 processor running at
2.8 GHz with Hyperthreading enabled and 2 GB of memory.
Our multiprocessor systems use the Xilinx MicroBlaze
soft-processor core [16] as the computing element. The com-
puting element connects to the network interface module
through two Fast Simplex Links (FSL), a Xilinx core that is
a unidirectional, point-to-point communication bus imple-
mented as a FIFO.
We use a variety of Xilinx chips to implement the de-
signs: the Virtex2 XC2V2000, and the Virtex4 XC4VLX25,
XC4VLX40, XC4VLX60 and XC4VLX200. The LX version
of the Virtex4 family only has Block RAM (BRAM) and
DSP hard cores in addition to the FPGA fabric. They do
not have PowerPC processors or Multi-gigabit Transceivers
(MGTs). This provides a more homogeneous architecture
that facilitates area comparisons.
The hard multiplier option for the MicroBlaze is disabled
to minimize the impact of hard core blocks that may influ-
ence or limit the placement and routing. The BRAM are
hard core blocks that also affect placement and routing, but
they are essential for the MicroBlaze system to synthesize
so they have not been eliminated.
A 32-node, fully-connected system requires 1056 links to
be specified, and doing this manually is time consuming and
error prone. Instead, we developed a set of tools that take a
high-level description of the system that specifies the topol-
ogy type, the number of nodes, the number of total links
and the number of links per node, and they generate the

files required by EDK.
The number of nodes, for all the topologies, is chosen
based on the limitation of the hypercube to 2
d
no des, where
d is the dimension. For d = 3, 4 and 5 we have 8, 16 and
32 nodes, respectively.
5. BASELINE SYSTEM
The main objective of this experiment is to measure the
logic and routing resources required for each of the topolo-
gies. The timing constraints are chosen to be realistic, but
not aggressive, so that the place and route times are not
excessive. The 8 and 16-node systems are specified to run
at 150 MHz and the 32-node systems are specified to run
at 133 MHz to account for the slower speed grade of the
XC4VLX200 chip that is used for those systems.
The logic resource usage is measured in terms of the to-
tal number of LUTs required for a design and the number of
LUTs related to only the interconnection network, i.e., those
used to implement the network interface modules. The logic
resources needed to implement the network are estimated by
first synthesizing the network interface modules as stand-
alone blocks to determine the number of LUTs required.
These numbers are then used to estimate the usage of the
entire network. For example, the 8-node star topology re-
quires one 7-port network interface, which uses 345 LUTs,
and seven 2-port network interfaces, which need 111 LUTs
each. The total number of LUTs required by the network is
345 + (7 ×111) = 1122 LUTs. Note that this is only an esti-
mate as the values reported by synthesizing the stand-alone
blo ck level may change at the system level due to optimiza-
tions that may occur. The register (flip flop) utilization is
found by using the same method as used for finding the logic
resource utilization.
The routing resource utilization is measured in terms of
the total number of nets in the design and the number of
nets used to implement only the network links and network
interfaces. The counting of nets is done by using the Xilinx
FPGA Editor, which allows the user to filter out net names.
The number of nets attributed to the network is found by
counting the number of nets related to all the network inter-
face modules in the design. This includes all nets that are
used in the network interface module as well as the nets in
the network topology itself. Including the nets in the net-
work interface module is reasonable because more complex
top ologies use more complex network interfaces that also
consume FPGA routing resources.
5.1 Results
Figure 3 shows a histogram of the number of LUTs needed
to implement the complete systems, including the MicroB-
laze, FSLs, memory interface controllers, switches, UART,
and OPB bus. As expected, the system with the fully-
connected network has the highest logic utilization, and as
the system size increases, the difference with respect to the
other topologies gets more pronounced because of the O(n
2
)
growth in size. The difference is most significant with the
32-no de system, which requires over twice the logic of the
other systems. For the other topologies, the maximum dif-
ference in LUT usage amongst the topologies at the same
no de size ranges from about 5% in the 8-node systems to
ab out 11% in the 32-node systems.
A more detailed view of the logic resources can be seen
in Table 2. The Logic Utiliz. column is the total numb er of
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
32168
LUTs
Nodes
ring
star
mesh
hypercube
fully-connected
Figure 3: Logic utilization of systems
Table 2: Logic and register resources used by each
system
Topology Nodes Logic Logic Logic Total Reg. Reg.
Utiliz. Incr. Ovrhd. Reg. Incr. Ovrhd.
(LUTs) (%) (%) (%) (%)
ring 8 10197 0.0 8.7 2637 0.0 11.2
star 8 10393 1.9 10.8 2642 0.2 11.6
mesh 8 10470 2.7 10.7 2641 0.2 11.4
hypercube 8 10701 4.9 12.6 2645 0.3 11.5
fully con. 8 12376 21.4 22.3 2762 4.7 13.9
ring 16 20448 0.00 8.7 5186 0.0 11.4
star 16 20936 2.4 9.6 5190 0.1 11.8
mesh 16 21360 4.5 12.6 5202 0.3 11.7
hypercube 16 22272 8.9 16.2 5218 0.6 12.0
fully con. 16 30176 47.6 38.1 5490 5.9 16.3
ring 32 40648 0.00 8.7 10209 0.0 11.6
star 32 41880 3.0 9.0 10214 0.1 11.9
mesh 32 42936 5.6 13.6 10250 0.4 11.9
hypercube 32 45104 11.0 17.9 10306 0.9 12.4
fully con. 32 87760 115.9 57.8 11330 11.0 20.3
LUTs used for each design and these are the values shown in
Figure 3. Since the ring has the simplest routing topology,
it is used as the baseline for comparisons with the rest of
the topologies.
Column Logic Incr. shows the increase in the number of
LUTs for each topology relative to the ring top ology. For
example, the fully-connected topology requires 21.4% more
LUTs than the ring for the 8-node system. In contrast, the
Logic Ovrhd column shows the numb er of LUTs used for the
network interfaces as a fraction of the total LUTs required
for the complete system. It is calculated as (total number of
LUTs for network interfaces)/(Total LUTs in the system).
As expected, the ring topology has the lowest overhead for
all node sizes and the fully-connected system overhead in-
creases very quickly as the number of nodes increases.
Table 2 also shows the corresponding results for the reg-
ister (flip flop) utilization of the various topologies. The
trends mimic the logic utilization data, but the variation is
smaller because the number of registers in the network in-
terface module is small and because it is the only component
that is changing in size.
The routing resource usage of each system is presented in
Table 3. The Routing Utiliz. column is the total number

of nets used in the design. In general, the routing resource
utilization follows a similar pattern to the logic resource uti-
lization across the systems. The fully-connected system re-
quires the most nets, as expected. It should also be noted
that the 32-node, fully-connected topology design could be
placed but not completely routed, leaving 56 unrouted nets.
Column Routing Increase presents the difference in rout-
ing resources relative to the ring topology. It can be seen
that the greatest increase in routing for the ring, star, mesh
and hypercube topologies occurs for the 32-node hypercube
system with only a 10.6% increase relative to the 32-node
ring system. This reflects the O(n log n) link complexity of
the hypercube as compared to the O(n) link complexity for
the ring, star, and mesh topologies.
The Routing Ovrhd column is calculated as the total num-
b er of nets for all the network interfaces divided by the total
number of nets in the entire system. A visual representa-
tion of how each network topology contributes to the global
number of nets can be seen in Figure 4. From this figure it
can be seen that the ring topology overhead is practically
indep endent of the system size at about 6% of the total
nets for the 8, 16 and 32-node systems. The star and mesh
top ologies increase slowly to a maximum of about 11% of
the total nets for the 32-node system. The hypercube adds
ab out 15% overhead to the global routing in the 32-node
system. The fully-connected topology starts at 20% over-
head for an 8-node system, and grows to around 55% of the
total routing for the 32-node topology, which actually fails
to completely route. The other topologies have much lower
routing overhead and will likely be able to expand to 64-
no de or 128-node systems, assuming large enough FPGAs
exist.
100
90
80
70
60
50
40
30
20
10
32168
% of Nets in the System
Nodes
ring
star
mesh
hypercube
fully-connected
Figure 4: Topology impact on global routing
The place and route time data varies considerably because
of how the place and route algorithms work and factors that
impact the workstation performance. For the ring, star,
mesh and hypercube topologies, the average times to place
and route are approximately the same for a fixed numb er of
no des.
For the 8, 16 and 32-node systems the average times are
12 min., 30 min. and 4 hours 48 min., respectively. The
fully-connected topology exhibits an exponentially growing
time of 15 min. for the 8-node system, 12 hours for the 16-
no de system, and remained unroutable after 3 days for the
32-no de system.
Table 3 also shows the clock frequency (freq.) achieved
Table 3: Routing resources used by each system
TopologyNodesRouting Routing Routing freq. Target
Utiliz. Increase Ovrhd Clock
(nets) (%) (%) (MHz)(MHz)
ring 8 10744 0.0 5.7 150 150
star 8 10956 2.0 7.6 151 150
mesh 8 11021 2.6 7.7 151 150
hyp.cube 8 11256 4.8 9.5 152 150
fully con. 8 13045 21.4 20.4 150 150
ring 16 21501 0.0 5.7 152 150
star 16 22013 2.4 8.0 150 150
mesh 16 22429 4.3 9.6 151 150
hyp.cube 16 23357 8.6 13.1 151 150
fully con. 16 31373 45.9 34.7 128 150
ring 32 42618 0.0 5.8 133 133
star 32 43888 3.0 8.6 100 133
mesh 32 44945 5.5 10.6 132 133
hyp.cube 32 47136 10.6 14.7 133 133
fully con. 32 90016 111.2 54.4 Fail 133
for each of the systems. Of the 8 and 16-node systems,
only the fully-connected, 16-node system is not able to meet
the 150 MHz requirement, achieving only 128 MHz. With
the 32-node systems, the target is 133 MHz, but this is not
achieved by the star or the fully-connected network. The
star incurs congestion at the central node, which affects the
timing, and the fully-connected system requires too many
wires. The placement and routing efforts were set to high,
but no time was spent to try and push the tools to improve
the results that did not meet the targets.
6. AREA REQUIREMENTS
For the previous exp eriments, LUT and flip flop counts
are used as the reference metrics for logic resource utiliza-
tion. However, for this experiment the number of slices is
used to measure area usage. In the Xilinx architecture, each
slice contains two LUTs. A design requires a certain num-
ber of LUTs and flip flops, and depending on how well the
packing algorithm performs, the design will require more or
less slices. Moreover, the place and route tools may not be
able to utilize the two LUTs in every slice because of routing
constraints and timing requirements. The number of slices
better reflects the actual chip area required to implement
the design. Also, the area constraints used by the Xilinx
tools are specified in terms of slices.
For this experiment, the Minimum Area Required is de-
fined as the smallest number of slices needed for the design
to place and route successfully. It is determined by reduc-
ing, or compressing, the area used by the design until just
before it fails to place and route and counting the number
of slices in the compressed region at that point. This gives
a measure of how efficiently the design can use the resources
when the resources are close to being fully utilized, which
models the effect of trying to implement a design on a chip
that is close to full capacity.
The area compression is done using area constraints in
the User Constraints File, i.e., the .ucf file. The constrained
area is described by giving the coordinates of the bottom-
left and top-right slice positions that define a rectangular
area in the FPGA. The origin is fixed to X0Y0 and the

Citations
More filters
Proceedings ArticleDOI

Generic Low-Latency NoC Router Architecture for FPGA Computing Systems

TL;DR: A novel cost-effective and low-latency wormhole router for packet-switched NoC designs, tailored for FPGA, is presented, designed to be scalable at system level to fully exploit the characteristics and constraints of FPGAs based systems.
Proceedings ArticleDOI

Scaling Soft Processor Systems

TL;DR: This paper designs and evaluates real FPGA-based processor, multithreaded processor, and multiprocessor systems on EEMBC benchmarks, investigating different approaches to scaling caches, processors, and thread contexts to maximize throughput while minimizing area.
Proceedings ArticleDOI

Predicting the performance of application-specific NoCs implemented on FPGAs

TL;DR: An analytical model is created that describes the maximum operating frequency of a NoC as a function of the topology's network parameters and demonstrates how an FPGA's prefabricated routing interconnect provides increased freedom in the selection of application-specific topologies.
Proceedings ArticleDOI

Development of a Universal Adaptive Fast Algorithm for the Synthesis of Circulant Topologies for Networks-on-Chip Implementations

TL;DR: The implemented methods to speed up the synthesis process, based on the properties of circulants, as well as improving the algorithm for calculation of the distance between nodes and caching the adjacency matrix to achieve an acceptable search speed, are proposed.
Proceedings ArticleDOI

The effect of node size, heterogeneity, and network size on FPGA based NoCs

TL;DR: It is found that NoC performance is only dependent on the number of nodes and is not impacted by the node size or heterogeneity of nodes, which means that the network node is not the critical path.
References
More filters
Proceedings ArticleDOI

Route packets, not wires: on-chip interconnection networks

TL;DR: This paper introduces the concept of on-chip networks, sketches a simple network, and discusses some challenges in the architecture and design of these networks.
Book

Interconnection Networks: An Engineering Approach

TL;DR: The book's engineering approach considers the issues that designers need to deal with and presents a broad set of practical solutions that address the challenges and details the basic underlying concepts of interconnection networks.
Proceedings ArticleDOI

A network on chip architecture and design methodology

TL;DR: A packet switched platform for single chip systems which scales well to an arbitrary number of processor like resources which is the onchip communication infrastructure comprising the physical layer, the data link layer and the network layer of the OSI protocol stack.
Journal ArticleDOI

Performance evaluation and design trade-offs for network-on-chip interconnect architectures

TL;DR: This paper develops a consistent and meaningful evaluation methodology to compare the performance and characteristics of a variety of NoC architectures and explores design trade-offs that characterize the NoC approach and obtains comparative results for a number of common NoC topologies.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What are the contributions mentioned in the paper "The routability of multiprocessor network topologies in fpgas" ?

Following a standard design flow and using commercial tools, the authors investigate how this fundamental difference in resource usage affects the mapping of various network topologies to a modern FPGA routing structure. By exploring the routability of different multiprocessor network topologies with 8, 16 and 32 nodes on a single FPGA, the authors show that the difference between resource utilization of a ring, star, hypercube and mesh topologies is not significant up to 32 nodes. The authors also show that a fully-connected network can be implemented with at least 16 nodes, but with 32 nodes it exceeds the routing resources available on the FPGA. 

Trending Questions (1)
How many nodes can be connected in LIN network?

We also show that a fully-connected network can be implemented with at least 16 nodes, but with 32 nodes it exceeds the routing resources available on the FPGA.