scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Parallelism versus Memory Allocation in Pipelined Router Forwarding Engines

TL;DR: It is shown that perfect memory sharing of shared memory can be achieved with a collection of two-port memories, as long as the number of processors is less than the numberof memories.
Abstract: A crucial problem that needs to be solved is the allocation of memory to processors in a pipeline. Ideally, the processor memories should be totally separate (i.e., one-port memories) in order to minimize contention; however, this minimizes memory sharing. Idealized sharing occurs by using a single shared memory for all processors but this maximizes contention. Instead, in this paper we show that perfect memory sharing of shared memory can be achieved with a collection of two-port memories, as long as the number of processors is less than the number of memories. We show that the problem of allocation is NP-complete in general, but has a fast approximation algorithm that comes within a factor of $\frac 32$ asymptotically. The proof utilizes a new bin packing model, which is interesting in its own right. Further, for important special cases that arise in practice a more sophisticated modification of this approximation algorithm is in fact optimal. We also discuss the online memory allocation problem and present fast online algorithms that provide good memory utilization while allowing fast updates.

Summary (2 min read)

1. Introduction

  • Parallel processors are often used to solve time-consuming problems.
  • To their best knowledge, this problem was first raised and left as an open problem in [14].
  • Given that minimizing memory is required to minimize cost and that pipelining is required for speed, one way out of the dilemma is to change the underlying model.
  • It is difficult today to imagine a very high speed design with more than say b = 100 banks of memory connected via the crossbar.
  • The authors say that an allocation is feasible if every processor’s request is satisfied and no more than two processors are allocated to any one memory.

3. Our Bin Packing Problem Is NP-Complete

  • The authors will prove the NP-completeness of the bin packing problem with the constraint that each bin can have at most two types.
  • In fact, they showed that the 3-PARTITION problem is NP-complete in the strong sense (see [8]).
  • Determine if W can be packed into 2m bins such that no bin contains more than two types.
  • First, the authors observe that a weight type cannot be partitioned into more than two parts.
  • Hence, the authors have shown the following: Theorem 1.

4. A Graph Representation

  • Before the authors discuss approximation algorithms for their bin packing problem and their worst-case analysis, they consider a graph representation of a packing.
  • (3) If the bin is partially filled with only one type, the authors say the corresponding loop is weak.
  • If the bin is completely filled with two types, the authors say the corresponding edge is strong.
  • There are, of course, different ways to pack W into two bins so that each bin contains at most two types.
  • In this paper the authors use the convention that a cycle must have at least two vertices.

5. Approximation Algorithms

  • The authors now describe a simple algorithm for bin packing subject to the constraint that no bin contains weights of more than two types.
  • (3) Each connected component except for the last one has at most one weak edge which can only appear at the end of the component.
  • Let OPT denote the number of bins needed in the optimum packing.
  • It is worth noting that if the associated graph of the resulted packing does not have any weak loop, Algorithm A is exactly at most 32 from optimal.

6. Some Properties of the Associated Graphs

  • These properties provide the foundation for the reduction steps in the approximation algorithm to be discussed in the next section.
  • During the moving process, the authors might split the original component into two, but the total number of bins will never increase.
  • If the authors successfully carry this on until the two weak edges become adjacent, they then use Operation 2 to eliminate one weak edge or split the component into two.
  • In the latter case, the authors need to check whether there are two weak loops in the entire graph.
  • If the associated graph G P contains a strong loop in one connected component X and a weak edge in another connected component Y and is stable, the authors can find another packing P ′ which uses no more than b bins with its associated graph with one fewer strong loops than packing P and also stable.

7. An Improved Algorithm

  • The authors will show that the modified algorithm gives an optimal solution when the total weight is greater than or equal to the number of types.
  • Every time the authors eliminate one strong loop, at most a linear number of atomic operations are involved, each taking constant time.
  • Now the authors have their linear time Algorithm B: Algorithm B.
  • While there exists a component X containing a strong loop and another component Y containing a weak edge, the authors use only the first step as described in the proof of Lemma 2 to merge these two components into one, without taking care of the possible multiple edges in any one connected component.

8. Dynamic Memory Allocation

  • So far the authors have only dealt with approximation and exact algorithms for static memory allocation.
  • In this situation the authors have a tradeoff between memory utilization and cost of repacking or compaction [14].
  • Call any memory piece that has not been swapped clean and otherwise dirty.
  • The authors then allocate one more weight of size α, and follow this with a deallocation of one of the paired weights.
  • Observing the allocation assignment made by the online Algorithm D, the authors are then given a list of k/2 deallocation requests which remove exactly one weight from every shared bin.

9. Conclusions

  • In practice, one would simply choose the parameters such that the number of memories is larger than the number of processor stages.
  • In that case, the approximation algorithm the authors presented will provide 100% efficiency.
  • The authors know at least one implementation of one of their models that scales to multiple OC-768 speeds.
  • On the theoretical front, their paper also poses an interesting open problem for the general case of packing bins so that each bin contains at most r types for some fixed integer r .

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

DOI: 10.1007/s00224-006-1249-3
Theory Comput. Systems 39, 829–849 (2006)
Theory of
Computing
Systems
©
2006 Springer Science+Business Media, Inc.
Parallelism versus Memory Allocation in Pipelined Router
Forwarding Engines
Fan Chung,
1
Ronald Graham,
2
Jia Mao,
2
and George Varghese
2
1
Department of Mathematics, University of California, San Diego,
La Jolla, CA 92093, USA
fan@ucsd.edu
2
Department of Computer Science and Engineering, University of California, San Diego,
La Jolla, CA 92093, USA
graham@ucsd.edu; {jiamao,varghese}@cs.ucsd.edu
Abstract. A crucial problem that needs to be solved is the allocation of memory to
processors in a pipeline. Ideally, the processor memories should be totally separate
(i.e., one-port memories) in order to minimize contention; however, this minimizes
memory sharing. Idealized sharing occurs by using a single shared memory for all
processors but this maximizes contention. Instead, in this paper we show that perfect
memory sharing of shared memory can be achieved with a collection of two-port
memories, as long as the number of processors is less than the number of memories.
We show that the problem of allocation is NP-complete in general, but has a fast
approximation algorithm that comes within a factor of
3
2
asymptotically. The proof
utilizes a new bin packing model, which is interesting in its own right. Further, for
important special cases that arise in practice a more sophisticated modification of
this approximation algorithm is in fact optimal. We also discuss the online memory
allocation problem and present fast online algorithms that provide good memory
utilization while allowing fast updates.
1. Introduction
Parallel processors are often used to solve time-consuming problems. Typically, each
processor has some memory where it stores computation data. To minimize contention
The research of Fan Chung was supported in part by NSF Grants DMS 0100472 and ITR 0205061.
The research of Ronald Graham was supported in part by NSF Grant CCR 0310991.

830 F. Chung, R. Graham, J. Mao, and G. Varghese
and maximize speed, each memory should be read by exactly one process. Unfortunately,
if the tasks assigned to processors vary widely in memory usage, this is not an efficient
use of memory, since for some tasks memory of one processor may be unused while
memory of another processor is exhausted.
The interaction between parallelism (the desire to minimize contention) and memory
allocation (the desire to maximize memory sharing) is a general phenomenon that has
been largely unexplored in the literature. We encountered this problem in the context of
networking while trying to design fast IP lookup schemes [6], [11]. In IP lookup, the
time-consuming task at hand is prefix lookup, and the processors are arranged (often
within a custom chip) as a pipeline.
Almost all known IP lookup schemes [13] traverse some form of tree (e.g., trie,
binary tree) using the destination 32-bit IP address in a received packet as a key. The
leaves provide information required to forward the packet. Lookup time is proportional
to tree height, and storage required is the sum of the storage required for each node.
Observe that any tree can easily be pipelined by height: all nodes at height i are
placed in memory i which is accessible only to processor i. Such a design is simple
because there is no memory contention. However, it is extremely wasteful of memory.
Since the shape of the tree can vary from database to database, and they are in general
unbalanced, trees can change their memory needs from database to database. More
precisely, the number of nodes at height i can vary for different databases by large
factors.
Thus, statically deciding the size of each memory is a bad idea because there will
be at least some databases where the total amount of memory required is less than the
sum of the sizes of all memories, but the database still cannot fit because memory i
is underutilized while say memory j is full. How then should memory be allocated to
processors? To our best knowledge, this problem was first raised and left as an open
problem in [14].
An approximate solution to the problem of trie memory allocation across pipeline
stages is described in [1]. Basu and Narlikov try to choose the tree to minimize memory
imbalance. Their results show a reduction in the maximum allocation by approximately
one-half. Unfortunately these results do not help worst-case designs. Their worst-case
bound is close to the naive bound of requiring each stage memory equal to the total
required memory.
Given that minimizing memory is required to minimize cost and that pipelining is
required for speed, one way out of the dilemma is to change the underlying model.In
some sense, the rest of this paper can be considered to be the proposal of a new memory
model for pipelined engines and its implications. To motivate our final model (multiple
two-port memories connected by a partial crossbar), we first consider a series of simpler
models, which however have drawbacks.
Our second model (the first is partitioned memory) is shared memory which is ideal
for memory sharing. Unfortunately, large, fast shared memories are currently infeasible
to build. In practice, most large n-port memories are (underneath the covers) time-
multiplexed. Every processor is given one memory access for every n memory accesses
done to the memory (in the worst case). Unfortunately, multiplexing n-ways causes the
effective memory access time to grow by a factor of n. The tradeoff between these two
extremes is shown in Figure 1.

Parallelism versus Memory Allocation 831
p
1
p
1
M
1
MM
2n
p
2
p
n
Memory
Shared
Zero contention
Poor Memory sharing
VS
Perfect memory sharing
Maximal contention
p
2
p
n
Fig. 1. Models 1 and 2 have problems: strictly partitioned memories have poor memory sharing while a
single shared memory has poor contention.
When faced with two unacceptable extremes, it is natural to consider intermediate
forms. Thus, strictly partitionedone-portmemorieshave good access speeds and memory
densities but have poor memory utilization. On the other hand, n-port memories have
the opposite problem. Hence, it is natural to consider a collection of Y -port memories,
where Y < n. A natural starting point is to consider Y = 1 memories. Thus, imagine
for our second model that we have a collection of b one-port memories that are shared
among the n processors (see Figure 2).
This can be modeled by a set of n processors (shown on the bottom of Figure 2)
and a set of b memories (shown on the top of Figure 2) that are connected by an inter-
connection network. The interconnection network allows parallel connections to be made
between processors and memories, and allows each processor to be connected to multiple
memories, but allows at most one processor to be connected to a single memory (because
the memories have only one port). Such interconnection networks are commonly used
in parallel computers [5] and are called crossbar switches.
Figure 2 shows processor p
1
connected to two memories M
1
and M
2
. Suppose that
is all that has been allocated to p
1
, and p
1
wants more memory. The idea is that the
memory allocation system keeps track of the free memories, realizes that, say M
3
,is
free and (see the dashed line in Figure 2) reconfigures the crossbar to allocate M
3
to p
1
.
p
1
M
1
M
2
p
2
p
n
M
3
Add connection if
p needs more memory
1
M
b
PARTIAL
CROSSBAR
Fig. 2. Model 3: allowing memory sharing by connecting a large number of one-ported memory banks to
the set of n processors via a partial crossbar.

832 F. Chung, R. Graham, J. Mao, and G. Varghese
p
1
2
p
n
PARTIAL
CROSSBAR
M
1
M
2
M
3M
b
p
pp
1
1n
2
p
p
Fig. 3. Our final model: allowing memory sharing by connecting a small number of two-ported memory
banks to the set of n processors via a partial crossbar.
Notice that the crossbar need only be reconfigured at allocation time, which is generally
orders of magnitude less stringent than lookup times.
At first glance, this looks very attractive, because if b is large, then each processor
can waste at most one memory, which is negligible for large b. Thus the percentage of
wasted memory is at most (n 1)/b. For example, for n = 16, if b = 32 this can incur a
worst-case memory wastage of around 50%. While this is quite large, it can be reduced
to essentially zero by increasing b.
While this looks superficially attractive, in practice one does not want to waste even
10% of an expensive SRAM memory system, especially if it is on chip. This implies the
use of even higher values of b. Unfortunately, practical constraints limit the values of b
that can be used. The larger the number of memory banks, the larger the load that must
be driven on the data busses that make up the interconnection network, and hence the
larger the delay. It is difficult today to imagine a very high speed design with more than
say b = 100 banks of memory connected via the crossbar. It would be far simpler and
faster (important for higher speeds) to use a smaller number of banks, such as b = 32,
and still get good memory utilization.
Because of the bus capacitance issues of dealing with a large number of memories
caused by using a large number of shared one-port memories, we consider the next
natural progression in our model (Figure 3). Thus we consider increasing the number of
ports on the memories to Y = 2 from Y = 1. A collection of two-port memories will
only slow down access speeds (using say time multiplexing) by a factor of at most 2.
However, what kind of memory utilization would such two-port memories provide?
To understand the model, imagine a collection of n processors that have access to
a network (e.g., a crossbar switch) that allows them access to a collection of b two-port
memories. Each memory has two ports that can be allocated to any two processors. Thus
each memory can be read by at most two processors at a time. Of course, a processor that
needs a large amount of memory could be assigned a port on X > 1 memories. Each of
the b memories has a fixed amount of memory, say Max memory words.
Notice in Figure 3 that memory M
1
is not completely full and is allocated partially
to processor p
1
and partially to processor p
2
. Notice also that of the two memory ports
allocated to each processor in Figure 3, M
1
has both ports allocated, M
2
and M
b
have
one port allocated and one port free, and M
3
has two ports free. Thus, if say processor
p
3
wants even one word of memory, p
3
cannot use M
1
(both of M
1
s ports are already

Parallelism versus Memory Allocation 833
allocated to other processors even though it has free memory). However, if p
2
wants
more memory it can get more allocation in M
1
.
Thus, it should be clear that besides allocating memory, the allocator has to be frugal
in allocating ports in order not to waste memory. Consider, for example, a scenario where
processors p
1
and p
2
are allocated one word of memory each in all of the b memories. If
Max 1, then no other processor can get any memory because all ports are allocated,
and the resulting utilization (measured when some processor cannot satisfy a memory
allocation request) is nearly zero. Of course, the memory allocator could finesse this
particular issue by compacting all of p
1
and p
2
s requests to fit in as few memory banks
as possible. However, this example should indicate that it is unclear whether perfect
memory allocation is possible while respecting the two-port constraint at every memory.
Now consider the offline problem of memory allocation. Imagine that the input is
a collection of memory requests per processor (e.g., five words for processor 1, ten for
processor 2, etc.). We say that an allocation is feasible if every processor’s request is
satisfied and no more than two processors are allocated to any one memory. Ideally, we
want a fast algorithm that will guarantee a feasible allocation as long as the input is
feasible (i.e., the sum of processor requests is less than the total memory size).
We will show that a very fast O(n) algorithm exists for optimal memory allocation
for feasible inputs as long as b > n. This algorithm is sufficient for practical imple-
mentations because one can constrain the design to use more smaller memories (often
called memory banks) than the number of processors. (As n grows, there is an increased
interconnect cost as b grows, but this is not a problem for n < 64.) While the speed
of allocation is usually not as important as reads and writes to memory, fast allocation
algorithms allow faster reconfiguration of data structures in this memory structure and
are important in their own right.
As often happens, practical problems give rise to theoretical problems that have a
life of their own. The practical problem can be abstracted as a theoretical problem of bin
packing with an additional constraint. We show that for the general case of arbitrary b
and n, the problem of finding a feasible allocation is NP-complete (it should not surprise
the reader that an NP-complete problem is efficiently solvable in a special case; consider
the case of computing a Hamiltonian cycle, which is trivial if the graph has only a small
number of cycles).
We deal with the NP-completeness by presenting an approximate algorithm that
produces memory utilization that is within a factor of
3
2
of optimal asymptotically.
Practically, this means that if the designer wishes to use a smaller number of memory
banks than the number of processors, he or she should overdesign the total memory
capacity by a factor of
3
2
. Fortunately, the approximation algorithm is exactly optimal in
the case of b > n, so we describe only one algorithm for both cases.
In the rest of this paper we abstract the problem as a bin packing problem with the
two-port constraint abstracted as a “two type” constraint. We also normalize the memory
sizes to 1 (instead of Max) without loss of generality by allowing fractional inputs (called
weights) for each processor.
While the former part of this paper mostly focuses on the offline problem, in practice
the set of processors will keep getting new memory requests. When a new memory
request occurs that causes the assignment of processors to memories to change, one has
to reconfigure the crossbar and possibly move data around between memories. Thus

Citations
More filters
Proceedings ArticleDOI
27 Aug 2013
TL;DR: The RMT (reconfigurable match tables) model is proposed, a new RISC-inspired pipelined architecture for switching chips, and the essential minimal set of action primitives to specify how headers are processed in hardware are identified.
Abstract: In Software Defined Networking (SDN) the control plane is physically separate from the forwarding plane. Control software programs the forwarding plane (e.g., switches and routers) using an open interface, such as OpenFlow. This paper aims to overcomes two limitations in current switching chips and the OpenFlow protocol: i) current hardware switches are quite rigid, allowing ``Match-Action'' processing on only a fixed set of fields, and ii) the OpenFlow specification only defines a limited repertoire of packet processing actions. We propose the RMT (reconfigurable match tables) model, a new RISC-inspired pipelined architecture for switching chips, and we identify the essential minimal set of action primitives to specify how headers are processed in hardware. RMT allows the forwarding plane to be changed in the field without modifying hardware. As in OpenFlow, the programmer can specify multiple match tables of arbitrary width and depth, subject only to an overall resource limit, with each table configurable for matching on arbitrary fields. However, RMT allows the programmer to modify all header fields much more comprehensively than in OpenFlow. Our paper describes the design of a 64 port by 10 Gb/s switch chip implementing the RMT model. Our concrete design demonstrates, contrary to concerns within the community, that flexible OpenFlow hardware switch implementations are feasible at almost no additional cost or power.

929 citations


Cites background from "Parallelism versus Memory Allocatio..."

  • ...An alternate design is to assign each logical stage to a decoupled set of memories via a crossbar [4]....

    [...]

Proceedings ArticleDOI
03 Dec 2006
TL;DR: It is shown that fairly straightforward techniques can ensure nearly full utilization of the pipeline, and are coupled with an adaptive mapping of trie nodes to the circular pipeline, create a pipelined architecture which can operate at high rates irrespective of the trie size.
Abstract: A large body of research literature has focused on improving the performance of longest prefix match IP-lookup. More recently, embedded memory based architectures have been proposed, which delivers very high lookup and update throughput. These architectures often use a pipeline of embedded memories, where each stage stores a single or set of levels of the lookup trie. A stream of lookup requests are issued into the pipeline, one every cycle, in order to achieve high throughput. Most recently, Baboescu et al. [21] have proposed a novel architecture, which uses circular memory pipeline and dynamically maps parts of the lookup trie to different stages.In this paper we extend this approach with an architecture called Circular, Adaptive and Monotonic Pipeline (CAMP), which is based upon the key observation that circular pipeline allows decoupling the number of pipeline stages from the number of levels in the trie. This provides much more flexibility in mapping nodes of the lookup trie to the stages. The flexibility, in turn, improves the memory utilization and also reduces the total memory and power consumption. The flexibility comes at a cost however; since the requests are issued at an arbitrary stage, they may get blocked if their entry stage is busy. In an extreme case, a request may block for a time equal to the pipeline depth, which may severely affect the pipeline utilization. We show that fairly straightforward techniques can ensure nearly full utilization of the pipeline. These techniques, coupled with an adaptive mapping of trie nodes to the circular pipeline, create a pipelined architecture which can operate at high rates irrespective of the trie size.

65 citations

Journal ArticleDOI
TL;DR: This paper shows that the degree constraint on the maximal number of clients that a server can handle is realistic in many contexts and proves that a very small additive resource augmentation on the servers degree is enough to find in polynomial time a solution that achieves at least the optimal throughput.
Abstract: In this paper, we consider the problem of assigning a set of clients with demands to a set of servers with capacities and degree constraints The goal is to find an allocation such that the number of clients assigned to a server is smaller than the server's degree and their overall demand is smaller than the server's capacity, while maximizing the overall throughput This problem has several natural applications in the context of independent tasks scheduling or virtual machines allocation We consider both the offline (when clients are known beforehand) and the online (when clients can join and leave the system at any time) versions of the problem We first show that the degree constraint on the maximal number of clients that a server can handle is realistic in many contexts Then, our main contribution is to prove that even if it makes the allocation problem more difficult (NP-Complete), a very small additive resource augmentation on the servers degree is enough to find in polynomial time a solution that achieves at least the optimal throughput After a set of theoretical results on the complexity of the offline and online versions of the problem, we propose several other greedy heuristics to solve the online problem and we compare the performance (in terms of throughput) and the cost (in terms of disconnections and reconnections) of all proposed algorithms through a set of extensive simulation results

48 citations

Proceedings ArticleDOI
16 Aug 2009
TL;DR: A flexible lookup module, PLUG (Pipelined Lookup Grid), which can achieve generality without loosing efficiency because various custom lookup modules have the same fundamental features the authors retain: area dominated by memories, simple processing, and strict access patterns defined by the data structure.
Abstract: New protocols for the data link and network layer are being proposed to address limitations of current protocols in terms of scalability, security, and manageability. High-speed routers and switches that implement these protocols traditionally perform packet processing using ASICs which offer high speed, low chip area, and low power. But with inflexible custom hardware, the deployment of new protocols could happen only through equipment upgrades. While newer routers use more flexible network processors for data plane processing, due to power and area constraints lookups in forwarding tables are done with custom lookup modules. Thus most of the proposed protocols can only be deployed with equipment upgrades. To speed up the deployment of new protocols, we propose a flexible lookup module, PLUG (Pipelined Lookup Grid). We can achieve generality without loosing efficiency because various custom lookup modules have the same fundamental features we retain: area dominated by memories, simple processing, and strict access patterns defined by the data structure. We implemented IPv4, Ethernet, Ethane, and SEATTLE in our dataflow-based programming model for the PLUG and mapped them to the PLUG hardware which consists of a grid of tiles. Throughput, area, power, and latency of PLUGs are close to those of specialized lookup modules.

46 citations


Cites methods from "Parallelism versus Memory Allocatio..."

  • ...Pipelined tries [7, 17, 5, 32, 27] are used by algorithmic lookup modules....

    [...]

Journal ArticleDOI
TL;DR: It is shown that this general case is strongly NP-hard for any k≥3 and an efficient approximation algorithm is designed, for which the approximation ratio can be made arbitrarily close to $\frac{7}{5}$ .
Abstract: We consider a memory allocation problem. This problem can be modeled as a version of bin packing where items may be split, but each bin may contain at most two (parts of) items. This problem was recently introduced by Chung et al. (Theory Comput. Syst. 39(6):829–849, 2006). We give a simple $\frac{3}{2}$-approximation algorithm for this problem which is in fact an online algorithm. This algorithm also has good performance for the more general case where each bin may contain at most k parts of items. We show that this general case is strongly NP-hard for any k≥3. Additionally, we design an efficient approximation algorithm, for which the approximation ratio can be made arbitrarily close to $\frac{7}{5}$.

30 citations


Cites background or methods from "Parallelism versus Memory Allocatio..."

  • ...Using the fact (shown in [3]), that an optimal packing can be represented by a forest with loops, a pattern is defined as a tree with at most 1 δ2 nodes....

    [...]

  • ...In [3], the authors show that the problem which they study is NP-hard in the strong sense for k = 2....

    [...]

  • ...The paper [3] showed that for any given packing, it is possible to modify the packing such that there are no cycles in the associated graph, apart from the loops....

    [...]

  • ...[3] studied this problem and described the drawbacks of the methods stated above....

    [...]

  • ...Thus, this simple algorithm performs as well as the algorithm from [3] for k = 2....

    [...]

References
More filters
Book
01 Jan 1979
TL;DR: The second edition of a quarterly column as discussed by the authors provides a continuing update to the list of problems (NP-complete and harder) presented by M. R. Garey and myself in our book "Computers and Intractability: A Guide to the Theory of NP-Completeness,” W. H. Freeman & Co., San Francisco, 1979.
Abstract: This is the second edition of a quarterly column the purpose of which is to provide a continuing update to the list of problems (NP-complete and harder) presented by M. R. Garey and myself in our book ‘‘Computers and Intractability: A Guide to the Theory of NP-Completeness,’’ W. H. Freeman & Co., San Francisco, 1979 (hereinafter referred to as ‘‘[G&J]’’; previous columns will be referred to by their dates). A background equivalent to that provided by [G&J] is assumed. Readers having results they would like mentioned (NP-hardness, PSPACE-hardness, polynomial-time-solvability, etc.), or open problems they would like publicized, should send them to David S. Johnson, Room 2C355, Bell Laboratories, Murray Hill, NJ 07974, including details, or at least sketches, of any new proofs (full papers are preferred). In the case of unpublished results, please state explicitly that you would like the results mentioned in the column. Comments and corrections are also welcome. For more details on the nature of the column and the form of desired submissions, see the December 1981 issue of this journal.

40,020 citations

Journal ArticleDOI
TL;DR: The bulk-synchronous parallel (BSP) model is introduced as a candidate for this role, and results quantifying its efficiency both in implementing high-level language features and algorithms, as well as in being implemented in hardware.
Abstract: The success of the von Neumann model of sequential computation is attributable to the fact that it is an efficient bridge between software and hardware: high-level languages can be efficiently compiled on to this model; yet it can be effeciently implemented in hardware. The author argues that an analogous bridge between software and hardware in required for parallel computation if that is to become as widely used. This article introduces the bulk-synchronous parallel (BSP) model as a candidate for this role, and gives results quantifying its efficiency both in implementing high-level language features and algorithms, as well as in being implemented in hardware.

3,885 citations


"Parallelism versus Memory Allocatio..." refers background in this paper

  • ...[13] L. Valiant, A bridging model for paral lel computation....

    [...]

  • ...Similar notions of randomizing accesses to memory date back to Valiant [13] and Ranade [8], as well as some recent work [3]....

    [...]

  • ...Simi­lar notions of randomizing accesses to memory date back to Valiant [13] and Ranade [8], as well as some recent work [3]....

    [...]

Book
01 Jan 1973

3,076 citations

Journal ArticleDOI
Ron Graham1
TL;DR: P can be chosen to I&E the centroid oC the triangle formed by X, y and z and Express each si E S in polar coordinates th origin P and 8 = 0 in the direction of zu~ arhitnry fixed half-line L from P.

1,741 citations

Frequently Asked Questions (1)
Q1. What contributions have the authors mentioned in the paper "Parallelism versus memory allocation in pipelined router forwarding engines∗" ?

Instead, in this paper the authors show that perfect memory sharing of shared memory can be achieved with a collection of two-port memories, as long as the number of processors is less than the number of memories. The authors show that the problem of allocation is NP-complete in general, but has a fast approximation algorithm that comes within a factor of 2 asymptotically. The authors also discuss the online memory allocation problem and present fast online algorithms that provide good memory utilization while allowing fast updates. Further, for important special cases that arise in practice a more sophisticated modification of this approximation algorithm is in fact optimal.