The iSLIP scheduling algorithm for input-queued switches
Summary (5 min read)
Introduction
- This idea is already being carried one step further, with cell switches forming the core, or backplane, of high-performance IP routers [26], [31], [6], [4].
- Before using a crossbar switch as a switching fabric, it is important to consider some of the potential drawbacks; the authors consider three here.
- But HOL blocking can be eliminated by using a simple buffering strategy at each input port.
- Rather than maintain a single FIFO queue for all cells, each input maintains a separate queue for each output as shown in Fig.
A. Maximum Size Matching
- These algorithms attempt to maximize the number of connections made in each cell time, and hence, maximize the instantaneous allocation of bandwidth.
- There exist many maximum-size bipartite matching algorithms, and the most efficient currently known converges in time [12].
- The algorithm should not allow a nonempty VOQ to remain unserved indefinitely.
- The authors then consider some small modifications to for various applications, and finally consider its implementation complexity.
B. Parallel Iterative Matching
- In some literature, the maximum size matching is called the maximum cardinality matching or just the maximum bipartite matching.
- Basis of the algorithm described later, the authors will describe the scheme in detail and consider some of its performance characteristics.
- PIM attempts to quickly converge on a conflict-free maximal match in multiple iterations, where each iteration consists of three steps.
- If an unmatched output receives any requests, it grants to one by randomly selecting a request uniformly over all requests.
- Second, when the switch is oversubscribed, PIM can lead to unfairness between connections.
II. THE SLIP ALGORITHM WITH A SINGLE ITERATION
- In this section the authors describe and evaluate the SLIP algorithm.
- This section concentrates on the behavior of SLIP with just a single iteration per cell time.
- The SLIP algorithm uses rotating priority (“round-robin”) arbitration to schedule each active input and output in turn.
- The authors find that the performance of SLIP for uniform traffic is high; for uniform independent identically distributed (i.i.d.).
- This is the result of a phenomenon that the authors encounter repeatedly; the arbiters in SLIP have a tendency to desynchronize with respect to one another.
A. Basic Round-Robin Matching Algorithm
- SLIP is a variation of simple basic round-robin matching algorithm (RRM).
- The RRM algorithm, like PIM, consists of three steps.
- The three steps of arbitration are: Step 1: Request.
- Each input sends a request to every output for which it has a queued cell.
- The pointer to the highest priority element of the round-robin schedule is incremented (modulo to one location beyond the granted input.
B. Performance of RRM for Bernoulli Arrivals
- As an introduction to the performance of the RRM algorithm, Fig. 5 shows the average delay as a function of offered load for uniform independent and identically distributed (i.i.d.).
- This synchronization phenomenon leads to a maximum throughput of just 50% for this traffic pattern.
- Fig. 8 shows the number of synchronized output arbiters as a function of offered load.
- The probability that an input will remain ungranted is (N 1=N)N ; hence as N increases, the throughput tends to 1 (1=e) 63%: offered load, cells arriving for output will find in a random position, equally likely to grant to any input.
- This agrees well with the simulation result for RRM in Fig.
III. THE SLIP ALGORITHM
- The algorithm improves upon RRM by reducing the synchronization of the output arbiters.
- Is identical to RRM except for a condition placed on updating the grant pointers.
- This is because when the arbiters move their pointers, the most recently granted becomes the lowest priority at that .
- The output will serve at most other inputs first, waiting at most cell times to be accepted by each input.
- Under heavy load, all queues with a common output have the same throughput, also known as Property 3.
C. As a Function of Switch Size
- Fig. 11 shows the average latency imposed by a scheduler as a function of offered load for switches with 4, 8, 16, and 32 ports.
- As the authors might expect, the performance degrades with the number of ports.
- The performance degrades differently under low and heavy loads.
- For a fixed low offered load, the queueing delay converges to a constant value.
- Ignoring the queueing delay under low offered load, the number of contending cells for each output is approximately which converges with increasing to 7.
D. Burstiness Reduction
- Intuitively, if a switch decreases the average burst length of traffic that it forwards, then the authors can expect it to improve the performance of its downstream neighbor.
- The authors can expect any scheduling policy that uses round-robin arbiters to be burst-reducing8 this is also the case for is a deterministic algorithm serving each connection in strict rotation.
- The authors use the same measure of burstiness that they use when generating traffic: the average burst length.
- The authors define a burst of cells at the output of a switch as the number of consecutive cells that entered the switch at the same input.
- This indicates that the output arbiters have become desynchronized and are operating as time-division multiplexers, serving each input in turn.
V. ANALYSIS OF SLIP PERFORMANCE
- In general, it is difficult to accurately analyze the performance of a switch, even for the simplest traffic models.
- Under uniform load and either very low or very high offered load, the authors can readily approximate and understand the way in which operates.
- When arrivals are infrequent, the authors can assume that the arbiters act independently and that arriving cells are successfully scheduled with very low delay.
- At the other extreme, when the switch becomes uniformly backlogged, the authors can see that desynchronization will lead the arbiters to find an efficient time division multiplexing scheme and operate without contention.
- But when the traffic is nonuniform, or when the offered load is at neither extreme, the interaction between the arbiters becomes difficult to describe.
A. Convergence to Time-Division Multiplexing Under Heavy Load
- Under heavy load, will behave similarly to an M/D/1 queue with arrival rates and deterministic service time cell times.
- So, under a heavy load of Bernoulli arrivals, the delay will be approximated by (2).
- This is because the service policy is not constant; when a queue changes between empty and nonempty, the scheduler must adapt to the new set of queues that require service.
- This adaptation takes place over many cell times while the arbiters desynchronize again.
- During this time, the throughput will be worse than for the M/D/1 queue and the queue length will increase.
VI. THE SLIP ALGORITHM WITH MULTIPLE ITERATIONS
- Until now, the authors have only considered the operation of with a single iteration.
- Once again, the authors shall see that desynchronization of the output arbiters plays an important role in achieving low latency.
- When multiple iterations are used, it is necessary to modify the algorithm.
- If an unmatched output receives any requests, it chooses the one that appears next in a fixed, roundrobin schedule starting from the highest priority element.
- The pointer to the highest priority element of the round-robin schedule is incremented (modulo to one location beyond the granted input if and only if the grant is accepted in Step 3 of the first iteration.
A. Updating Pointers
- Note that pointers and are only updated for matches found in the first iteration.
- Connections made in subsequent iterations do not cause the pointers to be updated.
- To understand how starvation can occur, the authors refer to the example of a 3 3 switch with five active and heavily loaded connections, shown in Fig. 15.
- The switch is scheduled using two iterations of the algorithm, except in this case, the pointers are updated after both iterations.
- Each time the round-robin arbiter at output 2 grants to input 1, input 1 chooses to accept output 1 instead.
B. Properties
- With multiple iterations, the algorithm has the following properties: Property 1: Connections matched in the first iteration become the lowest priority in the next cell time.
- Because pointers are not updated after the first iteration, an output will continue to grant to the highest priority requesting input until it is successful.
- For with more than one iteration, and under heavy load, queues with a common output may each have a different throughput, also known as Property 3.
- If zero connections are scheduled in an iteration, then the algorithm has converged; no more connections can be added with more iterations.
- The algorithm will not necessarily converge to a maximum sized match, also known as Property 5.
A. How Many Iterations?
- When implementing with multiple iterations, the authors need to decide how many iterations to perform during each cell time.
- Ideally, from Property 4 above, the authors would like to perform iterations.
- After cell times, the arbiters have become totally desynchronized and the algorithm will converge in a single iteration.
- In some applications this may be acceptable.
- For all the stationary arrival processes the authors have tried for However, they have not been able to prove that this relation holds in general.
A. Prioritized SLIP
- Many applications use multiple classes of traffic with different priority levels.
- The Prioritized algorithm gives strict priority to the highest priority request in each cell time.
- The pointer is incremented (modulo to one location beyond the granted input if and only if input accepts output in step 3 of the first iteration.
- The input then chooses one output among only those that have requested at level.
- The input arbiter maintains a separate pointer, for each priority level.
B. Threshold SLIP
- Scheduling algorithms that find a maximum weight match outperform those that find a maximum sized match.
- In particular, if the weight of the edge between input and output is the occupancy of input queue then the authors will conjecture that the algorithm can achieve 100% throughput for all i.i.d.
- In the Threshold algorithm, the authors make a compromise between the maximum-sized match and the maximum weight match by quantizing the queue occupancy according to a set of threshold levels.
- The threshold level is then used to determine the priority level in the Priority algorithm.
- If then the input makes a request of level.
C. Weighted SLIP
- In some applications, the strict priority scheme of Prioritized may be undesirable, leading to starvation of lowpriority traffic.
- As illustrated in Fig. 20, each arbiter consists of a priority encoder with a programmable highest priority, a register to hold the highest priority value, and an incrementer to move the pointer after it has been updated.
- The grant decision from each grant arbiter is then passed to the accept arbiters, where each arbiter selects at most one output on behalf of an input, implementing Step 3.
- Finally, the authors have observed that the complexity of the implementation is independent of the number of iterations.
- These values were obtained from a VHDL design that was synthesized using the Synopsis design tools, and compiled for the Texas Instruments TSC5000 0.25- m CMOS ASIC process.
X. CONCLUSION
- The Internet requires fast switches and routers to handle the increasing congestion.
- The authors believe that these switches will use virtual output queueing, and hence will need fast, simple, fair, and efficient scheduling algorithms to arbitrate access to the switching fabric.
- To this end, the authors have introduced the algorithm, an iterative algorithm that achieves high throughput, yet is simple to implement in hardware and operate at high speed.
- When the traffic is nonuniform, the algorithm quickly adapts to an efficient round-robin policy among the busy queues.
- The simplicity of the algorithm allows the arbiter for a 32-port switch to be placed on single chip, and to make close to 100 million arbitration decisions per second.
Did you find this useful? Give us your feedback
Citations
3,233 citations
1,612 citations
Cites background from "The iSLIP scheduling algorithm for ..."
...Multi-step arbitration schemes are frequently used in packet switches for computer systems [41] [3] [103] [139]....
[...]
1,045 citations
Cites background from "The iSLIP scheduling algorithm for ..."
...PIM [3] and iSLIP [18] schedule packets or cells across a crossbar switch fabric over very short timescales....
[...]
680 citations
Cites background or methods from "The iSLIP scheduling algorithm for ..."
...Several algorithms can compute the maximum matching efficiently [23], and several approximation algorithms [23, 32] can compute near-optimal solutions even faster....
[...]
...4At larger scales or if switching times drop drastically, faster but slightly heuristic algorithms such as iSLIP could be used [32]....
[...]
645 citations
Cites background from "The iSLIP scheduling algorithm for ..."
...Efficient utilization of the buffer resources is thus critical in minimizing the total cost of the network for a given set of performance constraints....
[...]
References
5,567 citations
2,785 citations
2,120 citations
"The iSLIP scheduling algorithm for ..." refers background or methods in this paper
...The maximum size matching for a bipartite graph can be found by solving an equivalent network flow problem [35]; we call the algorithm that does this maxsize....
[...]
...This is equivalent to finding a bipartite matching on a graph with vertices [2], [25], [35]....
[...]
...But maximum weight matches are significantly harder to calculate than maximum sized matches [35], and to be practical, must be implemented using an upper limit on the number of bits used to represent...
[...]
...For example, the algorithms described in [25] and [28] that achieve 100% throughput, use maximum weight bipartite matching algorithms [35], which have a running-time complexity of...
[...]
2,049 citations
"The iSLIP scheduling algorithm for ..." refers background in this paper
...8 There are many definitions of burstiness, for example the coefficient of variation [36], burstiness curves [20], maximum burst length [10], or effective bandwidth [21]....
[...]
1,592 citations
Related Papers (5)
Frequently Asked Questions (13)
Q2. How does the synchronization effect affect the output?
8. As the offered load increases, synchronized output arbiters tend to move in lockstep and the degree of synchronization changes only slightly.
Q3. What is the effect of bursty arrivals on the performance of a switch?
With bursty arrivals, the performance of an input-queued switch becomes more and more like an output-queued switch under the save arrival conditions [9].
Q4. How many gates are needed to implement a 32-port scheduler?
The number of gates for a 32-port scheduler is less than 100 000, making it readily implementable in current CMOS technologies, and the total number of gates grows approximately with
Q5. What is the effect of burst size on the queueing delay?
As the authors would expect, the increased burst size leads to a higher queueing delay whereas an increased number of iterations leads to a lower queueing delay.
Q6. Why is the service policy not constant?
This is because the service policy is not constant; when a queue changes between empty and nonempty, the scheduler must adapt to the new set of queues that require service.
Q7. What is the approximation for the expected number of unmatched inputs at time?
The approximation is based on two assumptions:1) inputs that are unmatched at time are uniformly distributed over all inputs; 2) the number of unmatched inputs at time has zero variance.
Q8. How does the algorithm calculate the maximum weight of the input queue?
In particular, if the weight of the edge between input and output is the occupancy of input queue then the authors will conjecture that the algorithm can achieve 100% throughput for all i.i.d.
Q9. How many iterations does it take to converge?
in practice there may be insufficient time for iterations, and so the authors need to consider the penalty of performing only iterations, where In fact, because of the desynchronization of the arbiters, will usually converge in fewer than iterations.
Q10. How can the basic algorithm be extended to include requests at multiple priority levels?
The basic algorithm can be extended to include requests at multiple priority levels with only a small performance and complexity penalty.
Q11. What is the problem with the interaction between the arbiters?
But when the traffic is nonuniform, or when the offered load is at neither extreme, the interaction between the arbiters becomes difficult to describe.
Q12. What is the pointer to the highest priority element of the round-robin schedule?
The pointer to the highest priority element of the round-robin schedule is incremented (modulo to one location beyond the granted input if and only if the grant is accepted in Step 3 of the first iteration.
Q13. How does PIM achieve a conflict-free maximal match?
PIM attempts to quickly converge on a conflict-free maximal match in multiple iterations, where each iteration consists of three steps.