Author
Adisak Mekkittikul
Bio: Adisak Mekkittikul is an academic researcher from Stanford University. The author has contributed to research in topics: Optical switch & Wavelength-division multiplexing. The author has an hindex of 4, co-authored 8 publications receiving 726 citations.
Papers
More filters
•
TL;DR: The Tiny Tera is a CMOS-based input-queued, fixed-size packet switch suitable for a wide range of applications such as a highperformance ATM switch, the core of an Internet router or as a fast multiprocessor interconnect.
Abstract: In this paper, we present the Tiny Tera: a small packet switch with an aggregate bandwidth of 320Gb/s. The Tiny Tera is a CMOS-based input-queued, fixed-size packet switch suitable for a wide range of applications such as a highperformance ATM switch, the core of an Internet router or as a fast multiprocessor interconnect. Using off-the-shelf technology, we plan to demonstrate that a very highbandwidth switch can be built without the need for esoteric optical switching technology. By employing novel scheduling algorithms for both unicast and multicast traffic, the switch will have a maximum throughput close to 100%. Using novel highspeed chip-to-chip serial link technology, we plan to reduce the physical size and complexity of the switch, as well as the system pin-count.
345 citations
••
29 Mar 1998TL;DR: This work introduces a new algorithm called longest port first (LPF), which is designed to overcome the complexity problems of LQF, and can be implemented in hardware at high speed.
Abstract: Input queueing is becoming increasingly used for high-bandwidth switches and routers. In previous work, it was proved that it is possible to achieve 100% throughput for input-queued switches using a combination of virtual output queueing and a scheduling algorithm called LQF However, this is only a theoretical result: LQF is too complex to implement in hardware. We introduce a new algorithm called longest port first (LPF), which is designed to overcome the complexity problems of LQF, and can be implemented in hardware at high speed. By giving preferential service based on queue lengths, we prove that LPF can achieve 100% throughput.
342 citations
01 Jan 1999
TL;DR: This thesis presents two new algorithms that can achieve 100% throughput for all traffic patterns with independent arrivals and shows that the heuristics can make a scheduling decision within 10– 20 nanoseconds when implemented using 0.25 μm CMOS technology.
Abstract: Until recently, Internet routers and ATM switches were generally built around a central pool of shared memory buffers and a fast, shared-bus backplane. However, limitations in both memory and bus bandwidth have led to the use of input queues and switched backplanes. Input queues relieve the bottleneck by distributing the memory over each switch input; and a switched backplane allows packet transfers to take place simultaneously.
This thesis focuses on the design of switched backplanes with input queues. In particular, we focus on the design of schedulers for switched backplanes. The scheduler decides the order in which packets, or cells, may traverse the backplane. Studies have shown that existing scheduling algorithms are either too complex to operate at high speed or lack the intelligence to perform well over a wide range of traffic patterns.
In this thesis, we present two new algorithms that are fast, simple and efficient. Using the methods of Lyapunov, we prove that both algorithms can achieve 100% throughput for all traffic patterns with independent arrivals. We also demonstrate heuristics that can be implemented in fast and relatively simple hardware.
Our exploratory design work shows that the heuristics can make a scheduling decision within 10– 20 nanoseconds when implemented using 0.25 μm CMOS technology. At this scheduling speed, it is possible to design switches or routers with more than one terabit per second of aggregate bandwidth.
33 citations
••
25 Feb 1996TL;DR: A flexible, distributed digital slot synchronization technique for CORD which is robust and scalable is described, which helps achieve time-slot synchronization over the whole network.
Abstract: Recently, a number of projects aimed at the implementation of optical packet-switched network prototypes have been announced. A common and critical element of these networks is the need to achieve time-slot synchronization over the whole network. Stanford University is currently implementing CORD, a two-node optical wavelength-division multiplexing (WDM) packet-switched network testbed with a star topology, featuring all-optical receiver contention resolution by means of optical switches and delay lines. This paper describes a flexible, distributed digital slot synchronization technique for CORD which is robust and scalable.
6 citations
••
TL;DR: The tiny tera achieves a high switching-bandwidth by using less expensive and proven CMOS technology and novel scheduling algorithms, and will achieve a maximum throughput close to 100% without the need for internal speedup.
Abstract: The tiny tera is an all-CMOS 320 Gbps, input-queued ATM switch suitable for non-ATM applications such as the core of an Internet router. The tiny tera efficiently supports both unicast and multicast traffic. Instead of using optical switching technology, we achieve a high switching-bandwidth by using less expensive and proven CMOS technology. Because of limitations in memory and interconnection bandwidths, we believe that to achieve such a high-bandwidth switch requires an innovative architecture. By using virtual output queuing (VOQ) and novel scheduling algorithms, the tiny tera will achieve a maximum throughput close to 100% without the need for internal speedup.
1 citations
Cited by
More filters
•
01 Jan 2004
TL;DR: This book offers a detailed and comprehensive presentation of the basic principles of interconnection network design, clearly illustrating them with numerous examples, chapter exercises, and case studies, allowing a designer to see all the steps of the process from abstract design to concrete implementation.
Abstract: One of the greatest challenges faced by designers of digital systems is optimizing the communication and interconnection between system components. Interconnection networks offer an attractive and economical solution to this communication crisis and are fast becoming pervasive in digital systems. Current trends suggest that this communication bottleneck will be even more problematic when designing future generations of machines. Consequently, the anatomy of an interconnection network router and science of interconnection network design will only grow in importance in the coming years.
This book offers a detailed and comprehensive presentation of the basic principles of interconnection network design, clearly illustrating them with numerous examples, chapter exercises, and case studies. It incorporates hardware-level descriptions of concepts, allowing a designer to see all the steps of the process from abstract design to concrete implementation.
·Case studies throughout the book draw on extensive author experience in designing interconnection networks over a period of more than twenty years, providing real world examples of what works, and what doesn't.
·Tightly couples concepts with implementation costs to facilitate a deeper understanding of the tradeoffs in the design of a practical network.
·A set of examples and exercises in every chapter help the reader to fully understand all the implications of every design decision.
Table of Contents
Chapter 1 Introduction to Interconnection Networks
1.1 Three Questions About Interconnection Networks
1.2 Uses of Interconnection Networks
1.3 Network Basics
1.4 History
1.5 Organization of this Book
Chapter 2 A Simple Interconnection Network
2.1 Network Specifications and Constraints
2.2 Topology
2.3 Routing
2.4 Flow Control
2.5 Router Design
2.6 Performance Analysis
2.7 Exercises
Chapter 3 Topology Basics
3.1 Nomenclature
3.2 Traffic Patterns
3.3 Performance
3.4 Packaging Cost
3.5 Case Study: The SGI Origin 2000
3.6 Bibliographic Notes
3.7 Exercises
Chapter 4 Butterfly Networks
4.1 The Structure of Butterfly Networks
4.2 Isomorphic Butterflies
4.3 Performance and Packaging Cost
4.4 Path Diversity and Extra Stages
4.5 Case Study: The BBN Butterfly
4.6 Bibliographic Notes
4.7 Exercises
Chapter 5 Torus Networks
5.1 The Structure of Torus Networks
5.2 Performance
5.3 Building Mesh and Torus Networks
5.4 Express Cubes
5.5 Case Study: The MIT J-Machine
5.6 Bibliographic Notes
5.7 Exercises
Chapter 6 Non-Blocking Networks
6.1 Non-Blocking vs. Non-Interfering Networks
6.2 Crossbar Networks
6.3 Clos Networks
6.4 Benes Networks
6.5 Sorting Networks
6.6 Case Study: The Velio VC2002 (Zeus) Grooming Switch
6.7 Bibliographic Notes
6.8 Exercises
Chapter 7 Slicing and Dicing
7.1 Concentrators and Distributors
7.2 Slicing and Dicing
7.3 Slicing Multistage Networks
7.4 Case Study: Bit Slicing in the Tiny Tera
7.5 Bibliographic Notes
7.6 Exercises
Chapter 8 Routing Basics
8.1 A Routing Example
8.2 Taxonomy of Routing Algorithms
8.3 The Routing Relation
8.4 Deterministic Routing
8.5 Case Study: Dimension-Order Routing in the Cray T3D
8.6 Bibliographic Notes
8.7 Exercises
Chapter 9 Oblivious Routing
9.1 Valiant's Randomized Routing Algorithm
9.2 Minimal Oblivious Routing
9.3 Load-Balanced Oblivious Routing
9.4 Analysis of Oblivious Routing
9.5 Case Study: Oblivious Routing in the
Avici Terabit Switch Router(TSR)
9.6 Bibliographic Notes
9.7 Exercises
Chapter 10 Adaptive Routing
10.1 Adaptive Routing Basics
10.2 Minimal Adaptive Routing
10.3 Fully Adaptive Routing
10.4 Load-Balanced Adaptive Routing
10.5 Search-Based Routing
10.6 Case Study: Adaptive Routing in the
Thinking Machines CM-5
10.7 Bibliographic Notes
10.8 Exercises
Chapter 11 Routing Mechanics
11.1 Table-Based Routing
11.2 Algorithmic Routing
11.3 Case Study: Oblivious Source Routing in the
IBM Vulcan Network
11.4 Bibliographic Notes
11.5 Exercises
Chapter 12 Flow Control Basics
12.1 Resources and Allocation Units
12.2 Bufferless Flow Control
12.3 Circuit Switching
12.4 Bibliographic Notes
12.5 Exercises
Chapter 13 Buffered Flow Control
13.1 Packet-Buffer Flow Control
13.2 Flit-Buffer Flow Control
13.3 Buffer Management and Backpressure
13.4 Flit-Reservation Flow Control
13.5 Bibliographic Notes
13.6 Exercises
Chapter 14 Deadlock and Livelock
14.1 Deadlock
14.2 Deadlock Avoidance
14.3 Adaptive Routing
14.4 Deadlock Recovery
14.5 Livelock
14.6 Case Study: Deadlock Avoidance in the Cray T3E
14.7 Bibliographic Notes
14.8 Exercises
Chapter 15 Quality of Service
15.1 Service Classes and Service Contracts
15.2 Burstiness and Network Delays
15.3 Implementation of Guaranteed Services
15.4 Implementation of Best-Effort Services
15.5 Separation of Resources
15.6 Case Study: ATM Service Classes
15.7 Case Study: Virtual Networks in the Avici TSR
15.8 Bibliographic Notes
15.9 Exercises
Chapter 16 Router Architecture
16.1 Basic Router Architecture
16.2 Stalls
16.3 Closing the Loop with Credits
16.4 Reallocating a Channel
16.5 Speculation and Lookahead
16.6 Flit and Credit Encoding
16.7 Case Study: The Alpha 21364 Router
16.8 Bibliographic Notes
16.9 Exercises
Chapter 17 Router Datapath Components
17.1 Input Buffer Organization
17.2 Switches
17.3 Output Organization
17.4 Case Study: The Datapath of the IBM Colony
Router
17.5 Bibliographic Notes
17.6 Exercises
Chapter 18 Arbitration
18.1 Arbitration Timing
18.2 Fairness
18.3 Fixed Priority Arbiter
18.4 Variable Priority Iterative Arbiters
18.5 Matrix Arbiter
18.6 Queuing Arbiter
18.7 Exercises
Chapter 19 Allocation
19.1 Representations
19.2 Exact Algorithms
19.3 Separable Allocators
19.4 Wavefront Allocator
19.5 Incremental vs. Batch Allocation
19.6 Multistage Allocation
19.7 Performance of Allocators
19.8 Case Study: The Tiny Tera Allocator
19.9 Bibliographic Notes
19.10 Exercises
Chapter 20 Network Interfaces
20.1 Processor-Network Interface
20.2 Shared-Memory Interface
20.3 Line-Fabric Interface
20.4 Case Study: The MIT M-Machine Network Interface
20.5 Bibliographic Notes
20.6 Exercises
Chapter 21 Error Control 411
21.1 Know Thy Enemy: Failure Modes and Fault Models
21.2 The Error Control Process: Detection, Containment,
and Recovery
21.3 Link Level Error Control
21.4 Router Error Control
21.5 Network-Level Error Control
21.6 End-to-end Error Control
21.7 Bibliographic Notes
21.8 Exercises
Chapter 22 Buses
22.1 Bus Basics
22.2 Bus Arbitration
22.3 High Performance Bus Protocol
22.4 From Buses to Networks
22.5 Case Study: The PCI Bus
22.6 Bibliographic Notes
22.7 Exercises
Chapter 23 Performance Analysis
23.1 Measures of Interconnection Network Performance
23.2 Analysis
23.3 Validation
23.4 Case Study: Efficiency and Loss in the
BBN Monarch Network
23.5 Bibliographic Notes
23.6 Exercises
Chapter 24 Simulation
24.1 Levels of Detail
24.2 Network Workloads
24.3 Simulation Measurements
24.4 Simulator Design
24.5 Bibliographic Notes
24.6 Exercises
Chapter 25 Simulation Examples 495
25.1 Routing
25.2 Flow Control Performance
25.3 Fault Tolerance
Appendix A Nomenclature
Appendix B Glossary
Appendix C Network Simulator
3,233 citations
••
TL;DR: This paper presents a scheduling algorithm called iSLIP, an iterative, round-robin algorithm that can achieve 100% throughput for uniform traffic, yet is simple to implement in hardware, and describes the implementation complexity of the algorithm.
Abstract: An increasing number of high performance internetworking protocol routers, LAN and asynchronous transfer mode (ATM) switches use a switched backplane based on a crossbar switch. Most often, these systems use input queues to hold packets waiting to traverse the switching fabric. It is well known that if simple first in first out (FIFO) input queues are used to hold packets then, even under benign conditions, head-of-line (HOL) blocking limits the achievable bandwidth to approximately 58.6% of the maximum. HOL blocking can be overcome by the use of virtual output queueing, which is described in this paper. A scheduling algorithm is used to configure the crossbar switch, deciding the order in which packets will be served. Previous results have shown that with a suitable scheduling algorithm, 100% throughput can be achieved. In this paper, we present a scheduling algorithm called iSLIP. An iterative, round-robin algorithm, iSLIP can achieve 100% throughput for uniform traffic, yet is simple to implement in hardware. Iterative and noniterative versions of the algorithms are presented, along with modified versions for prioritized traffic. Simulation results are presented to indicate the performance of iSLIP under benign and bursty traffic conditions. Prototype and commercial implementations of iSLIP exist in systems with aggregate bandwidths ranging from 50 to 500 Gb/s. When the traffic is nonuniform, iSLIP quickly adapts to a fair scheduling policy that is guaranteed never to starve an input queue. Finally, we describe the implementation complexity of iSLIP. Based on a two-dimensional (2-D) array of priority encoders, single-chip schedulers have been built supporting up to 32 ports, and making approximately 100 million scheduling decisions per second.
1,277 citations
•
05 Mar 2012
TL;DR: Computer Networking: A Top-Down Approach Featuring the Internet explains the engineering problems that are inherent in communicating digital information from point to point, and presents the mathematics that determine the best path, show some code that implements those algorithms, and illustrate the logic by using excellent conceptual diagrams.
Abstract: Certain data-communication protocols hog the spotlight, but all of them have a lot in common. Computer Networking: A Top-Down Approach Featuring the Internet explains the engineering problems that are inherent in communicating digital information from point to point. The top-down approach mentioned in the subtitle means that the book starts at the top of the protocol stack--at the application layer--and works its way down through the other layers, until it reaches bare wire. The authors, for the most part, shun the well-known seven-layer Open Systems Interconnection (OSI) protocol stack in favor of their own five-layer (application, transport, network, link, and physical) model. It's an effective approach that helps clear away some of the hand waving traditionally associated with the more obtuse layers in the OSI model. The approach is definitely theoretical--don't look here for instructions on configuring Windows 2000 or a Cisco router--but it's relevant to reality, and should help anyone who needs to understand networking as a programmer, system architect, or even administration guru.The treatment of the network layer, at which routing takes place, is typical of the overall style. In discussing routing, authors James Kurose and Keith Ross explain (by way of lots of clear, definition-packed text) what routing protocols need to do: find the best route to a destination. Then they present the mathematics that determine the best path, show some code that implements those algorithms, and illustrate the logic by using excellent conceptual diagrams. Real-life implementations of the algorithms--including Internet Protocol (both IPv4 and IPv6) and several popular IP routing protocols--help you to make the transition from pure theory to networking technologies. --David WallTopics covered: The theory behind data networks, with thorough discussion of the problems that are posed at each level (the application layer gets plenty of attention). For each layer, there's academic coverage of networking problems and solutions, followed by discussion of real technologies. Special sections deal with network security and transmission of digital multimedia.
1,079 citations
••
01 Oct 1998TL;DR: Two new algorithms for solving the least cost matching filter problem at high speeds are described, based on a grid-of-tries construction and works optimally for processing filters consisting of two prefix fields using linear space.
Abstract: In Layer Four switching, the route and resources allocated to a packet are determined by the destination address as well as other header fields of the packet such as source address, TCP and UDP port numbers. Layer Four switching unifies firewall processing, RSVP style resource reservation filters, QoS Routing, and normal unicast and multicast forwarding into a single framework. In this framework, the forwarding database of a router consists of a potentially large number of filters on key header fields. A given packet header can match multiple filters, so each filter is given a cost, and the packet is forwarded using the least cost matching filter.In this paper, we describe two new algorithms for solving the least cost matching filter problem at high speeds. Our first algorithm is based on a grid-of-tries construction and works optimally for processing filters consisting of two prefix fields (such as destination-source filters) using linear space. Our second algorithm, cross-producting, provides fast lookup times for arbitrary filters but potentially requires large storage. We describe a combination scheme that combines the advantages of both schemes. The combination scheme can be optimized to handle pure destination prefix filters in 4 memory accesses, destination-source filters in 8 memory accesses worst case, and all other filters in 11 memory accesses in the typical case.
625 citations
••
TL;DR: The main technique, controlled prefix expansion, transforms a set of prefixes into an equivalent set with fewer prefix lengths, and optimization techniques based on dynamic programming, and local transformations of data structures to improve cache behavior are used.
Abstract: Internet (IP) address lookup is a major bottleneck in high-performance routers. IP address lookup is challenging because it requires a longest matching prefix lookup. It is compounded by increasing routing table sizes, increased traffic, higher-speed links, and the migration to 128-bit IPv6 addresses. We describe how IP lookups and updates can be made faster using a set of of transformation techniques. Our main technique, controlled prefix expansion, transforms a set of prefixes into an equivalent set with fewer prefix lengths. In addition, we use optimization techniques based on dynamic programming, and local transformations of data structures to improve cache behavior. When applied to trie search, our techniques provide a range of algorithms (Expanded Tries) whose performance can be tuned. For example, using a processor with 1MB of L2 cache, search of the MaeEast database containing 38000 prefixes can be done in 3 L2 cache accesses. On a 300MHz Pentium II which takes 4 cycles for accessing the first word of the L2 cacheline, this algorithm has a worst-case search time of 180 nsec., a worst-case insert/delete time of 2.5 msec., and an average insert/delete time of 4 usec. Expanded tries provide faster search and faster insert/delete times than earlier lookup algirthms. When applied to Binary Search on Levels, our techniques improve worst-case search times by nearly a factor of 2 (using twice as much storage) for the MaeEast database. Our approach to algorithm design is based on measurements using the VTune tool on a Pentium to obtain dynamic clock cycle counts. Our techniques also apply to similar address lookup problems in other network protocols.
514 citations