Journal ArticleDOI
The Combined Input-Output Queued Crossbar Architecture for High-Radix On-Chip Switches
Reads0
Chats0
TLDR
This article proposes a novel microarchitecture that allows flat crossbar switches to scale to 128 ports, supporting 32 Gbits per second per port (Gbps/port) while occupying 4.9 mm^2 and consuming 4.2 W, and compares CIOQ with Swizzle Switch prototypes and demonstrates high-radix crossbars' potential for system-on-chip interconnects.Abstract:
High-radix, single-chip routers have emerged as efficient building blocks for interconnection networks. Researchers believe that hierarchical switch architectures are needed at high radices as crossbars scale with the square of the router radix. This article proposes a novel microarchitecture that allows flat crossbar switches to scale to 128 ports, supporting 32 Gbits per second per port (Gbps/port) while occupying 4.9 mm^2 and consuming 4.2 W, or supporting 64 Gbps/port at 7.5 mm^2 and 7.5 W, in 45-nm CMOS. Key features include deep crossbar pipelining to cope with wire delay, a novel cross-scheduler architecture to reduce wiring complexity, and catalytic custom gate placement within standard electronic design automation (EDA) flows. Furthermore, on a chip, crossbar speedup and combined I/O queuing (CIOQ) is better than hierarchical queueing, providing top performance with orders of magnitude lower memory cost. Finally, the authors compare CIOQ with Swizzle Switch prototypes and demonstrate high-radix crossbars' potential for system-on-chip interconnects.read more
Citations
More filters
Proceedings ArticleDOI
Enhanced Overloaded CDMA Interconnect (OCI) Bus Architecture for On-Chip Communication
TL;DR: This work proposes the Difference-Overloaded CDMA Interconnect (D-OCI) bus that leverages the balancing property of the Walsh codes to increase the number of interconnected elements by 50% and motivates using Code Division Multiple Access (CDMA) as a bus sharing strategy which offers many advantages over other topologies.
Journal ArticleDOI
Ping-lock round robin arbiter
TL;DR: PLA, which is an improved IPPA offers fair arbitration under any distribution of active requests and has the advantage of low execution delay, while the FPGA and ASIC implementations of PLA show up to 18% and 12% improvement in average delay when compared to existing RRAs in literature.
Journal ArticleDOI
Low power hardware implementations for network packet processing elements
TL;DR: This paper proposes a new crossbar switch based packet classification and 2-to-1 multiplexer based buffered crossbar backplane to achieve high performance and achieves significant improvement in power reduction over the existing designs.
Posted Content
SERENADE: A Parallel Randomized Algorithm Suite for Crossbar Scheduling in Input-Queued Switches
TL;DR: This work proposes SERENADE (SERENA, the Distributed Edition), a parallel algorithm suite that emulates SERENA in only $O(\log N)$ %(distributed) iterations between input ports and output ports, and hence has a time complexity of only £log N per port.
Proceedings ArticleDOI
DeepHiR: improving high-radix router throughput with deep hybrid memory buffer microarchitecture
TL;DR: DeepHiR is proposed, a novel deep hybrid buffer organization (STT-MRAM and SRAM) combined with a centralized buffer organization to provide high performance with minimal cost and leverage alternate memory technology to increase the buffer capacity for high-radix routers.
References
More filters
Journal ArticleDOI
The future of wires
R. Ho,Ken Mai,Mark Horowitz +2 more
TL;DR: Wires that shorten in length as technologies scale have delays that either track gate delays or grow slowly relative to gate delays, which is good news since these "local" wires dominate chip wiring.
Proceedings ArticleDOI
Express virtual channels: towards the ideal interconnection fabric
TL;DR: This paper proposes express virtual channels (EVCs), a novel flow control mechanism which allows packets to virtually bypass intermediate routers along their path in a completely non-speculative fashion, thereby lowering the energy/delay towards that of a dedicated wire while simultaneously approaching ideal throughput with a practical design suitable for on-chip networks.
Posted Content
The Tiny Tera: A Packet Switch Core
TL;DR: The Tiny Tera is a CMOS-based input-queued, fixed-size packet switch suitable for a wide range of applications such as a highperformance ATM switch, the core of an Internet router or as a fast multiprocessor interconnect.
Journal ArticleDOI
Tiny Tera: a packet switch core
TL;DR: Tiny Tera as mentioned in this paper is an input-buffered switch, which makes it the highest bandwidth switch possible given a particular CMOS and memory technology. But it does not support multicasting.
Journal ArticleDOI
Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support
TL;DR: The proposed Virtual Circuit Tree Multicasting (VCTM) router is flexible enough to improve interconnect performance for a broad spectrum of multicasting scenarios, and achieves these benefits with straightforward and inexpensive extensions to a state-of-the-art packet-switched router.