Cube: A 512-FPGA cluster

doi:10.1109/SPL.2009.4914907

Home
/
Papers
/
Cube: A 512-FPGA cluster

Proceedings Article•DOI•

Cube: A 512-FPGA cluster

Oskar Mencer¹, Kuen Hung Tsoi¹, Stephen Craimer¹, Tim Todman¹, Wayne Luk¹, Ming Yee Wong², Philip H. W. Leong² - Show less +3 more•Institutions (2)

Imperial College London¹, The Chinese University of Hong Kong²

01 Apr 2009-pp 51-57

TL;DR: The Cube, a massively-parallel FPGA-based platform is presented, which can perform a full search on the 40-bit key space within 3 minutes, this being 359 times faster than a multi-threaded software implementation running on a 2.5GHz Intel Quad-Core Xeon processor.

read less

Abstract: Cube, a massively-parallel FPGA-based platform is presented. The machine is made from boards each containing 64 FPGA devices and eight boards can be connected in a cube structure for a total of 512 FPGA devices. With high bandwidth systolic inter-FPGA communication and a flexible programming scheme, the result is a low power, high density and scalable supercomputing machine suitable for various large scale parallel applications. A RC4 key search engine was built as an demonstration application. In a fully implemented Cube, the engine can perform a full search on the 40-bit key space within 3 minutes, this being 359 times faster than a multi-threaded software implementation running on a 2.5GHz Intel Quad-Core Xeon processor.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

How soccer players would do stream joins

[...]

Jens Teubner¹, Rene Mueller²•Institutions (2)

ETH Zurich¹, IBM²

12 Jun 2011

TL;DR: This work presents handshake join, a way of describing and executing window-based stream joins that is highly amenable to parallelized execution and gives a new intuition of window semantics, which it believes could inspire other stream processing algorithms or ongoing standardization efforts for stream query languages.

...read moreread less

Abstract: In spite of the omnipresence of parallel (multi-core) systems, the predominant strategy to evaluate window-based stream joins is still strictly sequential, mostly just straightforward along the definition of the operation semantics.In this work we present handshake join, a way of describing and executing window-based stream joins that is highly amenable to parallelized execution. Handshake join naturally leverages available hardware parallelism, which we demonstrate with an implementation on a modern multi-core system and on top of field-programmable gate arrays (FPGAs), an emerging technology that has shown distinctive advantages for high-throughput data processing.On the practical side, we provide a join implementation that substantially outperforms CellJoin (the fastest published result) and that will directly turn any degree of parallelism into higher throughput or larger supported window sizes. On the semantic side, our work gives a new intuition of window semantics, which we believe could inspire other stream processing algorithms or ongoing standardization efforts for stream query languages.

...read moreread less

144 citations

Proceedings Article•DOI•

Network-attached FPGAs for data center applications

[...]

Jagath Weerasinghe¹, Raphael Polig¹, Francois Abel¹, Christoph Hagleitner¹•Institutions (1)

IBM¹

01 Dec 2016

TL;DR: FPGAs (Field Programmable Gate Arrays) are making their way into data centers (DC) and are used as accelerators to boost the compute power of individual server nodes and to improve the overall power efficiency.

...read moreread less

Abstract: FPGAs (Field Programmable Gate Arrays) are making their way into data centers (DC). They are used as accelerators to boost the compute power of individual server nodes and to improve the overall power efficiency. However, this approach limits the number of FPGAs per node and hinders the acceleration of large-scale distributed applications.

...read moreread less

66 citations

Cites background from "Cube: A 512-FPGA cluster"

...Usually, when several FPGAs are required for an application, either those FPGAs are soldered on a PCB [15] [16] [17] [18] or connected in a network on a fixed topology [19]...
[...]

Patent•

Deep neural network processing on hardware accelerators with stacked memory

[...]

Douglas C. Burger¹, Derek Chiou¹, Eric S. Chung¹, Andrew Putnam¹•Institutions (1)

Microsoft¹

29 Jun 2015

TL;DR: In this article, a method for processing on an acceleration component a deep neural network is presented, which includes configuring the acceleration component to perform forward propagation and backpropagation stages of the deep neural networks.

...read moreread less

Abstract: A method is provided for processing on an acceleration component a deep neural network. The method includes configuring the acceleration component to perform forward propagation and backpropagation stages of the deep neural network. The acceleration component includes an acceleration component die and a memory stack disposed in an integrated circuit package. The memory stack has a memory bandwidth greater than about 50 GB/sec and a power efficiency of greater than about 20 MB/sec/mW.

...read moreread less

48 citations

Proceedings Article•DOI•

Novo-G#: Large-scale reconfigurable computing with direct and programmable interconnects

[...]

Alan D. George¹, Martin C. Herbordt², Herman Lam¹, Abhijeet Lawande¹, Jiayi Sheng², Chen Yang² - Show less +2 more•Institutions (2)

University of Florida¹, Boston University²

01 Sep 2016

TL;DR: This report discusses the motivation behind and particular objectives of Novo-G#, the work completed so far, the products of that work, and their potential impact, and an invitation to join the project users group.

...read moreread less

Abstract: While High-Performance Computing is ever more pervasive and effective, computing capability is currently only a small fraction of what is needed. Three fundamental issues limiting performance are computational efficiency, power density, and communication latency. All of these issues are being addressed through increased heterogeneity, but the last in particular by integrating communication into the accelerator. This integration enables direct and programmable communication among compute components. Novo-G# is a large-scale FPGA-centric cluster being built to investigate and develop architectures, system and tool infrastructure, and applications for this model. In this report we discuss the motivation behind and particular objectives of Novo-G#, the work completed so far, the products of that work, and their potential impact. We end with a description of and an invitation to join the Novo-G# Forum, the project users group.

...read moreread less

45 citations

Cites background from "Cube: A 512-FPGA cluster"

...Yet computing capability is currently only a small fraction of what is needed: e.g., detailed biological simulations are limited to small numbers of macro-molecules; additional factors of millions are needed to simulate cells and far more than that for larger structures....
[...]

Journal Article•DOI•

Frequent Item Computation on a Chip

[...]

Jens Teubner¹, René Müller¹, Gustavo Alonso¹•Institutions (1)

ETH Zurich¹

01 Aug 2011-IEEE Transactions on Knowledge and Data Engineering

TL;DR: An important result of the paper is to demonstrate how the inherent massive parallelism of FPGAs can improve performance of existing algorithms but only after a fundamental redesign of the algorithms.

...read moreread less

Abstract: Computing frequent items is an important problem by itself and as a subroutine in several data mining algorithms. In this paper, we explore how to accelerate the computation of frequent items using field-programmable gate arrays (FPGAs) with a threefold goal: increase performance over existing solutions, reduce energy consumption over CPU-based systems, and explore the design space in detail as the constraints on FPGAs are very different from those of traditional software-based systems. We discuss three design alternatives, each one of them exploiting different FPGA features and each one providing different performance/scalability trade-offs. An important result of the paper is to demonstrate how the inherent massive parallelism of FPGAs can improve performance of existing algorithms but only after a fundamental redesign of the algorithms. Our experimental results show that, e.g., the pipelined solution we introduce can reach more than 100 million tuples per second of sustained throughput (four times the best available results to date) by making use of techniques that are not available to CPU-based solutions. Moreover, and unlike in software approaches, the high throughput is independent of the skew of the Zipf distribution of the input and at a far lower energy cost.

...read moreread less

42 citations

Cites background from "Cube: A 512-FPGA cluster"

...Scalability beyond the available space of a single chip could be achieved by hardware solutions that daisy-chain multiple FPGA chips (such as the BEE3 [15] or Cube [16] systems)....
[...]
...Scalability beyond the available space of a single chip could be achieved by hardware solutions that daisychain multiple FPGA chips (such as the BEE3 [15] or Cube [16] systems)....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

BEE2: a high-end reconfigurable computing system

[...]

Chen Chang¹, John Wawrzynek¹, Robert W. Brodersen¹•Institutions (1)

University of California, Berkeley¹

01 Mar 2005-IEEE Design & Test of Computers

TL;DR: The Berkeley Emulation Engine 2 (BEE2) project is developing a reusable, modular, and scalable framework for designing high-end reconfigurable computers, including a processing-module building block and several programming models.

...read moreread less

Abstract: The Berkeley Emulation Engine 2 (BEE2) project is developing a reusable, modular, and scalable framework for designing high-end reconfigurable computers, including a processing-module building block and several programming models. Using these elements, BEE2 can provide over 10 times more computing throughput than a DSP-based system with similar power consumption and cost and over 100 times that of a microprocessor-based system.

...read moreread less

304 citations

Journal Article•DOI•

GRAPE-6: Massively-Parallel Special-Purpose Computer for Astrophysical Particle Simulations

[...]

Junichiro Makino¹, Toshiyuki Fukushige², Masaki Koga², Ken Namura³•Institutions (3)

University of Tokyo¹, Florida State University College of Arts and Sciences², IBM³

25 Dec 2003-Publications of the Astronomical Society of Japan

TL;DR: The GRAPE-6 system as mentioned in this paper is a massively parallel special-purpose computer for astrophysical $N$-body simulations with a theoretical peak speed of 1.08 TFLOPS.

...read moreread less

Abstract: In this paper, we describe the architecture and performance of the GRAPE-6 system, a massively-parallel special-purpose computer for astrophysical $N$-body simulations. GRAPE-6 is the successor of GRAPE-4, which was completed in 1995 and achieved the theoretical peak speed of 1.08 Tflops. As was the case with GRAPE-4, the primary application of GRAPE-6 is simulation of collisional systems, though it can be used for collisionless systems. The main differences between GRAPE-4 and GRAPE-6 are (a) The processor chip of GRAPE-6 integrates 6 force-calculation pipelines, compared to one pipeline of GRAPE-4 (which needed 3 clock cycles to calculate one interaction), (b) the clock speed is increased from 32 to 90 MHz, and (c) the total number of processor chips is increased from 1728 to 2048. These improvements resulted in the peak speed of 64 Tflops. We also discuss the design of the successor of GRAPE-6.

...read moreread less

199 citations

"Cube: A 512-FPGA cluster" refers methods in this paper

...In 1995, the measured peak performance of a completed GRAPE-4 system was reported as 1.08 Tflops [2]....
[...]
...In 2002, the GRAPE-6 system with 1728 to 2048 processors achieved 64 Tflops [3]....
[...]
...[1] J. Makino, M. Taiji, T. Ebisuzaki, and D. Sugimoto, “GRAPE 4: a one-Tflops special-purpose computer for astrophysical N-body problem,” Supercomputing Proceedings, pp. 429– 438, Nov. 1994....
[...]

Journal Article•DOI•

GRAPE-4: A Massively Parallel Special-Purpose Computer for Collisional N-Body Simulations

[...]

Junichiro Makino¹, Makoto Taiji¹, Toshikazu Ebisuzaki¹, Daiichiro Sugimoto¹•Institutions (1)

Florida State University College of Arts and Sciences¹

01 May 1997-The Astrophysical Journal

TL;DR: The architecture and performance of the GRAPE-4 system, a massively parallel special-purpose computer for N-body simulation of gravitational collisional systems, is described.

...read moreread less

Abstract: In this paper, we describe the architecture and performance of the GRAPE-4 system, a massively parallel special-purpose computer for N-body simulation of gravitational collisional systems. The calculation cost of N-body simulation of collisional self-gravitating system is O(N3). Thus, even with present-day supercomputers, the number of particles one can handle is still around 10,000. In N-body simulations, almost all computing time is spent calculating the force between particles, since the number of interactions is proportional to the square of the number of particles. Computational cost of the rest of the simulation, such as the time integration and the reduction of the result, is generally proportional to the number of particles. The calculation of the force between particles can be greatly accelerated by means of a dedicated special-purpose hardware. We have developed a series of hardware systems, the GRAPE (GRAvity PipE) systems, which perform the force calculation. They are used with a general-purpose host computer which performs the rest of the calculation. The GRAPE-4 system is our newest hardware, completed in 1995 summer. Its peak speed is 1.08 TFLOPS. This speed is achieved by running 1692 pipeline large-scale integrated circuits (LSIs), each providing 640 MFLOPS, in parallel.

...read moreread less

193 citations

Journal Article•DOI•

Cryptanalysis with COPACOBANA

[...]

Tim Güneysu¹, Timo Kasper¹, Martin Novotny¹, Christof Paar¹, Andy Rupp¹ - Show less +1 more•Institutions (1)

Ruhr University Bochum¹

01 Nov 2008-IEEE Transactions on Computers

TL;DR: This work describes various exhaustive key search attacks on symmetric ciphers and demonstrates an attack on a security mechanism employed in the electronic passport and introduces efficient implementations of more complex cryptanalysis on asymmetric cryptosystems, e.g., elliptic curve cryptosSystems (ECCs) and number cofactorization for RSA.

...read moreread less

Abstract: Cryptanalysis of ciphers usually involves massive computations. The security parameters of cryptographic algorithms are commonly chosen so that attacks are infeasible with available computing resources. Thus, in the absence of mathematical breakthroughs to a cryptanalytical problem, a promising way for tackling the computations involved is to build special-purpose hardware exhibiting a (much) better performance-cost ratio than off-the-shelf computers. This contribution presents a variety of cryptanalytical applications utilizing the cost-optimized parallel code breaker (COPACOBANA) machine, which is a high-performance low-cost cluster consisting of 120 field-programmable gate arrays (FPGAs). COPACOBANA appears to be the only such reconfigurable parallel FPGA machine optimized for code breaking tasks reported in the open literature. Depending on the actual algorithm, the parallel hardware architecture can outperform conventional computers by several orders of magnitude. In this work, we focus on novel implementations of cryptanalytical algorithms, utilizing the impressive computational power of COPACOBANA. We describe various exhaustive key search attacks on symmetric ciphers and demonstrate an attack on a security mechanism employed in the electronic passport (e-passport). Furthermore, we describe time-memory trade-off techniques that can, e.g., be used for attacking the popular A5/1 algorithm used in GSM voice encryption. In addition, we introduce efficient implementations of more complex cryptanalysis on asymmetric cryptosystems, e.g., elliptic curve cryptosystems (ECCs) and number cofactorization for RSA. Even though breaking RSA or elliptic curves with parameter lengths used in most practical applications is out of reach with COPACOBANA, our attacks on algorithms with artificially short bit lengths allow us to extrapolate more reliable security estimates for real-world bit lengths. This is particularly useful for deriving estimates about the longevity of asymmetric key lengths.

...read moreread less

157 citations

"Cube: A 512-FPGA cluster" refers background in this paper

...In a 2007 implementation [7] the system running at 136MHz can search a full 56-bit DES key space in 12....
[...]

Journal Article•

Breaking ciphers with COPACOBANA : A cost-optimized parallel-code breaker

[...]

Sandeep Kumar¹, Christof Paar¹, Jan Pelzl¹, Gerd Pfeiffer², Manfred Schimmler² - Show less +1 more•Institutions (2)

Ruhr University Bochum¹, University of Kiel²

01 Jan 2006-Lecture Notes in Computer Science

TL;DR: The design and realization of the COPACOBANA (Cost-Optimized Parallel Code Breaker) machine is presented, which is optimized for running cryptanalytical algorithms and can be realized for less than US$ 10,000, and it will be shown that the architecture can outperform conventional computers by several orders in magnitude.

...read moreread less

Abstract: Cryptanalysis of symmetric and asymmetric ciphers is computationally extremely demanding Since the security parameters (in particular the key length) of almost all practical crypto algorithms are chosen such that attacks with conventional computers are computationally infeasible, the only promising way to tackle existing ciphers (assuming no mathematical breakthrough) is to build special-purpose hardware Dedicating those machines to the task of cryptanalysis holds the promise of a dramatically improved cost-performance ratio so that breaking of commercial ciphers comes within reach This contribution presents the design and realization of the COPACOBANA (Cost-Optimized Parallel Code Breaker) machine, which is optimized for running cryptanalytical algorithms and can be realized for less than US$ 10,000 It will be shown that, depending on the actual algorithm, the architecture can outperform conventional computers by several orders in magnitude COPACOBANA hosts 120 low-cost FP-GAs and is able to, eg, perform an exhaustive key search of the Data Encryption Standard (DES) in less than nine days on average As a real-world application, our architecture can be used to attack machine readable travel documents (ePass) COPACOBANA is intended, but not necessarily restricted to solving problems related to cryptanalysis The hardware architecture is suitable for computational problems which are parallelizable and have low communication requirements The hardware can be used, eg, to attack elliptic curve cryptosystems and to factor numbers Even though breaking full-size RSA (1024 bit or more) or elliptic curves (ECC with 160 bit or more) is out of reach with COPACOBANA, it can be used to analyze cryptosystems with a (deliberately chosen) small bitlength to provide reliable security estimates of RSA and ECC by extrapolation 1

...read moreread less

143 citations