scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Cube: A 512-FPGA cluster

TL;DR: The Cube, a massively-parallel FPGA-based platform is presented, which can perform a full search on the 40-bit key space within 3 minutes, this being 359 times faster than a multi-threaded software implementation running on a 2.5GHz Intel Quad-Core Xeon processor.
Abstract: Cube, a massively-parallel FPGA-based platform is presented. The machine is made from boards each containing 64 FPGA devices and eight boards can be connected in a cube structure for a total of 512 FPGA devices. With high bandwidth systolic inter-FPGA communication and a flexible programming scheme, the result is a low power, high density and scalable supercomputing machine suitable for various large scale parallel applications. A RC4 key search engine was built as an demonstration application. In a fully implemented Cube, the engine can perform a full search on the 40-bit key space within 3 minutes, this being 359 times faster than a multi-threaded software implementation running on a 2.5GHz Intel Quad-Core Xeon processor.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
Jens Teubner1, Rene Mueller2
12 Jun 2011
TL;DR: This work presents handshake join, a way of describing and executing window-based stream joins that is highly amenable to parallelized execution and gives a new intuition of window semantics, which it believes could inspire other stream processing algorithms or ongoing standardization efforts for stream query languages.
Abstract: In spite of the omnipresence of parallel (multi-core) systems, the predominant strategy to evaluate window-based stream joins is still strictly sequential, mostly just straightforward along the definition of the operation semantics.In this work we present handshake join, a way of describing and executing window-based stream joins that is highly amenable to parallelized execution. Handshake join naturally leverages available hardware parallelism, which we demonstrate with an implementation on a modern multi-core system and on top of field-programmable gate arrays (FPGAs), an emerging technology that has shown distinctive advantages for high-throughput data processing.On the practical side, we provide a join implementation that substantially outperforms CellJoin (the fastest published result) and that will directly turn any degree of parallelism into higher throughput or larger supported window sizes. On the semantic side, our work gives a new intuition of window semantics, which we believe could inspire other stream processing algorithms or ongoing standardization efforts for stream query languages.

144 citations

Proceedings ArticleDOI
01 Dec 2016
TL;DR: FPGAs (Field Programmable Gate Arrays) are making their way into data centers (DC) and are used as accelerators to boost the compute power of individual server nodes and to improve the overall power efficiency.
Abstract: FPGAs (Field Programmable Gate Arrays) are making their way into data centers (DC). They are used as accelerators to boost the compute power of individual server nodes and to improve the overall power efficiency. However, this approach limits the number of FPGAs per node and hinders the acceleration of large-scale distributed applications.

66 citations


Cites background from "Cube: A 512-FPGA cluster"

  • ...Usually, when several FPGAs are required for an application, either those FPGAs are soldered on a PCB [15] [16] [17] [18] or connected in a network on a fixed topology [19]...

    [...]

Patent
29 Jun 2015
TL;DR: In this article, a method for processing on an acceleration component a deep neural network is presented, which includes configuring the acceleration component to perform forward propagation and backpropagation stages of the deep neural networks.
Abstract: A method is provided for processing on an acceleration component a deep neural network. The method includes configuring the acceleration component to perform forward propagation and backpropagation stages of the deep neural network. The acceleration component includes an acceleration component die and a memory stack disposed in an integrated circuit package. The memory stack has a memory bandwidth greater than about 50 GB/sec and a power efficiency of greater than about 20 MB/sec/mW.

48 citations

Proceedings ArticleDOI
01 Sep 2016
TL;DR: This report discusses the motivation behind and particular objectives of Novo-G#, the work completed so far, the products of that work, and their potential impact, and an invitation to join the project users group.
Abstract: While High-Performance Computing is ever more pervasive and effective, computing capability is currently only a small fraction of what is needed. Three fundamental issues limiting performance are computational efficiency, power density, and communication latency. All of these issues are being addressed through increased heterogeneity, but the last in particular by integrating communication into the accelerator. This integration enables direct and programmable communication among compute components. Novo-G# is a large-scale FPGA-centric cluster being built to investigate and develop architectures, system and tool infrastructure, and applications for this model. In this report we discuss the motivation behind and particular objectives of Novo-G#, the work completed so far, the products of that work, and their potential impact. We end with a description of and an invitation to join the Novo-G# Forum, the project users group.

45 citations


Cites background from "Cube: A 512-FPGA cluster"

  • ...Yet computing capability is currently only a small fraction of what is needed: e.g., detailed biological simulations are limited to small numbers of macro-molecules; additional factors of millions are needed to simulate cells and far more than that for larger structures....

    [...]

Journal ArticleDOI
TL;DR: An important result of the paper is to demonstrate how the inherent massive parallelism of FPGAs can improve performance of existing algorithms but only after a fundamental redesign of the algorithms.
Abstract: Computing frequent items is an important problem by itself and as a subroutine in several data mining algorithms. In this paper, we explore how to accelerate the computation of frequent items using field-programmable gate arrays (FPGAs) with a threefold goal: increase performance over existing solutions, reduce energy consumption over CPU-based systems, and explore the design space in detail as the constraints on FPGAs are very different from those of traditional software-based systems. We discuss three design alternatives, each one of them exploiting different FPGA features and each one providing different performance/scalability trade-offs. An important result of the paper is to demonstrate how the inherent massive parallelism of FPGAs can improve performance of existing algorithms but only after a fundamental redesign of the algorithms. Our experimental results show that, e.g., the pipelined solution we introduce can reach more than 100 million tuples per second of sustained throughput (four times the best available results to date) by making use of techniques that are not available to CPU-based solutions. Moreover, and unlike in software approaches, the high throughput is independent of the skew of the Zipf distribution of the input and at a far lower energy cost.

42 citations


Cites background from "Cube: A 512-FPGA cluster"

  • ...Scalability beyond the available space of a single chip could be achieved by hardware solutions that daisy-chain multiple FPGA chips (such as the BEE3 [15] or Cube [16] systems)....

    [...]

  • ...Scalability beyond the available space of a single chip could be achieved by hardware solutions that daisychain multiple FPGA chips (such as the BEE3 [15] or Cube [16] systems)....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: The Berkeley Emulation Engine 2 (BEE2) project is developing a reusable, modular, and scalable framework for designing high-end reconfigurable computers, including a processing-module building block and several programming models.
Abstract: The Berkeley Emulation Engine 2 (BEE2) project is developing a reusable, modular, and scalable framework for designing high-end reconfigurable computers, including a processing-module building block and several programming models. Using these elements, BEE2 can provide over 10 times more computing throughput than a DSP-based system with similar power consumption and cost and over 100 times that of a microprocessor-based system.

304 citations

Journal ArticleDOI
TL;DR: The GRAPE-6 system as mentioned in this paper is a massively parallel special-purpose computer for astrophysical $N$-body simulations with a theoretical peak speed of 1.08 TFLOPS.
Abstract: In this paper, we describe the architecture and performance of the GRAPE-6 system, a massively-parallel special-purpose computer for astrophysical $N$-body simulations. GRAPE-6 is the successor of GRAPE-4, which was completed in 1995 and achieved the theoretical peak speed of 1.08 Tflops. As was the case with GRAPE-4, the primary application of GRAPE-6 is simulation of collisional systems, though it can be used for collisionless systems. The main differences between GRAPE-4 and GRAPE-6 are (a) The processor chip of GRAPE-6 integrates 6 force-calculation pipelines, compared to one pipeline of GRAPE-4 (which needed 3 clock cycles to calculate one interaction), (b) the clock speed is increased from 32 to 90 MHz, and (c) the total number of processor chips is increased from 1728 to 2048. These improvements resulted in the peak speed of 64 Tflops. We also discuss the design of the successor of GRAPE-6.

199 citations


"Cube: A 512-FPGA cluster" refers methods in this paper

  • ...In 1995, the measured peak performance of a completed GRAPE-4 system was reported as 1.08 Tflops [2]....

    [...]

  • ...In 2002, the GRAPE-6 system with 1728 to 2048 processors achieved 64 Tflops [3]....

    [...]

  • ...[1] J. Makino, M. Taiji, T. Ebisuzaki, and D. Sugimoto, “GRAPE 4: a one-Tflops special-purpose computer for astrophysical N-body problem,” Supercomputing Proceedings, pp. 429– 438, Nov. 1994....

    [...]

Journal ArticleDOI
TL;DR: The architecture and performance of the GRAPE-4 system, a massively parallel special-purpose computer for N-body simulation of gravitational collisional systems, is described.
Abstract: In this paper, we describe the architecture and performance of the GRAPE-4 system, a massively parallel special-purpose computer for N-body simulation of gravitational collisional systems. The calculation cost of N-body simulation of collisional self-gravitating system is O(N3). Thus, even with present-day supercomputers, the number of particles one can handle is still around 10,000. In N-body simulations, almost all computing time is spent calculating the force between particles, since the number of interactions is proportional to the square of the number of particles. Computational cost of the rest of the simulation, such as the time integration and the reduction of the result, is generally proportional to the number of particles. The calculation of the force between particles can be greatly accelerated by means of a dedicated special-purpose hardware. We have developed a series of hardware systems, the GRAPE (GRAvity PipE) systems, which perform the force calculation. They are used with a general-purpose host computer which performs the rest of the calculation. The GRAPE-4 system is our newest hardware, completed in 1995 summer. Its peak speed is 1.08 TFLOPS. This speed is achieved by running 1692 pipeline large-scale integrated circuits (LSIs), each providing 640 MFLOPS, in parallel.

193 citations

Journal ArticleDOI
Tim Güneysu1, Timo Kasper1, Martin Novotny1, Christof Paar1, Andy Rupp1 
TL;DR: This work describes various exhaustive key search attacks on symmetric ciphers and demonstrates an attack on a security mechanism employed in the electronic passport and introduces efficient implementations of more complex cryptanalysis on asymmetric cryptosystems, e.g., elliptic curve cryptosSystems (ECCs) and number cofactorization for RSA.
Abstract: Cryptanalysis of ciphers usually involves massive computations. The security parameters of cryptographic algorithms are commonly chosen so that attacks are infeasible with available computing resources. Thus, in the absence of mathematical breakthroughs to a cryptanalytical problem, a promising way for tackling the computations involved is to build special-purpose hardware exhibiting a (much) better performance-cost ratio than off-the-shelf computers. This contribution presents a variety of cryptanalytical applications utilizing the cost-optimized parallel code breaker (COPACOBANA) machine, which is a high-performance low-cost cluster consisting of 120 field-programmable gate arrays (FPGAs). COPACOBANA appears to be the only such reconfigurable parallel FPGA machine optimized for code breaking tasks reported in the open literature. Depending on the actual algorithm, the parallel hardware architecture can outperform conventional computers by several orders of magnitude. In this work, we focus on novel implementations of cryptanalytical algorithms, utilizing the impressive computational power of COPACOBANA. We describe various exhaustive key search attacks on symmetric ciphers and demonstrate an attack on a security mechanism employed in the electronic passport (e-passport). Furthermore, we describe time-memory trade-off techniques that can, e.g., be used for attacking the popular A5/1 algorithm used in GSM voice encryption. In addition, we introduce efficient implementations of more complex cryptanalysis on asymmetric cryptosystems, e.g., elliptic curve cryptosystems (ECCs) and number cofactorization for RSA. Even though breaking RSA or elliptic curves with parameter lengths used in most practical applications is out of reach with COPACOBANA, our attacks on algorithms with artificially short bit lengths allow us to extrapolate more reliable security estimates for real-world bit lengths. This is particularly useful for deriving estimates about the longevity of asymmetric key lengths.

157 citations


"Cube: A 512-FPGA cluster" refers background in this paper

  • ...In a 2007 implementation [7] the system running at 136MHz can search a full 56-bit DES key space in 12....

    [...]

Journal Article
TL;DR: The design and realization of the COPACOBANA (Cost-Optimized Parallel Code Breaker) machine is presented, which is optimized for running cryptanalytical algorithms and can be realized for less than US$ 10,000, and it will be shown that the architecture can outperform conventional computers by several orders in magnitude.
Abstract: Cryptanalysis of symmetric and asymmetric ciphers is computationally extremely demanding Since the security parameters (in particular the key length) of almost all practical crypto algorithms are chosen such that attacks with conventional computers are computationally infeasible, the only promising way to tackle existing ciphers (assuming no mathematical breakthrough) is to build special-purpose hardware Dedicating those machines to the task of cryptanalysis holds the promise of a dramatically improved cost-performance ratio so that breaking of commercial ciphers comes within reach This contribution presents the design and realization of the COPACOBANA (Cost-Optimized Parallel Code Breaker) machine, which is optimized for running cryptanalytical algorithms and can be realized for less than US$ 10,000 It will be shown that, depending on the actual algorithm, the architecture can outperform conventional computers by several orders in magnitude COPACOBANA hosts 120 low-cost FP-GAs and is able to, eg, perform an exhaustive key search of the Data Encryption Standard (DES) in less than nine days on average As a real-world application, our architecture can be used to attack machine readable travel documents (ePass) COPACOBANA is intended, but not necessarily restricted to solving problems related to cryptanalysis The hardware architecture is suitable for computational problems which are parallelizable and have low communication requirements The hardware can be used, eg, to attack elliptic curve cryptosystems and to factor numbers Even though breaking full-size RSA (1024 bit or more) or elliptic curves (ECC with 160 bit or more) is out of reach with COPACOBANA, it can be used to analyze cryptosystems with a (deliberately chosen) small bitlength to provide reliable security estimates of RSA and ECC by extrapolation 1

143 citations


"Cube: A 512-FPGA cluster" refers methods in this paper

  • ...In 2006, COPACOBANA, a low cost cryptanalysis system using large numbers of FPGAs was described [6]....

    [...]