scispace - formally typeset
Search or ask a question

Showing papers by "Srinivas Devadas published in 2021"


Proceedings ArticleDOI
18 Oct 2021
TL;DR: F1 as discussed by the authors is the first FHE accelerator that is programmable, i.e., capable of executing full FHE programs, based on an in-depth architectural analysis of the characteristics of FHE computations that reveals acceleration opportunities.
Abstract: Fully Homomorphic Encryption (FHE) allows computing on encrypted data, enabling secure offloading of computation to untrusted servers. Though it provides ideal security, FHE is expensive when executed in software, 4 to 5 orders of magnitude slower than computing on unencrypted data. These overheads are a major barrier to FHE’s widespread adoption. We present F1, the first FHE accelerator that is programmable, i.e., capable of executing full FHE programs. F1 builds on an in-depth architectural analysis of the characteristics of FHE computations that reveals acceleration opportunities. F1 is a wide-vector processor with novel functional units deeply specialized to FHE primitives, such as modular arithmetic, number-theoretic transforms, and structured permutations. This organization provides so much compute throughput that data movement becomes the key bottleneck. Thus, F1 is primarily designed to minimize data movement. Hardware provides an explicitly managed memory hierarchy and mechanisms to decouple data movement from execution. A novel compiler leverages these mechanisms to maximize reuse and schedule off-chip and on-chip data movement. We evaluate F1 using cycle-accurate simulation and RTL synthesis. F1 is the first system to accelerate complete FHE programs, and outperforms state-of-the-art software implementations by gmean 5,400 × and by up to 17,000 ×. These speedups counter most of FHE’s overheads and enable new applications, like real-time private deep learning in the cloud.

98 citations


Journal ArticleDOI
08 Feb 2021
TL;DR: In this paper, the gradient of rigid body dynamics is computed on a CPU, GPU, and FPGA, and the authors show that the relative performance across hardware platforms depends on the number of parallel gradient evaluations required.
Abstract: Computing the gradient of rigid body dynamics is a central operation in many state-of-the-art planning and control algorithms in robotics Parallel computing platforms such as GPUs and FPGAs can offer performance gains for algorithms with hardware-compatible computational structures In this letter, we detail the designs of three faster than state-of-the-art implementations of the gradient of rigid body dynamics on a CPU, GPU, and FPGA Our optimized FPGA and GPU implementations provide as much as a 30x end-to-end speedup over our optimized CPU implementation by refactoring the algorithm to exploit its computational features, eg, parallelism at different granularities We also find that the relative performance across hardware platforms depends on the number of parallel gradient evaluations required

22 citations


Proceedings ArticleDOI
19 Apr 2021
TL;DR: In this article, a methodology to transform robot morphology into a customized hardware accelerator morphology is presented, using robot topology and structure to exploit parallelism and matrix sparsity patterns in accelerator hardware.
Abstract: Robotics applications have hard time constraints and heavy computational burdens that can greatly benefit from domain-specific hardware accelerators. For the latency-critical problem of robot motion planning and control, there exists a performance gap of at least an order of magnitude between joint actuator response rates and state-of-the-art software solutions. Hardware acceleration can close this gap, but it is essential to define automated hardware design flows to keep the design process agile as applications and robot platforms evolve. To address this challenge, we introduce robomorphic computing: a methodology to transform robot morphology into a customized hardware accelerator morphology. We (i) present this design methodology, using robot topology and structure to exploit parallelism and matrix sparsity patterns in accelerator hardware; (ii) use the methodology to generate a parameterized accelerator design for the gradient of rigid body dynamics, a key kernel in motion planning; (iii) evaluate FPGA and synthesized ASIC implementations of this accelerator for an industrial manipulator robot; and (iv) describe how the design can be automatically customized for other robot models. Our FPGA accelerator achieves speedups of 8× and 86× over CPU and GPU when executing a single dynamics gradient computation. It maintains speedups of 1.9× to 2.9× over CPU and GPU, including computation and I/O round-trip latency, when deployed as a coprocessor to a host CPU for processing multiple dynamics gradient computations. ASIC synthesis indicates an additional 7.2× speedup for single computation latency. We describe how this principled approach generalizes to more complex robot platforms, such as quadrupeds and humanoids, as well as to other computational kernels in robotics, outlining a path forward for future robomorphic computing accelerators.

20 citations