scispace - formally typeset
Search or ask a question
Author

Tomohiro Ueno

Other affiliations: Utsunomiya University
Bio: Tomohiro Ueno is an academic researcher from Tohoku University. The author has contributed to research in topics: Scalability & Data stream mining. The author has an hindex of 6, co-authored 16 publications receiving 77 citations. Previous affiliations of Tomohiro Ueno include Utsunomiya University.

Papers
More filters
Proceedings ArticleDOI
18 May 2020
TL;DR: A Communication Integrated Reconfigurable CompUting System (CIRCUS) is proposed to enable us to utilize high-speed interconnection of FPGAS from OpenCL and makes a fused single pipeline combining the computation and the communication, which hides the communication latency by completely overlapping them.
Abstract: In recent years, much High Performance Computing (HPC) researchers attract to utilize Field Programmable Gate Arrays (FPGAs) for HPC applications. We can use FPGAs for communication as well as computation thanks to FPGA's I/O capabilities. HPC scientists cannot utilize FPGAs for their applications because of the difficulty of the FPGA development, however High Level Synthesis (HLS) allows them to use with appropriate costs. In this study, we propose a Communication Integrated Reconfigurable CompUting System (CIRCUS) to enable us to utilize high-speed interconnection of FPGAS from OpenCL. CIRCUS makes a fused single pipeline combining the computation and the communication, which hides the communication latency by completely overlapping them. In this paper, we present the detail of the implementation and the evaluation result using two benchmarks: pingpong benchmark and allreduce benchmark.

20 citations

Journal ArticleDOI
TL;DR: A hardware-based bandwidth compression technique that can be applied to field-programmable gate array-- (FPGA) based high-performance computation with a logically wider effective memory bandwidth and a multichannel serializer and deserializer that enable applications to use multiple channels of computational data with the bandwidth compression.
Abstract: Although computational performance is often limited by insufficient bandwidth to/from an external memory, it is not easy to physically increase off-chip memory bandwidth In this study, we propose a hardware-based bandwidth compression technique that can be applied to field-programmable gate array-- (FPGA) based high-performance computation with a logically wider effective memory bandwidth Our proposed hardware approach can boost the performance of FPGA-based stream computations by applying a data compression technique to effectively transfer more data streams To apply this data compression technique to bandwidth compression via hardware, several requirements must first be satisfied, including an acceptable level of compression performance and a sufficiently small hardware footprint Our proposed hardware-based bandwidth compressor utilizes an efficient prediction-based data compression algorithm Moreover, we propose a multichannel serializer and deserializer that enable applications to use multiple channels of computational data with the bandwidth compression The serializer encodes compressed data blocks of multiple channels into a data stream, which is efficiently written to an external memory Based on preliminary evaluation, we define an encoding format considering both high compression ratio and small hardware area As a result, we demonstrate that our area saving bandwidth compressor increases performance of an FPGA-based fluid dynamics simulation by deploying more processing elements to exploit spatial parallelism with the enhanced memory bandwidth

16 citations

Journal ArticleDOI
TL;DR: The detailed design of a custom computing machine for fully-streamed LBM computation on multiple FPGAs is presented, and its efficiency is evaluated with prototype implementation.
Abstract: This paper presents the detailed design of a custom computing machine for fully-streamed LBM computation on multiple FPGAs, and evaluates its efficiency with prototype implementation. We design a unit for completely streamed computation including boundary treatment with a newly introduced cell attribute. Experimental results demonstrate that the proposed machine achieves high utilization of PEs, 99 % of the peak performance, for one and two FPGAs computing a large lattice. This is due to our fully-streamed design to allow all arithmetic units to be efficienly utilized with a constant memory bandwidth, and the architecture to exploit a low-latency accelerator domain network (ADN) of a tightly-coupled FPGA cluster for scalable computation.

13 citations

Journal ArticleDOI
TL;DR: This paper presents a scalable architecture of a deeply pipelined stream computing platform, where available parallelism and inter-FPGA link characteristics are investigated to achieve a scaled performance.
Abstract: Since the hardware resource of a single FPGA is limited, one idea to scale the performance of FPGA-based HPC applications is to expand the design space with multiple FPGAs. This paper presents a scalable architecture of a deeply pipelined stream computing platform, where available parallelism and inter-FPGA link characteristics are investigated to achieve a scaled performance. For a practical exploration of this vast design space, a performance model is presented and verified with the evaluation of a tsunami simulation application implemented on Intel Arria 10 FPGAs. Finally, scalability analysis is performed, where speedup is achieved when increasing the computing pipeline over multiple FPGAs while maintaining the problem size of computation. Performance is scaled with multiple FPGAs; however, performance degradation occurs with insufficient available bandwidth and large pipeline overhead brought by inadequate data stream size. Tsunami simulation results show that the highest scaled performance for 8 cascaded Arria 10 FPGAs is achieved with a single pipeline of 5 stream processing elements (SPEs), which obtained a scaled performance of 2.5 TFlops and a parallel efficiency of 98%, indicating the strong scalability of the multi-FPGA stream computing platform. key words: tsunami simulation, stream computing, scalability, multiple FPGAs, high-performance computing

11 citations

Book ChapterDOI
01 Apr 2020
TL;DR: This paper introduces a scalable platform of indirectly-connected FPGAs, where its Ethernet-switching network allows flexibly customized inter-FPGA connectivity and demonstrates good performance and scalability for large HPC applications.
Abstract: As field programmable gate arrays (FPGAs) become a favorable choice in exploring new computing architectures for the post-Moore era, a flexible network architecture for scalable FPGA clusters becomes increasingly important in high performance computing (HPC). In this paper, we introduce a scalable platform of indirectly-connected FPGAs, where its Ethernet-switching network allows flexibly customized inter-FPGA connectivity. However, for certain applications such as in stream computing, it is necessary to establish a connection-oriented datapath with backpressure between FPGAs. Due to the lack of physical backpressure channel in the network, we utilized our existing credit-based network protocol with flow control to provide receiver FPGA awareness and tailored it to minimize overall communication overhead for the proposed framework. To know its performance characteristics, we implemented necessary data transfer hardware on Intel Arria 10 FPGAs, modeled and obtained its communication performance, and compared it to a direct network. Results show that our proposed indirect framework achieves approximately 3% higher effective network bandwidth than our existing direct inter-FPGA network, which demonstrates good performance and scalability for large HPC applications.

11 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: A deep survey around state of the art research and implementation of HPC algorithms is performed; features relevant to each family are extracted and list them as key factors to obtain higher performance.
Abstract: High performance computing (HPC) systems currently integrate several resources such as multi-cores (CPUs), graphic processing units (GPUs) and reconfigurable logic devices, like field programmable gate arrays (FPGAs). The role of the latter two has traditionally being confined to act as secondary accelerators rather than as main execution units. We perform a deep survey around state of the art research and implementation of HPC algorithms; we extract features relevant to each family and list them as key factors to obtain higher performance. Due to the broad spectra of the survey we only include the most complete references found. We provide a general classification of the 13 HPC families with respect to their needs and suitability for hardware implementation. In addition, we present an analysis based on current and future technology availability as well as in particular aspects identified in the survey. Finally we list general guidelines and opportunities to be accounted for in future heterogeneous designs that employ FPGAs for HPC.

41 citations

Journal ArticleDOI
TL;DR: This paper presents an architecture and design for scalable fluid simulation based on data-flow computing with a state-of-the-art FPGA and introduces spatial and temporal parallelism to further scale the performance by adding more stream processing elements (SPEs) in an array.
Abstract: High-performance and low-power computation is required for large-scale fluid dynamics simulation. Due to the inefficient architecture and structure of CPUs and GPUs, they now have a difficulty in improving power efficiency for the target application. Although FPGAs become promising alternatives for power-efficient and high-performance computation due to their new architecture having floating-point (FP) DSP blocks, their relatively narrow memory bandwidth requires an appropriate way to fully exploit the advantage. This paper presents an architecture and design for scalable fluid simulation based on data-flow computing with a state-of-the-art FPGA. To exploit available hardware resources including FP DSPs, we introduce spatial and temporal parallelism to further scale the performance by adding more stream processing elements (SPEs) in an array. Performance modeling and prototype implementation allow us to explore the design space for both the existing Altera Arria10 and the upcoming Intel Stratix10 FPGAs. We demonstrate that Arria10 10AX115 FPGA achieves 519 GFlops at 9.67 GFlops/W only with a stream bandwidth of 9.0 GB/s, which is 97.9 percent of the peak performance of 18 implemented SPEs. We also estimate that Stratix10 FPGA can scale up to 6844 GFlops by combining spatial and temporal parallelism adequately.

30 citations

Journal ArticleDOI
TL;DR: A hardware-based bandwidth compression technique that can be applied to field-programmable gate array-- (FPGA) based high-performance computation with a logically wider effective memory bandwidth and a multichannel serializer and deserializer that enable applications to use multiple channels of computational data with the bandwidth compression.
Abstract: Although computational performance is often limited by insufficient bandwidth to/from an external memory, it is not easy to physically increase off-chip memory bandwidth In this study, we propose a hardware-based bandwidth compression technique that can be applied to field-programmable gate array-- (FPGA) based high-performance computation with a logically wider effective memory bandwidth Our proposed hardware approach can boost the performance of FPGA-based stream computations by applying a data compression technique to effectively transfer more data streams To apply this data compression technique to bandwidth compression via hardware, several requirements must first be satisfied, including an acceptable level of compression performance and a sufficiently small hardware footprint Our proposed hardware-based bandwidth compressor utilizes an efficient prediction-based data compression algorithm Moreover, we propose a multichannel serializer and deserializer that enable applications to use multiple channels of computational data with the bandwidth compression The serializer encodes compressed data blocks of multiple channels into a data stream, which is efficiently written to an external memory Based on preliminary evaluation, we define an encoding format considering both high compression ratio and small hardware area As a result, we demonstrate that our area saving bandwidth compressor increases performance of an FPGA-based fluid dynamics simulation by deploying more processing elements to exploit spatial parallelism with the enhanced memory bandwidth

16 citations

Journal ArticleDOI
TL;DR: In this article, the wave equation is solved by using the method of separation of variables based on the eigenvalue technique, and the resonance frequencies as well as the E-field distributions in two exemplary small resonators are presented for a variety of modes.
Abstract: Whispering-gallery (WG) modes in photonic microdevices made of dielectric circularly planar resonators are analyzed. The wave equation is solved by using the method of separation of variables based on the eigenvalue technique. The resonant frequency at an azimuthal mode is decided by iteration using the bisection method based on the continuity conditions at the resonator peripheral boundary. The radial mode is determined by the critical points of the field intensity profile in the radial direction via the first derivative test. The resonance frequencies as well as the E-field distributions in two exemplary small resonators are presented for a variety of modes. Comparison with numerical predictions is conducted, and a good agreement is found. The geometric optics method is found inappropriate for small resonators.

14 citations