scispace - formally typeset
Search or ask a question

Showing papers by "Koji Nakano published in 2016"


Proceedings ArticleDOI
01 Aug 2016
TL;DR: Theoretical lower bounds of the diameter and the ASPL, which prove optimality of randomly optimized grid graphs, are provided and a diagonal grid layout is presented that significantly reduces the diameter compared to the conventional one under the edge-length limitation.
Abstract: In this work we present randomly optimized grid graphs that maximize the performance measure, such as diameter and average shortest path length (ASPL), with subject to limited edge length on a grid surface. We also provide theoretical lower bounds of the diameter and the ASPL, which prove optimality of our randomly optimized grid graphs. We further present a diagonal grid layout that significantly reduces the diameter compared to the conventional one under the edge-length limitation. We finally show their applications to three case studies of off-and on-chip interconnection networks. Our design efficiently improves their performance measures, such as end-to-end communication latency, network power consumption, cost, and execution time of parallel benchmarks.

17 citations


Book ChapterDOI
14 Dec 2016
TL;DR: The main contribution of this paper is to present a hardware LZW decompression algorithm and to implement it in an FPGA and the experimental results show that one proposed module on Virtex-7 family FPGa XC7VX485T-2 runs up to 2.16 times faster than sequential LZw decompression on a single CPU.
Abstract: LZW algorithm is one of the most famous dictionary-based compression and decompression algorithms. The main contribution of this paper is to present a hardware LZW decompression algorithm and to implement it in an FPGA. The experimental results show that one proposed module on Virtex-7 family FPGA XC7VX485T-2 runs up to 2.16 times faster than sequential LZW decompression on a single CPU, where the frequency of FPGA is 301.02MHz. Since the proposed module is compactly designed and uses a few resources of the FPGA, we have succeeded to implement 150 identical modules which works in parallel on the FPGA, where the frequency of FPGA is 245.4MHz. In other words, our implementation runs up to 264 times faster than a sequential implementation on a single CPU.

14 citations


Book ChapterDOI
14 Dec 2016
TL;DR: A new lossless data compression method that is designed so that decompression can be highly parallelized and run very efficiently on the GPU, called Light Loss-Less (LLL) compression is presented.
Abstract: There is no doubt that data compression is very important in computer engineering. However, most lossless data compression and decompression algorithms are very hard to parallelize, because they use dictionaries updated sequentially. The main contribution of this paper is to present a new lossless data compression method that we call Light Loss-Less (LLL) compression. It is designed so that decompression can be highly parallelized and run very efficiently on the GPU. This makes sense for many applications in which compressed data is read and decompressed many times and decompression performed more frequently than compression. We show optimal sequential and parallel algorithms for LLL decompression and implement them to run on Core i7-4790 CPU and GeForce GTX 1080 GPU, respectively. To show the potentiality of LLL compression method, we have evaluated the running time using five images and compared with well-known compression methods LZW and LZSS. Our GPU implementation of LLL decompression runs 91.1–176 times faster than the CPU implementation. Also, the running time on the GPU of our experiments show that LLL decompression is 2.49–9.13 times faster than LZW decompression and 4.30–14.1 times faster that LZSS decompression, although their compression ratios are comparable.

12 citations


Journal ArticleDOI
TL;DR: A work-optimal parallel LZW decompression algorithm on the CREW-PRAM (Concurrent-Read Exclusive-Write Parallel Random Access Machine), which is a standard theoretical parallel computing model with a shared memory, and an efficient implementation of this parallel algorithm on a GPU, which is implemented in a CUDA-enabled GPU.
Abstract: The main contribution of this paper is to present a workoptimal parallel algorithm for LZW decompression and to implement it in a CUDA-enabled GPU. Since sequential LZW decompression creates a dictionary table by reading codes in a compressed file one by one, it is not easy to parallelize it. We first present a work-optimal parallel LZW decompression algorithm on the CREW-PRAM (Concurrent-Read Exclusive-Write Parallel Random Access Machine), which is a standard theoretical parallel computing model with a shared memory. We then go on to present an efficient implementation of this parallel algorithm on a GPU. The experimental results show that our GPU implementation performs LZW decompression in 1.15 milliseconds for a gray scale TIFF image with 4096 × 3072 pixels stored in the global memory of GeForce GTX 980. On the other hand, sequential LZW decompression for the same image stored in the main memory of Intel Core i7 CPU takes 50.1 milliseconds. Thus, our parallel LZW decompression on the global memory of the GPU is 43.6 times faster than a sequential LZW decompression on the main memory of the CPU for this image. To show the applicability of our GPU implementation for LZW decompression, we evaluated the SSD-GPU data loading time for three scenarios. The experimental results show that the scenario using our LZW decompression on the GPU is faster than the others. key words: data compression, big data, parallel algorithm, GPU, CUDA

12 citations


Proceedings ArticleDOI
01 Nov 2016
TL;DR: The main contribution of this work is to present a memory-access-efficient implementation for computing the ASM on a GPU that relies on warp shuffle operations, which are used to reduce the communication overhead between threads.
Abstract: The task of finding strings having a partial match to a given pattern is of interest to a number of practical applications, including DNA sequencing and text searching. Owing to its importance, alternatives to accelerate the Approximate String Matching (ASM) have been widely investigated in the literature. The main contribution of this work is to present a memory-access-efficient implementation for computing the ASM on a GPU. The key idea of our implementation relies on warp shuffle operations, which are used to reduce the communication overhead between threads. Experimental results, carried out on a GeForce GTX 960 GPU, show that the proposed implementation provides acceleration between 1.31 and 1.84 times when compared to another noteworthy alternative.

6 citations


Proceedings ArticleDOI
23 May 2016
TL;DR: It is shown that the pairwise sums of a lot of integers can be computed faster using the BPBC technique, if the values of input integers are not large, and that the CKY parsing for context-free grammars can be implemented in the GPU efficiently using this technique.
Abstract: The main contribution of this paper is topresent Bitwise Parallel Bulk Computation (BPBC) technique, to accelerate bulk computation, which executes the same algorithm for a lot of instances in turn or in parallel. The idea of the BPBC technique isto simulate a combinational logic circuit for 32 inputsat the same time using bitwise logic operators for 32-bit integerssupported by most processing devices. We will show that the BPBC technique worksvery efficiently on a CPU as well as on a GPU. As a simple example of the BPBC, we first show that the pairwise sums of a lot of integerscan be computed faster using the BPBC technique, if the values of input integers are not large. We also show that the CKY parsing for context-free grammarscan be implemented in the GPU efficientlyusing the BPBC technique. The experimental results using Intel Core~i7 CPUand GeForce GTX TITAN X GPU show thatthe GPU implementation forthe CKY parsing can be more than 400 times fasterthan the CPU implementation.

6 citations


Journal ArticleDOI
TL;DR: This paper develops several acceleration techniques for simulating the Game of Life using a GPU using the Bitwise Parallel Bulk Computation technique, which works very efficiently on a GPU.
Abstract: Conway’s Game of Life is the most well-known cellular automaton. The universe of the Game of Life is a 2-dimensional array of cells, each of which takes two possible states, alive or dead. The state of every cell is repeatedly updated according to those of eight neighbors. A cell will be alive if exactly three neighbors are alive, or if it is alive and two neighbors are alive. The main contribution of this paper is to develop several acceleration techniques for simulating the Game of Life using a GPU as follows: (1) the states of 32/64 cells in 32/64-bit words (integers) and the next states are computed by the Bitwise Parallel Bulk Computation (BPBC) technique, (2) the states of cells stored in 2 words are updated at the same time by a thread, (3) warp shuffle instruction is used to directly transfer the current states stored in registers, and (4) multi-step simulation is performed to reduce the overhead of data transfer and invoking CUDA kernel. The experimental results show that, the performance of our GPU implementation using GeForce GTX TITAN X is 1350×109 updates per second for 16K-step simulation of 512K ×512K cells stored in the SSD. Since Intel Core i7 CPU using the same technique performs 13.4×109 updates per second, our GPU implementation for the Game of Life achieves a speedup factor of 100. Thus, these techniques work very efficiently on a GPU.

6 citations


Proceedings ArticleDOI
01 Nov 2016
TL;DR: A very efficient GPU implementation of bulk computation of eigenvalues for a large number of small non-symmetric real matrices using three types of assignments of GPU threads to matrices and three memory arrangements in the global memory is presented.
Abstract: The main contribution of this paper is to present a very efficient GPU implementation of bulk computation of eigenvalues for a large number of small non-symmetric real matrices. This work is motivated by the necessity of such bulk computation in design of control systems, which requires to compute the eigenvalues of hundreds of thousands non-symmetric real matrices of size up to 30x30. In our GPU implementation, we considered programming issues of the GPU architecture including warp divergence, coalesced access of the global memory, bank conflict of the shared memory, etc. In particular, we present three types of assignments of GPU threads to matrices and introduce three memory arrangements in the global memory. The experimental results on NVIDIA GeForce GTX TITAN X show that our GPU implementation for 500000 matrices of size 5x5 to 30x30 attains a speed-up factor of approximately 15 over the CPU implementation on Intel Core i7-4790.

5 citations


Journal ArticleDOI
TL;DR: This work is to present a flexible-lengtharithmetic processor based on FDFM (Few DSP slices and Few Memory blocks) approach that supports arithmetic operations on multiple-length numbers using FPGAs (Field Programmable Gate Array).
Abstract: Algorithms requiring fast manipulation of multiple-length numbers are usually implemented in hardware. However, hardware implementation, using HDL (Hardware Description Language) for instance, is a laborious task and the quality of the solution relies heavily on the designer expertise. The main contribution of this work is to present a flexible-lengtharithmetic processor based on FDFM (Few DSP slices and Few Memory blocks) approach that supports arithmetic operations on multiple-length numbers using FPGAs (Field Programmable Gate Array). The proposed processor has been implement on the Xilinx Virtex-6 FPGA. Arithmetic instructions of the proposed processor architecture include addition, subtraction, and multiplication of integer numbers exceeding 64-bits. To reduce the burden of implementing algorithm directly on the FPGA, applications requiring multiple-length arithmetic operations are written in a C-like language and translated into a machine program. The machine program is then transferred and executed on the proposed architecture. A 2048-bit RSA encryption/decryption implementation has been used to assess the goodness of the proposed approach. Experimental results shows that the computing time, using the proposed architecture, of a 2048-bit RSA encryption takes only 2.2 times longer than a direct FPGA implementation. Furthermore, by employing multiple FDFM cores for the same task, the computing time reduces considerably. key words: multiple-length-numbers, multiple-length-arithmetic, FPGA, RSA, montgomery modular multiplication

5 citations


Journal ArticleDOI
TL;DR: A new technique to generate an ASCII/JIS art that reproduces the original tone and the details of an input grey-scale image is proposed, inspired by the local exhaustive search (LES) to optimise binary images for printing based on the characteristic of the human visual system.
Abstract: An ASCII art is a matrix of ASCII code characters that reproduces an original grey-scale image. A JIS art is an ASCII art that uses JIS Kanji code characters instead of ASCII code characters. They are commonly used to represent pseudo grey-scale images in text -based messages. Since automatic generation of high quality ASCII/JIS art images is very hard, they are usually produced by hand. The main contribution of this paper is to propose a new technique to generate an ASCII/JIS art that reproduces the original tone and the details of an input grey-scale image. Our new technique is inspired by the local exhaustive search LES to optimise binary images for printing based on the characteristic of the human visual system. Although it can generate high quality ASCII/JIS art images, a lot of computing time is necessary for the LES. Hence, we have implemented our new technique in a graphics processing unit GPU to accelerate the computation. The experimental results show that the GPU implementation can achieve a speedup factor up to 89.56 over the conventional CPU implementation.

4 citations


Journal ArticleDOI
TL;DR: The main purpose of this work is to implement the bulk execution of a Euclidean algorithm computing the GCD (Greatest Common Divisor) of two large numbers in a GPU and introduces a semi-oblivious sequential algorithms, which is almost oblivious.
Abstract: The bulk execution of a sequential algorithm is to execute it for many different inputs in turn or at the same time. A sequential algorithm is oblivious if the address accessed at each time unit is independent of the input. It is known that the bulk execution of an oblivious sequential algorithm can be implemented to run on a GPU very efficiently. The main purpose of our work is to implement the bulk execution of a Euclidean algorithm computing the GCD (Greatest Common Divisor) of two large numbers in a GPU. We first present a new efficient Euclidean algorithm that we call the Approximate Euclidean algorithm. The idea of the Approximate Euclidean algorithm is to compute an approximation of quotient by just one 64-bit division and to use it for reducing the number of iterations of the Euclidean algorithm. Unfortunately, the Approximate Euclidean algorithm is not oblivious. To show that the bulk execution of the Approximate Euclidean algorithm can be implemented efficiently in the GPU, we introduce a semi-oblivious sequential algorithms, which is almost oblivious. We show that the Approximate Euclidean algorithm can be implemented as a semi-oblivious algorithm. The experimental results show that our parallel implementation of the Approximate Euclidean algorithm for 1024- bit integers running on GeForce GTX Titan X GPU is 90 times faster than the Intel Xeon CPU implementation.Â

Journal ArticleDOI
TL;DR: The main contribution of this paper is to develop a processor core that executes Euclidean algorithm computing the GCD (Greatest Common Divisor) of two large numbers in an FPGA that is 3.8 times faster than the best GPU implementation and 316 times better than a sequential implementation on the Intel Xeon CPU.
Abstract: The FDFM (Few DSP slices and Few block Memories) approach is an efficient approach which implements a processor core executing a particular algorithm using few DSP slices and few block RAMs in a single FPGA. Since a processor core based on the FDFM approach uses few hardware resources, hundreds of processor cores working in parallel can be implemented in an FPGA. The main contribution of this paper is to develop a processor core that executes Euclidean algorithm computing the GCD (Greatest Common Divisor) of two large numbers in an FPGA. This processor core that we call GCD processor core uses only one DSP slice and one block RAM, and 1280 GCD processors can be implemented in a Xilinx Virtex-7 family FPGA XC7VX485T-2. The experimental results show that the performance of this FPGA implementation using 1280 GCD processor cores is 0.0904us per one GCD computation for two 1024-bit integers. Quite surprisingly, it is 3.8 times faster than the best GPU implementation and 316 times faster than a sequential implementation on the Intel Xeon CPU.

Proceedings ArticleDOI
01 Nov 2016
TL;DR: A sequence of timestamps to, t1,..., tn-l is d-sorted if ti if some of sensing data may be delayed by some period of time and the sequence is not in proper increasing order of timESTamps.
Abstract: Suppose that a sequence of sensing data with timestamps are transferred asynchronously. Some of sensing data may be delayed by some period of time and the sequence is not in proper increasing order of timestamps. A sequence of timestamps to, t1,..., tn-l is d-sorted if ti

Journal ArticleDOI
TL;DR: A GPU implementation of bulk multiple-length multiplications to adopt a warp-synchronous programming technique and attains a speed-up factor of 52 for 1024-bit multiplelength multiplication over the sequential CPU implementation.
Abstract: In this paper, we present a GPU implementation of bulk multiple-length multiplications. The idea of our GPU implementation is to adopt a warp-synchronous programming technique. We assign each multiple-length multiplication to one warp that consists of 32 threads. In parallel processing using multiple threads, usually, it is costly to synchronize execution of threads and communicate within threads. In warpsynchronous programming technique, however, execution of threads in a warp can be synchronized instruction by instruction without any barrier synchronous operations. Also, inter-thread communication can be performed by warp shuffle functions without accessing shared memory. The experimental results show that our GPU implementation on NVIDIA GeForce GTX 980 attains a speed-up factor of 52 for 1024-bit multiplelength multiplication over the sequential CPU implementation. Moreover, we use this 1024-bit multiple-length multiplication for larger size of bits as a sub-routine. The GPU implementation attains a speed-up factor of 21 for 65536-bit multiple-length multiplication. key words: multiple-length multiplication, GPU, GPGPU, parallel processing, warp-synchronous

Proceedings ArticleDOI
01 Nov 2016
TL;DR: This paper proposes a GPU implementation to accelerate the computation of the ACO algorithm for the vertex coloring problem, and considers programming issues of the GPU architecture, such as coalescing access of the global memory, bank conflict of the shared memory, etc.
Abstract: Vertex coloring is an assignment of colors to vertex of an undirected graph such that no two vertices sharing the same edge have the same color. The vertex coloring problem is to find the minimum number of colors necessary to color a graph given, which is an NP-hard problem in combinatorial optimization. Ant Colony Optimization (ACO) is a well-known meta-heuristic in which a colony of artificial ants cooperates in exploring good solutions to a combinatorial optimization problem. Several methods applying ACO to the vertex coloring problem have been proposed. The main contribution of this paper is to propose a GPU implementation to accelerate the computation of the ACO algorithm for the vertex coloring problem. In our implementation, we have considered programming issues of the GPU architecture, such as coalescing access of the global memory, bank conflict of the shared memory, etc. The experimental results show that on NVIDIA GeForce GTX 1080, our implementation for 1000 vertices runs in 2.740s, while the CPU implementation on Intel Core i7-4790 runs in 100.866s. Thus, our GPU implementation attains a speed-up factor of 36.81.

Journal ArticleDOI
TL;DR: The main contribution of this work is to present a memory-access-efficient implementation for computing the ASM on a GPU, called w-SCAN, which relies on warp shuffle instructions which are used to accelerate the communication between threads without resorting to shared memory access.
Abstract: The closeness of a match is an important measure with a number of practical applications, including computational biology, signal processing and text retrieval. The approximate string matching (ASM) problem asks to find a substring of string Y of length n that is most similar to string X of length m. It is well-know that the ASM can be solved by dynamic programming technique by computing a table of size m × n. The main contribution of this work is to present a memory-access-efficient implementation for computing the ASM on a GPU. The proposed GPU implementation relies on warp shuffle instructions which are used to accelerate the communication between threads without resorting to shared memory access. Despite the fact that O(mn) memory access operations are necessary to access all elements of a table with size n × m, the proposed implementation performs only O( mn w ) memory access operations, where w is the warp size. Experimental results carried out on a GeForce GTX 980 GPU show that the proposed implementation, called w-SCAN, provides speed-up of over two fold in computing the ASM as compared to another prominent alternative. key words: approximate string matching, edit distance, GPU, CUDA, shuffle instructions

Book ChapterDOI
14 Dec 2016
TL;DR: The idea of the proposed method is to arrange the basic component derived from (3, g)-cage in a two-dimensional manner and to connect adjacent components by parallel edges of length 4 each and the result of numerical calculations shows that the average distance in the resulting graph is close to the lower bound.
Abstract: This paper proposes a deterministic method to construct 5-regular geometric graphs with short average distance under the constraint such that the set of vertices is a subset of \(\mathbb {N}\times \mathbb {N}\) and the length of each edge is at most 4. This problem is motivated by the design of efficient floor plan of parallel computers consisting of a number of computing nodes arranged on a two-dimensional array. In such systems, the degree of vertices is determined by the number of ports of the routers and the edge length is limited by a certain value determined by the cycle time. The goodness of the resulting geometric graph is evaluated by the average shortest path length (ASPL) between vertices which reflects the average communication delay between computing nodes. The idea of the proposed method is to arrange the basic component derived from (3, g)-cage in a two-dimensional manner and to connect adjacent components by parallel edges of length 4 each. The result of numerical calculations shows that the average distance in the resulting graph is close to the lower bound so that the gap to the lower bound is less than 0.98 when the number of vertices is 432000.

12 Jan 2016
TL;DR: The main contribution of this paper is to present an efficient hardware architecture of LZW decompression algorithm and to implement it in an FPGA, which runs up to 64 times faster than sequential LZw decompression on a single CPU.
Abstract: LZW algorithm is one of the most important compression and decompression algorithms. The main contribution of this paper is to present an efficient hardware architecture of LZW decompression algorithm and to implement it in an FPGA. In our implementation, the codes of a compressed file is read one by one, and the dictionary table is continuously updated until the table is full. For each code of the compressed file, an inverse of string corresponding to this code is sequentially written to an output buffer. The length of this string and the address of the forefront of this string is stored. The inverse of string can be output reversely from the output buffer using the stored length and forefront address. Since output buffer uses dual-port block RAMs, input of the inverse strings and output of the original strings are performed in parallel. The experimental results show that our FPGA module of LZW decompression on Virtex-7 family FPGA uses 287 slice registers, 282 slice LUTs and 7 block RAMs with 36k-bit. One LZW decompression module is more than 2 times faster than sequential LZW decompression on a single CPU. Since the proposed FPGA module uses a few resources of the FPGA, we implement 34 LZW decompression modules which works in parallel in the FPGA. In other words, our implementation runs up to 64 times faster than sequential LZW decompression on a single CPU.