scispace - formally typeset
Search or ask a question

Showing papers by "Koji Nakano published in 2017"


Journal ArticleDOI
TL;DR: The adaptive loss-less (ALL) data compression as discussed by the authors is a lossless data compression method that is designed so that the data compression ratio is moderate, but decompression can be performed very efficiently on the GPU.
Abstract: Summary There is no doubt that data compression is very important in computer engineering. However, most lossless data compression and decompression algorithms are very hard to parallelize, because they use dictionaries updated sequentially. The main contribution of this paper is to present a new lossless data compression method that we call adaptive loss-less (ALL) data compression. It is designed so that the data compression ratio is moderate, but decompression can be performed very efficiently on the graphics processing unit (GPU). This makes sense for applications such as training of deep learning, in which compressed archived data are decompressed many times. To show the potentiality of ALL data compression method, we have evaluated the running time using five images and five text data and compared ALL with previously published lossless data compression methods implemented in the GPU, Gompresso, CULZSS, and LZW. The data compression ratio of ALL data compression is better than the others for eight data out of these 10 data. Also, our GPU implementation on GeForce GTX 1080 GPU for ALL decompression runs 84.0 to 231 times faster than the CPU implementation on Core i7-4790 CPU. Further, it runs 1.22 to 23.5 times faster than Gompresso, CULZSS, and LZW running on the same GPU.

17 citations


Proceedings ArticleDOI
01 Nov 2017
TL;DR: The main contribution of this paper is to introduce task arrays and to present Single Kernel Soft Synchronization (SKSS) technique that significantly reduces such overheads for task arrays.
Abstract: A task array is a 2-dimensional array of tasks with dependency relations. Each task uses the resulting values of some tasks in the left columns, and so it can be started only after these left tasks are completed. Conventional CUDA implementations repeatedly perform a separated CUDA kernel call for each column from left to right to synchronize the computation for tasks. However, this conventional CUDA implementation has several drawbacks: a CUDA kernel call has a certain overhead, and the running time of a CUDA kernel is determined by a CUDA block that terminates lastly. Also, every task must write and preserve the resulting values in the global memory with low memory access performance for the following tasks. The main contribution of this paper is to introduce task arrays and to present Single Kernel Soft Synchronization (SKSS) technique that significantly reduces such overheads for task arrays. The SKSS performs only one CUDA kernel call and CUDA blocks assigned to each row of a task array using a global counter. To clarify the potentiality of our SKSS technique, we have implemented the dynamic programming for the 0-1 knapsack problem, the summed area table computation, and the error diffusion of a gray-scale image using our SKSS technique and compared with previously published best GPU implementations. Quite surprisingly, the experimental results using NVIDIA Titan X show that, our SKSS implementations are 1.29-2.11 times faster for the 0-1 knapsack problem, 1.08-1.56 times faster for the summed area table computation, and 1.61-2.11 times faster for the error diffusion.

12 citations


Proceedings ArticleDOI
01 May 2017
TL;DR: This work is to present a Bitwise Parallel Bulk Computation (BPBC) to accelerate the Smith-Waterman Algorithm (SWA) by converting this computation into a circuit simulation using the BPBC technique to compute multiple instances simultaneously.
Abstract: The bulk execution of a sequential algorithm is to execute it for many different inputs in turn or at the same time. It is known that the bulk execution of an oblivious sequential algorithm can be implemented to run efficiently on a GPU. The bulk execution supports fine grained bitwise parallelism, allowing it to achieve high acceleration over a straightforward sequential computation. The main contribution of this work is to present a Bitwise Parallel Bulk Computation (BPBC) to accelerate the Smith-Waterman Algorithm (SWA). More precisely, the dynamic programming for the SWA repeatedly performs the same computation O(mn) times. Thus, our idea is to convert this computation into a circuit simulation using the BPBC technique to compute multiple instances simultaneously. The proposed BPBC technique for the SWA has been implemented on the GPU and CPU. Experimental results show that the proposed BPBC for SWA accelerates the computation by over 447 times as compared to a single CPU implementation.

10 citations


Proceedings ArticleDOI
01 Aug 2017
TL;DR: This paper presents simple and fast parallel algorithms for computing the complete/connected Voronoi maps and the Euclidean distance map and implement them in the GPU.
Abstract: The complete Voronoi map of a binary image with black and white pixels is a matrix of the same size such that each element is the closest black pixel of the corresponding pixel. The complete Voronoi map visualizes the influence region of each black pixel. However, each region may not be connected due to exclave pixels. The connected Voronoi map is a modification of the complete Voronoi map so that all regions are connected. The Euclidean distance map of a binary image is a matrix, in which each element is the distance to the closest black pixel. It has many applications of image processing such as dilation, erosion, blurring effects, skeletonization and matching. The main contribution of this paper is to present simple and fast parallel algorithms for computing the complete/connected Voronoi maps and the Euclidean distance map and implement them in the GPU. Our parallel algorithm first computes the mixed Voronoi map, which is a mixture of the complete and connected Voronoi maps, and then converts it into the complete/connected Voronoi by exposing/hiding all exclave pixels. After that, the complete Voronoi map is converted into the Euclidean distance map by computing the distance to the closest black pixel for every pixel in an obvious way. The experimental results on GeForce GTX~1080 GPU show that the computing time for these conversions is relatively small. The throughput of our GPU implementation for computing the Euclidean distance maps of 2K × 2K binary images is up to 2.08 times larger than the previously published best GPU implementation, and up to 172 times larger than CPU implementation using Intel Core i7-4790.

8 citations


Journal ArticleDOI
TL;DR: This paper proposes a graphics processing unit (GPU) implementation for digital halftoning employing local exhaustive search to produce high‐quality binary images and a GPU implementation for cluster‐dot halftoned tailored for local exhaustiveSearch.
Abstract: Digital halftoning is an important process to convert a grayscale image into a binary image with black and white pixels. Local exhaustive search‐based halftoning is one of the halftoning methods that can generate high‐quality binary images. However, considering the computing time, it is not realistic for most applications. As a first contribution, this paper proposes a graphics processing unit (GPU) implementation for digital halftoning employing local exhaustive search to produce high‐quality binary images. Programming issues of the GPU architecture have been carefully assessed for implementing the proposed method. Experimental results show that the proposed GPU implementation on NVIDIA (Santa Clara, CA, USA) GeForce GTX TITAN X attains a speed‐up factor of up to 48 over a CPU implementation. Our second contribution is a GPU implementation for cluster‐dot halftoning tailored for local exhaustive search. This implementation attains a speed‐up factor of 92 over a sequential CPU implementation. Copyright © 2016 John Wiley & Sons, Ltd.

8 citations


Proceedings ArticleDOI
01 Nov 2017
TL;DR: A very efficient FPGA implementation, which performs the Approximate String Matching (ASM) for a pattern string and a text string of length m and n, respectively and a hybrid circuit of the bit-vector and the ASM circuits is presented.
Abstract: The main contribution of this paper is to present a very efficient FPGA implementation, which performs the Approximate String Matching (ASM) for a pattern string and a text string of length m and n, respectively. It is well known that the ASM can be done in O(mn) time by the dynamic programming technique. Myers has presented a sophisticated sequential algorithm called bit-vector algorithm, which performs the ASM in O(n) time using m-bit addition and bitwise operations. Hoffmann et al. have implemented the bit-vector algorithm in the FPGA and evaluated the performance. However, the performance of the bit-vector circuit implemented in the FPGA is degraded for large m due to a long critical path of length proportional to m. We will present a circuit with O(1)-length critical path that performs the ASM with very high clock frequency and throughput. Also, to reduce the hardware usage, we present a hybrid circuit of the bit-vector and our ASM circuits. The experimental results show that, our hybrid circuit for the ASM is 20 times more efficient than the bit-vector circuit in terms of the performance per circuit resource. To see the potentiality of the ASM computation on the FPGA, we evaluated the performances of the ASM on the latest FPGA, GPU, and CPU. Our hybrid circuit implemented in Xilinx Virtex UltraScale+ XCVU9P FPGA is more than 58 times and 1400 times faster than parallel ASM computation on NVIDIA TITAN~X GPU and Core i7-6700K CPU, respectively. Thus, the FPGA is promising as an accelerator of the ASM.

5 citations


Journal ArticleDOI
TL;DR: An efficient GPU implementation of bulk computation of eigenvalues for many small, non-symmetric, real matrices is presented and two types of assignments of GPU threads to matrices are presented and three memory arrangements in the global memory are introduced.
Abstract: The main contribution of this paper is to present an efficient GPU implementation of bulk computation of eigenvalues for many small, non-symmetric, real matrices. This work is motivated by the necessity of such bulk computation in designing of control systems, which requires to compute the eigenvalues of hundreds of thousands non-symmetric real matrices of size up to 30x30. Several efforts have been devoted to accelerating the eigenvalue computation including computer languages, systems, environments supporting matrix manipulation offering specific libraries/function calls. Some of them are optimized for computing the eigenvalues of a very large matrix by parallel processing. However, such libraries/function calls are not aimed at accelerating the eigenvalues computation for a lot of small matrices. In our GPU implementation, we considered programming issues of the GPU architecture including warp divergence, coalesced access of the global memory, utilization of the shared memory, and so forth. In particular, we present two types of assignments of GPU threads to matrices and introduce three memory arrangements in the global memory. Furthermore, to hide CPU-GPU data transfer latency, overlapping computation on the GPU with the transfer is employed. Experimental results on NVIDIA TITAN~X show that our GPU implementation attains a speed-up factor of up to 83.50 and 17.67 over the sequential CPU implementation and the parallel CPU implementation with eight threads on Intel Core i7-6700K, respectively.

5 citations



Journal ArticleDOI
TL;DR: This paper presents a time‐optimal implementation for bulk execution of an oblivious sequential algorithm, and develops a tool, named C2CU, which automatically generates a CUDA C program for a bulk execution in order to evaluate the bulk execution performance of these algorithms.
Abstract: Several important tasks, including matrix computation, signal processing, sorting, dynamic programming, encryption, and decryption, can be performed by oblivious sequential algorithms. A sequential algorithm is oblivious if an address accessed at each time does not depend on the input data. A bulk execution of a sequential algorithm is to execute it for many independent inputs in turn or in parallel. A number of works have been devoted to design and implement parallel algorithms for a single input. However, none of these works evaluated the bulk execution performance of these algorithms. The first contribution of this paper is to present a time‐optimal implementation for bulk execution of an oblivious sequential algorithm. Our second contribution is to develop a tool, named C2CU, which automatically generates a CUDA C program for a bulk execution of an oblivious sequential algorithm. The C2CU has been used to generate CUDA C programs for the bulk execution of the bitonic sorting, Floyd‐Warshall, and Montgomery modulo multiplication algorithms. Compared to a sequential implementation on a single CPU, the generated CUDA C programs for the above algorithms run, respectively, 199, 54, and 78 times faster.

4 citations


Proceedings ArticleDOI
01 Sep 2017
TL;DR: This work introduces a novel graph called a host-switch graph, which consists of host vertices and switch vertices, and forms a graph problem called an order/radix problem (ORP) for designing low end-to-end latency interconnection networks and demonstrates that the optimal number of switches can be predicted.
Abstract: We introduce a novel graph called a host-switch graph, which consists of host vertices and switch vertices. Using host-switch graphs, we formulate a graph problem called an order/radix problem (ORP) for designing low end-to-end latency interconnection networks. Our focus is on reducing the host-to-host average shortest path length (h-ASPL), since the shortest path length between hosts in a host-switch graph corresponds to the end-to-end latency of a network. We hence define ORP as follows: given order (the number of hosts) and radix (the number of ports per switch), find a host-switch graph with the minimum h-ASPL. We demonstrate that the optimal number of switches can mathematically be predicted. On the basis of the prediction, we carry out a randomized algorithm to find a host-switch graph with the minimum h-ASPL. Interestingly, our solutions include a host-switch graph such that switches have the different number of hosts. We then apply host-switch graphs to interconnection networks and evaluate them practically. As compared with the three conventional interconnection networks (the torus, the dragonfly, and the fat-tree), we demonstrate that our networks provide higher performance while the number of switches can decrease.

4 citations


Proceedings ArticleDOI
01 Nov 2017
TL;DR: The main contribution of this paper is to show a non-photorealistic rendering for high quality pointillism image generation with squares by pasting square patterns on canvas in a graphics processing unit (GPU) to accelerate the computation.
Abstract: Non-photorealistic rendering is one of the digital art techniques. It generates digital images resembling artistic representation. The main contribution of this paper is to show a non-photorealistic rendering for high quality pointillism image generation with squares by pasting square patterns on canvas. Our technique is inspired by the characteristic of the human visual system to optimize generated images. Although it can generate high quality pointillistic images, a lot of time is necessary. Hence, we have implemented our technique in a graphics processing unit (GPU) to accelerate the computation. The experimental results show that the GPU implementation can achieve a speed-up factor of 160 over the sequential CPU implementation.

Journal ArticleDOI
TL;DR: This paper presents an implementation that performs the exhaustive search to verify the Collatz conjecture using a GPU and achieves a speed-up factor of 249 over the sequential CPU implementation.
Abstract: The main contribution of this paper is to present an implementation that performs the exhaustive search to verify the Collatz conjecture using a GPU. Consider the following operation on an arbitrary positive number: if the number is even, divide it by two, and if the number is odd, triple it and add one. The Collatz conjecture asserts that, starting from any positive number m, repeated iteration of the operations eventually produces the value 1. We have implemented it on NVIDIA GeForce GTX TITAN~X and evaluated the performance. The experimental results show that, our GPU implementation can verify 1.31x10^12 64-bit numbers per second. While the sequential CPU implementation on Intel Core i7-4790 can verify 5.25x10^9 64-bit numbers per second. Thus, our implementation on the GPU attains a speed-up factor of 249 over the sequential CPU implementation. Additionally, we accelerated the computation of counting the number of the above operations until a number reaches 1, called delay that is one of the mathematical interests for the Collatz conjecture by the GPU. Using a similar idea, we achieved a speed-up factor of 73.

Proceedings ArticleDOI
01 May 2017
TL;DR: The main contribution of this paper is to show a new photomosaic generation method by rearranging subimages of an image by reducing the rearrangement problem to a minimum weighted bipartite matching problem.
Abstract: The main contribution of this paper is to show a new photomosaic generation method by rearranging subimages of an image. In the photomosaic generation, an input image is divided into small subimages and they are rearranged such that the rearranged image reproduces another image given as a target image. Therefore, this problem can be considered as a combinatorial optimization problem to obtain the rearrangement which reproduces approximate images to the target image. Our new idea is that this rearrangement problem is reduced to a minimum weighted bipartite matching problem. By solving the matching problem, we can obtain the best rearrangement image. Although it can generate the most similar photomosaic image, a lot of computing time is necessary. Hence, we propose an approximation algorithm of the photomosaic generation. This approximation algorithm does not obtain the most similar photomosaic image. However, the computing time can be shortened considerably. Additionally, we accelerate the computation using the GPU (Graphics Processing Unit). The experimental results show that the GPU implementations for the optimization algorithm and the approximation algorithm can accelerate the computation to 40 and 66 times faster than the serial CPU implementation, respectively.

Book ChapterDOI
10 Sep 2017
TL;DR: This paper presents an efficient parallel implementation of this \(O(n^3)\)-time algorithm for a lot of instances on the GPU (Graphics Processing Unit) and considered programming issues of the GPU architecture such as coalesced access of the global memory, warp divergence.
Abstract: The optimal polygon triangulation problem for a convex polygon is an optimization problem to find a triangulation with minimum total weight. It is known that this problem can be solved using the dynamic programming technique in \(O(n^3)\) time. The main contribution of this paper is to present an efficient parallel implementation of this \(O(n^3)\)-time algorithm for a lot of instances on the GPU (Graphics Processing Unit). In our proposed GPU implementation, we focused on the computation for a lot of instances and considered programming issues of the GPU architecture such as coalesced access of the global memory, warp divergence. Our implementation solves the optimal polygon triangulation problem for 1024 convex 1024-gons in 4.77 s on the NVIDIA TITAN X, while a conventional CPU implementation runs in 241.53 s. Thus, our GPU implementation attains a speedup factor of 50.6.

Journal ArticleDOI
TL;DR: This special issue is intended to provide an overview of some key topics and state-of-the-art of recent advances in subjects relevant to High-End Data-Intensive Computing Systems.
Abstract: With the increasing availability of data generated by scientific instruments and simulations, today, solving many of our most important scientific and engineering problems requires high-end computing systems (HECS)1 that may be able to process and storage a huge amount of data.2 With this landscape, many synergies between extreme-scale computing, simulations, and data intensive applications might arise.3,4 However, the high-performance computing and data analysis platforms, paradigms, and tools have evolved in many cases in different fields, having their own specific methodologies, tools, and techniques. We need to evolve systems and paradigms to create High-End Data-Intensive Computing Systems (HEDICS) to create high-end resources that must be powerful enough in a broad sense (computation, storage, I/O capacity, communications, etc), but at the same time have to provide utilities from the Big Data computing (BDC) space to satisfy the data management and analytics needs of near future applications. Future HECS platforms will be likely characterized by a three to four orders of magnitude, increasing in concurrency, a substantially larger storage capacity, and a deepening of the storage hierarchy. Moreover, the advent of the Big Data challenges5 has generated new initiatives closely related to ultrascale computing systems in large scale distributed systems. The current uncoordinated development model of independently applying optimizations at each layer of the system software I/O software stack will not scale to the required levels of distribution, concurrency, storage hierarchy, and capacity.6 Thus, we need reusable, modular, and scalable frameworks for designing high-end reconfigurable computers, including novel data processing building block and innovative programming models. In those aspects, many new topics are open to research: parallel and distributed algorithms for HEDICS; algorithms for aggressive management of information and knowledge from massive data sources; resource management and scheduling in high-end data and computing systems; tools and environments for parallel/distributed high-end software development; new programming models, as well as machine and application abstractions; resilience issues in HEDICS; adaptive software; architectures, networks, and systems suited for extreme-scale and Big Data; massive distributed and parallel data analytics and feature extraction; new I/O and storage systems valid for HEDICS; and novel and redesigned high-end scientific and engineering computing. This special issue is intended to provide an overview of some key topics and state-of-the-art of recent advances in subjects relevant to High-End Data-Intensive Computing Systems. The general objectives are to address, explore, and exchange information on the challenges and current state-of-the-art in HEDICS, new programming models, run-times, and data facilities design and performance, and their application in various science and engineering domains.

01 Jan 2017
TL;DR: The new idea is that this rearrangement problem is reduced to a minimum weighted bipartite matching problem and by solving the matching problem, this approximation method does not obtain the most similar photomosaic image, but the computing time can be shortened considerably.
Abstract: In this paper, we propose a photomosaic generation method by rearranging divided images. In the photomosaic generation, an input image is divided into small subimages and they are rearranged such that the rearranged image reproduces another image given as a target image. Therefore, we can consider this problem as combinatorial optimization to obtain the rearrangement which reproduces approximate images to the target image. Our new idea is that this rearrangement problem is reduced to a minimum weighted bipartite matching problem. By solving the matching problem, we can obtain the best rearrangement image. Although it can generate the most similar photomosaic image, a lot of computing time is necessary. Hence, we propose an approximation method of the photomosaic generation. This approximation method does not obtain the most similar photomosaic image. However, the computing time can be shortened considerably. Additionally, we accelerate the approximation method using the GPU (Graphics Processing Unit). The experimental result shows that the GPU implementation attains a speed-up factor of 25 times over the sequential CPU implementation.