scispace - formally typeset
Search or ask a question

Showing papers by "Koji Nakano published in 2013"


Proceedings ArticleDOI
Koji Nakano1
20 May 2013
TL;DR: The Hierarchical Memory Machine (HMM) is introduced, which consists of multiple DMMs and a single UMM, and it is proved that the implementation of the direct convolution is time optimal.
Abstract: The Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM) are theoretical parallel computing models that capture the essence of the shared memory access and the global memory access of GPUs. The main contribution of this paper is to introduce the Hierarchical Memory Machine (HMM), which consists of multiple DMMs and a single UMM. The HMM is a more practical parallel computing model which reflects the architecture of current GPUs. We present several fundamental algorithms on the HMM. First, we show that the sum of n numbers can be computed in O(n/w + nl/p + l + log n) time units using p threads on the HMM with width o and latency l, and prove that this computing time is optimal. We also show that the direct convolution of m and m + n - 1 numbers can be done in O(n/w + mn/dw + nl/p + l+ log m) time units using p threads on the HMM with d DMMs, width o, and latency l. Finally, we prove that our implementation of the direct convolution is time optimal.

35 citations


Proceedings ArticleDOI
26 Sep 2013
TL;DR: This paper shows an optimal parallel algorithm for the approximate string matching on the HMM and to implement it on a CUDA-enabled GPU and shows that the implementation on the GPU attains a speedup factor of 66.1 over the single CPU implementation.
Abstract: The Hierarchical Memory Machine (HMM) is a theoretical parallel computing model that captures the essence of computing on CUDA-enabled GPUs. The approximate string matching (ASM) for two strings X and Y of length m and n is a task to find a sub string of Y most similar to X. The main contribution of this paper is to show an optimal parallel algorithm for the approximate string matching on the HMM and to implement it on a CUDA-enabled GPU. Our algorithm runs in O(n/w+mn/dw + nL/p + mnl/p) on the HMM with d streaming processors, memory band width w, global memory access latency L, and shared memory access latency l. Further, we implement our algorithm on GeForce GTX 580 GPU and evaluate the performance. The experimental results show that the ASM of two strings of 1024 and 4M (=222) characters can be computed in 419.6ms, while the sequential algorithm can compute it in 27720ms. Thus, our implementation on the GPU attains a speedup factor of 66.1 over the single CPU implementation.

22 citations


Proceedings ArticleDOI
26 Sep 2013
TL;DR: This paper presents an FPGA implementation of a Support Vector Machine (SVM) classification using the DSP slices and block RAMs in the Xilinx Virtex-6 family FPGa.
Abstract: This paper presents an FPGA implementation of a Support Vector Machine (SVM) classification using the DSP slices and block RAMs in the Xilinx Virtex-6 family FPGA. In our approach, the SVM classification is performed by the multiple DSPs. Our implementation supports 3 types of kernel functions, the sigmoid kernel, the polynomial kernel, and the RBF kernel. We connect DSPs with the built-in cascade logic in a DSP slice. Thus, our architecture consists of a cascaded DSP pipeline and process the input data with this pipeline. The number of DSP slices included in this cascade connection is equal to the number of the support vectors in the SVM. We have implemented the processor core which includes 768 DSPs for SVM classification in a Xilinx Virtex-6 FPGA XC6VLX240T-FF1156. The implementation results show that it can be implemented in the FPGA with 768 DSP48E1 slices, 800 block RAMs and 17680 slices. It runs in 370.096MHz clock frequency and can evaluate the SVM classification for 128-dimensional feature space data 2.89×106 times per second.

21 citations


Proceedings ArticleDOI
04 Dec 2013
TL;DR: A new technique to generate an ASCII art that reproduces the original tone and the details of an input gray-scale image is proposed, inspired by the local exhaustive search to optimize binary images for printing based on the characteristic of the human visual system.
Abstract: An ASCII art is a matrix of characters that reproduces an original gray-scale image. It is commonly used to represent pseudo gray-scale images in text based messages. Since automatic generation of high quality ASCII art images is very hard, they are usually produced by hand. The main contribution of this paper is to propose a new technique to generate an ASCII art that reproduces the original tone and the details of an input gray-scale image. Our new technique is inspired by the local exhaustive search to optimize binary images for printing based on the characteristic of the human visual system. Although it can generate high quality ASCII art images, a lot of computing time is necessary for the local exhaustive search. Hence, we have implemented our new technique in a GPU to accelerate the computation. The experimental results shows that the GPU implementation can achieve a speedup factor up to 57.1 over the conventional CPU implementation.

21 citations


Proceedings ArticleDOI
01 Oct 2013
TL;DR: The experimental results of this paper provide a good example of GPU computation showing that a complicated but ingenious implementation with a larger constant factor in computing time can outperform a much simpler conventional algorithm.
Abstract: The Hierarchical Memory Machine (HMM) is a theoretical parallel computing model that captures the essence of computation on CUDA-enabled GPUs. The offline permutation is a task to copy numbers stored in an array a of size n to an array b of the same size along a permutation P given in advance. A conventional algorithm can complete the offline permutation by executing b[p[i]] ← a[i] for all i in parallel, where an array p stores the permutation P. This conventional algorithm simply performs three rounds of memory access for reading from a, reading from p, and writing in b. The main contribution of this paper is to present an optimal offline permutation algorithm running in O(n/w + L) time units using n threads on the HMM with width w and latency L. We also implement our optimal offline permutation algorithm on GeForce GTX-680 GPU and evaluate the performance. Quite surprisingly, our optimal offline permutation algorithm achieves better performance than the conventional algorithm in most permutations, although it performs 32 rounds of memory access. For example, the bit-reversal permutation for 4M float (32-bit) numbers can be completed in 780ms by our optimal permutation algorithm, while the conventional algorithm takes 2328ms. We can say that the experimental results of this paper provide a good example of GPU computation showing that a complicated but ingenious implementation with a larger constant factor in computing time can outperform a much simpler conventional algorithm.

17 citations


Journal ArticleDOI
TL;DR: The implementation using two new ideas to accelerate the dynamic programming solves the optimal polygon triangulation problem for a convex 8192-gon in 5.57 seconds on the NVIDIA GeForce GTX 680, while a conventional CPU implementation runs in 1939.
Abstract: This paper presents a GPU (Graphics Processing Units) implementation of dynamic programming for the optimal polygon triangulation. Recently, GPUs can be used for general purpose parallel computation. Users can develop parallel programs running on GPUs using programming architecture called CUDA (Compute Unified Device Architecture) provided by NVIDIA. The optimal polygon triangulation problem for a convex polygon is an optimization problem to find a triangulation with minimum total weight. It is known that this problem for a convex n-gon can be solved using the dynamic programming technique in O(n3) time using a work space of size O(n2). In this paper, we propose an efficient parallel implementation of this O(n3)-time algorithm on the GPU. In our implementation, we have used two new ideas to accelerate the dynamic programming. The first idea (adaptive granularity) is to partition the dynamic programming algorithm into many sequential kernel calls of CUDA, and to select the best parameters for the size and the number of blocks for each kernel call. The second idea (sliding and mirroring arrangements) is to arrange the working data for coalesced access of the global memory in the GPU to minimize the memory access overhead. Our implementation using these two ideas solves the optimal polygon triangulation problem for a convex 8192-gon in 5.57 seconds on the NVIDIA GeForce GTX 680, while a conventional CPU implementation runs in 1939.02 seconds. Thus, our GPU implementation attains a speedup factor of 348.02. key words: Dynamic programming, parallel algorithms, coalesced memory access, GPGPU, CUDA

17 citations


Journal ArticleDOI
Koji Nakano1
TL;DR: This paper shows optimal parallel algorithms to compute the sum, the prefix-sums, and the summed area table on two memory machine models, the Discrete Memory Machine (DMM) and the Unified memory Machine (UMM).
Abstract: The main contribution of this paper is to show optimal parallel algorithms to compute the sum, the prefix-sums, and the summed area table on two memory machine models, the Discrete Memory Machine (DMM) and the Unified Memory Machine (UMM). The DMM and the UMM are theoretical parallel computing models that capture the essence of the shared memory and the global memory of GPUs. These models have three parameters, the number p of threads, and the width w of the memory, and the memory access latency l. We first show that the sum of n numbers can be computed in O( n w + nl p + l log n) time units on the DMM and the UMM. We then go on to show that Ω(n w + nl p + l log n) time units are necessary to compute the sum. We also present a parallel algorithm that computes the prefix-sums of n numbers in O( n w + nl p + l log n) time units on the DMM and the UMM. Finally, we show that the summed area table of size √ n × √n can be computed in O( n w + nl p + l log n) time units on the DMM and the UMM. Since the computation of the prefix-sums and the summed area table is at least as hard as the sum computation, these parallel algorithms are also optimal. key words: Memory machine models, prefix-sums computation, parallel algorithm, GPU, CUDA

17 citations


Proceedings ArticleDOI
Koji Nakano1
04 Dec 2013
TL;DR: This paper shows that the dynamic programming to solve the optimal polygon triangulation problem can be implemented in the UMM using the sequential memory access, and proves that any implementation of the dynamic Programming needs Omega(n3/w + n3l/p + nl) time units.
Abstract: The Unified Memory Machine (UMM) is a theoretical parallel computing model that captures the essence of the global memory access of GPUs. Although it is a good theoretical model for GPU computing, the performance analysis of parallel algorithms on it is sometimes complicated. The main contribution of this paper is to provide a useful gadget, the sequential memory access, that makes the computing time evaluation easy, and to show its application to the dynamic programming. The sequential memory access has two parameters: length n and fragmentation f. We first show that the sequential memory access of length n with fragmentation f can be done in O(n/w + nl/p+l+f) time units using p threads on the UMM with width w and latency l. We next show that the dynamic programming to solve the optimal polygon triangulation problem can be implemented in the UMM using the sequential memory access. The resulting implementation for a convex n-gon runs in O(n3/w + n3l/p + nl) time units using p threads on the UMM with width w and latency l. We also prove that any implementation of the dynamic programming needs Omega(n3/w + n3l/p + nl) time units. Thus, our implementation is time optimal.

16 citations


Proceedings ArticleDOI
20 May 2013
TL;DR: A new FPGA architecture for the Hough transform that identifies straight lines in a binary image using 178 DSP48E1 slices and 180 block RAMs with 18Kbits that work in parallel that is close to optimal.
Abstract: The main contribution of this paper is to present a new FPGA architecture for the Hough transform that identifies straight lines in a binary image. Recent FPGAs have hundreds of embedded DSP slices and block RAMs. For example, Xilinx Virtex-6 Family FPGAs have a DSP48E1 slice, which is a configurable logic block equipped with fast multipliers, adders, pipeline registers, and so on. They also have a dual-port memory with 18Kbits as a block RAM. One of the most important key techniques for accelerating computation using FPGAs is an efficient usage ofDSP slices and block RAMs. Our new architecture for the Hough transform uses 178 DSP48E1 slices and 180 block RAMs with 18Kbits that work in parallel. As far as we know, there is no previously published work that fully utilizes DSP slices and block RAMs for the Hough transform. Roughly speaking, a conventional sequential implementation performs 180m voting operations for m edge points. Our architecture performs voting operations in parallel, and outputs identified straight lines in m+97 clock cycles. Since 180m voting operations are performed using 178 DSP48E1 slices, the lower bound of the computing time is m clock cycles. Hence our implementation is close to optimal. The implementation results show that the Hough transform for a 512×512 image with 33232 edge points can be done in only 135.75us.

15 citations


Journal ArticleDOI
TL;DR: In this article, a field programmable gate array (FPGA) implementation of a three-layer perceptron using the few DSP blocks and few block RAMs FDFM approach implemented in the Xilinx Virtex-6 family FPGA is presented.
Abstract: This paper presents a field programmable gate array FPGA implementation of a three-layer perceptron using the few DSP blocks and few block RAMs FDFM approach implemented in the Xilinx Virtex-6 family FPGA. In the FDFM approach, multiple processor cores with few DSP slices and few block RAMs are used. We have implemented 150 processor cores for perceptrons in a Xilinx Virtex-6 family FPGA XC6VLX240T-FF1156. The implementation results show that the 150 processor cores for 32-32-32 input–hidden–output layer perceptrons can be implemented in the FPGA using 150 DSP48 slices, 185 block RAMs and 9676 slices. It runs in 242.89 MHz clock frequency, and a single evaluation of 150 nodes perceptron can be performed 1.65 × 107 times per second.

15 citations


Proceedings ArticleDOI
04 Dec 2013
TL;DR: An FPGA implementation of template matching using DSP slices using a pixel rearrangement technique that is a coarse-to-fine technique that always can find a template image in a base image if the template image is included in the base image.
Abstract: The main contribution of this paper is to propose an FPGA implementation of template matching using DSP slices. Template matching is a technique for finding small parts of an image which match a template image. In our approach, we use a pixel rearrangement technique that is a coarse-to-fine technique. Unlike ordinary coarse-to-fine techniques, it always can find a template image in a base image if the template image is included in the base image. In our implementation, we use multiple matching modules that compute similarity and work in parallel. In each matching module, we efficiently use embedded DSP slices on the Virtex-6 FPGA. We have implemented the template matching in a Xilinx Virtex-6 FPGA XC6VLX240T-FF1156. The implementation results show that it can be implemented in the FPGA with 352 DSP slices, 3 block RAMs and 455 CLBs. It runs in approximately 280MHz clock frequency. The computing time of our FPGA implementation is 348.88 and 3.66 times faster than that of CPU and GPU implementations, respectively.

Journal ArticleDOI
TL;DR: The main contribution of this paper is to implement a conflict-free permutation algorithm on the DMM in a GPU that runs in 167ns for any permutation including the random permutation and the worst permutation, although it performs more memory accesses.
Abstract: The Discrete Memory Machine (DMM) is a theoretical parallel computing model that captures the essence of the shared memory access of GPUs. The bank conflicts should be avoided for maximizing the bandwidth of the shared memory access. Offline permutation of an array is a task to copy all elements in array a into array b along a permutation given in advance. The main contribution of this paper is to implement a conflict-free permutation algorithm on the DMM in a GPU. We have also implemented straightforward permutation algorithms on the GPU. The experimental results for 1024 double (64-bit) numbers on NVIDIA GeForce GTX-680 show that the straightforward permutation algorithm takes 247.8 ns for the random permutation and 1684ns for the worst permutation that involves the maximum bank conflicts. Our conflict-free permutation algorithm runs in 167ns for any permutation including the random permutation and the worst permutation, although it performs more memory accesses. It follows that our conflict-free permutation is 1.48 times faster for the random permutation and 10.0 times faster for the worst permutation. key words: memory machine models, data movement, bank conflict, shared memory, GPU, CUDA

Proceedings ArticleDOI
26 Sep 2013
TL;DR: A new FPGA architecture for the Hough transform for all the pixel data input in raster scan order is presented, using 90 DSP48E1 slices and 181 block RAMs with 18Kbits that work in parallel.
Abstract: Since FPGA chips maintain relatively low price and its programmable features, it is widely used in those fields which need to update architecture or functions frequently such as communication and education areas. Especially, in mobile devices that recently require the ability to perform computation such as real-time image processing, FPGAs are promising devices. Recent FPGAs have hundreds of embedded DSP slices and block RAMs. For example, Xilinx Virtex-6 Family FPGAs have a DSP48E1 slice, which is a configurable logic block equipped with fast multipliers, adders, pipeline registers, and so on. They also have a dual-port memory with 18Kbits as a block RAM. Therefore, one of the most important key techniques for accelerating computation using such FPGAs is an efficient usage of DSP slices and block RAMs. The main contribution of this paper is to present a new FPGA architecture for the Hough transform for all the pixel data input in raster scan order. The architecture uses 90 DSP48E1 slices and 181 block RAMs with 18Kbits that work in parallel. The experimental results show that this implementation runs in 247.525MHz and given a binary image of size n × n, our circuit can perform in n2 + √(2)n + 379 clock cycles.

20 Jan 2013
TL;DR: A new FPGA architecture for the Hough transform that identifies straight lines in a binary image using DSP blocks and block RAMs with 18Kbits that work in parallel is presented.
Abstract: Since FPGA chips maintain relatively low price and its programmable features, it is widely used in those fields which need to update architecture or functions frequently such as communication and education areas. Especially, in mobile devices that recently require the ability to perform computation such as real-time image processing, FPGAs are promising devices. The main contribution of this paper is to present a new FPGA architecture for the Hough transform that identifies straight lines in a binary image. Recent FPGAs have hundreds of embedded DSP blocks and block RAMs. For example, Xilinx Virtex-6 Family FPGAs have a DSP48E1 block, which is a configurable logic block equipped with fast multipliers, adders, pipeline registers, and so on. They also have a dual-port memory with 18Kbits as a block RAM. One of the most important key techniques for accelerating computation using FPGAs is an efficient usage of DSP blocks and block RAMs. Our new architecture for the Hough transform uses 178 DSP48E1 blocks and 180 block RAMs with 18Kbits that work in parallel. As far as we know, there is no previously published work that fully utilizes DSP blocks and block RAMs for the Hough transform. Roughly speaking, a conventional sequential implementation performs 180 m voting operations for m edge points. Our architecture performs voting operations in parallel, and outputs identified straight lines in m +97 clock cycles. Since 180 m voting operations are performed using 178 DSP48E1 blocks, the lower bound of the computing time is m clock cycles. Hence our implementation is close to optimal. The implementation results show that the Hough transform for a 512×512 image with 33232 edge points can be done in only 135.75µs.

Proceedings ArticleDOI
04 Dec 2013
TL;DR: This paper shows that the memory access congestion is expected O(log w/log log w) for any memory access requests including malicious ones by a warp of w threads, and applies the random address shift technique to matrix transpose algorithms.
Abstract: The Discrete Memory Machine (DMM) is a theoretical parallel computing model that captures the essence of memory access of the streaming multiprocessor on CUDA-enabled GPUs. The DMM has w memory banks that constitute a shared memory, and w threads in a warp try to access them at the same time. However, memory access requests destined for the same memory bank are processed sequentially. Hence, it is very important for developing efficient algorithms to reduce the memory access congestion, the maximum number of memory access requests destined for the same bank. The memory access congestion takes value between 1 and w. The main contribution of this paper is to present a novel algorithmic technique called the random address shift that reduces the memory access congestion. We show that the memory access congestion is expected O(log w/log log w) for any memory access requests including malicious ones by a warp of w threads. The simulation results show that the expected congestion for w=32 threads is only 3.436. Since the malicious memory access requests destined for the same bank take congestion 32, our random address shift technique substantially reduces the memory access congestion. We have applied the random address shift technique to matrix transpose algorithms. The experimental results on GeForce GTX Titan show that the random address shift technique is practical and can accelerate the straightforward matrix transpose algorithms by a factor of 5.

Journal ArticleDOI
TL;DR: In this paper, a GPU implementation of computing Euclidean distance map EDM with efficient memory access has been presented, where the main access from/to the global memory enables to be performed by coalesced access.
Abstract: Recent graphics processing units GPUs, which have many processing units, can be used for general purpose parallel computation. To utilise the powerful computing ability, GPUs are widely used for general purpose processing. Since GPUs have very high memory bandwidth, the performance of GPUs greatly depends on memory access. The main contribution of this paper is to present a GPU implementation of computing Euclidean distance map EDM with efficient memory access. Given a two-dimensional 2D binary image, EDM is a 2D array of the same size such that each element stores the Euclidean distance to the nearest black pixel. In the proposed GPU implementation, we have considered many programming issues of the GPU system such as coalesced access of global memory and shared memory bank conflicts, and so on. To be concrete, by transposing 2D arrays, which are temporal data stored in the global memory, with the shared memory, the main access from/to the global memory enables to be performed by coalesced access. In practice, we have implemented our parallel algorithm in the following three modern GPU systems: Tesla C1060, GTX 480 and GTX 580. The experimental results have shown that, for an input binary image with size of 9216 × 9216, our implementation can achieve a speedup factor of 54 over the sequential algorithm implementation.

Proceedings ArticleDOI
01 Dec 2013
TL;DR: The Random Super Discrete Memory Machine (RSDMM), an extended version of the DMM, which supports a super warp with multiple warps and the random address shift technique is applied, to present novel and practical parallel computing models in which the congestion is small for any memory access requests.
Abstract: The Discrete Memory Machine (DMM) is a theoretical parallel computing model that captures the essence of memory access by a streaming multiprocessor on CUDA-enabled GPUs. The DMM has w memory banks that constitute a shared memory, and each warp of w threads access the shared memory at the same time. However, memory access requests destined for the same memory bank are processed sequentially. Hence, it is very important for developing efficient algorithms to reduce the memory access congestion, the maximum number of memory access requests destined for the same bank. However, it is not easy to minimize the memory access congestion for some problems. The main contribution of this paper is to present novel and practical parallel computing models in which the congestion is small for any memory access requests. We first present the Super Discrete Memory Machine (SDMM), an extended version of the DMM, which supports a super warp with multiple warps. Memory access requests by multiple warps in a super warp are packed through pipeline registers to reduce the memory access congestion. We then go on to apply the random address shift technique to the SDMM. The resulting machine, the Random Super Discrete Memory Machine (RSDMM) can equalize memory access requests by a super warp. Quite surprisingly, for any memory access requests by a super warp on the RSDMM, the overhead of the memory access congestion is within a constant factor of perfectly scheduled memory access. Thus, unlike the DMM, developers of parallel algorithms do not have to consider the memory access congestion on the RSDMM. The congestion on the RSDMM is evaluated by theoretical analysis as well as by experiments.

Proceedings ArticleDOI
04 Dec 2013
TL;DR: In this article, a processor based on FDFM (Few DSP slices and Few Memory blocks) is presented, which supports arithmetic operations with flexibly many bits, including addition, subtraction, and multiplication with variable size longer than 64 bits.
Abstract: Some applications such as RSA encryption/decryption needs integer arithmetic operations with many bits However, such operations cannot be performed directly by conventional CPUs, because their instruction supports integers with fixed bits, say, 64 bits Since the CPUs need to repeat arithmetic operations to numbers with fixed bits, they have considerably overhead to execute applications involving integer arithmetic with many bits On the other hand, we can implement hardware algorithms for such applications in the FPGAs for further acceleration However, the implementation of hardware algorithm is usually very complicated and debugging of hardware is too hard The main contribution of this paper is to present an intermediate approach of software and hardware using FPGAs More specifically, we present a processor based on FDFM (Few DSP slices and Few Memory blocks) approach that supports arithmetic operations with flexibly many bits, and implement it in the FPGA Arithmetic instructions of our processor architecture include addition, subtraction, and multiplication for numbers with variable size longer than 64 bits To show the potentiality of our processor, we have implemented 2048-bit RSA encryption/decryption by software written by machine instructions The resulting processor uses only one DSP48E1 slices and four Block RAMs (BRAMs), and RSA encryption software on it runs in 63565ms It has been shown that the direct hardware implementation of RSA encryption runs in 27726ms Although our intermediate approach is slower, it has several advantages Since the algorithm is written by software, the development and the debugging are easy Also, it is more flexible and scalable

Proceedings ArticleDOI
04 Dec 2013
TL;DR: The main contribution of this paper is to present TinyCSE (Tiny Computer System for Education), an extension of TinyCPU supporting interrupts and peripheral controllers.
Abstract: TinyCPU is a small processor that can be implemented in various FPGAs that can be used for education and development of small embedded system. TinyCPU is so small that it is designed using Verilog HDL and the size of source code is only 427 lines. However, it does not support interrupts and peripheral controllers. The main contribution of this paper is to present TinyCSE (Tiny Computer System for Education), an extension of TinyCPU supporting interrupts and peripheral controllers. TinyCSE has controllers for external devices including keyboard, mouse, serial communication, switch, and timer. It also supports hardware interrupts from these external devices. Quite surprisingly, the code sizes of the CPU with interrupt controller and the device controllers are 515 lines and is 1339 lines in Verilog HDL, respectively. Our processor is portable and easy to understand and the function expansion is not difficult. As real-life applications, we have developed a time watch. This applications runs in 73MHz on the Xilinx Spartan-3AN family FPGA XC3S700AN using 832 out of 5888 slices (14.1%). Therefore, our tiny processing system benefits computer system education and small embedded system development.

01 Jan 2013
TL;DR: This paper presents a very efficient method for random selection of next cities by a number of ants using iterative random trial which can find next cities in few computational costs with high probability.
Abstract: )Recent Graphics Processing Units (GPUs) can be used for general purpose parallel computation. Ant Colony Optimization (ACO) approaches have been introduced as nature-inspired heuristics to find good solutions of the Traveling Salesman Problem (TSP). In ACO approaches, a number of ants traverse the cities of the TSP to find better solutions of the TSP. The ants randomly select next visiting cities based on the probabilities determined by total amounts of their pheromone spread on routes. The main contribution of this paper is to present sophisticated and efficient implementation of one of the ACO approaches on the GPU. In our implementation, we have considered many programming issues of the GPU architecture including coalesced access of global memory, shared memory bank conflicts, etc. In particular, we present a very efficient method for random selection of next cities by a number of ants. Our new method uses iterative random trial which can find next cities in few computational costs with high probability. This idea can be applied not only GPU implementation, but also CPU implementation. The experimental results on NVIDIA GeForce GTX 580 show that our implementation for 1002 cities runs in 8.71 seconds, while the CPU implementation runs in 190.05 seconds. Thus, our GPU implementation attains a speed-up factor of 22.11.

Journal Article
TL;DR: A new implementation of Hough transform using the GPU is presented, which attains a speed-up factor of 68.388ms and identifies straight lines in a binary edge image.
Abstract: Recent Graphics Processing Units (GPUs) have many processing cores. To use the powerful computing ability, GPUs are wildly utilized for general purpose processing. The main contribution of this paper is to present a new implementation of Hough transform using the GPU. The Hough transform identifies straight lines in a binary edge image. In our GPU implementation, the voting process in the Hough transform is performed for each degree in parallel. Also, the shared memory is used with bank conflict-free access. We have implemented our parallel algorithm with NVIDIA GeForce GTX680. The experimental results show that the Hough transform for a 512×512 image with 33232 edge points can be done in only 0.638ms, while a conventional CPU implementation runs in 43.388ms. Thus, our GPU implementation attains a speed-up factor of 68.

20 Jan 2013
TL;DR: The main contribution of this paper is to present an efficient implementation of a Support Vector Machine (SVM) in the FPGA that mainly uses the DSP blocks and block RAMs in the Xilinx Virtex-6 family FPGa.
Abstract: The main contribution of this paper is to present an efficient implementation of a Support Vector Machine (SVM) in the FPGA. Our implementation mainly uses the DSP blocks and block RAMs in the Xilinx Virtex-6 family FPGA. Each DSP block is used to compute the product-sum performed for an internal node and the output node of the SVM. The block RAMs are used to store the weights and interim values. They also used to compute the sigmoid function. The experimental result shows that our implementation for a 128 input and 760 output nodes SVM uses 768 DSP48E1 blocks, 800 block RAMs, and 17680 slices in a Xilinx Virtex-6 FPGA XC6VLX240T-FF1156 and runs in 348.554 MHz. Also, it performs the computation for a 128 input and 760 output nodes SVM 2.72×10 6 times per second.