scispace - formally typeset
Search or ask a question

Showing papers by "Koji Nakano published in 2010"


Proceedings ArticleDOI
17 Nov 2010
TL;DR: The experimental result shows that the implementation of Canny edge detection algorithm on CUDA achieves a speedup factor of 61 over a conventional software implementation.
Abstract: Recent GPUs, which have many processing units connected with a global memory, can be used for general purpose parallel computation. Users can develop parallel programs running on GPUs using programming architecture called CUDA (Compute Unified Device Architecture). The main contribution of this paper is to implement a Canny edge detection algorithm on CUDA. The experimental result shows that our implementation of Canny edge detection algorithm on CUDA achieves a speedup factor of 61 over a conventional software implementation.

120 citations


Proceedings ArticleDOI
17 Nov 2010
TL;DR: Since the circuit uses only one DSP48E1 block and one Block RAM, the implementation is close to optimal in the sense that it has only less than 3% overhead in multiplication and no further improvement is possible as long as Montgomery multiplication based algorithm is used.
Abstract: The main contribution of this paper is to present an efficient hardware algorithm for RSA encryption/decryption based on Montgomery multiplication. Modern FPGAs have a number of embedded DSP blocks (DSP48E1) and embedded memory blocks (BRAM). Our hardware algorithm supporting 2048-bit RSA encryption/decryption is designed to be implemented using one DSP48E1, one BRAM and few logic blocks (slices) in the Xilinx Virtex-6 family FPGA. The implementation results showed that our RSA module for 2048-bit RSA encryption/decryption runs in 277.26ms. Quite surprisingly, the multiplier in DSP48E1 used to compute Montgomery multiplication works in more than 97% clock cycles over all clock cycles. Hence, our implementation is close to optimal in the sense that it has only less than 3% overhead in multiplication and no further improvement is possible as long as Montgomery multiplication based algorithm is used. Also, since our circuit uses only one DSP48E1 block and one Block RAM, we can implement a number of RSA modules in an FPGA that can work in parallel to attain high throughput RSA encryption/decryption.

30 citations


Proceedings ArticleDOI
Duhu Man1, Kenji Uda1, Hironobu Ueyama1, Yasuaki Ito1, Koji Nakano1 
17 Nov 2010
TL;DR: The main contribution of this paper is to develop a simple parallel algorithm for the EDM and implement it in two parallel platforms: multicore processors and a Graphics Processing Unit (GPU).
Abstract: Given a 2-D binary image of size $n \times n$, Euclidean Distance Map (EDM) is a 2-D array of the same size such that each element is storing the Euclidean distance to the nearest black pixel. It is known that a sequential algorithm can compute the EDM in $O(n^2)$ and thus this algorithm is optimal. Also, work-time optimal parallel algorithms for shared memory model have been presented. However, these algorithms are too complicated to implement in existing shared memory parallel machines. The main contribution of this paper is to develop a simple parallel algorithm for the EDM and implement it in two parallel platforms: multicore processors and a Graphics Processing Unit (GPU). More specifically, we have implemented our parallel algorithm in a Linux server with four Intel hexad-core processors (Intel Xeon X7460 2.66GHz). We have also implemented it in a modern GPU system, Tesla C1060, respectively. The experimental results have shown that, for an input binary image with size of $10000\times 10000$, our implementation in the multi-core system achieves a speedup factor of 18 over the performance of a sequential algorithm using a single processor in the same system. Meanwhile, for the same input binary image, our implementation on the GPU achieves a speedup factor of 5 over the sequential algorithm implementation.

23 citations


Journal ArticleDOI
TL;DR: This paper presents a low-latency hardware connected component labeling algorithm for k-concave binary images designed and implemented in FPGA and shows that for a 10-conCave binary image of 2048 × 2048, the algorithm runs in approximately 70ms and its latency is approximately 750µs.
Abstract: Connected component labeling is a process that assigns unique labels to the connected components of a binary image. The main contribution of this paper is to present a low-latency hardware connected component labeling algorithm for k-concave binary images designed and implemented in FPGA. Pixels of a binary image are given to the FPGA in raster order, and the resulting labels are also output in the same order. The advantage of our labeling algorithm is low latency and to use a small internal storage of the FPGA. We have implemented our hardware labeling algorithm in an Altera Stratix Family FPGA, and evaluated the performance. The implementation result shows that for a 10-concave binary image of 2048 × 2048, our connected component labeling algorithm runs in approximately 70ms and its latency is approximately 750µs.

20 citations


Journal ArticleDOI
TL;DR: The main contribution of this work is to propose a Medium Access Control (MAC) scheme which aims to lessen the effects of deafness and hidden terminal problems in directional communications without precluding spatial reuse.
Abstract: It is known that wireless ad hoc networks employing omnidirectional communications suffer from poor network throughput due to inefficient spatial reuse. Although the use of directional communications is expected to provide significant improvements in this regard, the lack of efficient mechanisms to deal with deafness and hidden terminal problems makes it difficult to fully explore its benefits. The main contribution of this work is to propose a Medium Access Control (MAC) scheme which aims to lessen the effects of deafness and hidden terminal problems in directional communications without precluding spatial reuse. The simulation results have shown that the proposed directional MAC provides significant throughput improvement over both the IEEE802.11DCF MAC protocol and other prominent directional MAC protocols in both linear and grid topologies.

8 citations


Proceedings ArticleDOI
18 Dec 2010
TL;DR: An experiential approach for teaching masters-level advanced computer architecture with the assistance of hands-on laboratory sessions is described, leading students to implement performance-enhancing additions to a simple stack-based CPU called Tiny CPU originally designed by Nakano of Hiroshima University.
Abstract: In many universities, computer architecture is taught using traditional textbook-based methods However, it is not easy for students to understand how computers work through lecture style courses alone This paper describes an experiential approach for teaching masters-level advanced computer architecture with the assistance of hands-on laboratory sessions, leading students to implement performance-enhancing additions to a simple stack-based CPU called Tiny CPU originally designed by Nakano of Hiroshima University From the teaching experience in Nan yang Technological University, analysed in this paper, students manage to quickly grasp the concepts of CPU operation, rapidly investigate the effects of adjusting CPU structure on program execution, and learn the skills-set necessary to enable them to built and improve custom processors later in their careers

5 citations


Proceedings ArticleDOI
19 Apr 2010
TL;DR: An efficient implementation of a coprocessor that performs the exhaustive search to verify the Collatz conjecture using a DSP48E Xilinx Virtex-5 blocks, each of which contains one multiplier and one adder is presented.
Abstract: Consider the following operation on an arbitrary positive number: if the number is even, divide it by two, and if the number is odd, triple it and add one. The Collatz conjecture asserts that, starting from any positive number m, repeated iteration of the operations eventually produces the value 1. The main contribution of this paper is to present an efficient implementation of a coprocessor that performs the exhaustive search to verify the Collatz conjecture using a DSP48E Xilinx Virtex-5 blocks, each of which contains one multiplier and one adder. The experimental results show that, our coprocessor can verify 3.88 × 108 64-bit numbers per second.

3 citations


Proceedings ArticleDOI
17 Nov 2010
TL;DR: An algorithm is shown that can generate a circuit with synchronous ROMs, whenever the original circuit with asynchronous ROMs satisfies this condition, and users can assume that FPGAs support asynchronousROMs when they design their circuits.
Abstract: A Field Programmable Gate Array (FPGA) is used to embed a circuit designed by users instantly. FPGAs can be used for implementing hardware algorithms. Most of FPGAs have Configurable Logic Blocks (CLBs) to implement combinational and sequential circuits and block RAMs to implement Random Access Memories (RAMs) and Read Only Memories (ROMs). Circuit design that minimizes the number of clock cycles is easy if we use asynchronous read operations. However, most RAMs and ROMs in modern FPGAs support synchronous read operations, but do not support asynchronous read operations. It is one of the main difficulties for users to implement hardware algorithms using RAMs and ROMs with synchronous read operations. The main contribution of this paper is to provide one of the potent methods to resolve this problem. We assume that a circuit using asynchronous ROMs designed by a user is given. Our goal is to convert this circuit into an equivalent circuit with synchronous ROMs. We first clarify the condition that a given circuit with asynchronous ROMs can be converted into a circuit without asynchronous ROMs. For this purpose, we will show an algorithm that can generate a circuit with synchronous ROMs, whenever the original circuit with asynchronous ROMs satisfies this condition. Using our conversion algorithm, users can assume that FPGAs support asynchronous ROMs when they design their circuits. Finally, we will show that we can generate an almost equivalent circuit with synchronous ROMs by modifying the circuit even if it does not satisfy this condition.

3 citations


Journal ArticleDOI
TL;DR: The model-based halftoning is executed via Error Diffusion using the hard circular dot-overlap printer model, and the cluster-dot halftoned is achieved through feed-back error diffusion.
Abstract: Circular dot-overlap model is one of the simplest printer models, which is used to predict the actual gray levels of printed images. Model-based halftoning can produce print outputs which can render the gray levels of original input image more correctly and keep more detail in the outputs. Cluster-dot halftoning is a classical method used to minimize dot gain. In this paper, the model-based halftoning is executed via Error Diffusion using the hard circular dot-overlap printer model, and the cluster-dot halftoning is achieved through feed-back error diffusion. The main contributions of this paper are: first, we modify the model-based error diffusion by way of changing the computing method of equivalent gray values and incorporating edge enhancement information (in the following text, this modified model-based error diffusion is entitled Our Edge-enhanced Bias-reduction Error Diffusion); second, we combine the modified model-based error diffusion with the feed-back error diffusion to give Our Edge-enhance Bias-reduction Cluster-dot Error Diffusion. Using our new model-based error diffusion, we can obtain resulting images which reproduce the gray levels of the original input gray scale images accurately.

2 citations


Proceedings Article
01 Jan 2010
TL;DR: The experimental results show that the Error Diffusion using a new filter can generate better quality binary images than the previously pub- lished know results, and the size of cluster can be adjusted by an additional feed- back operation.
Abstract: Digital halftoning is a process to con- vert a continuous-tone image into a binary image with black and white dots. This process is necessary to print a continuous-tone image using printers. The Error Diffusion is one of the most popular methods of digital halftoning, because it can generates high quality output images with relatively low computing time. Binary images generated by the Error Diffu- sion are fine grained in the sense that they have a lot of isolated small black and white dots. However, fine binary images are not appropriate for practical printing, because isolated black and white dots may disappear by dot-gain or dot-loss. Hence cluster-dot halftoning, which generates binary images has no iso- lated black and white dots are important. It is known that the Error Diffusion with some feedback operation can generate clustered-dot binary images. However, the resulting binary images have strong directional characteristic, which spoils the printing results. The main contribution of this paper is to presents new fil- ters of the Error Diffusion for cluster-dot halftoning with no directional characteristic. Quite surprisingly, it can generate cluster-dot binary images with no di- rectional characteristic using our new filter. Also, the size of cluster can be adjusted by an additional feed- back operation. The experimental results show that the Error Diffusion using our new filter can generate better quality binary images than the previously pub- lished know results.

1 citations


01 Jan 2010
TL;DR: A halftoning method that conceals a small binary image into a large binary image and can be used for watermarking as well as amusement purpose is presented.
Abstract: Halftoning technique is used to convert a continuous-tone image into a binary image with pure black and white pixels. This technique is necessary when printing or displaying a monochrome or color image by a device with limited color levels. The main contribution of this paper is to present a halftoning method that conceals a small binary image into a large binary image. More specifically, two distinct gray scale images are given, such that the smaller one of them should be hidden in another larger gray scale image. Our halftoning method generates two binary images that reproduce the tone of the corresponding original two gray scale images. Each pixel of the small binary image is hidden into some pixel of the large binary image through our halftoning method. The small hidden image can be seen when we pick out the pixels of the large binary image at premeditated locations, or we cannot see the hidden image if we have no location information. Another contribution of this paper is to extend our halftoning method to hide a small image of any size into a corresponding large size image. The resulting images show that our halftoning method hides and recovers the original images. Hence, our halftoning technique can be used for watermarking as well as amusement purpose.

01 Jan 2010
TL;DR: In this paper, a color error diffusion method was proposed to uni-formly distribute pixels of 8 combination colors, i.e., CMY, CM, CY, MY, C, M, Y, and W, obtained by combining three process colors.
Abstract: Digital color halftoning is a process to convert a continuous-tone color image into an image with a limited number of colors. This process is re- quired to generate an image reproducing the colors, the tone, and the details of the original continuous- tone color image. An elementary color halftoning method is to apply halftoning techniques indepen- dently to each of the color planes. For example, a full color image is separated into three continuous-tone images with C (Cyan), M (Magenta), and Y (Yellow) process colors, and each of them is independently con- verted to a binary image. Since this method ignores the relation among color planes, three process col- ors are overlapped randomly and it produces noisy and poor printing results. The main contribution of this paper is to present a new color error diffusion method. The key idea of our new method is to uni- formly distribute pixels of 8 combination colors CMY, CM, CY, MY, C, M, Y, and W obtained by combining three process colors C, M, and Y. Also, the bright- ness of the resulting images are equalized using the error diffusion. The experimental results show that our new color halftoning technique generates better quality printing results compared to the independent