There is, I think, something ethereal about i —the square root of minus one. I remember first hearing about it at school. It seemed an odd beast at that time—an intruder hovering on the edge of reality.

Usually familiarity dulls this sense of the bizarre, but in the case of i it was the reverse: over the years the sense of its surreal nature intensified. It seemed that it was impossible to write mathematics that described the real world in …

I and i

We will review some of the major results in random graphs and some of the more challenging open problems. We will cover algorithmic and structural questions. We will touch on newer models, including those related to the WWW.

Random graphs

“Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告

We have witnessed great interest and a wealth of promise in content-based image retrieval as an emerging technology. While the last decade laid foundation to such promise, it also paved the way for a large number of new techniques and systems, got many new people involved, and triggered stronger association of weakly related fields. In this article, we survey almost 300 key theoretical and empirical contributions in the current decade related to image retrieval and automatic image annotation, and in the process discuss the spawning of related subfields. We also discuss significant challenges involved in the adaptation of existing image retrieval techniques to build systems that can be useful in the real world. In retrospect of what has been achieved so far, we also conjecture what the future may hold for image retrieval research.

/pdf/image-retrieval-ideas-influences-and-trends-of-the-new-age-51whw4w7su.pdf

Image retrieval: Ideas, influences, and trends of the new age

Computational geometry

Finding a vast array of applications, the problem of computing the convex hull of a set of sorted points in the plane is one of the fundamental tasks in pattern recognition, morphology and image processing. The main contribution of this paper is to show a simple parallel algorithm for computing the convex hull of a set of n sorted points in the plane and evaluate the performance on the dual quad-core processors. The experimental results show that, our implementation achieves a speed-up factor of approximately 7 using 8 processors. Since the speed-up factor of more than 8 is not possible, our parallel implementation for computing the convex hull is close to optimal. Also, for 2 or 4 processors, we achieved a super linear speed up.

/pdf/a-simple-parallel-convex-hulls-algorithm-for-sorted-points-4ftp2zmjy3.pdf

A Simple Parallel Convex Hulls Algorithm for Sorted Points and the Performance Evaluation on the Multicore Processors

Embedded multicore processors represented by FPGAs and GPUs have lately attracted considerable attention for their potential computation ability and power consumption. Recent FPGAs have hundreds of embedded DSP slices and block RAMs. For example, Xilinx Virtex-6 Family FPGAs have a DSP48E1 slice, which is a configurable logic block equipped with fast multipliers, adders, pipeline registers, and so on. They also have a dual-port memory with 18Kbits as a block RAM. Meanwhile, recent GPUs can be used for general purpose computation. Users can develop parallel programs running on GPUs using programming architecture called CUDA provided by NVIDIA. The main contribution of this paper is to present two implementations of the Hough transform on the FPGA and the GPU. The first idea of the implementations is an efficient usage of DSP slices and block RAMs for FPGAs, and the shared memory for GPUs. The second idea is to partition the voting space in the Hough transform and the voting operation is performed in parallel. The implementation results show that the Hough transform for a 512 × 512 image with 33232 edge points can be done in 135.75 μs and 637.88 μs on the FPGA and the GPU, respectively. On the other hand, a conventional CPU implementation runs in 37.10 ms . Thus, both implementations achieve a sufficient speed-up.

/pdf/implementations-of-the-hough-transform-on-the-embedded-11awb0fsyp.pdf

Implementations of the Hough Transform on the Embedded Multicore Processors

The task of finding strings having a partial match to a given pattern is of interest to a number of practical applications, including DNA sequencing and text searching. Owing to its importance, alternatives to accelerate the Approximate String Matching (ASM) have been widely investigated in the literature. The main contribution of this work is to present a memory-access-efficient implementation for computing the ASM on a GPU. The key idea of our implementation relies on warp shuffle operations, which are used to reduce the communication overhead between threads. Experimental results, carried out on a GeForce GTX 960 GPU, show that the proposed implementation provides acceleration between 1.31 and 1.84 times when compared to another noteworthy alternative.

A Memory-Access-Efficient Implementation of the Approximate String Matching Algorithm on GPU

The main contribution of this paper is topresent Bitwise Parallel Bulk Computation (BPBC) technique, to accelerate bulk computation, which executes the same algorithm for a lot of instances in turn or in parallel. The idea of the BPBC technique isto simulate a combinational logic circuit for 32 inputsat the same time using bitwise logic operators for 32-bit integerssupported by most processing devices. We will show that the BPBC technique worksvery efficiently on a CPU as well as on a GPU. As a simple example of the BPBC, we first show that the pairwise sums of a lot of integerscan be computed faster using the BPBC technique, if the values of input integers are not large. We also show that the CKY parsing for context-free grammarscan be implemented in the GPU efficientlyusing the BPBC technique. The experimental results using Intel Core~i7 CPUand GeForce GTX TITAN X GPU show thatthe GPU implementation forthe CKY parsing can be more than 400 times fasterthan the CPU implementation.

Bitwise Parallel Bulk Computation on the GPU, with Application to the CKY Parsing for Context-Free Grammars

The Discrete Memory Machine (DMM) is a theoretical parallel computing model that captures the essence of memory access by a streaming multiprocessor on CUDA-enabled GPUs. The DMM has w memory banks that constitute a shared memory, and each warp of w threads access the shared memory at the same time. However, memory access requests destined for the same memory bank are processed sequentially. Hence, it is very important for developing efficient algorithms to reduce the memory access congestion, the maximum number of memory access requests destined for the same bank. However, it is not easy to minimize the memory access congestion for some problems. The main contribution of this paper is to present novel and practical parallel computing models in which the congestion is small for any memory access requests. We first present the Super Discrete Memory Machine (SDMM), an extended version of the DMM, which supports a super warp with multiple warps. Memory access requests by multiple warps in a super warp are packed through pipeline registers to reduce the memory access congestion. We then go on to apply the random address shift technique to the SDMM. The resulting machine, the Random Super Discrete Memory Machine (RSDMM) can equalize memory access requests by a super warp. Quite surprisingly, for any memory access requests by a super warp on the RSDMM, the overhead of the memory access congestion is within a constant factor of perfectly scheduled memory access. Thus, unlike the DMM, developers of parallel algorithms do not have to consider the memory access congestion on the RSDMM. The congestion on the RSDMM is evaluated by theoretical analysis as well as by experiments.

/pdf/the-super-warp-architecture-with-random-address-shift-2323nf1tob.pdf

Koji Nakano

Papers

A Simple Parallel Convex Hulls Algorithm for Sorted Points and the Performance Evaluation on the Multicore Processors

Implementations of the Hough Transform on the Embedded Multicore Processors

A Memory-Access-Efficient Implementation of the Approximate String Matching Algorithm on GPU

Bitwise Parallel Bulk Computation on the GPU, with Application to the CKY Parsing for Context-Free Grammars

The super warp architecture with random address shift