The Grid 2: Blueprint for a New Computing Infrastructure

https://fisica.cab.cnea.gov.ar/gpgpu/images/charlas/medical_image_processing_on_the_gpu.pdf

Medical Image Processing on the GPU : Past, Present and Future

The establishment of image correspondence through robust image registration is critical to many clinical tasks such as image fusion, organ atlas creation, and tumor growth monitoring and is a very challenging problem. Since the beginning of the recent deep learning renaissance, the medical imaging research community has developed deep learning-based approaches and achieved the state-of-the-art in many applications, including image registration. The rapid adoption of deep learning for image registration applications over the past few years necessitates a comprehensive summary and outlook, which is the main scope of this survey. This requires placing a focus on the different research areas as well as highlighting challenges that practitioners face. This survey, therefore, outlines the evolution of deep learning-based medical image registration in the context of both research challenges and relevant innovations in the past few years. Further, this survey highlights future research directions to show how this field may be possibly moved forward to the next level.

Deep learning in medical image registration: a survey

https://hal.inria.fr/hal-01017319/document

Versatile, Scalable, and Accurate Simulation of Distributed Applications and Platforms

The graphics processing unit (GPU) has emerged as a competitive platform for computing massively parallel problems. Many computing applications in medical physics can be formulated as data-parallel tasks that exploit the capabilities of the GPU for reducing processing times. The authors review the basic principles of GPU computing as well as the main performance optimization techniques, and survey existing applications in three areas of medical physics, namely image reconstruction, dose calculation and treatment plan optimization, and image processing.

/pdf/gpu-computing-in-medical-physics-a-review-i97kr9j4ih.pdf

GPU computing in medical physics: a review.

We present a new parallel computational model, named LogGPS, which captures synchronization.The LogGPS model is an extension of the LogGP model, which abstracts communication on parallel platforms. Although the LogGP model captures long messages with one bandwidth parameter (G), it does not capture synchronization that is needed before sending a long message by high-level communication libraries. Our model has one additional parameter, S, defined as the threshold for message length, above which synchronous messages are sent.We also present some experimental results using both models. The results include (1) a verification of the LogGPS model, (2) an example of synchronization analysis using an MPI program and (3) a comparison of the models. The results indicate that the LogGPS model is more accurate than the LogGP model, and analyzing synchronization costs is important when improving parallel program performance.

/pdf/loggps-a-parallel-computational-model-for-synchronization-zgqnq06u1a.pdf

LogGPS: a parallel computational model for synchronization analysis

Image registration is a technique for defining a geometric relationship between each point in images. This paper presents a data distributed parallel algorithm that is capable of aligning large-scale three-dimensional (3-D) images of deformable objects. The novelty of our algorithm is to overcome the limitations on the memory space as well as the execution time. In order to enable this, our algorithm incorporates data distribution, data-parallel processing, and load balancing techniques into Schnabel's registration algorithm that realizes robust and efficient alignment based on information theory and adaptive mesh refinement. We also present some experimental results obtained on a 128-CPU cluster of PCs interconnected by Myrinet and Fast Ethernet switches. The results show that our algorithm requires less amount of memory resources, so that aligns datasets up to 1024x1024x590 voxel images with reducing the execution time from hours to minutes, a clinically compatible time.

/pdf/a-data-distributed-parallel-algorithm-for-nonrigid-image-quhejtl39b.pdf

A data distributed parallel algorithm for nonrigid image registration

Compute unified device architecture (CUDA) is a software development platform that allows us to run C-like programs on the nVIDIA graphics processing unit (GPU). This paper presents an acceleration method for cone beam reconstruction using CUDA compatible GPUs. The proposed method accelerates the Feldkamp, Davis, and Kress (FDK) algorithm using three techniques: (1) off-chip memory access reduction for saving the memory bandwidth; (2) loop unrolling for hiding the memory latency; and (3) multithreading for exploiting multiple GPUs. We describe how these techniques can be incorporated into the reconstruction code. We also show an analytical model to understand the reconstruction performance on multi-GPU environments. Experimental results show that the proposed method runs at 83% of the theoretical memory bandwidth, achieving a throughput of 64.3 projections per second (pps) for reconstruction of 512^3-voxel volume from 360 512^2-pixel projections. This performance is 41% higher than the previous CUDA-based method and is 24 times faster than a CPU-based method optimized by vector intrinsics. Some detailed analyses are also presented to understand how effectively the acceleration techniques increase the reconstruction performance of a naive method. We also demonstrate out-of-core reconstruction for large-scale datasets, up to 1024^3-voxel volume.

/pdf/high-performance-cone-beam-reconstruction-using-cuda-27barse48y.pdf

High-performance cone beam reconstruction using CUDA compatible GPUs

This paper describes a design and implementation of the Smith-Waterman algorithm accelerated on the graphics processing unit (GPU). Our method is implemented using compute unified device architecture (CUDA), which is available on the nVIDIA GPU. The method efficiently uses on-chip shared memory to reduce the data amount being transferred between off-chip memory and processing elements in the GPU. Furthermore, it reduces the number of data fetches by applying a data reuse technique to query and database sequences. We show some experimental results comparing the proposed method with an OpenGL-based method. As a result, the speedup over the OpenGL-based method reaches a factor of 6.4 when using amino acid sequence database.We also find that shared memory reduces the amount of data fetches to 1/140, providing a peak performance of 5.65 giga cell updates per second (GCUPS). This performance is approximately three times faster than a prior CUDA-based implementation.

/pdf/design-and-implementation-of-the-smith-waterman-algorithm-on-5g6m9i2fvi.pdf

Design and implementation of the Smith-Waterman algorithm on the CUDA-compatible GPU

Sort-last parallel rendering is a good rendering scheme on distributed memory multiprocessors. This paper presents an improvement on the binary-swap (BS) method, which is an efficient image compositing algorithm for sort-last parallel rendering. Our compositing method uses three acceleration techniques, compared to the original BS method: (1) the interleaved splitting, (2) multiple bounding rectangle, and (3) run-length encoding. Through the use of the three techniques, our method balances the compositing workload among processors, exploits more sparsity of the image, and reduces the cost of communication.We also show some experimental results on a PC cluster. The results show that our method completes the image compositing faster than the original BS method, and its speedup to the original increases with the number of processors.

/pdf/an-improved-binary-swap-compositing-for-sort-last-parallel-49slm1blf9.pdf

Fumihiko Ino

Papers

LogGPS: a parallel computational model for synchronization analysis

A data distributed parallel algorithm for nonrigid image registration

High-performance cone beam reconstruction using CUDA compatible GPUs

Design and implementation of the Smith-Waterman algorithm on the CUDA-compatible GPU

An improved binary-swap compositing for sort-last parallel rendering on distributed memory multiprocessors