Showing papers on "Parallel processing (DSP implementation) published in 2010"

PDF

Open Access

Journal Article•DOI•

On the Utility of Graphics Cards to Perform Massively Parallel Simulation of Advanced Monte Carlo Methods

[...]

Anthony Lee, Christopher Yau, Michael B. Giles, Arnaud Doucet, Christopher Holmes - Show less +1 more

01 Dec 2010-Journal of Computational and Graphical Statistics

TL;DR: It is suggested that GPUs have the potential to facilitate the growth of statistical modeling into complex data-rich domains through the availability of cheap and accessible many-core computation.

...read moreread less

Abstract: We present a case-study on the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods. Graphics cards, containing multiple Graphics Processing Units (GPUs), are self-contained parallel computational devices that can be housed in conventional desktop and laptop computers and can be thought of as prototypes of the next generation of many-core processors. For certain classes of population-based Monte Carlo algorithms they offer massively parallel simulation, with the added advantage over conventional distributed multi-core processors that they are cheap, easily accessible, easy to maintain, easy to code, dedicated local devices with low power consumption. On a canonical set of stochastic simulation examples including population-based Markov chain Monte Carlo methods and Sequential Monte Carlo methods, we nd speedups from 35 to 500 fold over conventional single-threaded computer code. Our findings suggest that GPUs have the potential to facilitate the growth of statistical modelling into complex data rich domains through the availability of cheap and accessible many-core computation. We believe the speedup we observe should motivate wider use of parallelizable simulation methods and greater methodological attention to their design.

...read moreread less

334 citations

Journal Article•DOI•

Parallel Multiscale Feature Extraction and Region Growing: Application in Retinal Blood Vessel Detection

[...]

M.A. Palomera-Perez, M.E. Martinez-Perez, Héctor Benítez-Pérez, Jorge L. Ortega-Arjona

01 Mar 2010

TL;DR: A parallel implementation based on insight segmentation and registration toolkit for a multiscale feature extraction and region growing algorithm, applied to retinal blood vessels segmentation, capable of achieving an accuracy comparable to its serial counterpart, but 8 to 10 times faster.

...read moreread less

Abstract: This paper presents a parallel implementation based on insight segmentation and registration toolkit for a multiscale feature extraction and region growing algorithm, applied to retinal blood vessels segmentation. This implementation is capable of achieving an accuracy (Ac) comparable to its serial counterpart (about 92%), but 8 to 10 times faster. In this paper, the Ac of this parallel implementation is evaluated by comparison with expert manual segmentation (obtained from public databases). On the other hand, its performance is compared with previous published serial implementations. Both these characteristics make this parallel implementation feasible for the analysis of a larger amount of high-resolution retinal images, achieving a faster and high-quality segmentation of retinal blood vessels.

...read moreread less

160 citations

Journal Article•DOI•

Composite Signal Representation for Fast and Storage-Efficient Processing of Encrypted Signals

[...]

Tiziano Bianchi, Alessandro Piva, Mauro Barni

01 Mar 2010-IEEE Transactions on Information Forensics and Security

TL;DR: This paper considers the data expansion required to pass from the plaintext to the encrypted representation of signals, due to the use of cryptosystems operating on very large algebraic structures, and proposes a general composite signal representation.

...read moreread less

Abstract: Signal processing tools working directly on encrypted data could provide an efficient solution to application scenarios where sensitive signals must be protected from an untrusted processing device. In this paper, we consider the data expansion required to pass from the plaintext to the encrypted representation of signals, due to the use of cryptosystems operating on very large algebraic structures. A general composite signal representation allowing us to pack together a number of signal samples and process them as a unique sample is proposed. The proposed representation permits us to speed up linear operations on encrypted signals via parallel processing and to reduce the size of the encrypted signal. A case study-1-D linear filtering-shows the merits of the proposed representation and provides some insights regarding the signal processing algorithms more suited to work on the composite representation.

...read moreread less

147 citations

Proceedings Article•DOI•

Voronoi-Based Geospatial Query Processing with MapReduce

[...]

Afsin Akdogan¹, Ugur Demiryurek¹, Farnoush Banaei-Kashani¹, Cyrus Shahabi¹•Institutions (1)

University of Southern California¹

30 Nov 2010

TL;DR: This paper creates a spatial index, Voronoi diagram, for given data points in 2D space and enables efficient processing of a wide range of GQs with the MapReduce programming model.

...read moreread less

Abstract: Geospatial queries (GQ) have been used in a wide variety of applications such as decision support systems, profile-based marketing, bioinformatics and GIS. Most of the existing query-answering approaches assume centralized processing on a single machine although GQs are intrinsically parallelizable. There are some approaches that have been designed for parallel databases and cluster systems, however, these only apply to the systems with limited parallel processing capability, far from that of the cloud-based platforms. In this paper, we study the problem of parallel geos patial query processing with the MapReduce programming model. Our proposed approach creates a spatial index, Voronoi diagram, for given data points in 2D space and enables efficient processing of a wide range of GQs. We evaluated the performance of our proposed techniques and correspondingly compared them with their closest related work while varying the number of employed nodes.

...read moreread less

132 citations

Journal Article•DOI•

The Brain's Router: A Cortical Network Model of Serial Processing in the Primate Brain

[...]

Ariel Zylberberg¹, Diego Fernández Slezak¹, Pieter R. Roelfsema², Pieter R. Roelfsema³, Stanislas Dehaene⁴, Stanislas Dehaene⁵, Mariano Sigman¹ - Show less +3 more•Institutions (5)

University of Buenos Aires¹, VU University Amsterdam², Royal Netherlands Academy of Arts and Sciences³, French Institute of Health and Medical Research⁴, Collège de France⁵

29 Apr 2010-PLOS Computational Biology

TL;DR: A spiking-neuron implementation of a cognitive architecture where a large number of local parallel processors assemble together to produce goal-driven behavior is presented, which captures the detailed dynamics of human behavior during dual-task-performance, including both mean RTs and RT distributions.

...read moreread less

Abstract: The human brain efficiently solves certain operations such as object recognition and categorization through a massively parallel network of dedicated processors. However, human cognition also relies on the ability to perform an arbitrarily large set of tasks by flexibly recombining different processors into a novel chain. This flexibility comes at the cost of a severe slowing down and a seriality of operations (100–500 ms per step). A limit on parallel processing is demonstrated in experimental setups such as the psychological refractory period (PRP) and the attentional blink (AB) in which the processing of an element either significantly delays (PRP) or impedes conscious access (AB) of a second, rapidly presented element. Here we present a spiking-neuron implementation of a cognitive architecture where a large number of local parallel processors assemble together to produce goal-driven behavior. The precise mapping of incoming sensory stimuli onto motor representations relies on a “router” network capable of flexibly interconnecting processors and rapidly changing its configuration from one task to another. Simulations show that, when presented with dual-task stimuli, the network exhibits parallel processing at peripheral sensory levels, a memory buffer capable of keeping the result of sensory processing on hold, and a slow serial performance at the router stage, resulting in a performance bottleneck. The network captures the detailed dynamics of human behavior during dual-task-performance, including both mean RTs and RT distributions, and establishes concrete predictions on neuronal dynamics during dual-task experiments in humans and non-human primates.

...read moreread less

129 citations

Patent•

Neighborhood operations for parallel processing

[...]

Avidan Akerib, Eli Ehrman, Oren Agam, Moshe Meyassed, Yehoshua Meir, Yukio Fukuzo - Show less +2 more

06 Oct 2010

TL;DR: In this paper, a memory device includes a plurality of storage units in which to store data of a bank, wherein the data has a logical order prior to storage and a physical order different than the logical order within the plurality of data units and a within-device reordering unit to reorder the data of the bank into the logical ordering prior to performing on-chip processing.

...read moreread less

Abstract: A memory device includes a plurality of storage units in which to store data of a bank, wherein the data has a logical order prior to storage and a physical order different than the logical order within the plurality of storage units and a within-device reordering unit to reorder the data of a bank into the logical order prior to performing on-chip processing. In another embodiment, the memory device includes an external device interface connectable to an external device communicating with the memory device, an internal processing element to process data stored on the device and multiple banks of storage. Each bank includes a plurality of storage units and each storage unit has two ports, an external port connectable to the external device interface and an internal port connected to the internal processing element.

...read moreread less

129 citations

Book Chapter•DOI•

Parallel K-means clustering of remote sensing images based on mapreduce

[...]

Zhenhua Lv¹, Yingjie Hu¹, Haidong Zhong¹, Jianping Wu¹, Bo Li¹, Hui Zhao¹ - Show less +2 more•Institutions (1)

East China Normal University¹

23 Oct 2010

TL;DR: The color representation of RS images, which means pixels need to be translated into a particular color space CIELAB that is more suitable for distinguishing colors is described, and the programming model MapReduce and a platform Hadoop are briefly introduced.

...read moreread less

Abstract: The K-Means clustering is a basic method in analyzing RS (remote sensing) images, which generates a direct overview of objects. Usually, such work can be done by some software (e.g. ENVI, ERDAS IMAGINE) in personal computers. However, for PCs, the limitation of hardware resources and the tolerance of time consuming present a bottleneck in processing a large amount of RS images. The techniques of parallel computing and distributed systems are no doubt the suitable choices. Different with traditional ways, in this paper we try to parallel this algorithm on Hadoop, an open source system that implements the MapReduce programming model. The paper firstly describes the color representation of RS images, which means pixels need to be translated into a particular color space CIELAB that is more suitable for distinguishing colors. It also gives an overview of traditional K-Means. Then the programming model MapReduce and a platform Hadoop are briefly introduced. This model requires customized 'map/reduce' functions, allowing users to parallel processing in two stages. In addition, the paper detail map and reduce functions by pseudo-codes, and the reports of performance based on the experiments are given. The paper shows that results are acceptable and may also inspire some other approaches of tackling similar problems within the field of remote sensing applications.

...read moreread less

127 citations

Proceedings Article•DOI•

Best-effort computing: re-thinking parallel software and hardware

[...]

Srimat Chakradhar, Anand Raghunathan¹•Institutions (1)

Purdue University¹

13 Jun 2010

TL;DR: It is argued that adopting a best-effort service model for various software and hardware components of the computing platform stack can lead to drastic improvements in scalability and large improvements in performance and energy efficiency.

...read moreread less

Abstract: With the advent of mainstream parallel computing, applications can obtain better performance only by scaling to platforms with larger numbers of cores. This is widely considered to be a very challenging problem due to the difficulty of parallel programming and the bottlenecks to efficient parallel execution. Inspired by how networking and storage systems have scaled to handle very large volumes of packet traffic and persistent data, we propose a new approach to the design of scalable, parallel computing platforms. For decades, computing platforms have gone to great lengths to ensure that every computation specified by applications is faithfully executed. While this design philosophy has remained largely unchanged, applications and the basic characteristics of their workloads have changed considerably. A wide range of existing and emerging computing workloads have an inherent forgiving nature. We therefore argue that adopting a best-effort service model for various software and hardware components of the computing platform stack can lead to drastic improvements in scalability. Applications are cognizant of the best-effort model, and separate their computations into those that may be executed on a best-effort basis and those that require the traditional execution guarantees. Best-effort computations may be exploited to simply reduce the computing workload, shape it to be more suitable for parallel execution, or execute it on unreliable hardware components. Guaranteed computations are realized either through an overlay software layer on top of the best-effort substrate, or through the use of application-specific strategies. We describe a system architecture for a best-effort computing platform, provide examples of parallel software and hardware that embody the best-effort model, and show that large improvements in performance and energy efficiency are possible through the adoption of this approach.

...read moreread less

124 citations

Proceedings Article•DOI•

Efficient Canny Edge Detection Using a GPU

[...]

Kohei Ogawa¹, Yasuaki Ito¹, Koji Nakano¹•Institutions (1)

Hiroshima University¹

17 Nov 2010

TL;DR: The experimental result shows that the implementation of Canny edge detection algorithm on CUDA achieves a speedup factor of 61 over a conventional software implementation.

...read moreread less

Abstract: Recent GPUs, which have many processing units connected with a global memory, can be used for general purpose parallel computation. Users can develop parallel programs running on GPUs using programming architecture called CUDA (Compute Unified Device Architecture). The main contribution of this paper is to implement a Canny edge detection algorithm on CUDA. The experimental result shows that our implementation of Canny edge detection algorithm on CUDA achieves a speedup factor of 61 over a conventional software implementation.

...read moreread less

120 citations

Journal Article•DOI•

The trouble with multi-core

[...]

David Patterson

01 Jul 2010-IEEE Spectrum

TL;DR: Designers now accept that although transistors will still get smaller and more numerous on each chip, they aren't going to operate faster than they do toady, which explains the shift to assembling them into multiple microprocessor cores instead.

...read moreread less

Abstract: Designers now accept that although transistors will still get smaller and more numerous on each chip, they aren't going to operate faster than they do toady.And if you tried to incorporate all those transistors into one giant microprocessor, you might well end up with a device that couldn't compute any faster than the chip it was replacing, which explains the shift to assembling them into multiple microprocessor cores instead.

...read moreread less

116 citations

Journal Article•DOI•

A real-time software backend for the GMRT

[...]

Jayanta Roy¹, Yashwant Gupta¹, Ue-Li Pen, Jeffrey B. Peterson², Sanjay Kudale¹, Jitendra Kodilkar¹ - Show less +2 more•Institutions (2)

Savitribai Phule Pune University¹, Carnegie Mellon University²

08 Jun 2010-Experimental Astronomy

TL;DR: This paper has built a correlator and a beamformer, using PCI-based ADC cards and a Linux cluster of 48 nodes with dual gigabit inter-node connectivity for real-time data transfer requirements, and believes this is the first instance of such a real- time observatory backend for an intermediate sized array like the GMRT.

...read moreread less

Abstract: The new era of software signal processing has a large impact on radio astronomy instrumentation. Our design and implementation of a 32 antennae, 33 MHz, dual polarization, fully real-time software backend for the GMRT, using only off-the-shelf components, is an example of this. We have built a correlator and a beamformer, using PCI-based ADC cards and a Linux cluster of 48 nodes with dual gigabit inter-node connectivity for real-time data transfer requirements. The highly optimized compute pipeline uses cache efficient, multi-threaded parallel code, with the aid of vectorized processing. This backend allows flexibility in final time and frequency resolutions, and the ability to implement algorithms for radio frequency interference rejection. Our approach has allowed relatively rapid development of a fairly sophisticated and flexible backend receiver system for the GMRT, which will greatly enhance the productivity of the telescope. In this paper we describe some of the first lights using this software processing pipeline. We believe this is the first instance of such a real-time observatory backend for an intermediate sized array like the GMRT.

...read moreread less

Proceedings Article•DOI•

Barra: A Parallel Functional Simulator for GPGPU

[...]

Sylvain Collange, Marc Daumas, David Defour, David Parello

17 Aug 2010

TL;DR: Barra, a simulator of Graphics Processing Units (GPU) tuned for general purpose processing (GPGPU), is presented, based on the UNISIM framework and it simulates the native instruction set of the Tesla architecture at the functional level.

...read moreread less

Abstract: We present Barra, a simulator of Graphics Processing Units (GPU) tuned for general purpose processing (GPGPU). It is based on the UNISIM framework and it simulates the native instruction set of the Tesla architecture at the functional level. The inputs are CUDA executables produced by NVIDIA tools. No alterations are needed to perform simulations. As it uses parallelism, Barra generates detailed statistics on executions in about the time needed by CUDA to operate in emulation mode. We use it to understand and explore the micro-architecture design spaces of GPUs.

...read moreread less

Journal Article•DOI•

Serial and parallel processing in the primate auditory cortex revisited.

[...]

Gregg H. Recanzone¹, Yale E. Cohen²•Institutions (2)

University of California, Davis¹, University of Pennsylvania²

05 Jan 2010-Behavioural Brain Research

TL;DR: This review will examine several key studies, primarily electrophysiological, that have tested the hypothesis that the primate auditory cortex is organized in a serial and parallel manner in which there is a dorsal stream processing spatial information and a ventral stream processing non-spatial information.

...read moreread less

Patent•

System and method for large-scale data processing using an application-independent framework

[...]

Jeffrey Dean¹, Sanjay Ghemawat¹•Institutions (1)

Google¹

12 Jan 2010

TL;DR: A large-scale data processing system and method for processing data in a distributed and parallel processing environment is described in this article, which includes an application-independent framework for processing the data having a plurality of applicationindependent map modules and reduce modules.

...read moreread less

Abstract: A large-scale data processing system and method for processing data in a distributed and parallel processing environment The system includes an application-independent framework for processing data having a plurality of application-independent map modules and reduce modules These application-independent modules use application-independent operators to automatically handle parallelization of computations across the distributed and parallel processing environment when performing user-specified data processing operations The system also includes a plurality of user-specified, application-specific operators, for use with the application-independent framework to perform a user-specified data processing operation on a user-specified set of input files The application-specific operators include: a map operator and a reduce operator The map operator is applied by the application-independent map modules to input data in the user-specified set of input files to produce intermediate data values The reduce operator is applied by the application-independent reduce modules to process the intermediate data values to produce final output data

...read moreread less

Journal Article•DOI•

Efficient computer implementation of coupled hydro-thermo-mechanical analysis

[...]

Jaroslav Kruis¹, Tomáš Koudelka¹, Tomáš Krejčí¹•Institutions (1)

Czech Technical University in Prague¹

01 Apr 2010-Mathematics and Computers in Simulation

TL;DR: The proposed strategy is demonstrated on a coupled analysis of an existing reactor vessel and the parallel processing leads to very good speedup and it also enables to solve significantly large problems in acceptable time.

...read moreread less

Journal Article•DOI•

Parallel computation of mutual information on the GPU with application to real-time registration of 3D medical images

[...]

Ramtin Shams¹, Parastoo Sadeghi¹, Rodney A. Kennedy¹, Richard Hartley¹•Institutions (1)

Australian National University¹

01 Aug 2010-Computer Methods and Programs in Biomedicine

TL;DR: This work proposes a novel method named sort and count for efficient parallelization of mutual information (MI) computation designed for massively multi-processing architectures that achieves real-time (less than 1s) rigid registration of 3D medical images using a commodity graphics processing unit (GPU).

...read moreread less

Proceedings Article•DOI•

Efficient integral image computation on the GPU

[...]

Berkin Bilgic¹, Berthold K. P. Horn¹, Ichiro Masaki¹•Institutions (1)

Massachusetts Institute of Technology¹

21 Jun 2010

TL;DR: An integral image algorithm that can run in real-time on a Graphics Processing Unit (GPU) via the NIVIDA CUDA programming model that makes use of the work-efficient scan algorithm that is explicated elsewhere.

...read moreread less

Abstract: We present an integral image algorithm that can run in real-time on a Graphics Processing Unit (GPU). Our system exploits the parallelisms in computation via the NIVIDA CUDA programming model, which is a software platform for solving non-graphics problems in a massively parallel high-performance fashion. This implementation makes use of the work-efficient scan algorithm that is explicated elsewhere. Treating the rows and the columns of the target image as independent input arrays for the scan algorithm, our method manages to expose a second level of parallelism in the problem. We compare the performance of the parallel approach running on the GPU with the sequential CPU implementation across a range of image sizes and report a speed up by a factor of 8 for a 4 megapixel input. We further investigate the impact of using packed vector type data on the performance, as well as the effect of double precision arithmetic on the GPU.

...read moreread less

Proceedings Article•DOI•

Real-time canny edge detection parallel implementation for FPGAs

[...]

Christos Gentsos¹, Calliope Louisa Sotiropoulou¹, Spiridon Nikolaidis¹, Nikolaos Vassiliadis•Institutions (1)

Aristotle University of Thessaloniki¹

01 Dec 2010

TL;DR: A new parallel Canny edge detector FPGA implementation is proposed in this paper that takes advantage of 4-pixel parallel computations to achieve high throughput without increasing the on-chip memory demands.

...read moreread less

Abstract: Edge detection is one of the most fundamental algorithms in digital image processing. The Canny edge detector is the most implemented edge detection algorithm because of its ability to detect edges even in images that are intensely contaminated by noise. However, this is a time consuming algorithm and therefore its implementations are difficult to reach real time response speeds. Especially nowadays where the demand for high resolution image processing is constantly increasing, the need for fast and efficient edge detector implementations is ever so present. A new parallel Canny edge detector FPGA implementation is proposed in this paper to answer this demand. This design takes advantage of 4-pixel parallel computations to achieve high throughput without increasing the on-chip memory demands. Synthesis and simulation results are presented to prove the design's efficiency and high frames per second rate.

...read moreread less

Patent•

Extensible Pipeline for Data Deduplication

[...]

Paul Adrian Oltean¹, Ran Kalach¹, Ahmed El-Shimi¹, James Robert Benton¹•Institutions (1)

Microsoft¹

16 Dec 2010

TL;DR: In this paper, a modular data deduplication pipeline is presented, which allows modules to be replaced, selected or extended, e.g., by selecting modules to increase dedupling quality, performance and/or throughput.

...read moreread less

Abstract: The subject disclosure is directed towards data deduplication (optimization) performed by phases/modules of a modular data deduplication pipeline. At each phase, the pipeline allows modules to be replaced, selected or extended, e.g., different algorithms can be used for chunking or compression based upon the type of data being processed. The pipeline facilitates secure data processing, batch processing, and parallel processing. The pipeline is tunable based upon feedback, e.g., by selecting modules to increase deduplication quality, performance and/or throughput. Also described is selecting, filtering, ranking, sorting and/or grouping the files to deduplicate, e.g., based upon properties and/or statistical properties of the files and/or a file dataset and/or internal or external feedback.

...read moreread less

Journal Article•DOI•

High-performance iterative electron tomography reconstruction with long-object compensation using graphics processing units (GPUs).

[...]

Wei Xu¹, Fang Xu¹, Mel Jones², Bettina Keszthelyi², John W. Sedat², David A. Agard², Klaus Mueller¹ - Show less +3 more•Institutions (2)

Stony Brook University¹, University of California, San Francisco²

01 Aug 2010-Journal of Structural Biology

TL;DR: This paper outlines a CT reconstruction approach for ET that is optimized for the special demands and application setting of ET and describes a novel GPU-amenable approach that effectively compensates for reconstruction errors resulting from the TEM data acquisition on (long) samples which extend the width of the parallel TEM beam.

...read moreread less

Proceedings Article•DOI•

Image parallel processing based on GPU

[...]

Nan Zhang¹, Yun-shan Chen¹, Jian-li Wang¹•Institutions (1)

Chinese Academy of Sciences¹

27 Mar 2010

TL;DR: Experimental results indicate that if data transfer time, between host memory and device memory, is taken into account, speed of the two algorithms implemented on GPU can be improved approximately 25 times and 49 times as fast as CPU, respectively, and GPU is practical for image processing.

...read moreread less

Abstract: In order to solve the compute-intensive character of image processing, based on advantages of GPU parallel operation, parallel acceleration processing technique is proposed for image. First, efficient architecture of GPU is introduced that improves computational efficiency, comparing with CPU. Then, Sobel edge detector and homomorphic filtering, two representative image processing algorithms, are embedded into GPU to validate the technique. Finally, tested image data of different resolutions are used on CPU and GPU hardware platform to compare computational efficiency of GPU and CPU. Experimental results indicate that if data transfer time, between host memory and device memory, is taken into account, speed of the two algorithms implemented on GPU can be improved approximately 25 times and 49 times as fast as CPU, respectively, and GPU is practical for image processing.

...read moreread less

Proceedings Article•DOI•

Enabling active storage on parallel I/O software stacks

[...]

Seung Woo Son¹, Samuel Lang¹, Philip Carns¹, Robert Ross¹, Rajeev Thakur¹, Berkin Ozisikyilmaz², Prabhat Kumar², Wei-keng Liao², Alok Choudhary² - Show less +5 more•Institutions (2)

Argonne National Laboratory¹, Northwestern University²

03 May 2010

TL;DR: An active storage system that allows data analysis, mining, and statistical operations to be executed from within a parallel I/O interface that consistently outperforms the traditional storage model with a wide variety of input dataset sizes, number of nodes, and computational loads is proposed.

...read moreread less

Abstract: As data sizes continue to increase, the concept of active storage is well fitted for many data analysis kernels. Nevertheless, while this concept has been investigated and deployed in a number of forms, enabling it from the parallel I/O software stack has been largely unexplored. In this paper, we propose and evaluate an active storage system that allows data analysis, mining, and statistical operations to be executed from within a parallel I/O interface. In our proposed scheme, common analysis kernels are embedded in parallel file systems. We expose the semantics of these kernels to parallel file systems through an enhanced runtime interface so that execution of embedded kernels is possible on the server. In order to allow complete server-side operations without file format or layout manipulation, our scheme adjusts the file I/O buffer to the computational unit boundary on the fly. Our scheme also uses server-side collective communication primitives for reduction and aggregation using interserver communication. We have implemented a prototype of our active storage system and demonstrate its benefits using four data analysis benchmarks. Our experimental results show that our proposed system improves the overall performance of all four benchmarks by 50.9% on average and that the compute-intensive portion of the k-means clustering kernel can be improved by 58.4% through GPU offloading when executed with a larger computational load. We also show that our scheme consistently outperforms the traditional storage model with a wide variety of input dataset sizes, number of nodes, and computational loads.

...read moreread less

Proceedings Article•DOI•

Parallel I/O performance: From events to ensembles

[...]

Andrew Uselton¹, Mark Howison¹, Nicholas J. Wright¹, David Skinner¹, Noel Keen¹, John Shalf¹, Karen L. Karavanic², Leonid Oliker¹ - Show less +4 more•Institutions (2)

Lawrence Berkeley National Laboratory¹, Portland State University²

19 Apr 2010

TL;DR: This paper proposes a statistical approach to understanding I/O performance that moves from the analysis of performance events to the exploration of performance ensembles and demonstrates that this approach can identify application and middleware performance deficiencies — resulting in more than 4× run time improvement for both examined applications.

...read moreread less

Abstract: Parallel I/O is fast becoming a bottleneck to the research agendas of many users of extreme scale parallel computers. The principle cause of this is the concurrency explosion of high-end computation, coupled with the complexity of providing parallel file systems that perform reliably at such scales. More than just being a bottleneck, parallel I/O performance at scale is notoriously variable, being influenced by numerous factors inside and outside the application, thus making it extremely difficult to isolate cause and effect for performance events. In this paper, we propose a statistical approach to understanding I/O performance that moves from the analysis of performance events to the exploration of performance ensembles. Using this methodology, we examine two I/O-intensive scientific computations from cosmology and climate science, and demonstrate that our approach can identify application and middleware performance deficiencies — resulting in more than 4× run time improvement for both examined applications.

...read moreread less

Proceedings Article•DOI•

Parallel frequent patterns mining algorithm on GPU

[...]

Jiayi Zhou¹, Kun-Ming Yu², Bin-Chang Wu²•Institutions (2)

National Tsing Hua University¹, Chung Hua University²

22 Nov 2010

TL;DR: In this study, a graphic processing unit (GPU) was used to perform FPM with a GPU-FPM to speed-up the process and the experimental results showed that the speed- up ratio of GPU- FPM can achieve 14.857 with 16 times of threads.

...read moreread less

Abstract: Extraction of frequent patterns from a transactional database is a fundamental task in data mining. Its applications include association rules, time series, etc. The Apriori approach is a commonly used generate-and-test approach to obtain frequent patterns from a database with a given threshold. Many parallel and distributed methods have been proposed for frequent pattern mining (FPM) to reduce computation time. However, most of them require a Cluster system or Grid system. In this study, a graphic processing unit (GPU) was used to perform FPM with a GPU-FPM to speed-up the process. Because of GPU hardware delimitations, a compact data structure was designed to store an entire database on GPU. In addition, MemPack and CLProgram template classes were also designed. Two datasets with different conditions were used to verify the performance of GPU-FPM. The experimental results showed that the speed-up ratio of GPU-FPM can achieve 14.857 with 16 times of threads.

...read moreread less

Patent•

Highly distributed parallel processing on multi-core device

[...]

Jason B. Brent¹, Nour Toukmaji¹•Institutions (1)

Mindspeed Technologies¹

19 Jan 2010

TL;DR: In this article, a highly distributed multi-core system with an adaptive scheduler is provided, where applications can be executed in a distributed manner across several types of slave processing cores.

...read moreread less

Abstract: There is provided a highly distributed multi-core system with an adaptive scheduler. By resolving data dependencies in a given list of parallel tasks and selecting a subset of tasks to execute based on provided software priorities, applications can be executed in a highly distributed manner across several types of slave processing cores. Moreover, by overriding provided priorities as necessary to adapt to hardware or other system requirements, the task scheduler may provide for low-level hardware optimizations that enable the timely completion of time-sensitive workloads, which may be of particular interest for real-time applications. Through this modularization of software development and hardware optimization, the conventional demand on application programmers to micromanage multi-core processing for optimal performance is thus avoided, thereby streamlining development and providing a higher quality end product.

...read moreread less

Patent•

Apparatus for enhancing performance of a parallel processing environment, and associated methods

[...]

Kevin D. Howard

30 Mar 2010

TL;DR: In this article, a Message Passing Interface (MPI) devolver enabled PPCA is in communication with the PPE and a host node, where the host node executes at least a parallel processing application and an MPI process.

...read moreread less

Abstract: Parallel Processing Communication Accelerator (PPCA) systems and methods for enhancing performance of a Parallel Processing Environment (PPE). In an embodiment, a Message Passing Interface (MPI) devolver enabled PPCA is in communication with the PPE and a host node. The host node executes at least a parallel processing application and an MPI process. The MPI devolver communicates with the MPI process and the PPE to improve the performance of the PPE by offloading MPI process functionality to the PPCA. Offloading MPI processing to the PPCA frees the host node for other processing tasks, for example, executing the parallel processing application, thereby improving the performance of the PPE.

...read moreread less

Journal Article•DOI•

Finite-Element Sparse Matrix Vector Multiplication on Graphic Processing Units

[...]

Maryam Mehri Dehnavi¹, David Fernandez¹, Dennis D. Giannacopoulos¹•Institutions (1)

McGill University¹

19 Jul 2010-IEEE Transactions on Magnetics

TL;DR: This work presents a new algorithm to accelerate the performance of the SMVM kernel on graphic processing units.

...read moreread less

Abstract: A wide class of finite-element (FE) electromagnetic applications requires computing very large sparse matrix vector multiplications (SMVM). Due to the sparsity pattern and size of the matrices, solvers can run relatively slowly. The rapid evolution of graphic processing units (GPUs) in performance, architecture, and programmability make them very attractive platforms for accelerating computationally intensive kernels such as SMVM. This work presents a new algorithm to accelerate the performance of the SMVM kernel on graphic processing units.

...read moreread less

Journal Article•DOI•

Real-time processing for full-range Fourier-domain optical-coherence tomography with zero-filling interpolation using multiple graphic processing units

[...]

Yuuki Watanabe¹, Seiya Maeno¹, Kenji Aoshima¹, Haruyuki Hasegawa¹, Hitoshi Koseki¹ - Show less +1 more•Institutions (1)

Yamagata University¹

01 Sep 2010-Applied Optics

TL;DR: The real-time display of full-range, 2048?axial pixelx1024?lateral pixel, Fourier-domain optical-coherence tomography (FD-OCT) images is demonstrated using dual graphic processing units (GPUs) with many stream processors to realize highly parallel processing.

...read moreread less

Abstract: The real-time display of full-range, 2048?axial pixelx1024?lateral pixel, Fourier-domain optical-coherence tomography (FD-OCT) images is demonstrated. The required speed was achieved by using dual graphic processing units (GPUs) with many stream processors to realize highly parallel processing. We used a zero-filling technique, including a forward Fourier transform, a zero padding to increase the axial data-array size to 8192, an inverse-Fourier transform back to the spectral domain, a linear interpolation from wavelength to wavenumber, a lateral Hilbert transform to obtain the complex spectrum, a Fourier transform to obtain the axial profiles, and a log scaling. The data-transfer time of the frame grabber was 15.73?ms, and the processing time, which includes the data transfer between the GPU memory and the host computer, was 14.75?ms, for a total time shorter than the 36.70?ms frame-interval time using a line-scan CCD camera operated at 27.9?kHz. That is, our OCT system achieved a processed-image display rate of 27.23 frames/s.

...read moreread less

Patent•

Calibration of resource allocation during parallel processing

[...]

Wen-Syan Li, Jianfeng Yan

20 Oct 2010

TL;DR: In this paper, an overhead factor characterizing a change of a parallelism overhead of executing the task with nodes executing in parallel is calculated, relative to a change in a number of the nodes, based on the first performance measurement and the second performance measurement.

...read moreread less

Abstract: A first performance measurement of an executing task may be determined, while the task is executed by a first number of nodes operating in parallel. A second performance measurement of the executing task may be determined, while the task is being executed by a second number of nodes operating in parallel. An overhead factor characterizing a change of a parallelism overhead of executing the task with nodes executing in parallel may then be calculated, relative to a change in a number of the nodes, based on the first performance measurement and the second performance measurement. Then, an optimal number of nodes to operate in parallel to continue executing the task may be determined, based on the overhead factor.

...read moreread less

Journal Article•DOI•

Multicore Image Processing with OpenMP [Applications Corner]

[...]

Gregory G. Slabaugh, Richard Boyes, Xiaoyun Yang

25 Mar 2010-IEEE Signal Processing Magazine

TL;DR: Today, many processors, including digital signal processors, mobile, graphics, and general-purpose central processing units (CPUs) have a multicore design, driven by the demand of higher performance.

...read moreread less

Abstract: One of the recent innovations in computer engineering has been the development of multicore processors, which are composed of two or more independent cores in a single physical package. Today, many processors, including digital signal processors (DSPs), mobile, graphics, and general-purpose central processing units (CPUs) have a multicore design, driven by the demand of higher performance. Major CPU vendors have changed strategy away from increasing the raw clock rate to adding on-chip support for multithreading by increases in the number of cores; dual and quad-core processors are now commonplace. Signal and image processing programmers can benefit dramatically from these advances in hardware, by modifying single-threaded code to exploit parallelism to run on multiple cores.

...read moreread less

Collapse