scispace - formally typeset
Search or ask a question

Showing papers on "Parallel processing (DSP implementation) published in 2010"


Journal ArticleDOI
TL;DR: It is suggested that GPUs have the potential to facilitate the growth of statistical modeling into complex data-rich domains through the availability of cheap and accessible many-core computation.
Abstract: We present a case-study on the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods. Graphics cards, containing multiple Graphics Processing Units (GPUs), are self-contained parallel computational devices that can be housed in conventional desktop and laptop computers and can be thought of as prototypes of the next generation of many-core processors. For certain classes of population-based Monte Carlo algorithms they offer massively parallel simulation, with the added advantage over conventional distributed multi-core processors that they are cheap, easily accessible, easy to maintain, easy to code, dedicated local devices with low power consumption. On a canonical set of stochastic simulation examples including population-based Markov chain Monte Carlo methods and Sequential Monte Carlo methods, we nd speedups from 35 to 500 fold over conventional single-threaded computer code. Our findings suggest that GPUs have the potential to facilitate the growth of statistical modelling into complex data rich domains through the availability of cheap and accessible many-core computation. We believe the speedup we observe should motivate wider use of parallelizable simulation methods and greater methodological attention to their design.

334 citations


Journal ArticleDOI
01 Mar 2010
TL;DR: A parallel implementation based on insight segmentation and registration toolkit for a multiscale feature extraction and region growing algorithm, applied to retinal blood vessels segmentation, capable of achieving an accuracy comparable to its serial counterpart, but 8 to 10 times faster.
Abstract: This paper presents a parallel implementation based on insight segmentation and registration toolkit for a multiscale feature extraction and region growing algorithm, applied to retinal blood vessels segmentation. This implementation is capable of achieving an accuracy (Ac) comparable to its serial counterpart (about 92%), but 8 to 10 times faster. In this paper, the Ac of this parallel implementation is evaluated by comparison with expert manual segmentation (obtained from public databases). On the other hand, its performance is compared with previous published serial implementations. Both these characteristics make this parallel implementation feasible for the analysis of a larger amount of high-resolution retinal images, achieving a faster and high-quality segmentation of retinal blood vessels.

160 citations


Journal ArticleDOI
TL;DR: This paper considers the data expansion required to pass from the plaintext to the encrypted representation of signals, due to the use of cryptosystems operating on very large algebraic structures, and proposes a general composite signal representation.
Abstract: Signal processing tools working directly on encrypted data could provide an efficient solution to application scenarios where sensitive signals must be protected from an untrusted processing device. In this paper, we consider the data expansion required to pass from the plaintext to the encrypted representation of signals, due to the use of cryptosystems operating on very large algebraic structures. A general composite signal representation allowing us to pack together a number of signal samples and process them as a unique sample is proposed. The proposed representation permits us to speed up linear operations on encrypted signals via parallel processing and to reduce the size of the encrypted signal. A case study-1-D linear filtering-shows the merits of the proposed representation and provides some insights regarding the signal processing algorithms more suited to work on the composite representation.

147 citations


Proceedings ArticleDOI
30 Nov 2010
TL;DR: This paper creates a spatial index, Voronoi diagram, for given data points in 2D space and enables efficient processing of a wide range of GQs with the MapReduce programming model.
Abstract: Geospatial queries (GQ) have been used in a wide variety of applications such as decision support systems, profile-based marketing, bioinformatics and GIS. Most of the existing query-answering approaches assume centralized processing on a single machine although GQs are intrinsically parallelizable. There are some approaches that have been designed for parallel databases and cluster systems, however, these only apply to the systems with limited parallel processing capability, far from that of the cloud-based platforms. In this paper, we study the problem of parallel geos patial query processing with the MapReduce programming model. Our proposed approach creates a spatial index, Voronoi diagram, for given data points in 2D space and enables efficient processing of a wide range of GQs. We evaluated the performance of our proposed techniques and correspondingly compared them with their closest related work while varying the number of employed nodes.

132 citations


Journal ArticleDOI
TL;DR: A spiking-neuron implementation of a cognitive architecture where a large number of local parallel processors assemble together to produce goal-driven behavior is presented, which captures the detailed dynamics of human behavior during dual-task-performance, including both mean RTs and RT distributions.
Abstract: The human brain efficiently solves certain operations such as object recognition and categorization through a massively parallel network of dedicated processors. However, human cognition also relies on the ability to perform an arbitrarily large set of tasks by flexibly recombining different processors into a novel chain. This flexibility comes at the cost of a severe slowing down and a seriality of operations (100–500 ms per step). A limit on parallel processing is demonstrated in experimental setups such as the psychological refractory period (PRP) and the attentional blink (AB) in which the processing of an element either significantly delays (PRP) or impedes conscious access (AB) of a second, rapidly presented element. Here we present a spiking-neuron implementation of a cognitive architecture where a large number of local parallel processors assemble together to produce goal-driven behavior. The precise mapping of incoming sensory stimuli onto motor representations relies on a “router” network capable of flexibly interconnecting processors and rapidly changing its configuration from one task to another. Simulations show that, when presented with dual-task stimuli, the network exhibits parallel processing at peripheral sensory levels, a memory buffer capable of keeping the result of sensory processing on hold, and a slow serial performance at the router stage, resulting in a performance bottleneck. The network captures the detailed dynamics of human behavior during dual-task-performance, including both mean RTs and RT distributions, and establishes concrete predictions on neuronal dynamics during dual-task experiments in humans and non-human primates.

129 citations


Patent
06 Oct 2010
TL;DR: In this paper, a memory device includes a plurality of storage units in which to store data of a bank, wherein the data has a logical order prior to storage and a physical order different than the logical order within the plurality of data units and a within-device reordering unit to reorder the data of the bank into the logical ordering prior to performing on-chip processing.
Abstract: A memory device includes a plurality of storage units in which to store data of a bank, wherein the data has a logical order prior to storage and a physical order different than the logical order within the plurality of storage units and a within-device reordering unit to reorder the data of a bank into the logical order prior to performing on-chip processing. In another embodiment, the memory device includes an external device interface connectable to an external device communicating with the memory device, an internal processing element to process data stored on the device and multiple banks of storage. Each bank includes a plurality of storage units and each storage unit has two ports, an external port connectable to the external device interface and an internal port connected to the internal processing element.

129 citations


Book ChapterDOI
Zhenhua Lv1, Yingjie Hu1, Haidong Zhong1, Jianping Wu1, Bo Li1, Hui Zhao1 
23 Oct 2010
TL;DR: The color representation of RS images, which means pixels need to be translated into a particular color space CIELAB that is more suitable for distinguishing colors is described, and the programming model MapReduce and a platform Hadoop are briefly introduced.
Abstract: The K-Means clustering is a basic method in analyzing RS (remote sensing) images, which generates a direct overview of objects. Usually, such work can be done by some software (e.g. ENVI, ERDAS IMAGINE) in personal computers. However, for PCs, the limitation of hardware resources and the tolerance of time consuming present a bottleneck in processing a large amount of RS images. The techniques of parallel computing and distributed systems are no doubt the suitable choices. Different with traditional ways, in this paper we try to parallel this algorithm on Hadoop, an open source system that implements the MapReduce programming model. The paper firstly describes the color representation of RS images, which means pixels need to be translated into a particular color space CIELAB that is more suitable for distinguishing colors. It also gives an overview of traditional K-Means. Then the programming model MapReduce and a platform Hadoop are briefly introduced. This model requires customized 'map/reduce' functions, allowing users to parallel processing in two stages. In addition, the paper detail map and reduce functions by pseudo-codes, and the reports of performance based on the experiments are given. The paper shows that results are acceptable and may also inspire some other approaches of tackling similar problems within the field of remote sensing applications.

127 citations


Proceedings ArticleDOI
13 Jun 2010
TL;DR: It is argued that adopting a best-effort service model for various software and hardware components of the computing platform stack can lead to drastic improvements in scalability and large improvements in performance and energy efficiency.
Abstract: With the advent of mainstream parallel computing, applications can obtain better performance only by scaling to platforms with larger numbers of cores. This is widely considered to be a very challenging problem due to the difficulty of parallel programming and the bottlenecks to efficient parallel execution. Inspired by how networking and storage systems have scaled to handle very large volumes of packet traffic and persistent data, we propose a new approach to the design of scalable, parallel computing platforms. For decades, computing platforms have gone to great lengths to ensure that every computation specified by applications is faithfully executed. While this design philosophy has remained largely unchanged, applications and the basic characteristics of their workloads have changed considerably. A wide range of existing and emerging computing workloads have an inherent forgiving nature. We therefore argue that adopting a best-effort service model for various software and hardware components of the computing platform stack can lead to drastic improvements in scalability. Applications are cognizant of the best-effort model, and separate their computations into those that may be executed on a best-effort basis and those that require the traditional execution guarantees. Best-effort computations may be exploited to simply reduce the computing workload, shape it to be more suitable for parallel execution, or execute it on unreliable hardware components. Guaranteed computations are realized either through an overlay software layer on top of the best-effort substrate, or through the use of application-specific strategies. We describe a system architecture for a best-effort computing platform, provide examples of parallel software and hardware that embody the best-effort model, and show that large improvements in performance and energy efficiency are possible through the adoption of this approach.

124 citations


Proceedings ArticleDOI
17 Nov 2010
TL;DR: The experimental result shows that the implementation of Canny edge detection algorithm on CUDA achieves a speedup factor of 61 over a conventional software implementation.
Abstract: Recent GPUs, which have many processing units connected with a global memory, can be used for general purpose parallel computation. Users can develop parallel programs running on GPUs using programming architecture called CUDA (Compute Unified Device Architecture). The main contribution of this paper is to implement a Canny edge detection algorithm on CUDA. The experimental result shows that our implementation of Canny edge detection algorithm on CUDA achieves a speedup factor of 61 over a conventional software implementation.

120 citations


Journal ArticleDOI
TL;DR: Designers now accept that although transistors will still get smaller and more numerous on each chip, they aren't going to operate faster than they do toady, which explains the shift to assembling them into multiple microprocessor cores instead.
Abstract: Designers now accept that although transistors will still get smaller and more numerous on each chip, they aren't going to operate faster than they do toady.And if you tried to incorporate all those transistors into one giant microprocessor, you might well end up with a device that couldn't compute any faster than the chip it was replacing, which explains the shift to assembling them into multiple microprocessor cores instead.

116 citations


Journal ArticleDOI
TL;DR: This paper has built a correlator and a beamformer, using PCI-based ADC cards and a Linux cluster of 48 nodes with dual gigabit inter-node connectivity for real-time data transfer requirements, and believes this is the first instance of such a real- time observatory backend for an intermediate sized array like the GMRT.
Abstract: The new era of software signal processing has a large impact on radio astronomy instrumentation. Our design and implementation of a 32 antennae, 33 MHz, dual polarization, fully real-time software backend for the GMRT, using only off-the-shelf components, is an example of this. We have built a correlator and a beamformer, using PCI-based ADC cards and a Linux cluster of 48 nodes with dual gigabit inter-node connectivity for real-time data transfer requirements. The highly optimized compute pipeline uses cache efficient, multi-threaded parallel code, with the aid of vectorized processing. This backend allows flexibility in final time and frequency resolutions, and the ability to implement algorithms for radio frequency interference rejection. Our approach has allowed relatively rapid development of a fairly sophisticated and flexible backend receiver system for the GMRT, which will greatly enhance the productivity of the telescope. In this paper we describe some of the first lights using this software processing pipeline. We believe this is the first instance of such a real-time observatory backend for an intermediate sized array like the GMRT.

Proceedings ArticleDOI
17 Aug 2010
TL;DR: Barra, a simulator of Graphics Processing Units (GPU) tuned for general purpose processing (GPGPU), is presented, based on the UNISIM framework and it simulates the native instruction set of the Tesla architecture at the functional level.
Abstract: We present Barra, a simulator of Graphics Processing Units (GPU) tuned for general purpose processing (GPGPU). It is based on the UNISIM framework and it simulates the native instruction set of the Tesla architecture at the functional level. The inputs are CUDA executables produced by NVIDIA tools. No alterations are needed to perform simulations. As it uses parallelism, Barra generates detailed statistics on executions in about the time needed by CUDA to operate in emulation mode. We use it to understand and explore the micro-architecture design spaces of GPUs.

Journal ArticleDOI
TL;DR: This review will examine several key studies, primarily electrophysiological, that have tested the hypothesis that the primate auditory cortex is organized in a serial and parallel manner in which there is a dorsal stream processing spatial information and a ventral stream processing non-spatial information.

Patent
Jeffrey Dean1, Sanjay Ghemawat1
12 Jan 2010
TL;DR: A large-scale data processing system and method for processing data in a distributed and parallel processing environment is described in this article, which includes an application-independent framework for processing the data having a plurality of applicationindependent map modules and reduce modules.
Abstract: A large-scale data processing system and method for processing data in a distributed and parallel processing environment The system includes an application-independent framework for processing data having a plurality of application-independent map modules and reduce modules These application-independent modules use application-independent operators to automatically handle parallelization of computations across the distributed and parallel processing environment when performing user-specified data processing operations The system also includes a plurality of user-specified, application-specific operators, for use with the application-independent framework to perform a user-specified data processing operation on a user-specified set of input files The application-specific operators include: a map operator and a reduce operator The map operator is applied by the application-independent map modules to input data in the user-specified set of input files to produce intermediate data values The reduce operator is applied by the application-independent reduce modules to process the intermediate data values to produce final output data

Journal ArticleDOI
TL;DR: The proposed strategy is demonstrated on a coupled analysis of an existing reactor vessel and the parallel processing leads to very good speedup and it also enables to solve significantly large problems in acceptable time.

Journal ArticleDOI
TL;DR: This work proposes a novel method named sort and count for efficient parallelization of mutual information (MI) computation designed for massively multi-processing architectures that achieves real-time (less than 1s) rigid registration of 3D medical images using a commodity graphics processing unit (GPU).

Proceedings ArticleDOI
21 Jun 2010
TL;DR: An integral image algorithm that can run in real-time on a Graphics Processing Unit (GPU) via the NIVIDA CUDA programming model that makes use of the work-efficient scan algorithm that is explicated elsewhere.
Abstract: We present an integral image algorithm that can run in real-time on a Graphics Processing Unit (GPU). Our system exploits the parallelisms in computation via the NIVIDA CUDA programming model, which is a software platform for solving non-graphics problems in a massively parallel high-performance fashion. This implementation makes use of the work-efficient scan algorithm that is explicated elsewhere. Treating the rows and the columns of the target image as independent input arrays for the scan algorithm, our method manages to expose a second level of parallelism in the problem. We compare the performance of the parallel approach running on the GPU with the sequential CPU implementation across a range of image sizes and report a speed up by a factor of 8 for a 4 megapixel input. We further investigate the impact of using packed vector type data on the performance, as well as the effect of double precision arithmetic on the GPU.

Proceedings ArticleDOI
01 Dec 2010
TL;DR: A new parallel Canny edge detector FPGA implementation is proposed in this paper that takes advantage of 4-pixel parallel computations to achieve high throughput without increasing the on-chip memory demands.
Abstract: Edge detection is one of the most fundamental algorithms in digital image processing. The Canny edge detector is the most implemented edge detection algorithm because of its ability to detect edges even in images that are intensely contaminated by noise. However, this is a time consuming algorithm and therefore its implementations are difficult to reach real time response speeds. Especially nowadays where the demand for high resolution image processing is constantly increasing, the need for fast and efficient edge detector implementations is ever so present. A new parallel Canny edge detector FPGA implementation is proposed in this paper to answer this demand. This design takes advantage of 4-pixel parallel computations to achieve high throughput without increasing the on-chip memory demands. Synthesis and simulation results are presented to prove the design's efficiency and high frames per second rate.

Patent
16 Dec 2010
TL;DR: In this paper, a modular data deduplication pipeline is presented, which allows modules to be replaced, selected or extended, e.g., by selecting modules to increase dedupling quality, performance and/or throughput.
Abstract: The subject disclosure is directed towards data deduplication (optimization) performed by phases/modules of a modular data deduplication pipeline. At each phase, the pipeline allows modules to be replaced, selected or extended, e.g., different algorithms can be used for chunking or compression based upon the type of data being processed. The pipeline facilitates secure data processing, batch processing, and parallel processing. The pipeline is tunable based upon feedback, e.g., by selecting modules to increase deduplication quality, performance and/or throughput. Also described is selecting, filtering, ranking, sorting and/or grouping the files to deduplicate, e.g., based upon properties and/or statistical properties of the files and/or a file dataset and/or internal or external feedback.

Journal ArticleDOI
TL;DR: This paper outlines a CT reconstruction approach for ET that is optimized for the special demands and application setting of ET and describes a novel GPU-amenable approach that effectively compensates for reconstruction errors resulting from the TEM data acquisition on (long) samples which extend the width of the parallel TEM beam.

Proceedings ArticleDOI
27 Mar 2010
TL;DR: Experimental results indicate that if data transfer time, between host memory and device memory, is taken into account, speed of the two algorithms implemented on GPU can be improved approximately 25 times and 49 times as fast as CPU, respectively, and GPU is practical for image processing.
Abstract: In order to solve the compute-intensive character of image processing, based on advantages of GPU parallel operation, parallel acceleration processing technique is proposed for image. First, efficient architecture of GPU is introduced that improves computational efficiency, comparing with CPU. Then, Sobel edge detector and homomorphic filtering, two representative image processing algorithms, are embedded into GPU to validate the technique. Finally, tested image data of different resolutions are used on CPU and GPU hardware platform to compare computational efficiency of GPU and CPU. Experimental results indicate that if data transfer time, between host memory and device memory, is taken into account, speed of the two algorithms implemented on GPU can be improved approximately 25 times and 49 times as fast as CPU, respectively, and GPU is practical for image processing.

Proceedings ArticleDOI
03 May 2010
TL;DR: An active storage system that allows data analysis, mining, and statistical operations to be executed from within a parallel I/O interface that consistently outperforms the traditional storage model with a wide variety of input dataset sizes, number of nodes, and computational loads is proposed.
Abstract: As data sizes continue to increase, the concept of active storage is well fitted for many data analysis kernels. Nevertheless, while this concept has been investigated and deployed in a number of forms, enabling it from the parallel I/O software stack has been largely unexplored. In this paper, we propose and evaluate an active storage system that allows data analysis, mining, and statistical operations to be executed from within a parallel I/O interface. In our proposed scheme, common analysis kernels are embedded in parallel file systems. We expose the semantics of these kernels to parallel file systems through an enhanced runtime interface so that execution of embedded kernels is possible on the server. In order to allow complete server-side operations without file format or layout manipulation, our scheme adjusts the file I/O buffer to the computational unit boundary on the fly. Our scheme also uses server-side collective communication primitives for reduction and aggregation using interserver communication. We have implemented a prototype of our active storage system and demonstrate its benefits using four data analysis benchmarks. Our experimental results show that our proposed system improves the overall performance of all four benchmarks by 50.9% on average and that the compute-intensive portion of the k-means clustering kernel can be improved by 58.4% through GPU offloading when executed with a larger computational load. We also show that our scheme consistently outperforms the traditional storage model with a wide variety of input dataset sizes, number of nodes, and computational loads.

Proceedings ArticleDOI
19 Apr 2010
TL;DR: This paper proposes a statistical approach to understanding I/O performance that moves from the analysis of performance events to the exploration of performance ensembles and demonstrates that this approach can identify application and middleware performance deficiencies — resulting in more than 4× run time improvement for both examined applications.
Abstract: Parallel I/O is fast becoming a bottleneck to the research agendas of many users of extreme scale parallel computers. The principle cause of this is the concurrency explosion of high-end computation, coupled with the complexity of providing parallel file systems that perform reliably at such scales. More than just being a bottleneck, parallel I/O performance at scale is notoriously variable, being influenced by numerous factors inside and outside the application, thus making it extremely difficult to isolate cause and effect for performance events. In this paper, we propose a statistical approach to understanding I/O performance that moves from the analysis of performance events to the exploration of performance ensembles. Using this methodology, we examine two I/O-intensive scientific computations from cosmology and climate science, and demonstrate that our approach can identify application and middleware performance deficiencies — resulting in more than 4× run time improvement for both examined applications.

Proceedings ArticleDOI
22 Nov 2010
TL;DR: In this study, a graphic processing unit (GPU) was used to perform FPM with a GPU-FPM to speed-up the process and the experimental results showed that the speed- up ratio of GPU- FPM can achieve 14.857 with 16 times of threads.
Abstract: Extraction of frequent patterns from a transactional database is a fundamental task in data mining. Its applications include association rules, time series, etc. The Apriori approach is a commonly used generate-and-test approach to obtain frequent patterns from a database with a given threshold. Many parallel and distributed methods have been proposed for frequent pattern mining (FPM) to reduce computation time. However, most of them require a Cluster system or Grid system. In this study, a graphic processing unit (GPU) was used to perform FPM with a GPU-FPM to speed-up the process. Because of GPU hardware delimitations, a compact data structure was designed to store an entire database on GPU. In addition, MemPack and CLProgram template classes were also designed. Two datasets with different conditions were used to verify the performance of GPU-FPM. The experimental results showed that the speed-up ratio of GPU-FPM can achieve 14.857 with 16 times of threads.

Patent
19 Jan 2010
TL;DR: In this article, a highly distributed multi-core system with an adaptive scheduler is provided, where applications can be executed in a distributed manner across several types of slave processing cores.
Abstract: There is provided a highly distributed multi-core system with an adaptive scheduler. By resolving data dependencies in a given list of parallel tasks and selecting a subset of tasks to execute based on provided software priorities, applications can be executed in a highly distributed manner across several types of slave processing cores. Moreover, by overriding provided priorities as necessary to adapt to hardware or other system requirements, the task scheduler may provide for low-level hardware optimizations that enable the timely completion of time-sensitive workloads, which may be of particular interest for real-time applications. Through this modularization of software development and hardware optimization, the conventional demand on application programmers to micromanage multi-core processing for optimal performance is thus avoided, thereby streamlining development and providing a higher quality end product.

Patent
30 Mar 2010
TL;DR: In this article, a Message Passing Interface (MPI) devolver enabled PPCA is in communication with the PPE and a host node, where the host node executes at least a parallel processing application and an MPI process.
Abstract: Parallel Processing Communication Accelerator (PPCA) systems and methods for enhancing performance of a Parallel Processing Environment (PPE). In an embodiment, a Message Passing Interface (MPI) devolver enabled PPCA is in communication with the PPE and a host node. The host node executes at least a parallel processing application and an MPI process. The MPI devolver communicates with the MPI process and the PPE to improve the performance of the PPE by offloading MPI process functionality to the PPCA. Offloading MPI processing to the PPCA frees the host node for other processing tasks, for example, executing the parallel processing application, thereby improving the performance of the PPE.

Journal ArticleDOI
TL;DR: This work presents a new algorithm to accelerate the performance of the SMVM kernel on graphic processing units.
Abstract: A wide class of finite-element (FE) electromagnetic applications requires computing very large sparse matrix vector multiplications (SMVM). Due to the sparsity pattern and size of the matrices, solvers can run relatively slowly. The rapid evolution of graphic processing units (GPUs) in performance, architecture, and programmability make them very attractive platforms for accelerating computationally intensive kernels such as SMVM. This work presents a new algorithm to accelerate the performance of the SMVM kernel on graphic processing units.

Journal ArticleDOI
TL;DR: The real-time display of full-range, 2048?axial pixelx1024?lateral pixel, Fourier-domain optical-coherence tomography (FD-OCT) images is demonstrated using dual graphic processing units (GPUs) with many stream processors to realize highly parallel processing.
Abstract: The real-time display of full-range, 2048?axial pixelx1024?lateral pixel, Fourier-domain optical-coherence tomography (FD-OCT) images is demonstrated. The required speed was achieved by using dual graphic processing units (GPUs) with many stream processors to realize highly parallel processing. We used a zero-filling technique, including a forward Fourier transform, a zero padding to increase the axial data-array size to 8192, an inverse-Fourier transform back to the spectral domain, a linear interpolation from wavelength to wavenumber, a lateral Hilbert transform to obtain the complex spectrum, a Fourier transform to obtain the axial profiles, and a log scaling. The data-transfer time of the frame grabber was 15.73?ms, and the processing time, which includes the data transfer between the GPU memory and the host computer, was 14.75?ms, for a total time shorter than the 36.70?ms frame-interval time using a line-scan CCD camera operated at 27.9?kHz. That is, our OCT system achieved a processed-image display rate of 27.23 frames/s.

Patent
20 Oct 2010
TL;DR: In this paper, an overhead factor characterizing a change of a parallelism overhead of executing the task with nodes executing in parallel is calculated, relative to a change in a number of the nodes, based on the first performance measurement and the second performance measurement.
Abstract: A first performance measurement of an executing task may be determined, while the task is executed by a first number of nodes operating in parallel. A second performance measurement of the executing task may be determined, while the task is being executed by a second number of nodes operating in parallel. An overhead factor characterizing a change of a parallelism overhead of executing the task with nodes executing in parallel may then be calculated, relative to a change in a number of the nodes, based on the first performance measurement and the second performance measurement. Then, an optimal number of nodes to operate in parallel to continue executing the task may be determined, based on the overhead factor.

Journal ArticleDOI
TL;DR: Today, many processors, including digital signal processors, mobile, graphics, and general-purpose central processing units (CPUs) have a multicore design, driven by the demand of higher performance.
Abstract: One of the recent innovations in computer engineering has been the development of multicore processors, which are composed of two or more independent cores in a single physical package. Today, many processors, including digital signal processors (DSPs), mobile, graphics, and general-purpose central processing units (CPUs) have a multicore design, driven by the demand of higher performance. Major CPU vendors have changed strategy away from increasing the raw clock rate to adding on-chip support for multithreading by increases in the number of cores; dual and quad-core processors are now commonplace. Signal and image processing programmers can benefit dramatically from these advances in hardware, by modifying single-threaded code to exploit parallelism to run on multiple cores.