scispace - formally typeset
Search or ask a question

Showing papers on "Parallel processing (DSP implementation) published in 2004"


Journal ArticleDOI
TL;DR: A theory of stochastic interactive parallel processing with special emphasis on channel interactions and their relation to system capacity is presented, and new theorems relating response time performance in these designs to earlier and novel issues are established.
Abstract: The authors present a theory of stochastic interactive parallel processing with special emphasis on channel interactions and their relation to system capacity. The approach is based both on linear systems theory augmented with stochastic elements and decisional operators and on a metatheory of parallel channels' dependencies that incorporates standard independent and coactive parallel models as special cases. The metatheory is applied to OR and AND experimental paradigms, and the authors establish new theorems relating response time performance in these designs to earlier and novel issues. One notable outcome is the remarkable processing efficiency associated with linear parallel-channel systems that include mutually positive interactions. The results may offer insight into perceptual and cognitive configural-holistic processing systems.

246 citations


Proceedings ArticleDOI
19 Jun 2004
TL;DR: Experimental results indicate that the extent of information exchange among subpopulations assigned to different processor nodes, bears a significant impact on the performance of the algorithm.
Abstract: Parallel processing has emerged as a key enabling technology in modern computing. Recent software advances have allowed collections of heterogeneous computers to be used as a concurrent computational resource. In this work we explore how differential evolution can be parallelized, using a ring-network topology, so as to improve both the speed and the performance of the method. Experimental results indicate that the extent of information exchange among subpopulations assigned to different processor nodes, bears a significant impact on the performance of the algorithm. Furthermore, not all the mutation strategies of the differential evolution algorithm are equally sensitive to the value of this parameter.

242 citations


Patent
Jeffrey Dean1, Sanjay Ghemawat1
18 Jun 2004
TL;DR: In this paper, a large-scale data processing system and method includes one or more application-independent map modules configured to read input data and to apply at least one application-specific map operation to the input data to produce intermediate data values, wherein the map operation is automatically parallelized across multiple processors in the parallel processing environment.
Abstract: A large-scale data processing system and method includes one or more application-independent map modules configured to read input data and to apply at least one application-specific map operation to the input data to produce intermediate data values, wherein the map operation is automatically parallelized across multiple processors in the parallel processing environment A plurality of intermediate data structures are used to store the intermediate data values One or more application-independent reduce modules are configured to retrieve the intermediate data values and to apply at least one application-specific reduce operation to the intermediate data values to provide output data

193 citations


Journal ArticleDOI
TL;DR: Two parallel genetic algorithm (PGA) models for TRND problem for urban bus operation are proposed and it is observed that the global PVM model performed better than the other model.
Abstract: A transit route network design (TRND) problem for urban bus operation involves the determination of a set of transit routes and the associated frequencies that achieve the desired objective. This can be formulated as an optimization problem of minimizing the total system cost, which is the sum of the operating cost and the generalized travel cost. A review of previous approaches to solve this problem reveals the deficiency of conventional optimization techniques and the suitability of genetic algorithm (GA) based models to handle such combinatorial optimization problems. Since GAs are computationally intensive optimization techniques, their application to large and complex problems is limited. The computational performance of a GA model can be improved by exploiting its inherent parallel nature. Accordingly, two parallel genetic algorithm (PGA) models are proposed in this study. The first is a global parallel virtual machine (PVM) parallel GA model where the fitness evaluation is done concurrently in a parallel processing environment using PVM libraries. The second is a global message passing interface (MPI) parallel GA model where an MPI environment substitutes for the PVM libraries. An existing GA model for TRND for a large city is used as a case study. These models are tested for computation time, speedup, and efficiency. From the study, it is observed that the global PVM model performed better than the other model.

112 citations


Journal ArticleDOI
TL;DR: Differences in their projections that support the notion of largely segregated parallel processing streams in the auditory thalamus and cerebral cortex are found.
Abstract: The basis for multiple representations of equivalent frequency ranges in auditory cortex was studied with physiological and anatomical methods. Our goal was to trace the convergence of thalamic, commissural, and corticocortical information upon two tonotopic fields in the cat, the primary auditory cortex (AI) and the anterior auditory field (AAF). Both fields are among the first cortical levels of processing. After neurophysiological mapping of characteristic frequency, we injected different retrograde tracers at separate, frequency-matched loci in AI and AAF. We found differences in their projections that support the notion of largely segregated parallel processing streams in the auditory thalamus and cerebral cortex. In each field, ipsilateral cortical input amounts to approximately 70% of the number of cells projecting to an isofrequency domain, while commissural and thalamic sources are each approximately 15%. Labeled thalamic and cortical neurons were concentrated in tonotopically predicted regions and in smaller loci far from their spectrally predicted positions. The few double-labeled thalamic neurons (<2%) are consistent with the hypothesis that information to AI and AAF travels along independent processing streams despite widespread regional overlap of thalamic input sources. Double labeling is also sparse in both the corticocortical and commissural systems ( approximately 1%), confirming their independence. The segregation of frequency-specific channels within thalamic and cortical systems is consistent with a model of parallel processing in auditory cortex. The global convergence of cells outside the targeted frequency domain in AI and AAF could contribute to context-dependent processing and to intracortical plasticity and reorganization.

112 citations


BookDOI
01 Jan 2004
TL;DR: This talk outlines an approach based on a web service component architecture for building large scale Grid applications and illustrates how the traditional parallel application can be wrapped by aweb service factory and integrated into complex workflows.
Abstract: Large scale Grid applications are often composed a distributed collection of parallel simulation codes, instrument monitors, data miners, rendering and visualization tools. For example, consider a severe storm prediction system driven by a grid of weather sensors. Typically these applications are very complex to build, so users interact with them through a Grid portal front end. This talk outlines an approach based on a web service component architecture for building these applications and portal interfaces. We illustrate how the traditional parallel application can be wrapped by a web service factory and integrated into complex workflows. Additional issues that are addressed include: grid security, web service tools and workflow composition tools. The talk will try to outline several important classes of unsolved problems and possible new research directions for building grid applications.

107 citations


Proceedings ArticleDOI
26 Apr 2004
TL;DR: This article gives a brief overview of theoretical advances, computing trends, applications and future perspectives in parallel genetic algorithms, and explains basic terms and behavior of (parallel) genetic algorithms.
Abstract: Summary form only given. This article gives a brief overview of theoretical advances, computing trends, applications and future perspectives in parallel genetic algorithms. It explains basic terms and behavior of (parallel) genetic algorithms. Genetic algorithms are easily parallelized algorithms, therefore two kinds of possible parallelism, data parallelism and control parallelism, are mentioned and described towards them. Parallelism of genetic algorithms brings many advantages and gains. Classifications of these algorithms are often based on the type of computing model, a walk strategy and the used computing machinery. Afterwards significant milestones in the theory with latest advances are briefly mentioned. Then current trends in parallel computing with stress computer architectures of parallel systems, interconnection topologies, operating systems, parallel (genetic) libraries and programming paradigms are reviewed shortly. The sufficient space is devoted to the latest applications of parallel genetic algorithms. After the discussion section, perspectives of the algorithms are predicted till the year 2005. The information in the article is segregated into two periods before and after the year 2000 in all chapters. The second period is more interesting and of higher importance, because it highlights recent research efforts and gives some hints about possible future trends. That is why we devote much space to the second period. As there is no such an overview of the recent period of parallel genetic algorithms, our investigation could be appealing and useful in many aspects.

107 citations


Journal ArticleDOI
TL;DR: The aim of this review is to summarize and compare the present concepts of auditory processing by relating behavioral performance to known neuronal mechanisms, and demonstrates that closely related species often use different combinations of temporal parameters in their recognition systems.
Abstract: Insects exhibit an astonishing diversity in the design of their ears and the subsequent processing of information within their auditory pathways. The aim of this review is to summarize and compare the present concepts of auditory processing by relating behavioral performance to known neuronal mechanisms. We focus on three general aspects, that is frequency, directional, and temporal processing. The first part compares the capacity (in some insects high) for frequency analysis in the ear with the rather low specificity of tuning in interneurons by looking at Q10dB values and frequency dependent inhibition of interneurons. Since sharpening of frequency does not seem to be the prime task of a set of differently tuned receptors, alternative hypotheses are discussed. Moreover, the physiological correspondence between tonotopic projections of receptors and dendritic organization of interneurons is not in all cases strong. The second part is concerned with directional hearing and thus with the ability for angular resolution of insects. The present concepts, as derived from behavioral performances, for angular resolution versus lateralization and serial versus parallel processing of directional and pattern information can be traced to the thoracic level of neuronal processing. Contralateral inhibition, a mechanism for enhancing directional tuning, appears to be most effective in parallel pathways, whereas in serial processing it may have detrimental effects on pattern processing. The third part, after some considerations of signal analysis in the temporal domain, demonstrates that closely related species often use different combinations of temporal parameters in their recognition systems. On the thoracic level, analysis of temporal modulation functions and effects of inhibition on spiking patterns reveals relatively simple processing, whereas brain neurons may exhibit more complex properties.

105 citations


Proceedings ArticleDOI
27 Sep 2004
TL;DR: In this paper, the authors present the general problems associated with parallel operation of UPS systems, and control strategy for parallel operation with different ratings, and the validity of the proposed control strategy is investigated through simulation and experiment with two UPS systems.
Abstract: Parallel operation of UPS system has been used to increase power capacity of the system or to secure reliable supply of power to critical loads. During parallel operation, load sharing control to maintain the current balance is critical for reliable operation, since load sharing is very sensitive to differences in components of each module such as amplitude /phase difference, line impedance, and output LC filters. To solve these problems various control algorithms have been researched. However, these methods cannot be applied to UPS systems with different ratings. For this case, master and slave control algorithms for parallel operation is adequate. If the ratings of UPS systems are different, the value of passive LC filters will be different, and it will affect current sharing. This paper presents the general problems associated with parallel operation of UPS systems, and control strategy for parallel operation with different ratings. The validity of the proposed control strategy is investigated through simulation and experiment with two UPS systems.

90 citations


Journal ArticleDOI
TL;DR: The proposed algorithm is based on one of the best known sequential techniques referred to as Frequent Pattern (FP) Growth algorithm and introduces minimum communication overheads by efficiently partitioning the list of frequent elements list over processors.
Abstract: Extraction of frequent patterns in transaction-oriented database is crucial to several data mining tasks such as association rule generation, time series analysis, classification, etc. Most of these mining tasks require multiple passes over the database and if the database size is large, which is usually the case, scalable high performance solutions involving multiple processors are required. This paper presents an efficient scalable parallel algorithm for mining frequent patterns on parallel shared nothing platforms. The proposed algorithm is based on one of the best known sequential techniques referred to as Frequent Pattern (FP) Growth algorithm. Unlike most of the earlier parallel approaches based on different variants of the Apriori Algorithm, the algorithm presented in this paper does not explicitly result in having entire counting data structure duplicated on each processor. Furthermore, the proposed algorithm introduces minimum communication (and hence synchronization) overheads by efficiently partitioning the list of frequent elements list over processors. The experimental results show scalable performance over different machine and problem sizes. The comparison of implementation results with existing parallel approaches show significant gains in the speedup. On an 8-processor machine, we report an average speedup of 6 for different problem sizes.

83 citations


Journal ArticleDOI
28 Jun 2004
TL;DR: A mixed-signal programmable chip for high-speed vision applications that can capture an image, run approximately 150 two-dimensional linear convolutions, and download the result in 8-bit digital format in less than 1 ms, together with the possibility of executing sequences of user-definable instructions makes the chip a true general-purpose sensory/processing device.
Abstract: This paper presents a mixed-signal programmable chip for high-speed vision applications. It consists of an array of processing elements, arranged to operate in accordance with the principles of single instruction multiple data (SIMD) computing architectures. This chip, implemented in a 0.35-/spl mu/m fully digital CMOS technology, contains /spl sim/ 3.75 M transistors and exhibits peak performance figures of 330 GOPS (8-bit equivalent giga-operations per second), 3.6 GOPS/mm/sup 2/ and 82.5 GOPS/W. It includes structures for image acquisition and for image processing, meaning that it does not require a separate imager for operation. At the sensory side, integration and log-compression sensing circuits are embedded, thus allowing the chip to handle a large variety of illumination conditions. At the processing plane, analog and digital circuits are employed whose parameters can be programmed and their architecture reconfigured for the realization of software-coded processing algorithms. The chip provides, and accepts, 8-bit digitized data through a 32-bit bidirectional data bus which operates at 120 MB/s. Experimental results show that frame rates of 1000 frames per second (FPS) can be achieved under room illumination conditions; applications using exposures of about 50 /spl mu/s have been recently reached by using special illumination setups. The chip can capture an image, run approximately 150 two-dimensional linear convolutions, and download the result in 8-bit digital format, in less than 1 ms. This feature, together with the possibility of executing sequences of user-definable instructions (stored on a full-custom 32-kb on-chip memory), and storing intermediate results (up to 8 grayscale images) makes the chip a true general-purpose sensory/processing device.

Journal ArticleDOI
TL;DR: New methodologies are employed to assess serial versus parallel processing and find strong evidence for pure serial or pure parallel processing, with some striking apparent differences across individuals and interstimulus conditions.
Abstract: Many mental tasks that involve operations on a number of items take place within a few hundred milliseconds. In such tasks, whether the items are processed simultaneously (in parallel) or sequentially (serially) has long been of interest to psychologists. Although certain types of parallel and serial models have been ruled out, it has proven extremely difficult to entirely separate reasonable serial and limitedcapacity parallel models on the basis of typical data. Recent advances in theory-driven methodology now permit strong tests of serial versus parallel processing in such tasks, in ways that bypass the capacity issue and that are distribution and parameter free. We employ new methodologies to assess serial versus parallel processing and find strong evidence for pure serial or pure parallel processing, with some striking apparent differences across individuals and interstimulus conditions.

Journal ArticleDOI
TL;DR: A parallel topology optimization method is proposed to deal with large-scale structural eigenvalue-related design problems and the preconditioned conjugate gradient method and the subspace iteration method are used as parallel solvers.

Proceedings ArticleDOI
15 Aug 2004
TL;DR: Efficient implementations of sufficient (albeit not necessary) partitioning algorithms are presented here, and proved correct.
Abstract: Given a collection of tasks that comprise the software for a real-time system, and a collection of available processing units of different kinds upon which to execute them, the heterogeneous multiprocessor partitioning problem is concerned with determining whether the given tasks can be partitioned among the available processing units in such a manner that all timing constraints are met. It is known that this problem is intractable; efficient implementations of sufficient (albeit not necessary) partitioning algorithms are presented here, and proved correct.

Journal ArticleDOI
TL;DR: In the absence of eye movements, asymmetric visual search, long considered an example of serial deployment of covert attention, is qualitatively and quantitatively consistent with parallel search processes.
Abstract: The difficulty of visual search may depend on assignment of the same visual elements as targets and distractors-search asymmetry. Easy C-in-O searches and difficult O-in-C searches are often associated with parallel and serial search, respectively. Here, the time course of visual search was measured for both tasks with speed-accuracy methods. The time courses of the 2 tasks were similar and independent of display size. New probabilistic parallel and serial search models and sophisticated-guessing variants made predictions about time course and accuracy of visual search. The probabilistic parallel model provided an excellent account of the data, but the serial model did not. Asymptotic search accuracies and display size effects were consistent with a signal-detection analysis, with lower variance encoding of Cs than Os. In the absence of eye movements, asymmetric visual search, long considered an example of serial deployment of covert attention, is qualitatively and quantitatively consistent with parallel search processes.

Patent
01 Jul 2004
TL;DR: In this paper, a data processing apparatus and a method for moving data between registers and memory is provided, which is responsive to a single access instruction to move a plurality of data elements between a chosen one of the lanes in specified registers and a structure within memory having a structure format.
Abstract: A data processing apparatus and method are provided for moving data between registers and memory. The data processing apparatus comprises a register data store having a plurality of registers operable to store data elements. A processor is operable to perform in parallel a data processing operation on multiple data elements occupying different lanes of parallel processing in at least one of the registers. Access logic is provided which is responsive to a single access instruction to move a plurality of data elements between a chosen one of the lanes in specified registers and a structure within memory having a structure format, the structure format having a plurality of components. The single access instruction identifies the number of components in the structure format, and the access logic is operation to arrange the plurality of data elements as they are moved such that data elements of different components are stored in different specified registers within the chosen lane whilst in memory the data elements are stored as the structure.

Patent
01 Apr 2004
TL;DR: In this paper, an apparatus and method of decoding coded video bitstreams is described, which consists of a first processor and a second processor configured to operate in parallel, where the first processor performs dequantization and inverse DCT to recover digital pixel data from the macroblocks.
Abstract: An apparatus and method of decoding coded video bitstreams is disclosed. The Apparatus comprises a first processor and a second processor configured to operate in parallel. The main processor (355) receives the coded video bitstream, parses it, and calls th second processor (360) to decode the coded video bitstream to retrieve macroblock data. IF an error occurs during decoding, the secon processor (360) signals the first processor, which can instruct the second processor to perform an error recovery routine. The First processor (355) the performs dequantization and inverse DCT to recover digital pixel data from the macroblocks so that an image formed from the digital pixel data can be later displayed on a monitor.

Journal ArticleDOI
TL;DR: The design and implementation of a parallel machine on an SOPC development board is described, using multiple instances of a soft IP configurable processor; this machine is used for LU factorization, which facilitates the efficient solution of linear equations at a cost much lower than that of supercomputers and networks of workstations.
Abstract: Configurable computing, where hardware resources are configured appropriately to match specific hardware designs, has recently demonstrated its ability to significantly improve performance for a wide range of computation-intensive applications. With steady advances in silicon technology, as predicted by Moore's Law, Field-Programmable Gate Array (FPGA) technologies have enabled the implementation of System-on-a-Programmable-Chip (SOPC or SOC) computing platforms, which, in turn, have given a significant boost to the field of configurable computing. It is possible to implement various specialized parallel machines in a single silicon chip. In this paper, we describe our design and implementation of a parallel machine on an SOPC development board, using multiple instances of a soft IP configurable processor; we use this machine for LU factorization. LU factorization is widely used in engineering and science to solve efficiently large systems of linear equations. Our implementation facilitates the efficient solution of linear equations at a cost much lower than that of supercomputers and networks of workstations. The intricacies of our FPGA-based design are presented along with tradeoff choices made for the purpose of illustration. Performance results prove the viability of our approach. Copyright © 2004 John Wiley & Sons, Ltd.

Patent
21 Jan 2004
TL;DR: In this paper, a compiler that translates a source program into a machine language program for the processor including a plurality of execution units which can execute instructions in parallel and an instruction issue unit which issue the instructions executed respectively by the plurality of operation units.
Abstract: A compiler apparatus that is capable of generating instruction sequences for causing a processor with parallel processing capability to operate with lower power consumption is a compiler apparatus that translates a source program into a machine language program for the processor including a plurality of execution units which can execute instructions in parallel and a plurality of instruction issue units which issue the instructions executed respectively by the plurality of execution units, and includes: a parser unit operable to parse the source program; an intermediate code conversion unit operable to convert the parsed source program into intermediate codes; an optimization unit operable to optimize the intermediate codes so as to reduce a hamming distance between instructions placed in positions corresponding to the same instruction issue unit in consecutive instruction cycles, without changing dependency between the instructions corresponding to the intermediate codes; and a code generation unit operable to convert the optimized intermediate codes into machine language instructions.

DOI
13 Jul 2004
TL;DR: Modifications of the technique of parallel coordinate plot for supporting visual exploration of object classes, in particular, resulting from cluster analysis, are described, applying two general approaches to handling large amounts of data: aggregation and filtering.
Abstract: We describe our modifications of the technique of parallel coordinate plot for supporting visual exploration of object classes, in particular, resulting from cluster analysis. We strived at creating a tool that would be suitable for analysis of large datasets. The basic parallel coordinate plot technique with the traditional method for representing classes, multi-coloured brushing, fails to properly convey class-relevant information due to tremendous overlapping of lines. We have applied two general approaches to handling large amounts of data: aggregation and filtering. Thus, information concerning the distribution of characteristics in classes and the entire dataset is shown on parallel coordinates in an aggregated form. This is combined with displaying individual characteristics only for user-selected object subsets.

Patent
04 Aug 2004
TL;DR: An ETL/EAI data warehouse management system and method for processing data by dynamically distributing the computational load across a cluster network of distributed servers using a master node and multiple servant nodes, where each of the servant nodes owns all of its resources independently of the other nodes as mentioned in this paper.
Abstract: An ETL/EAI data warehouse management system and method for processing data by dynamically distributing the computational load across a cluster network of distributed servers using a master node and multiple servant nodes, where each of the servant nodes owns all of its resources independently of the other nodes.

Proceedings ArticleDOI
10 May 2004
TL;DR: The binary and real-valued versions of PSO algorithm are exploited in two important signal processing paradigm: multiuser detection (MUD) and blind extraction of sources (BES), respectively.
Abstract: The particle swarm optimization (PSO) algorithm, which originated as a simulation of a simplified social system, is an evolutionary computation technique. In this paper the binary and real-valued versions of PSO algorithm are exploited in two important signal processing paradigm: multiuser detection (MUD) and blind extraction of sources (BES), respectively. The novel approaches are effective and efficient with parallel processing structure and relatively feasible implementation. Simulation results validate either PSO-MUD or PSO-BES has a significant performance improvement over conventional methods.

Journal ArticleDOI
TL;DR: This paper proposes a new parallel algorithm which uses a hybrid heuristic within a multilevel scheme and is able to obtain very high quality partitions and improvement on those obtained by other algorithms previously put forward.
Abstract: One significant problem of optimisation which occurs in many scientific areas is that of graph partitioning. Several heuristics, which pertain to high quality partitions, have been put forward. Multilevel schemes can in fact improve the quality of the solutions. However, the size of the graphs is very large in many applications, making it impossible to effectively explore the search space. In these cases, parallel processing becomes a very useful tool overcoming this problem. In this paper, we propose a new parallel algorithm which uses a hybrid heuristic within a multilevel scheme. It is able to obtain very high quality partitions and improvement on those obtained by other algorithms previously put forward.

Proceedings ArticleDOI
TL;DR: A novel custom image sensor based on smart pixels dedicated to parallel OCT (pOCT) is presented, which overcomes the main challenges for OCT using parallel detection such as data rate, power consumption, circuit size, and optical sensitivity.
Abstract: Optical Coherence Tomography (OCT) is an optical imaging technique allowing the acquisition of three-dimensional images with micrometer resolution. It is very well suited to cross-sectional imaging of highly scattering materials, such as most biomedical tissues. A novel custom image sensor based on smart pixels dedicated to parallel OCT (pOCT) is presented. Massively parallel detection and signal processing enables a significant increase in the 3D frame rate and a reduction of the mechanical complexity of the complete setup compared to conventional point-scanning OCT. This renders the parallel OCT technique particularly advantageous for high-speed applications in industrial and biomedical domains while also reducing overall system costs. The sensor architecture presented in this article overcomes the main challenges for OCT using parallel detection such as data rate, power consumption, circuit size, and optical sensitivity. Each pixel of the pOCT sensor contains a low-power signal demodulation circuit allowing the simultaneous detection of the envelope and the phase information of the optical interferometry signal. An automatic photocurrent offset-compensation circuit, a synchronous sampling stage, programmable time averaging, and random pixel accessing are also incorporated at the pixel level. The low-power demodulation principle chosen as well as alternative implementations are discussed. The characterization results of the sensor exhibit a sensitivity of at least 74 dB, which is within 4 dB of the theoretical limit of a shot-noise limited OCT system. Real-time high-resolution three-dimensional tomographic imaging is demonstrated along with corresponding performance measurements.

Patent
09 Jul 2004
TL;DR: In this article, an OS service unit which provides services of the OS for single processors to a unit of work which can be parallelized within the application controls security function with respect to a processing request from the unit of the work in response to the processing request.
Abstract: On a parallel processing system which operates an OS and an existing application for single processors on a multiprocessor to realize parallel processing by the multiprocessor with respect to the application, an OS service unit which provides services of the OS for single processors to a unit of work which can be parallelized within the application controls security function with respect to a processing request from the unit of work in response to the processing request.

Proceedings ArticleDOI
20 Apr 2004
TL;DR: An architecture supporting the single program multiple data model of parallel processing, and results taken from a parallel implementation of the JPEG2000 encoding algorithm and Mandelbrot set generation are presented.
Abstract: The prevalence of software reference code motivates investigation into efficient implementations of software architectures on field-programmable devices. Modern FPGAs allow designers to generate multi-processor architectures that exactly match the processing needs of the algorithm. This paper describes an architecture supporting the single program multiple data model of parallel processing, and presents results taken from a parallel implementation of the JPEG2000 encoding algorithm and Mandelbrot set generation.

Patent
13 Jul 2004
TL;DR: In this article, a data processing apparatus and a method for performing in parallel a SIMD operation on data elements is provided, which comprises a register data store having a plurality of registers operable to store data elements, and processing logic operability to perform data processing operations on data items.
Abstract: A data processing apparatus and method are provided for performing in parallel a data processing operation on data elements. The data processing apparatus comprises a register data store having a plurality of registers operable to store data elements, and processing logic operable to perform data processing operations on data elements. A decoder is operable to decode a data processing instruction, the data processing instruction identifying a lane size and a data element size, the lane size being a multiple of the data element size. Further, the decoder is operable to control the processing logic to define based on the lane size a number of lanes of parallel processing in at least one of the registers, and the processing logic is operable to perform in parallel a data processing operation on the data elements within each lane of parallel processing. This provides significantly improved flexibility in the performance of SIMD operations.

Proceedings ArticleDOI
12 Jan 2004
TL;DR: GXP is described, a shell for distributed multi-cluster environments that features a very fast parallel (simultaneous) command submission, parallel pipes, and a flexible and efficient method to interactively select a subset of nodes to execute subsequent commands on.
Abstract: We describe GXP, a shell for distributed multi-cluster environments With GXP, users can quickly submit a command to many nodes simultaneously (approximately 600 milliseconds on over 300 nodes spread across five local-area networks) It therefore brings an interactive and instantaneous response to many cluster/network operations, such as trouble diagnosis, parallel program invocation, installation and deployment, testing and debugging, monitoring, and dead process cleanup It features (1) a very fast parallel (simultaneous) command submission, (2) parallel pipes (pipes between local command and all parallel commands), and (3) a flexible and efficient method to interactively select a subset of nodes to execute subsequent commands on It is very easy to start using GXP, because it is designed not to require cumbersome per-node setup and installation and to depend only on a very small number of pre-installed tools and nothing else We describe how GXP achieves these features and demonstrate through examples how they make many otherwise boring and error-prone tasks simple, efficient, and fun


Journal ArticleDOI
TL;DR: A simple, efficient, finite state machine-based approach for communication minimization of library-based data parallel regular domain problems, referred to as lazy parallelization, where a sequential program is parallelized automatically at runtime by inserting communication primitives and memory management operations whenever necessary.
Abstract: A popular approach to providing nonexperts in parallel computing with an easy-to-use programming model is to design a software library consisting of a set of preparallelized routines, and hide the intricacies of parallelization behind the library's API. However, for regular domain problems (such as simple matrix manipulations or low-level image processing applications-in which all elements in a regular subset of a dense data field are accessed in turn) speedup obtained with many such library-based parallelization tools is often suboptimal. This is because interoperation optimization (or: time-optimization of communication steps across library calls) is generally not incorporated in the library implementations. We present a simple, efficient, finite state machine-based approach for communication minimization of library-based data parallel regular domain problems. In the approach, referred to as lazy parallelization, a sequential program is parallelized automatically at runtime by inserting communication primitives and memory management operations whenever necessary. Apart from being simple and cheap, lazy parallelization guarantees to generate legal, correct, and efficient parallel programs at all times. The effectiveness of the approach is demonstrated by analyzing the performance characteristics of two typical regular domain problems obtained from the field of low-level image processing. Experimental results show significant performance improvements over nonoptimized parallel applications. Moreover, obtained communication behavior is found to be optimal with respect to the abstraction level of message passing programs.