scispace - formally typeset
Search or ask a question

Showing papers on "Degree of parallelism published in 2004"


Proceedings ArticleDOI
Alexander Keller1, Joseph L. Hellerstein1, Joel L. Wolf1, Kun-Lung Wu1, Vijaya Krishnan1 
23 Apr 2004
TL;DR: The CHAMPS system, a prototype under development at IBM Research for Change Management with Planning and Scheduling, is discussed, able to achieve a very high degree of parallelism for a set of tasks by exploiting detailed factual knowledge about the structure of a distributed system from dependency information at runtime.
Abstract: Change management is a process by which IT systems are modified to accommodate considerations such as software fixes, hardware upgrades and performance enhancements. This paper discusses the CHAMPS system, a prototype under development at IBM Research for Change Management with Planning and Scheduling. The CHAMPS system is able to achieve a very high degree of parallelism for a set of tasks by exploiting detailed factual knowledge about the structure of a distributed system from dependency information at runtime. In contrast, today's systems expect an administrator to provide such insights, which is often not the case. Furthermore, the optimization techniques we employ allow the CHAMPS system to come up with a very high quality solution for a mathematically intractable problem in a time which scales nicely with the problem size. We have implemented the CHAMPS system and have applied it in a TPC-W environment that implements an on-line book store application.

151 citations


Book ChapterDOI
13 Jun 2004
TL;DR: This paper presents a comprehensive characterization of a multi-cluster supercomputer workload using twelve-month scientific research traces that include system utilization, job arrival rate and interarrival time, job cancellation rate, job size, job runtime, memory usage, and user/group behavior.
Abstract: This paper presents a comprehensive characterization of a multi-cluster supercomputer workload using twelve-month scientific research traces. Metrics that we characterize include system utilization, job arrival rate and interarrival time, job cancellation rate, job size (degree of parallelism), job runtime, memory usage, and user/group behavior. Correlations between metrics (job runtime and memory usage, requested and actual runtime, etc) are identified and extensively studied. Differences with previously reported workloads are recognized and statistical distributions are fitted for generating synthetic workloads with the same characteristics. This study provides a realistic basis for experiments in resource management and evaluations of different scheduling strategies in a multi-cluster research environment.

148 citations


Patent
27 Feb 2004
TL;DR: In this article, an improved method and system for preserving data constraints during parallel apply in asynchronous transaction replication in a database system have been disclosed, preserving secondary unique constraints and referential integrity constraints, while also allowing a high degree of parallelism in the application of asynchronous replication transactions.
Abstract: An improved method and system for preserving data constraints during parallel apply in asynchronous transaction replication in a database system have been disclosed. The method and system preserves secondary unique constraints and referential integrity constraints, while also allowing a high degree of parallelism in the application of asynchronous replication transactions. The method and system also detects and resolves ordering problems introduced by referential integrity cascade deletes, and allows the parallel initial loading of parent and child tables of a referential integrity constraint.

77 citations


Proceedings ArticleDOI
J. Rivoir1
14 Jul 2004
TL;DR: In this paper, the authors show quantitatively that parallel test is a much effective test cost reduction method than low-cost ATE, because it reduces all test cost contributors, not only capital cost of ATE.
Abstract: Today's manufacturers of high-volume consumer devices are under tremendous cost pressure and consequently under extreme pressure to reduce cost of test. Low-cost ATE has often been promoted as the obvious solution. Parallel test is another well-known approach, where multiple devices are tested in parallel (multi-site test) and/or multiple blocks within one device are tested in parallel (concurrent test). This paper shows quantitatively that parallel test is a much effective test cost reduction method than low-cost ATE, because it reduces all test cost contributors, not only capital cost of ATE. It also shows that the optimum number of sites is relatively insensitive to ATE capital cost, operating cost, yield, and various limiting factors, but the cost benefits diminish fast, if limited independent ATE resources reduce the degree of parallelism and force a partially sequential test.

44 citations


Proceedings Article
10 Mar 2004
TL;DR: This paper deconstructs the notion of commit in an out-of-order processor, and examines the set of necessary conditions under which instructions can be permitted to retire out of program order, providing a detailed analysis of the frequency and relative importance of these conditions.
Abstract: Many modern processors execute instructions out of their original program order to exploit instruction-level parallelism and achieve higher performance. However even though instructions can execute in an arbitrary order, they must eventually commit, or retire from execution, in program order. This constraint provides a safety mechanism to ensure that mis-speculated instructions are not inadvertently committed, but can consume valuable processor resources and severely limit the degree of parallelism exposed in a program. We assert that such a constraint is overly conservative, and propose conditions under which it can be relaxed. This paper deconstructs the notion of commit in an out-of-order processor, and examines the set of necessary conditions under which instructions can be permitted to retire out of program order. It provides a detailed analysis of the frequency and relative importance of these conditions, and discusses microarchitectural modifications that relax the in-order commit requirement. Overall, we found that for a given set of processor resources our technique achieves speedups of up to 68% and 8% for floating point and integer benchmarks, respectively. Conversely, because out-of-order commit allows more efficient utilization of cycle-time limiting resources, it can alternatively enable simpler designs with potentially higher clock frequencies.

42 citations


Journal ArticleDOI
19 Jun 2004
TL;DR: Steady state memetic algorithm is compared with transgenerational Memetic algorithm using different crossover operators and hill-climbing methods to find the best number of processors and the best data distribution method for each stage of a parallel program.
Abstract: Determining the optimum data distribution, degree of parallelism and the communication structure on distributed memory machines for a given algorithm is not a straightforward task. Assuming that a parallel algorithm consists of consecutive stages, a genetic algorithm is proposed to find the best number of processors and the best data distribution method to be used for each stage of the parallel algorithm. Steady state genetic algorithm is compared with transgenerational genetic algorithm using different crossover operators. Performance is evaluated in terms of the total execution time of the program including communication and computation times. A computation intensive, a communication intensive and a mixed implementation are utilized in the experiments. The performance of GA provides satisfactory results for these illustrative examples.

31 citations


Journal ArticleDOI
TL;DR: The analysis phase is responsible for driving the choice between the horizontal and the vertical partitioning techniques, or even the combination of both, in order to assist distribution designers in the fragmentation phase of object databases.
Abstract: The design of distributed databases involves making decisions on the fragmentation and placement of data and programs across the sites of a computer network. The first phase of the distribution design in a top-down approach is the fragmentation phase, which clusters in fragments the information accessed simultaneously by applications. Most distribution design algorithms propose a horizontal or vertical class fragmentation. However, the user has no assistance in the choice between these techniques. In this work we present a detailed methodology for the design of distributed object databases that includes: (i) an analysis phase, to indicate the most adequate fragmentation technique to be applied in each class of the database schemas (ii) a horizontal class fragmentation algorithm, and (iii) a vertical class fragmentation algorithm. Basically, the analysis phase is responsible for driving the choice between the horizontal and the vertical partitioning techniques, or even the combination of both, in order to assist distribution designers in the fragmentation phase of object databases. Experiments using our methodology have resulted in fragmentation schemas offering a high degree of parallelism together with an important reduction of irrelevant data.

30 citations


Proceedings ArticleDOI
28 Jun 2004
TL;DR: An algorithm to efficiently schedule parallel task graphs (fork-join structures) that considers more than one factor at the same time, scheduability, reliability of the participating processors and achieved degree of parallelism.
Abstract: Efficient task scheduling is essential for achieving high performance computing applications for distributed systems. Most of existing real-time systems consider schedulability as a main goal and ignores other effects such as machines failures. In This work we develop an algorithm to efficiently schedule parallel task graphs (fork-join structures). Our scheduling algorithm considers more than one factor at the same time. These factors are scheduability, reliability of the participating processors and achieved degree of parallelism. To achieve most of these goals, we composed an objective function that combines these different factors simultaneously. The proposed objective function is adjustable to provide the user with a way to prefer one factor to the others. The simulation results indicate that our algorithm produces schedules where the applications deadlines are met, reliability is maximized and the application parallelism is exploited.

24 citations


Proceedings ArticleDOI
08 Mar 2004
TL;DR: The system programming model is presented which considers two different views of a component interface: one from the point of view of the application programmer and another thought to be used by a configuration tool in order to establish efficient implementations.
Abstract: SBASCO is a new programming environment for the development of parallel and distributed high-performance scientific applications. The approach integrates both skeleton-based and component technologies. The main goal of the proposal is to provide a high-level programmability system for the efficient development of numerical applications with performance portability on different platforms. We present the system programming model which considers two different views of a component interface: one from the point of view of the application programmer and another thought to be used by a configuration tool in order to establish efficient implementations. This can be achieved due to the knowledge at the interface level of data distribution and processor layout inside each component. The programming model borrows from software skeletons a cost model enhanced by a run-time analysis, which enables one to automatically establish a suitable degree of parallelism and replication of the internal structure of a component.

23 citations


01 Jan 2004
TL;DR: This work describes the use and implementation of skeletons in a distributed computation environment, with the Java-based system Lithium as the reference implementation, and proposes three different optimizations based on an asynchronous, optimized RMI interaction mechanism that optimize the collection of results and the work-load balancing.
Abstract: Skeletons are common patterns of parallelism such as farm and pipeline that can be abstracted and offered to the application programmer as programming primitives. We describe the use and implementation of skeletons in a distributed computation environment, with the Java-based system Lithium as our reference implementation. Our main contribution is optimization techniques based on an asynchronous, optimized RMI interaction mechanism, which we integrated into the macro data flow (MDF) evaluation technology of Lithium. In detail, we show three different optimizations: 1) a lookahead mechanism that allows to process multiple tasks concurrently at each single server and thereby increases the overall degree of parallelism, 2) a lazy task-binding technique that reduces interactions between remote servers and the task dispatcher, and 3) dynamic improvements based on process monitoring that optimize the collection of results and the work-load balancing. We report experimental results that demonstrate the achieved improvements due to the proposed optimizations on various testbeds, including heterogeneous environments.

20 citations


Proceedings ArticleDOI
17 Jun 2004
TL;DR: This paper presents a VLSI processor for reliable stereo matching to establish correspondence between images by selecting a desirable window size for sum of absolute differences (SAD) computation with a window-parallel and pixel-serial architecture.
Abstract: This paper presents a VLSI processor for reliable stereo matching to establish correspondence between images by selecting a desirable window size for sum of absolute differences(SAD) computation. In SAD computation, a degree of parallelism between pixels in a window changes depending on its window size, while a degree of parallelism between windows is predetermined by the input-image size. Based on this consideration, a window-parallel and pixel-serial architecture is also proposed to achieve 100% utilization of processing elements. Not only 100% utilization but also a simple interconnection network between memory modules and processing elements makes the VLSI processor much superior to the pixel-parallel-architecture-based VLSI processors.

01 Jan 2004
TL;DR: Comparisons between the most popular implementation alternatives of the Discrete Wavelet Transform, known as the lifting and filter-bank algorithms, show that the lifting algorithm can be efficiently tailored to provide best results despite the data dependencies involved in this scheme.
Abstract: The growing popularity of the Discrete Wavelet Transform (DWT) has boosted its tuning on all sorts of computer systems, from special purpose hardware for embedded systems to general purpose microprocessors and multiprocessors. In this paper we continue to investigate possibilities for the implementation of the DWT, focusing on state-of-the-art programmable graphics hardware. Current design trends have transformed these devices into powerful coprocessors with enough flexibility to perform intensive and complex floating-point calculations. This study is concentrated on the comparison between the most popular implementation alternatives, known as the lifting and filter-bank algorithms. The characteristics of the filter-bank version suggest a better mapping on current graphics hardware, given that they present a higher degree of parallelism. However, our experiments show that the lifting algorithm, which exhibits lower computational demands, can be efficiently tailored to provide best results despite the data dependencies involved in this scheme, which makes the exploitation of data parallelism more difficult.

Journal ArticleDOI
TL;DR: The hardware implementation of a spiking neuron model, which uses a spike time dependent plasticity (STDP) rule that allows synaptic changes by discrete time steps, is described and the serial implementation has been realized.
Abstract: In this paper we describe the hardware implementation of a spiking neuron model, which uses a spike time dependent plasticity (STDP) rule that allows synaptic changes by discrete time steps. For this purpose an integrate-and-fire neuron is used with recurrent local connections. The connectivity of this model has been set to 24 neighbours, so there is a high degree of parallelism. After obtaining good results with the hardware implementation of the model, we proceed to simplify this hardware description, trying to keep the same behaviour. Some experiments using dynamic grading patterns have been used in order to test the learning capabilities of the model. Finally, the serial implementation has been realized.

Proceedings Article
01 Jan 2004
TL;DR: This work discusses three optimizations: a lookahead mechanism that allows to process multiple tasks concurrently at each grid server and thereby increases the overall degree of parallelism, a lazy taskbinding technique that reduces interactions between grid servers and the task dispatcher, and dynamic improvements that optimize the collecting of results and the work-load balancing.
Abstract: Skeletons are common patterns of parallelism, such as farm and pipeline, that can be abstracted and offered to the application programmer as programming primitives. We describe the use and implementation of skeletons on emerging computational grids, with the skeleton system Lithium, based on Java and RMI, as our reference programming syttem. Our main contribution is the exploration of optimization techniques for implementing skeletons on grids based on an optimized, future-based RMI mechanism, which we integrate into the macro-dataflow evaluation mechanism of Lithium. We discuss three optimizations: 1) a lookahead mechanism that allows to process multiple tasks concurrently at each grid server and thereby increases the overall degree of parallelism, 2) a lazy taskbinding technique that reduces interactions between grid servers and the task dispatcher, and 3) dynamic improvements that optimize the collecting of results and the work-load balancing. We report experimental results that demonstrate the improvements due to our optimizations on various testbeds, including a heterogeneous grid-like environment.

Journal ArticleDOI
TL;DR: This paper reports significant progress towards the solution of the problem of internalising protocol roles within the “intruder” process, by means anticipated in [5], namely by “internalising” protocol roles by means of a finite FDR check.
Abstract: We carry forward the work described in our previous papers [5,18,20] on the application of data independence to the model checking of security protocols using CSP [19] and FDR [10]. In particular, we showed how techniques based on data independence [12,19] could be used to justify, by means of a finite FDR check, systems where agents can perform an unbounded number of protocol runs. Whilst this allows for a more complete analysis, there was one significant incompleteness in the results we obtained: while each individual identity could perform an unlimited number of protocol runs sequentially, the degree of parallelism remained bounded (and small to avoid state space explosion). In this paper, we report significant progress towards the solution of this problem, by means anticipated in [5], namely by “internalising” protocol roles within the “intruder” process. The internalisation of protocol roles (initially only server-type roles) was introduced in [20] as a state-space reduction technique (for which it is usually spectacularly successful). It was quickly noticed that this had the beneficial side-effect of making the internalised server arbitrarily parallel, at least in cases where it did not generate any new values of data independent type. We now consider the case where internal roles do introduce fresh values and address the issue of capturing their state of mind (for the purposes of analysis).

Proceedings ArticleDOI
23 May 2004
TL;DR: This paper describes an efficient structure to implement a system consisting of an M-channel synthesis filterbank followed by an L-channel analysis filterbank that is very efficient in VLSI, FPGA or parallel processor implementation.
Abstract: This paper describes an efficient structure to implement a system consisting of an M-channel synthesis filterbank followed by an L-channel analysis filterbank (where M is a multiple of L or L is a multiple of M). The structure is very efficient in VLSI, FPGA or parallel processor implementation in terms of requiring less area or logic blocks, lower power consumption and extending the degree of parallelism. The proposed method is applicable in situations where a subband based processing or encoding follows another subband based processing or decoding and the intermediate synthesized signal is not a desired signal in itself.

Book ChapterDOI
22 Sep 2004
TL;DR: This paper has implemented HTAs as a MATLAB TM toolbox, overloading conventional operators and array functions such that HTA operations appear to the programmer as extensions of MATLABTM.
Abstract: In this paper, we describe our experience in writing parallel numerical algorithms using Hierarchically Tiled Arrays (HTAs) HTAs are classes of objects that encapsulate parallelism HTAs allow the construction of single-threaded parallel programs where a master process distributes tasks to be executed by a collection of servers holding the components (tiles) of the HTAs The tiled and recursive nature of HTAs facilitates the development of algorithms with a high degree of parallelism as well as locality We have implemented HTAs as a MATLABTM toolbox, overloading conventional operators and array functions such that HTA operations appear to the programmer as extensions of MATLABTM We have successfully used it to write some widely used parallel numerical programs The resulting programs are easier to understand and maintain than their MPI counterparts

Proceedings ArticleDOI
22 Feb 2004
TL;DR: This work aims to describe a methodology for scheduling and allocation of hardware contexts, in applications with high degree of parallelism, in a Run-Time-Reconfiguration (RTR) proceeding for a reconfigurable FPGA.
Abstract: This work aims to describe a methodology for scheduling and allocation of hardware contexts, in applications with high degree of parallelism, in a Run-Time-Reconfiguration (RTR) proceeding for a reconfigurable FPGA. The Scheduling approach is based on the hardware resource distribution in the FPGA architecture. The Scheduler is modeled as a Petri Net and the best performance yields the best scheduling. The hardware contexts allocation is based on a Left-Edge algorithm principle for rationalization of resources in scheduling approach. The adaptation of the algorithm considers that pre-located areas for loading of the contexts in the architecture are used.

Proceedings ArticleDOI
11 Jul 2004
TL;DR: This work shows the design of the IFFT module corresponding to the baseband processing of an OFDM transmitter according to the IEEE802.11a-g and Hiperlan/2 standard.
Abstract: This work shows the design of the IFFT module corresponding to the baseband processing of an OFDM transmitter according to the IEEE802.11a-g and Hiperlan/2 standard. This module will be included in a future OFDM demonstrator, which will be implemented into a programmable logic device. We have used our own algorithm for IFFT computing. It is based on the recursive properties called decimation. This algorithm offers optimal characteristics for the hardware implementation: a high degree of parallelism and exactly the same interconnection pattern between any of the algorithm stages. A new point of view in the prototyping design flow and the verification process comes from the use of the last generation system level design environments for DSPs into FPGAs. These environments, called visual data flows, are ideally suited for modeling DSP systems since they allow a high level of functional abstraction with different data types and operators.

Journal ArticleDOI
31 Jan 2004
TL;DR: The goal is to study the performance gain from parallel I/O under the constraints of different numbers of commodity storage devices in a Linux cluster, and evaluates two read performance optimisation techniques employed in CEFT-PVFS.
Abstract: In this work, we investigate parallel I/O efficiencies in parallelised BLAST, the most popular tool for searching similarity in biological databases and implement two variations by incorporating the PVFS and CEFT-PVFS parallel I/O facilities. Our goal is to study the performance gain from parallel I/O under the constraints of different numbers of commodity storage devices in a Linux cluster. We also evaluate two read performance optimisation techniques employed in CEFT-PVFS: (1) doubling the degree of parallelism is shown to have comparable read performance with respect to PVFS when both systems have the same number of servers; (2) skipping hot-spot nodes can reduce the performance penalty when I/O workloads are highly imbalanced. The I/O resource contention between multiple applications, running in the same cluster, can degrade the performance of the original parallel BLAST and the PVFS version up to 10- and 21-fold, respectively; whereas, the one based on CEFT-PVFS, which has the ability to skip hot-spot nodes, suffered only a two-fold performance degradation.

Book ChapterDOI
20 Jun 2004
TL;DR: This work evaluates a parallel version of AMIGO (Advanced Multidimensional Interval Analysis Global Optimization) algorithm that makes an efficient use of all the available information in continuous differentiable problems to reduce the search domain and to accelerate the search.
Abstract: Interval Global Optimization based on Branch and Bound (B&B) technique is a standard for searching an optimal solution in the scope of continuous and discrete Global Optimization. It iteratively creates a search tree where each node represents a problem which is decomposed in several subproblems provided that a feasible solution can be found by solving this set of subproblems. The enormous computational power needed to solved most of the B&B Global Optimization problems and their high degree of parallelism make them suitable candidates to be solved in a multiprocessing environment. This work evaluates a parallel version of AMIGO (Advanced Multidimensional Interval Analysis Global Optimization) algorithm. AMIGO makes an efficient use of all the available information in continuous differentiable problems to reduce the search domain and to accelerate the search. Our parallel version takes advantage of the capabilities offered by Charm++. Preliminary results show our proposal as a good candidate to solve very hard global optimization problems.

01 Jan 2004
TL;DR: This paper reports significant progress towards the solution of the problem of internalising protocol roles within the “intruder” process, by means anticipated in [5], namely by “internalising” protocol roles by means of a finite FDR check.
Abstract: We carry forward the work described in our previous papers [5,18,20] on the application of data independence to the model checking of security protocols using CSP [19] and FDR [10]. In particular, we showed how techniques based on data independence [12,19] could be used to justify, by means of a finite FDR check, systems where agents can perform an unbounded number of protocol runs. Whilst this allows for a more complete analysis, there was one significant incompleteness in the results we obtained: while each individual identity could perform an unlimited number of protocol runs sequentially, the degree of parallelism remained bounded (and small to avoid state space explosion). In this paper, we report significant progress towards the solution of this problem, by means anticipated in [5], namely by “internalising” protocol roles within the “intruder” process. The internalisation of protocol roles (initially only server-type roles) was introduced in [20] as a state-space reduction technique (for which it is usually spectacularly successful). It was quickly noticed that this had the beneficial side-effect of making the internalised server arbitrarily parallel, at least in cases where it did not generate any new values of data independent type. We now consider the case where internal roles do introduce fresh values and address the issue of capturing their state of mind (for the purposes of analysis).

Proceedings ArticleDOI
07 Sep 2004
TL;DR: A compiler-based methodology is introduced to speed-up and simplify the customization of the data path platform for AS-DSPs based on the SUIF compiler framework by Stanford University.
Abstract: In order to achieve high performance and low hardware overhead over application specific integrated circuits (ASICs), application-specific DSPs (AS-DSPs) are more and more widely used. However, designing them is still a tedious, time-consuming and error-prone task since each application has to be analyzed thoroughly, which is usually done by hand. Recently, we proposed a platform approach to design data paths for AS-DSPs. In this paper we are introducing a compiler-based methodology to speed-up and simplify the customization of the data path platform. Based on the SUIF compiler framework by Stanford University, we implemented analysis passes to determine the kind and useful number of functional units, the potential degree of parallelism, and the required connectivity between functional units.

Proceedings ArticleDOI
10 May 2004
TL;DR: From block factorizations for any nonsingular transform matrix, two types of parallel elementary reversible matrix (PERM) factorizations are introduced which are helpful for the parallelization of perfectly reversible integer transforms.
Abstract: Integer mapping is critical for lossless source coding and the techniques have been used for image compression in the new international image compression standard, JPEG 2000. In this paper, from block factorizations for any nonsingular transform matrix, we introduce two types of parallel elementary reversible matrix (PERM) factorizations which are helpful for the parallelization of perfectly reversible integer transforms. With improved degree of parallelism (DOP) and parallel performance, the cost of multiplication and addition can be respectively reduced to O(logN) and O(log2N) for an N-by-N transform matrix. These make PERM factorizations an effective means of developing parallel integer transforms for large matrices. Besides, we also present a scheme to block the matrix and allocate the load of processors for efficient transformation.

Proceedings ArticleDOI
25 Jul 2004
TL;DR: This work presents an adaptive construction of the bitonic balancing network, which tunes its width to the system size in a distributed and local way, and does this with the help of an efficient peer-to-peer lookup service.
Abstract: We present an adaptive construction of the bitonic balancing network. Our network tunes its width (the degree of parallelism) to the system size in a distributed and local way, and does this with the help of an efficient peer-to-peer lookup service. In contrast, all previously known constructions were static, and had the same width irrespective of the system size.Our technique is quite general: though we describe here the construction of the bitonic balancing network, this could be used in the adaptive construction of any distributed data structure which can be decomposed in a recursive manner.