scispace - formally typeset
Search or ask a question

Showing papers on "Degree of parallelism published in 2003"


Journal ArticleDOI
TL;DR: A word-based version of MM is presented and used to explain the main concepts in the hardware design and gives enough freedom to select the word size and the degree of parallelism to be used, according to the available area and/or desired performance.
Abstract: This paper presents a scalable architecture for the computation of modular multiplication, based on the Montgomery multiplication (MM) algorithm. A word-based version of MM is presented and used to explain the main concepts in the hardware design. The proposed multiplier is able to work with any precision of the input operands, limited only by memory or control constraints. Its architecture gives enough freedom to select the word size and the degree of parallelism to be used, according to the available area and/or desired performance. Design trade offs are analyzed in order to identify adequate hardware configurations for a given area or bandwidth requirement.

242 citations


Proceedings ArticleDOI
R. Mishra1, N. Rastogi1, Dakai Zhu1, Daniel Mosse1, Rami Melhem1 
22 Apr 2003
TL;DR: To consider the run-time behavior of tasks, an on-line dynamic power management technique is proposed to further explore the idle periods of processors and it is found that this static technique can save an average of 10% more energy than the simple static power management.
Abstract: Power management has become popular in mobile computing as well as in server farms. Although a lot of work has been done to manage the energy consumption on uniprocessor real-time systems, there is less work done on their multicomputer counterparts. For a set of real-time tasks with precedence constraints executing on a distributed system, we propose new static and dynamic power management schemes. Assuming a given static schedule generated from any list scheduling heuristic algorithm, our static power management scheme uses the static slack (if any) based on the degree of parallelism in the schedule. To consider the run-time behavior of tasks, an on-line dynamic power management technique is proposed to further explore the idle periods of processors. By comparing our static technique with the simple static power management, where the static slack is distributed to the schedule proportionally, we find that our static scheme can save an average of 10% more energy. When combined with dynamic schemes, our schemes significantly improve energy savings.

191 citations


Journal ArticleDOI
TL;DR: C-SAT is presented, a SAT-based procedure capable of dealing with planning domains having incomplete information about the initial state, and whose underlying transition system is specified using the highly expressive action language C.

82 citations


Patent
04 Aug 2003
TL;DR: In this paper, the authors measure the degree of parallelism achieved in executing program instructions and use this to dynamically control the clock speeds and supply voltage levels applied to different processor cores 4, 6 so as to reduce the overall amount of energy consumed by matching the processing performance achieved to the clock speed and voltage levels used.
Abstract: A multi-processing system 2 measures the degree of parallelism achieved in executing program instructions and uses this to dynamically control the clock speeds and supply voltage levels applied to different processor cores 4, 6 so as to reduce the overall amount of energy consumed by matching the processing performance achieved to the clock speeds and voltage levels used.

79 citations


Journal ArticleDOI
TL;DR: A domain-specific modeling technique for energy-efficient kernel design that exploits the knowledge of the algorithm and the target architecture family for a given kernel to develop a high-level model that is used to quickly obtain fairly accurate estimate of the system-wide energy dissipation of data paths configured using FPGAs.
Abstract: Reconfigurable architectures such as FPGAs are flexible alternatives to DSPs or ASICs used in mobile devices for which energy is a key performance metric. Reconfigurable architectures offer several design parameters such as operating frequency, precision, amount of memory, degree of parallelism, etc. These parameters define a large design space that must be explored to find energy-efficient solutions. It is also challenging to predict the energy variation at the early design phases when a design is modified at algorithm level. Efficient traversal of such a large design space requires high-level modeling to facilitate rapid estimation of system-wide energy. However, FPGAs do not exhibit a high-level structure like, for example, a RISC processor for which high-level as well as low-level energy models are available. To address this scenario, we propose a domain-specific modeling technique for energy-efficient kernel design that exploits the knowledge of the algorithm and the target architecture family for a given kernel to develop a high-level model. This model captures architecture and algorithm features, parameters affecting energy performance, and power estimation functions based on these parameters. A system-wide energy function is derived based on the power functions and cycle specific power state of each building block of the architecture. This model is used to understand the impact of various parameters on system-wide energy and can be a basis for the design of energy-efficient algorithms. Our high-level model is used to quickly obtain fairly accurate estimate of the system-wide energy dissipation of data paths configured using FPGAs. We demonstrate our modeling methodology by applying it to four domains.

19 citations


Book ChapterDOI
26 Aug 2003
TL;DR: With recent advances in both hardware and software, it is now possible to create high quality images at interactive rates on commodity PC clusters.
Abstract: Due to its practical significance and its high degree of parallelism, ray tracing has always been an attractive target for research in parallel processing. With recent advances in both hardware and software, it is now possible to create high quality images at interactive rates on commodity PC clusters.

18 citations


Journal ArticleDOI
01 Apr 2003
TL;DR: A parallel implementation of an interior point algorithm for solving sparse convex quadratic programs with bound constraints using an iterative approach based on the conjugate gradient method and on a block diagonal preconditioning technique to obtain an efficient parallel interior point solver for general sparse problems.
Abstract: This paper deals with a parallel implementation of an interior point algorithm for solving sparse convex quadratic programs with bound constraints. The parallelism is introduced at the linear algebra level. Concerning the solution of the linear system arising at each step of the considered algorithm, we use an iterative approach based on the conjugate gradient method and on a block diagonal preconditioning technique. Moreover, we apply an incomplete Cholesky factorization with limited memory into each block, in order to put together the high degree of parallelism of diagonal preconditioning techniques and the greater effectiveness of incomplete factorizations procedures. The goal is to obtain an efficient parallel interior point solver for general sparse problems. Results of computational experiments carried out on an IBM SP parallel system by using randomly generated very sparse problems without a particular structure are presented. Such results show that the considered inner iterative approach allows to obtain a constant CPU time reduction as the number of processors used increases.

17 citations


Proceedings ArticleDOI
01 Sep 2003
TL;DR: The application characteristics of some representative video pixel processing functions are studied and it is shown with an example that these properties can be exploited to make specialized programmable processors.
Abstract: Media processing system-on-chips (SoCs) mainly consist of audio encoding/decoding (e.g. AC-3, MP3), video encoding/decoding (e.g. H263, MPEG-2) and video pixel processing functions (e.g. de-interlacing, noise reduction). Video pixel processing functions have very high computational demands, as they require a large amount of computations on large amount of data (note that the data are pixels of completely decoded pictures). In this paper, we focus on video pixel processing functions. Usually, these functions are implemented in dedicated hardware. However, flexibility (by means of programmability or reconfigurability) is needed to introduce the latest innovative algorithms, to allow differentiation of products, and to allow bug fixing after fabricating chips. It is impossible to fulfill the computational requirements of these functions by current programmable media processors. To achieve efficient implementations for flexible solutions, we will study, in this paper, the application characteristics of some representative video pixel processing functions. The characteristics considered are granularity of operations, amount and kind of data accesses and degree of parallelism present in these functions. We observe that from computational granularity point of view many functions can be expressed in terms of kernels e.g. Median3 (i.e. median of three values), finite impulse response (FIR) filters, table lookups (LUT) etc. that are coarser grain than ALU, Mult, MAC, etc. Regarding the kind of data accesses, we categorize these functions as regular, regular with some data rearrangement and irregular data access patterns. Furthermore, the degree of parallelism present in these functions is expressed in terms of data level parallelism (DLP) and instruction/operation level parallelism (ILP). We show with an example that these properties can be exploited to make specialized programmable processors.

13 citations


01 Jul 2003
TL;DR: The implementation and the evaluation of a Queue-Java compiler (QJAVAC) is described, which is a part of whole research project at the laboratory, for high-level parallelism QueueJava bytecodes without too much need for parallelism scheduling.
Abstract: In this paper, we will describe the implementation and the evaluation of a Queue-Java compiler (QJAVAC), which is a part of whole research project at our laboratory, for high-level parallelism QueueJava bytecodes without too much need for parallelism scheduling. We will also describe a new type of syntax tree-Queue Abstract Syntax Tree(QAST) that is used for an optimized Queue Java Virtual Machine (QJVM) instruction generation. With the QJAVAC compiler, we have successfully compiled the Java source code to the QJVM byte code. the achieved average degree of parallelism is about 2.11 times greater than that of a general Java byte code.

11 citations


Book ChapterDOI
26 Aug 2003
TL;DR: RoCL is a communication library that aims to exploit the low-level communication facilities of today’s cluster networking hardware and to merge, via the resource oriented paradigm, those facilities and the high-level degree of parallelism achieved on SMP systems through multi-threading.
Abstract: RoCL is a communication library that aims to exploit the low-level communication facilities of today’s cluster networking hardware and to merge, via the resource oriented paradigm, those facilities and the high-level degree of parallelism achieved on SMP systems through multi-threading.

10 citations


Proceedings ArticleDOI
01 Jan 2003
TL;DR: Two read performance optimization techniques in CEFT-PVFS are examined and performance results indicate that doubling the degree of parallelism boosts the read performance to approach that of PVFS; and skipping hot-spots can substantially improve the I/O performance when the load on data servers is highly imbalanced.
Abstract: In this paper we analyze the I/O access patterns of a widely-used biological sequence search tool and implement two variations that employ parallel-I/O for data access based on PVFS (Parallel Virtual File System) and CEFT-PVFS (cost-effective fault-tolerant PVFS) Experiments show that the two variations outperform the original tool when equal or even fewer storage devices are used in the former It is also found that although the performance of the two variations improves consistently when initially increasing the number of servers, this performance gain from parallel I/O becomes insignificant with further increase in server number We examine the effectiveness of two read performance optimization techniques in CEFT-PVFS by using this tool as a benchmark Performance results indicate: (1) doubling the degree of parallelism boosts the read performance to approach that of PVFS; and (2) skipping hot-spots can substantially improve the I/O performance when the load on data servers is highly imbalanced The I/O resource contention due to the sharing of server nodes by multiple applications in a cluster has been shown to degrade the performance of the original tool and the variation based on PVFS by up to 10 and 21 folds, respectively; whereas, the variation based on CEFT-PVFS only suffered a two-fold performance degradation

Journal Article
TL;DR: In this article, a two-phase clustering algorithm is introduced as a preprocessing step to an existing hardware/software partitioning and scheduling system, which increases the granularity in the partition design, resulting in a higher degree of parallelism and a better mapping to the reconfigurable resource.
Abstract: To achieve a good performance when implementing applications in codesign systems, partitioning and scheduling are important steps In this paper, a two-phase clustering algorithm is introduced as a preprocessing step to an existing hardware/software partitioning and scheduling system This preprocessing step increases the granularity in the partition design, resulting in a higher degree of parallelism and a better mapping to the reconfigurable resource This cluster-driven approach shows improvements in both the makespan of the implementation, and the CPU runtime

Book ChapterDOI
01 Sep 2003
TL;DR: A two-phase clustering algorithm is introduced as a preprocessing step to an existing hardware/software partitioning and scheduling system, resulting in a higher degree of parallelism and a better mapping to the reconfigurable resource.
Abstract: To achieve a good performance when implementing applications in codesign systems, partitioning and scheduling are important steps. In this paper, a two-phase clustering algorithm is introduced as a preprocessing step to an existing hardware/software partitioning and scheduling system. This preprocessing step increases the granularity in the partition design, resulting in a higher degree of parallelism and a better mapping to the reconfigurable resource. This cluster-driven approach shows improvements in both the makespan of the implementation, and the CPU runtime.

Journal ArticleDOI
TL;DR: An integration of task-graph parallelism in OpenMP is presented by extending the parallel sections constructs to include task-index and precedence-relations matrix clauses, and precedence relations are described through simple programmer annotations, with implementation details handled by the system.
Abstract: In a wide variety of scientific parallel applications, both task and data parallelism must be exploited to achieve the best possible performance on a multiprocessor machine. These applications induce task-graph parallelism with coarse-grain granularity. Nevertheless, using the available task-graph parallelism and combining it with data parallelism can increase the performance of parallel applications considerably since an additional degree of parallelism is exploited. The OpenMP standard supports data parallelism but does not support task-graph parallelism. In this paper we present an integration of task-graph parallelism in OpenMP by extending the parallel sections constructs to include task-index and precedence-relations matrix clauses. There are many ways in which task-graph parallelism can be supported in a programming environment. A fundamental design decision is whether the programmer has to write programs with explicit precedence relations, or if the responsibility of precedence relations generation is delegated to the compiler. One of the benefits provided by parallel programming models like OpenMP is that they liberate the programmer from dealing with the underlying details of communication and synchronization, which are cumbersome and error-prone tasks. If task-graph parallelism is to find acceptance, writing task-graph parallel programs must be no harder than writing data parallel programs, and therefore, in our design, precedence relations are described through simple programmer annotations, with implementation details handled by the system. This paper concludes with a description of several parallel application kernels that were developed to study the practical aspects of task-graph parallelism in OpenMP. The examples demonstrate that exploiting data and task parallelism in a single framework is the key to achieving good performance in a variety of applications.

Proceedings ArticleDOI
09 Mar 2003
TL;DR: It is shown that, in the case where the function involves a Fourier transform, the degree of parallelism in the program generated by automatic differentiation can be increased leading to a rich set of automatic parallelism strategies that are not available when employing a black box automatic parallelization approach.
Abstract: For functions given in the form of a computer program, automatic differentiation is an efficient technique to accurately evaluate the derivatives of that function. Starting from a given computer program, automatic differentiation generates another program for the evaluation of the original function and its derivatives in a fully mechanical way. While the efficiency of this black box approach is already high as compared to numerical differentiation based on divided differences, automatic differentiation can be applied even more efficiently by taking into account high-level knowledge about the given computer program. We show that, in the case where the function involves a Fourier transform, the degree of parallelism in the program generated by automatic differentiation can be increased leading to a rich set of automatic parallelization strategies that are not available when employing a black box automatic parallelization approach. Experiments of the new automatic parallelization approach are reported on a SunFire 6800 server using up to 20 processors.

Book ChapterDOI
Chih-Ping Chen1
15 Sep 2003
TL;DR: The Intel® Debugger achieves better startup time and user response time than conventional parallel debuggers by setting up a tree-like debugger network, which has a higher degree of parallelism and scalability than a flat network.
Abstract: In addition to being a quality symbolic debugger for serial IA32 and IPF Linux applications written in C, C++, and Fortran, the Intel® Debugger is also capable of debugging parallel applications of Pthreads, OpenMP, and MPI. When debugging a MPI application, the Intel® Debugger achieves better startup time and user response time than conventional parallel debuggers by (1) setting up a tree-like debugger network, which has a higher degree of parallelism and scalability than a flat network, and (2) employing a message aggregation mechanism to reduce the amount of data flowing in the network. This parallel debugging architecture can be further enhanced to support the debugging of mixed-mode and heterogeneous parallel applications. Moreover, a generalized version of this architecture can be applied in areas other than debugging, such as performance profiling of parallel applications.

Journal ArticleDOI
TL;DR: A meta-heuristic developed here for this specific problem combines simulated annealing and hill climbing (SA-HC) in the search for the optimum configuration of a sequential program for distributed memory machines.
Abstract: In this study, a global optimization meta-heuristic is developed for the problem of determining the optimum data distribution and degree of parallelism in parallelizing a sequential program for distributed memory machines. The parallel program is considered as the union of consecutive stages and the method deals with all the stages in the entire program rather than proposing solutions for each stage. The meta-heuristic developed here for this specific problem combines simulated annealing and hill climbing (SA-HC) in the search for the optimum configuration. Performance is tested in terms of the total execution time of the program including communication and computation times. Two exemplary codes from the literature, the first being computation intensive and the second being communication intensive, are utilized in the experiments. The performance of the SA-HC algorithm provides satisfactory results for these illustrative examples.

Proceedings ArticleDOI
27 Dec 2003
TL;DR: The study has reached the conclusion that, the more the degree of parallelism, the less the effect of fault condition on module output maximum power.
Abstract: This paper has proposed two assembling configurations of cells inside a module. These developed techniques have been intensively discussed to synthesize the effect of cells configuration on module losses. The study has reached the conclusion that, the more the degree of parallelism, the less the effect of fault condition on module output maximum power. An alternative configuration realizing the same advantage of reduced losses is achieved by series-parallel configuration.

Proceedings Article
01 Jan 2003
TL;DR: The hardware realization of a Hamming artificial neural network is presented, and its use in a high-speed precision alignment sys- tem is demonstrated, thus realizing a complex operation using a fast and low-power circuit.
Abstract: This paper presents the hardware realization of a Hamming artificial neural network, and demonstrates its use in a high-speed precision alignment sys- tem. High degree of parallelism is exploited in the proposed architecture, where the result of NxN array of sum of products is provided simultaneously. The full operation of the artificial neural network requires three clock cycles, which are shown to be completed within a few tens of nanoseconds, depending on the cho- sen architecture, thus realizing a complex operation using a fast and low-power circuit. Possible applications of the device include industrial image processing such as focus recovery, fast and precise alignment in a noisy environment, and vehicle navigation systems.

01 Jan 2003
TL;DR: The basic approach was the implementation of cryptographic algorithms on high-end, state-of-theart, DSP chips in order to study the various parameters that optimize the performance of the chip while keeping the overhead of encryption and decryption to a minimum.
Abstract: It is clear that Cryptography is computationally intensive. It is also known that embedded systems have slow clock rates and less memory. The idea for this thesis was to study the possibilities for analysis of cryptography on embedded systems. The basic approach was the implementation of cryptographic algorithms on high-end, state-of-theart, DSP chips in order to study the various parameters that optimize the performance of the chip while keeping the overhead of encryption and decryption to a minimum. Embedded systems are very resource sensitive. An embedded system is composed of different components, which are implemented in both hardware and software. Therefore, hardware-software co-synthesis is a crucial factor affecting the performance of embedded systems. Encryption algorithms are generally classified as data-dominated systems rather than ubiquitous control-dominated systems. Datadominated systems have a high degree of parallelism. Embedded systems populate the

Proceedings ArticleDOI
08 Dec 2003
TL;DR: This work considers parallelism in genetic algorithms while computing the fitness of the population individuals (chromosomes) and proposes a scheme that supports large data sets, that is. larger the data size, larger will be the degree of parallelism achieved.
Abstract: High volumes of data pose a challenge to the scalability of data mining algorithms. Dividing this data into equal partitions and processing it in parallel naturally becomes a choice. Peer-to-peer computing exposes a bright source for exploiting parallelism and maintaining scale-up capability. We consider parallelism in genetic algorithms while computing the fitness of the population individuals (chromosomes). This strategy has an edge over its counterpart, that is, parallelism in genetic operators, because genetic operators tend to be computationally cheap. Simply speaking this scheme supports large data sets, that is. larger the data size, larger will be the degree of parallelism achieved.

Proceedings ArticleDOI
18 Sep 2003
TL;DR: A medium grained parallel processing algorithm where every processing stage is done in parallel, and the degree of parallelism is task-level, fit for the parallel computer with good communication capacity.
Abstract: With the development of SAR processing techniques, high image precision and high real time rate have become an important index, especially on the military field. This paper presents a medium grained parallel processing algorithm where every processing stage is done in parallel, and the degree of parallelism is task-level. It is fit for the parallel computer with good communication capacity. The experiments on DAWNING3000 shows this parallel processing algorithm can get good results on real time rate and processing efficiency.

Book ChapterDOI
01 Sep 2003
TL;DR: To derive architecture instances from the template, a design environment called DEfInE is used, which integrates some existing academic and industrial tools with ReSArT-specific components, developed as a part of this work.
Abstract: This paper introduces the ReSArT ( Reconfigurable Scalable Architecture Template). Based on a suitable design space model, ReSArT is parametrizable, scalable, and able to support all levels of parallelism. To derive architecture instances from the template, a design environment called DEfInE ( Design Environment for ReSArT Instance G eneration) is used, which integrates some existing academic and industrial tools with ReSArT-specific components, developed as a part of this work. Different architecture instances were tested with a set of 10 benchmark applications as a proof of concept, achieving a maximum degree of parallelism of 30 and an average degree of parallelism of nearly 20 16-bit operations per cycle.

Journal ArticleDOI
TL;DR: The concept of disjoint faults is extended to reduce the number of tests to the time efficiency of Θ(N(superscript 5/6) for N×N DOMINs, and an algorithm is proposed herein to find the maximum number of disJoint faults.
Abstract: Dilated Optical Multistage Interconnection Networks (DOMINs) based on 2×2 directional coupler photonic switches play an important role in all-optical high-performance networks, especially for the emerging IP over DWDM architectures. The problem of crosstalk within photonic switches is underestimated due to the aging of the switching element, control voltage, temperature and polarization, and thus causes undesirable coupling of the signal from one path to the other. Previous works [18] designed an efficient diagnosing disjoint faults algorithm in small sized networks, which reduced the number of tests required by overlapping the tests with computations to one half in photonic switching networks. Furthermore, this paper generically derives algorithms and mathematical modules to find the optimal degree of parallelism of faults diagnosis for N×N dilated blocking networks, as the size of network is larger. Taking advantage of the properties of disjoint faults, diagnosis can be accelerated significantly because the optimal degree of parallel fault diagnosis may grow exponentially. To reduce the diagnosis time, an algorithm is proposed herein to find the maximum number of disjoint faults. Rather than requiring up to 4MN tests as a native approach, a two-phase diagnosis algorithm is proposed to reduce the testing requirement to 4N tests. This study extends the concept of disjoint faults to reduce the number of tests to the time efficiency of Θ(N(superscript 5/6)) for N×N DOMINs.

Proceedings ArticleDOI
14 Oct 2003
TL;DR: An efficient multi-level dot diffusion halftoning program for a digital copier using a TMS320C6416 DSP and a lookup table based method is employed in order to avoid conditional branch operations.
Abstract: We have developed an efficient multi-level dot diffusion halftoning program for a digital copier using a TMS320C6416 DSP. Although this processor can execute several arithmetic operations in a cycle, the efficient use of the resources for the implementation of halftoning programs is difficult due to the sequential nature of the algorithm. The dot diffusion based algorithm is selected not only for better image quality, but also to exploit a higher degree of parallelism. The conventional dot diffusion computation procedure is modified in order to increase the regularity of arithmetic operations in the error diffusion process. Although this modification requires more arithmetic operations, the increase of the parallelism significantly shortens the overall processing time. As for multi-level quantization, a lookup table based method is employed in order to avoid conditional branch operations. This implementation can result in 30 PPM (page per minute) throughput for a 600 DPI A4 size digital copier.

01 Jan 2003
TL;DR: In this paper, the thought about how to design a parallel system based on local area network is introduced and some primary factors such as the performance of single computer, the degree of parallelism and the efficiency of communication are analyzed.
Abstract: In order to solve the problem of mass parallel processing and computing, in this paper, the thought about how to design a parallel system based on local area network is introduced. It emphatically analyses the performances such as parallel efficiency, speedup and scaleup. And it also analyses some primary factors such as the performance of single computer, the degree of parallelism and the efficiency of communication, which influence the performances of the parallel system. Concerning this system, at last, it also discusses several influenced problems including skew, transmission bot-tleneck and symmetry.

Journal ArticleDOI
TL;DR: The main objective of this paper is to introduce (a) optimisation of the time cost for such parallel structure; (b) allocation and scheduling techniques for solving the problem associated with parallel structure (number of groups is less than the number of shared memory modules, and/or thenumber of processors is lessthan the degree of parallelism).
Abstract: Finding the optimal accessing order to the critical section within a parallel structure will result in saving time cost of the overall execution time of the parallel program. A parallel computation model is used to represent the detailed time cost of the parallel structure (portion of the program). Most of the previous research considered the parallel structure with one communication node and few of the researches considered a parallel structure with two communication nodes. The extension of these efforts considered a parallel structure with more than one group; one such extension was developed solely for the parallel structure with one communication node. In this paper another extension of the problem is proposed, that is, a parallel structure with two communication nodes. The main objective of this paper is to introduce (a) optimisation of the time cost for such parallel structure; (b) allocation and scheduling techniques for solving the problem associated with parallel structure (number of groups is less than the number of shared memory modules, and/or the number of processors is less than the degree of parallelism).

Proceedings ArticleDOI
01 Jan 2003
TL;DR: Experiments show that using LEDS approach, 33.5% higher energy reduction can be obtained than previous method.
Abstract: Technique of energy minimization by dynamic voltage scheduling (DVS) for distributed real-time system at system level design is proposed. Considering variation of power profile and degree of parallelism simultaneously while allocating slack time, the proposed approach named LEDS explores space of minimizing energy consumption. Experiments show that using LEDS approach, 33.5% higher energy reduction can be obtained than previous method.

01 Jan 2003
TL;DR: This paper identifies several conflicting objectives that must be satisfied by any reliable real-time scheduling scheme and proposes an objective function, which can then guide the scheduling algorithms.
Abstract: consists of a set of homogenous multiprocessors, Real time applications are composed of one or more tasks that are required to perform their functions under strict timing constraints. These applications have to meet their deadlines amidst contradicting goals, while maximizing resource utilization. A task missingits deadline may result in a domino effect, possibly causing other tasks to miss their deadlines resulting in a system failure. Scheduling real time applications on multiprocessor systems is a very complex problem because of the multiple conflicting objectives that must be simultaneously achieved. Existing scheduling techniques for real-time applications on clusters do not: 1) Consider application task structure [2,4], 2) Handle fragmentation of the processing power appropriately, 3) Make an effort to minimize communication among the tasks while retaining the degree of parallelism as specified by the task structure of the application. 4) Use an appropriate performance criterion (they use average execution time which is not sufficiently accurate and 5) Consider reliability issues [3]. In this paper we identify several conflicting objectives that must be satisfied by any reliable real-time scheduling scheme. In order to consider these conflicting goals in an integrated fashion, we propose an objective function, which can then guide the scheduling algorithms. The main focus of our research is to achieve the required reliability by exploiting available resources and/or application conditions.

Proceedings ArticleDOI
B.G. Patrick1, M. Jack1
20 Oct 2003
TL;DR: Using a simple job model characterized by sequential time-to-completion and degree of parallelism, it is demonstrated via simulation that in most cases, the uninformed strategy of equipartitioning outperforms marginal analysis with respect to system performance and without a commensurate degradation in system efficiency.
Abstract: Given n malleable and nonpreemptable parallel jobs that arrive for execution at time 0, we examine and compare two job scheduling strategies that allocate m identical processors among the n competing jobs. In all cases, n/spl les/m. The first strategy is based on the heuristic paradigm of equipartitioning, and the second is based on the notion of marginal analysis. Equipartitioning uses no a priori information when processor allocations are made to parallel jobs. Marginal analysis, on the other hand, assumes full a priori information in order to maximize processor utility. We compare both strategies with respect to average time-to-completion (system performance) and overall time-to-completion (system efficiency). Using a simple job model characterized by sequential time-to-completion and degree of parallelism, it is demonstrated via simulation that in most cases, the uninformed strategy of equipartitioning outperforms marginal analysis with respect to system performance and without a commensurate degradation in system efficiency.