scispace - formally typeset
Search or ask a question

Showing papers on "Speedup published in 2000"


Journal ArticleDOI
TL;DR: This paper presents a new approach that can further enhance the efficiency of ordinal optimization, which determines a highly efficient number of simulation replications or samples and significantly reduces the total simulation cost.
Abstract: Ordinal Optimization has emerged as an efficient technique for simulation and optimization. Exponential convergence rates can be achieved in many cases. In this paper, we present a new approach that can further enhance the efficiency of ordinal optimization. Our approach determines a highly efficient number of simulation replications or samples and significantly reduces the total simulation cost. We also compare several different allocation procedures, including a popular two-stage procedure in simulation literature. Numerical testing shows that our approach is much more efficient than all compared methods. The results further indicate that our approach can obtain a speedup factor of higher than 20 above and beyond the speedup achieved by the use of ordinal optimization for a 210-design example.

708 citations


Journal ArticleDOI
12 Nov 2000
TL;DR: It is demonstrated that performance on a hardware multithreaded processor is sensitive to the set of jobs that are coscheduled by the operating system jobscheduler, and that a small sample of the possible schedules is sufficient to identify a good schedule quickly.
Abstract: Simultaneous Multithreading machines fetch and execute instructions from multiple instruction streams to increase system utilization and speedup the execution of jobs. When there are more jobs in the system than there is hardware to support simultaneous execution, the operating system scheduler must choose the set of jobs to coscheduleThis paper demonstrates that performance on a hardware multithreaded processor is sensitive to the set of jobs that are coscheduled by the operating system jobscheduler. Thus, the full benefits of SMT hardware can only be achieved if the scheduler is aware of thread interactions. Here, a mechanism is presented that allows the scheduler to significantly raise the performance of SMT architectures. This is done without any advance knowledge of a workload's characteristics, using sampling to identify jobs which run well together.We demonstrate an SMT jobscheduler called SOS. SOS combines an overhead-free sample phase which collects information about various possible schedules, and a symbiosis phase which uses that information to predict which schedule will provide the best performance. We show that a small sample of the possible schedules is sufficient to identify a good schedule quickly. On a system with random job arrivals and departures, response time is improved as much as 17% over a schedule which does not incorporate symbiosis.

619 citations


Proceedings ArticleDOI
26 Mar 2000
TL;DR: This paper uses fluid model techniques to establish two results concerning the throughput of data switches: for an input-queued switch (with no speedup) it is shown that a maximum weight algorithm for connecting inputs and outputs delivers a throughput of 100%, and for combined input- and output-Queued switches that run at a speedup of 2 they are shown to be correct.
Abstract: In this paper we use fluid model techniques to establish two results concerning the throughput of data switches. For an input-queued switch (with no speedup) we show that a maximum weight algorithm for connecting inputs and outputs delivers a throughput of 100%, and for combined input- and output-queued switches that run at a speedup of 2 we show that any maximal matching algorithm delivers a throughput of 100%. The only assumptions on the input traffic are that it satisfies the strong law of large numbers and that it does not oversubscribe any input or any output.

473 citations


Proceedings ArticleDOI
01 Aug 2000
TL;DR: An algorithm for mining long patterns in databases by using depth first search on a lexicographic tree of itemsets achieves more than one order of magnitude speedup over the recently proposed MaxMiner algorithm.
Abstract: In this paper we present an algorithm for mining long patterns in databases. The algorithm nds large itemsets by using depth rst search on a lexicographic tree of itemsets. The focus of this paper is to develop CPU-e cient algorithms for nding frequent itemsets in the cases when the database contains patterns which are very wide. We refer to this algorithm as DepthProject, and it achieves more than one order of magnitude speedup over the recently proposed MaxMiner algorithm for nding long patterns. These techniques may be quite useful for applications in areas such as computational biology in which the number of records is relatively small, but the itemsets are very long. This necessitates the discovery of patterns using algorithms which are especially tailored to the nature of such domains.

362 citations


Proceedings ArticleDOI
05 Nov 2000
TL;DR: A top-down hierarchical approach is used in Dragon2000 as mentioned in this paper to solve large-scale cell placement problem effectively, and the results show that minimizing net-cut is more important than greedily obtaining a wirelength optimal placement at intermediate hierarchical levels.
Abstract: In this paper, we develop a new standard cell placement tool, Dragon2000, to solve large scale placement problem effectively. A top-down hierarchical approach is used in Dragon2000. State-of-the-art partitioning tools are tightly integrated with wirelength minimization techniques to achieve superior performance. We argue that net-cut minimization is a good and important shortcut to solve the large scale placement problem. Experimental results show that minimizing net-cut is more important than greedily obtain a wirelength optimal placement at intermediate hierarchical levels. We run Dragon2000 on recently released large benchmark suite ISPD98 as well as MCNC circuits. For circuits which have more than 100 k cells, comparing to iToolsl.4.0, Dragon2000 can produce slightly better placement results (1.4%) while spending much less amount of time (2/spl times/ speedup). This is also the first published placement result on the publicly available large industrial circuits.

259 citations


Journal ArticleDOI
12 Nov 2000
TL;DR: This paper introduces new instructions to improve the performance of symmetric key cipher algorithms, and analyses of the original and optimized algorithms suggest future directions for the design of high-performance programmable cryptographic processors.
Abstract: The emergence of the Internet as a trusted medium for commerce and communication has made cryptography an essential component of modern information systems. Cryptography provides the mechanisms necessary to implement accountability, accuracy, and confidentiality in communication. As demands for secure communication bandwidth grow, efficient cryptographic processing will become increasingly vital to good system performance.In this paper, we explore techniques to improve the performance of symmetric key cipher algorithms. Eight popular strong encryption algorithms are examined in detail. Analysis reveals the algorithms are computationally complex and contain little parallelism. Overall throughput on a high-end microprocessor is quite poor, a 600 Mhz processor is incapable of saturating a T3 communication line with 3DES (triple DES) encrypted data.We introduce new instructions that improve the efficiency of the analyzed algorithms. Our approach adds instruction set support for fast substitutions, general permutations, rotates, and modular arithmetic. Performance analysis of the optimized ciphers shows an overall speedup of 59% over a baseline machine with rotate instructions and 74% speedup over a baseline without rotates. Even higher speedups are demonstrated with optimized substitutions (SBOXes) and additional functional unit resources. Our analyses of the original and optimized algorithms suggest future directions for the design of high-performance programmable cryptographic processors.

133 citations


Book ChapterDOI
17 Dec 2000
TL;DR: A simple but effective scheduling strategy that dynamically measures the execution times of tasks and uses this information to dynamically adjust the number of workers to achieve a desirable efficiency, minimizing the impact in loss of speedup is proposed.
Abstract: We address the problem of how many workers should be allocated for executing a distributed application that follows the master-worker paradigm, and how to assign tasks to workers in order to maximize resource efficiency and minimize application execution time. We propose a simple but effective scheduling strategy that dynamically measures the execution times of tasks and uses this information to dynamically adjust the number of workers to achieve a desirable efficiency, minimizing the impact in loss of speedup. The scheduling strategy has been implemented using an extended version of MW, a runtime library that allows quick and easy development of master-worker computations on a computational grid. We report on an initial set of experiments that we have conducted on a Condor pool using our extended version of MW to evaluate the effectiveness of the scheduling strategy.

130 citations


Proceedings ArticleDOI
01 Dec 2000
TL;DR: This work proposes Predictor-Directed Stream Buffers (PSB), a scheme in which the stream buffer follows an address prediction stream instead of a fired stride, and shows that PSB provides a 30% speedup on average over no prefetching, and provides an average 10% speed up over using previously proposed stride-based stream buffers for pointer-intensive applications.
Abstract: An effective method for reducing the effect of load latency in modern processors is data prefetching. One form of data prefetching, stream buffers, has been shown to be particularly effective due to its ability to detect data streams and run ahead of them, prefetching as it goes. Unfortunately, in the past, the applicability of streaming was limited to stride intensive code. We propose Predictor-Directed Stream Buffers (PSB), a scheme in which the stream buffer follows an address prediction stream instead of a fired stride. In addition, we examine using confidence techniques to guide the allocation and prioritization of stream buffers and their prefetch requests. Our results show for pointer-based applications that PSB provides a 30% speedup on average over no prefetching, and provides an average 10% speedup over using previously proposed stride-based stream buffers for pointer-intensive applications.

128 citations


Proceedings ArticleDOI
01 Nov 2000
TL;DR: This paper presents a simple, uniform schema for pipelined the hardware execution of a broad class of loops, which resembles VLIW software pipelining much more than it resembles hardware synthesis retiming algorithms.
Abstract: The Garp compiler and architecture have been developed in parallel, in part to help investigate whether features of the architecture help facilitate rapid, automatic compilation utilizing the Garp’s rapidly reconfigurable coprocessor. Previously reported work for compiling to Garp has drawn heavily on techniques from software compilation rather than high-level synthesis. That trend continues in this paper, which describes the extension of those techniques to support pipelined execution of loops on the coprocessor. Even though it targets hardware, our approach resembles VLIW software pipelining much more than it resembles hardware synthesis retiming algorithms. This paper presents a simple, uniform schema for pipelining the hardware execution of a broad class of loops. The loops can have multiple control paths, multiple exits (including exits resulting from hyperblock path exclusion), datadependent exits, and arbitrary memory accesses. The Garp compiler is fully implemented, and results are presented. A sample benchmark, wavelet image encoding, saw its overall speedup on accelerated loops grow from a speedup of about 2 without pipelined execution to a speedup of about 4 with pipelined execution.

125 citations


Journal ArticleDOI
TL;DR: A dependence graph (DG) is presented to visualize and describe a merged multiply-accumulate (MAC) hardware that is based on the modified Booth algorithm, in which an accurate delay model for deep submicron CMOS technology is used.
Abstract: This paper presents a dependence graph (DG) to visualize and describe a merged multiply-accumulate (MAC) hardware that is based on the modified Booth algorithm (MBA). The carry-save technique is used in the Booth encoder, the Booth multiplier, and the accumulator sections to ensure the fastest possible implementation. The DG applies to any MAC data word size and allows designing multiplier structures that are regular and have minimal delay, sign-bit extensions, and datapath width. Using the DG, a fast pipelined implementation is proposed, in which an accurate delay model for deep submicron CMOS technology is used. The delay model describes multi-level gate delays, taking into account input ramp and output loading. Based on the delay model, the proposed pipelined parallel MAC design is three times faster than other parallel MAC schemes that are based on the MBA. The speedup resulted from merging the accumulate and the multiply operations and the wide use of carry-save techniques.

122 citations


Proceedings ArticleDOI
17 Apr 2000
TL;DR: A microcoded Xilinx Virtex based elliptic curve processor that implements curve operations as well as optimal normal basis field operations in F(2/sup n/); the design is parameterized for arbitrary n; and it is microcoded to allow for rapid development of the control part of the processor.
Abstract: Elliptic curve cryptography (ECC) has been the focus of much recent attention since it offers the highest security per bit of any known public key cryptosystem. This benefit of smaller key sizes makes ECC particularly attractive for embedded applications since its implementation requires less memory and processing power. In this paper a microcoded Xilinx Virtex based elliptic curve processor is described. In contrast to previous implementations, it implements curve operations as well as optimal normal basis field operations in F(2/sup n/); the design is parameterized for arbitrary n; and it is microcoded to allow for rapid development of the control part of the processor. The design was successfully tested on a Xilinx Virtex XCV300-4 and, for n=113 bits, utilized 1290 slices at a maximum frequency of 45 MHz and achieved a thirty-fold speedup over an optimized software implementation.

Journal ArticleDOI
TL;DR: In this paper, a full-band cellular automaton (CA) code for simulation of electron and hole transport in Si and GaAs is presented, where the entire Brillouin zone is discretized using a non-uniform mesh in k-space, and a transition table is generated between all initial and final states.
Abstract: We present a fullband cellular automaton (CA) code for simulation of electron and hole transport in Si and GaAs. In this implementation, the entire Brillouin zone is discretized using a nonuniform mesh in k-space, and a transition table is generated between all initial and final states on the mesh, greatly simplifying the final state selection of the conventional Monte Carlo algorithm. This method allows for fully anisotropic scattering rates within the fullband scheme, at the cost of increased memory requirements for the transition table itself. Good agreement is obtained between the CA model and previously reported results for the velocity-field characteristics and high field distribution function, which illustrate the potential accuracy of the technique. A hybrid CA/Monte Carlo algorithm is introduced which helps alleviate the memory problems of the CA method while preserving the speed up and accuracy.

Proceedings ArticleDOI
01 Feb 2000
TL;DR: This paper describes a C compiler for a mixed Processor/FPGA architecture where the FPGA is a Reconfigurable Functional Unit (RFU) and presents three compilation techniques that can extract computations from applications to put into the RFU.
Abstract: This paper describes a C compiler for a mixed Processor/FPGA architecture where the FPGA is a Reconfigurable Functional Unit (RFU). It presents three compilation techniques that can extract computations from applications to put into the RFU. The results show that large instruction sequences can be created and extracted by these techniques. An average speedup of 2.6 is achieved over a set of benchmarks.

Book ChapterDOI
27 Mar 2000
TL;DR: This article investigates approximate evaluation techniques based on the VA-File for Nearest-Neighbor Search (NN-Search) and develops different approximate query evaluation techniques that have the desired effect when allowing for a small but specific reduction of result quality.
Abstract: In many situations, users would readily accept an approximate query result if evaluation of the query becomes faster. In this article, we investigate approximate evaluation techniques based on the VA-File for Nearest-Neighbor Search (NN-Search). The VA-File contains approximations of feature points. These approximations frequently suffice to eliminate the vast majority of points in a first phase. Then, a second phase identifies the NN by computing exact distances of all remaining points. To develop approximate query-evaluation techniques, we proceed in two steps: first, we derive an analytic model for VA-File based NN-search. This is to investigate the relationship between approximation granularity, effectiveness of the filtering step and search performance. In more detail, we develop formulae for the distribution of the error of the bounds and the duration of the different phases of query evaluation. Based on these results, we develop different approximate query evaluation techniques. The first one adapts the bounds to have a more rigid filtering, the second one skips computation of the exact distances. Experiments show that these techniques have the desired effect: for instance, when allowing for a small but specific reduction of result quality, we observed a speedup of 7 in 50-NN search.

Proceedings Article
30 Jul 2000
TL;DR: The first to introduce an efficient parallel QSAT-solver is introduced, a distributed theorem-prover for Quantified Boolean Formulae that runs efficiently on distributed systems, i.
Abstract: In this paper, we present PQSOLVE, a distributed theorem-prover for Quantified Boolean Formulae First, we introduce our sequential algorithm QSOLVE, which uses new heuristics and improves the use of known heuristics to prune the search tree As a result, QSOLVE is more efficient than the QSAT-solvers previously known We have parallelized QSOLVE The resulting distributed QSAT-solver PQSOLVE uses parallel search techniques, which we have developed for distributed game tree search PQSOLVE runs efficiently on distributed systems, i e parallel systems without any shared memory We briefly present experiments that show a speedup of about 114 on 128 processors To the best of our knowledge we are the first to introduce an efficient parallel QSAT-solver

Journal ArticleDOI
TL;DR: A simple and fast parallel graph coloring heuristic that is well suited for shared memory programming and yields an almost linear speedup on the PRAM model is presented.
Abstract: SUMMARY Finding a good graph coloring quickly is often a crucial phase in the development of efficient, parallel algorithms for many scientific and engineering applications. In this paper we consider the problem of solving the graph coloring problem itself in parallel. We present a simple and fast parallel graph coloring heuristic that is well suited for shared memory programming and yields an almost linear speedup on the PRAM model. We also present a second heuristic that improves on the number of colors used. The heuristics have been implemented using OpenMP. Experiments conducted on an SGI Cray Origin 2000 supercomputer using very large graphs from finite element methods and eigenvalue computations validate the theoretical run-time analysis. Copyright  2000 John Wiley & Sons, Ltd.

Proceedings ArticleDOI
26 Mar 2000
TL;DR: This paper seeks an answer to the question: How can the search tree be drawn so as to minimize the average packet lookup time while keeping the worst-case lookup time within a fixed bound?
Abstract: The problem of route address lookup has received much attention recently and several algorithms and data structures for performing address lookups at high speeds have been proposed. In this paper we consider one such data structure-a binary search tree built on the intervals created by the routing table prefixes. We wish to exploit the difference in the probabilities with which the various leaves of the tree (where the intervals are stored) are accessed by incoming packets in order to speedup the lookup process. More precisely, we seek an answer to the question: How can the search tree be drawn so as to minimize the average packet lookup time while keeping the worst-case lookup time within a fixed bound?" We use ideas from information theory to derive efficient algorithms for computing near-optimal routing lookup trees. Finally, we consider the practicality of our algorithms through analysis and simulation.

Journal ArticleDOI
01 Sep 2000
TL;DR: The approach is based on a tetrahedral decomposition of the space, chosen both for its suitability to support a particle system and for the ready availability of many techniques recently proposed for the simplification and multiresolution management of 3D simplicial decompositions.
Abstract: Performing a really interactive and physically-based simulation of complex soft objects is still an open problem in computer animation/simulation. Given the application domain of virtual surgery training, a complete model should be quite realistic, interactive and should enable the user to modify the topology of the objects. Recent papers propose the adoption of multiresolution techniques to optimize time performance by representing at high resolution only the object parts considered more important or critical. The speed up obtainable at simulation time are counterbalanced by the need of a preprocessing phase strongly dependent on the topology of the object, with the drawback that performing dynamic topology modification becomes a prohibitive issue. In this paper we present an approach that couples multiresolution and topological modifications, based on the adoption of a particle systems approach to the physical simulation. Our approach is based on a tetrahedral decomposition of the space, chosen both for its suitability to support a particle system and for the ready availability of many techniques recently proposed for the simplification and multiresolution management of 3D simplicial decompositions. The multiresolution simulation system is designed to ensure the required speedup and to support dynamic changes of the topology, e.g. due to cuts or lacerations of the represented tissue.

Proceedings ArticleDOI
26 Mar 2000
TL;DR: This work investigated the locality behavior of the Interent traffic (at layer-4) and proposed a near-LRU algorithm that can best harness this behavior, and invented a dynamic set-associative scheme that exploits the nice properties of N-universal hash functions.
Abstract: Existing and emerging layer-4 switching technologies require packet classification to be performed on more than one header field, known as layer-4 lookup. Currently, the fastest general layer-8 lookup scheme delivers a throughput of 1 million lookups per second (MLPS), far off from 25/75 MLPS needed to support 50/150 Gbps layer-4 router. We propose the use of route caching to speed up layer-4 lookup, and design and implement a cache architecture for this purpose. We investigated the locality behavior of the Interent traffic (at layer-4) and propose a near-LRU algorithm that can best harness this behavior. In implementation, to best approximate fully-associative near-LRU using relatively inexpensive set-associative hardware, we invented a dynamic set-associative scheme that exploits the nice properties of N-universal hash functions. The cache architecture achieves a high and stable hit ratio above 90 percent and a fast throughput up to 75 MLPS at a reasonable cost.

Journal ArticleDOI
TL;DR: In this article, the authors study the problem of processor scheduling for n parallel jobs applying the method of competitive analysis and show that for jobs with a single phase of parallelism, a preemptive scheduling algorithm without information about job execution time can achieve a mean completion time within $2-{2\over n+1}$ times the optimum.
Abstract: We study the problem of processor scheduling for n parallel jobs applying the method of competitive analysis. We prove that for jobs with a single phase of parallelism, a preemptive scheduling algorithm without information about job execution time can achieve a mean completion time within $2-{2\over n+1}$ times the optimum. In other words, we prove a competitive ratio of $2-{2\over n+1}$. The result is extended to jobs with multiple phases of parallelism (which can be used to model jobs with sublinear speedup) and to interactive jobs (with phases during which the job has no CPU requirements) to derive solutions guaranteed to be within $4-{4\over n+1}$ times the optimum. In comparison with previous work, our assumption that job execution times are unknown prior to their completion is more realistic, our multiphased job model is more general, and our approximation ratio (for jobs with a single phase of parallelism) is tighter and cannot be improved. While this work presents theoretical results obtained using competitive analysis, we believe that the results provide insight into the performance of practical multiprocessor scheduling algorithms that operate in the absence of complete information.

Proceedings ArticleDOI
05 Jan 2000
TL;DR: This work presents projection merging, a technique to reduce path redundancy, and achieves orders of magnitude speedup of analysis time on programs over that of using cycle elimination alone.
Abstract: Inclusion-based program analyses are implemented by adding new edges to directed graphs. In most analyses, there are many different ways to add a transitive edge between two nodes, namely through each different path connecting the nodes. This path redundancy limits the scalability of these analyses. We present projection merging, a technique to reduce path redundancy. Combined with cycle elimination [7], projection merging achieves orders of magnitude speedup of analysis time on programs over that of using cycle elimination alone.

Journal ArticleDOI
Frank Jenko1
TL;DR: In this article, a hybrid model of drift-kinetic electrons and fluid ions is used to treat electromagnetic drift-wave turbulence in an inhomogeneous collisionless plasma confined by a strong external magnetic field.

Proceedings ArticleDOI
01 Jan 2000
TL;DR: Out-of-order execution and branch prediction are observed to be extremely important to exploit such parallelism in media applications on very long instruction word (VLIW), single instruction multiple data (SIMD), and superscalar processors.
Abstract: This paper aims to provide a quantitative understanding of the performance of DSP and multimedia applications on very long instruction word (VLIW), single instruction multiple data (SIMD), and superscalar processors. We evaluate the performance of the VLIW paradigm using Texas Instruments Inc.'s TMS320C62xx processor and the SIMD paradigm using Intel's Pentium II processor (with MMX) on a set of DSP and media benchmarks. Tradeoffs in superscalar performance are evaluated with a combination of measurements on Pentium II and simulation experiments on the SimpleScalar simulator. Our benchmark suite includes kernels (filtering, autocorrelation, and dot product) and applications (audio effects, G.711 speech coding, and speech compression). Optimized assembly libraries and compiler intrinsics were used to create the SIMD and VLIW code. We used the hardware performance counters on the Pentium II and the stand-alone simulator for the C62xx to obtain the execution cycle counts. In comparison to non-SIMD Pentium II performance, the SIMD version exhibits a speedup ranging from 1.0 to 5.5 while the speedup of the VLIW version ranges from 0.63 to 9.0. The benchmarks are seen to contain large amounts of available parallelism, however, most of it is inter-iteration parallelism. Out-of-order execution and branch prediction are observed to be extremely important to exploit such parallelism in media applications.

Proceedings ArticleDOI
26 Mar 2000
TL;DR: Time-stepped hybrid simulation (TSHS) is a framework that offers the user the flexibility to choose the simulation time scale so as to trade off the computational cost of the simulation with its fidelity.
Abstract: Data communication networks have been experiencing tremendous growth in size, complexity, and heterogeneity over the last decade. This trend poses a significant challenge to the design of scalable performance evaluation methodologies. In this paper we propose time-stepped hybrid simulation (TSHS) to deal with the scalability issue faced by traditional packet-level discrete event simulation methods. TSHS is a framework that offers the user the flexibility to choose the simulation time scale so as to trade off the computational cost of the simulation with its fidelity. Simulation speedup is achieved by evaluating the system at coarser time-scales. The potential loss of simulation accuracy when fine-time-scale behavior is evaluated at a coarser time-scale is studied both analytically and experimentally.

Journal ArticleDOI
Junehwa Song1, Boon-Lock Yeo
TL;DR: It is shown how shared information within a macroblock, such as a motion vector and common blocks, can be exploited to yield substantial speedup in computation, and can be implemented on top of any optimized solution.
Abstract: The ability to construct intracoded frame from motion-compensated intercoded frames directly in the compressed domain is important for efficient video manipulation and composition. In the context of motion-compensated discrete cosine transform (DCT)-based coding of video as in MPEG video, this problem of DCT-domain inverse motion compensation has been studied and, subsequently, improved faster algorithms were proposed. These schemes, however, treat each 8/spl times/8 block as a fundamental unit, and do not take into account the fact that in MPEG, a macroblock consists of several such blocks. We show how shared information within a macroblock, such as a motion vector and common blocks, can be exploited to yield substantial speedup in computation. Compared to previous brute-force approaches, our algorithms yield about 44% improvement. Our technique is independent of the underlying computational or processor model, and thus can be implemented on top of any optimized solution. We demonstrate an improvement by about 19%, and 13.5% in the worst case, on top of the optimized solutions presented in existing literature.

Proceedings ArticleDOI
22 Oct 2000
TL;DR: In this article, the authors present the SelfAnalyzer, an approach to dynamically analyze the performance of applications (speedup, efficiency and execution time), and the Performance-Driven Processor Allocation (PDPA), a new scheduling policy that distributes processors considering both the global conditions of the system and the particular characteristics of running applications.
Abstract: This work is focused on processor allocation in shared-memory multiprocessor systems, where no knowledge of the application is available when applications are submitted. We perform the processor allocation taking into account the characteristics of the application measured at run-time. We want to demonstrate the importance of an accurate performance analysis and the criteria used to distribute the processors. With this aim, we present the SelfAnalyzer, an approach to dynamically analyzing the performance of applications (speedup, efficiency and execution time), and the Performance-Driven Processor Allocation (PDPA), a new scheduling policy that distributes processors considering both the global conditions of the system and the particular characteristics of running applications. This work also defends the importance of the interaction between the medium-term and the long-term scheduler to control the multiprogramming level in the case of the clairvoyant scheduling pol-icies1. We have implemented our proposal in an SGI Origin2000 with 64 processors and we have compared its performance with that of some scheduling policies proposed so far and with the native IRIX scheduling policy. Results show that the combination of the SelfAnalyzer+PDPA with the medium/long-term scheduling interaction outperforms the rest of the scheduling policies evaluated. The evaluation shows that in workloads where a simple equipartition performs well, the PDPA also performs well, and in extreme workloads where all the applications have a bad performance, our proposal can achieve a speedup of 3.9 with respect to an equipartition and 11.8 with respect to the native IRIX scheduling policy.

Proceedings ArticleDOI
Vivek Sarkar1
08 May 2000
TL;DR: This paper addresses the problems of automatically selecting un roll factors for perfectly nested loops, and generating compact code for the selected unroll factors, and proposes a new code generation algorithm for unrolling nested loops that generates more compact code (with fewer remainder loops) than the unroll-and-jam transformation.
Abstract: In this paper, we address the problems of automatically selecting unroll factors for perfectly nested loops, and generating compact code for the selected unroll factors. Compared to past work, the contributions of our work include a) a more detailed cost model that includes ILP and 1-cache considerations, b) a new code generation algorithm for unrolling nested loops that generates more compact code (with fewer remainder loops) than the unroll-and-jam transformation, and c) a new algorithm for efficiently enumerating feasible unroll vectors.Our experimental results confirm the wide applicability of our approach by showing a 2.2X speedup on matrix multiply, and an average 1.08X speedup on seven of the SPEC95fp benchmarks (with a 1.2X speedup for two benchmarks). These speedups are significant because the baseline compiler used for comparison is the IBM XL Fortran product compiler which generates high quality code with unrolling and software pipelining of innermost loops enabled. Larger performance improvements due to unrolling of nested loops can be expected on processors that have larger numbers of registers and larger degrees of instruction-level parallelism than the processor used for our measurements (PowerPC 604).

Proceedings ArticleDOI
01 Dec 2000
TL;DR: The experimental results show that the dynamic programming algorithm can be efficiently implemented on EARTH systems with high performance, good programmability and reasonable cost.
Abstract: This paper discusses the issues involved in implementing a dynamic programming algorithm for biological sequence comparison on a general-purpose parallel computing platform based on a fine-grain event-driven multithreaded program execution model. Fine-grain multithreading permits efficient parallelism exploitation in this application both by taking advantage of asynchronous point-to-point synchronizations and communication with low overheads and by effectively tolerating latency through the overlapping of computation and communication. We have implemented our scheme on EARTH, a fine-grain event-driven multithreaded execution and architecture model which has been ported to a number of parallel machines with off-the-shelf processors. Our experimental results show that the dynamic programming algorithm can be efficiently implemented on EARTH systems with high performance (e.g., speedup of 90 on 120 nodes), good programmability and reasonable cost.

Journal ArticleDOI
TL;DR: A modification of the coordinate descent algorithm with one-dimensional (1-D) Newton-Raphson approximations to an alternative quadratic which allows convergence to be proven easily and a new algorithm which exploits the FS method to allow parallel updates of arbitrary sets of pixels using computations similar to iterative coordinate descent.
Abstract: Bayesian tomographic reconstruction algorithms generally require the efficient optimization of a functional of many variables. In this setting, as well as in many other optimization tasks, functional substitution (FS) has been widely applied to simplify each step of the iterative process. The function to be minimized is replaced locally by an approximation having a more easily manipulated form, e.g., quadratic, but which maintains sufficient similarity to descend the true functional while computing only the substitute. We provide two new applications of FS methods in iterative coordinate descent for Bayesian tomography. The first is a modification of our coordinate descent algorithm with one-dimensional (1-D) Newton-Raphson approximations to an alternative quadratic which allows convergence to be proven easily. In simulations, we find essentially no difference in convergence speed between the two techniques. We also present a new algorithm which exploits the FS method to allow parallel updates of arbitrary sets of pixels using computations similar to iterative coordinate descent. The theoretical potential speed up of parallel implementations is nearly linear with the number of processors if communication costs are neglected.

Proceedings ArticleDOI
01 Jan 2000
TL;DR: The utility of the co-estimation tool to explore system-level power tradeoffs for a TCP/IP network interface card sub-system and an automotive controller is shown and the use of the proposed acceleration techniques results in significant speedups in SOC power estimation time.
Abstract: We present efficient power estimation techniques for HW/SW System-On-Chip (SOC) designs. Our techniques are based on concurrent and synchronized execution of multiple power estimators that analyze different parts of the SOC (we refer to this as co-estimation), driven by a system-level simulation master. We motivate the need for power co-estimation, and demonstrate that performing independent power estimation for the various system components can lead to significant errors in the power estimates, especially for control-intensive and reactive embedded systems. We observe that the computation time for performing power co-estimation is dominated by: (i) the requirement to analyze/simulate some parts of the system at lower levels of abstraction in order to obtain accurate estimates of timing and switching activity information and (ii) the need to communicate between and synchronize the various simulators. Thus, a naive implementation of power co-estimation may be too inefficient to be used in an iterative design exploration framework. To address this issue, we present several acceleration (speedup) techniques for power co-estimation. The acceleration techniques are energy caching, software power macromodeling, and statistical sampling. Our speedup techniques reduce the workload of the power estimators for the individual SOC components, as well as their communication/synchronization overhead. Experimental results indicate that the use of the proposed acceleration techniques results in significant (8/spl times/ to 87/spl times/) speedups in SOC power estimation time, with minimal impact on accuracy. We also show the utility of our co-estimation tool to explore system-level power tradeoffs for a TCP/IP network interface card sub-system and an automotive controller.