scispace - formally typeset
Search or ask a question

Showing papers on "Degree of parallelism published in 1992"


Journal ArticleDOI
TL;DR: This paper deals with the problem of finding closed form schedules as affine or piecewise affine functions of the iteration vector and presents an algorithm which reduces the scheduling problem to a parametric linear program of small size, which can be readily solved by an efficient algorithm.
Abstract: Programs and systems of recurrence equations may be represented as sets of actions which are to be executed subject to precedence constraints. In may cases, actions may be labelled by integral vectors in some iterations domains, and precedence constraints may be described by affine relations. A schedule for such a program is a function which assigns an execution data to each action. Knowledge of such a schedule allows one to estimate the intrinsic degree of parallelism of the program and to compile a parallel version for multiprocessor architectures or systolic arrays. This paper deals with the problem of finding closed form schedules as affine or piecewise affine functions of the iteration vector. An algorithm is presented which reduces the scheduling problem to a parametric linear program of small size, which can be readily solved by an efficient algorithm.

614 citations


Journal ArticleDOI
TL;DR: This paper takes a new look at numerical techniques for solving parabolic equations by the method of lines by approximate the action of the evolution operator on a given state vector by means of a projection process onto a Krylov subspace.
Abstract: This paper takes a new look at numerical techniques for solving parabolic equations by the method of lines. The main motivation for the proposed approach is the possibility of exploiting a high degree of parallelism in a simple manner. The basic idea of the method is to approximate the action of the evolution operator on a given state vector by means of a projection process onto a Krylov subspace. Thus the resulting approximation consists of applying an evolution operator of very small dimension to a known vector, which is, in turn, computed accurately by exploiting high-order rational Chebyshev and Pade approximations to the exponential. Because the rational approximation is only applied to a small matrix, the only operations required with the original large matrix are matrix-by-vector multiplications and, as a result, the algorithm can easily be parallelized and vectorized. Further parallelism is introduced by expanding the rational approximations into partial fractions. Some relevant approximation and ...

342 citations


Proceedings ArticleDOI
27 May 1992
TL;DR: A method for computing Reed-Muller expansions for multivalued logic functions is presented, and the complexity in terms of the area-time tradeoff yields a better result than a butterfly algorithm does.
Abstract: A method for computing Reed-Muller expansions for multivalued logic functions is presented. All coefficients are constructed directly without the use of matrix multiplication. Due to the high degree of parallelism, the complexity of the algorithm in terms of the area-time tradeoff (AT/sup 2/) yields a better result than a butterfly algorithm does. >

34 citations


Journal ArticleDOI
TL;DR: Modeling a maximum concurrency scheduling in concurrent systems by Petri nets, and making use of the results for predicting the amount of effort required in modeling and analysis is made.

34 citations


Proceedings ArticleDOI
01 Jun 1992
TL;DR: A high performance signature file organization is proposed, integrating the latest developments both in storage structure and parallel computing architectures, and it combines horizontal and vertical approaches to the signature file fragmentation, achieving a new, mixed decomposition scheme.
Abstract: The retrieval capabilities of the signature file access method have become very attractive for many data processing applications dealing with both formatted and unformatted data. However, performance is still a problem, mainly when large files are used and fast response required. In this paper, a high performance signature file organization is proposed, integrating the latest developments both in storage structure and parallel computing architectures. It combines horizontal and vertical approaches to the signature file fragmentation. In this way, a new, mixed decomposition scheme, particularly suitable for parallel implementation, is achieved. The organization, based on this fragmentation scheme, is called Fragmented Signature File. Performance analysis shows that this organization provides very good and relatively stable performance, covering the full range of possible queries. For the same degree of parallelism, it outperforms any other parallel signature file organization that has been defined so far. The proposed method also has other important advantages concerning processing of dynamic files, adaptability to the number of available processors, load balancing, and, to some extent, fault-tolerant query processing.

33 citations


Journal ArticleDOI
TL;DR: The theory and the experimental results reported here strongly indicate that a nonstandard in the context of steady-state simulation, yet easy to apply, estimation procedure is required on highly parallel machines and this nonstandard estimator is a ratio estimator.
Abstract: A simple and effective way to exploit parallel processors in discrete event simulations is to run multiple independent replications, in parallel, on multiple processors and to average the results at the end of the runs. We call this the method of parallel replications. This paper is concerned with using the method of parallel replications for estimating steady-state performance measures. We report on the results of queueing network simulation experiments that compare the statistical properties of several possible estimators that can be formed using this method. The theoretical asymptotic properties of these estimators were determined in Glynn and Heidelberger 1989a, b. Both the theory and the experimental results reported here strongly indicate that a nonstandard in the context of steady-state simulation, yet easy to apply, estimation procedure is required on highly parallel machines. This nonstandard estimator is a ratio estimator. The experiments also show that use of the ratio estimator is advantageous even on machines with only a moderate degree of parallelism.

27 citations


Proceedings ArticleDOI
23 Mar 1992
TL;DR: A very-fine-grain, VLSI processor is described, which maintains both a high degree of flexibility and fine grainness by reducing each processing cell to a small RAM and several multiplexers.
Abstract: A very-fine-grain, VLSI processor is described. Very-fine-grain VLSI processors are especially suited for problems with a high degree of parallelism. However, to maintain their fine grainness (i.e., small size) most fine grain processors are relatively inflexible. Attempts to increase flexibility usually increase processor complexity and, thereby, decreases grainness. The two-dimensional, micrograined processor maintains both a high degree of flexibility and fine grainness by reducing each processing cell to a small RAM and several multiplexers. For even greater speed, arithmetic operations are based on a redundant number representation. Algorithms for single-instruction multiple-data (SIMD), mesh architectures can be easily adapted for the micrograined processor. This is particularly true for algorithms for certain two-dimensional signal and image processing problems. >

26 citations


Journal ArticleDOI
TL;DR: The design of constellations that allow a high degree of parallelism in the staged decoder structure and in addition allow soft decoding of the component algebraic codes based on a systolic algorithm amenable to VLSI implementation is considered.
Abstract: The design of constellations that allow a high degree of parallelism in the staged decoder structure and in addition allow soft decoding of the component algebraic codes based on a systolic algorithm amenable to VLSI implementation is considered. Combination of these multidimensional constellations with trellis-coded modulation (TCM) is also considered. Set partitioning is based on the partition of a linear code into a subcode and its cosets, and the author shows that a highly parallel demodulator structure can also be applied when TCM is used. >

14 citations


Proceedings ArticleDOI
26 Apr 1992
TL;DR: The authors have parallelized the AMBER molecular dynamics program for the AP1000 highly parallel computer and showed that a problem with 41095 atoms is processed 226 times faster with a 512 processor AP1000 than by a single processor.
Abstract: The authors have parallelized the AMBER molecular dynamics program for the AP1000 highly parallel computer. To obtain a high degree of parallelism and an even load balance between processors for model problems of protein and water molecules, protein amino acid residues and water molecules are distributed to processors randomly. Global interprocessor communication required by this data mapping is efficiently done using the AP1000 broadcast network, to broadcast atom coordinate data for other processors' reference and its torus network; also for point-to-point communication to accumulate forces for atoms assigned to other processors. Experiments showed that a problem with 41095 atoms is processed 226 times faster with a 512 processor AP1000 than by a single processor. >

13 citations


Journal ArticleDOI
01 Jan 1992
TL;DR: The efficiencies obtained by an implementation on a message-passing multiprocessor demonstrate the suitability of the time-parallel extrapolation method for this type of equation.
Abstract: We consider the problem of solving unsteady partial differential equations on an MIMD machine. Conventional parallel methods use a data partitioning type approach in which the solution grid at each time-step is divided amongst the available processors. The sequential nature of the time integration is, however, retained. The algorithm presented in this paper makes use of a time-parallel approach, whreby several processors may be employed to solve at several time-steps simultaneously. The time-parallel method enables the inherent parallelism of the extrapolation scheme to be efficiently exploited, allowing a significant increase both in accuracy and in the degree of parallelism. The efficiencies obtained by an implementation on a message-passing multiprocessor demonstrate the suitability of the time-parallel extrapolation method for this type of equation.

13 citations


Patent
Michael Keith1
11 Dec 1992
TL;DR: In this paper, a single-instruction, multiple-data (SIMD) architecture is adopted to exploit the high degree of parallelism inherent in many video signal processing algorithms.
Abstract: Single-instruction multiple-data is a new class of integrated video signal processors especially suited for real-time processing of two-dimensional images. The single-instruction, multiple-data architecture is adopted to exploit the high degree of parallelism inherent in many video signal processing algorithms. Features have been added to the architecture which support conditional execution and sequencing--an inherent limitation of traditional single-instruction multiple-data machines. A separate transfer engine offloads transaction processing from the execution core, allowing balancing of input/output and compute resources--a critical factor in optimizing performance for video processing. These features, coupled with a scalable architecture allow a united programming model and application driven performance.

Journal ArticleDOI
TL;DR: The generality of the data replication technique in image processing is demonstrated by showing how replicated data algorithms can be developed automatically for any operation that can be described as an image to template operation using image algebra.
Abstract: Data parallel processing on processor array architectures has gained considerable popularity in data intensive applications such as image processing and scientific computing. The data parallel paradigm of assigning one processing element to each data element results in an inefficient utilization of a large processor array when a relatively small data structure is processed on it. The large degree of parallelism of a massively parallel processor array machine does not result in a faster solution to a problem involving relatively small data structures than the modest degree of parallelism of a machine that is just as large as the data structure. In this paper, we present an algorithmic technique, called data replication technique, that speeds up the processing of small data structures both analytically and in practice. The technique combines data parallelism and operation parallelism using multiple copies of the data structure. We demonstrate the technique for two image processing operations, namely, image histogram computation and image convolution, and present the results of implementing them on a Connection Machine CM-2. In each case, we also compare the replicated data algorithm with the data parallel algorithm on three common interconnection network architectures to determine the conditions under which a speedup is obtained. Finally, we demonstrate the generality of the data replication technique in image processing by showing how replicated data algorithms can be developed automatically for any operation that can be described as an image to template operation using image algebra.

Journal ArticleDOI
TL;DR: In this article, a simple matrix formula for the distance between two flats or affine spaces of R n is derived and the geometry of intersection and degree of parallelism of two flats is elucidated.

Journal ArticleDOI
TL;DR: This work analyzes several implementations of logic operations and suggests an implementation whose complexity is no greater than the best theoretical realization of a Boolean function, and demonstrates the optimality of that realization to within a constant multiple for digital optical-computing systems realized by bulk spatially variant elements.
Abstract: Optical computing has been suggested as a means of achieving a high degree of parallelism for both scientific and symbolic applications. While a number of implementations of logic operations have been forwarded, all have some characteristic that prevents their direct extension to functions of a large number of input bits. We analyze several of these implementations and demonstrate that all these implementations require that some measure of the system (area, space–bandwidth product, or time) grow exponentially with the number of inputs. We then suggest an implementation whose complexity is no greater than the best theoretical realization of a Boolean function. We demonstrate the optimality of that realization, to within a constant multiple, for digital optical-computing systems realized by bulk spatially variant elements.

Journal ArticleDOI
TL;DR: The development and implementation of a parallel branch-and-bound algorithm created by adapting a commercial MIP solver and Computational results on a variety of real integer programming problems are reported.
Abstract: The time-consuming process of solving large-scale Mixed Integer Programming problems using the branch-and-bound technique can be speeded up by introducing a degree of parallelism into the basic algorithm. This paper describes the development and implementation of a parallel branch-and-bound algorithm created by adapting a commercial MIP solver. Inherent in the design of this software are certain ad hoc methods, the use of which are necessary in the effective solution of real problems. The extent to which these ad hoc methods can successfully be transferred to a parallel environment, in this case an array of at most nine transputers, is discussed. Computational results on a variety of real integer programming problems are reported.

Proceedings ArticleDOI
11 Oct 1992
TL;DR: A finite-element solver for electrical defibrillation analysis has been developed on a massively parallel computer, Thinking Machines Corporation's Connection Machine 2 (CM-2), which allows a high degree of parallelism and the solution of large problems as discussed by the authors.
Abstract: A finite-element solver for electrical defibrillation analysis has been developed on a massively parallel computer, Thinking Machines Corporation's Connection Machine 2 (CM-2), which allows a high degree of parallelism and the solution of large problems. The solver uses a nodal assembly technique where each node in the finite-element grid is mapped to a virtual processor in the computer. Using this solver, potential and current density distributions during transthoracic defibrillation have been calculated for different anatomic models, including a realistic 3-D finite-element model constructed from a series of cross-sectional magnetic resonance imaging (MRI) images of a mongrel dog. Numerical results obtained with this model are presented together with computational performance data for the algorithm. >

Proceedings ArticleDOI
09 Sep 1992
TL;DR: In a dedicated computer/network environment, the wall-clock time required by the resulting distributed application is reduced to that for the AGCM/Physics, with the other two model components and interprocessor communications running in parallel.
Abstract: The authors investigate the distribution of a climate model across homogeneous and heterogeneous computer environments with nodes that can reside at geographically different locations. The application consists of an atmospheric general circulation model (AGCM) coupled to an oceanic general circulation model (OGCM). Three levels of code decomposition are considered to achieve a high degree of parallelism and to mask communication with computation. First, the domains of both the grid-point AGCM and OGCM are divided into sub-domains for which calculations are carried out concurrently (domain decomposition). Second, the model is decomposed based on the diversity of tasks performed by its major components (task decomposition). Last, computation and communication are organized in such a way that the exchange of data between different tasks is carried out in subdomains of the model domain (I/O decomposition). In a dedicated computer/network environment, the wall-clock time required by the resulting distributed application is reduced to that for the AGCM/Physics, with the other two model components and interprocessor communications running in parallel. >

Proceedings ArticleDOI
19 Oct 1992
TL;DR: The authors present a staggered distribution scheme for DOACROSS loops that utilizes processors more efficiently, since, relative to the equal distribution approach, it requires fewer processors to attain maximum speedup.
Abstract: The authors present a staggered distribution scheme for DOACROSS loops. The scheme uses heuristics to distribute the loop iterations unevenly among processors in order to mask the delay caused by data dependencies and inter-PE (processing element) communication. Simulation results have shown that this scheme is effective for loops that have a large degree of parallelism among iterations. The scheme, due to its nature, distributes loop iterations among PEs based on architectural characteristics of the underlying organization, i.e. processor speed and communication cost. The maximum speedup attained is very close to the maximum speedup possible for a particular loop even in the presence of inter-PE communication cost. This scheme utilizes processors more efficiently, since, relative to the equal distribution approach, it requires fewer processors to attain maximum speedup. Although this scheme produces an unbalanced distribution among processors, this can be remedied by considering other loops when making the distribution to produce a balanced load among processors. >

Journal ArticleDOI
TL;DR: The performance of parallel computing in a network of Apollo workstations where the processes use the remote procedure call (RPC) mechanism for communication is addressed and the theoretical maximum speedup is bounded by half of the optimum degree of parallelism.
Abstract: The performance of parallel computing in a network of Apollo workstations where the processes use the remote procedure call (RPC) mechanism for communication is addressed. The speedup in such systems cannot be accurately estimated without taking into account the relatively large communication overheads. Moreover, it decreases by increasing parallelism when the latter exceeds some certain limit. To estimate the speedup and determine the optimum degree of parallelism, the author characterizes the parallelization and the communication overheads in the system considered. Then, parallel applications are modeled and their execution times are expressed for the general case of nonidentical tasks and workstations. The general case study allows the structural constraints of the applications to be taken into account by permitting their partitioning into heterogeneous tasks. A simple expression of the optimum degree of parallelism is obtained for identical tasks where the inherent constraints are neglected. The fact that the theoretical maximum speedup is bounded by half of the optimum degree of parallelism shows the importance of this measure. >

Journal Article
TL;DR: In this article, the authors take a new look at numerical techniques for solving parabolic equations by the method of lines, which is the main motivation for the proposed approach is the possibility of exploiting a high degree of parallelism in a simple manner.
Abstract: This paper takes a new look at numerical techniques for solving parabolic equations by the method of lines. The main motivation for the proposed approach is the possibility of exploiting a high degree of parallelism in a simple manner. The basic idea of the method is to approximate the action of the evolution operator on a given state vector by means of a projection process onto a Krylov subspace. Thus the resulting approximation consists of applying an evolution operator of very small dimension to a known vector, which is, in turn, computed accurately by exploiting high-order rational Chebyshev and Pade approximations to the exponential. Because the rational approximation is only applied to a small matrix, the only operations required with the original large matrix are matrix-by-vector multiplications and, as a result, the algorithm can easily be parallelized and vectorized. Further parallelism is introduced by expanding the rational approximations into partial fractions. Some relevant approximation and ...

Proceedings ArticleDOI
26 Apr 1992
TL;DR: The author identifies two types of dependencies, namely communication dependencies and scheduling dependencies, and proposes to represent these dependencies explicitly in the program dependency graph and presents program transformations that reduce communication related run-time overhead.
Abstract: The single-program multiple-data (SPMD) mode of execution is an effective approach for exploiting parallelism in programs written using the shared-memory programming model on distributed memory machines. However, during SPMD execution one must consider dependencies due to the transfer of data among the processors. Such dependencies can be avoided by reordering the communication operations (sends and receives). However, no formal framework has been developed to explicitly recognize the represent such dependencies. The author identifies two types of dependencies, namely communication dependencies and scheduling dependencies, and proposes to represent these dependencies explicitly in the program dependency graph. Next, he presents program transformations that use this dependency information in transforming the program and increasing the degree of parallelism exploited. Finally, the author presents program transformations that reduce communication related run-time overhead. >

Journal ArticleDOI
TL;DR: The model algorithm developed and applied in Luhar and Britter has been implemented on a Distributed Array of Processors (DAP) and the results show a 14-fold improvement in the processing time over a sequential machine.

Journal ArticleDOI
TL;DR: Results show that the sparse matrix storage schemes studied have a high degree of parallelism in the process of the sparse matrices decomposition but require careful attention to the management of cache memories for a massively parallel computer in which cache processing yields performance gain.

Journal ArticleDOI
TL;DR: This paper presents replicated data algorithms for digital image convolutions and median filtering, and compares their performance with conventional data parallel algorithms for the same on three popular array interconnection networks, namely, the 2-D mesh, the 3-DMesh, and the hypercube.
Abstract: Data parallel processing on processor array architectures has gained popularity in data intensive applications, such as image processing and scientific computing, as massively parallel processor array machines became feasible commercially. The data parallel paradigm of assigning one processing element to each data element results in an inefficient utilization of a large processor array when a relatively small data structure is processed on it. The large degree of parallelism of a massively parallel processor array machine does not result in a faster solution to a problem involving relatively small data structures than the modest degree of parallelism of a machine that is just as large as the data structure. We presented data replication technique to speed up the processing of small data structures on large processor arrays. In this paper, we present replicated data algorithms for digital image convolutions and median filtering, and compare their performance with conventional data parallel algorithms for the same on three popular array interconnection networks, namely, the 2-D mesh, the 3-D mesh, and the hypercube.

Proceedings ArticleDOI
16 Nov 1992
TL;DR: In this article, the key devices necessary for optical signal and image processors, some of the system application demonstration programs currently in progress, and active research directions for the implementation of next-generation architectures.
Abstract: The maturation in the state-of-the-art of optical components is enabling increased applications for the technology. Most notable is the ever-expanding market for fiber optic data and communications links, familiar in both commercial and military markets. The inherent properties of optics and photonics, however, have suggested that components and processors may be designed that offer advantages over more commonly considered digital approaches for a variety of airborne sensor and signal processing applications. Various academic, industrial, and governmental research groups have been actively investigating and exploiting these properties of high bandwidth, large degree of parallelism in computation (e.g., processing in parallel over a two-dimensional field), and interconnectivity, and have succeeded in advancing the technology to the stage of systems demonstration. Such advantages as computational throughput and low operating power consumption are highly attractive for many computationally intensive problems. This review covers the key devices necessary for optical signal and image processors, some of the system application demonstration programs currently in progress, and active research directions for the implementation of next-generation architectures.

01 Jan 1992
TL;DR: In this article, the design of parallel architectures for computing the 8x8 Discrete Cosine Transform (DCT) was addressed, which concen- trates on direct methods to avoid a row-column decomposition.
Abstract: This paper addresses the design of parallel architectures for computing the 8x8 Discrete Cosine Transform (DCT). It concen- trates on direct methods, which avoid a row-column decomposition. Two novel multiplier-free parallel architectures for high-speed 8x8 DCT calculation are proposed. The first architecture, which uses polynomial transforms, is compared with a second architecture, which computes the DCT via the Walsh-Hadamard Transform (WHT). Both architectures achieve a high degree of parallelism and regularity. The proposed architectures are designed for HDTV sampling rates and can be emaently realized in CMOS technology.

Proceedings ArticleDOI
17 Sep 1992
TL;DR: This approach, based on a string-coding functional and neural network, to solving the longest common subsequences problem with a high degree of parallelism, in this approach the parameters related to the input strings are contained entirely in the linear term of the neural network energy function.
Abstract: Presented is an approach, based on a string-coding functional and neural network, to solving the longest common subsequences (LCS) problem with a high degree of parallelism. In this approach, the parameters related to the input strings are contained entirely in the linear term of the neural network energy function, and the quadratic term only has to do with constraints. It is not necessary to modify the internal parameters and the connection weight matrix with new input strings. The complexities of both the network computing and the hardware implementation are substantially reduced. >

Journal ArticleDOI
TL;DR: It turns out that parallelism affords only limited opportunity for reducing the computing time with linear multistage multivalue methods, and parallel one-step methods offer no speedup over serial one- step methods for the standard linear test problem.
Abstract: Numerical methods for ordinary initial value problems that do not depend on special properties of the system are usually found in the class of linear multistage multivalue methods, first formulated by J.C. Butcher. Among these the explicit methods are easiest to implement. For these reasons there has been considerable research activity devoted to generating methods of this class which utilize independent function evaluations that can be performed in parallel. Each such group of concurrent function evaluations can be regarded as a stage of the method. However, it turns out that parallelism affords only limited opportunity for reducing the computing time with such methods. This is most evident for the simple linear homogeneous constant-coefficient test problem, whose solution is essentially a matter of approximating the exponential by an algebraic function. For a given number of stages and a given number of saved values, parallelism offers a somewhat enlarged set of algebraic functions from which to choose. However, there is absolutely no benefit in having the degree of parallelism (number of processors) exceed the number of saved values of the method. Thus, in particular, parallel one-step methods offer no speedup over serial one-step methods for the standard linear test problem. Although the implication of this result for general nonlinear problems is unclear, there are indications that dramatic speedups are not possible in general. Also given are some results relevant to the construction of methods.

Proceedings ArticleDOI
30 Aug 1992
TL;DR: A two-layers architecture for optical flow computation is presented, which uses neural nets in the lower layer and a special relaxation system in the upper layer to make it particularly suitable for real-time applications.
Abstract: A two-layers architecture for optical flow computation is presented, which uses neural nets in the lower layer and a special relaxation system in the upper layer. The high degree of parallelism of the architecture makes it particularly suitable for real-time applications. >

01 Jan 1992
TL;DR: An approach to understanding the properties of traditional compiler optimizations and parallelizing optimizations that are frequently applied to program code to increase the degree of parallelism, a Unifying Framework for Optimizing Transformations (UFOT) was developed for specifying optimizations and automatically producing optimizers.
Abstract: As an approach to understanding the properties of traditional compiler optimizations and parallelizing optimizations that are frequently applied to program code to increase the degree of parallelism, a Unifying Framework for Optimizing Transformations (UFOT) was developed for specifying optimizations and automatically producing optimizers. For a selected set of optimizations, the framework is used to determine those interactions among the optimizations that can create and those that can destroy conditions for applying other optimizations. From these interactions, an application order is derived that attempts to maximize the potential benefits of the optimizations that can be applied to a program. The framework is also used to create a General Optimization Specification Language (GOSpeL) whose implementation culminates in the development of an optimizer generator (GENesis) that automatically produces optimizers from specifications. The specifications are input into the generator, and code is produced for applying the specified optimizations. The optimizations are applied to an extended intermediate representation of a source program, thus making GENesis generally source language independent. GENesis is powerful in that it can generate optimizations that require global conditions, which are needed for traditional and parallelizing optimizations. A prototype implementation of GENesis has been constructed and numerous optimizers have been produced and verified. A variety of experiments have been performed and numerous results have been determined. Experimentation indicates that many theoretical interactions occur in practice, and application points for some optimizations are rarely found. The experiments also indicate that different orderings of optimizations are needed for different code segments of the same program. Thus, optimizations should be applied on a code segment basis rather than to an entire program. Experimentation was also performed to examine the cost of applying an optimization and the expected benefit. It was found that the cost of some optimizations as compared to their benefit is prohibitive, and in some cases the cost of an optimization can be reduced by carefully specifying the optimization. It was also found that the cost of some optimizations varies significantly with different implementations produced by the optimizer generator.