scispace - formally typeset
Search or ask a question

Showing papers on "Degree of parallelism published in 2005"


Proceedings ArticleDOI
04 Apr 2005
TL;DR: This work presents two adaptive algorithms that achieve average improvements of 10% in performance and 35% in stability for the tested workloads, and provides best parameter configurations for each algorithm.
Abstract: The scheduler is a key component in determining the overall performance of a parallel computer, and as we show here, the schedulers in wide use today exhibit large unexplained gaps in performance during their operation. Also, different scheduling algorithms often vary in the gaps they show, suggesting that choosing the correct scheduler for each time frame can improve overall performance. We present two adaptive algorithms that achieve this: One chooses by recent past performance, and the other by the recent average degree of parallelism, which is shown to be correlated to algorithmic superiority. Simulation results for the algorithms on production workloads are analyzed, and illustrate unique features of the chaotic temporal structure of parallel workloads. We provide best parameter configurations for each algorithm, which both achieve average improvements of 10% in performance and 35% in stability for the tested workloads.

33 citations


Proceedings ArticleDOI
01 Jun 2005
TL;DR: The usage of POSE, the authors' parallel object-oriented simulation environment, for application performance prediction on large parallel machines such as BlueGene is explored and the utility of the simulator is illustrated through prediction and validation studies for a molecular dynamics application.
Abstract: Parallel discrete event simulation (PDES) of models with fine-grained computation remains a challenging problem. We explore the usage of POSE, our parallel object-oriented simulation environment, for application performance prediction on large parallel machines such as BlueGene. This study involves the simulation of communication at the packet level through a detailed network model. This presents an extremely fine-grained simulation: events correspond to the transmission and receipt of packets. Computation is minimal, communication dominates, and strong dependencies between events result in a low degree of parallelism. There is limited look-ahead capability since the outcome of many events is determined by the application whose performance the simulation is predicting. Thus conservative synchronization approaches are challenging for this type of problem. We present recent experiences and performance results for our network simulator and illustrate the utility of our simulator through prediction and validation studies for a molecular dynamics application.

32 citations


Book ChapterDOI
22 Jan 2005
TL;DR: A model is obtained with a strictly lower power of computation by relaxing the hypothesis on the existence of a port numbering, a high level of synchronization involved in one atomic computation step, which involves more synchronization than the message passing model.
Abstract: The different local computations mechanisms are very useful for delimiting the borderline between positive and negative results in distributed computations. Indeed, they enable to study the importance of the synchronization level and to understand how important is the initial knowledge. A high level of synchronization involved in one atomic computation step makes a model powerful but reduces the degree of parallelism. Charron-Bost et al. [1] study the difference between synchronous and asynchronous message passing models. The model studied in this paper involves more synchronization than the message passing model: an elementary computation step modifies the states of two neighbours in the network, depending only on their current states. The information the processors initially have can be global information about the network, such as the size, the diameter or the topology of the network. The initial knowledge can also be local: each node can initially know its own degree for example. Another example of local knowledge is the existence of a port numbering: each processor locally gives numbers to its incident edges and in this way, it can consistently distinguish its neighbours. In Angluin's model [2], it is assumed that a port numbering exists, whereas it is not the case in our model. In fact, we obtain a model with a strictly lower power of computation by relaxing the hypothesis on the existence of a port numbering.

28 citations


Journal ArticleDOI
01 Dec 2005
TL;DR: A multiprocessor strategy that exploits the computational characteristics of the algorithms used for biological sequence comparison proposed in the literature to solve the problem of aligning biological sequences for the first time in the domain of DLT.
Abstract: In this paper, we design a multiprocessor strategy that exploits the computational characteristics of the algorithms used for biological sequence comparison proposed in the literature. We employ divisible load theory (DLT) that is suitable for handling large scale processing on network based systems. For the first time in the domain of DLT, the problem of aligning biological sequences is attempted. The objective is to minimize the total processing time of the alignment process. In designing our strategy, DLT facilitates a clever partitioning of the entire computation process involved in such a way that the overall time consumed for aligning the sequences is a minimum. The partitioning takes into account the computation speeds of the nodes and the underlying communication network. Since this is a real-life application, the post-processing phase becomes important, and hence we consider propagating the results back in order to generate an exact alignment. We consider several cases in our analysis such as deriving closed-form solutions for the processing time for heterogeneous, homogeneous, and networks with slow links. Further, we attempt to employ a multiinstallment strategy to distribute the tasks such that a higher degree of parallelism can be achieved. For slow networks, our strategy recommends near-optimal solutions. We derive an important condition to identify such cases and propose two heuristic strategies. Also, our strategy can be extended for multisequence alignment by utilizing a clustering strategy such as the Berger-Munson algorithm proposed in the literature. Finally, we use real-life DNA samples of house mouse mitochondrion (Mus Musculus Mitochondrion, NC.001569) consisting of 16 295 residues and the DNA of human mitochondrion (Homo Sapiens Mitochondrion, NC.001807) consisting of 16 571 residues, obtainable from the GenBank , in our rigorous simulation experiments to illustrate all the theoretical findings.

26 citations


Book ChapterDOI
30 Aug 2005
TL;DR: An alternative approach to balance the load in parallel adaptive finite element simulations is presented and a heuristic that contains a high degree of parallelism and computes well shaped connected partitions is obtained.
Abstract: Load balancing plays an important role in parallel numerical simulations. State-of-the-art libraries addressing this problem base on vertex exchange heuristics that are embedded in a multilevel scheme. However, these are hard to parallelize due to their sequential nature. Furthermore, libraries like Metis and Jostle focus on a small edge-cut and cannot obey constraints like connectivity and straight partition boundaries, which are important for some numerical solvers. In this paper we present an alternative approach to balance the load in parallel adaptive finite element simulations. We compute a distribution that is based on solutions of linear equations. Integrated into a learning framework, we obtain a heuristic that contains a high degree of parallelism and computes well shaped connected partitions. Furthermore, our experiments indicate that we can find solutions that are comparable to those of the two state-of-the-art libraries Metis and Jostle also regarding the classic metrics like edge-cut and boundary length.

25 citations


Journal ArticleDOI
TL;DR: In this paper, the authors describe the application of the parallel integration evaluation model (PIEM) in an industrial case study and propose a design solution that obeys the tradeoff that parallelism introduces into the networked supply operating system: while direct-production/supply time decreases, the overhead of interaction time among the parties, T, increases.
Abstract: This paper describes the application of the parallel integration evaluation model (PIEM) in an industrial case study. The PIEM model is based on modelling the interactions among supply network parties. It generates the parallel configuration of production and supply servers yielding the minimum total production and supply time/cost for the system, Φ. The design solution recommended by the model obeys the tradeoff that parallelism introduces into the networked supply operating system: while direct-production/supply time Π decreases, the overhead of interaction time among the networked parties, T, increases. The interaction time comprises two delay generating factors, limiting the implementation of massively parallel supply networks: the delay due to communication, negotiation, and coordination among the parties, K, and the congestion delay Γ at shared resources in the supply network. These two types of delay factors are positively correlated with the network's degree of parallelism, Ψ, and they affect inve...

18 citations


Patent
Kenneth Alan Dockser1
09 Jun 2005
TL;DR: In this article, power control of one or more processing elements matches a degree of parallelism to requirements of a task performed in a highly parallel programmable data processor, where the power control can be selected to conserve power.
Abstract: Selective power control of one or more processing elements matches a degree of parallelism to requirements of a task performed in a highly parallel programmable data processor. For example, when program operations require less than the full width of the data path, a software instruction of the program sets a mode of operation requiring a subset of the parallel processing capacity. At least one parallel processing element, that is not needed, can be shut down to conserve power. At a later time, when the added capacity is needed, execution of another software instruction sets the mode of operation to that of the wider data path, typically the full width, and the mode change reactivates the previously shut-down processing element.

18 citations


Proceedings ArticleDOI
04 Jun 2005
TL;DR: This paper will describe the principles and features of SMI++ as well as its integration with an industrial SCADA tool for use by the LHC experiments and it will be shown that such tools, can provide a very convenient mechanism for the automation of large scale, high complexity, applications.
Abstract: The new LHC experiments at CERN have very large numbers of channels to operate. In order to be able to configure and monitor such large systems, a high degree of parallelism is necessary. The control system is built as a hierarchy of sub-systems distributed over several computers. A toolkit $SMI++, combining two approaches: finite state machines and rule-based programming, allows for the description of the various sub-systems as decentralized deciding entities, reacting in real-time to changes in the system, thus providing for the automation of standard procedures and for the automatic recovery from error conditions in a hierarchical fashion. In this paper we describe the principles and features of SMI++ as well as its integration with an industrial SCADA tool for use by the LHC experiments and we try to show that such tools, can provide a very convenient mechanism for the automation of large scale, high complexity, applications

12 citations


Journal ArticleDOI
TL;DR: Using instruction traces from common applications, quantitative analyses of implicit operands, memory addressing and condition codes have been performed, three sources of significant limitations on the maximum achievable parallelism in the x86 architecture and some conclusions are presented relating the obtained degree of parallelism with negative characteristics of x86 instruction set architecture.

11 citations


Journal ArticleDOI
TL;DR: A two-level scheduling method (TSM) is proposed, which integrates unimodular transformations, loop tiling technique, and conventional methods used on single DSP, and can achieve shorter execution time and more scalable speedup results.

10 citations


Book ChapterDOI
01 Jan 2005
TL;DR: In this article, the authors describe the use and implementation of skeletons on emerging computational grids, with the skeleton system Lithium, based on Java and RMI, as their reference programming syttem.
Abstract: Skeletons are common patterns of parallelism, such as farm and pipeline, that can be abstracted and offered to the application programmer as programming primitives. We describe the use and implementation of skeletons on emerging computational grids, with the skeleton system Lithium, based on Java and RMI, as our reference programming syttem. Our main contribution is the exploration of optimization techniques for implementing skeletons on grids based on an optimized, future-based RMI mechanism, which we integrate into the macro-dataflow evaluation mechanism of Lithium. We discuss three optimizations: 1) a lookahead mechanism that allows to process multiple tasks concurrently at each grid server and thereby increases the overall degree of parallelism, 2) a lazy taskbinding technique that reduces interactions between grid servers and the task dispatcher, and 3) dynamic improvements that optimize the collecting of results and the work-load balancing. We report experimental results that demonstrate the improvements due to our optimizations on various testbeds, including a heterogeneous grid-like environment.

Journal ArticleDOI
01 Aug 2005
TL;DR: A method to accurately compute the distribution of the largest (Max) and the smallest execution time of the composite of a number of parallel programming tasks, each having an independent, stochastic, arbitrary workload is presented.
Abstract: Predicting the execution time of parallel programs involves computing the maximum or minimum of the execution times of the tasks involved in the parallel computation. We present a method to accurately compute the distribution of the largest (Max) and the smallest (Min) execution time of the composite of a number of parallel programming tasks, each having an independent, stochastic, arbitrary workload. The Max function applies to the general case that the composite task completes at the time its longest constituent task terminates. The Min function applies when the completion of its shortest task terminates the whole parallel process, such as in a parallel searching program. Both the Min and Max density function of a constituent task are characterized in terms of a Pearson distribution. Due to its accuracy, the presented method is especially of interest when the performance of time critical parallel applications must be derived. Both prediction methods are tested against three well-known distributions. Furthermore, the Max prediction method is also tested against a number of measured real-life data parallel programs with different degree of parallelism. The results show excellent accuracy of better than 1% with a very few exceptions in extreme situations.

Journal ArticleDOI
TL;DR: A block red-black coloring is introduced to increase the degree of parallelism in the application of the blockILU preconditioner for solving sparse matrices, arising from convection-diffusion equations discretized using the finite difference scheme (five-point operator).
Abstract: It is well known that the ordering of the unknowns can have a significant effect on the convergence of a preconditioned iterative method and on its implementation on a parallel computer. To do so, we introduce a block red-black coloring to increase the degree of parallelism in the application of the blockILU preconditioner for solving sparse matrices, arising from convection-diffusion equations discretized using the finite difference scheme (five-point operator). We study the preconditioned PGMRES iterative method for solving these linear systems.

Proceedings ArticleDOI
27 Jun 2005
TL;DR: A new scheduling algorithm is introduced, which is based on using an objective function to guide the search for a near optimal solution, which includes different criteria such as real-time deadlines, reliability, and quantitative measures of the communication, degree of parallelism and processing power fragmentation.
Abstract: Improper scheduling of real-time applications on a cluster may lead to missing required deadlines and offset the gain of using the system and software parallelism. Most existing scheduling algorithms do not consider factors such as real-time deadlines, system reliability, processing power fragmentation, inter-task communication and degree of parallelism on performance. In this paper we introduce a new scheduling algorithm, which is based on using an objective function to guide the search for a near optimal solution. This objective function includes different criteria such as real-time deadlines, reliability, and quantitative measures of the communication, degree of parallelism and processing power fragmentation. The presence of different criteria may affect the overall acceptance rate of the applications. We also investigate the effect of reliability on the overall acceptance rate.

Patent
Kenneth Alan Dockser1
09 Jun 2005
TL;DR: In this paper, a power and energy control of one or more processing elements matches a degree of parallelism to a monitored condition, in a highly parallel programmable programmable data processor.
Abstract: Automatic selective power and energy control of one or more processing elements matches a degree of parallelism to a monitored condition, in a highly parallel programmable data processor. For example, logic of the parallel processor detects when program operations (e.g. for a particular task or due to a detected temperature) require less than the full width of the data path. In response, the control logic automatically sets a mode of operation requiring a subset of the parallel processing capacity. At least one parallel processing element, that is not needed, can be shut down, to conserve energy and/or to reduce heating (i.e., power consumption). At a later time, when operation of the added capacity is appropriate, the logic detects the change in processing conditions and automatically sets the mode of operation to that of the wider data path, typically the full width. The mode change reactivates the previously shut-down processing element.

Journal ArticleDOI
01 Jan 2005
TL;DR: The feasibility to exploit the hardware parallelism in order to accelerate the interleaving procedure based on a heuristic algorithm and the possible speedup for different interleavers as a function of the degree of parallelism of the hardware is presented.
Abstract: Today's communications systems especially in the field of wireless communications rely on many different algorithms to provide applications with constantly increasing data rates and higher quality. This development combined with the wireless channel characteristics as well as the invention of turbo codes has particularly increased the importance of interleaver algorithms. In this paper, we demonstrate the feasibility to exploit the hardware parallelism in order to accelerate the interleaving procedure. Based on a heuristic algorithm, the possible speedup for different interleavers as a function of the degree of parallelism of the hardware is presented. The parallelization is generic in the sense that the assumed underlying hardware is based on a parallel datapath DSP architecture and therefore provides the flexibility of software solutions.

Proceedings ArticleDOI
06 Jun 2005
TL;DR: The authors presented an adaptive construction of the bitonic counting network, layered on an overlay network which provides an efficient peer-to-peer lookup service, and uses the recursive structure present in thebitonic network to adapt its implementation.
Abstract: Counting networks are well studied parallel and distributed data structures, which are useful in synchronization applications such as distributed counting and load balancing. However, current constructions of counting networks are static, since their width (the degree of parallelism), and hence the size of the network, have to be fixed in advance. This present an obstacle in efficiently implementing them in a large distributed system whose size may be changing, due to nodes joining and leaving the network. The authors presented an adaptive construction of the bitonic counting network. The network tunes its width to the system size in a distributed and local way. With high probability, the effective "width" of the network is Omega(N/log2 N), where N is the number of nodes currently in the system, and the effective '"depth" of the network is O(log2 N). In contrast, a static implementation would have the same width irrespective of the system size. When the system size changes, the network adapts by splitting or merging its components. All decisions and actions are decentralized: these include the decision of when to split and merge the components, and the action of splitting and merging them. The construction is layered on an overlay network which provides an efficient peer-to-peer lookup service, and uses the recursive structure present in the bitonic network to adapt its implementation. Though the bitonic network was discussed, the technique could be applied to build an adaptive implementation of any distributed data structure which could be decomposed in a recursive way

Proceedings ArticleDOI
17 Oct 2005
TL;DR: Boolean Web-service automata for distributed Web services are introduced as a parallel model for interaction and interoperability between applications and the generality of BWA leads to high degree of parallelism and efficient composition among Web service applications.
Abstract: Boolean Web-service automata (BWA) for distributed Web services are introduced as a parallel model for interaction and interoperability between applications. Boolean automata are a generalization of nondeterministic automata. The generality of BWA leads to high degree of parallelism and efficient composition among Web service applications. We also consider two formalisms - (1) deterministic Web-service automata (DWA), a model supporting Web service composition, (2) conversation Web-service automata (CWA), a conversation model supporting Web service interaction. DWA and CWA complement BWA in conjunction with the composition and conversation operations.

Book ChapterDOI
22 Oct 2005
TL;DR: Experimental results indicated that on an SMP system the multi-threaded Prolog could achieve a high degree of parallelism while the server could obtain scalability, and the application of the server to clinical decision support in a hospital information system demonstrated that themulti-threading Prolog and the server were sufficiently robust for use in an enterprise application.
Abstract: A knowledge-based system is suitable for realizing advanced functions that require domain-specific expert knowledge in enterprise-mission-critical information systems (enterprise applications). This paper describes a newly implemented multi-threaded Prolog system that evolves single-threaded Inside Prolog. It is intended as a means to apply a knowledge-based system written in Prolog to an enterprise application. It realizes a high degree of parallelism on an SMP system by minimizing mutual exclusion for scalability essential in enterprise use. Also briefly introduced is the knowledge processing server which is a framework for operating a knowledge-based system written in Prolog with an enterprise application. Experimental results indicated that on an SMP system the multi-threaded Prolog could achieve a high degree of parallelism while the server could obtain scalability. The application of the server to clinical decision support in a hospital information system also demonstrated that the multi-threaded Prolog and the server were sufficiently robust for use in an enterprise application.

Proceedings ArticleDOI
05 Dec 2005
TL;DR: The block processing engine can satisfy the stringent real-time constraints imposed by emerging technologies and its efficiency has been proven through the implementation of a dual standard frequency domain equalizer supporting 3GPP HSDPA and IEEE 802.11a.
Abstract: This paper presents the block processing engine (BPE), a programmable architecture specifically suited for high-throughput wireless communications. Thanks to a high degree of parallelism and a consistent use of pipelined processing, the BPE can satisfy the stringent real-time constraints imposed by emerging technologies. Its efficiency has been proven through the implementation of a dual standard frequency domain equalizer supporting 3GPP HSDPA and IEEE 802.11a.

Book ChapterDOI
06 Jun 2005
TL;DR: In this paper, a fast and highly parallel algorithm for pricing CDD weather derivatives is presented, which consists of multiple convolutions of functions with a Gaussian distribution and can be computed efficiently with the fast Gauss transform.
Abstract: We present a fast and highly parallel algorithm for pricing CDD weather derivatives, which are financial products for hedging weather risks due to higher-than average temperature in summer. To find the price, we need to compute the expected value of its payoff, namely, the CDD weather index. To this end, we derive a new recurrence formula to compute the probability density function of the CDD. The formula consists of multiple convolutions of functions with a Gaussian distribution and can be computed efficiently with the fast Gauss transform. In addition, our algorithm has a large degree of parallelism because each convolution can be computed independently. Numerical experiments show that our method is more than 10 times faster than the conventional Monte Carlo method when computing the prices of various CDD derivatives on one processor. Moreover, parallel execution on a PC cluster with 8 nodes attains up to six times speedup, allowing the pricing of most of the derivatives to be completed in about 10 seconds.

01 Jan 2005
TL;DR: Two techniques to design good S-random interleavers, to be used in parallel and serially concatenated codes with interleaver, are proposed and an example of the advantages is provided in a realistic system framework.
Abstract: In this paper, we propose two techniques to design good S-random interleavers, to be used in parallel and serially concatenated codes with interleavers. The interleavers designed according to these algorithms can be shortened, in order to sup- port different block lengths in such a way that all the permutations obtained by pruning, when employed in a parallel turbo decoder, are collision-free. The first technique, suitable for short and medium interleavers, guarantees the same performance of non- parallel interleavers in terms of spreading properties, simulated frame-error probabilities, and obtainable minimum distance of the actual codes. The second algorithm, to be used for large block lengths, permits achieving high degrees of parallelism at the price of a slight degradation of the spread properties, and also to change the degree of parallelism on-the-fly. The operations of a parallel turbo decoder employing these interleavers are described, and an example of the advantages of the proposed techniques is provided in a realistic system framework.

01 Jan 2005
TL;DR: In this paper, the degree of parallelism is defined as the amount of non-redundant parallelism needed in the derivations of Lindenmayer and Bharat systems.
Abstract: In this paper, the degree of parallelism is introduced and investigated. The degree of parallelism is a natural descriptional complexity measure of Lindenmayer and Bharat systems. This concept quantifies the amount of non-redundant parallelism needed in the derivations of those systems. We consider both static and dynamic versions of this notion. Corresponding hierarchy and undecidability results are established. Furthermore, we show that the degree of parallelism links to the notions of growth functions and active symbols.

Book ChapterDOI
02 Nov 2005
TL;DR: In this paper, graph theory is introduced into transient stability analysis in power system by using weighted graph, which reflects the degree of parallelism of computing and improves speed-up ration of system.
Abstract: In this paper, we introduce graph theory into transient stability analysis in power system. In the weighted graph, Vertex weight represents node’s parallel computing workload and edge weight represents serial computing workload on the border of regions, which reflects the degree of parallelism of computing and improves speed-up ration of system. In order to reduce communication time wastage induced by CSMA protocol in TCP/IP based LAN, asynchronous message passing is used in our method. Simulation results show that it achieves better performance.

Proceedings ArticleDOI
01 Nov 2005
TL;DR: The 2ke is described, a flexible and modular computational system that allows developers to standardise on one processor, instruction set, software architecture and tool chain for many projects, but maintaining common instruction set and development tools.
Abstract: Embedded computational hardware has become prevalent in recent years for communications signal processing for reasons including size and cost. The availability of competing single processor solutions from traditional vendors gives system designers a degree of choice. Some recent market entrants have even embraced parallel concepts within their architectures. However the fact remains that while one particular computational device or parallel configuration may suit a given application, it seldom suits a broad range of other applications. This promotes design inefficiency: either developers familiar with one solution from a previous project choose to use it for the next project despite some probable degree of mismatch, or they are faced with a costly learning curve implied in the adoption of a different, but possibly better matched, architecture. A preferable approach is to allow computational hardware to be adapted at a micro- and macro-architectural level to fit requirements on a project-to-project basis, but maintaining common instruction set and development tools. This gives designers the flexibility to choose the degree of parallelism and type of parallel arrangement required for their application, but without requiring a new tool and hardware learning curve. This paper describes the 2ke, a flexible and modular computational system that allows developers to standardise on one processor, instruction set, software architecture and tool chain for many projects. Architectural enhancements to its forerunner, the 2k2, are presented to permit micro-architectural parallelism to be chosen along a continuum from SISD at one extreme to full SIMD at the other, whilst the very nature of the 2ke permits extension to MIMD along an orthogonal development direction. Results in terms of logic cell usage, current consumption and memory usage will be presented for each arrangement for example application code.

Journal Article
TL;DR: To overcome the shortcomings of the separation of product design process and project development process, a systematic project scheduling methodology for complex product development is submitted.
Abstract: To overcome the shortcomings of the separation of product design process and project development process,a systematic project scheduling methodology for complex product development is submitted. First, the product development process is modeled using DSM(Design Structure Matrix) and is optimized by minimizing the feedback iterations, which generates a controllable set of DSMs. For each resultant DSM, corresponding CPM(Critical Path Method) network is constructed based on which the critical path and activities are identified and project lead-time is calculated. Finally, the optimal DSM and project schedule plan are obtained by using traditional crashing technique or by increasing the degree of parallelism of sequential activities on the critical path. The feasibility and efficiency of the proposed method is manifested by a case study.

Proceedings ArticleDOI
03 Jul 2005
TL;DR: The algorithm builds the conjugate-direction decomposition (CDD) of the optimal transversal filter weight vector using a novel stabilized parallel version of the Gram-Schmidt orthogonalization (GSO) using a bootstrapped mechanism of parallel crossed feedbacks.
Abstract: A new fast-converging numerically stable parallel algorithm of adaptive antenna beamforming is introduced. The algorithm builds the conjugate-direction decomposition (CDD) of the optimal transversal filter weight vector using a novel stabilized parallel version of the Gram-Schmidt orthogonalization (GSO). The numerical robustness of the modified GSO version is achieved through a bootstrapped mechanism of parallel crossed feedbacks. Regarding the number of independent input samples, the new algorithm has the same convergence as that of the widely used sample matrix inversion (SMI) method, but its real time of adaptation appears to be much faster due to the high degree of parallelism, reduced numerical complexity, and ability to be implemented with fix-point arithmetic.

Journal ArticleDOI
TL;DR: A generic framework of sparse parallelization which can be applied to any numerical programs satisfying the usual syntactic constraints of parallelization, based on both a refinement of the data-dependence test proposed by Bernstein and an inspector-executor scheme which is specialized to each input program of the compiler.

Proceedings ArticleDOI
TL;DR: In this paper, a VLSI architecture for the integer-to-integer wavelet transform which is used by JPEG2000 standard for lossless compression is proposed and implemented using Xilinx FPGA device, and its main results are provided.
Abstract: In this paper we shall propose and examine an VLSI architecture for the integer-to-integer wavelet transform which is used by JPEG2000 standard for lossless compression. In order to achieve a fully utilization of hardware resources independently of the bit-depth of the input data, on-line arithmetic (digit-serial computation) is proposed to carry out this architecture. Besides, a high throughput is achieved thanks to the high degree of parallelism that on-line arithmetic allows. The design has been simulated and implemented using Xilinx FPGA device, and its main results are provided.