Showing papers on "Degree of parallelism published in 2005"

PDF

Open Access

Proceedings Article•DOI•

Improving and stabilizing parallel computer performance using adaptive backfilling

[...]

David Talby¹, Dror G. Feitelson¹•Institutions (1)

04 Apr 2005

TL;DR: This work presents two adaptive algorithms that achieve average improvements of 10% in performance and 35% in stability for the tested workloads, and provides best parameter configurations for each algorithm.

...read moreread less

Abstract: The scheduler is a key component in determining the overall performance of a parallel computer, and as we show here, the schedulers in wide use today exhibit large unexplained gaps in performance during their operation. Also, different scheduling algorithms often vary in the gaps they show, suggesting that choosing the correct scheduler for each time frame can improve overall performance. We present two adaptive algorithms that achieve this: One chooses by recent past performance, and the other by the recent average degree of parallelism, which is shown to be correlated to algorithmic superiority. Simulation results for the algorithms on production workloads are analyzed, and illustrate unique features of the chaotic temporal structure of parallel workloads. We provide best parameter configurations for each algorithm, which both achieve average improvements of 10% in performance and 35% in stability for the tested workloads.

...read moreread less

33 citations

Proceedings Article•DOI•

Performance Prediction Using Simulation of Large-Scale Interconnection Networks in POSE

[...]

Terry Wilmarth¹, Gengbin Zheng¹, Eric Bohm¹, Yogesh Mehta¹, Nilesh Choudhury¹, Praveen K. Jagadishprasad¹, Laxmikant V. Kale¹ - Show less +3 more•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Jun 2005

TL;DR: The usage of POSE, the authors' parallel object-oriented simulation environment, for application performance prediction on large parallel machines such as BlueGene is explored and the utility of the simulator is illustrated through prediction and validation studies for a molecular dynamics application.

...read moreread less

Abstract: Parallel discrete event simulation (PDES) of models with fine-grained computation remains a challenging problem. We explore the usage of POSE, our parallel object-oriented simulation environment, for application performance prediction on large parallel machines such as BlueGene. This study involves the simulation of communication at the packet level through a detailed network model. This presents an extremely fine-grained simulation: events correspond to the transmission and receipt of packets. Computation is minimal, communication dominates, and strong dependencies between events result in a low degree of parallelism. There is limited look-ahead capability since the outcome of many events is determined by the application whose performance the simulation is predicting. Thus conservative synchronization approaches are challenging for this type of problem. We present recent experiences and performance results for our network simulator and illustrate the utility of our simulator through prediction and validation studies for a molecular dynamics application.

...read moreread less

32 citations

Book Chapter•DOI•

Local computations on closed unlabelled edges: the election problem and the naming problem

[...]

Jérémie Chalopin¹•Institutions (1)

University of Bordeaux¹

22 Jan 2005

TL;DR: A model is obtained with a strictly lower power of computation by relaxing the hypothesis on the existence of a port numbering, a high level of synchronization involved in one atomic computation step, which involves more synchronization than the message passing model.

...read moreread less

Abstract: The different local computations mechanisms are very useful for delimiting the borderline between positive and negative results in distributed computations. Indeed, they enable to study the importance of the synchronization level and to understand how important is the initial knowledge. A high level of synchronization involved in one atomic computation step makes a model powerful but reduces the degree of parallelism. Charron-Bost et al. [1] study the difference between synchronous and asynchronous message passing models. The model studied in this paper involves more synchronization than the message passing model: an elementary computation step modifies the states of two neighbours in the network, depending only on their current states. The information the processors initially have can be global information about the network, such as the size, the diameter or the topology of the network. The initial knowledge can also be local: each node can initially know its own degree for example. Another example of local knowledge is the existence of a port numbering: each processor locally gives numbers to its incident edges and in this way, it can consistently distinguish its neighbours. In Angluin's model [2], it is assumed that a port numbering exists, whereas it is not the case in our model. In fact, we obtain a model with a strictly lower power of computation by relaxing the hypothesis on the existence of a port numbering.

...read moreread less

28 citations

Journal Article•DOI•

Aligning biological sequences on distributed bus networks: a divisible load scheduling approach

[...]

Wong Han Min¹, Bharadwaj Veeravalli•Institutions (1)

Agency for Science, Technology and Research¹

01 Dec 2005

TL;DR: A multiprocessor strategy that exploits the computational characteristics of the algorithms used for biological sequence comparison proposed in the literature to solve the problem of aligning biological sequences for the first time in the domain of DLT.

...read moreread less

Abstract: In this paper, we design a multiprocessor strategy that exploits the computational characteristics of the algorithms used for biological sequence comparison proposed in the literature. We employ divisible load theory (DLT) that is suitable for handling large scale processing on network based systems. For the first time in the domain of DLT, the problem of aligning biological sequences is attempted. The objective is to minimize the total processing time of the alignment process. In designing our strategy, DLT facilitates a clever partitioning of the entire computation process involved in such a way that the overall time consumed for aligning the sequences is a minimum. The partitioning takes into account the computation speeds of the nodes and the underlying communication network. Since this is a real-life application, the post-processing phase becomes important, and hence we consider propagating the results back in order to generate an exact alignment. We consider several cases in our analysis such as deriving closed-form solutions for the processing time for heterogeneous, homogeneous, and networks with slow links. Further, we attempt to employ a multiinstallment strategy to distribute the tasks such that a higher degree of parallelism can be achieved. For slow networks, our strategy recommends near-optimal solutions. We derive an important condition to identify such cases and propose two heuristic strategies. Also, our strategy can be extended for multisequence alignment by utilizing a clustering strategy such as the Berger-Munson algorithm proposed in the literature. Finally, we use real-life DNA samples of house mouse mitochondrion (Mus Musculus Mitochondrion, NC.001569) consisting of 16 295 residues and the DNA of human mitochondrion (Homo Sapiens Mitochondrion, NC.001807) consisting of 16 571 residues, obtainable from the GenBank , in our rigorous simulation experiments to illustrate all the theoretical findings.

...read moreread less

26 citations

Book Chapter•DOI•

Balancing parallel adaptive FEM computations by solving systems of linear equations

[...]

Henning Meyerhenke¹, Stefan Schamberger¹•Institutions (1)

University of Paderborn¹

30 Aug 2005

TL;DR: An alternative approach to balance the load in parallel adaptive finite element simulations is presented and a heuristic that contains a high degree of parallelism and computes well shaped connected partitions is obtained.

...read moreread less

Abstract: Load balancing plays an important role in parallel numerical simulations. State-of-the-art libraries addressing this problem base on vertex exchange heuristics that are embedded in a multilevel scheme. However, these are hard to parallelize due to their sequential nature. Furthermore, libraries like Metis and Jostle focus on a small edge-cut and cannot obey constraints like connectivity and straight partition boundaries, which are important for some numerical solvers. In this paper we present an alternative approach to balance the load in parallel adaptive finite element simulations. We compute a distribution that is based on solutions of linear equations. Integrated into a learning framework, we obtain a heuristic that contains a high degree of parallelism and computes well shaped connected partitions. Furthermore, our experiments indicate that we can find solutions that are comparable to those of the two state-of-the-art libraries Metis and Jostle also regarding the classic metrics like edge-cut and boundary length.

...read moreread less

25 citations

Journal Article•DOI•

Task parallelism in distributed supply organizations: a case study in the shoe industry

[...]

Jose Ceroni¹, Shimon Y. Nof²•Institutions (2)

University of Valparaíso¹, Purdue University²

01 Jul 2005-Production Planning & Control

TL;DR: In this paper, the authors describe the application of the parallel integration evaluation model (PIEM) in an industrial case study and propose a design solution that obeys the tradeoff that parallelism introduces into the networked supply operating system: while direct-production/supply time decreases, the overhead of interaction time among the parties, T, increases.

...read moreread less

Abstract: This paper describes the application of the parallel integration evaluation model (PIEM) in an industrial case study. The PIEM model is based on modelling the interactions among supply network parties. It generates the parallel configuration of production and supply servers yielding the minimum total production and supply time/cost for the system, Φ. The design solution recommended by the model obeys the tradeoff that parallelism introduces into the networked supply operating system: while direct-production/supply time Π decreases, the overhead of interaction time among the networked parties, T, increases. The interaction time comprises two delay generating factors, limiting the implementation of massively parallel supply networks: the delay due to communication, negotiation, and coordination among the parties, K, and the congestion delay Γ at shared resources in the supply network. These two types of delay factors are positively correlated with the network's degree of parallelism, Ψ, and they affect inve...

...read moreread less

18 citations

Patent•

Software selectable adjustment of SIMD parallelism

[...]

Kenneth Alan Dockser¹•Institutions (1)

Qualcomm¹

09 Jun 2005

TL;DR: In this article, power control of one or more processing elements matches a degree of parallelism to requirements of a task performed in a highly parallel programmable data processor, where the power control can be selected to conserve power.

...read moreread less

Abstract: Selective power control of one or more processing elements matches a degree of parallelism to requirements of a task performed in a highly parallel programmable data processor. For example, when program operations require less than the full width of the data path, a software instruction of the program sets a mode of operation requiring a subset of the parallel processing capacity. At least one parallel processing element, that is not needed, can be shut down to conserve power. At a later time, when the added capacity is needed, execution of another software instruction sets the mode of operation to that of the wider data path, typically the full width, and the mode change reactivates the previously shut-down processing element.

...read moreread less

18 citations

Proceedings Article•DOI•

Tools for the automation of large distributed control systems

[...]

Clara Gaspar¹, B. Franek²•Institutions (2)

CERN¹, Rutherford Appleton Laboratory²

04 Jun 2005

TL;DR: This paper will describe the principles and features of SMI++ as well as its integration with an industrial SCADA tool for use by the LHC experiments and it will be shown that such tools, can provide a very convenient mechanism for the automation of large scale, high complexity, applications.

...read moreread less

Abstract: The new LHC experiments at CERN have very large numbers of channels to operate. In order to be able to configure and monitor such large systems, a high degree of parallelism is necessary. The control system is built as a hierarchy of sub-systems distributed over several computers. A toolkit $SMI++, combining two approaches: finite state machines and rule-based programming, allows for the description of the various sub-systems as decentralized deciding entities, reacting in real-time to changes in the system, thus providing for the automation of standard procedures and for the automatic recovery from error conditions in a hierarchical fashion. In this paper we describe the principles and features of SMI++ as well as its integration with an industrial SCADA tool for use by the LHC experiments and we try to show that such tools, can provide a very convenient mechanism for the automation of large scale, high complexity, applications

...read moreread less

12 citations

Journal Article•DOI•

The impact of x86 instruction set architecture on superscalar processing

[...]

Rafael Rico¹, Juan-Ignacio Pérez¹, José A. de Frutos¹•Institutions (1)

University of Alcalá¹

01 Jan 2005-Journal of Systems Architecture

TL;DR: Using instruction traces from common applications, quantitative analyses of implicit operands, memory addressing and condition codes have been performed, three sources of significant limitations on the maximum achievable parallelism in the x86 architecture and some conclusions are presented relating the obtained degree of parallelism with negative characteristics of x86 instruction set architecture.

...read moreread less

11 citations

Journal Article•DOI•

A two-level scheduling method: an effective parallelizing technique for uniform nested loops on a DSP multiprocessor

[...]

Yi-Hsuan Lee¹, Cheng Chen¹•Institutions (1)

National Chiao Tung University¹

15 Feb 2005-Journal of Systems and Software

TL;DR: A two-level scheduling method (TSM) is proposed, which integrates unimodular transformations, loop tiling technique, and conventional methods used on single DSP, and can achieve shorter execution time and more scalable speedup results.

...read moreread less

10 citations

Book Chapter•DOI•

Optimization techniques for skeletons on grids

[...]

Marco Aldinucci¹, Marco Danelutto², Jan Dünnweber³, Sergei Gorlatch³•Institutions (3)

Istituto di Scienza e Tecnologie dell'Informazione¹, University of Pisa², University of Münster³

01 Jan 2005

TL;DR: In this article, the authors describe the use and implementation of skeletons on emerging computational grids, with the skeleton system Lithium, based on Java and RMI, as their reference programming syttem.

...read moreread less

Abstract: Skeletons are common patterns of parallelism, such as farm and pipeline, that can be abstracted and offered to the application programmer as programming primitives. We describe the use and implementation of skeletons on emerging computational grids, with the skeleton system Lithium, based on Java and RMI, as our reference programming syttem. Our main contribution is the exploration of optimization techniques for implementing skeletons on grids based on an optimized, future-based RMI mechanism, which we integrate into the macro-dataflow evaluation mechanism of Lithium. We discuss three optimizations: 1) a lookahead mechanism that allows to process multiple tasks concurrently at each grid server and thereby increases the overall degree of parallelism, 2) a lazy taskbinding technique that reduces interactions between grid servers and the task dispatcher, and 3) dynamic improvements that optimize the collecting of results and the work-load balancing. We report experimental results that demonstrate the improvements due to our optimizations on various testbeds, including a heterogeneous grid-like environment.

...read moreread less

Journal Article•DOI•

Predicting the execution times of parallel-independent programs using Pearson distributions

[...]

G. L. Reijns¹, A.J.C. van Gemund¹•Institutions (1)

Delft University of Technology¹

01 Aug 2005

TL;DR: A method to accurately compute the distribution of the largest (Max) and the smallest execution time of the composite of a number of parallel programming tasks, each having an independent, stochastic, arbitrary workload is presented.

...read moreread less

Abstract: Predicting the execution time of parallel programs involves computing the maximum or minimum of the execution times of the tasks involved in the parallel computation. We present a method to accurately compute the distribution of the largest (Max) and the smallest (Min) execution time of the composite of a number of parallel programming tasks, each having an independent, stochastic, arbitrary workload. The Max function applies to the general case that the composite task completes at the time its longest constituent task terminates. The Min function applies when the completion of its shortest task terminates the whole parallel process, such as in a parallel searching program. Both the Min and Max density function of a constituent task are characterized in terms of a Pearson distribution. Due to its accuracy, the presented method is especially of interest when the performance of time critical parallel applications must be derived. Both prediction methods are tested against three well-known distributions. Furthermore, the Max prediction method is also tested against a number of measured real-life data parallel programs with different degree of parallelism. The results show excellent accuracy of better than 1% with a very few exceptions in extreme situations.

...read moreread less

Journal Article•DOI•

The effect of block red-black ordering on block ilu preconditioner for sparse matrices

[...]

N. Guessous¹, O. Souhar•Institutions (1)

École Normale Supérieure¹

01 Mar 2005-Journal of Applied Mathematics and Computing

TL;DR: A block red-black coloring is introduced to increase the degree of parallelism in the application of the blockILU preconditioner for solving sparse matrices, arising from convection-diffusion equations discretized using the finite difference scheme (five-point operator).

...read moreread less

Abstract: It is well known that the ordering of the unknowns can have a significant effect on the convergence of a preconditioned iterative method and on its implementation on a parallel computer. To do so, we introduce a block red-black coloring to increase the degree of parallelism in the application of the blockILU preconditioner for solving sparse matrices, arising from convection-diffusion equations discretized using the finite difference scheme (five-point operator). We study the preconditioned PGMRES iterative method for solving these linear systems.

...read moreread less

Proceedings Article•DOI•

Maximizing reliability while scheduling real-time task-graphs on a cluster of computers

[...]

A. Amin¹, Reda A. Ammar¹, Sanguthevar Rajasekaran¹•Institutions (1)

University of Connecticut¹

27 Jun 2005

TL;DR: A new scheduling algorithm is introduced, which is based on using an objective function to guide the search for a near optimal solution, which includes different criteria such as real-time deadlines, reliability, and quantitative measures of the communication, degree of parallelism and processing power fragmentation.

...read moreread less

Abstract: Improper scheduling of real-time applications on a cluster may lead to missing required deadlines and offset the gain of using the system and software parallelism. Most existing scheduling algorithms do not consider factors such as real-time deadlines, system reliability, processing power fragmentation, inter-task communication and degree of parallelism on performance. In this paper we introduce a new scheduling algorithm, which is based on using an objective function to guide the search for a near optimal solution. This objective function includes different criteria such as real-time deadlines, reliability, and quantitative measures of the communication, degree of parallelism and processing power fragmentation. The presence of different criteria may affect the overall acceptance rate of the applications. We also investigate the effect of reliability on the overall acceptance rate.

...read moreread less

Patent•

Microprocessor with automatic selection of processing parallelism mode based on width data of instructions

[...]

Kenneth Alan Dockser¹•Institutions (1)

Qualcomm¹

09 Jun 2005

TL;DR: In this paper, a power and energy control of one or more processing elements matches a degree of parallelism to a monitored condition, in a highly parallel programmable programmable data processor.

...read moreread less

Abstract: Automatic selective power and energy control of one or more processing elements matches a degree of parallelism to a monitored condition, in a highly parallel programmable data processor. For example, logic of the parallel processor detects when program operations (e.g. for a particular task or due to a detected temperature) require less than the full width of the data path. In response, the control logic automatically sets a mode of operation requiring a subset of the parallel processing capacity. At least one parallel processing element, that is not needed, can be shut down, to conserve energy and/or to reduce heating (i.e., power consumption). At a later time, when operation of the added capacity is appropriate, the logic detects the change in processing conditions and automatically sets the mode of operation to that of the wider data path, typically the full width. The mode change reactivates the previously shut-down processing element.

...read moreread less

Journal Article•DOI•

Interleaving on Parallel DSP Architectures

[...]

Thomas Richter¹, Gerhard Fettweis¹•Institutions (1)

Dresden University of Technology¹

01 Jan 2005

TL;DR: The feasibility to exploit the hardware parallelism in order to accelerate the interleaving procedure based on a heuristic algorithm and the possible speedup for different interleavers as a function of the degree of parallelism of the hardware is presented.

...read moreread less

Abstract: Today's communications systems especially in the field of wireless communications rely on many different algorithms to provide applications with constantly increasing data rates and higher quality. This development combined with the wireless channel characteristics as well as the invention of turbo codes has particularly increased the importance of interleaver algorithms. In this paper, we demonstrate the feasibility to exploit the hardware parallelism in order to accelerate the interleaving procedure. Based on a heuristic algorithm, the possible speedup for different interleavers as a function of the degree of parallelism of the hardware is presented. The parallelization is generic in the sense that the assumed underlying hardware is based on a parallel datapath DSP architecture and therefore provides the flexibility of software solutions.

...read moreread less

Proceedings Article•DOI•

Adaptive Counting Networks

[...]

Srikanta Tirthapura¹•Institutions (1)

Iowa State University¹

06 Jun 2005

TL;DR: The authors presented an adaptive construction of the bitonic counting network, layered on an overlay network which provides an efficient peer-to-peer lookup service, and uses the recursive structure present in thebitonic network to adapt its implementation.

...read moreread less

Abstract: Counting networks are well studied parallel and distributed data structures, which are useful in synchronization applications such as distributed counting and load balancing. However, current constructions of counting networks are static, since their width (the degree of parallelism), and hence the size of the network, have to be fixed in advance. This present an obstacle in efficiently implementing them in a large distributed system whose size may be changing, due to nodes joining and leaving the network. The authors presented an adaptive construction of the bitonic counting network. The network tunes its width to the system size in a distributed and local way. With high probability, the effective "width" of the network is Omega(N/log2 N), where N is the number of nodes currently in the system, and the effective '"depth" of the network is O(log2 N). In contrast, a static implementation would have the same width irrespective of the system size. When the system size changes, the network adapts by splitting or merging its components. All decisions and actions are decentralized: these include the decision of when to split and merge the components, and the action of splitting and merging them. The construction is layered on an overlay network which provides an efficient peer-to-peer lookup service, and uses the recursive structure present in the bitonic network to adapt its implementation. Though the bitonic network was discussed, the technique could be applied to build an adaptive implementation of any distributed data structure which could be decomposed in a recursive way

...read moreread less

Proceedings Article•DOI•

Boolean Web-service automata: a parallel model for distributed Web service operations

[...]

Abdelaziz Fellah¹•Institutions (1)

University of Sharjah¹

17 Oct 2005

TL;DR: Boolean Web-service automata for distributed Web services are introduced as a parallel model for interaction and interoperability between applications and the generality of BWA leads to high degree of parallelism and efficient composition among Web service applications.

...read moreread less

Abstract: Boolean Web-service automata (BWA) for distributed Web services are introduced as a parallel model for interaction and interoperability between applications. Boolean automata are a generalization of nondeterministic automata. The generality of BWA leads to high degree of parallelism and efficient composition among Web service applications. We also consider two formalisms - (1) deterministic Web-service automata (DWA), a model supporting Web service composition, (2) conversation Web-service automata (CWA), a conversation model supporting Web service interaction. DWA and CWA complement BWA in conjunction with the composition and conversation operations.

...read moreread less

Book Chapter•DOI•

Multi-threading inside prolog for knowledge-based enterprise applications

[...]

Masanobu Umeda¹, Keiichi Katamine¹, Isao Nagasawa¹, Masaaki Hashimoto¹, Osamu Takata¹ - Show less +1 more•Institutions (1)

Kyushu Institute of Technology¹

22 Oct 2005

TL;DR: Experimental results indicated that on an SMP system the multi-threaded Prolog could achieve a high degree of parallelism while the server could obtain scalability, and the application of the server to clinical decision support in a hospital information system demonstrated that themulti-threading Prolog and the server were sufficiently robust for use in an enterprise application.

...read moreread less

Abstract: A knowledge-based system is suitable for realizing advanced functions that require domain-specific expert knowledge in enterprise-mission-critical information systems (enterprise applications). This paper describes a newly implemented multi-threaded Prolog system that evolves single-threaded Inside Prolog. It is intended as a means to apply a knowledge-based system written in Prolog to an enterprise application. It realizes a high degree of parallelism on an SMP system by minimizing mutual exclusion for scalability essential in enterprise use. Also briefly introduced is the knowledge processing server which is a framework for operating a knowledge-based system written in Prolog with an enterprise application. Experimental results indicated that on an SMP system the multi-threaded Prolog could achieve a high degree of parallelism while the server could obtain scalability. The application of the server to clinical decision support in a hospital information system also demonstrated that the multi-threaded Prolog and the server were sufficiently robust for use in an enterprise application.

...read moreread less

Proceedings Article•DOI•

Block processing engine for high-throughput wireless communications

[...]

D. Lo Iacono, J. Zory, Ettore Messina, N. Piazzese

05 Dec 2005

TL;DR: The block processing engine can satisfy the stringent real-time constraints imposed by emerging technologies and its efficiency has been proven through the implementation of a dual standard frequency domain equalizer supporting 3GPP HSDPA and IEEE 802.11a.

...read moreread less

Abstract: This paper presents the block processing engine (BPE), a programmable architecture specifically suited for high-throughput wireless communications. Thanks to a high degree of parallelism and a consistent use of pipelined processing, the BPE can satisfy the stringent real-time constraints imposed by emerging technologies. Its efficiency has been proven through the implementation of a dual standard frequency domain equalizer supporting 3GPP HSDPA and IEEE 802.11a.

...read moreread less

Book Chapter•DOI•

An efficient and easily parallelizable algorithm for pricing weather derivatives

[...]

Yusaku Yamamoto¹•Institutions (1)

Nagoya University¹

06 Jun 2005

TL;DR: In this paper, a fast and highly parallel algorithm for pricing CDD weather derivatives is presented, which consists of multiple convolutions of functions with a Gaussian distribution and can be computed efficiently with the fast Gauss transform.

...read moreread less

Abstract: We present a fast and highly parallel algorithm for pricing CDD weather derivatives, which are financial products for hedging weather risks due to higher-than average temperature in summer. To find the price, we need to compute the expected value of its payoff, namely, the CDD weather index. To this end, we derive a new recurrence formula to compute the probability density function of the CDD. The formula consists of multiple convolutions of functions with a Gaussian distribution and can be computed efficiently with the fast Gauss transform. In addition, our algorithm has a large degree of parallelism because each convolution can be computed independently. Numerical experiments show that our method is more than 10 times faster than the conventional Monte Carlo method when computing the prices of various CDD derivatives on one processor. Moreover, parallel execution on a PC cluster with 8 nodes attains up to six times speedup, allowing the pricing of most of the derivatives to be completed in about 10 seconds.

...read moreread less

Variable-Size Interleaver Design for Parallel

[...]

Libero Dinoi, Sergio Benedetto

01 Jan 2005

TL;DR: Two techniques to design good S-random interleavers, to be used in parallel and serially concatenated codes with interleaver, are proposed and an example of the advantages is provided in a realistic system framework.

...read moreread less

Abstract: In this paper, we propose two techniques to design good S-random interleavers, to be used in parallel and serially concatenated codes with interleavers. The interleavers designed according to these algorithms can be shortened, in order to sup- port different block lengths in such a way that all the permutations obtained by pruning, when employed in a parallel turbo decoder, are collision-free. The first technique, suitable for short and medium interleavers, guarantees the same performance of non- parallel interleavers in terms of spreading properties, simulated frame-error probabilities, and obtainable minimum distance of the actual codes. The second algorithm, to be used for large block lengths, permits achieving high degrees of parallelism at the price of a slight degradation of the spread properties, and also to change the degree of parallelism on-the-fly. The operations of a parallel turbo decoder employing these interleavers are described, and an example of the advantages of the proposed techniques is provided in a realistic system framework.

...read moreread less

The degree of parallelism.

[...]

Henning Bordihn¹, Henning Fernau²•Institutions (2)

University of Potsdam¹, University of Trier²

01 Jan 2005

TL;DR: In this paper, the degree of parallelism is defined as the amount of non-redundant parallelism needed in the derivations of Lindenmayer and Bharat systems.

...read moreread less

Abstract: In this paper, the degree of parallelism is introduced and investigated. The degree of parallelism is a natural descriptional complexity measure of Lindenmayer and Bharat systems. This concept quantifies the amount of non-redundant parallelism needed in the derivations of those systems. We consider both static and dynamic versions of this notion. Corresponding hierarchy and undecidability results are established. Furthermore, we show that the degree of parallelism links to the notions of growth functions and active symbols.

...read moreread less

Book Chapter•DOI•

Distributed network computing on transient stability analysis and control

[...]

Chenrong Huang¹, Mingxue Chen²•Institutions (2)

Nanjing Institute of Technology¹, Nanjing University of Posts and Telecommunications²

02 Nov 2005

TL;DR: In this paper, graph theory is introduced into transient stability analysis in power system by using weighted graph, which reflects the degree of parallelism of computing and improves speed-up ration of system.

...read moreread less

Abstract: In this paper, we introduce graph theory into transient stability analysis in power system. In the weighted graph, Vertex weight represents node’s parallel computing workload and edge weight represents serial computing workload on the border of regions, which reflects the degree of parallelism of computing and improves speed-up ration of system. In order to reduce communication time wastage induced by CSMA protocol in TCP/IP based LAN, asynchronous message passing is used in our method. Simulation results show that it achieves better performance.

...read moreread less

Proceedings Article•DOI•

A scalable parallel computational core for embedded processing

[...]

R. Shadich, Ian McLoughlin

01 Nov 2005

TL;DR: The 2ke is described, a flexible and modular computational system that allows developers to standardise on one processor, instruction set, software architecture and tool chain for many projects, but maintaining common instruction set and development tools.

...read moreread less

Abstract: Embedded computational hardware has become prevalent in recent years for communications signal processing for reasons including size and cost. The availability of competing single processor solutions from traditional vendors gives system designers a degree of choice. Some recent market entrants have even embraced parallel concepts within their architectures. However the fact remains that while one particular computational device or parallel configuration may suit a given application, it seldom suits a broad range of other applications. This promotes design inefficiency: either developers familiar with one solution from a previous project choose to use it for the next project despite some probable degree of mismatch, or they are faced with a costly learning curve implied in the adoption of a different, but possibly better matched, architecture. A preferable approach is to allow computational hardware to be adapted at a micro- and macro-architectural level to fit requirements on a project-to-project basis, but maintaining common instruction set and development tools. This gives designers the flexibility to choose the degree of parallelism and type of parallel arrangement required for their application, but without requiring a new tool and hardware learning curve. This paper describes the 2ke, a flexible and modular computational system that allows developers to standardise on one processor, instruction set, software architecture and tool chain for many projects. Architectural enhancements to its forerunner, the 2k2, are presented to permit micro-architectural parallelism to be chosen along a continuum from SISD at one extreme to full SIMD at the other, whilst the very nature of the 2ke permits extension to MIMD along an orthogonal development direction. Results in terms of logic cell usage, current consumption and memory usage will be presented for each arrangement for example application code.

...read moreread less

Journal Article•

Systematic Management Method for Complex Product Development Project

[...]

Yang Man-li

01 Jan 2005-Industrial Engineering and Management

TL;DR: To overcome the shortcomings of the separation of product design process and project development process, a systematic project scheduling methodology for complex product development is submitted.

...read moreread less

Abstract: To overcome the shortcomings of the separation of product design process and project development process,a systematic project scheduling methodology for complex product development is submitted. First, the product development process is modeled using DSM(Design Structure Matrix) and is optimized by minimizing the feedback iterations, which generates a controllable set of DSMs. For each resultant DSM, corresponding CPM(Critical Path Method) network is constructed based on which the critical path and activities are identified and project lead-time is calculated. Finally, the optimal DSM and project schedule plan are obtained by using traditional crashing technique or by increasing the degree of parallelism of sequential activities on the critical path. The feasibility and efficiency of the proposed method is manifested by a case study.

...read moreread less

Proceedings Article•DOI•

Fast stable parallel adaptive beamforming using conjugate directions

[...]

V.A. Khlebnikov

03 Jul 2005

TL;DR: The algorithm builds the conjugate-direction decomposition (CDD) of the optimal transversal filter weight vector using a novel stabilized parallel version of the Gram-Schmidt orthogonalization (GSO) using a bootstrapped mechanism of parallel crossed feedbacks.

...read moreread less

Abstract: A new fast-converging numerically stable parallel algorithm of adaptive antenna beamforming is introduced. The algorithm builds the conjugate-direction decomposition (CDD) of the optimal transversal filter weight vector using a novel stabilized parallel version of the Gram-Schmidt orthogonalization (GSO). The numerical robustness of the modified GSO version is achieved through a bootstrapped mechanism of parallel crossed feedbacks. Regarding the number of independent input samples, the new algorithm has the same convergence as that of the widely used sample matrix inversion (SMI) method, but its real time of adaptation appears to be much faster due to the high degree of parallelism, reduced numerical complexity, and ability to be implemented with fix-point arithmetic.

...read moreread less

Journal Article•DOI•

Toward an automatic parallelization of sparse matrix computations

[...]

Roxane Adle¹, Marc Aiguier¹, Franck Delaplace¹•Institutions (1)

University of Évry Val d'Essonne¹

01 Mar 2005-Journal of Parallel and Distributed Computing

TL;DR: A generic framework of sparse parallelization which can be applied to any numerical programs satisfying the usual syntactic constraints of parallelization, based on both a refinement of the data-dependence test proposed by Bernstein and an inspector-executor scheme which is specialized to each input program of the compiler.

...read moreread less

Proceedings Article•DOI•

Hardware implementation of the wavelet transform for JPEG2000

[...]

Javier Hormigo¹, J. M. Prades¹, Julio Villalba¹, Emilio L. Zapata¹•Institutions (1)

University of Málaga¹

30 Jun 2005-Proceedings of SPIE

TL;DR: In this paper, a VLSI architecture for the integer-to-integer wavelet transform which is used by JPEG2000 standard for lossless compression is proposed and implemented using Xilinx FPGA device, and its main results are provided.

...read moreread less

Abstract: In this paper we shall propose and examine an VLSI architecture for the integer-to-integer wavelet transform which is used by JPEG2000 standard for lossless compression. In order to achieve a fully utilization of hardware resources independently of the bit-depth of the input data, on-line arithmetic (digit-serial computation) is proposed to carry out this architecture. Besides, a high throughput is achieved thanks to the high degree of parallelism that on-line arithmetic allows. The design has been simulated and implemented using Xilinx FPGA device, and its main results are provided.

...read moreread less