scispace - formally typeset
Search or ask a question

Showing papers on "Degree of parallelism published in 2008"


Journal ArticleDOI
01 Jul 2008
TL;DR: A massively parallel machine called Anton is described, which should be capable of executing millisecond-scale classical MD simulations of such biomolecular systems and has been designed to use both novel parallel algorithms and special-purpose logic to dramatically accelerate those calculations that dominate the time required for a typical MD simulation.
Abstract: The ability to perform long, accurate molecular dynamics (MD) simulations involving proteins and other biological macro-molecules could in principle provide answers to some of the most important currently outstanding questions in the fields of biology, chemistry, and medicine. A wide range of biologically interesting phenomena, however, occur over timescales on the order of a millisecond---several orders of magnitude beyond the duration of the longest current MD simulations. We describe a massively parallel machine called Anton, which should be capable of executing millisecond-scale classical MD simulations of such biomolecular systems. The machine, which is scheduled for completion by the end of 2008, is based on 512 identical MD-specific ASICs that interact in a tightly coupled manner using a specialized highspeed communication network. Anton has been designed to use both novel parallel algorithms and special-purpose logic to dramatically accelerate those calculations that dominate the time required for a typical MD simulation. The remainder of the simulation algorithm is executed by a programmable portion of each chip that achieves a substantial degree of parallelism while preserving the flexibility necessary to accommodate anticipated advances in physical models and simulation methods.

778 citations


Proceedings ArticleDOI
15 Nov 2008
TL;DR: This paper proposes dynamic file partitioning methods that adapt according to the underlying locking protocols in the parallel file systems and evaluates the performance of four partitioning method under two locking protocols.
Abstract: Collective I/O, such as that provided in MPI-IO, enables process collaboration among a group of processes for greater I/O parallelism. Its implementation involves file domain partitioning, and having the right partitioning is a key to achieving high-performance I/O. As modern parallel file systems maintain data consistency by adopting a distributed file locking mechanism to avoid centralized lock management, different locking protocols can have significant impact to the degree of parallelism of a given file domain partitioning method. In this paper, we propose dynamic file partitioning methods that adapt according to the underlying locking protocols in the parallel file systems and evaluate the performance of four partitioning methods under two locking protocols. By running multiple I/O benchmarks, our experiments demonstrate that no single partitioning guarantees the best performance. Using MPI-IO as an implementation platform, we provide guidelines to select the most appropriate partitioning methods for various I/O patterns and file systems.

85 citations


Proceedings ArticleDOI
15 Nov 2008
TL;DR: A portable MPI-IO layer is proposed where certain tasks, such as file caching, consistency control, and collective I/O optimization are delegated to a small set of compute nodes, collectively termed asI/O Delegate nodes, which alleviates the lock contention at I/o servers.
Abstract: Increasingly complex scientific applications require massive parallelism to achieve the goals of fidelity and high computational performance. Such applications periodically offload checkpointing data to file system for post-processing and program resumption. As a side effect of high degree of parallelism, I/O contention at servers doesn't allow overall performance to scale with increasing number of processors. To bridge the gap between parallel computational and I/O performance, we propose a portable MPI-IO layer where certain tasks, such as file caching, consistency control, and collective I/O optimization are delegated to a small set of compute nodes, collectively termed as I/O Delegate nodes. A collective cache design is incorporated to resolve cache coherence and hence alleviates the lock contention at I/O servers. By using popular parallel I/O benchmark and application I/O kernels, our experimental evaluation indicates considerable performance improvement with a small percentage of compute resources reserved for I/O.

82 citations


Proceedings ArticleDOI
13 Feb 2008
TL;DR: This paper examines the scalable parallel implementation of the QR factorization of a general matrix, targeting SMP and multi-core architectures, and shows that the implementation effort is greatly simplified by expressing the algorithms in code with the FLAME/FLASH API, which allows matrices stored by blocks to be viewed and managed as matrices of matrix blocks.
Abstract: This paper examines the scalable parallel implementation of the QR factorization of a general matrix, targeting SMP and multi-core architectures. Two implementations of algorithms-by-blocks are presented. Each implementation views a block of a matrix as the fundamental unit of data, and likewise, operations over these blocks as the primary unit of computation. The first is a conventional blocked algorithm similar to those included in libFLAME and LAPACK but expressed in a way that allows operations in the so-called critical path of execution to be computed as soon as their dependencies are satisfied. The second algorithm captures a higher degree of parallelism with an approach based on Givens rotations while preserving the performance benefits of algorithms based on blocked Householder transformations. We show that the implementation effort is greatly simplified by expressing the algorithms in code with the FLAME/FLASH API, which allows matrices stored by blocks to be viewed and managed as matrices of matrix blocks. The SuperMatrix run-time system utilizes FLASH to assemble and represent matrices but also provides out-of-order scheduling of operations that is transparent to the programmer. Scalability of the solution is demonstrated on ccNUMA platform with 16 processors and an SMP architecture with 16 cores.

53 citations


Journal ArticleDOI
TL;DR: In this article, the authors designed scalable deep-packet filters on field-programmable gate arrays (FPGAs) to search for all data-independent patterns simultaneously, which can scale linearly to support a greater number of patterns, as well as higher data throughput.
Abstract: Most network routers and switches provide some protection against the network attacks. However, the rapidly increasing amount of damages reported over the past few years indicates the urgent need for tougher security. Deep-packet inspection is one of the solutions to capture packets that can not be identified using the traditional methods. It uses a list of signatures to scan the entire content of the packet, providing the means to filter harmful packets out of the network. Since one signature does not depend on the other, the filtering process has a high degree of parallelism. Most software and hardware deep-packet filters that are in use today execute the tasks under Von Neuman architecture. Such architecture can not fully take advantage of the parallelism. For instance, one of the most widely used network intrusion-detection systems, Snort, configured with 845 patterns, running on a dual 1-GHz Pentium III system, can sustain a throughput of only 50 Mbps. The poor performance is because of the fact that the processor is programmed to execute several tasks sequentially instead of simultaneously. We designed scalable deep-packet filters on field-programmable gate arrays (FPGAs) to search for all data-independent patterns simultaneously. With FPGAs, we have the ability to reprogram the filter when there are any changes to the signature set. The smallest full-pattern matcher implementation for the latest Snort NIDS fits in a single 400k Xilinx FPGA (Spartan 3-XC3S400) with a sustained throughput of 1.6 Gbps. Given a larger FPGA, the design can scale linearly to support a greater number of patterns, as well as higher data throughput.

43 citations


Proceedings Article
25 Feb 2008
TL;DR: It is demonstrated that this parameter is indeed critical, as it determines the degree of parallelism in the system, and optimal piece sizes for distributing small and large content are investigated.
Abstract: Peer-to-peer content distribution systems have been enjoying great popularity, and are now gaining momentum as a means of disseminating video streams over the Internet. In many of these protocols, including the popular BitTorrent, content is split into mostly fixed-size pieces, allowing a client to download data from many peers simultaneously. This makes piece size potentially critical for performance. However, previous research efforts have largely overlooked this parameter, opting to focus on others instead. This paper presents the results of real experiments with varying piece sizes on a controlled Bit-Torrent testbed.We demonstrate that this parameter is indeed critical, as it determines the degree of parallelism in the system, and we investigate optimal piece sizes for distributing small and large content. We also pinpoint a related design tradeoff, and explain how BitTorrent's choice of dividing pieces into subpieces attempts to address it.

39 citations


Proceedings ArticleDOI
01 Jun 2008
TL;DR: A new efficient multi-objective evolutionary algorithm for solving computationally-intensive optimization problems based on a steady-state design and a new performance metric is suggested that combines convergence and diversity into one single measure.
Abstract: This paper presents a new efficient multi-objective evolutionary algorithm for solving computationally-intensive optimization problems. To support a high degree of parallelism, the algorithm is based on a steady-state design. For improved efficiency the algorithm utilizes a surrogate to identify promising candidate solutions and filter out poor ones. To handle the uncertainties associated with the approximative surrogate evaluations, a new method for multi-objective optimization is described which is generally applicable to all surrogate techniques. In this method, basically, surrogate objective values assigned to offspring are adjusted to consider the error of the surrogate. The algorithm is evaluated on the ZDT benchmark functions and on a real-world problem of manufacturing optimization. In assessing the performance of the algorithm, a new performance metric is suggested that combines convergence and diversity into one single measure. Results from both the benchmark experiments and the real-world test case indicate the potential of the proposed algorithm.

37 citations


Proceedings ArticleDOI
10 Dec 2008
TL;DR: Real hardware implementation shows that the FPGA-based implementation of the Black-Scholes model outperforms an equivalent software implementation running on a workstation cluster with the same number of computing nodes (CPU/FPGA) by a factor of 750, which is the fastest ever reported FPGa implementation of this model.
Abstract: Monte-Carlo simulation is a very widely used technique in scientific computations in general with huge computation benefits in solving problems where closed form solutions are impossible to derive. This technique is also characterized by a high degree of parallelism as a large number of different simulation paths need to be calculated, which makes it ideal for a parallel hardware implementation. This paper illustrates the benefits of such implementation in the context of financial computing as it implements a financial Monte-Carlo simulation engine on an FPGA-based supercomputer, called Maxwell, developed at the University of Edinburgh. The latter consists of a 32 CPU cluster augmented with 64 Virtex-4 Xilinx FPGAs connected in a 2D torus. Our engine can implement various Monte-Carlo simulations on the Maxwell machine with speed-ups in the 3-order magnitude compared to equivalent software implementations. This is illustrated in this paper in the context of an implementation of the Black-Scholes option pricing model. Real hardware implementation shows that our FPGA-based implementation of the Black-Scholes model outperforms an equivalent software implementation running on a workstation cluster with the same number of computing nodes (CPU/FPGA) by a factor of 750, which is the fastest ever reported FPGA implementation of this model.

35 citations


01 Jan 2008
TL;DR: A novel EA for numerical optimization inspired by the multiple universes principle of quantum computing is proposed and results show that this algorithm can find better solutions, with less evaluations, when compared with similar algorithms.
Abstract: Since they were proposed as an optimization method, evolutionary algorithms (EA) have been used to solve problems in several research fields. This success is due, besides other things, to the fact that these algorithms do not require previous considerations regarding the problem to be optimized and offers a high degree of parallelism. However, some problems are computationally intensive regarding solution's evaluation, which makes the optimization by EA's slow for some situations. This paper proposes a novel EA for numerical optimization inspired by the multiple universes principle of quantum computing. Results show that this algorithm can find better solutions, with less evaluations, when compared with similar algorithms.

30 citations


Journal ArticleDOI
TL;DR: A maintenance-free itinerary-based approach to K-nearest neighbors query processing called density-aware itinerary KNN query processing (DIKNN), which outperforms the second runner with up to a 50 percent saving in energy consumption and a 40 percent reduction in query response time, while rendering the same level of query result accuracy.
Abstract: The K-nearest neighbors (KNN) query has been of significant interest in many studies and has become one of the most important spatial queries in mobile sensor networks. Applications of KNN queries may include vehicle navigation, wildlife social discovery, and squad/platoon searching on the battlefields. Current approaches to KNN search in mobile sensor networks require a certain kind of indexing support. This index could be either a centralized spatial index or an in-network data structure that is distributed over the sensor nodes. Creation and maintenance of these index structures, to reflect the network dynamics due to sensor node mobility, may result in long query response time and low battery efficiency, thus limiting their practical use. In this paper, we propose a maintenance-free itinerary-based approach called density-aware itinerary KNN query processing (DIKNN). The DIKNN divides the search area into multiple cone-shape areas centered at the query point. It then performs a query dissemination and response collection itinerary in each of the cone-shape areas in parallel. The design of the DIKNN scheme takes into account several challenging issues such as the trade-off between degree of parallelism and network interference on query response time, and the dynamic adjustment of the search radius (in terms of number of hops) according to spatial irregularity or mobility of sensor nodes. To optimize the performance of DIKNN, a detailed analytical model is derived that automatically determines the most suitable degree of parallelism under various network conditions. This model is validated by extensive simulations. The simulation results show that DIKNN yields substantially better performance and scalability over previous work, both as kappa increases and as the sensor node mobility increases. It outperforms the second runner with up to a 50 percent saving in energy consumption and up to a 40 percent reduction in query response time, while rendering the same level of query result accuracy.

27 citations


Proceedings ArticleDOI
01 Oct 2008
TL;DR: A matrix-based formulation for the Cyclic Redundancy Check (CRC) computation that is derived from its polynomial-based definition is developed and it is shown that the time-area product follows the critical path delay plot.
Abstract: In this paper, we develop a matrix-based formulation for the Cyclic Redundancy Check (CRC) computation that is derived from its polynomial-based definition. Then, using this formulation, we propose a parallel CRC computation structure with optimizations specific to the case when the degree of parallelism is greater than the degree of the generator polynomial. Afterward, through extensive simulations we obtain the optimum degrees of parallelism in terms of their critical path delays for some common generator polynomials. We also show that the time-area product follows the critical path delay plot.

Journal ArticleDOI
TL;DR: A multicore parallelization of Kohn-Sham density functional theory is described, using an accelerator technology made by ClearSpeed Technology and efficiently scaling parallelization over 2304 cores is achieved.
Abstract: A multicore parallelization of Kohn-Sham density functional theory is described, using an accelerator technology made by ClearSpeed Technology. Efficiently scaling parallelization over 2304 cores is achieved. To deliver this degree of parallelism, the Coulomb problem is reformulated to use Poisson density fitting with numerical quadrature of the required three-index integrals; extensive testing reveals negligible errors from the additional approximations.

Journal ArticleDOI
TL;DR: This algorithm is much simpler than previous offline algorithms for scheduling malleable jobs that require more than a constant number of passes through the job list and realistically asserts a relationship between job length and the maximum useful degree of parallelism.

29 Apr 2008
TL;DR: The purpose of this study was to investigate the degree of parallelism of test forms assembled with the WDM heuristic using both CTT and IRT methods and concluded that the CTT approach performed at least as well as the IRT approaches.
Abstract: The automated assembly of alternate test forms for online delivery provides an alternative to computer-administered, fixed test forms, or computerized-adaptive tests when a testing program migrates from paper/pencil testing to computer-based testing. The weighted deviations model (WDM) heuristic particularly promising for automated test assembly (ATA) because it is computationally straightforward and produces tests with desired properties under realistic testing conditions. Unfortunately, research into the WDM heuristic has focused exclusively on the Item Response Theory (IRT) methods even though there are situations under which Classical Test Theory (CTT) item statistics are the only data available to test developers. The purpose of this study was to investigate the degree of parallelism of test forms assembled with the WDM heuristic using both CTT and IRT methods. Alternate forms of a 60-item test were assembled from a pool of 600 items. One CTT and two IRT approaches were used to generate content and psychometric constraints. The three methods were compared in terms of conformity to the test-assembly constraints, average test overlap rate, content parallelism, and statistical parallelism. The results led to a primary conclusion that the CTT approach performed at least as well as the IRT approaches. The possible reasons for the results of the comparability of the three test-assembly approaches were discussed and the suggestions for the future ATA applications were provided in this paper.

Journal ArticleDOI
TL;DR: This work introduces a recursive variant of OBF and experimentally evaluates several different implementations of it that vary in the degree of parallelism, compared with other successful SCC decomposition techniques.

Proceedings ArticleDOI
05 May 2008
TL;DR: It is demonstrated that for modern medical imaging applications, parallel implementations on traditional parallel architectures can be outperformed, both in terms of speed and cost-effectiveness, by new implementations on next-generation architectures like GPUs.
Abstract: We demonstrate that for modern medical imaging applications, parallel implementations on traditional parallel architectures (clusters and multiprocessor servers) can be outperformed, both in terms of speed and cost-effectiveness, by new implementations on next-generation architectures like GPUs (Graphics Processing Units) Although, compared to clusters and multiprocessor servers, GPUs are rather small and much less expensive, they consist of several SIMD-processors and thus provide a high degree of parallelism For an iterative image reconstruction algorithm---the list-mode OSEM--- we demonstrate, first, the limitations of parallel reconstructions with this algorithm on the traditional parallel architectures, and second, how the well-analyzed parallel strategies for traditional architectures can be adapted systematically to achieve fast reconstructions on the GPU

Proceedings ArticleDOI
14 Apr 2008
TL;DR: It is shown that an algorithm-by-blocks exposes a higher degree of parallelism than traditional implementations based on multithreaded BIAS.
Abstract: The scalable parallel implementation, targeting SMP and/or multicore architectures, of dense linear algebra libraries is analyzed. Using the LU factorization as a case study, it is shown that an algorithm-by-blocks exposes a higher degree of parallelism than traditional implementations based on multithreaded BIAS. The implementation of this algorithm using the SuperMatrix runtime system is discussed and the scalability of the solution is demonstrated on two different platforms with 16 processors.

Dissertation
01 Nov 2008
TL;DR: This thesis presents architectures and field-programmable gate array (FPGA) implementations of two variants of the DCD algorithm, known as the cyclic and leading DCD algorithms, for real-valued and complex-valued systems and shows applications of the designs to complex division, antenna array beamforming and adaptive filtering.
Abstract: In areas of signal processing and communications such as antenna array beamforming, adaptive filtering, multi-user and multiple-input multiple-output (MIMO) detection, channel estimation and equalization, echo and interference cancellation and others, solving linear systems of equations often provides an optimal performance. However, this is also a very complicated operation that designers try to avoid by proposing different sub-optimal solutions. The dichotomous coordinate descent (DCD) algorithm allows linear systems of equations to be solved with high computational efficiency. It is a multiplication-free and division-free technique and, therefore, it is well suited for hardware implementation. In this thesis, we present architectures and field-programmable gate array (FPGA) implementations of two variants of the DCD algorithm, known as the cyclic and leading DCD algorithms, for real-valued and complex-valued systems. For each of these techniques, we present architectures and implementations with different degree of parallelism. The proposed architectures allow a trade-off between FPGA resources and the computation time. The fixed-point implementations provide an accuracy performance which is very close to the performance of floating-point counterparts. We also show applications of the designs to complex division, antenna array beamforming and adaptive filtering. The DCD-based complex divider is based on the idea that the complex division can be viewed as a problem of finding the solution of a 2x2 real-valued system of linear equations, which is solved using the DCD algorithm. Therefore, the new divider uses no multiplication and division. Comparing with the classical complex divider, the DCD-based complex divider requires significantly smaller chip area. A DCD-based minimum variance distortionless response (MVDR) beamformer employs the DCD algorithm for multiplication-free finding the antenna array weights. An FPGA implementation of the proposed DCD-MVDR beamformer requires a chip area much smaller and throughput much higher than that achieved with other implementations. The performance of the fixed-point implementation is very close to that of floating-point implementation of the MVDR beamformer using direct matrix inversion. When incorporating the DCD algorithm in recursive least squares (RLS) adaptive filter, a new efficient technique, named as the RLS-DCD algorithm, is derived. The RLS-DCD algorithm expresses the RLS adaptive filtering problem in terms of auxiliary normal equations with respect to increments of the filter weights. The normal equations are approximately solved by using the DCD iterations. The RLS-DCD algorithm is well-suited to hardware implementation and its complexity is as low as O(N2) operations per sample in a general case and O(N) operations per sample for transversal RLS adaptive filters. The performance of the RLS-DCD algorithm, including both fixed-point and floating-point implementations, can be made arbitrarily close to that of the floating-point classical RLS algorithm. Furthermore, a new dynamically regularized RLS-DCD algorithm is also proposed to reduce the complexity of the regularized RLS problem from O(N^3) to O(N^2) in a general case and to O(N) for transversal adaptive filters. This dynamically regularized RLS-DCD algorithm is simple for finite precision implementation and requires small chip resources.

Proceedings ArticleDOI
22 May 2008
TL;DR: It is proved, that input patterns can be encoded in the synaptic weights by local Hebbian delay-learning where, after learning, the firing time of an output neuron reflects the distance of the evaluated pattern to its learned input pattern thus realizing a kind of RBF neuron.
Abstract: In this paper we describe a novel, hardware implementation friendly model of spiking neurons, with "sparse temporal coding". This is used then to implement a neural network on a FPGA platform, yielding a high degree of parallelism. In the first section of this paper the biological background of spiking neural networks are discussed such as the structure and the functionality of natural neurons which form the basis of the further presented, artificially built ones. With a clustering application in mind, we prove, that input patterns can be encoded in the synaptic weights by local Hebbian delay-learning where, after learning, the firing time of an output neuron reflects the distance of the evaluated pattern to its learned input pattern thus realizing a kind of RBF neuron. Further in the paper, we show that temporal spike-time coding and Hebbian learning is a viable means for unsupervised computation in a network of spiking neurons, as the network is capable of clustering realistic data. The modular neuron structure, the multiplier- less, fully parallel FPGA hardware implementation of the network, the acquired signals during and after the learning phase are given, with the proper interpretation of the results compared to other reported results in the specific literature.

Journal ArticleDOI
TL;DR: The design and implementation of an efficient reconfigurable parallel prefix computation hardware on field-programmable gate arrays (FPGAs) based on a pipelined dataflow algorithm, and control logic is added to reconfigure the system for arbitrary parallelism degree.
Abstract: This paper presents the design and implementation of an efficient reconfigurable parallel prefix computation hardware on field-programmable gate arrays (FPGAs). The design is based on a pipelined dataflow algorithm, and control logic is added to reconfigure the system for arbitrary parallelism degree. The system receives multiple input streams of elements in parallel and produces output streams in parallel. It has an advantage of controlling the degree of parallelism explicitly at run time. The time complexity of the design is O(d+(N?d)/d), where d and N are parallelism degree and stream size, respectively. When the stream size is sufficiently larger than the initial trigger time of the pipeline (d), the time complexity becomes O(N/d). Unlike the prefix computation circuits found in the literature, the design is scalable for different problem sizes including unknown sized data. The design is modular based on a finite state machine, and implemented and tested for target FPGA devices Xilinx Spartan2S XC2S300EFT256-6Q and XC2S600EFG676-6.

Proceedings Article
01 Jan 2008
TL;DR: The effects of increasing the number of items displayed to users in menus through parallelism are examined and it is found that moving from serial to a partially parallel (traditional) menu significantly improved user performance, but moving from a partially Parallel to a fully parallel menu design had more ambiguous results.
Abstract: Menus and toolbars are the primary controls for issuing commands in modern interfaces. As software systems continue to support increasingly large command sets, the user's task of locating the desired command control is progressively time consuming. Many factors influence a user's ability to visually search for and select a target in a set of menus or toolbars, one of which is the degree of parallelism in the display arrangement. A fully parallel layout will show all commands at once, allowing the user to visually scan all items without needing to manipulate the interface, but there is a risk that this will harm performance due to excessive visual clutter. At the other extreme, a fully serial display minimises visual clutter by displaying only one item at a time, but separate interface manipulations are necessary to display each item. This paper examines the effects of increasing the number of items displayed to users in menus through parallelism---displaying multiple menus simultaneously, spanning both horizontally and vertically---and compares it to traditional menus and pure serial display menus. We found that moving from serial to a partially parallel (traditional) menu significantly improved user performance, but moving from a partially parallel to a fully parallel menu design had more ambiguous results. The results have direct design implications for the layout of command interfaces.

Journal ArticleDOI
TL;DR: The proposed architecture, called multithreaded lockstep execution processor (MLEP), is a compromise between the single‐instruction multiple‐data (SIMD) and symmetric multithreading/chip multiprocessor (SMT/CMP) solutions and can save more power and chip area than the SMT /CMP approach without significant performance degradation.
Abstract: In most parallel loops of embedded applications, every iteration executes the exact same sequence of instructions while manipulating different data. This fact motivates a new compiler-hardware orchestrated execution framework in which all parallel threads share one fetch unit and one decode unit but have their own execution, memory, and write-back units. This resource sharing enables parallel threads to execute in lockstep with minimal hardware extension and compiler support. Our proposed architecture, called multithreaded lockstep execution processor (MLEP), is a compromise between the single-instruction multiple-data (SIMD) and symmetric multithreading/chip multiprocessor (SMT/CMP) solutions. The proposed approach is more favorable than a typical SIMD execution in terms of degree of parallelism, range of applicability, and code generation, and can save more power and chip area than the SMT/CMP approach without significant performance degradation. For the architecture verification, we extend a commercial 32-bit embedded core AE32000C and synthesize it on Xilinx FPGA. Compared to the original architecture, our approach is 13.5% faster with a 2-way MLEP and 33.7% faster with a 4-way MLEP in EEMBC benchmarks which are automatically parallelized by the Intel compiler. Keywords: ILP, TLP, SMT, CMP, MLEP.

Book ChapterDOI
16 Dec 2008
TL;DR: This work pursues the scalable parallel implementation of the Cholesky factorization of band matrices with medium to large bandwidth targeting SMP and multi-core architectures by decomposing the computation into a large number of fine-grained operations exposing a higher degree of parallelism.
Abstract: We pursue the scalable parallel implementation of the factor- ization of band matrices with medium to large bandwidth targeting SMP and multi-core architectures. Our approach decomposes the computation into a large number of fine-grained operations exposing a higher degree of parallelism. The SuperMatrix run-time system allows an out-of-order scheduling of operations that is transparent to the programmer. Exper- imental results for the Cholesky factorization of band matrices on two parallel platforms with sixteen processors demonstrate the scalability of the solution.

Journal Article
TL;DR: This paper presents a review of the concepts turbo product codes, and designs an alternative based on the high degree of parallelism available in the reconfigurable hardware devices such as the FPGA, using these devices arrangements composed by field programmable; for designing functional modules such encoders.
Abstract: This paper present a review of the concepts turbo product codes, with the aim of designing an alternative based on the high degree of parallelism available in the reconfigurable hardware devices such as the FPGA, using these devices arrangements composed by field programmable; for designing functional modules such encoders. The selected modules have been described in language descriptor hardware, synthesized and simulated, using the design tool Xilinx ISE 9.2i, which was conducted with the programming of components, and sets out the findings on the basis of the alternatives raised. Key word: VHDL, re-configurable hardware, coding, digital communications.

Journal ArticleDOI
TL;DR: Within this publication, a thorough characterisation of graph properties typical for task graphs in the field of wireless embedded system design has been undertaken and has led to the development of an entirely new approach for the system partitioning problem.
Abstract: The research field of system partitioning in modern electronic system design started to find strong advertence of scientists about fifteen years ago. Since a multitude of formulations for the partitioning problem exist, the same multitude could be found in the number of strategies that address this problem. Their feasibility is highly dependent on the platform abstraction and the degree of realism that it features. This work originated from the intention to identify the most mature and powerful approaches for system partitioning in order to integrate them into a consistent design framework for wireless embedded systems. Within this publication, a thorough characterisation of graph properties typical for task graphs in the field of wireless embedded system design has been undertaken and has led to the development of an entirely new approach for the system partitioning problem. The restricted range exhaustive search algorithm is introduced and compared to popular and well-reputed heuristic techniques based on tabu search, genetic algorithm, and the global criticality/local phase algorithm. It proves superior performance for a set of system graphs featuring specific properties found in human-made task graphs, since it exploits their typical characteristics such as locality, sparsity, and their degree of parallelism.

Dissertation
01 Jun 2008
TL;DR: A new hardware-based parallel implementation of the iterative Conjugate Gradient (CG) algorithm for solving linear systems of equations is proposed, successfully employed in a set of haptic interaction experiments using static and dynamic linear FE-based models.
Abstract: In the last two decades there has been an increasing interest in the field of haptics science. Real-time simulation of haptic interaction with non-rigid deformable object/tissue is computationally demanding. The computational bottleneck in finiteelement (FE) modeling of deformable objects is in solving a large but sparse linear system of equations at each time step of the simulation. Depending on the mechanical properties of the object, high-fidelity stable haptic simulations require an update rate in the order of 100− 1000 Hz. Direct software-based implementations that use conventional computers are fairly limited in the size of the model that they can process at such high rates. In this thesis, a new hardware-based parallel implementation of the iterative Conjugate Gradient (CG) algorithm for solving linear systems of equations is proposed. Sparse matrix-vector multiplication (SpMxV) is the main computational kernel in iterative solution methods such as the CG algorithm. Modern microprocessors exhibit poor performance in executing memory-bound tasks such as SpMxV. In the proposed hardware architecture, a novel organization of on-chip memory resources enables concurrent utilization of a large number of fixed-point computing units on a FPGA device for performing the calculations. The result is a powerful parallel computing platform that can iteratively solve the system of iv equations arising from the FE models of object deformation within the timing constraint of real-time haptics applications. Numerical accuracy of the fixed-point implementation, the hardware architecture design, and issues pertaining to the degree of parallelism and scalability of the solution are discussed in details. The proposed computing platform in this thesis is successfully employed in a set of haptic interaction experiments using static and dynamic linear FE-based models.

Book ChapterDOI
26 Aug 2008
TL;DR: This work uses High-Level Petri Nets (HLPN) to intuitively describe the parallel implementations for distributed- memory machines and identifies parallel functions that can be implemented efficiently on the GPU.
Abstract: Modern Graphics Processing Units (GPUs) consist of several SIMD-processors and thus provide a high degree of parallelism at low cost. We introduce a new approach to systematically develop parallel image reconstruction algorithms for GPUs from their parallel equivalents for distributed-memory machines. We use High-Level Petri Nets (HLPN) to intuitively describe the parallel implementations for distributed- memory machines. By denoting the functions of the HLPN with memory requirements and information about data distribution, we are able to identify parallel functions that can be implemented efficiently on the GPU. For an important iterative medical image reconstruction algorithm --the list-mode OSEM algorithm--we demonstrate the limitations of its distributed-memory implementation and show how our HLPN-based approach leads to a fast implementation on GPUs, reusable across different medical imaging devices.

Patent
02 Apr 2008
TL;DR: In this article, the authors propose to perform parallel I/offline I/O with the expected degree of parallelism while temporarily securing the necessary number of O/O nodes.
Abstract: PROBLEM TO BE SOLVED: To perform parallel I/O with the expected degree of parallelism while temporarily securing the necessary number of I/O nodes even when a job for executing the parallel I/O does not own enough I/O nodes to obtain the expected degree of parallelism when the parallel I/O is started. SOLUTION: When starting the job, enough I/O nodes to perform parallel I/O with the expected degree of parallelism are not secured but only a small number of I/O nodes are secured. When the parallel I/O is started, nodes in short supply are selected by an I/O node security/release part for the parallel I/O and an I/O node group change part from the I/O node for a normal I/O group owned by the other job to temporarily snatch from another job being executed. Also, by an I/O node management table and a job management table, information about the I/O nodes snatched and that about the job from which nodes are snatched are simultaneously managed. For the purpose, I/O nodes are grouped. COPYRIGHT: (C)2011,JPO&INPIT

Proceedings ArticleDOI
10 Sep 2008
TL;DR: This work proposes a parallel backpropagation implementation on a multiprocessor system-on-chip (SoC) with a large number of independent floating-point processing units, controlled by software running on embedded processors in order to allow flexibility in the selection of the network topology to be trained.
Abstract: The backpropagation algorithm used for the training of multilayer perceptrons (MLPs) has a high degree of parallelism and is therefore well-suited for hardware implementation on an ASIC or FPGA. However, most implementations are lacking in generality of application, either by limiting the range of trainable network topologies or by resorting to fixed-point arithmetic to increase processing speed. We propose a parallel backpropagation implementation on a multiprocessor system-on-chip (SoC) with a large number of independent floating-point processing units, controlled by software running on embedded processors in order to allow flexibility in the selection of the network topology to be trained. It is shown that the speed of such a system is limited by the communication overhead between processing nodes, especially by the management of training vectors. Preliminary performance results on an Altera DE2-70 development board are given and optimal architecture parameters are selected.

Proceedings Article
06 Apr 2008
TL;DR: This paper investigates the use of Berger code as a means of incorporating CED into a self checking systolic FIFO stack.
Abstract: The advances in VLSI technology have made possible many changes not only in the amount of hardware that can be integrated into a die permitting the implementation of single chip processor, but also in processor architecture. This creates a need for algorithms that can exploit a high degree of pipelining and parallelism. The algorithms that are the best at this time, for being able to incorporate a high degree of parallelism are the systolic arrays. The systolic systems have balanced uniform architectures which typically look like grids where each line indicates a communication path and each intersection represents a cell or a systolic element. Unfortunately as the scale of integration has increased so also has the occurrence of intermittent faults. The characteristics of these types of faults render them undetectable by standard test strategies. This is particularly problematic with the wide use of complex circuits in safety-critical applications. Ensuring the reliability of these systems is a major testing challenge. The detection of intermittent faults requires the use of Concurrent Error Detection (coding) techniques. This paper investigates the use of Berger code as a means of incorporating CED into a self checking systolic FIFO stack.