scispace - formally typeset
Search or ask a question

Showing papers on "Degree of parallelism published in 2007"


Proceedings ArticleDOI
09 Jun 2007
TL;DR: A massively parallel machine called Anton is described, which should be capable of executing millisecond-scale classical MD simulations of such biomolecular systems and is designed to use both novel parallel algorithms and special-purpose logic to dramatically accelerate those calculations that dominate the time required for a typical MD simulation.
Abstract: The ability to perform long, accurate molecular dynamics (MD) simulations involving proteins and other biological macro-molecules could in principle provide answers to some of the most important currently outstanding questions in the fields of biology, chemistry and medicine. A wide range of biologically interesting phenomena, however, occur over time scales on the order of a millisecond--about three orders of magnitude beyond the duration of the longest current MD simulations.In this paper, we describe a massively parallel machine called Anton, which should be capable of executing millisecond-scale classical MD simulations of such biomolecular systems. The machine, which is scheduled for completion by the end of 2008, is based on 512 identical MD-specific ASICs that interact in a tightly coupled manner using a specialized high-speed communication network. Anton has been designed to use both novel parallel algorithms and special-purpose logic to dramatically accelerate those calculations that dominate the time required for a typical MD simulation. The remainder of the simulation algorithm is executed by a programmable portion of each chip that achieves a substantial degree of parallelism while preserving the flexibility necessary to accommodate anticipated advances in physical models and simulation methods.

340 citations


Proceedings ArticleDOI
26 Mar 2007
TL;DR: An HEFT-based adaptive rescheduling algorithm is presented, evaluated and compared with traditional static and dynamic strategies respectively and results show that the proposed strategy not only outperforms the dynamic one but also improves over the traditional static one.
Abstract: Scheduling is the key to the performance of grid workflow applications. Various strategies are proposed, including static scheduling strategies which map jobs to resources before execution time, or dynamic alternatives which schedule individual job only when it is ready to execute. While sizable work supports the claim that the static scheduling performs better for workflow applications than the dynamic one, it is questioned how a static schedule works effectively in a grid environment which changes constantly. This paper proposes a novel adaptive rescheduling concept, which allows the workflow planner works collaboratively with the run time executor and reschedule in a proactive way had the grid environment changes significantly. An HEFT-based adaptive rescheduling algorithm is presented, evaluated and compared with traditional static and dynamic strategies respectively. The experiment results show that the proposed strategy not only outperforms the dynamic one but also improves over the traditional static one. Furthermore we observed that it performs more efficiently with data intensive application of higher degree of parallelism.

116 citations


Proceedings ArticleDOI
17 Sep 2007
TL;DR: It is shown that ANNs are effective for identifying energy-efficient concurrency levels in multithreaded scientific applications, and they are effective using physical experimentation on a state-of-the-art quad-core Xeon platform.
Abstract: Multicore microprocessors have been largely motivated by the diminishing returns in performance and the increased power consumption of single-threaded ILP microprocessors. With the industry already shifting from multicore to many-core microprocessors, software developers must extract more thread-level parallelism from applications. Unfortunately, low power-efficiency and diminishing returns in performance remain major obstacles with many cores. Poor interaction between software and hardware, and bottlenecks in shared hardware structures often prevent scaling to many cores, even in applications where a high degree of parallelism is potentially available. In some cases, throwing additional cores at a problem may actually harm performance and increase power consumption. Better use of otherwise limitedly beneficial cores by software components such as hypervisors and operating systems can improve system-wide performance and reliability, even in cases where power consumption is not a main concern. In response to these observations, we evaluate an approach to throttle concurrency in parallel programs dynamically. We throttle concurrency to levels with higher predicted efficiency from both performance and energy standpoints, and we do so via machine learning, specifically artificial neural networks (ANNs). One advantage of using ANNs over similar techniques previously explored is that the training phase is greatly simplified, thereby reducing the burden on the end user. Using machine learning in the context of concurrency throttling is novel. We show that ANNs are effective for identifying energy-efficient concurrency levels in multithreaded scientific applications, and we do so using physical experimentation on a state-of-the-art quad-core Xeon platform.

37 citations


Journal ArticleDOI
TL;DR: The concept of functional- level power analysis (FLPA) for power estimation of programmable processors is extended in order to model embedded as well as heterogeneous processor architectures featuring different embedded processor cores.

33 citations


Proceedings ArticleDOI
22 Oct 2007
TL;DR: An algorithm to detect "direct" semantic interference between parallel changes and faults is designed and implemented and it is shown that it is effective in predicting faults in changes made within a short time period.
Abstract: Parallel developments are becoming increasingly prevalent in the building and evolution of large-scale software systems. Our previous studies of a large industrial project showed that there was a linear correlation between the degree of parallelism and the likelihood of defects in the changes. To further study the relationship between parallel changes and faults, we have designed and implemented an algorithm to detect "direct" semantic interference between parallel changes. To evaluate the analyzer's effectiveness in fault prediction, we designed an experiment in the context of an industrial project. We first mine the change and version management repositories to find sample versions sets of different degrees of parallelism. We investigate the interference between the versions with our analyzer. We then mine the change and version repositories to find out what faults were discovered subsequent to the analyzed interfering versions. We use the match rate between semantic interference and faults to evaluate the effectiveness of the analyzer in predicting faults. Our contributions in this evaluative empirical study are twofold. First, we evaluate the semantic interference analyzer and show that it is effective in predicting faults (based on "direct" semantic interference detection) in changes made within a short time period. Second, the design of our experiment is itself a significant contribution and exemplifies how to mine software repositories rather than use artificial cases for rigorous experimental evaluations.

21 citations


Patent
30 Aug 2007
TL;DR: In this article, the maximum supported degree of parallel sort operations in a multi-processor computing environment is determined by an allocation module that allocates a minimum number of sort files to each data source that participates in the parallel sort.
Abstract: An apparatus, system, and method for determining the maximum supported degree of parallel sort operations in a multi-processor computing environment. An allocation module allocates a minimum number of sort files to a sort operation for each data source that participates in the parallel sort. The allocation module attempts to allocate sort files of one-half the sort operation data source file size, and iteratively reduces the sort file size requests in response to determinations that sort files of the requested size are not available. After allocation, a parallel operation module determines whether there is sufficient virtual storage to execute the sort operations in parallel. If there is not, the parallel operations module collapses the two smallest sort operations, thereby reducing the degree of parallelism by one, and repeats the request. The parallel operation module repeats the process until the sorts are executed or the process fails for lack of virtual storage.

20 citations


Patent
Raul E. Silvera1, Priya Unnikrishnan1
18 Sep 2007
TL;DR: In this paper, the authors propose a mechanism for folding all the data dependencies in a loop into a single, conservative dependence, which leads to one pair of synchronization primitives per loop.
Abstract: A mechanism for folding all the data dependencies in a loop into a single, conservative dependence. This mechanism leads to one pair of synchronization primitives per loop. This mechanism does not require complicated, multi-stage compile time analysis. This mechanism considers only the data dependence information in the loop. The low synchronization cost balances the loss in parallelism due to the reduced overlap between iterations. Additionally, a novel scheme is presented to implement required synchronization to enforce data dependences in a DOACROSS loop. The synchronization is based on an iteration vector, which identifies a spatial position in the iteration space of the loop. Multiple iterations executing in parallel have their own iteration vector for synchronization where they update their position in the iteration space. As no sequential updates to the synchronization variable exist, this method exploits a greater degree of parallelism.

16 citations


Journal ArticleDOI
TL;DR: Preliminary experimental results show a significant benefit in terms of both speedup and power consumption, making the DDM-CMP architecture an attractive architecture for future processors.
Abstract: Although the dataflow model of execution, with its obvious benefits, has been proposed for a long time, it has not yet been successfully exploited. Nevertheless, as traditional systems have recently started to reach their limits in delivering higher performance, new models of execution that use dataflow-like concepts are being studied. Among these, Data-Driven Multithreading (DDM) is a multithreading model that effectively hides the communication delay and synchronisation overheads. In DDM threads are scheduled as soon as their input data has been produced, that is, in a dataflow-like way. In addition to presenting a motivation to the dataflow model of execution, this paper also presents an overview of the DDM project. In particular, it focuses on the Chip Multiprocessor (CMP) implementation using the DDM model, its hardware, run-time system and performance evaluation. The DDM-CMP inherits the benefits of both the DDM model which allows to overcome the memory wall limitation and the CMP which offers a simpler design, higher degree of parallelism and larger power-performance efficiency, therefore overcoming the power wall. Preliminary experimental results show a significant benefit in terms of both speedup and power consumption, making the DDM-CMP architecture an attractive architecture for future processors.

14 citations


Journal ArticleDOI
TL;DR: The labelled dependency graph associated with a P system is defined, and this new concept is used for proving some results concerning the maximum number of applications of rules in a single step through the computation of a P systems.

13 citations


Proceedings ArticleDOI
11 Mar 2007
TL;DR: Various state-of-the-art implementations of synchronization primitives are surveyed, in order to assess their impact on performance and on energy consumption and show that some commonly accepted intuitions in the multiprocessor domain do not hold in the context of MPSoCs.
Abstract: Applications running on Multiprocessor Systems-on-Chips (MP-SoCs) exhibit complex interaction patterns, resulting in significant amounts of time spent while synchronizing for mutually exclusive access to shared resources. Such an overhead is expected to increase with the degree of parallelism and with the mutual correlation of concurrent tasks, thus becoming in a severe obstacle to the full exploitation of a system potential. Although the topic has been extensively studied in the literature, in MPSoC architectures, which exhibit different tradeoffs with respect to traditional multi-processors, the available results may not be valid or hold only partially. Furthermore, the strict energy budget of MPSoCs requires also the evaluation of the energy efficiency of such synchronization primitives. In this work we survey various state-of-the-art implementations of synchronization primitives, in order to assess their impact on performance and on energy consumption. The results of our analysis show that some commonly accepted intuitions in the multiprocessor domain do not hold in the context of MPSoCs.

10 citations


Book ChapterDOI
28 Aug 2007
TL;DR: A new scheduling algorithm for task graphs arising from parallel multifrontal methods for sparse linear systems is presented, based on the theorem proved by Prasanna and Musicus for tree-shaped task graphs, when all tasks exhibit the same degree of parallelism.
Abstract: We present a new scheduling algorithm for task graphs arising from parallel multifrontal methods for sparse linear systems. This algorithm is based on the theorem proved by Prasanna and Musicus [1] for tree-shaped task graphs, when all tasks exhibit the same degree of parallelism. We propose extended versions of this algorithm to take communication between tasks and memory balancing into account. The efficiency of proposed approach is assessed by a set of experiments on a set of large sparse matrices from several libraries.

Patent
25 May 2007
TL;DR: In this article, a trace information and performance information corresponding to trace information are obtained from memory, and the task transition state and performance based on the trace information is displayed by superimposing on the transition chart.
Abstract: Aiming at enabling an analysis of relationship between a task transition and performance information such as mis-caching in a multiprocessor system and clearly identifying a relationship between a degree of parallelism and the task transition of the system processing, trace information and performance information corresponding to the trace information are obtained from memory, and the task transition state and performance information based on the trace information are displayed by superimposing on the transition chart. A degree of parallelism corresponding to an operation state of a plurality of processors is calculated on the basis of the trace information, and the degree of parallelism is displayed by being temporally synchronized with the task transition chart.

01 Jan 2007
TL;DR: It is demonstrated that the tradi-tional retiming technique does not always achieve optimal sched-ules (although it can be used in combination with other techniquesto do so) and a new graph-transformation technique is proposed, Ex-tended Retiming, which will.
Abstract: Many iterative or recursive applications commonly found inDSP and image processing applications can be represented bydata-flow graphs (DFGs). A great deal of research has beendone attempting to optimize such applications by applying vari-ous graph transformation techniques to the DFG in order to mini-mize the schedule length. One of the most effective of these tech-niques is retiming. In this paper, we demonstrate that the tradi-tional retiming technique does not always achieve optimal sched-ules (although it can be used in combination with other techniquesto do so) and propose a new graph-transformation technique,ex-tended retiming, which will.Index terms: Scheduling, Data-flow Graphs, Retiming, GraphTransformation, Timing Optimization 1 Introduction Many iterative or recursive applications, such as image pro-cessing, DSP and PDE simulations, can be represented by data-flow graphs , or DFGs [4]. The nodes of a DFG represent tasks,while edges between nodes represent data dependencies amongthe tasks, either within iterations (an execution of all tasks) or be-tween iterations. To model repeated steps within an algorithm, aDFG may contain loops. To meet the desired throughput, it be-comes necessary to use multiple processors or multiple functionalunits. Due to the expense of such units, it is important for us tominimize the number of processors we involve during execution,while maximizing the use of those processors that we do include.The process of assigning a starting time and processor to eachevent in the DFG, known as scheduling, becomes a vital step inthis process.There are two common approaches for system-level synthesisand scheduling of parallel systems:1. We can explicitly schedule the DFG as-is.2. We can first apply a graph transformation technique to theDFG in order to maximize the degree of parallelism, thenschedule the acyclic (or DAG) part of the resulting graph.There are many methods for doing scheduling [2,6,8]; hence thefocus of our study will be the optimization of the DFG via graphtransformation. We will later show that the second of these twomethods is preferable to the first because the schedule it pro ducesrequires fewer resources.The execution of all tasks of a DFG is called an iteration,with the length of time it takes to complete an iteration called theschedule length of the DFG. While there are many graph trans-formation techniques available to us, it is possible to find g raphsfor which the current techniques will not produce a transformedDFG having minimum schedule length. We will demonstrate thatin this paper, as well as propose a new transformation techniquewhich does deliver optimal results. When compared with the tra-ditional methods, our new technique quickly and easily producesa transformed graph without increasing the size of the DFG.A great deal of research has been done attempting to optimizethe schedule of tasks for an application after applying variousgraph transformation techniques to the application’s DFG. Oneof the more effective of these techniques is retiming [1,7], wheredelays are redistributed among the edges so that the application’sfunction remains the same, but the length of the longest zero-delay path, called the clock period of the DFG G and denotedcl

Proceedings Article
24 Aug 2007
TL;DR: This paper deals with the implementation of the Fast Fourier Transform on a novel graphics architecture offered recently by NVIDIA, and takes into consideration memory reference locality issues, that are crucial when pursuing a high degree of parallelism.
Abstract: The growing computational power of modern graphics processing units is making them very suitable for general purpose computing. These commodity processors operate generally as parallel SIMD platforms and, among other factors, the effectiveness of the codes is subject to a right exploitation of the underlying memory hierarchy. This paper deals with the implementation of the Fast Fourier Transform on a novel graphics architecture offered recently by NVIDIA. Such an implementation takes into consideration memory reference locality issues, that are crucial when pursuing a high degree of parallelism, that is, a good occupancy of the processing elements. The proposed implementation has been tested and compared to the manufacturer's own implementation.

Proceedings ArticleDOI
07 Jul 2007
TL;DR: In this paper, a parallel model of multi-objective genetic algorithm supposing a grid environment is discussed, where the crossover is performed between individuals that are close to each other in the objective space, and two individuals as a crossover pair are transmitted to each slave process.
Abstract: In this paper, a parallel model of multi-objective genetic algorithm supposing a grid environment is discussed. In this proposed parallel model, we extended master-slave model which has high degree of parallelism, and 2 individuals as a crossover pair are transmitted to each slave process. Then the number of offspring generated by crossover is changed dynamically adapting to the performance of the each calculation resource. This mechanism is effective for heterogeneous computational resources. In addition, total communication cost can be reduced by increasing processing load of the slave processes, and reduction of the overhead time is expected. Moreover, we incorporated the neighborhood crossover, in which the crossover is performed between individuals that are close to each other in the objective space. Therefore, 2 individuals which are close to each other are sent to each slave process. This neighborhood crossover improves the search ability. Computational experiments on heterogeneous computational resources indicated that the proposed model was able to utilize the maximum performance of all calculation resources and reduce the overhead time.

Patent
Ba-Zhong Shen1, Tak K. Lee1
07 Jun 2007
TL;DR: Reduced complexity ARP interleaves providing flexible granularity and parallelism adaptable to any possible turbo code block size are presented in this article, where a novel means is presented by which any desired turbo code blocks size can be employed when only requiring, in only some instances, a very small number of dummy bits.
Abstract: Reduced complexity ARP (almost regular permutation) interleaves providing flexible granularity and parallelism adaptable to any possible turbo code block size. A novel means is presented by which any desired turbo code block size can be employed when only requiring, in only some instances, a very small number of dummy bits. This approach also is directly adaptable to parallel turbo decoding, in which any desired degree of parallelism can be employed. Alternatively, as few as one turbo decoder can be employed in a fully non-parallel implementation as well. Also, this approach allows for storage of a reduced number of parameters to accommodate a wide variety of interleaves.

19 Jun 2007
TL;DR: A new evolution rules application algorithm to a multiset of objects to use in the P system implementation in digital devices that reaches a certain degree of parallelism due to a rule that can be applied several times in a single step.
Abstract: This paper presents a new evolution rules application algorithm to a multiset of objects to use in the P system implementation in digital devices. In each step of this algorithm two main actions are carried out eliminating, at least, an evolution rule to the set of active rules. Therefore, the number of operations executed is limited and it can be known a priori which is its execution time at worst. This is very important as it allows for determination of the number of membranes to be located in each processor in the distributed implementation architectures of P systems to obtain optimal times with minimal resources. Although the algorithm is sequential, it reaches a certain degree of parallelism due to a rule that can be applied several times in a single step. In addition to this, this algorithm has shown in the experimental tests that the execution times is better than the ones previously published.

Book ChapterDOI
11 Dec 2007
TL;DR: This paper presents a technique, called ParaSolve, that exploits the sparsity structure of the web graph matrix to improve on the degree of parallelism in a number of distributed approaches for computating PageRank.
Abstract: This paper presents a technique we call ParaSolve that exploits the sparsity structure of the web graph matrix to improve on the degree of parallelism in a number of distributed approaches for computating PageRank. Specifically, a typical algorithm (such as power iteration or GMRES) for solving the linear system corresponding to PageRank, call it LinearSolve, may be converted to a distributed algorithm, Distrib( LinearSolve), by partitioning the problem and applying a standard technique (i.e., Distrib). By reducing the number of inter-partition multiplications, we may greatly increase the degree of parallelism, while achieving a similar degree of accuracy. This should lead to increasingly better performance as we utilize more processors. For example, using GeoSolve (a variant of Jacobi iteration) as our linear solver and the 2001 web graph from Stanford's WebBase project, on 12 processors Para-Solve(GeoSolve) outperforms Distrib(GeoSolve) by a factor of 1.4, while on 32 processors the performance ratio improves to 2.8.

Patent
27 Dec 2007
TL;DR: In this article, a method, computer program product and apparatus for utilizing simulated locking prior to starting concurrent execution are described, and the results of this simulated locking are used to define a canonical ordering which controls the order of execution and the degree of parallelism that can be used.
Abstract: A method, computer program product and apparatus for utilizing simulated locking prior to starting concurrent execution are disclosed. The results of this simulated locking are used to define a canonical ordering which controls the order of execution and the degree of parallelism that can be used.

Proceedings ArticleDOI
17 Sep 2007
TL;DR: Numerical results are presented which confirm the effectiveness of Gauss-Seidel parallelized with iteration space alternate tiling technique, specifically compared with owner-computes and red-black coloring based Gaussian methods, and show that the new method has a good parallel performance on distributed memory machines, as well as scalability.
Abstract: Many important scientific kernels compute solutions using finite difference techniques, and the most time consuming part of them is the iterative method, such as Gauss-Seidel or SOR. To improve performance, iterative method can exploit parallelism, intra-iteration data reuse, and inter-iteration data reuse. This paper describes a new parallel Gauss-Seidel method using iteration space alternate tiling technique, which is developed not only to improve parallelism, intra-iteration, and inter-iteration data locality, but also to decrease communication and synchronization cost in iterative method. Time-skewing can increase cache locality by exploiting locality in the time direction as well as spatial locality. The degree of parallelism can be improved by reordering the execution of cache blocks. Finally numerical results are presented which confirm the effectiveness of Gauss-Seidel parallelized with iteration space alternate tiling technique, specifically compared with owner-computes and red-black coloring based Gauss-Seidel methods, and show that the new method has a good parallel performance on distributed memory machines, as well as scalability.

Proceedings ArticleDOI
01 Sep 2007
TL;DR: In this paper, a systematic empirical study of the tradeoffs between degree of parallelism, threshold voltage and power consumption under constant throughput conditions commercially available FPGAs has been presented.
Abstract: As field programmable gate array (FPGA) based systems scale up in complexity, energy aware designs paradigms with strict power budgets require the designer to explore all viable options for minimising dynamic power consumption. The concepts of parallelism and pipelining have long been exploited in CMOS chips to reduce power and energy consumption. In this paper, a systematic empirical study of the tradeoffs between degree of parallelism, threshold voltage and power consumption under constant throughput conditions commercially available FPGAs has been presented. Results indicate that there is excellent scope for reduction in dynamic voltage by suitably applying the tradeoffs in FPGA based designs in order to achieve energy efficient implementations.

Proceedings ArticleDOI
16 Oct 2007
TL;DR: In this paper, a compiler infrastructure for the architecture is introduced in detail with discussion of how to support OpenMP APIs and how to integrate the Omni OpenMP compiler with the backend code generator.
Abstract: Embedded applications intrinsically have high degree of parallelism, but it is difficult to exploit the parallelism due to resource constraint of embedded platforms. In order to overcome the problem, we introduced a promising processor solution to support parallel thread execution with pretty good performance while consuming small hardware resources. We call this processor as Multithread Lockstep Execution Processor (MLEP). Since each iteration of parallel loops performs the same sequence of instructions at most time while manipulating different data, we only need to partially duplicate a pipeline resource to support the multithreading. This architecture makes it possible that parallel threads execute synchronously in a lockstep manner. However, because of providing a totally different kind of thread execution, it sometime makes programmers confused when parallelizing code for the processor. In this paper, we introduce a compiler infrastructure for our architecture in detail with discussion of how to support OpenMP APIs and how to integrate the Omni OpenMP compiler with our backend code generator. Also, for verification of our compiler system, we show that our code generation scheme delivers the same performance as handed codes.

Book ChapterDOI
06 Jun 2007
TL;DR: This work suggests that for the more realistic network setting, the heuristics Random and Rarest Block First both allow nclients to download bblocks in b+ O(logn) phases, and suggests that the latter bound better reflects the high degree of parallelism of BT observed in reality.
Abstract: BitTorrent (BT) in practice is a very efficient method to share data over a network of clients. In this paper we extend the recent work of Arthur and Panigrahy [1] on modelling the distribution of individual data blocks in BT systems, aiming at a better understanding of why BT can achieve a high degree of parallelism. In particular, we include in our study several new network features that BT systems are using, as well as different local heuristics for routing data blocks in each client. We conduct simulations to figure out to what extent the new network features and routing heuristics would affect the distribution efficiency. Our findings confirm that for the primitive network setting studied in [1], it does require i¾?(blogn) phases for nclients to download bdata blocks. More interestingly, our work suggests that for the more realistic network setting, the heuristics Random and Rarest Block First both allow nclients to download bblocks in b+ O(logn) phases. We believe that the latter bound better reflects the high degree of parallelism of BT observed in reality. It is also worth-mentioning that b+ lognis the smallest possible number of phases needed; it is interesting to see that some simple local routing heuristics have a performance so close to the optimal.

01 Jan 2007
TL;DR: A set of tile operations is shown which leads to a natural and easy implementation of different algorithms in parallel and in sequential with higher clarity and smaller size and it is shown that the HTA codes needs less programming effort with a negligible effect on performance.
Abstract: The importance of tiles or blocks in mathematics and thus computer science cannot be overstated. From a high level point of view, they are the natural way to express many algorithms. Tiles or sub-tiles are used as basic units in the algorithm description. From a low level point of view, tiling, either as the unit maintained by the algorithm, or as a class of data layouts, is one of the most effective ways to exploit locality, which is a must to achieve good performance in current computers. Finally, tiles and operations on them are also basic to express data distribution and parallelism. Despite the importance of this concept, which makes inevitable its widespread usage, most languages do not support it directly. Programmers have to understand and manage the low-level details along with the introduction of tiling. This gives place to bloated potentially error-prone programs in which opportunities for performance are lost. On the other hand, the disparity between the algorithm and the actual implementation enlarges. This thesis illustrates the power of Hierarchically Tiled Arrays (HTAs), a data type which enables the easy manipulation of tiles in object-oriented languages. The objective is to evolve this data type in order to make the representation of all classes for algorithms with a high degree of parallelism and/or locality as natural as possible. We show in the thesis a set of tile operations which leads to a natural and easy implementation of different algorithms in parallel and in sequential with higher clarity and smaller size. In particular, two new language constructs dynamic partitioning and overlapped tiling are discussed in detail. They are extensions of the HTA data type to improve its capabilities to express algorithms with a high abstraction and free programmers from programming tedious low-level tasks. To prove the claims, two popular languages, C++ and MATLAB are extended with our HTA data type. In addition, several important dense linear algebra kernels, stencil computation kernels, as well as some benchmarks in NAS benchmark suite were implemented. We show that the HTA codes needs less programming effort with a negligible effect on performance.

Patent
28 Mar 2007
TL;DR: In this article, the same operation among the prediction formulas in 17 different prediction modes of 4x4 block with 16 pixels adopting digit computation strength cut method to remove computation redundancy, which can process the predicted values of 16 pixels within every clock cycle.
Abstract: The invention belongs to Video decoder IC design field. The character is in that: to the same operation among the prediction formulas in 17 different prediction modes of 4x4 block with 16 pixels adopting digit computation strength cut method to remove computation redundance; providing a in-frame predictor system with high degree of parallelism, which can process the predicted values of 16 pixels within every clock cycle. From the results achieved, compared to the design with the use of reconstruction, this invention can decrease circuit area under same parallelism and simplifies the control logic.

Journal Article
TL;DR: The degree of parallelism is introduced and investigated, and it is shown that it links to the notions of growth functions and active symbols in Lindenmayer and Bharat systems.
Abstract: In this paper, the degree of parallelism is introduced and investigated. The degree of parallelism is a natural descriptional complexity measure of Lindenmayer and Bharat systems. This concept quantifies the amount of non-redundant parallelism needed in the derivations of those systems. We consider both static and dynamic versions of this notion. Corresponding hierarchy and undecidability results are established. Furthermore, we show that the degree of parallelism links to the notions of growth functions and active symbols.

01 Jan 2007
TL;DR: This dissertation focuses on practical methodologies for solving reinforcement learning problems with large state and/or action spaces, and addresses scenarios in which an agent does not have full knowledge of its state, but rather receives partial information about its environment via sensory-based observations.
Abstract: Reinforcement Learning (RL) is a machine learning discipline in which an agent learns by interacting with its environment. In this paradigm, the agent is required to perceive its state and take actions accordingly. Upon taking each action, a numerical reward is provided by the environment. The goal of the agent is thus to maximize the aggregate rewards it receives over time. Over the past two decades, a large variety of algorithms have been proposed to select actions in order to explore the environment and gradually construct an e¤ective strategy that maximizes the rewards. These RL techniques have been successfully applied to numerous real-world, complex applications including board games and motor control tasks. Almost all RL algorithms involve the estimation of a value function, which indicates how good it is for the agent to be in a given state, in terms of the total expected reward in the long run. Alternatively, the value function may re‡ect on the impact of taking a particular action at a given state. The most fundamental approach for constructing such a value function consists of updating a table that contains a value for each state (or each state-action pair). However, this approach is impractical for large scale problems, in which the state and/or action spaces are large. In order to deal with such problems, it is necessary to exploit the generalization capabilities of non-linear function approximators, such as arti…cial neural networks. This dissertation focuses on practical methodologies for solving reinforcement learning problems with large state and/or action spaces. In particular, the work addresses scenarios in which an agent does not have full knowledge of its state, but rather receives partial information about its environment via sensory-based observations. In order to address such intricate problems, novel solutions for both tabular and function-approximation based RL frameworks are proposed. A resource-e¢ cient recurrent neural network algorithm is presented, which exploits adaptive step-size techniques to improve learning characteristics. Moreover, a consolidated actor-critic network is introduced, which omits the modeling redundancy found in typical actor-critic systems. Pivotal concerns are the scalability and speed of the learning algorithms, for which we devise architectures that map e¢ ciently to hardware. As a result, a high degree of parallelism can be achieved. Simulation results that correspond to relevant testbench problems clearly demonstrate the solid performance attributes of the proposed solutions.

Proceedings ArticleDOI
TL;DR: The results obtained show a considerable reduction in the number of cycles and memory accesses needed to perform the motion compensation as well as an increase in the degree of parallelism when compared with an implementation on a Very Long Instruction Word (VLIW) dedicated processor.
Abstract: The decoding of a H.264/AVC bitstream represents a complex and time-consuming task. Due to this reason, efficient implementations in terms of performance and flexibility are mandatory for real time applications. In this sense, the mapping of the motion compensation and deblocking filtering stages onto a coarse-grained reconfigurable architecture named ADRES (Architecture for Dynamically Reconfigurable Embedded Systems) is presented in this paper. The results obtained show a considerable reduction in the number of cycles and memory accesses needed to perform the motion compensation as well as an increase in the degree of parallelism when compared with an implementation on a Very Long Instruction Word (VLIW) dedicated processor.

Proceedings ArticleDOI
04 Jun 2007
TL;DR: The present paper aims at seeking faster speed and lower cost methods for characterizing granular objects in general, and cereal grains in particular, based on simultaneous cross-combined optical and impedimetric sensing techniques in the MHz range, applied to flowing streams to determine real-time 3D morphological characterization of single objects.
Abstract: The content of the present paper aims at seeking faster speed and lower cost methods for characterizing granular objects in general, and cereal grains in particular. The approach is based on simultaneous cross-combined optical and impedimetric sensing techniques in the MHz range, applied to flowing streams so as to determine real-time 3D morphological characterization of single objects. The main advantages of the proposed approach rely on the high discriminatory capacity of the object classes and on the high degree of parallelism, capable of processing large amounts of material on production lines. The use of integrated electronic systems also allows high operative speed, easy calibration and flexibility to the required classification features.

01 Jan 2007
TL;DR: For solution of system of linear equations related to finite difference discretization of partial differential equations, a new parallel Gauss-Seidel is developed based on domain decomposition scheme called convergence iteration space alternate tiling, which is developed to improve parallelism, intra-iteration, and inter-iterations data locality, but also to decrease communication and synchronization cost in iterative method.
Abstract: Many important scientific kernels compute solutions using finite difference techniques, and the most time consuming part of them is the iterative method, such as Gauss-Seidel or SOR. To improve performance, iterative method can exploit parallelism, intra-iteration data reuse, and inter-iteration data reuse. This paper describes a new parallel Gauss-Seidel method using iteration space alternate tiling technique, which is developed not only to improve parallelism, intra-iteration, and inter-iteration data locality, but also to decrease communication and synchronization cost in iterative method. Time-skewing can increase cache locality by exploiting locality in the time direction as well as spatial locality. The degree of parallelism can be improved by reordering the execution of cache blocks. Finally numerical results are presented which confirm the effectiveness of Gauss-Seidel parallelized with iteration space alternate tiling technique, specifically compared with owner-computes and red- black coloring based Gauss-Seidel methods, and show that the new method has a good parallel performance on distributed memory machines, as well as scalability. communication and synchronization cost and maximizing data reuse. In the present study, for solution of system of linear equations related to finite difference discretization of partial differential equations, a new parallel Gauss-Seidel is developed based on domain decomposition scheme called convergence iteration space alternate tiling. The goal of this method is to optimize three different performance aspects: inter-iteration data locality, intra-iteration data locality and parallelism. Intra-iteration locality refers to cache locality upon data reuse within convergence iteration, and inter- iteration locality refers to cache locality upon data reuse between convergence iterations. In this algorithm, Cache blocking (5) can improve the performance of kernel computations for large grids by reorganizing the memory access pattern to improve memory locality. Time-skewing (6) can further increase cache locality by exploiting locality in the time direction as well as spatial locality. Block-reordering can improve the degree of parallelism by reordering the execution of blocks. Each cache block contains computations from multiple convergence iterations for a sub set of unknowns, so it only needs one communication stage per K (number of iterations in a half of tile) iterations, which may lead to good parallel performance on distributed memory machines. Our numerical experiments are presented which confirm the effectiveness of GS parallelized with iteration space tiling, specifically compared with owner-computes and red-black coloring based GS methods, and show that the new method has a good parallel performance on distributed memory machines.