scispace - formally typeset
Search or ask a question

Showing papers on "Degree of parallelism published in 2006"


Book ChapterDOI
02 Nov 2006
TL;DR: An unbalanced tree search benchmark designed to evaluate the performance and ease of programming for parallel applications requiring dynamic load balancing, and creates versions of UTS in two parallel languages, OpenMP and Unified Parallel C, using work stealing as the mechanism for reducing load imbalance.
Abstract: This paper presents an unbalanced tree search (UTS) benchmark designed to evaluate the performance and ease of programming for parallel applications requiring dynamic load balancing. We describe algorithms for building a variety of unbalanced search trees to simulate different forms of load imbalance. We created versions of UTS in two parallel languages, OpenMP and Unified Parallel C (UPC), using work stealing as the mechanism for reducing load imbalance. We benchmarked the performance of UTS on various parallel architectures, including shared-memory systems and PC clusters. We found it simple to implement UTS in both UPC and OpenMP, due to UPC's shared-memory abstractions. Results show that both UPC and OpenMP can support efficient dynamic load balancing on shared-memory architectures. However, UPC cannot alleviate the underlying communication costs of distributed-memory systems. Since dynamic load balancing requires intensive communication, performance portability remains difficult for applications such as UTS and performance degrades on PC clusters. By varying key work stealing parameters, we expose important tradeoffs between the granularity of load balance, the degree of parallelism, and communication costs.

172 citations


Proceedings ArticleDOI
06 Sep 2006
TL;DR: This paper analytically study three important aspects on improving DHT lookup performance under churn, i.e., lookup strategy, lookup parallelism and lookup key replication, and explores the existence of better alternatives.
Abstract: The phenomenon of churn degrades the lookup performance of DHT-based P2P systems greatly. To date, a number of approaches have been proposed to handle it from both the system side and the client side. However, there lacks theoretical analysis to direct how to make design choices under different churn levels and how to configure their parameters optimally. In this paper, we analytically study three important aspects on improving DHT lookup performance under churn, i.e., lookup strategy, lookup parallelism and lookup key replication. Our objective is to build a theoretical basis for DHT designers to make better design choices in the future. We first compare the performance of two representative lookup strategies - recursive routing and iterative routing, and explore the existence of better alternatives. Then we show the effectiveness of parallel lookup in systems with different churn levels and how to select the optimal degree of parallelism. Due to the importance of key replication on lookup performance, we also analyze the reliability of replicated keys under two different replication policies, and discuss how to make configuration in different environments. Besides analytical study, our results are also validated by simulation, and Kad [1] is taken as a case to show the meaningfulness of our analysis.

49 citations


Journal ArticleDOI
Clara Gaspar1, B. Franek
TL;DR: The principles and features of SMI++ as well as its integration with an industrial SCADA tool for use by the LHC experiments are described and it is shown that such tools, can provide a very convenient mechanism for the automation of large scale, high complexity, applications.
Abstract: The new LHC experiments at CERN will have very large numbers of channels to operate. In order to be able to configure and monitor such large systems, a high degree of parallelism is necessary. The control system is built as a hierarchy of sub-systems distributed over several computers. A toolkit-SMI++, combining two approaches: finite state machines and rule-based programming, allows for the description of the various sub-systems as decentralized deciding entities, reacting in real-time to changes in the system, thus providing for the automation of standard procedures and for the automatic recovery from error conditions in a hierarchical fashion. In this paper we will describe the principles and features of SMI++ as well as its integration with an industrial SCADA tool for use by the LHC experiments and we will try to show that such tools, can provide a very convenient mechanism for the automation of large scale, high complexity, applications.

30 citations


Patent
Kenneth Alan Dockser1
25 May 2006
TL;DR: In this paper, a power and energy control of one or more processing elements matches a degree of parallelism to a monitored condition, in a highly parallel programmable programmable data processor.
Abstract: Automatic selective power and energy control of one or more processing elements matches a degree of parallelism to a monitored condition, in a highly parallel programmable data processor. For example, logic of the parallel processor detects when program operations (e.g. for a particular task or due to a detected temperature) require less than the full width of the data path. In response, the control logic automatically sets a mode of operation requiring a subset of the parallel processing capacity. At least one parallel processing element, that is not needed, can be shut down, to conserve energy and/or to reduce heating (i.e., power consumption). At a later time, when operation of the added capacity is appropriate, the logic detects the change in processing conditions and automatically sets the mode of operation to that of the wider data path, typically the full width. The mode change reactivates the previously shut-down processing element.

28 citations


Journal ArticleDOI
TL;DR: An approach to simplify the design of IFFT/FFT cores for OFDM applications using a novel software tool, called AFORE, which employs a parallel architecture, where the degree of parallelism can be varied.
Abstract: In this paper we present an approach to simplify the design of IFFT/FFT cores for OFDM applications, A novel software tool is proposed, called AFORE. It is able to generate efficient single and multiple mode IFFT/FFT processors. AFORE employs a parallel architecture, where the degree of parallelism can be varied. This way, the tool can find a trade off between area and processing time to meet the system specification. In order to assess the quality of the proposed approach, results are provided for some of the most widely used OFDM standards, such as, WLAN 802.11a/g, WMAN 802.16a, DVB-T.

23 citations


Book ChapterDOI
28 Aug 2006
TL;DR: A distributed implementation of a load balancing heuristic for parallel adaptive FEM simulations based on a disturbed diffusion scheme embedded in a learning framework that helps to omit unnecessary computations as well as replace the domain decomposition by an alternative data distribution scheme reducing the communication overhead.
Abstract: Load balancing is an important issue in parallel numerical simulations. However, state-of-the-art libraries addressing this problem show several deficiencies: they are hard to parallelize, focus on small edge-cuts rather than few boundary vertices, and often produce disconnected partitions. We present a distributed implementation of a load balancing heuristic for parallel adaptive FEM simulations. It is based on a disturbed diffusion scheme embedded in a learning framework. This approach incorporates a high degree of parallelism that can be exploited and it computes well-shaped partitions as shown in previous publications. Our focus lies on improving the condition of the involved matrix and solving the resulting linear systems with local accuracy. This helps to omit unnecessary computations as well as allows to replace the domain decomposition by an alternative data distribution scheme reducing the communication overhead, as shown by experiments with our new MPI based implementation.

14 citations


Journal ArticleDOI
TL;DR: It is shown that it is possible to accelerate Monte Carlo computations significantly using FPGAs and found that the simple photon transport test case can be evaluated in excess of 650 times faster on a large FPGA than on a 3.2 GHz Pentium-4 desktop PC running MCNP5.
Abstract: Advancements in parallel and cluster computing have made many complex Monte Carlo simulations possible in the past several years. Unfortunately, cluster computers are large, expensive, and still not fast enough to make the Monte Carlo technique useful for calculations requiring a near real-time evaluation period. For Monte Carlo simulations, a small computational unit called a Field Programmable Gate Array (FPGA) is capable of bringing the power of a large cluster computer into any personal computer (PC). Because an FPGA is capable of executing Monte Carlo simulations with a high degree of parallelism, a simulation run on a large FPGA can be executed at a much higher rate than an equivalent simulation on a modern single-processor desktop PC. In this paper, a simple radiation transport problem involving moderate energy photons incident on a three-dimensional target is discussed. By comparing the evaluation speed of this transport problem on a large FPGA to the evaluation speed of the same transport problem using standard computing techniques, it is shown that it is possible to accelerate Monte Carlo computations significantly using FPGAs. In fact, we have found that our simple photon transport test case can be evaluated in excess of 650 times faster on a large FPGA than on a 3.2 GHz Pentium-4 desktop PC running MCNP5.

13 citations


Book ChapterDOI
01 Jan 2006
TL;DR: This chapter presents a new systolic architecture for the complete back propagation algorithm which completely parallelizes the entire computation of learning phase and achieves very favorable performance with range of 5 GOPS.
Abstract: Back propagation is a well known technique used in the implementation of artificial neural networks. The algorithm can be described essentially as a sequence of matrix vector multiplications and outer product operations interspersed with the application of a point wise non linear function. The algorithm is compute intensive and lends itself to a high degree of parallelism. These features motivate a systolic design of hardware to implement the Back Propagation algorithm. We present in this chapter a new systolic architecture for the complete back propagation algorithm. For a neural network with N input neurons, P hidden layer neurons and M output neurons, the proposed architecture with P processors, has a running time of (2N + 2M + P + max(M,P)) for each training set vector. This is the first such implementation of the back propagation algorithm which completely parallelizes the entire computation of learning phase. The array has been implemented on an Annapolis FPGA based coprocessor and it achieves very favorable performance with range of 5 GOPS. The proposed new design targets Virtex boards.

12 citations


Journal Article
TL;DR: In this article, the concept of functional-level power analysis (FLPA) for power estimation of programmable processors is extended in order to model even embedded general purpose processors, based on the separation of the processor architecture into functional blocks like e.g. processing unit, clock network, internal memory etc.
Abstract: In this contribution the concept of Functional-Level Power Analysis (FLPA) for power estimation of programmable processors is extended in order to model even embedded general purpose processors. The basic FLPA approach is based on the separation of the processor architecture into functional blocks like e.g. processing unit, clock network, internal memory etc. The power consumption of these blocks is described by parameterized arithmetic models. By application of a parser based automated analysis of assembler codes the input parameters of the arithmetic functions like e.g. the achieved degree of parallelism or the kind and number of memory accesses can be computed. For modeling an embedded general purpose processor (here, an ARM940T) the basic FLPA modeling concept had to be extended to a so-called hybrid functional level and instruction level model in order to achieve a good modeling accuracy. The approach is exemplarily demonstrated and evaluated applying a variety of basic digital signal processing tasks ranging from basic filters to complete audio decoders. Estimated power figures for the inspected tasks are compared to physically measured values. A resulting maximum estimation error of less than 8 % is achieved.

11 citations


Book ChapterDOI
17 Jul 2006
TL;DR: The concept of Functional-Level Power Analysis (FLPA) for power estimation of programmable processors is extended in order to model even embedded general purpose processors and a resulting maximum estimation error of less than 8 % is achieved.
Abstract: In this contribution the concept of Functional-Level Power Analysis (FLPA) for power estimation of programmable processors is extended in order to model even embedded general purpose processors. The basic FLPA approach is based on the separation of the processor architecture into functional blocks like e.g. processing unit, clock network, internal memory etc. The power consumption of these blocks is described by parameterized arithmetic models. By application of a parser based automated analysis of assembler codes the input parameters of the arithmetic functions like e.g. the achieved degree of parallelism or the kind and number of memory accesses can be computed. For modeling an embedded general purpose processor (here, an ARM940T) the basic FLPA modeling concept had to be extended to a so-called hybrid functional level and instruction level model in order to achieve a good modeling accuracy. The approach is exemplarily demonstrated and evaluated applying a variety of basic digital signal processing tasks ranging from basic filters to complete audio decoders. Estimated power figures for the inspected tasks are compared to physically measured values. A resulting maximum estimation error of less than 8 % is achieved.

11 citations


Proceedings ArticleDOI
01 Aug 2006
TL;DR: This work presents a formal analysis of maximizing FPGA utilization, with adaptations that simplify the optimization problem, and reports on design tools containing extensions that support automated sizing ofFPGA-based computation arrays.
Abstract: Computing applications in FPGAs are commonly built from repetitive structures of computing and/or memory elements. In many cases, application performance depends on the degree of parallelism ? ideally, the most that will fit into the fabric of the FPGA being used. Several factors complicate determination of the largest structure that will fit the FPGA: arrays that grow nonlinearly and in uneven step sizes, coupled structures that grow in different polynomial order, multiple design parameters controlling different aspects of the computing structure, and interlocked usage of different hardware resources. Combined with resource usage that depends on application-specific data elements and arithmetic details, these factors defeat any simple approach for scaling the computing structures up to the FPGA?s capacity. We present a formal analysis of maximizing FPGA utilization, with adaptations that simplify the optimization problem. We also report on design tools containing extensions that support automated sizing of FPGA-based computation arrays.

Proceedings ArticleDOI
01 Oct 2006
TL;DR: In this article, an optimal implementation of 128-point FFT/IFFT for low power IEEE 802.15.3a WPAN using pseudo-parallel datapath structure is presented, where the 128-Pt FFT is devolved into 8-pt and 16-pts FFTs and then once again by devolving the 16-Pft FFT into 4times4 and 2times8-
Abstract: An optimal implementation of 128-Pt FFT/IFFT for low power IEEE 802.15.3a WPAN using pseudo-parallel datapath structure is presented, where the 128-Pt FFT is devolved into 8-Pt and 16-Pt FFTs and then once again by devolving the 16-Pt FFT into 4times4 and 2times8- We analyze 128-Pt FFT/IFFT architecture for various pseudo-parallel 8-Pt and 16-Pt FFTs and an optimum datapath architecture is explored. It is suggested that there exist an optimum degree of parallelism for the given algorithm. The analysis demonstrated that with modest increase in area one can achieve significant reduction in power. The proposed architectures complete one parallel-to-parallel (i.e., when all input data are available in parallel and all output data are generated in parallel) 128-point FFT computation in less than 312 ns and thereby meeting the standard specification. The relative merits and demerits of these architectures have been analyzed from the algorithm as well as implementation point of view. Detailed power analysis of each of the architectures with different number of data paths at block level is described. We found that from power perspective the architecture with eight datapaths is optimum. The core power consumption with optimum case is 60.6mw which is only less than half of the latest reported 128-point FFT design in 0.18u technology. Apart from the low power consumption, the advantages of the proposed architectures include reduced hardware complexity, regular data flow and simple counter based control

Dissertation
01 Jan 2006
TL;DR: This thesis shows that multimedia compression algorithms, composed of many independent processing stages, are a good match for the streaming model of computation.
Abstract: Video playback devices rely on compression algorithms to minimize storage, transmission bandwidth, and overall cost. Compression techniques have high realtime and sustained throughput requirements, and the end of CPU clock scaling means that parallel implementations for novel system architectures are needed. Parallel implementations increase the complexity of application design. Current languages force the programmer to trade off productivity for performance; the performance demands dictate that the parallel programmer choose a low-level language in which he can explicitly control the degree of parallelism and tune his code for performance. This methodology is not cost effective because this architecture-specific code is neither malleable nor portable. Reimplementations must be written from scratch for each of the existing parallel and reconfigurable architectures. This thesis shows that multimedia compression algorithms, composed of many independent processing stages, are a good match for the streaming model of computation. Stream programming models afford certain advantages in terms of programmability, robustness, and achieving high performance. This thesis intends to influence language design towards the inclusion of features that lend to the efficient implementation and parallel execution of streaming applications like image and video compression algorithms. Towards this I contribute i) a clean, malleable, and portable implementation of an MPEG-2 encoder and decoder expressed in a streaming fashion, ii) an analysis of how a streaming language improves programmer productivity, iii) an analysis of how a streaming language enables scalable parallel execution, iv) an enumeration of the language features that are needed to cleanly express compression algorithms, v) an enumeration of the language features that support large scale application development and promote software engineering principles such as portability and reusability. This thesis presents a case study of MPEG-2 encoding and decoding to explicate points about language expressiveness. The work is in the context of the StreamIt programming language.

Journal ArticleDOI
TL;DR: Graph theory can help to create a new frame of fine grain parallelism analysis, from reduced valence to data dependence matrix D, the latter characterizing a code sequence in a mathematical manner.
Abstract: The evaluation of computer architectures requires new tools that complement the customary simulations. Graph theory can help to create a new frame of fine grain parallelism analysis. The differences found between the superscalar performance in x86 and non-x86 processors and the peculiar characteristics of the x86 instruction set architecture recommend to carry out a thorough study of the available parallelism at the machine language layer. Starting off from graph theory foundations, new concepts are introduced, from reduced valence to data dependence matrix D, the latter characterizing a code sequence in a mathematical manner. This matrix satisfies a series of properties and restrictions and provides information about the ability of the code to be processed concurrently. The different sources of data dependencies can be composed, facilitating a way to analyze their final influence on the degree of parallelism.

Proceedings ArticleDOI
01 Nov 2006
TL;DR: Technology-independent hardware cost analysis for a new class of highly parameterizable coarse-grained reconfigurable architectures called weakly programmable processor arrays is performed.
Abstract: Growing complexity and speed requirements in modern application areas such as wireless communication and multimedia in embedded devices demand for flexible and efficient parallel hardware architectures. The inherent parallelism in these application fields has to be reflected at the hardware level to achieve high performance. Coarse-grained reconfigurable architectures support a high degree of parallelism at multiple levels. In this paper technology-independent hardware cost analysis for a new class of highly parameterizable coarse-grained reconfigurable architectures called weakly programmable procerssor arrays is performed.

Proceedings ArticleDOI
01 Dec 2006
TL;DR: This paper describes a cost effective artificial neural network implementation on an FPGA in three easy steps and proposes the manner in which network layers are mapped into a particular hardware structure such that the performance and efficiency of the hardware resources are greatly improved.
Abstract: This paper describes a cost effective artificial neural network implementation on an FPGA in three easy steps. Furthermore, it proposes the manner in which network layers are mapped into a particular hardware structure such that the performance and efficiency, with which the hardware resources are used, are greatly improved. A reconfigurable, parameterised neural node is presented as the basic building block for neural implementations, and is modelled in Verilog (HDL). The results show a high degree of parallelism, fast performance and most important low area resources.

12 Mar 2006
TL;DR: A novel, hardware implementation friendly, "pulse reactive" model of spiking neurons is described, used then to implement a fully connected network, yielding a high degree of parallelism.
Abstract: Neuromorphic neural networks are of interest both from a biological point of view and in terms of robust signaling in noisy environments. The basic question however, is what type of architecture can be used to efficiently build such neural networks in hardware devices, in order to use them in real time process control problems. In this paper a novel, hardware implementation friendly, "pulse reactive" model of spiking neurons is described. This is used then to implement a fully connected network, yielding a high degree of parallelism. The modular neuron structure, acquired signals and a process control application are given.

Journal ArticleDOI
TL;DR: A heuristic scheduling algorithm is developed, motivated from the observations of a simple cluster configuration, to spatially schedule write operations on the nodes with less load among each mirroring pair to alleviate performance degradation in a RAID-10 style file system.
Abstract: While aggregating the throughput of existing disks on cluster nodes is a cost-effective approach to alleviate the I/O bottleneck in cluster computing, this approach suffers from potential performance degradations due to contentions for shared resources on the same node between storage data processing and user task computation. This paper proposes to judiciously utilize the storage redundancy in the form of mirroring existed in a RAID-10 style file system to alleviate this performance degradation. More specifically, a heuristic scheduling algorithm is developed, motivated from the observations of a simple cluster configuration, to spatially schedule write operations on the nodes with less load among each mirroring pair. The duplication of modified data to the mirroring nodes is performed asynchronously in the background. The read performance is improved by two techniques: doubling the degree of parallelism and hot-spot skipping. A synthetic benchmark is used to evaluate these algorithms in a real cluster environment and the proposed algorithms are shown to be very effective in performance enhancement.

Journal Article
TL;DR: In this article, the authors describe a multi-threaded Prolog system that evolves single-thread Inside Prolog to achieve high degree of parallelism on an SMP system by minimizing mutual exclusion for scalability essential in enterprise use.
Abstract: A knowledge-based system is suitable for realizing advanced functions that require domain-specific expert knowledge in enterprise-mission-critical information systems (enterprise applications). This paper describes a newly implemented multi-threaded Prolog system that evolves single-threaded Inside Prolog. It is intended as a means to apply a knowledge-based system written in Prolog to an enterprise application. It realizes a high degree of parallelism on an SMP system by minimizing mutual exclusion for scalability essential in enterprise use. Also briefly introduced is the knowledge processing server which is a framework for operating a knowledge-based system written in Prolog with an enterprise application. Experimental results indicated that on an SMP system the multi-threaded Prolog could achieve a high degree of parallelism while the server could obtain scalability. The application of the server to clinical decision support in a hospital information system also demonstrated that the multi-threaded Prolog and the server were sufficiently robust for use in an enterprise application.

Journal Article
TL;DR: A new recurrence formula is derived to compute the probability density function of the CDD, which consists of multiple convolutions of functions with a Gaussian distribution and can be computed efficiently with the fast Gauss transform.
Abstract: We present a fast and highly parallel algorithm for pricing CDD weather derivatives, which are financial products for hedging weather risks due to higher-than average temperature in summer. To find the price, we need to compute the expected value of its payoff, namely, the CDD weather index. To this end, we derive a new recurrence formula to compute the probability density function of the CDD. The formula consists of multiple convolutions of functions with a Gaussian distribution and can be computed efficiently with the fast Gauss transform. In addition, our algorithm has a large degree of parallelism because each convolution can be computed independently. Numerical experiments show that our method is more than 10 times faster than the conventional Monte Carlo method when computing the prices of various CDD derivatives on one processor. Moreover, parallel execution on a PC cluster with 8 nodes attains up to six times speedup, allowing the pricing of most of the derivatives to be completed in about 10 seconds.

Journal Article
TL;DR: In this paper, a parallel version of AMIGO (Advanced Multidimensional Interval Analysis Global Optimization) algorithm is proposed to solve very hard global optimization problems in a multiprocessing environment.
Abstract: Interval Global Optimization based on Branch and Bound (B&B) technique is a standard for searching an optimal solution in the scope of continuous and discrete Global Optimization. It iteratively creates a search tree where each node represents a problem which is decomposed in several subproblems provided that a feasible solution can be found by solving this set of subproblems. The enormous computational power needed to solved most of the B&B Global Optimization problems and their high degree of parallelism make them suitable candidates to be solved in a multiprocessing environment. This work evaluates a parallel version of AMIGO (Advanced Multidimensional Interval Analysis Global Optimization) algorithm. AMIGO makes an efficient use of all the available information in continuous differentiable problems to reduce the search domain and to accelerate the search. Our parallel version takes advantage of the capabilities offered by Charm++. Preliminary results show our proposal as a good candidate to solve very hard global optimization problems.

01 Jan 2006
TL;DR: The labeled dependency graph associated with a P system is defined, and this new concept is used for proving some results concerning the maximum number of applications of rules in a single step along the computation of a P systems.
Abstract: In the literature, several designs of P systems were used for performing the same task. The use of different techniques or even different P system models makes it very difficult to compare these designs. In this paper, we introduce a new criterion for such a comparison: the degree of parallelism of a P system. To this aim, we define the labeled dependency graph associated with a P system, and we use this new concept for proving some results concerning the maximum number of applications of rules in a single step along the computation of a P system.

Journal ArticleDOI
TL;DR: A differential method for optical flow evaluation is being presented that employs a new error formulation that ensures a more than satisfactory image reconstruction in those points which are free of motion discontinuity.
Abstract: Optical flow estimation is a recurrent problem in several disciplines and assumes a primary importance in a number of applicative fields such as medical imaging [12], computer vision [6], productive process control [4], etc. In this paper, a differential method for optical flow evaluation is being presented. It employs a new error formulation that ensures a more than satisfactory image reconstruction in those points which are free of motion discontinuity. A dynamic scheme of brightness-sample processing has been used to regularise the motion field. A technique based on the concurrent processing of sequences with multiple pairs of images has also been developed for improving detection and resolution of mobile objects on the scene, if they exist. This approach permits to detect motions ranging from a fraction of a pixel to a few pixels per frame. Good results, even on noisy sequences and without the need of a filtering pre-processing stage, can be achieved. The intrinsic method structure can be exploited for favourable implementation on multi-processor systems with a scalable degree of parallelism. Several sequences, some with noise and presenting various types of motions, have been used for evaluating the performances and the effectiveness of the method.

Proceedings ArticleDOI
09 Jul 2006
TL;DR: In this paper, an Artificial Neural Network (ANN) based model reference adaptive controller has been developed for a positioning system with a flexible transmission element, taking into account hard nonlinearities in the motor and load models.
Abstract: An Artificial Neural Network (ANN) based model reference adaptive controller has been developed for a positioning system with a flexible transmission element, taking into account hard nonlinearities in the motor and load models. Due to the presence of Coulomb friction and of the flexible coupling, the inverse model of the system is not realizable. The ability of ANNs to approximate nonlinear functions is exploited to obtain an approximate inverse model for the positioning system and a reference model is used to define the desired error dynamics. The controller uses desired load position and velocity trajectories with measurement of load position, load velocity and motor velocity. The paper describes a VLSI implementation of the controller on a Virtex2 Pro 2VP30 Field Programmable Gate Array (FPGA) from Xilinx. A pipelined adaptation of the on-line back-propagation algorithm is used. The hardware implementation is capable of a high degree of parallelism and pipelining of neural networks allows the controller to operate at even higher speed. The FPGA implementation on the other hand allows fast prototyping and rapid system deployment. The controller can be used to improve both static and dynamic performance of electromechanical systems.

Proceedings ArticleDOI
25 Apr 2006
TL;DR: The results reveal that the solution of implementing a number of ultra low-power processors in compact packaging is an excellent way to achieve extremely high performance in applications with a certain degree of parallelism.
Abstract: In our research project named "Mega-Scale Computing Based on Low-Power Technology and Workload Modeling", we have been developing a prototype cluster not based on ASIC or FPGA but instead only using commodity technology. Its packaging is extremely compact and dense, and its performance/power ratio is very high. Our previous prototype system named "MegaProto" demonstrated that one cluster unit, which consists of 16 commodity low-power processors, can be successfully implemented on just 1U height chassis and it is capable of up to 2.8 times higher performance/power ratio than ordinary high-performance dual-Xeon 1U server units. We have improved MegaProto by replacing the CPU and enhancing the I/O performance. The new cluster unit named "MegaProto/E" with 16 Transmeta Efficeon processors achieves 32 GFlops of peak performance, which is 2.2-fold greater than that of the original one. The cluster unit is equipped with an independent dual network of Gigabit Ethernet, including dual 24-port switches. The maximum power consumption of the cluster unit is 320 W, which is comparable with that of today's high-end PC servers for high performance clusters. Performance evaluation using NPB kernels and HPL shows that the performance of MegaProto/E exceeds that of a dual-Xeon server in all the benchmarks, and its performance ratio ranges from 1.3 to 3.7. These results reveal that our solution of implementing a number of ultra low-power processors in compact packaging is an excellent way to achieve extremely high performance in applications with a certain degree of parallelism. We are now building a multi-unit cluster with 128 CPUs (8 units) to prove that this advantage still holds with higher scalability.

Patent
30 Aug 2006
TL;DR: In this paper, a method for encrypting and decrypting data based on the well-known confusion-diffusion paradigm is disclosed. But unlike any previous methods, the present method performs efficient, scalable and flexible encryption by using full advantages of the way the data to encrypt is structured.
Abstract: A method for encrypting and decrypting data is disclosed. Unlike any previous methods, the present invention performs efficient, scalable and flexible encryption by using full advantages of the way the data to encrypt is structured. The method is based on the well-known confusion-diffusion paradigm. The confusion is performed using random invertible or reversible s-boxes while the diffusion is achieved at the bit level between neighbouring s-boxes. Correctly structuring the data in a topological space allows for optimal diffusion to occur. The method is naturally adapted to be implemented in hardware with great efficiency. Its scalable nature allows for variable block sizes, which either in hardware and software allows for greater performances. The method is based on a two-dimensional lattice and is characterized by its high degree of parallelism.

Proceedings ArticleDOI
08 Mar 2006
TL;DR: This paper presents parallelization and high speed hardware implementation of a heuristic 2D rectangle placement algorithm and a performance analysis and evaluation of the suggested mapping onto reconfigurable hardware.
Abstract: Many areas of industry involve computationally intensive layout design problems. A high performance computing system for swiftly generating and analyzing layout alternatives is much desired. The performance of such a system is largely affected by the efficiency of the placement algorithm and its degree of parallelism, and the employed hardware platform. In this paper, we present parallelization and high speed hardware implementation of a heuristic 2D rectangle placement algorithm. A performance analysis and evaluation of the suggested mapping onto reconfigurable hardware is also presented.

Patent
25 May 2006
TL;DR: In this article, a power and energy control of one or more processing elements matches a degree of parallelism to a monitored condition, in a highly parallel programmable programmable data processor.
Abstract: Automatic selective power and energy control of one or more processing elements matches a degree of parallelism to a monitored condition, in a highly parallel programmable data processor. For example,logic of the parallel processor detects when program operations (e.g. for a particular task or due to a detected temperature) require less than the full width of the data path. In response, the control logic automatically sets a mode of operation requiring a subset of the parallel processing capacity. At least one parallel processing element, that is not needed, can be shut down, to conserve energy and/or to reduce heating (i.e., power consumption). At a later time, when operation of the added capacity is appropriate, the logic detects the change in processing conditions and automatically sets the mode of operation to that of the wider data path, typically the full width. The mode change reactivates the previously shut-down processing element.

Patent
Marco Casarsa1
04 May 2006
TL;DR: In this article, a dual compression architecture is proposed to effectively overcome the increasing complexity of the current System-on-Chip (SoC) Integrated Circuits (ICs), especially but not only the ones including Embedded Flash memories, along with the increasingly stringent quality mandates, gives rise to a consequent increase of the overall test cost which does not conciliate with the current trend to reduce as much as possible the device delivery time.
Abstract: In the field of Design For Testability (DFT), the increasing complexity of the current System-on-Chip (SoC) Integrated Circuits (ICs), especially but not only the ones including Embedded Flash memories, along with the increasingly stringent quality mandates, gives rise to a consequent increase of the overall test cost which does not conciliate with the current trend to reduce as much as possible the device delivery time. According to the invention a dual compression architecture is proposed to effectively overcome this problem, exploiting two different adaptive scan architecture configurations to trade off between the opposite requests of high parallelism required by some test (e.g. EWS) and low parallelism required by others (e.g. FT). One of the two configurations, say MIN CONF , allows testing of a limited amount of external scan chains, and will be used for example during EWS (Electrical Wafer Sort) test in order to increase the degree of parallelism of testing. The second one, say MAX CONF , allows testing of all scan chains, and will be used for example during FT (Final Test) to reduce the testing time.