scispace - formally typeset
Search or ask a question

Showing papers on "Degree of parallelism published in 2002"


Proceedings ArticleDOI
03 Jun 2002
TL;DR: It is shown that using a SIMD parallelism of four, the CPU time for the new algorithms is from 10% to more than four times less than for the traditional algorithms, and superlinear speedups are obtained as a result of the elimination of branch misprediction effects.
Abstract: Modern CPUs have instructions that allow basic operations to be performed on several data elements in parallel. These instructions are called SIMD instructions, since they apply a single instruction to multiple data elements. SIMD technology was initially built into commodity processors in order to accelerate the performance of multimedia applications. SIMD instructions provide new opportunities for database engine design and implementation. We study various kinds of operations in a database context, and show how the inner loop of the operations can be accelerated using SIMD instructions. The use of SIMD instructions has two immediate performance benefits: It allows a degree of parallelism, so that many operands can be processed at once. It also often leads to the elimination of conditional branch instructions, reducing branch mispredictions.We consider the most important database operations, including sequential scans, aggregation, index operations, and joins. We present techniques for implementing these using SIMD instructions. We show that there are significant benefits in redesigning traditional query processing algorithms so that they can make better use of SIMD technology. Our study shows that using a SIMD parallelism of four, the CPU time for the new algorithms is from 10% to more than four times less than for the traditional algorithms. Superlinear speedups are obtained as a result of the elimination of branch misprediction effects.

291 citations


01 Jan 2002
TL;DR: The compiler techniques of OpenMP pragmaand directive-guided parallelization developed for the highperformance Intel C++/Fortran compiler are presented and a performance evaluation of a set of benchmarks and applications are presented.
Abstract: In the never-ending quest for higher performance, CPUs become faster and faster. Processor resources, however, are generally underutilized by many applications. Intel’s Hyper-Threading Technology is developed to resolve this issue. This new technology allows a single processor to manage data as if it were two processors by executing data instructions from different threads in parallel rather than serially. Processors enabled with Hyper-Threading Technology can greatly imp rove the performance of applications with a high degree of parallelism. However, the potential gain is only obtained if an application is multithreaded, by either manual, automatic, or semiautomatic parallelization techniques. This paper presents the compiler techniques of OpenMP pragmaand directive-guided parallelization developed for the highperformance Intel C++/Fortran compiler. We also present a performance evaluation of a set of benchmarks and applications.

81 citations


Patent
29 Mar 2002
TL;DR: A storage processor particularly suited to RAID systems provides high throughput for applications such as streaming video data as discussed by the authors, and the preferred embodiment provides a store and forward architecture configured around a switch with prioritization on data pathways critical to high throughput.
Abstract: A storage processor particularly suited to RAID systems provides high throughput for applications such as streaming video data An embodiment is configured as an ASIC with a high degree of parallelism in its interconnections The preferred embodiment provides a store and forward architecture configured around a switch with prioritization on data pathways critical to high throughput

57 citations


Book ChapterDOI
08 Apr 2002
TL;DR: In this paper, a modification of the unfolding algorithm is presented, which can be efficiently parallelized and admits a more efficient implementation. But the degree of parallelism is usually quite high and resulting algorithms potentially can achieve significant speedup comparing with the sequential case.
Abstract: In this paper, we first present theoretical results, helping to understand the unfolding algorithm presented in [6,7] We then propose a modification of this algorithm, which can be efficiently parallelised and admits a more efficient implementation Our experiments demonstrate that the degree of parallelism is usually quite high and resulting algorithms potentially can achieve significant speedup comparing with the sequential case

35 citations


Patent
Taketo Heishi1, Shuichi Takayama, Tetsuya Tanaka1, Hajime Ogawa1, Nobuo Higaki1 
19 Sep 2002
TL;DR: In this article, the processor decodes a number of instructions that is greater than the number of provided computing units and judges their execution conditions with an instruction issue control portion before the execution stage, instructions for which the condition is false are invalidated and subsequent valid instructions are assigned so that the computing units (hardware) is used efficiently.
Abstract: In order to overcome the problem that conditionally executed instructions are executed as no-operation instructions if their condition is not fulfilled, leading to poor utilization efficiency of the hardware and lowering the effective performance, the processor decodes a number of instructions that is greater than the number of provided computing units and judges their execution conditions with an instruction issue control portion before the execution stage, Instructions for which the condition is false are invalidated, and subsequent valid instructions are assigned so that the computing units (hardware) is used efficiently A compiler performs scheduling such that the number of instructions whose execution condition is true does not exceed the upper limit of the degree of parallelism of the hardware The number of instructions arranged in parallel at each cycle may exceed the degree of parallelism of the hardware

25 citations


Journal ArticleDOI
TL;DR: It can be proved that DiagRSMarch can identify all stuck-at, transition, state coupling, and dynamic coupling faults occurring in all memory arrays and is highly dependent on memory topology, defect-type distribution, and degree of parallelism.
Abstract: In this paper, the authors propose a new built-in self-diagnosis method to simultaneously diagnose spatially distributed memory modules with different sizes. Based on the serial interfacing technique, the serial fault masking effect is observed and a bidirectional serial interfacing technique is proposed to deal with such an issue. By tolerating redundant read/write operations, they develop a new march algorithm called DiagRSMarch to achieve the goals of low test signal routing overhead, tolerable diagnostic time, and high diagnostic coverage. It can be proved that DiagRSMarch can identify all stuck-at, transition, state coupling, and dynamic coupling faults occurring in all memory arrays. Experimental results also demonstrate that the test efficiency of DiagRSMarch is highly dependent on memory topology, defect-type distribution, and degree of parallelism.

23 citations


Journal ArticleDOI
TL;DR: The method developed, called the distributed parallel integration evaluation model (DPIEM) models the workflow in the distributed enterprise based on three integration scenarios and minimizes the integrated tasks total cost by adding as many parallel servers per task as possible.
Abstract: Distribution has become an increasingly common characteristic for modern service and production companies Enterprises nowadays rely on distribution of their operations for provision of their supplies, labor, and for selling their products in dynamic global markets Much of today enterprises efforts to cope with global markets are being directed towards the finding of effective collaboration means among their operations and partners This research proposes a model for assisting distributed enterprises in modeling their operations by optimizing and integrating their workflow to accomplish the collaborative objective The method developed, called the distributed parallel integration evaluation model (DPIEM) models the workflow in the distributed enterprise based on three integration scenarios DPIEM minimizes the integrated tasks total cost by adding as many parallel servers per task as possible The method was tested for a case of distributed assembly of two part-types A total of eight scenarios for the case were analyzed, yielding the recommended number of parallel servers per integrated task For comparison, each scenario was also simulated with the TIE parallel-computer environment The TIE simulation results corroborate the DPIEM recommendation based on the lowest total cost for the case analyzed

20 citations


Proceedings ArticleDOI
10 Dec 2002
TL;DR: The feasibility to exploit the hardware parallelism in order to accelerate the interleaving procedure is demonstrated and on a heuristic algorithm the possible speedup for different interleavers as a function of the degree of parallelism of the hardware is presented.
Abstract: Today's communications systems especially in the field of wireless communications rely on many different algorithms to provide applications with constantly increasing data rates and higher quality. This development combined with the wireless channel characteristics as well as the invention of turbo codes has particularly increased the importance of interleaver algorithms. In this paper we demonstrate the feasibility to exploit the hardware parallelism in order to accelerate the interleaving procedure. Based on a heuristic algorithm the possible speedup for different interleavers as a function of the degree of parallelism of the hardware is presented. The parallelization is generic in the sense that the assumed underlying hardware is based on a parallel datapath DSP architecture and therefore provides the flexibility of software solutions.

17 citations


Patent
29 Mar 2002
TL;DR: In this paper, a storage processor particularly suited to RAID systems provides high throughput for applications such as streaming video data and is configured as an ASIC with a high degree of parallelism in its interconnections.
Abstract: A storage processor particularly suited to RAID systems provides high throughput for applications such as streaming video data An embodiment is configured as an ASIC with a high degree of parallelism in its interconnections The communications architecture provides saturation of user data pathways with low complexity and low latency by employing multiple memory channels under software control, an efficient parity calculation mechanism and other features

13 citations


Patent
29 Mar 2002
TL;DR: In this paper, a storage processor particularly suited to RAID systems provides high throughput for applications such as streaming video data and buffering may be used to maintain clear paths for priority data, such as user data being read or written, on shared channels.
Abstract: A storage processor particularly suited to RAID systems provides high throughput for applications such as streaming video data. An embodiment is configured as an ASIC with a high degree of parallelism in its interconnections. Buffering may be used to maintain clear paths for priority data, such as user data being read or written, on shared channels.

10 citations


Proceedings ArticleDOI
24 Jun 2002
TL;DR: This paper shows how techniques based on data independence could be used to justify, by means of a finite FDR check, systems where agents can perform an unbounded number of protocol runs, and addresses the issue of capturing the state of mind of internal agents.
Abstract: We carry forward the work described in our previous papers (Broadfoot et al., 2000, Broadfoot and Roscoe, 2002, and Roscoe, 1998) on the application of data independence to the model checking of cryptographic protocols using CSP and FDR. In particular, we showed how techniques based on data independence could be used to justify, by means of a finite FDR check, systems where agents can perform an unbounded number of protocol runs. Whilst this allows for a more complete analysis, there was one significant incompleteness in the results we obtained: While each individual identity could perform an unlimited number of protocol runs sequentially, the degree of parallelism remained bounded. We report significant progress towards the solution of this problem, by "internalising" all or part of each agent identity within the "intruder" process. We consider the case where internal agents do introduce fresh values and address the issue of capturing the state of mind of internal agents (for the purposes of analysis).

Book ChapterDOI
TL;DR: A first implementation at the highest granularity level is presented, which contains a high degree of parallelism at different levels of granularity, which can be exploited when designing distributed implementations, such as workcrew computation in a master-slave paradigm.
Abstract: This paper presents a parallel implementation of a hybrid data mining technique for multivariate heterogeneous time varying processes based on a combination of neuro-fuzzy techniques and genetic algorithms. The purpose is to discover patterns of dependency in general multivariate time-varying systems, and to construct a suitable representation for the function expressing those dependencies. The patterns of dependency are represented by multivariate, non-linear, autoregressive models. Given a set of time series, the models relate future values of one target series with past values of all such series, including itself. The model space is explored with a genetic algorithm, whereas the functional approximation is constructed with a similarity based neuro-fuzzy heterogeneous network. This approach allows rapid prototyping of interesting interdependencies, especially in poorly known complex multivariate processes. This method contains a high degree of parallelism at different levels of granularity, which can be exploited when designing distributed implementations, such as workcrew computation in a master-slave paradigm. In the present paper, a first implementation at the highest granularity level is presented. The implementation was tested for performance and portability in different homogeneous and heterogeneous Beowulf clusters with satisfactory results. An application example with a known time series problem is presented.

Journal ArticleDOI
TL;DR: A possible methodology for the application design at the architectural level, targeted to embedded systems built upon multicore chipsets with a low degree of parallelism is proposed, which makes use of performance predictions, obtained by simulations.
Abstract: Many software applications demanding a considerable computing power are moving towards the field of embedded systems (and, in particular, hand-held devices). A possible way to increase the computing power of this kind of platform, so that both cost and power consumption are kept low, is the employment of multiple CPU cores on the same chipset. Consequently, it is essential to design applications that meet performance requirements leveraging the underlying parallel platform. As embedded applications are usually built using different components (whose source code is often not available) from different companies, the designer can mostly only operate at the architectural level. So far, methodologies for designing software architectures have mainly addressed general-purpose systems, often relying on hardware platforms with a high degree of parallelism. In this paper, we present our experience in architectural design of parallel embedded applications; as a result, we propose a possible methodology for the application design at the architectural level, targeted to embedded systems built upon multicore chipsets with a low degree of parallelism. It makes use of performance predictions, obtained by simulations. Such a methodology can be employed both for retargeting existing sequential applications to parallel processing platforms and for designing complete applications from scratch. We show the application of the proposed methodology to an embedded digital cartographic system. Starting with a software description using UML diagrams, candidate software architectures (utilizing different parallel solutions) are first defined and then evaluated, to end with the selection of the one yielding the highest performance gain.

Proceedings ArticleDOI
15 Apr 2002
TL;DR: It is shown that both coevolutionary algorithms outperform a sequential GA and may be recommended to be used in optimization systems when high degree of parallelism is possible and non global coordination is expected while the CCGA algorithm is useful when low degree of Parallelism andglobal coordination is acceptable.
Abstract: The problem of parallel and distributed function optimization is considered. Two coevolutionary algorithms with different degrees of parallelism and different levels of a global coordination are used for this purpose and compared with sequential genetic algorithm (GA). The first coevolutionary algorithm called a loosely coupled genetic algorithm (LCGA) represents a competitive coevolutionary approach to problem solving and is compared with another coevolutionary algoritm called cooperative coevolutionary genetic algorithm (CCGA). The algorithms are applied for parallel and distributed optimization of a number of test functions known in the area of evolutionary computation. We show that both coevolutionary algorithms outperform a sequential GA. While both LCGA and CCGA algorithms offer high quality solutions, they may compete to outperform each other in some specific test optimization problems. The LCGA may be recommended to be used in optimization systems when high degree of parallelism is possible and non global coordination is expected while the CCGA algorithm is useful when low degree of parallelism and global coordination is acceptable.

Proceedings Article
01 Jan 2002
TL;DR: The concept of disjoint faults is extended to reduce the number of tests to the time efficiency of Θ (N 5/6 )f or N×ND OMINs and a two-phase diagnosis algorithm is proposed to reduced the testing requirement to 4N tests.
Abstract: Dilated Optical Multistage Interconnection Networks (DOMINs) based on 2 × 2 directional coupler photonic switches play an important role in all-optical high-performance networks, especially for the emerging IP over DWDM architectures. The problem of crosstalk within photonic switches is underestimated due to the aging of the switching element, control voltage, temperature and polarization, and thus causes undesirable coupling of the signal from one path to the other. Previous works [18] designed an efficient diagnosing disjoint faults algorithm in small sized networks, which reduced the number of tests required by overlapping the tests with computations to one half in photonic switching networks. Furthermore, this paper generically derives algorithms and mathematical modules to find the optimal degree of parallelism of faults diagnosis for N × N dilated blocking networks, as the size of network is larger. Taking advantage of the properties of disjoint faults, diagnosis can be accelerated significantly because the optimal degree of parallel fault diagnosis may grow exponentially. To reduce the diagnosis time, an algorithm is proposed herein to find the maximum number of disjoint faults. Rather than requiring up to 4MN tests as a native approach, a two-phase diagnosis algorithm is proposed to reduce the testing requirement to 4N tests. This study extends the concept of disjoint faults to reduce the number of tests to the time efficiency of Θ (N 5/6 )f or N×ND OMINs.

Patent
18 Oct 2002
TL;DR: In this paper, the authors proposed a symmetric-key cryptographic processing technique capable of realizing both high-speed cryptographic processing having a high degree of parallelism, and alteration detection.
Abstract: PROBLEM TO BE SOLVED: To provide a symmetric-key cryptographic processing technique capable of realizing both high-speed cryptographic processing having a high degree of parallelism, and alteration detection. SOLUTION: A redundancy is added to the data desired to encrypt, the resulting data are encrypted by using a key stream whose length is truly longer than the resulting data to produce a cipher text. In the corresponding decryption, confirming the decryption of the redundancy can detect alteration during communication. The success probability of the alteration detection can be evaluated, and the processing with high parallelism and a smaller mount scale is faster than that of a block cipher.

Proceedings ArticleDOI
28 Oct 2002
TL;DR: A new asynchronous, coarse-grain parallel genetic algorithm model is proposed and a design pattern is presented, which can achieve high degree of parallelism, can handle the high communication latency and low communication bandwidth problems of the Internet, and can contribute to building more robust Web-based parallel GA.
Abstract: Aiming at developing more efficient and robust genetic algorithms (GA) over the Internet, a new asynchronous, coarse-grain parallel genetic algorithm model is proposed in this paper On the basis of the model, we present a design pattern for Web-based parallel GA, which captures design solutions to core problems in implementing Web-based parallel GA This design pattern can achieve high degree of parallelism, can handle the high communication latency and low communication bandwidth problems of the Internet, and can contribute to building more robust Web-based parallel GA

Journal Article
TL;DR: In this article, the degree of parallelism in the hyperbalanced λ-calculus, λ H, a subcalculus of λcalculus containing all simply typable terms (up to a restricted η-expansion), was investigated.
Abstract: We investigate the degree of parallelism (or modularity) in the hyperbalanced λ-calculus, λ H , a subcalculus of λ-calculus containing all simply typable terms (up to a restricted η-expansion). In technical terms, we study the family relation on redexes in λ H , and the contribution relation on redex-families, and show that the latter is a forest (as a partial order). This means that hyperbalanced λ-terms allow for maximal possible parallelism in computation. To prove our results, we use and further refine, for the case of hyperbalanced terms, some well known results concerning paths, which allow for static analysis of many fundamental properties of β-reduction.

Book ChapterDOI
22 Jul 2002
TL;DR: The family relation on redexes in ?
Abstract: We investigate the degree of parallelism (or modularity) in the hyperbalanced λ-calculus, λH, a subcalculus of λ-calculus containing all simply typable terms (up to a restricted η-expansion). In technical terms, we study the family relation on redexes in λH, and the contribution relation on redex-families, and show that the latter is a forest (as a partial order). This means that hyperbalanced λ-terms allow for maximal possible parallelism in computation. To prove our results, we use and further refine, for the case of hyperbalanced terms, some well known results concerning paths, which allow for static analysis of many fundamental properties of β-reduction.

Patent
20 Sep 2002
TL;DR: In this article, a compiler performs scheduling such that the number of instructions whose execution condition is true does not exceed the upper limit of the degree of parallelism of the hardware, so that the computing units (hardware) is used efficiently.
Abstract: In order to overcome the problem that conditionally executed instructions are executed as no-operation instructions if their condition is not fulfilled, leading to poor utilization efficiency of the hardware and lowering the effective performance, the processor decodes a number of instructions that is greater than the number of provided computing units and judges their execution conditions with an instruction issue control portion before the execution stage, Instructions for which the condition is false are invalidated, and subsequent valid instructions are assigned so that the computing units (hardware) is used efficiently. A compiler performs scheduling such that the number of instructions whose execution condition is true does not exceed the upper limit of the degree of parallelism of the hardware. The number of instructions arranged in parallel at each cycle may exceed the degree of parallelism of the hardware.

Book ChapterDOI
01 Jan 2002
TL;DR: The scheme is a parallelization of the dynamic programming method of evaluating minimum edit distances between the pattern and any substring of the reference string, and is extendible for any pattern size and any parallelism degree.
Abstract: Parallel broadcasting provides an efficient parallel hardware solution to the k-mismatches problem which is a version of the approximate string matching problem. The scheme is a parallelization of the dynamic programming method of evaluating minimum edit distances between the pattern and any substring of the reference string. Implicit parallelism based on dataflow and explicit parallelism based on parallel broadcasting mechanism exploit the maximum parallelism. Time complexity of the proposed parallel scheme is O(((n−m)/d) + m), where n and m represent lengths of the reference and the pattern strings respectively, and d is the degree of parallelism that is controllable. The proposed scheme is suitable for the VLSI implementation. The design is based on the linear systolic array of a simple basic cell, and is extendible for any pattern size and any parallelism degree. For serial and parallel designs, m and d*m identical processing elements are needed.

Proceedings Article
22 Jul 2002
TL;DR: In this paper, the degree of parallelism in the hyperbalanced?-calculus, a subcalculus of the?calculus containing all simply typable terms (up to a restricted?-expansion) was investigated.
Abstract: We investigate the degree of parallelism (or modularity) in the hyperbalanced ?-calculus, ?H, a subcalculus of ?-calculus containing all simply typable terms (up to a restricted ?-expansion). In technical terms, we study the family relation on redexes in ?H, and the contribution relation on redex-families, and show that the latter is a forest (as a partial order). This means that hyperbalanced ?-terms allow for maximal possible parallelism in computation. To prove our results, we use and further refine, for the case of hyperbalanced terms, some well known results concerning paths, which allow for static analysis of many fundamental properties of s-reduction.

Book ChapterDOI
15 Jun 2002
TL;DR: Several parallelism-independent algorithms are proposed which are either applicable for distributed computing systems, i.e. systems of autonomous processors connected via communication links or for tightly coupled multiprocessor systems or architectures exploiting instruction level parallelism as well.
Abstract: The objective of the parallelism-independent (PI) scheduling is minimization of the completion time of a parallel application for any number of processing elements in the computing system. We propose several parallelism-independent algorithms which are either applicable for distributed computing systems, i.e. systems of autonomous processors connected via communication links (in this case we provide explicit message communication scheduling) or for tightly coupled multiprocessor systems or architectures exploiting instruction level parallelism as well. The algorithms are hybrid but predominantly done at compile time in order to reduce the dynamic overhead and scheduling hardware. All the traditional static scheduling algorithms produce machine codes with fixed degree of parallelism which cannot be executed efficiently on computer systems with different degrees of parallelism. Our algorithms eliminate this problem closely related to the distribution of parallel programs.