Showing papers on "Degree of parallelism published in 2002"

PDF

Open Access

Proceedings Article•DOI•

Implementing database operations using SIMD instructions

[...]

Jingren Zhou¹, Kenneth A. Ross¹•Institutions (1)

03 Jun 2002

TL;DR: It is shown that using a SIMD parallelism of four, the CPU time for the new algorithms is from 10% to more than four times less than for the traditional algorithms, and superlinear speedups are obtained as a result of the elimination of branch misprediction effects.

...read moreread less

Abstract: Modern CPUs have instructions that allow basic operations to be performed on several data elements in parallel. These instructions are called SIMD instructions, since they apply a single instruction to multiple data elements. SIMD technology was initially built into commodity processors in order to accelerate the performance of multimedia applications. SIMD instructions provide new opportunities for database engine design and implementation. We study various kinds of operations in a database context, and show how the inner loop of the operations can be accelerated using SIMD instructions. The use of SIMD instructions has two immediate performance benefits: It allows a degree of parallelism, so that many operands can be processed at once. It also often leads to the elimination of conditional branch instructions, reducing branch mispredictions.We consider the most important database operations, including sequential scans, aggregation, index operations, and joins. We present techniques for implementing these using SIMD instructions. We show that there are significant benefits in redesigning traditional query processing algorithms so that they can make better use of SIMD technology. Our study shows that using a SIMD parallelism of four, the CPU time for the new algorithms is from 10% to more than four times less than for the traditional algorithms. Superlinear speedups are obtained as a result of the elimination of branch misprediction effects.

...read moreread less

291 citations

Intel OpenMP C++/Fortran Compiler for Hyper-Threading Technology: Implementation and Performance

[...]

Xinmin Tian

01 Jan 2002

TL;DR: The compiler techniques of OpenMP pragmaand directive-guided parallelization developed for the highperformance Intel C++/Fortran compiler are presented and a performance evaluation of a set of benchmarks and applications are presented.

...read moreread less

Abstract: In the never-ending quest for higher performance, CPUs become faster and faster. Processor resources, however, are generally underutilized by many applications. Intel’s Hyper-Threading Technology is developed to resolve this issue. This new technology allows a single processor to manage data as if it were two processors by executing data instructions from different threads in parallel rather than serially. Processors enabled with Hyper-Threading Technology can greatly imp rove the performance of applications with a high degree of parallelism. However, the potential gain is only obtained if an application is multithreaded, by either manual, automatic, or semiautomatic parallelization techniques. This paper presents the compiler techniques of OpenMP pragmaand directive-guided parallelization developed for the highperformance Intel C++/Fortran compiler. We also present a performance evaluation of a set of benchmarks and applications.

...read moreread less

81 citations

Patent•

Communications architecture for a high throughput storage processor

[...]

Robert C. Solomon¹, Jeffrey A. Brown¹•Institutions (1)

EMC Corporation¹

29 Mar 2002

TL;DR: A storage processor particularly suited to RAID systems provides high throughput for applications such as streaming video data as discussed by the authors, and the preferred embodiment provides a store and forward architecture configured around a switch with prioritization on data pathways critical to high throughput.

...read moreread less

Abstract: A storage processor particularly suited to RAID systems provides high throughput for applications such as streaming video data An embodiment is configured as an ASIC with a high degree of parallelism in its interconnections The preferred embodiment provides a store and forward architecture configured around a switch with prioritization on data pathways critical to high throughput

...read moreread less

57 citations

Book Chapter•DOI•

Parallelisation of the Petri Net Unfolding Algorithm

[...]

Keijo Heljanko¹, Victor Khomenko², Maciej Koutny²•Institutions (2)

Helsinki University of Technology¹, University of Newcastle²

08 Apr 2002

TL;DR: In this paper, a modification of the unfolding algorithm is presented, which can be efficiently parallelized and admits a more efficient implementation. But the degree of parallelism is usually quite high and resulting algorithms potentially can achieve significant speedup comparing with the sequential case.

...read moreread less

Abstract: In this paper, we first present theoretical results, helping to understand the unfolding algorithm presented in [6,7] We then propose a modification of this algorithm, which can be efficiently parallelised and admits a more efficient implementation Our experiments demonstrate that the degree of parallelism is usually quite high and resulting algorithms potentially can achieve significant speedup comparing with the sequential case

...read moreread less

35 citations

Patent•

Processor, compiler and compilation method

[...]

Taketo Heishi¹, Shuichi Takayama, Tetsuya Tanaka¹, Hajime Ogawa¹, Nobuo Higaki¹ - Show less +1 more•Institutions (1)

Panasonic¹

19 Sep 2002

TL;DR: In this article, the processor decodes a number of instructions that is greater than the number of provided computing units and judges their execution conditions with an instruction issue control portion before the execution stage, instructions for which the condition is false are invalidated and subsequent valid instructions are assigned so that the computing units (hardware) is used efficiently.

...read moreread less

Abstract: In order to overcome the problem that conditionally executed instructions are executed as no-operation instructions if their condition is not fulfilled, leading to poor utilization efficiency of the hardware and lowering the effective performance, the processor decodes a number of instructions that is greater than the number of provided computing units and judges their execution conditions with an instruction issue control portion before the execution stage, Instructions for which the condition is false are invalidated, and subsequent valid instructions are assigned so that the computing units (hardware) is used efficiently A compiler performs scheduling such that the number of instructions whose execution condition is true does not exceed the upper limit of the degree of parallelism of the hardware The number of instructions arranged in parallel at each cycle may exceed the degree of parallelism of the hardware

...read moreread less

25 citations

Journal Article•DOI•

A parallel built-in self-diagnostic method for embedded memory arrays

[...]

D.C. Huang¹, Wen-Ben Jone•Institutions (1)

National Chung Cheng University¹

07 Aug 2002-IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

TL;DR: It can be proved that DiagRSMarch can identify all stuck-at, transition, state coupling, and dynamic coupling faults occurring in all memory arrays and is highly dependent on memory topology, defect-type distribution, and degree of parallelism.

...read moreread less

Abstract: In this paper, the authors propose a new built-in self-diagnosis method to simultaneously diagnose spatially distributed memory modules with different sizes. Based on the serial interfacing technique, the serial fault masking effect is observed and a bidirectional serial interfacing technique is proposed to deal with such an issue. By tolerating redundant read/write operations, they develop a new march algorithm called DiagRSMarch to achieve the goals of low test signal routing overhead, tolerable diagnostic time, and high diagnostic coverage. It can be proved that DiagRSMarch can identify all stuck-at, transition, state coupling, and dynamic coupling faults occurring in all memory arrays. Experimental results also demonstrate that the test efficiency of DiagRSMarch is highly dependent on memory topology, defect-type distribution, and degree of parallelism.

...read moreread less

23 citations

Journal Article•DOI•

A workflow model based on parallelism for distributed organizations

[...]

Jose Ceroni¹, Shimon Y. Nof²•Institutions (2)

University of Valparaíso¹, Purdue University²

01 Dec 2002-Journal of Intelligent Manufacturing

TL;DR: The method developed, called the distributed parallel integration evaluation model (DPIEM) models the workflow in the distributed enterprise based on three integration scenarios and minimizes the integrated tasks total cost by adding as many parallel servers per task as possible.

...read moreread less

Abstract: Distribution has become an increasingly common characteristic for modern service and production companies Enterprises nowadays rely on distribution of their operations for provision of their supplies, labor, and for selling their products in dynamic global markets Much of today enterprises efforts to cope with global markets are being directed towards the finding of effective collaboration means among their operations and partners This research proposes a model for assisting distributed enterprises in modeling their operations by optimizing and integrating their workflow to accomplish the collaborative objective The method developed, called the distributed parallel integration evaluation model (DPIEM) models the workflow in the distributed enterprise based on three integration scenarios DPIEM minimizes the integrated tasks total cost by adding as many parallel servers per task as possible The method was tested for a case of distributed assembly of two part-types A total of eight scenarios for the case were analyzed, yielding the recommended number of parallel servers per integrated task For comparison, each scenario was also simulated with the TIE parallel-computer environment The TIE simulation results corroborate the DPIEM recommendation based on the lowest total cost for the case analyzed

...read moreread less

20 citations

Proceedings Article•DOI•

Parallel interleaving on parallel DSP architectures

[...]

Thomas Richter¹, Gerhard Fettweis¹•Institutions (1)

Dresden University of Technology¹

10 Dec 2002

TL;DR: The feasibility to exploit the hardware parallelism in order to accelerate the interleaving procedure is demonstrated and on a heuristic algorithm the possible speedup for different interleavers as a function of the degree of parallelism of the hardware is presented.

...read moreread less

Abstract: Today's communications systems especially in the field of wireless communications rely on many different algorithms to provide applications with constantly increasing data rates and higher quality. This development combined with the wireless channel characteristics as well as the invention of turbo codes has particularly increased the importance of interleaver algorithms. In this paper we demonstrate the feasibility to exploit the hardware parallelism in order to accelerate the interleaving procedure. Based on a heuristic algorithm the possible speedup for different interleavers as a function of the degree of parallelism of the hardware is presented. The parallelization is generic in the sense that the assumed underlying hardware is based on a parallel datapath DSP architecture and therefore provides the flexibility of software solutions.

...read moreread less

17 citations

Patent•

Storage processor architecture for high throughput applications providing efficient user data channel loading

[...]

Robert C. Solomon¹, Jeffrey A. Brown¹•Institutions (1)

EMC Corporation¹

29 Mar 2002

TL;DR: In this paper, a storage processor particularly suited to RAID systems provides high throughput for applications such as streaming video data and is configured as an ASIC with a high degree of parallelism in its interconnections.

...read moreread less

Abstract: A storage processor particularly suited to RAID systems provides high throughput for applications such as streaming video data An embodiment is configured as an ASIC with a high degree of parallelism in its interconnections The communications architecture provides saturation of user data pathways with low complexity and low latency by employing multiple memory channels under software control, an efficient parity calculation mechanism and other features

...read moreread less

13 citations

Patent•

Communications architecture for a high throughput storage processor providing user data priority on shared channels

[...]

William F. Baxter¹•Institutions (1)

EMC Corporation¹

29 Mar 2002

TL;DR: In this paper, a storage processor particularly suited to RAID systems provides high throughput for applications such as streaming video data and buffering may be used to maintain clear paths for priority data, such as user data being read or written, on shared channels.

...read moreread less

Abstract: A storage processor particularly suited to RAID systems provides high throughput for applications such as streaming video data. An embodiment is configured as an ASIC with a high degree of parallelism in its interconnections. Buffering may be used to maintain clear paths for priority data, such as user data being read or written, on shared channels.

...read moreread less

10 citations

Proceedings Article•DOI•

Capturing parallel attacks within the data independence framework

[...]

R.J. Broadfoot¹, A. W. Roscoe¹•Institutions (1)

University of Oxford¹

24 Jun 2002

TL;DR: This paper shows how techniques based on data independence could be used to justify, by means of a finite FDR check, systems where agents can perform an unbounded number of protocol runs, and addresses the issue of capturing the state of mind of internal agents.

...read moreread less

Abstract: We carry forward the work described in our previous papers (Broadfoot et al., 2000, Broadfoot and Roscoe, 2002, and Roscoe, 1998) on the application of data independence to the model checking of cryptographic protocols using CSP and FDR. In particular, we showed how techniques based on data independence could be used to justify, by means of a finite FDR check, systems where agents can perform an unbounded number of protocol runs. Whilst this allows for a more complete analysis, there was one significant incompleteness in the results we obtained: While each individual identity could perform an unlimited number of protocol runs sequentially, the degree of parallelism remained bounded. We report significant progress towards the solution of this problem, by "internalising" all or part of each agent identity within the "intruder" process. We consider the case where internal agents do introduce fresh values and address the issue of capturing the state of mind of internal agents (for the purposes of analysis).

...read moreread less

Book Chapter•DOI•

Time Series Model Mining with Similarity-Based Neuro-Fuzzy Networks and Genetic Algorithms: A Parallel Implementation

[...]

Julio J. Valdés¹, Gabriel Mateescu¹•Institutions (1)

National Research Council¹

14 Oct 2002-Lecture Notes in Computer Science

TL;DR: A first implementation at the highest granularity level is presented, which contains a high degree of parallelism at different levels of granularity, which can be exploited when designing distributed implementations, such as workcrew computation in a master-slave paradigm.

...read moreread less

Abstract: This paper presents a parallel implementation of a hybrid data mining technique for multivariate heterogeneous time varying processes based on a combination of neuro-fuzzy techniques and genetic algorithms. The purpose is to discover patterns of dependency in general multivariate time-varying systems, and to construct a suitable representation for the function expressing those dependencies. The patterns of dependency are represented by multivariate, non-linear, autoregressive models. Given a set of time series, the models relate future values of one target series with past values of all such series, including itself. The model space is explored with a genetic algorithm, whereas the functional approximation is constructed with a similarity based neuro-fuzzy heterogeneous network. This approach allows rapid prototyping of interesting interdependencies, especially in poorly known complex multivariate processes. This method contains a high degree of parallelism at different levels of granularity, which can be exploited when designing distributed implementations, such as workcrew computation in a master-slave paradigm. In the present paper, a first implementation at the highest granularity level is presented. The implementation was tested for performance and portability in different homogeneous and heterogeneous Beowulf clusters with satisfactory results. An application example with a known time series problem is presented.

...read moreread less

Journal Article•DOI•

Performance-steered design of software architectures for embedded multicore systems

[...]

Alessio Bechini¹, Cosimo Antonio Prete¹•Institutions (1)

University of Pisa¹

01 Oct 2002-Software - Practice and Experience

TL;DR: A possible methodology for the application design at the architectural level, targeted to embedded systems built upon multicore chipsets with a low degree of parallelism is proposed, which makes use of performance predictions, obtained by simulations.

...read moreread less

Abstract: Many software applications demanding a considerable computing power are moving towards the field of embedded systems (and, in particular, hand-held devices). A possible way to increase the computing power of this kind of platform, so that both cost and power consumption are kept low, is the employment of multiple CPU cores on the same chipset. Consequently, it is essential to design applications that meet performance requirements leveraging the underlying parallel platform. As embedded applications are usually built using different components (whose source code is often not available) from different companies, the designer can mostly only operate at the architectural level. So far, methodologies for designing software architectures have mainly addressed general-purpose systems, often relying on hardware platforms with a high degree of parallelism. In this paper, we present our experience in architectural design of parallel embedded applications; as a result, we propose a possible methodology for the application design at the architectural level, targeted to embedded systems built upon multicore chipsets with a low degree of parallelism. It makes use of performance predictions, obtained by simulations. Such a methodology can be employed both for retargeting existing sequential applications to parallel processing platforms and for designing complete applications from scratch. We show the application of the proposed methodology to an embedded digital cartographic system. Starting with a software description using UML diagrams, candidate software architectures (utilizing different parallel solutions) are first defined and then evaluated, to end with the selection of the one yielding the highest performance gain.

...read moreread less

Proceedings Article•DOI•

Parallel and distributed computing with coevolutionary algorithms

[...]

F. Seredynski, Albert Y. Zomaya¹•Institutions (1)

University of Western Australia¹

15 Apr 2002

TL;DR: It is shown that both coevolutionary algorithms outperform a sequential GA and may be recommended to be used in optimization systems when high degree of parallelism is possible and non global coordination is expected while the CCGA algorithm is useful when low degree of Parallelism andglobal coordination is acceptable.

...read moreread less

Abstract: The problem of parallel and distributed function optimization is considered. Two coevolutionary algorithms with different degrees of parallelism and different levels of a global coordination are used for this purpose and compared with sequential genetic algorithm (GA). The first coevolutionary algorithm called a loosely coupled genetic algorithm (LCGA) represents a competitive coevolutionary approach to problem solving and is compared with another coevolutionary algoritm called cooperative coevolutionary genetic algorithm (CCGA). The algorithms are applied for parallel and distributed optimization of a number of test functions known in the area of evolutionary computation. We show that both coevolutionary algorithms outperform a sequential GA. While both LCGA and CCGA algorithms offer high quality solutions, they may compete to outperform each other in some specific test optimization problems. The LCGA may be recommended to be used in optimization systems when high degree of parallelism is possible and non global coordination is expected while the CCGA algorithm is useful when low degree of parallelism and global coordination is acceptable.

...read moreread less

Proceedings Article•

The O (N 5/6 ) Time Complexity of Fault Diagnosis Algorithm in NxN Dilated Blocking Photonic Switching Networks.

[...]

I-Shyan Hwang, Hung-Chang Lin, San-Nan Lee

01 Jan 2002

TL;DR: The concept of disjoint faults is extended to reduce the number of tests to the time efficiency of Θ (N 5/6 )f or N×ND OMINs and a two-phase diagnosis algorithm is proposed to reduced the testing requirement to 4N tests.

...read moreread less

Abstract: Dilated Optical Multistage Interconnection Networks (DOMINs) based on 2 × 2 directional coupler photonic switches play an important role in all-optical high-performance networks, especially for the emerging IP over DWDM architectures. The problem of crosstalk within photonic switches is underestimated due to the aging of the switching element, control voltage, temperature and polarization, and thus causes undesirable coupling of the signal from one path to the other. Previous works [18] designed an efficient diagnosing disjoint faults algorithm in small sized networks, which reduced the number of tests required by overlapping the tests with computations to one half in photonic switching networks. Furthermore, this paper generically derives algorithms and mathematical modules to find the optimal degree of parallelism of faults diagnosis for N × N dilated blocking networks, as the size of network is larger. Taking advantage of the properties of disjoint faults, diagnosis can be accelerated significantly because the optimal degree of parallel fault diagnosis may grow exponentially. To reduce the diagnosis time, an algorithm is proposed herein to find the maximum number of disjoint faults. Rather than requiring up to 4MN tests as a native approach, a two-phase diagnosis algorithm is proposed to reduce the testing requirement to 4N tests. This study extends the concept of disjoint faults to reduce the number of tests to the time efficiency of Θ (N 5/6 )f or N×ND OMINs.

...read moreread less

Patent•

Apparatus for symmetric-key encryption and apparatus for symmetric-key decryption

[...]

Soichi Furuya, Kazuo Takaragi, Hiroyuki Kurumaya, Masashi Takahashi, Kunihiko Miyazaki, Hisanobu Sato, Watanabe Masaru - Show less +3 more

18 Oct 2002

TL;DR: In this paper, the authors proposed a symmetric-key cryptographic processing technique capable of realizing both high-speed cryptographic processing having a high degree of parallelism, and alteration detection.

...read moreread less

Abstract: PROBLEM TO BE SOLVED: To provide a symmetric-key cryptographic processing technique capable of realizing both high-speed cryptographic processing having a high degree of parallelism, and alteration detection. SOLUTION: A redundancy is added to the data desired to encrypt, the resulting data are encrypted by using a key stream whose length is truly longer than the resulting data to produce a cipher text. In the corresponding decryption, confirming the decryption of the redundancy can detect alteration during communication. The success probability of the alteration detection can be evaluated, and the processing with high parallelism and a smaller mount scale is faster than that of a block cipher.

...read moreread less

Proceedings Article•DOI•

A design pattern for Web-based parallel genetic algorithms

[...]

Maolin Tang¹•Institutions (1)

Queensland University of Technology¹

28 Oct 2002

TL;DR: A new asynchronous, coarse-grain parallel genetic algorithm model is proposed and a design pattern is presented, which can achieve high degree of parallelism, can handle the high communication latency and low communication bandwidth problems of the Internet, and can contribute to building more robust Web-based parallel GA.

...read moreread less

Abstract: Aiming at developing more efficient and robust genetic algorithms (GA) over the Internet, a new asynchronous, coarse-grain parallel genetic algorithm model is proposed in this paper On the basis of the model, we present a design pattern for Web-based parallel GA, which captures design solutions to core problems in implementing Web-based parallel GA This design pattern can achieve high degree of parallelism, can handle the high communication latency and low communication bandwidth problems of the Internet, and can contribute to building more robust Web-based parallel GA

...read moreread less

Journal Article•

Static analysis of modularity of β-reduction in the hyperbalanced λ-calculus

[...]

Richard Kennaway¹, Zurab Khasidashvili², Adolfo Piperno³•Institutions (3)

University of East Anglia¹, Bar-Ilan University², Sapienza University of Rome³

01 Jan 2002-Lecture Notes in Computer Science

TL;DR: In this article, the degree of parallelism in the hyperbalanced λ-calculus, λ H, a subcalculus of λcalculus containing all simply typable terms (up to a restricted η-expansion), was investigated.

...read moreread less

Abstract: We investigate the degree of parallelism (or modularity) in the hyperbalanced λ-calculus, λ H , a subcalculus of λ-calculus containing all simply typable terms (up to a restricted η-expansion). In technical terms, we study the family relation on redexes in λ H , and the contribution relation on redex-families, and show that the latter is a forest (as a partial order). This means that hyperbalanced λ-terms allow for maximal possible parallelism in computation. To prove our results, we use and further refine, for the case of hyperbalanced terms, some well known results concerning paths, which allow for static analysis of many fundamental properties of β-reduction.

...read moreread less

Book Chapter•DOI•

Static Analysis of Modularity of β-Reduction in the Hyperbalanced λ-Calculus

[...]

Richard Kennaway¹, Zurab Khasidashvili², Adolfo Piperno³•Institutions (3)

University of East Anglia¹, Bar-Ilan University², Sapienza University of Rome³

22 Jul 2002

TL;DR: The family relation on redexes in ?

...read moreread less

Abstract: We investigate the degree of parallelism (or modularity) in the hyperbalanced λ-calculus, λH, a subcalculus of λ-calculus containing all simply typable terms (up to a restricted η-expansion). In technical terms, we study the family relation on redexes in λH, and the contribution relation on redex-families, and show that the latter is a forest (as a partial order). This means that hyperbalanced λ-terms allow for maximal possible parallelism in computation. To prove our results, we use and further refine, for the case of hyperbalanced terms, some well known results concerning paths, which allow for static analysis of many fundamental properties of β-reduction.

...read moreread less

Patent•

Valid instruction dispatching and execution

[...]

Taketo Heishi, Shuichi Takayama, Tetsuya Tanaka, Hajime Ogawa, Nobuo Higaki - Show less +1 more

20 Sep 2002

TL;DR: In this article, a compiler performs scheduling such that the number of instructions whose execution condition is true does not exceed the upper limit of the degree of parallelism of the hardware, so that the computing units (hardware) is used efficiently.

...read moreread less

Abstract: In order to overcome the problem that conditionally executed instructions are executed as no-operation instructions if their condition is not fulfilled, leading to poor utilization efficiency of the hardware and lowering the effective performance, the processor decodes a number of instructions that is greater than the number of provided computing units and judges their execution conditions with an instruction issue control portion before the execution stage, Instructions for which the condition is false are invalidated, and subsequent valid instructions are assigned so that the computing units (hardware) is used efficiently. A compiler performs scheduling such that the number of instructions whose execution condition is true does not exceed the upper limit of the degree of parallelism of the hardware. The number of instructions arranged in parallel at each cycle may exceed the degree of parallelism of the hardware.

...read moreread less

Book Chapter•DOI•

Parallel Broadcasting Scheme for Approximate String Matching with K-Mismatches

[...]

Jin Hwan Park¹, Keqin Li¹•Institutions (1)

State University of New York System¹

01 Jan 2002

TL;DR: The scheme is a parallelization of the dynamic programming method of evaluating minimum edit distances between the pattern and any substring of the reference string, and is extendible for any pattern size and any parallelism degree.

...read moreread less

Abstract: Parallel broadcasting provides an efficient parallel hardware solution to the k-mismatches problem which is a version of the approximate string matching problem. The scheme is a parallelization of the dynamic programming method of evaluating minimum edit distances between the pattern and any substring of the reference string. Implicit parallelism based on dataflow and explicit parallelism based on parallel broadcasting mechanism exploit the maximum parallelism. Time complexity of the proposed parallel scheme is O(((n−m)/d) + m), where n and m represent lengths of the reference and the pattern strings respectively, and d is the degree of parallelism that is controllable. The proposed scheme is suitable for the VLSI implementation. The design is based on the linear systolic array of a simple basic cell, and is extendible for any pattern size and any parallelism degree. For serial and parallel designs, m and d*m identical processing elements are needed.

...read moreread less

Proceedings Article•

Static Analysis of Modularity of beta-Reduction in the Hyperbalanced lambda-Calculus

[...]

Richard Kennaway, Zurab Khasidashvili, Adolfo Piperno

22 Jul 2002

TL;DR: In this paper, the degree of parallelism in the hyperbalanced?-calculus, a subcalculus of the?calculus containing all simply typable terms (up to a restricted?-expansion) was investigated.

...read moreread less

Abstract: We investigate the degree of parallelism (or modularity) in the hyperbalanced ?-calculus, ?H, a subcalculus of ?-calculus containing all simply typable terms (up to a restricted ?-expansion). In technical terms, we study the family relation on redexes in ?H, and the contribution relation on redex-families, and show that the latter is a forest (as a partial order). This means that hyperbalanced ?-terms allow for maximal possible parallelism in computation. To prove our results, we use and further refine, for the case of hyperbalanced terms, some well known results concerning paths, which allow for static analysis of many fundamental properties of s-reduction.

...read moreread less

Book Chapter•DOI•

Compiler-Controlled Parallelism-Independent Scheduling for Parallel and Distributed Systems

[...]

Kirilka Nikolova¹, Sou Pei You¹, Masahiro Sowa¹•Institutions (1)

University of Electro-Communications¹

15 Jun 2002

TL;DR: Several parallelism-independent algorithms are proposed which are either applicable for distributed computing systems, i.e. systems of autonomous processors connected via communication links or for tightly coupled multiprocessor systems or architectures exploiting instruction level parallelism as well.

...read moreread less

Abstract: The objective of the parallelism-independent (PI) scheduling is minimization of the completion time of a parallel application for any number of processing elements in the computing system. We propose several parallelism-independent algorithms which are either applicable for distributed computing systems, i.e. systems of autonomous processors connected via communication links (in this case we provide explicit message communication scheduling) or for tightly coupled multiprocessor systems or architectures exploiting instruction level parallelism as well. The algorithms are hybrid but predominantly done at compile time in order to reduce the dynamic overhead and scheduling hardware. All the traditional static scheduling algorithms produce machine codes with fixed degree of parallelism which cannot be executed efficiently on computer systems with different degrees of parallelism. Our algorithms eliminate this problem closely related to the distribution of parallel programs.

...read moreread less