scispace - formally typeset
Search or ask a question

Showing papers on "Degree of parallelism published in 2001"


Journal ArticleDOI
TL;DR: An observational case study in which the change and configuration management history of a legacy system is collected and analyzed to delineate the boundaries of, and to understand the nature of, the problems encountered in parallel development.
Abstract: An essential characteristic of large-scale software development is parallel development by teams of developers. How this parallel development is structured and supported has a profound effect on both the quality and timeliness of the product. We conduct an observational case study in which we collect and analyze the change and configuration management history of a legacy system to delineate the boundaries of, and to understand the nature of, the problems encountered in parallel development. The results of our studies are (1) that the degree of parallelism is very highhigher than considered by tool builders; (2) there are multiple levels of parallelism, and the data for some important aspects are uniform and consistent for all levels; (3) the tails of the distributions are long, indicating the tail, rather than the mean, must receive serious attention in providing solutions for these problems; and (4) there is a significant correlation between the degree of parallel work on a given component and the number of quality problems it has. Thus, the results of this study are important both for tool builders and for process and project engineers.

178 citations


Journal ArticleDOI
TL;DR: A new parallel algorithm for data mining of association rules on shared-memory multiprocessors is presented, the degree of parallelism, synchronization, and data locality issues are studied, and proposed optimizations for fast frequency computation are presented.
Abstract: In this paper we present a new parallel algorithm for data mining of association rules on shared-memory multiprocessors. We study the degree of parallelism, synchronization, and data locality issues, and present optimizations for fast frequency computation. Experiments show that a significant improvement of performance is achieved using our proposed optimizations. We also achieved good speed-up for the parallel algorithm.

168 citations


Proceedings ArticleDOI
18 Jun 2001
TL;DR: This paper proposes a solution to automate this data analysis problem by applying fundamental statistical techniques to scalability experiment data, and finds that non-parametric correlation of the number of tasks to the ratio of the time for communication operations to overall communication time provides a reliable measure for identifying communication operations that scale poorly.
Abstract: Current trends in high performance computing suggest that users will soon have widespread access to clusters of multiprocessors with hundreds, if not thousands, of processors. This unprecedented degree of parallelism will undoubtedly expose scalability limitations in existing applications, where scalability is the ability of a parallel algorithm on a parallel architecture to effectively utilize an increasing number of processors. Users will need precise and automated techniques for detecting the cause of limited scalability. This paper addresses this dilemma. First, we argue that users face numerous challenges in understanding application scalability: managing substantial amounts of experiment data, extracting useful trends from this data, and reconciling performance information with their applications design. Second, we propose a solution to automate this data analysis problem by applying fundamental statistical techniques to scalability experiment data. Finally, we evaluate our operational prototype on several applications, and show that statistical techniques offer an effective strategy for assessing application scalability. In particular, we find that non-parametric correlation of the number of tasks to the ratio of the time for communication operations to overall communication time provides a reliable measure for identifying communication operations that scale poorly.

145 citations


Patent
14 Mar 2001
TL;DR: In this article, a method and a system of database divisional management for use with a parallel database system comprising an FES (front end server), BES's (back end servers), an IOS (I/O server) and disk units is presented.
Abstract: A method and a system of database divisional management for use with a parallel database system comprising an FES (front end server), BES's (back end servers), an IOS (I/O server) and disk units. The numbers of processors assigned to the FES, BES's and IOS, the number of disk units, and the number of partitions of the disk units are determined in accordance with the load pattern in question. Illustratively, there may be established a configuration of one FES, four BES's, one IOS and eight disk units. The number of BES's is varied from one to four depending on the fluctuation in load, so that a scalable system configuration is implemented. When the number of BES's is increased or decreased, only the management information thereabout is transferred between nodes and not the data, whereby the desired degree of parallelism is obtained for high-speed query processing.

117 citations


Journal ArticleDOI
TL;DR: This paper resolves a long-standing open problem on whether the concurrent write capability of parallel random access machine (PRAM) is essential for solving fundamental graph problems like connected components and minimum spanning trees in logarithmic time.
Abstract: This paper resolves a long-standing open problem on whether the concurrent write capability of parallel random access machine (PRAM) is essential for solving fundamental graph problems like connected components and minimum spanning trees in O(logn) time. Specifically, we present a new algorithm to solve these problems in O(logn) time using a linear number of processors on the exclusive-read exclusive-write PRAM. The logarithmic time bound is actually optimal since it is well known that even computing the “OR” of nbit requires O(log n time on the exclusive-write PRAM. The efficiency achieved by the new algorithm is based on a new schedule which can exploit a high degree of parallelism.

76 citations


Patent
07 May 2001
TL;DR: In this paper, a technique for increasing the degree of parallelism without incurring overhead costs associated with inter-nodal communication for performing parallel operations is presented. But it does not address the overhead of inter-node communication.
Abstract: Techniques are provided for increasing the degree of parallelism without incurring overhead costs associated with inter-nodal communication for performing parallel operations. One aspect of the invention is to distribute-phase partition-pairs of a parallel partition-wise operation on a pair of objects among the nodes of a database system. The -phase partition-pairs that are distributed to each node are further partitioned to form a new set of-phase partition-pairs. One -phase partition-pair from the set of new-phase partition-pairs is assigned to each slave process that is on a given node. In addition, a target object may be partitioned by applying an appropriate hash function to the tuples of the target object. The parallel operation is performed by broadcasting each tuple from a source table only to the group of slave processes that is working on the static partition to which the tuple is mapped.

57 citations


Proceedings ArticleDOI
14 May 2001
TL;DR: The results show that "typical" echo-based management operations could be executed within some 18 seconds on the entire Internet, due to the high degree of parallelism and distributed control in this pattern and some specific properties of the Internet topology.
Abstract: Performing global management operations on the Internet in an efficient way is difficult, because of the continuous changes to the Internet topology, its large number of nodes and the lack of an up-to-date global database. In practice, these difficulties appear in the management of large private IP networks and large autonomous systems, which form the sub-topologies of the Internet and are under independent administration. This paper introduces the echo pattern, a scheme for distributing management operations, which addresses these difficulties. Management operations based on this pattern do not need knowledge of the network topology, they can dynamically adapt to changes in the topology, and they scale well in very large networks. A management operation based on the echo pattern has two phases. In a first phase, the network is being flooded with management commands to be run on the network elements. In the second phase, the results of the local management operations are aggregated inside the network. We analyze the echo pattern with respect to time and traffic complexity and compare its performance to that of a centralized management scheme. Our results show that "typical" echo-based management operations could be executed within some 18 seconds on the entire Internet. This short time is due to (1) the high degree of parallelism and distributed control in this pattern and (2) some specific properties of the Internet topology.

47 citations


Book
01 Jan 2001
TL;DR: This dissertation proposes a new transformation framework for the program domain of arbitrary loop structures with affine array accesses and loop bounds that unifies a large class of transformations, including loop interchange, reversal, skewing, fusion, fission, reindexing, scaling, and statement reordering.
Abstract: To effectively harness the power of modern parallel machines, it is important to find parallelism that does not incur high synchronization cost between the processors Effective utilization of the memory hierarchy is another crucial factor in getting high performance In the past, a large number of compiler algorithms have been proposed in the areas of finding loop-level parallelism and improving data locality Many of these algorithms rely on ad hoc techniques to select a series of transformations Others attempt to place a number of transformations in a single framework to avoid ad hoc selections Unfortunately, frameworks proposed so far are either limited to perfectly nested loops or they lack algorithms to select the best possible transformations in the frameworks This dissertation proposes a new transformation framework for the program domain of arbitrary loop structures with affine array accesses and loop bounds This framework unifies a large class of transformations, including loop interchange, reversal, skewing, fusion, fission, reindexing, scaling, and statement reordering In this framework, each statement is given its own affine mapping describing the partition of its dynamic instances to different processors or to different sequential steps Two algorithms, one for parallelization and one for locality optimization, were developed under this framework and can be easily combined The parallelization algorithm derives from data dependence constraints the affine mappings that maximize the degree of parallelism with the least amount of synchronization The degree of parallelism found by the algorithm is optimal with respect to all the unified transformations The algorithm also minimizes communication by trading off excess degrees of parallelism and by choosing pipeline parallelism over doall parallelism if it can significantly reduce communication cost Optimizing memory performance in a single processor, the locality algorithm performs aggressive affine transformations to separate the independent threads in a program and if possible, place statements with data reuse into perfectly nested loops These transformations also have the benefits of enabling blocking and array contraction to be applied across arbitrarily nested loops

26 citations


Journal ArticleDOI
TL;DR: The existence of an optimum degree of parallelism ((τopt) for which the best performance, in terms of efficiency and number of iterations, and effectiveness is obtained is obtained.

12 citations


DOI
01 Jan 2001
TL;DR: A detailed three-dimensional computational model of the human cochlea is developed and refined, which uses the immersed boundary method to calculate the fluid-structure interactions produced in response to incoming sound waves.
Abstract: We have developed and are refining a detailed three-dimensional computational model of the human cochlea. The model uses the immersed boundary method to calculate the fluid-structure interactions produced in response to incoming sound waves. An accurate cochlear geometry obtained from physical measurements is incorporated. The model includes a detailed and realistic description of the various elastic structures present. Initially, a macro-mechanical computational model was developed for execution on a CRAY T90 at the San Diego Supercomputing Center. This code was ported to the latest generation of shared memory high performance servers from Hewlett Packard. Using compiler generated threads and OpenMP directives, we have achieved a high degree of parallelism in the executable, which has made possible to run several large scale numerical simulation experiments to study the interesting features of the cochlear system. In this paper, we outline the methods, algorithms and software tools that were used to implement and fine tune the code, and discuss some of the simulation results.

10 citations


12 Dec 2001
TL;DR: This paper focuses on dynamic and large scale applications that require much larger scale calculations than are possible at present, and the development of new latency tolerant algorithms, and sophisticated code frameworks like Cactus to carry out more complex and high fidelity simulations with a massive degree of parallelism.
Abstract: Computer simulations are becoming increasingly important as the only means for studying and interpreting the complex processes of nature. Yet the scope and accuracy of these simulations are severely limited by available computational power, even using today's most powerful supercomputers. As we endeavor to simulate the true complexity of nature, we will require much larger scale calculations than are possible at present. Such dynamic and large scale applications will require computational grids and grids require development of new latency tolerant algorithms, and sophisticated code frameworks like Cactus to carry out more complex and high fidelity simulations with a massive degree of parallelism.

Proceedings ArticleDOI
23 Apr 2001
TL;DR: In this paper, value prediction and producer identification are used to break the dependency between the running processes and transform the barrier synchronization into per-variable flags, respectively, to reduce the probability of rollback.
Abstract: Barrier synchronization is a source of inefficiency in many parallel programs, due to the association of many producer-consumer relations in with one synchronization variable. This inefficiency may consume a significant percentage of total execution time, especially as we increase the degree of parallelism while maintaining the problem size. Barrier synchronization wait time can be hidden by speculatively executing instructions after the barrier. The speculative execution must not violate the dependencies imposed by the program. Dependency violation causes rollback, incurring a penalty that may exceed the benefit of speculation. In this work, we investigate how to reduce the probability of rollback through the use of two different techniques: value prediction and producer identification. The first technique tries to break the dependency between the running processes. The second technique tries to respect only true dependencies by transforming the barrier synchronization into per-variable flags. Simulation results using scientific benchmarks mostly SPLASH-2, indicate that producer identification promises a greater potential reduction in synchronization time, close to actual dependency, and maintains rollback percentage below 10% for most benchmarks.

Journal ArticleDOI
TL;DR: A stereo-matching algorithm to establish reliable correspondence between images by selecting a desirable window size for SAD (Sum of Absolute Differences) computation is presented and a window-parallel and pixel-serial architecture is proposed to achieve 100% utilization of processing elements.
Abstract: This paper presents a stereo-matching algorithm to establish reliable correspondence between images by selecting a desirable window size for SAD (Sum of Absolute Differences) computation. In SAD computation, a degree of parallelism between pixels in a window changes depending on its window size, while a degree of parallelism between windows is predetermined by the input-image size. Based on this consideration, a window-parallel and pixel-serial architecture is proposed to achieve 100% utilization of processing elements. Not only 100% utilization but also a simple interconnection network between memory modules and processing elements makes the VLSI processor much superior to conventional processors.

Proceedings ArticleDOI
04 Nov 2001
TL;DR: Performances and computational densities well above those of systems based on microprocessors are obtained and the present system is going to be introduced in the mass market of general-purpose spectrometers.
Abstract: We discuss planning, realization and engineering of a system for on-line digital pulse analysis, based on reconfigurable devices. Performances and computational densities well above those of systems based on microprocessors are obtained. Data-path structures and algorithms suited to obtain full advantage of spatial computing in programmable devices, through efficient implementation and high degree of parallelism are presented. The present system is going to be introduced in the mass market of general-purpose spectrometers.

Patent
12 Sep 2001
TL;DR: In this article, a private memory buffer is allocated for holding results, such as a communication message, an operation system call or a new job signal, of a speculatively executed job.
Abstract: In general, the invention is directed towards a multiprocessing system in which jobs are speculatively executed in parallel by multiple processors (30-1, 30-2, ..., 30-N). By speculating on the existence of more coarse-grained parallelism, so-called job-level parallelism, and backing of to sequential execution only in cases where dependencies that prevent parallel execution of jobs are detected, a high degree of parallelism can be extracted. According to the invention a private memory buffer is speculatively allocated for holding results, such as a communication message, an operation system call or a new job signal, of a speculatively executed job, and these results are speculatively written directly into the allocated memory buffer. When commit priority is assigned to the speculatively executed job, a pointer referring to the allocated memory buffer is transferred to an input/output (10) device which may access the memory buffer by means of the transferred pointer. In this way, by speculatively writing messages and signals into private memory buffers, even further parallelism can be extracted.

Proceedings ArticleDOI
10 Sep 2001
TL;DR: A co-synthesis algorithm that combines repartitioning and HW /SW partitioning of the processes in a system specification to provide an efficient design space exploration stractegy and the reported experimental results show the efficiency: when compared with related works: and practicability of the co-Synthesis algorithms.
Abstract: We present a co-synthesis algorithm that combines repartitioning and HW /SW partitioning of the processes in a system specification to provide an efficient design space exploration stractegy. The algorithm is defined on a Partial Order based Model (POM), which is an alternative to model concurrency at a high level of abstraction and has a concise symbolic representation, mainly for systems with high degree of parallelism, as well as allows the use of efficient reachability analysis techniques based on partial order reductions. Our repartitioning algorithm generates a partitioning tree, where the possible partitions of the processes in a specification are represented in a systematic way, according to the communication among them. The HW /SW partitioning algorithm is applied on these possible partitionings and will select the one that minimizes the communication cost between the partitions. In this paper, we will just present the repartitioning: HW /SW partitioning and performance/cost estimates algorithms and how they are used in our design space exploration stractegy. The reported experimental results show the efficiency: when compared with related works: and practicability of our co-synthesis algorithms.

Patent
04 Jan 2001
TL;DR: In this paper, a neural network capable of extracting digital data from a radio signal has a sufficiently high degree of parallelism to dynamically determine at least one suitable channel model presenting a selected propagation path of the radio signal.
Abstract: A Neural Network capable of extracting digital data from a radio signal has a sufficiently high degree of parallelism to dynamically determine at least one suitable channel model presenting a selected propagation path of the radio signal. The degree of parallelism provided by the Neural Network is sufficient to obtain channel equalisation of the received signal by processing data inputted into the Neural Network as the data are received in real-time from the radio signal.

01 Jan 2001
TL;DR: DyRecT enables parallel applications belonging to different domains to be made adaptive while preserving their best-suited programming model as well as providing a more general support for developing adaptive parallel applications.
Abstract: Clusters of workstations are now considered a platform for parallel computing along with dedicated multiprocessors systems. The main issue that arises when using non-dedicated clusters of workstations for executing parallel applications is consideration about user activity. Parallel applications should only execute on idle workstations and have the ability to withdraw from a workstation when user activity is detected. Since the set of idle workstations varies over time, parallel applications have to adapt to fluctuations in available computing resources and be able to change their degree of parallelism at run-time. However, most parallel applications are designed to run on a fixed set of computing resources. For a parallel program to execute in a dynamic environment, functionality has to be provided to allow the application to adapt to different scenarios of changing computing resources at run-time. Providing such functionality is a non-trivial task and several application-level approaches have been proposed to facilitate the development of adaptive parallel applications. These approaches, however, either restrict support to a specific class of applications or require the application to be written following a programming model that easily support changes in degree of parallelism. This dissertation presents an application-level approach that relaxes the restriction imposed by current systems and provides a more general support for developing adaptive parallel applications. This novel approach consists of analyzing the tasks required to make parallel programs from different domains adaptive and isolating the common adaptation operations found among these applications. The common adaptation tasks obtained form a general framework for adaptive parallelism describing the operations that need to be executed to allow parallel applications from different domains to adjust to changing computing resources. The major contribution of this dissertation is the design and implementation of a software system that provides the functionality presented above. DyRecT enables parallel applications belonging to different domains to be made adaptive while preserving their best-suited programming model. The system is implemented as a software library providing different level of abstractions for facilitating the development of adaptive parallel programs.

Proceedings ArticleDOI
06 May 2001
TL;DR: This paper presents a novel path-metric buffering scheme realizing the Add-Compare-Select-feedback (ACS-FB) and it will be proven that this technique is optimal in the sense of routability, area- and power consumption.
Abstract: For the implementation of large constraint length, sequential Viterbi-Decoders (VD), e.g. for HDSL2 or SDSL, it is necessary to realize the Add-Compare-Select-feedback (ACS-FB) with memories. Using butterfly processor elements (BF-PE) for the ACS computation leads to memory I/O and access conflicts and as a result "ping-pong" architectures are currently implemented, which use twice as much memory than the Viterbi-Algorithm requires. This paper presents a novel path-metric buffering scheme realizing the ACS-FB. It will be proven that this technique is optimal in the sense of routability, area- and power consumption. Memory I/O- and path-metric access conflicts are completely prevented without use of wait-cycles. The introduced architecture always remains the same, regardless to the number of BF-PE and thus simplifies the ACS-unit design for sequential VD macros. Independent from the degree of parallelism only one memory is used and hence the proposed architecture is area- and power efficient.

Book ChapterDOI
01 Jun 2001
TL;DR: This work describes a partitioning approach based on the above motivation for the general cases of DOALL loops to achieve a computation+ communication load balanced partitioning through static data and iteration space distribution.
Abstract: Due to a significant communication overhead of sending and receiving data, the loop partitioning approaches on distributed memory systems must guarantee not just the computation loadba lance but computation+communication load balance. The previous approaches in loop partitioning have achieved a communication-free, computation load balanced iteration space partitioning solution for a limited subset of DOALL loops [6]. But a large category of DOALL loops inevitably result in communication and the tradeoffs between computation and communication must be carefully analyzed for those loops in order to balance out the combined computation time and communication overheads. In this work, we describe a partitioning approach based on the above motivation for the general cases of DOALL loops. Our goal is to achieve a computation+ communication load balanced partitioning through static data and iteration space distribution. First, code partitioning phase analyzes the references in the body of the DOALL loop nest and determines a set of directions for reducing a larger degree of communication by trading a lesser degree of parallelism. The partitioning is carried out in the iteration space of the loop by cyclically following a set of direction vectors such that the data references are maximally localized and re-used eliminating a large communication volume. A new larger partition owns rule is formulated to minimize the communication overhead for a compute intensive partition by localizing its references relatively more than a smaller non-compute intensive partition. A Partition Interaction Graph is then constructedt hat is used to merge the partitions to achieve granularity adjustment, computation+communication load balance and mapping on the actual number of available processors. Relevant theory anda lgorithms are developed along with a performance evaluation on Cray T3D.

Proceedings ArticleDOI
20 Sep 2001
TL;DR: Different aspects of the decoder, the decoding algorithm, decoder structure and memory requirements are discussed, and it is shown that small package high performance high speed DSPs are very suitable to be used in portable devices.
Abstract: MPEG-4 is an international coding standard that aims at providing standardized core technologies allowing efficient storage, transmission and manipulation of video data in multimedia environments. As mobility has become one of the key requirements of the information society today, the next generation mobile will be a service-oriented industry, capable of delivering rich multimedia content, especially streaming video, to the palms of mobile subscribers. MPEG-4, with its superior compression, interactivity and systems capabilities, is the most promising future standard. And that small package high performance high speed DSPs are very suitable to be used in portable devices. This paper describes the implementation of MPEG-4 SVP(simple visual profile) video decoder on the TMS320C6201 DSP which is on the 'C6x EVM evaluation module(EVM) . The Texas Instruments TMS320C62x devices are fixed-point DSPs that feature the VelociTiTM architecture, which is a high-performance, advanced, very-long-instruction-word (VLIW) architecture. With this architecture, a high degree of parallelism can be exploited to meet real-time requirements of video processing such as compression and decompression. In this paperu¼ different aspects of the decoder , the decoding algorithm, decoder structure and memory requirements are discussed.

Book ChapterDOI
28 Aug 2001
TL;DR: The design and development of a dynamic scheduler of parallel threads in the Multithreaded multiProcessor Architecture (MPA) is presented, which efficiently assigns resources to threads, and permits them to communicate with great flexibility.
Abstract: This paper presents the design and development of a dynamic scheduler of parallel threads in the Multithreaded multiProcessor Architecture (MPA) The scheduler relies on an on-chip associative memory whose management time is overlapped with the execution of ready threads The scheduler efficiently assigns resources to threads, and permits them to communicate with great flexibility The results achieved with small number of threads from programs with high degree of parallelism are very satisfactory, even under various degrees of cache misses

Proceedings ArticleDOI
28 Dec 2001
TL;DR: A study of high performance, reusable and scalable DSP architecture of BOPS, which targets specific applications, and an analysis to reduce complexity as well as implementation of G.729a on an array processor using various optimization techniques is presented.
Abstract: A study of high performance, reusable and scalable DSP architecture of BOPS, which targets specific applications, is carried out. The degree of parallelism supported by the BOPS' ManArray architecture and its usability is tested on various algorithmic building blocks along with the more complex and irregular algorithm of the G.729a vocoder, a key requirement of VoIP gateway DSP engine. An analysis to reduce complexity as well as implementation of G.729a on an array processor using various optimization techniques is presented.

Journal ArticleDOI
TL;DR: This work presents a technique of early simulation in the design phase of concurrent and distributed systems using a P/T net to model the system whose behavior is simulated by the net execution.
Abstract: This work presents a technique of early simulation in the design phase of concurrent and distributed systems. A P/T net is used to model the system whose behavior is simulated by the net execution; the truly concurrent semantics of P/T nets establishes a partial order among the system events. The designer can interact with the simulator asking for measures about the system behavior that concern all executions respecting the same partial order. Some measures, such as the degree of parallelism exploited, are not easily obtainable from an interleaving semantics. Moreover, the designer can force the system behavior to reflect resource-constrained environments.