scispace - formally typeset
Search or ask a question
Topic

Degree of parallelism

About: Degree of parallelism is a research topic. Over the lifetime, 1515 publications have been published within this topic receiving 25546 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: A generic framework to identify IOSs and to quantify the DoP based on rank theorem in linear algebra is presented, which is applied to extract algorithmic parallelisms at various granularities, namely, multigrain parallelism.
Abstract: Degree of parallelism (DoP) is an essential complexity metric that characterizes the number of independent operation sets (IOSs) that can be concurrently executed within an algorithm. This paper presents a generic framework to identify IOSs and to quantify the DoP based on rank theorem in linear algebra. This framework is applied to extract algorithmic parallelisms at various granularities, namely, multigrain parallelism. Our parallelism is intrinsic and platform independent and can provide insights into architectural information, thus facilitating mapping onto generic platforms and early back annotation for modifying algorithms. It plays a significant role in the concurrent optimization of both algorithms and architectures, referred to as Algorithm/Architecture Coexploration (AAC), by trading off between the DoP and the number of operations (NoO). This paper reports three case studies for AAC. The case study on an IDCT reveals that our framework accurately quantifies the parallelism for mapping the algorithm onto generic platforms, including FPGA and multicore systems. The IDCT parallelized by our technique surpasses a conventional spectral parallelization. By exploiting fine-grain parallelism, this paper presents a better porting of a discrete wavelet transform (DWT) onto single instruction multiple data (SIMD) machines compared with a commercial compiler. A high-quality deinterlacer is implemented on a low-cost multicore platform for real-time high-definition applications by analyzing the multigrain parallelism. These case studies reveal the effectiveness of our parallel analysis framework which is applicable to generic systems. Compared with traditional graph traversal techniques, our linear algebraic approach impressively features low complexity and is practical for complicated algorithms.

18 citations

Journal ArticleDOI
J. Roos1
01 Apr 1989
TL;DR: A single chip VLSI support processor has been designed that provides predictable and uniformly low overhead for the entire semantics of a rendezvous, so that the powerful real-time constructs of Ada can be used freely without performance degradation.
Abstract: Task synchronization in Ada causes excessive run-time overhead due to the complex semantics of the rendezvous. To demonstrate that the speed can be increased by two orders of magnitude by using special purpose hardware, a single chip VLSI support processor has been designed. By providing predictable and uniformly low overhead for the entire semantics of a rendezvous, the powerful real-time constructs of Ada can be used freely without performance degradation.The key to high performance is the set of primitive operations implemented in hardware. Each operation is complex enough to replace a considerable amount of code was designed to execute with a minimum of communication overhead. Task control blocks are stored on-chip as well as headers for entry, delay and ready queues. All necessary scheduling is integrated in the operations. Delays are handled completely on-chip using an internal real-time clock.A multilevel design strategy, based on silicon compilation, made it possible to run actual Ada programs on a functional emulator of the chip and use the results to verify the detailed design. A high degree of parallelism and pipelining together with an elaborate internal addressing scheme has reduced the number of clock cycles needed to perform each operation. Using 2 mm CMOS, the processor can run at 20 MHz. A complex rendezvous, including the calling sequence and all necessary scheduling, can be performed in less than 15 ms.

18 citations

ReportDOI
01 Jun 2011
TL;DR: A new methodology for utilizing all CPU cores and all GPUs efficiently on a heterogeneous multicore and multi-GPU system to support matrix computations efficiently and an auto-tuning method to determine the best tile sizes to attain both high performance and load balancing is presented.
Abstract: Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures ∗ Fengguang Song Stanimire Tomov Jack Dongarra University of Tennessee EECS Department Knoxville, TN, USA University of Tennessee EECS Department Knoxville, TN, USA University of Tennessee Oak Ridge National Laboratory University of Manchester song@eecs.utk.edu tomov@eecs.utk.edu dongarra@eecs.utk.edu ABSTRACT We present a new methodology for utilizing all CPU cores and all GPUs on a heterogeneous multicore and multi-GPU system to support matrix computations efficiently. Our ap- proach is able to achieve the objectives of a high degree of parallelism, minimized synchronization, minimized commu- nication, and load balancing. Our main idea is to treat the heterogeneous system as a distributed-memory machine, and to use a heterogeneous 1-D block cyclic distribution to allo- cate data to the host system and GPUs to minimize commu- nication. We have designed heterogeneous algorithms with two different tile sizes (one for CPU cores and the other for GPUs) to cope with processor heterogeneity. We propose an auto-tuning method to determine the best tile sizes to attain both high performance and load balancing. We have also implemented a new runtime system and applied it to the Cholesky and QR factorizations. Our experiments on a compute node with two Intel Westmere hexa-core CPUs and three Nvidia Fermi GPUs demonstrate good weak scal- ability, strong scalability, load balance, and efficiency of our approach. INTRODUCTION As the performance of both multicore CPU and GPU con- tinues to scale at a Moore’s law rate, it is becoming perva- sive to use heterogeneous multicore and multi-GPU archi- tectures to attain the highest performance possible from a single compute node. Before making parallel programs run efficiently on a distributed-memory system, it is critical to achieve high performance on a single node first. However, the heterogeneity in the multi-core and multi-GPU architec- ture has introduced new challenges to algorithm design and system software. Over the last few years, our colleagues at the Univer- sity of Tennessee have developed the PLASMA library [2] to solve linear algebra problems on multicore architectures. In parallel with PLASMA, we have also developed another library called MAGMA [27] to solve linear algebra problems on GPUs. While PLASMA and MAGMA aim to provide the same routines as LAPACK [4], one is used for multicore CPUs, and the other for a single core with an attached GPU, respectively. Our goal is to utilize all cores and all GPUs efficiently on a single multicore and multi-GPU system to support matrix computations. ∗ This material is based upon work supported by the NSF grants CCF-0811642, OCI-0910735, by the DOE grant DE- FC02-06ER25761, and by Microsoft Research. GPU Device Memory Multicore Host System Host Memory PCIe Interface GPU Switch PCIe Interface GPU Switch GPU Device Memory GPU Device Memory GPU Device Memory Figure 1: An example of a heterogeneous multi-core and multi-GPU system. The host system is connected to four GPUs via two PCI Express connections. The host system and the GPUs have separate memory spaces. Figure 1 shows the architecture of a heterogeneous mul- ticore and multi-GPU system we are considering. The mul- ticore host system is connected to four GPUs via two PCI Express connections and each pair of GPUs share a GPU switch. To design new software on this type of heteroge- neous architectures, we must consider the following special features: (1) The host and the GPUs have different memory spaces and an explicit memory copy is required to transfer data between the host and a GPU; (2) The system is also dif- ferent from a distributed-memory machine since each GPU is actually controlled by a thread running on the host (more like pthreads on a shared-memory machine); (3) The pro- cessor heterogeneity between CPUs and GPUs; (4) GPUs are optimized for throughput and expect a larger input size than CPUs which are optimized for latency [24]; (5) As the performance gap between a GPU and its PCI-Express in- terconnection to the host becomes larger, network is even- tually the bottleneck for the entire system. In this work, we take into account all these factors and strive to meet the following objectives in order to obtain high performance: a high degree of parallelism, minimized synchronization, min- imized communication, and load balancing. We propose to design new heterogeneous algorithms and to use a simple but practical static data distribution to achieve the objec- tives simultaneously. This paper describes heterogeneous rectangular tile algo- rithms with hybrid tile sizes, heterogeneous 1-D block cyclic data distribution, a new runtime system, and an auto-tuning method to determine the hybrid tile sizes. The rectangu- lar tile algorithms build upon the previous tile algorithms, which divide a matrix into square tiles and exhibit a high de- gree of parallelism and minimized synchronizations [13, 14]

18 citations

Book ChapterDOI
26 Aug 2003
TL;DR: With recent advances in both hardware and software, it is now possible to create high quality images at interactive rates on commodity PC clusters.
Abstract: Due to its practical significance and its high degree of parallelism, ray tracing has always been an attractive target for research in parallel processing. With recent advances in both hardware and software, it is now possible to create high quality images at interactive rates on commodity PC clusters.

18 citations

Journal ArticleDOI
TL;DR: Analysis of a study directed to the specification and procurement of a new cockpit simulator for an advanced class of heli copters showed that a particularly cost-effective approach is to employ a large minicomputer acting as host and controller for a special-purpose digital peripheral processor.
Abstract: This paper describes some of the results of a study directed to the specification and procurement of a new cockpit simulator for an advanced class of helicopters. A part of the study was the definition of a challenging benchmark problem, and detailed analyses of it were made to assess the suitability of a variety of simulation techniques. The analyses showed that a particularly cost-effective approach to the attainment of adequate speed for this extremely demanding application is to employ a large minicomputer acting as host and controller for a special-purpose digital peripheral processor. Various realizations of such peripheral processors, all employing state-of-the-art electronic circuitry and a high degree of parallelism and pipelining, are available or under development. The types of peripheral processors array processors, simulation-oriented processors, and arrays of processing elements - are analyzed and compared. They are particularly promising approaches which should be suitable for high-speed simulations of all kinds, the cockpit simulator being a case in point.

18 citations


Network Information
Related Topics (5)
Server
79.5K papers, 1.4M citations
85% related
Scheduling (computing)
78.6K papers, 1.3M citations
83% related
Network packet
159.7K papers, 2.2M citations
80% related
Web service
57.6K papers, 989K citations
80% related
Quality of service
77.1K papers, 996.6K citations
79% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20221
202147
202048
201952
201870
201775