scispace - formally typeset
Search or ask a question
Topic

Degree of parallelism

About: Degree of parallelism is a research topic. Over the lifetime, 1515 publications have been published within this topic receiving 25546 citations.


Papers
More filters
Proceedings ArticleDOI
13 Nov 2010
TL;DR: The main contribution of this paper is the extension to the distributed-memory environment of the previous work done by Hadri et al. on Communication- Avoiding QR (CA-QR) factorizations for tall and skinny matrices, which is able to outperform the de facto ScaLAPACK library by up to 4 times, and has good scalability on up to 3,072 cores.
Abstract: As tile linear algebra algorithms continue achieving high performance on shared-memory multicore architectures, it is a challenging task to make them scalable on distributed-memory multicore cluster machines. The main contribution of this paper is the extension to the distributed-memory environment of the previous work done by Hadri et al. on Communication- Avoiding QR (CA-QR) factorizations for tall and skinny matrices (initially done on shared-memory multicore systems). The fine granularity of tile algorithms associated with communicationavoiding techniques for the QR factorization presents a high degree of parallelism where multiple tasks can be concurrently executed, computation and communication largely overlapped, and computation steps fully pipelined. A decentralized dynamic scheduler has then been integrated as a runtime system to efficiently schedule tasks across the distributed resources. Our experimental results performed on two clusters (with dual-core and 8-core nodes, respectively) and a Cray XT5 system with 12-core nodes show that the tile CA-QR factorization is able to outperform the de facto ScaLAPACK library by up to 4 times for tall and skinny matrices, and has good scalability on up to 3,072 cores.

42 citations

Book ChapterDOI
01 Jan 1994
TL;DR: It is argued that multi-coloring can be combined with multiple-step relaxation preconditioners to achieve a good level of parallelism while keeping the rates of convergence to good levels.
Abstract: The degree of parallelism in the preconditioned Krylov subspace method using standard preconditioners is limited and can lead to poor performance on massively parallel computers. In this paper we examine this problem and consider a number of alternatives based both on multi-coloring ideas and polynomial preconditioning. The emphasis is on methods that deal specifically with general unstructured sparse matrices such as those arising from finite element methods on unstructured grids. It is argued that multi-coloring can be combined with multiple-step relaxation preconditioners to achieve a good level of parallelism while keeping the rates of convergence to good levels. We also exploit the idea of multi-coloring and independent set orderings to introduce a multi-elimination incomplete LU factorization named ILUM, which is related to multifrontal elimination. The main goal of the paper is to discuss some of the prevailing ideas and to compare them on a few test problems.

42 citations

Proceedings ArticleDOI
01 Sep 1996
TL;DR: This paper shows how to reduce time and space thread overhead using control flow and register liveness information inferred after compilation and reduces the overall execution time of fine-grain threaded programs by 15-30%.
Abstract: Modern languages and operating systems often encourage programmers to use threads, or independent control streams, to mask the overhead of some operations and simplify program structure. Multitasking operating systems use threads to mask communication latency, either with hardwares devices or users. Client-server applications typically use threads to simplify the complex control-flow that arises when multiple clients are used. Recently, the scientific computing community has started using threads to mask network communication latency in massively parallel architectures, allowing computation and communication to be overlapped. Lastly, some architectures implement threads in hardware, using those threads to tolerate memory latency.In general, it would be desirable if threaded programs could be written to expose the largest degree of parallelism possible, or to simplify the program design. However, threads incur time and space overheads, and programmers often compromise simple designs for performance. In this paper, we show how to reduce time and space thread overhead using control flow and register liveness information inferred after compilation. Our techniques work on binaries, are not specific to a particular compiler or thread library and reduce the the overall execution time of fine-grain threaded programs by ≈ 15-30%. We use execution-driven analysis and an instrumented operating system to show why the execution time is reduced and to indicate areas for future work.

41 citations

Proceedings ArticleDOI
10 Apr 1994
TL;DR: A multi-assignment language derived from the UNITY formalism is proposed, to implement the controllers with a high degree of parallelism of the ArMen FPGA-multiprocessor.
Abstract: Embedding a FPGA circular array into MIMD architectures allows one to synthesize fine-grain circuits for global computation support. These circuits operate concurrently with the distributed applications. They provide specific speed-up or additional services, such as communication protocols or global controllers. This article describes an architectural model for such controllers with practical examples implemented on the ArMen FPGA-multiprocessor. A multi-assignment language derived from the UNITY formalism is proposed, to implement the controllers with a high degree of parallelism. Their hardware synthesis principles are given. >

39 citations

Proceedings Article
25 Feb 2008
TL;DR: It is demonstrated that this parameter is indeed critical, as it determines the degree of parallelism in the system, and optimal piece sizes for distributing small and large content are investigated.
Abstract: Peer-to-peer content distribution systems have been enjoying great popularity, and are now gaining momentum as a means of disseminating video streams over the Internet. In many of these protocols, including the popular BitTorrent, content is split into mostly fixed-size pieces, allowing a client to download data from many peers simultaneously. This makes piece size potentially critical for performance. However, previous research efforts have largely overlooked this parameter, opting to focus on others instead. This paper presents the results of real experiments with varying piece sizes on a controlled Bit-Torrent testbed.We demonstrate that this parameter is indeed critical, as it determines the degree of parallelism in the system, and we investigate optimal piece sizes for distributing small and large content. We also pinpoint a related design tradeoff, and explain how BitTorrent's choice of dividing pieces into subpieces attempts to address it.

39 citations


Network Information
Related Topics (5)
Server
79.5K papers, 1.4M citations
85% related
Scheduling (computing)
78.6K papers, 1.3M citations
83% related
Network packet
159.7K papers, 2.2M citations
80% related
Web service
57.6K papers, 989K citations
80% related
Quality of service
77.1K papers, 996.6K citations
79% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20221
202147
202048
201952
201870
201775