Topic

Degree of parallelism

About: Degree of parallelism is a research topic. Over the lifetime, 1515 publications have been published within this topic receiving 25546 citations.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Ultra Efficient Acceleration for De Novo Genome Assembly via Near-Memory Computing

[...]

Minxuan Zhou¹, Lingxi Wu², Muzhou Li¹, Niema Moshiri¹, Kevin Skadron², Tajana Rosing¹ - Show less +2 more•Institutions (2)

University of California, San Diego¹, University of Virginia²

01 Sep 2021

TL;DR: Zhang et al. as mentioned in this paper proposed a near-data processing (NDP) architecture based on 3D-stacking, which distributes key operations across NDP cores to exploit high degree of parallelism and high memory bandwidth.

...read moreread less

Abstract: De novo assembly of genomes for which there is no reference, is essential for novel species discovery and metagenomics. In this work, we accelerate two key performance bottlenecks of DBG-based assembly, graph construction and graph traversal, with a near-data processing (NDP) architecture based on 3D-stacking. The proposed framework distributes key operations across NDP cores to exploit a high degree of parallelism and high memory bandwidth. We propose several optimizations based on domain-specific properties to improve the performance of our design. We integrate the proposed techniques into an existing DBG assembly tool, and our simulation-based evaluation shows that the proposed NDP implementation can improve the performance of graph construction by 33× and traversal by 16× compared to the state-of-the-art.

...read moreread less

4 citations

Patent•

Method for rapidly capturing navigation satellite signal and apparatus thereof

[...]

Weijun Lu, Dunshan Yu, Xing Zhang, Huang Yongcan

11 Apr 2012

TL;DR: In this article, a signal capture method in a satellite navigation system and an apparatus thereof is described, where a method of determining whether phases of an energy peak value is consistent is combined with the method of comparing the peak value to a threshold so that whether a satellite signal is visible and a frequency and the code phase of the satellite signal can be determined.

...read moreread less

Abstract: The invention provides a signal capture method in a satellite navigation system and an apparatus thereof. According to the method and the apparatus of the invention, a method of determining whether phases of an energy peak value is consistent is combined with the method of comparing the energy peak value to a threshold so that whether a satellite signal is visible and a frequency and the code phase of the satellite signal can be determined. The method of comparing the detection energy peak value to the threshold in the traditional capture method is not used. Capture time can be effectively reduced. Under the condition of maintaining a degree of parallelism, a receiver used the method of the invention can be rapidly started. Otherwise, if the starting time of the receiver is required to be the same, by using the method, the degree of parallelism of an energy monitor can be reduced so as to reduce hardware realization costs. Therefore, the costs can be saved.

...read moreread less

4 citations

Hiding global communication latency and increasing the arithmetic intensity in extreme-scale Krylov solvers

[...]

Wim Vanroose, Pieter Ghysels, Karl Meerbergen, Dirk Roose

01 Aug 2013

TL;DR: This position paper puts forward possible ways to achieve preconditioned Krylov solvers that efficiently use all the resources on many-core chips and are extremely scalable on massively parallel machines.

...read moreread less

Abstract: For many HPC codes the major cost lies in the solution of large sparse linear systems [1]. For many problems, the methods of choice for such systems are preconditioned Krylov solvers. However, Krylov solvers are hard to scale to a large numbers of cores due to two main bottlenecks: the inter-node latency and the on-node bandwidth. In this position paper, we review recently proposed techniques to overcome each of these bottlenecks and we put forward possible ways to achieve preconditioned Krylov solvers that efficiently use all the resources on many-core chips and are extremely scalable on massively parallel machines. A key ingredient in our approach is the use of stencil compilers. Future supercomputers will have a large number of nodes, each being a many-core processor. In addition, the cores will feature vector processing units (VPU) with very long vectors. New algorithms and software should exploit these three levels of parallelism. On massively parallel machines, global communication should be avoided as much as possible. Global communication is very expensive due to the large latency on the wire, unless it can be overlapped with calculations. In Krylov solvers, there are usually at least two such global communication phases per iteration, used for orthogonalization and normalization of the Krylov base vectors. In the standard formulation of most Krylov methods, there is no possibility to overlap this communication with local work, which leads to a bulk synchronous execution pattern that leaves many resources idle. Recently, pipelined Krylov methods [6, 7] reorganized the algorithms with only one global reduction per iteration. The reduction’s latency can be overlapped with other work such as the (preconditioned) sparse matrix-vector product ((P)SpMV). While the reduction takes place in the background, new Krylov base vectors can be computed using the (P)SpMV. Only when enough (P)SpMVs have been computed to completely hide the global communication latency, an orthogonalization and normalization step is performed. This deferred orthogonalization obviously changes the numerical properties of the Krylov algorithm. However, since only very few (P)SpMVs are required to completely hide the global latency, numerical stability is mildly affected. This can be remediated by introducing shifts in the (P)SpMV that prevents the base vectors from aligning to the dominant eigenvector and results in an improved Krylov basis. Pipelined methods lift the main bottleneck for scaling Krylov solvers to extreme numbers of cores and the resulting solver scales as well as the (P)SpMV. For many applications, good scalability for the sparse matrix-vector product (SpMV) can be achieved even for 100k cores, if the problem is partitioned such that there is only local communication. For the preconditioner, there is typically a trade-off between parallelism and efficiency. We expect pipelined methods to be better suited for cheap preconditioners with a high degree of parallelism. 2 Although the (P)SpMV may scale well on a distributed memory system, the on-node performance may still be poor. Within a node, the different threads have to share the available

...read moreread less

4 citations

Journal Article•DOI•

A Computation+Communication Load Balanced Loop Partitioning Method for Distributed Memory Systems

[...]

Santosh Pande¹, Tareq Bali¹•Institutions (1)

University of Cincinnati¹

01 Sep 1999-Journal of Parallel and Distributed Computing

TL;DR: This work describes a multiphase partitioning approach based on the above motivation for the general cases of DOALL loops to achieve a computation+communication load balanced partitioning through static data and iteration space distribution.

...read moreread less

4 citations

Proceedings Article•DOI•

Graph models of computer systems: Application to performance evaluation of an operating system

[...]

James Wayne Anderson, J. C. Browne

29 Mar 1976

TL;DR: The power of this modeling technique with respect to comprehensibility, accuracy of representation and ease of validation and modification is demonstrated by application to modeling of the UT-2 operating system for the CDC 6000 series computer system.

...read moreread less

Abstract: This paper defines and determines a graph model of a computer system in a form applicable to system performance analysis. The power of this modeling technique with respect to comprehensibility, accuracy of representation and ease of validation and modification is demonstrated by application to modeling of the UT-2 operating system for the CDC 6000 series computer system. This multiprocessor-multi-programmed operating system with its high degree of parallelism provides an excellent test for the utility and range of application of graph models in performance evaluation. A programmed representation of the kernel monitor is used. All other system processes are represented in graph form and are input data to the simulator. A generally applicable technique for extracting graph representations of processes are represented in graph form and are input data to the simulator. A generally applicable technique for extracting graph representations of processes from event trace data is described and applied to the event trace generated by the UT-2 operating system. The technique is both complete and general and may be profitably applied for either partial or complete models of any type of complex computer system process where data or techniques for automated graph construction are available or can he applied.

...read moreread less

4 citations

Collapse

Network Information

Performance

Metrics

1,515

Papers

27,447

Citations

No. of papers in the topic in previous years
Year	Papers
2022	1
2021	47
2020	48
2019	52
2018	70
2017	75

Degree of parallelism

Papers published on a yearly basis

Papers

Trending Questions (7)

Network Information

Related Topics (5)

Performance

Metrics