scispace - formally typeset
Search or ask a question
Author

L. Budin

Bio: L. Budin is an academic researcher from University of Zagreb. The author has contributed to research in topics: Shared memory & Cache coherence. The author has an hindex of 1, co-authored 1 publications receiving 12 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: New analytical models for predicting the performance of parallel applications under various cache coherence protocol assumptions are introduced and the potential advantage of using dynamic hybrid protocols is shown.
Abstract: In this paper, we introduce new analytical models for predicting the performance of parallel applications under various cache coherence protocol assumptions. The purpose of these models is to determine which protocols are to be used for which data blocks, and, in the case of dynamic protocols, also to determine when to change protocols. Although we focus on tightly-coupled multiprocessor systems, similar models can be derived for loosely-coupled distributed systems, such as networks of workstations. Our models are unique in that they lie between a large body of theoretical models that assume independence and a uniform distribution of memory accesses across processors, and a large body of address-trace oriented models that assume the availability of a precise characterization of interleaving behavior of memory accesses. The former are not very realistic, and the latter are not suitable for compile-time and run-time usage. In contrast, our models enable us to choose different input parameters depending on how the models will be used and depending on the needed accuracy in performance prediction. We present the models and show how the required parameters can be obtained. We assess the accuracy of our models on 15 parallel applications. For these applications, our most complete model predicts performance within a 10 percent margin when compared to a simulation of a sequentially consistent multiprocessor system. As part of this study, we also show the potential advantage of using dynamic hybrid protocols.

12 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: This work presents a simple coherence protocol that eliminates passive sharing using information from the compiler that is normally available in operating system kernels, and further limits the coherence-maintaining overhead by using information about access patterns to shared data exhibited in parallel applications.
Abstract: In high-performance general-purpose workstations and servers, the workload can be typically constituted of both sequential and parallel applications. Shared-bus shared-memory multiprocessor can be used to speed-up the execution of such workload. In this environment, the scheduler takes care of the load balancing by allocating a ready process on the first available processor, thus producing process migration. Process migration and the persistence of private data into different caches produce an undesired sharing, named passive sharing. The copies due to passive sharing produce useless coherence traffic on the bus and coping with such a problem may represent a challenging design problem for these machines. Many protocols use smart solutions to limit the overhead to maintain coherence among shared copies. None of these studies treats passive-sharing directly, although some indirect effect is present while dealing with the other kinds of sharing. Affinity scheduling can alleviate this problem, but this technique does not adapt to all load conditions, especially when the effects of migration are massive. We present a simple coherence protocol that eliminates passive sharing using information from the compiler that is normally available in operating system kernels. We evaluate the performance of this protocol and compare it against other solutions proposed in the literature by means of enhanced trace-driven simulation. We evaluate the complexity in terms of the number of protocol states, additional bus lines, and required software support. Our protocol further limits the coherence-maintaining overhead by using information about access patterns to shared data exhibited in parallel applications.

34 citations

01 Jan 1999
TL;DR: In this article, the authors present a simple coherence protocol that eliminates passive sharing using information from the compiler that is normally available in operating system kernels, and evaluate the performance of this protocol and compare it with other solutions proposed in the literature by means of enhanced trace-driven simulation.
Abstract: In high-performance general-purpose workstations and servers, the workload can be typically constituted of both sequential and parallel applications. Shared-bus shared-memory multiprocessor can be used to speed-up the execution of such workload. In this environment, the scheduler takes care of the load balancing by allocating a ready process on the first available processor, thus producing process migration. Process migration and the persistence of private data into different caches produce an undesired sharing, named passive sharing. The copies due to passive sharing produce useless coherence traffic on the bus and coping with such a problem may represent a challenging design problem for these machines. Many protocols use smart solutions to limit the overhead to maintain coherence among shared copies. None of these studies treats passive-sharing directly, although some indirect effect is present while dealing with the other kinds of sharing. Affinity scheduling can alleviate this problem, but this technique does not adapt to all load conditions, especially when the effects of migration are massive. We present a simple coherence protocol that eliminates passive sharing using information from the compiler that is normally available in operating system kernels. We evaluate the performance of this protocol and compare it against other solutions proposed in the literature by means of enhanced trace-driven simulation. We evaluate the complexity in terms of the number of protocol states, additional bus lines, and required software support. Our protocol further limits the coherence-maintaining overhead by using information about access patterns to shared data exhibited in parallel applications.

21 citations

Journal ArticleDOI
TL;DR: Three cache coherence mechanisms optimized for CMPs are presented, including a dynamic write-update mechanism augmented on top of a write-invalidate protocol, a bandwidth-adaptive mechanism to eliminate performance degradation from write-updates under limited bandwidth, and a proximity-aware mechanism to extend the base adaptive protocol with latency-based optimizations.
Abstract: In chip multiprocessors (CMPs), maintaining cache coherence can account for a major performance overhead. Write-invalidate protocols adapted by most CMPs generate high cache-to-cache misses under producer–consumer sharing patterns. Accordingly, this paper presents three cache coherence mechanisms optimized for CMPs. First, to reduce coherence misses observed in write-invalidate-based protocols, we propose a dynamic write-update mechanism augmented on top of a write-invalidate protocol. This mechanism is specifically triggered at the detection of a producer–consumer sharing pattern. Second, we extend this adaptive protocol with a bandwidth-adaptive mechanism to eliminate performance degradation from write-updates under limited bandwidth. Finally, proximity-aware mechanism is proposed to extend the base adaptive protocol with latency-based optimizations. Experimental analysis is conducted on a set of scientific applications from the SPLASH-2 and NAS parallel benchmark suites. The proposed mechanisms were shown to reduce coherence misses by up to 48% and in return speed up application performance up to 30%. Bandwidth-adaptive mechanism is proven to perform well under varying levels of available bandwidth. Results from our proposed proximity-aware extension demonstrated up to 6% performance gain over the base adaptive protocol for 64-core tiled CMP runs. In addition, the analytical model provided good estimates for performance gains from our adaptive protocols.

16 citations

01 Jan 2003
TL;DR: This dissertation provides a framework for evaluating the coherence communication traffic of different protocols and considers using more than one protocol in a DSM multiprocessor, and shows that no single protocol is best suited for all communication patterns.
Abstract: Assessment of Cache Coherence Protocols in Shared-memory Multiprocessors Alexander Grbic Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto 2003 The cache coherence protocol plays an important role in the performance of a distributed shared-memory (DSM) multiprocessor. A variety of cache coherence protocols exist and differ mainly in the scope of the sites that are updated by a write operation. These protocols can be complex and their impact on the performance of a multiprocessor system is often difficult to assess. To obtain good performance, both architects and users must understand processor communication, data locality, the properties of the interconnection network, and the nature of the coherence protocols. Analyzing the processor data sharing behavior and determining its effect on cache coherence communication traffic is the first step to a better understanding of overall performance. Toward this goal, this dissertation provides a framework for evaluating the coherence communication traffic of different protocols and considers using more than one protocol in a DSM multiprocessor. The framework consists of a data access characterization and the application of assessment rules. Its usefulness is demonstrated through an investigation into the performance of different cache coherence protocols for a variety of systems and parameters. It is shown to be effective for determining the relative performance of protocols and the effect of changes in system and application parameters. The investigation also shows that no single protocol is best suited for all communication patterns. Consequently, the dissertation also considers using more than one cache coherence protocol in a DSM multiprocessor. The results show that the hybrid protocol can significantly reduce traffic in all levels of the interconnection network with little effect on execution time.

10 citations

Dissertation
01 Jan 1999
TL;DR: The architecture and design process leading to a working 48-processor prototype are described in detail and analysis of the system is based on a cycle-accurate, execution-driven simulator developed as part of the thesis.
Abstract: This dissertation considers the design and aaalysis of NUMAchiac: a distributcd, sharcdmemory multiprocessor. The architecture and design process leading to a working 48-processor prototype are describeci in detail. Analysis of the system is bascd on a cycle-accurate, execution-driven simulator developed as part of the thesis. An exploration of the design space is also undertaken to provide some intuition as to possible future enhanccments to the architec-

4 citations