scispace - formally typeset
Search or ask a question

Showing papers by "Per Stenström published in 2011"


Proceedings ArticleDOI
31 May 2011
TL;DR: This work develops an HTM system that allows selection of versioning and conflict resolution policies at the granularity of cache lines and discovers that this neat match in granularity with that of the cache coherence protocol results in a design that is very simple and yet able to track closely or exceed the performance of the best performing policy for a given workload.
Abstract: Hardware Transactional Memory (HTM) systems, in prior research, have either fixed policies of conflict resolution and data versioning for the entire system or allowed a degree of flexibility at the level of transactions. Unfortunately, this results in susceptibility to pathologies, lower average performance over diverse workload characteristics or high design complexity. In this work we explore a new dimension along which flexibility in policy can be introduced. Recognizing the fact that contention is more a property of data rather than that of an atomic code block, we develop an HTM system that allows selection of versioning and conflict resolution policies at the granularity of cache lines. We discover that this neat match in granularity with that of the cache coherence protocol results in a design that is very simple and yet able to track closely or exceed the performance of the best performing policy for a given workload. It also brings together the benefits of parallel commits (inherent in traditional eager HTMs) and good optimistic concurrency without deadlock avoidance mechanisms (inherent in lazy HTMs), with little increase in complexity.

28 citations


Proceedings ArticleDOI
10 Oct 2011
TL;DR: P-TM is introduced, a novel HTM design that uses modest extensions to existing directory-based cache coherence protocols to keep a record of conflicting cache lines as a transaction executes, which allows a consistent cache state to be maintained when transactions commit or abort.
Abstract: Lazy hardware transactional memory (HTM) al-lows better utilization of available concurrency in transactional workloads than eager HTM, but poses challenges at commit time due to the requirement of en-masse publication of speculative updates to global system state. Early conflictdetection can be employed in lazy HTM designs to allow non-conflicting transactions to commit in parallel. Though this has the potential to improve performance, it has not been utilized effectively so far. Prior work in the area burdens common-case transactional execution severely to avoid some relatively uncommon correctness concerns. In this work we investigate this problem and introduce a novel design, p-TM, which eliminates this problem. p-TM uses modest extensions to existing directory-based cache coherence protocols to keep a record of conflicting cache lines as a transaction executes. This information allows a consistent cache state to be maintained when transactions commit or abort. We observe that contention is typically seen only on a small fraction of shared data accessed by coarse-grained transactions. In p-TM earlyconflict detection mechanisms imply additional work only when such contention actually exists. Thus, the design is able to avoid expensive core-to-core and core-to-directory communication for a large part of transactionally accessed data. Our evalutation shows major performance gains when compared to other HTM designs in this class and competitive performance when compared to more complex lazy commit schemes.

17 citations


BookDOI
01 Jan 2011
TL;DR: A High Performance Adaptive Miss Handling Architecture for Chip Multiprocessors and Characterizing Time-Varying Program Behavior Using Phase Complexity Surfaces.
Abstract: A High Performance Adaptive Miss Handling Architecture for Chip Multiprocessors.- Characterizing Time-Varying Program Behavior Using Phase Complexity Surfaces.- Compiler Directed Issue Queue Energy Reduction.- A Systematic Design Space Exploration Approach to Customising Multi-Processor Architectures: Exemplified Using Graphics Processors.- Microvisor: A Runtime Architecture for Thermal Management in Chip Multiprocessors.- Special Section on High-Performance and Embedded Architectures and Compilers (HiPEAC).- A Highly Scalable Parallel Implementation of H.264.- Communication Based Proactive Link Power Management.- Finding Extreme Behaviors in Microprocessor Workloads.- Hybrid Super/Subthreshold Design of a Low Power Scalable-Throughput FFT Architecture.- Special Section on Selected papers from the Workshop on Software and Hardware Challenges of Many-core Platforms.- Transaction Reordering to Reduce Aborts in Software Transactional Memory.- A Parallelizing Compiler Cooperative Heterogeneous Multicore Processor Architecture.- A Modular Simulator Framework for Network-on-Chip Based Manycore Chips Using UNISIM.- Software Transactional Memory Validation - Time and Space Considerations Tiled Multi-Core Stream Architecture.- An Efficient and Flexible Task Management for Many Cores.- Special Section on International Symposium on Systems, ArchitecturesModeling and Simulation.- On Two-layer Brain-inspired Hierarchical Topologies: A Rent's Rule Approach.- Advanced Packet Segmentation and Buffering Algorithms in Network Processors.- Energy Reduction by Systematic Run-Time Reconfigurable Hardware Deactivation.- A Cost Model for Partial Dynamic Reconfiguration.- Heterogeneous Design in Functional DIF.- Signature-based Calibration of Analytical Performance Models for System-level Design Space Exploration.

12 citations


Proceedings ArticleDOI
13 Sep 2011
TL;DR: The insights, related to the interplay between buffering mechanisms, system policies and workload characteristics, contained in this paper clearly distinguish gains in performance to be had from write-buffering from those that can be ascribed to HTM policy.
Abstract: Hardware transactional memory (HTM) systems have been studied extensively along the dimensions of speculative versioning and contention management policies. The relative performance of several designs policies has been discussed at length in prior work within the framework of scalable chip-multiprocessing systems. Yet, the impact of simple structural optimizations like write-buffering has not been investigated and performance deviations due to the presence or absence of these optimizations remains unclear. This lack of insight into the effective use and impact of these interfacial structures between the processor core and the coherent memory hierarchy forms the crux of the problem we study in this paper. Through detailed modeling of various write-buffering configurations we show that they play a major role in determining the overall performance of a practical HTM system. Our study of both eager and lazy conflict resolution mechanisms in a scalable parallel architecture notes a remarkable convergence of the performance of these two diametrically opposite design points when write buffers are introduced and used well to support the common case. Mitigation of redundant actions, fewer invalidations on abort, latency-hiding and prefetch effects contribute towards reducing execution times for transactions. Shorter transaction durations also imply a lower contention probability, thereby amplifying gains even further. The insights, related to the interplay between buffering mechanisms, system policies and workload characteristics, contained in this paper clearly distinguish gains in performance to be had from write-buffering from those that can be ascribed to HTM policy. We believe that this information would facilitate sound design decisions when incorporating HTMs into parallel architectures.

11 citations


Proceedings ArticleDOI
26 Oct 2011
TL;DR: This paper analyzes the sources of performance losses in hardware transactional memory and investigates techniques to reduce the losses and proposes the 5C cache-miss classification model that extends the well-established 4C model with a new class of cache misses known as contamination misses.
Abstract: This paper analyzes the sources of performance losses in hardware transactional memory and investigates techniques to reduce the losses. It dissects the root causes of data conflicts in hardware transactional memory systems (HTM) into four classes of conflicts: true sharing, false sharing, silent store, and write-write conflicts. These conflicts can cause performance and energy losses due to aborts and extra communication. To quantify losses, the paper first proposes the 5C cache-miss classification model that extends the well-established 4C model with a new class of cache misses known as contamination misses. The paper also contributes with two techniques for removal of data conflicts: One for removal of false sharing conflicts and another for removal of silent store conflicts. In addition, it revisits and adapts a technique that is able to reduce losses due to both true and false conflicts. All of the proposed techniques can be accommodated in a lazy versioning and lazy conflict resolution HTM built on top of a MESI cache-coherence infrastructure with quite modest extensions. Their ability to reduce performance is quantitatively established, individually as well as in combination. Performance is improved substantially.

10 citations


Proceedings ArticleDOI
13 Sep 2011
TL;DR: In this paper, the authors study the scalability of a set of data mining workloads that have negligible serial sections and show that asymmetric CMPs with one large and many tiny cores are only optimal for applications with a low reduction overhead.
Abstract: Amdahl's Law dictates that in parallel applications serial sections establish an upper limit on the scalability. Asymmetric chip multiprocessors with a large core in addition to several small cores have been advocated for recently as a promising design paradigm because the large core can accelerate the execution of serial sections and hence mitigate the scalability bottlenecks due to large serial sections. This paper studies the scalability of a set of data mining workloads that have negligible serial sections. The formulation of Amdahl's Law, that optimistically assumes constant serial sections, estimates these workloads to scale to hundreds of cores in a chip multiprocessor (CMP). However the overhead in carrying out merging (or reduction) operations makes scalability to peak at lesser number. We establish this by extending theAmdahl's speedup model to factor in the impact of reduction operations on the speedup of applications on symmetric as well as asymmetric CMP designs. Our analytical model estimates that asymmetric CMPs with one large and many tiny cores are only optimal for applications with a low reduction overhead. However, as the overhead starts to increase, the balance is shifted towards using fewer but more capable cores. This eventually limits the performance advantage of asymmetric over symmetric CMPs.

7 citations


Proceedings ArticleDOI
09 Oct 2011
TL;DR: This paper introduces the notion of silent loads to classify load accesses that can be satisfied by already available values of the physical register file and proposes a new architectural concept to exploit such loads and shows that the prevalence of such loads is mostly inherent in applications.
Abstract: This paper introduces the notion of silent loads to classify load accesses that can be satisfied by already available values of the physical register file and proposes a new architectural concept to exploit such loads. The paper then unifies different approaches of eliminating memory accesses early by contributing with a new architectural scheme. We show that our unified approach covers previously proposed techniques of exploiting forwarded and small-value loads in addition to silent loads. Forwarded loads obtain values through load-to-load and store-to-load forwarding whereas small-value loads return small values that can be coded with 8 bits or less. We find that 22%, 31% and 24% of all dynamic loads are forwarded, small-value and silent, respectively. We demonstrate that the prevalence of such loads is mostly inherent in applications. We establish that a hypothetical scheme that encompasses all the categories can eliminate as many as 42% of all dynamic loads and about 18% of all committed stores. Finally, we contribute with a new architectural technique to implement the unified scheme. We show that our proposed scheme reduces execution time to provide noticeable speedup and reduces overall energy dissipation with very low area overhead.

3 citations


Proceedings ArticleDOI
31 May 2011
TL;DR: A better use of the chip area is for fewer and hence more capable cores rather than simply increasing the number of cores for symmetric and asymmetric architectures and the performance potential of asymmetric over symmetric multi-core architectures is limited for such applications.
Abstract: Amdahl's Law estimates parallel applications with negligible serial sections to potentially scale to many cores. However, due to merging phases in data mining applications, the serial sections do not remain constant. We extend Amdahl's model to accommodate this and establish that Amdahl's Law can overestimate the scalability offered by symmetric and asymmetric architectures for such applications. Implications: 1) A better use of the chip area is for fewer and hence more capable cores rather than simply increasing the number of cores for symmetric and asymmetric architectures and 2) The performance potential of asymmetric over symmetric multi-core architectures is limited for such applications.

2 citations


Proceedings ArticleDOI
16 May 2011
TL;DR: This paper studies how coherent buffering, in private caches for example, as has been proposed in several hardware TM proposals, can lead to inefficiencies and shows how such in efficiencies can be substantially mitigated by using complete or partial non-coherent buffering of speculative writes in dedicated structures or suitably adapted standard per-core write-buffers.
Abstract: When supported in silicon, transactional memory (TM) promises to become a fast, simple and scalable parallel programming paradigm for future shared memory multiprocessor systems. Among the multitude of hardware TM design points and policies that have been studied so far, lazy conflict resolution designs often extract the most concurrency, but their inherent need for lazy versioning requires careful management of speculative updates. In this paper we study how coherent buffering, in private caches for example, as has been proposed in several hardware TM proposals, can lead to inefficiencies. We then show how such inefficiencies can be substantially mitigated by using complete or partial non-coherent buffering of speculative writes in dedicated structures or suitably adapted standard per-core write-buffers. These benefits are particularly noticeable in scenarios involving large coarse grained transactions that may write a lot of non-contended data in addition to actively shared data. We believe our analysis provides important insights into some overlooked aspects of TM behaviour and would prove useful to designers wishing to implement lazy TM schemes in hardware.

1 citations


01 Jan 2011
TL;DR: This paper studies the scalability of a set of data mining workloads that have negligible serial sections by extending the Amdahl's speedup model to factor in the impact of reduction operations on the speedup of applications on symmetric as well as asymmetric CMP designs.
Abstract: Amdahl's Law estimates parallel applications with negligible serial sections to potentially scale to many cores. However, due to merging phases in data mining applications, the serial sections do not remain constant. We extend Amdahl's model to accommodate this and establish that Amdahl's Law can overestimate the scalability offered by symmetric and asymmetric architectures for such applications. Implications: 1) A better use of the chip area is for fewer and hence more capable cores rather than simply increasing the number of cores for symmetric and asymmetric architectures and 2) The performance potential of asymmetric over symmetric multi-core architectures is limited for such applications. © 2011 Authors.