scispace - formally typeset
Search or ask a question

Showing papers by "Todd C. Mowry published in 2000"


Proceedings ArticleDOI
01 May 2000
TL;DR: This paper proposes and evaluates a design for supporting TLS that seamlessly scales to any machine size because it is a straightforward extension of writeback invalidation-based cache coherence (which itself scales both up and down).
Abstract: While architects understand how to build cost-effective parallel machines across a wide spectrum of machine sizes (ranging from within a single chip to large-scale servers), the real challenge is how to easily create parallel software to effectively exploit all of this raw performance potential. One promising technique for overcoming this problem is Thread-Level Speculation (TLS), which enables the compiler to optimistically create parallel threads despite uncertainty as to whether those threads are actually independent. In this paper, we propose and evaluate a design for supporting TLS that seamlessly scales to any machine size because it is a straightforward extension of writeback invalidation-based cache coherence (which itself scales both up and down). Our experimental results demonstrate that our scheme performs well on both single-chip multiprocessors and on larger-scale machines where communication latencies are twenty times larger.

401 citations


Proceedings ArticleDOI
22 Oct 2000
TL;DR: It is shown that the impact of out-of-core applications on interactive ones can be greatly mitigated through an approach that integrates compiler analysis with simple OS support and a run-time layer that adapts to dynamic conditions.
Abstract: Out-of-core applications consume physical resources at a rapid rate, causing interactive applications sharing the same machine to exhibit poor response times. This behavior is the result of default resource management strategies in the OS that are inappropriate for memory-intensive applications. Using an approach that integrates compiler analysis with simple OS support and a run-time layer that adapts to dynamic conditions, we have shown that the impact of out-of-core applications on interactive ones can be greatly mitigated. A combination of prefetching pages that will soon be needed, and releasing pages no longer in use results in good throughput for the out-of-core task and good response time for the interactive one. Each class of application performs well according to the metric most important to it. In addition, the OS does not need to attempt to identify these application classes, or modify its default resource management policies in any way. We also observe that when an out-of-core application releases pages, it both improves the response time of interactive tasks, and also improves its own performance through better replacement decisions and reduced memory management overhead.

49 citations


Proceedings ArticleDOI
08 Jan 2000
TL;DR: A novel approach to switch-on-miss multithreading that is software-controlled rather than hardware-controlled is explored, which requires substantially less hardware support than preview schemes and is nor likely to degrade single-thread performance.
Abstract: To help tolerate the latency of accessing remote data in a shared-memory multiprocessor, we explore a novel approach to switch-on-miss multithreading that is software-controlled rather than hardware-controlled. Our technique uses informing memory operations to trigger the thread switches with sufficiently low over-head that we observe speedups of 10% or more for four out of seven applications, with one application speeding up by 14%. By selectively applying register partitioning to reduce thread switching overhead, we can achieve further gains: e.g. an overall speedup of 23% for FFT. Although this software-controlled approach does not match the performance of hardware-controlled schemes on multithreaded workloads, it requires substantially less hardware support than preview schemes and is nor likely to degrade single-thread performance. As remote memory accesses continue to become more expensive relative to software overheads, we expect software-controlled multithreading to become increasingly attractive in the future.

19 citations


Dissertation
01 Jan 2000
TL;DR: Cooperative instruction prefetching is proposed, a novel technique which significantly outperforms state-of-the-art instructionPrefetching schemes by being able to prefetch more aggressively and much further ahead of time while at the same time substantially reducing the amount of useless prefetches, and new cache miss prediction techniques based on correlation profiling are proposed.
Abstract: The latency of accessing instructions and data from the memory subsystem is an increasingly crucial performance bottleneck in modern computer systems While cache hierarchies are an important first step, they alone cannot solve the problem Further, though a variety of latency-hiding techniques have been proposed, their success has been largely limited to regular, numeric applications Few promising latency-hiding techniques that can handle irregular, non-numeric codes have been proposed, in spite of the popularity of such codes in computer applications This dissertation investigates hardware and software techniques for coping with the instruction-access latency and data-access latency in non-numeric applications To deal with instruction-access latency, we propose cooperative instruction prefetching , a novel technique which significantly outperforms state-of-the-art instruction prefetching schemes by being able to prefetch more aggressively and much further ahead of time while at the same time substantially reducing the amount of useless prefetches To cope with data-access latency, we investigate three complementary techniques First, we study how to use compiler-inserted data prefetching to tolerate the latency of accessing pointer-based data structures To schedule prefetches early enough, we design three prefetching schemes to overcome the pointer-chasing problem associated with these data structures, and we automate them in an optimizing research compiler Second, we study how to safely perform an important class of locality optimizations, namely dynamic data layout optimizations, in non-numeric codes Specifically, we propose the use of an architectural mechanism called memory forwarding which can guarantee the safety of data relocation, thereby enabling many aggressive data layout optimizations (which also facilitate prefetching) that cannot be safely performed using current hardware or compiler technology Finally, in an effort to minimize the overheads of latency tolerance techniques, we propose new cache miss prediction techniques based on correlation profiling By correlating cache miss behaviors with dynamic execution contexts, these techniques can accurately isolate dynamic miss instances and so pay the latency tolerance overhead only when there would have been cache misses Detailed design considerations and experimental evaluations are provided for our proposed techniques, confirming them as viable solutions for coping with memory latency in non-numeric applications

9 citations


Journal ArticleDOI
TL;DR: A new profiling technique is proposed and evaluated that helps predict which dynamic instances of a static memory reference will hit or miss in the cache: correlation profiling and it is demonstrated that software prefetching can achieve better performance on a modern superscalar processor when directed by correlation profiling rather than summary profiling information.
Abstract: Latency-tolerance techniques offer the potential for bridging the ever-increasing speed gap between the memory subsystem and today's high-performance processors. However, to fully exploit the benefit of these techniques, one must be careful to apply them only to the dynamic references that are likely to suffer cache misses-otherwise the runtime overheads can potentially offset any gains. In this paper, we focus on isolating dynamic miss instances in nonnumeric applications, which is a difficult but important problem. Although compilers cannot statically analyze data locality in nonnumeric applications, one viable approach is to use profiling information to measure the actual miss behavior. Unfortunately, the state-of-the-art in cache miss profiling (which we call summary profiling) is inadequate for references with intermediate miss ratios-it either misses opportunities to hide latency, or else inserts overhead that is unnecessary. To overcome this problem, we propose and evaluate a new profiling technique that helps predict which dynamic instances of a static memory reference will hit or miss in the cache: correlation profiling Our experimental results demonstrate that roughly half of the 21 nonnumeric applications we study can potentially enjoy significant reductions in memory stall time by exploiting at least one of the three forms of correlation profiling we consider: control-flow correlation, self correlation, and global correlation. In addition, our detailed case studies illustrate that self correlation succeeds because a given reference's cache outcomes often contain repeated patterns and control-flow correlation succeeds because cache outcomes are often call-chain dependent. Finally, we suggest a number of ways to exploit correlation profiling in practice and demonstrate that software prefetching can achieve better performance on a modern superscalar processor when directed by correlation profiling rather than summary profiling information.

5 citations