scispace - formally typeset
Search or ask a question

Showing papers by "Todd C. Mowry published in 2002"


Proceedings ArticleDOI
01 Oct 2002
TL;DR: This paper presents and evaluates dataflow algorithms for three increasingly-aggressive instruction scheduling techniques that reduce the critical forwarding path introduced by the synchronization associated with this data forwarding in Thread-Level Speculation.
Abstract: While there have been many recent proposals for hardware that supports Thread-Level Speculation (TLS), there has been relatively little work on compiler optimizations to fully exploit this potential for parallelizing programs optimistically. In this paper, we focus on one important limitation of program performance under TLS, which is stalls due to forwarding scalar values between threads that would otherwise cause frequent data dependences. We present and evaluate dataflow algorithms for three increasingly-aggressive instruction scheduling techniques that reduce the critical forwarding path introduced by the synchronization associated with this data forwarding. In addition, we contrast our compiler techniques with related hardware-only approaches. With our most aggressive compiler and hardware techniques, we improve performance under TLS by 6.2-28.5% for 6 of 14 applications, and by at least 2.7% for half of the other applications.

134 citations


Proceedings ArticleDOI
03 Jun 2002
TL;DR: Fractal prefetching B+-Trees (fpB+Trees) as discussed by the authors embeds cache-optimized trees within disk optimized trees, in order to optimize both cache and I/O performance.
Abstract: B+-Trees have been traditionally optimized for I/O performance with disk pages as tree nodes. Recently, researchers have proposed new types of B+-Trees optimized for CPU cache performance in main memory environments, where the tree node sizes are one or a few cache lines. Unfortunately, due primarily to this large discrepancy in optimal node sizes, existing disk-optimized B+-Trees suffer from poor cache performance while cache-optimized B+-Trees exhibit poor disk performance. In this paper, we propose fractal prefetching B+-Trees (fpB+-Trees), which embed "cache-optimized" trees within "disk-optimized" trees, in order to optimize both cache and I/O performance. We design and evaluate two approaches to breaking disk pages into cache-optimized nodes: disk-first and cache-first. These approaches are somewhat biased in favor of maximizing disk and cache performance, respectively, as demonstrated by our results. Both implementations of fpB+-Trees achieve dramatically better cache performance than disk-optimized B+-Trees: a factor of 1.1-1.8 improvement for search, up to a factor of 4.2 improvement for range scans, and up to a 20-fold improvement for updates, all without significant degradation of I/O performance. In addition, fpB+-Trees accelerate I/O performance for range scans by using jump-pointer arrays to prefetch leaf pages, thereby achieving a speed-up of 2.5-5 on IBM's DB2 Universal Database.

127 citations


Proceedings ArticleDOI
02 Feb 2002
TL;DR: This paper shows how to apply value prediction, dynamic synchronization and hardware instruction prioritization to improve value communication and hence performance in several SPECint benchmarks that have been automatically transformed by the compiler to exploit TLS.
Abstract: Thread-level speculation (TLS) allows us to automatically parallelize general-purpose programs by supporting parallel execution of threads that might not actually be independent. In this paper, we show that the key to good performance ties in the three different ways to communicate a value between speculative threads: speculation, synchronization and prediction. The difficult part is deciding how and when to apply each method. This paper shows how we can apply value prediction, dynamic synchronization and hardware instruction prioritization to improve value communication and hence performance in several SPECint benchmarks that have been automatically transformed by our compiler to exploit TLS. We find that value prediction can be effective when properly throttled to avoid the high costs of mis-prediction, while most of the gains of value prediction can be more easily achieved by exploiting silent stores. We also show that dynamic synchronization is quite effective for most benchmarks, while hardware instruction prioritization is not. Overall, we find that these techniques have great potential for improving the performance of TLS.

122 citations