scispace - formally typeset
Search or ask a question

Showing papers by "Todd C. Mowry published in 2005"


Journal ArticleDOI
TL;DR: One possible outcome of continued progress in high-volume nanoscale assembly is the ability to inexpensively produce millimeter-scale units that integrate computing, sensing, actuation, and locomotion mechanisms.
Abstract: In the past 50 years, computers have shrunk from room-size mainframes to lightweight handhelds. This fantastic miniaturization is primarily the result of high-volume nanoscale manufacturing. While this technology has predominantly been applied to logic and memory, it's now being used to create advanced microelectromechanical systems using both top-down and bottom-up processes. One possible outcome of continued progress in high-volume nanoscale assembly is the ability to inexpensively produce millimeter-scale units that integrate computing, sensing, actuation, and locomotion mechanisms. A collection of such units can be viewed as a form of programmable matter.

290 citations


Journal ArticleDOI
TL;DR: This article proposes and evaluates a design for supporting TLS that seamlessly scales both within a chip and beyond because it is a straightforward extension of write-back invalidation-based cache coherence (which itself scales both up and down).
Abstract: Multithreaded processor architectures are becoming increasingly commonplace: many current and upcoming designs support chip multiprocessing, simultaneous multithreading, or both. While it is relatively straightforward to use these architectures to improve the throughput of a multithreaded or multiprogrammed workload, the real challenge is how to easily create parallel software to allow single programs to effectively exploit all of this raw performance potential. One promising technique for overcoming this problem is Thread-Level Speculation (TLS), which enables the compiler to optimistically create parallel threads despite uncertainty as to whether those threads are actually independent. In this article, we propose and evaluate a design for supporting TLS that seamlessly scales both within a chip and beyond because it is a straightforward extension of write-back invalidation-based cache coherence (which itself scales both up and down). Our experimental results demonstrate that our scheme performs well on single-chip multiprocessors where the first level caches are either private or shared. For our private-cache design, the program performance of two of 13 general purpose applications studied improves by 86p and 56p, four others by more than 8p, and an average across all applications of 16p---confirming that TLS is a promising way to exploit the naturally-multithreaded processing resources of future computer systems.

182 citations


Proceedings ArticleDOI
01 Jan 2005
TL;DR: Initial positive results obtained from the prototype that point to the feasibility of the development of technology to enable a third party to offer scalability as a subscription service with “per-click” pricing to application providers are reported.
Abstract: Providers of dynamic Web applications are currently unable to accommodate heavy usage without significant investment in infrastructure and in-house management capability. Our goal is to develop technology to enable a third party to offer scalability as a subscription service with “per-click” pricing to application providers. To this end we have developed a prototype proxy caching system able to scale delivery of dynamic Web content to a large number of users. In this paper we report initial positive results obtained from our prototype that point to the feasibility of our goal. We also report the shortcomings of our current prototype, the chief one being the lack of a scalable method of managing data consistency. We then present our initial work on a novel approach to scalable consistency management. Our approach is based on a fully distributed mechanism that does not require content providers to assist in managing the consistency of remotely cached data. Finally, we describe our ongoing efforts to characterize the inherent tradeoff between scalability and data secrecy, a crucial issue in environments shared by multiple organizations.

41 citations


Proceedings Article
09 Jul 2005
TL;DR: This work presents substantial challenges in mechanical and electronic design, control, programming, reliability, power delivery, and motion planning, and holds the promise of radically altering the relationship between computation, humans, and the physical world.
Abstract: We demonstrate modular robot prototypes developed as part of the Claytronics Project (Goldstein et al. 2005). Among the novel features of these robots ("catoms") is their ability to reconfigure (move) relative to one another without moving parts. The absence of moving parts is central to one key aim of our work, namely, plausible manufacturability at smaller and smaller physical scales using high-volume. low-unit-cost techniques such as batch photolithography, multimaterial submicron 3D lithographic processing, and self assembly. Claytronics envisions multi-million-module robot ensembles able to form into three dimensional scenes, eventually with sufficient fidelity so as to convince a human observer the scenes are real. This work presents substantial challenges in mechanical and electronic design, control, programming, reliability, power delivery, and motion planning (among other areas), and holds the promise of radically altering the relationship between computation, humans, and the physical world.

31 citations


Proceedings Article
30 Aug 2005
TL;DR: This work shows how inspector joins, employing novel statistics and specialized indexes, match or exceed the performance of state-of-the-art cache-friendly hash join algorithms.
Abstract: The key idea behind Inspector Joins is that during the I/O partitioning phase of a hash-based join, we have the opportunity to look at the actual data itself and then use this knowledge in two ways: (1) to create specialized indexes, specific to the given query on the given data, for optimizing the CPU cache performance of the subsequent join phase of the algorithm, and (2) to decide which join phase algorithm best suits this specific query. We show how inspector joins, employing novel statistics and specialized indexes, match or exceed the performance of state-of-the-art cache-friendly hash join algorithms. For example, when run on eight or more processors, our experiments show that inspector joins offer 1.1-1.4X speedups over these previous algorithms, with the speedup increasing as the number of processors increases.

21 citations


Proceedings Article
30 Aug 2005
TL;DR: This paper shows how dividing a transaction into speculative threads solves both problems of intra-transaction parallelism in existing database systems: it minimizes the changes required to the DBMS, and the details of parallelization are hidden from the transaction programmer.
Abstract: With the advent of chip multiprocessors, exploiting intra-transaction parallelism is an attractive way of improving transaction performance. However, exploiting intra-transaction parallelism in existing database systems is difficult, for two reasons: first, significant changes are required to avoid races or conflicts within the DBMS, and second, adding threads to transactions requires a high level of sophistication from transaction programmers. In this paper we show how dividing a transaction into speculative threads solves both problems---it minimizes the changes required to the DBMS, and the details of parallelization are hidden from the transaction programmer. Our technique requires a limited number of small, localized changes to a subset of the low-level data structures in the DBMS. Through this method of parallelizing transactions we can dramatically improve performance: on a simulated 4-processor chip-multiprocessor, we improve the response time by 36-74% for three of the five TPC-C transactions.

20 citations


01 Jan 2005
TL;DR: This dissertation proposes to use the compiler to orchestrate inter-thread value communication for both memory-resident and register-resident values and reports the performance impact of several compiler-base value communication optimization techniques on a four-processor single-chip multiprocessor that has been extended to support thread-level speculation.
Abstract: In the context of Thread-Level Speculation (TLS), inter-thread value communication is the key to efficient parallel execution. From the compiler's perspective, TLS supports two forms of inter-thread value communication: speculation and synchronization. Speculation allows for maximum parallel overlap when it succeeds, but becomes costly when it fails. Synchronization, on the other hand, introduces a fixed cost regardless of whether the dependence actually occurs or not. The fixed cost of synchronization is determined by the critical forwarding path, which is the time between when a thread first receives a value from its predecessor to when a new value is generated and forwarded to its successor. In the baseline implementation used in this dissertation, we synchronize all register-resident values and speculate on all memory-resident values. However, this naive approach yields little performance gain due to the excessive cost from inter-thread value communication. The goal of this dissertation is to develop compiler-based techniques to reduce the cost of inter-thread value communication and improve the overall program performance. This dissertation proposes to use the compiler to orchestrate inter-thread value communication for both memory-resident and register-resident values. To improve the efficiency of inter-thread value communication, the compiler must first decide whether to synchronize or to speculate on a potential data dependence based on how frequently the dependence occurs. If synchronization is necessary, the compiler will then insert the corresponding signal and wait instructions, creating a point-to-point path to forward the values involved in the dependence. Because synchronization could serialize execution by stalling the consumer thread, we use the compiler to avoid such stalling by applying novel dataflow analyses to schedule instructions to shrink the critical forwarding path. This dissertation reports the performance impact of several compiler-base value communication optimization techniques on a four-processor single-chip multiprocessor that has been extended to support thread-level speculation. Relative to the performance of the original sequential program executing on a single processor, for the set of loops selected to maximize program performance, parallel execution with the proposed baseline implementation results in 1% performance degradation for integer benchmarks and 21% performance improvement for floating point benchmarks, while with the optimization techniques we developed, parallel execution achieves 22% and 42% performance improvement for integer benchmarks and floating point benchmarks, respectively.

14 citations


Proceedings ArticleDOI
02 Nov 2005
TL;DR: This effort envisions multi-million-module robot ensembles able to morph into three-dimensional scenes, eventually with sufficient fidelity so as to convince a human observer the scenes are real.
Abstract: We propose a demonstration of extremely scalable modular robotics algorithms developed as part of the Claytronics Project (http://www-2.cs.cmu.edu/~claytronics/), as well as a demonstration of proof-of-concept prototypes. Our effort envisions multi-million-module robot ensembles able to morph into three-dimensional scenes, eventually with sufficient fidelity so as to convince a human observer the scenes are real. Although this work is potentially revolutionary in the sense that it holds out the possibility of radically altering the relationship between computation, humans, and the physical world, many of the research questions involved are similar in flavor to more mainstream systems research, albeit larger in scale. For instance, as in sensor networks, each robot will incorporate sensing, computation, and communications components. However, unlike most sensor networks each robot will also include mechanisms for actuation and motion. Many of the key challenges in this project involve coordination and communication of sensing and actuation across such large ensembles of independent units.

11 citations


01 Jan 2005
TL;DR: This thesis investigates a different approach: reducing the impact of cache misses through a technique called cache prefetching, and presents a novel algorithm, Inspector Joins, that exploits the free information obtained from one pass of the hash join algorithm to improve the performance of a later pass.
Abstract: Computer systems have enjoyed an exponential growth in processor speed for the past 20 years, while main memory speed has improved only moderately. Today a cache miss to main memory takes hundreds of processor cycles. Recent studies have demonstrated that on commercial databases, about 50% or more of execution time in memory is often wasted due to cache misses. In light of this problem, a number of recent studies focused on reducing the number of cache misses of database algorithms. In this thesis, we investigate a different approach: reducing the impact of cache misses through a technique called cache prefetching. Since prefetching for sequential array accesses has been well studied, we are interested in studying non-contiguous access patterns found in two classes of database algorithms: the B+-Tree index algorithm and the hash join algorithm. We re-examine their designs with cache prefetching in mind, and combine prefetching and data locality optimizations to achieve good cache performance. For B+-Trees, we first propose and evaluate a novel main memory index structure, Prefetching B+Trees, which uses prefetching to accelerate two major access patterns of B+-Tree indices: searches and range scans. We then apply our findings in the development of a novel index structure, Fractal Prefetching B+-Trees, that optimizes index operations both for CPU cache performance and for disk performance in commercial database systems by intelligently embedding cache-optimized trees into disk pages. For hash joins, we first exploit cache prefetching separately for the I/O partition phase and the join phase of the algorithm. We propose and evaluate two techniques, Group Prefetching and Software-Pipelined Prefetching, that exploit inter-tuple parallelism to overlap cache misses across the processing of multiple tuples. Then we present a novel algorithm, Inspector Joins, that exploits the free information obtained from one pass of the hash join algorithm to improve the performance of a later pass. This new algorithm addresses the memory bandwidth sharing problem in shared-bus multiprocessor systems. We compare our techniques against state-of-the-art cache-friendly algorithms for B+-Trees and hash joins through both simulation studies and real machine experiments. Our experimental results demonstrate dramatic performance benefits of our cache prefetching enabled techniques.

5 citations


01 Jan 2005
TL;DR: The initial experiments motivated a new compiler scheduling algorithm that is capable of tolerating the large and variable latencies that are common for disk accesses, in the presence of multiply-nested loops with unknown bounds.
Abstract: For a large class of scientific computing applications, the continuing growth in physical memory capacity cannot be expected to eliminate the need to perform UO throughout their executions. For these out-of-core applications, the large and widening gap between processor performance and disk latency is a major concern. Current operating systems deliver poor performance when an application's working set does not fit in main memory. As a result, programmers who wish to solve these out-of-core problems efficiently are typically faced with the onerous task of rewriting their application to use explicit I/O operations (e.g., read/write). In many cases, the end result is that the size of physical memory determines the size of problem that can be solved. In this dissertation, we propose and evaluate a fully-automatic technique which liberates the programmer from this task, provides high performance, and requires only minimal changes to current operating systems. In our scheme, the compiler provides the crucial information on future access patterns without burdening the programmer, the operating system supports non-binding prefetch and release hints for managing I/O in a virtual memory system, and the operating system cooperates with a run-time layer to accelerate performance by adapting to dynamic behavior and minimizing prefetch overhead. This approach maintains the abstraction of unlimited virtual memory for the programmer, gives the compiler the flexibility to aggressively insert prefetches ahead of references, and gives the operating system the flexibility to arbitrate between the competing resource demands of multiple applications. We implemented our compiler analysis within the SUIF compiler, and used it to target implementations of our run-time and operating system support on both research and commercial systems (HURRICANE and I RIX 6.5, respectively). Our experimental results show large performance gains for out-of-core scientific applications on both systems: more than 50% of the I/O stall time has been eliminated in most cases, thus translating into overall speedups of roughly twofold in many cases. Our initial experiments motivated a new compiler scheduling algorithm that is capable of tolerating the large and variable latencies that are common for disk accesses, in the presence of multiply-nested loops with unknown bounds. On our current experimental systems, many of our benchmark applications remain I/O bound, however, we show that the new scheduling algorithms are able to substantially improve performance in some cases, reducing execution time by an additional 36% in the best case. We further show that the new algorithms should enable applications to make more effective use of higher-bandwidth disk systems that will be available in the future.

3 citations


01 Jan 2005
TL;DR: This thesis shows how dividing a transaction into speculative threads (or epochs) solves both problems---it minimizes the changes required to the DBMS, and the details of parallelization are hidden from the transaction programmer.
Abstract: Thread-level speculation (TLS) is a promising method of extracting parallelism from both integer and scientific workloads. In this thesis we apply TLS to exploit intra-transaction parallelism in database workloads. Exploiting intra-transaction parallelism without using TLS in existing database systems is difficult, for two reasons: first, significant changes are required to avoid races or conflicts within the DBMS, and second, adding threads to transactions requires a high level of sophistication from transaction programmers. In this thesis we show how dividing a transaction into speculative threads (or epochs) solves both problems---it minimizes the changes required to the DBMS, and the details of parallelization are hidden from the transaction programmer. Our technique requires a limited number of small, localized changes to a subset of the low-level data structures in the DBMS. We also show that previous hardware support for TLS is insufficient for the resulting large speculative threads and the complexity of the dependences between them. In this thesis we extend previous TLS hardware support in three ways to facilitate large speculative threads: (i) we propose a method for buffering speculative state in the L2 cache, instead of solely using an extended store buffer, L1 data cache, or specialized table to track speculative changes; (ii) we tolerate cross-thread data dependences through the use of sub-epochs, significantly reducing the cost of mis-speculation; and (iii) with programmer assistance we escape speculation for database operations which can be performed non-speculatively. With this support we can effectively exploit intra-transaction parallelism in a database and dramatically improve transaction performance: on a simulated 4-processor chip-multiprocessor, we improve the response time by 46--66% for three of the five TPC-C transactions.