scispace - formally typeset
Search or ask a question

Showing papers on "Cache published in 1999"


Proceedings ArticleDOI
21 Mar 1999
TL;DR: This paper investigates the page request distribution seen by Web proxy caches using traces from a variety of sources and considers a simple model where the Web accesses are independent and the reference probability of the documents follows a Zipf-like distribution, suggesting that the various observed properties of hit-ratios and temporal locality are indeed inherent to Web accesse observed by proxies.
Abstract: This paper addresses two unresolved issues about Web caching. The first issue is whether Web requests from a fixed user community are distributed according to Zipf's (1929) law. The second issue relates to a number of studies on the characteristics of Web proxy traces, which have shown that the hit-ratios and temporal locality of the traces exhibit certain asymptotic properties that are uniform across the different sets of the traces. In particular, the question is whether these properties are inherent to Web accesses or whether they are simply an artifact of the traces. An answer to these unresolved issues will facilitate both Web cache resource planning and cache hierarchy design. We show that the answers to the two questions are related. We first investigate the page request distribution seen by Web proxy caches using traces from a variety of sources. We find that the distribution does not follow Zipf's law precisely, but instead follows a Zipf-like distribution with the exponent varying from trace to trace. Furthermore, we find that there is only (i) a weak correlation between the access frequency of a Web page and its size and (ii) a weak correlation between access frequency and its rate of change. We then consider a simple model where the Web accesses are independent and the reference probability of the documents follows a Zipf-like distribution. We find that the model yields asymptotic behaviour that are consistent with the experimental observations, suggesting that the various observed properties of hit-ratios and temporal locality are indeed inherent to Web accesses observed by proxies. Finally, we revisit Web cache replacement algorithms and show that the algorithm that is suggested by this simple model performs best on real trace data. The results indicate that while page requests do indeed reveal short-term correlations and other structures, a simple model for an independent request stream following a Zipf-like distribution is sufficient to capture certain asymptotic properties observed at Web proxies.

3,582 citations


Proceedings ArticleDOI
17 Oct 1999
TL;DR: It is proved that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the ideal-cache model can be simulated efficiently by LRU replacement.
Abstract: This paper presents asymptotically optimal algorithms for rectangular matrix transpose, FFT, and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms, these algorithms are cache oblivious: no variables dependent on hardware parameters, such as cache size and cache-line length, need to be tuned to achieve optimality. Nevertheless, these algorithms use an optimal amount of work and move data optimally among multiple levels of cache. For a cache with size Z and cache-line length L where Z=/spl Omega/(L/sup 2/) the number of cache misses for an m/spl times/n matrix transpose is /spl Theta/(1+mn/L). The number of cache misses for either an n-point FFT or the sorting of n numbers is /spl Theta/(1+(n/L)(1+log/sub Z/n)). We also give an /spl Theta/(mnp)-work algorithm to multiply an m/spl times/n matrix by an n/spl times/p matrix that incurs /spl Theta/(1+(mn+np+mp)/L+mnp/L/spl radic/Z) cache faults. We introduce an "ideal-cache" model to analyze our algorithms. We prove that an optimal cache-oblivious algorithm designed for two levels of memory is also optimal for multiple levels and that the assumption of optimal replacement in the ideal-cache model. Can be simulated efficiently by LRU replacement. We also provide preliminary empirical results on the effectiveness of cache-oblivious algorithms in practice.

789 citations


Proceedings ArticleDOI
16 Nov 1999
TL;DR: In this paper, a tradeoff between performance and energy is made between a small performance degradation for energy savings, and the tradeoff can produce a significant reduction in cache energy dissipation.
Abstract: Increasing levels of microprocessor power dissipation call for new approaches at the architectural level that save energy by better matching of on-chip resources to application requirements. Selective cache ways provides the ability to disable a subset of the ways in a set associative cache during periods of modest cache activity, while the full cache may remain operational for more cache-intensive periods. Because this approach leverages the subarray partitioning that is already present for performance reasons, only minor changes to a conventional cache are required, and therefore, full-speed cache operation can be maintained. Furthermore, the tradeoff between performance and energy is flexible, and can be dynamically tailored to meet changing application and machine environmental conditions. We show that trading off a small performance degradation for energy savings can produce a significant reduction in cache energy dissipation using this approach.

733 citations


Patent
Michel K. Bowman-Amuah1
31 Aug 1999
TL;DR: In this article, a system, method, and article of manufacture are provided for efficiently retrieving data from a server in a single call, where all of the data is bundled into a data structure by the server in response to the single call.
Abstract: A system, method, and article of manufacture are provided for efficiently retrieving data. A total amount of data required for an application executed by a client is determined. In a single call, the total amount of data from a server is requested over a network. All of the data is bundled into a data structure by the server in response to the single call. The bundled data structure is sent to the client over the network and the data of the data structure is cached on the client. The cached data of the data structure is used as needed during execution of the application on the client.

636 citations


Journal ArticleDOI
Andras Valko1
01 Jan 1999
TL;DR: Cellular IP is proposed, a new lightweight and robust protocol that is optimized to support local mobility but efficiently interworks with Mobile IP to provide wide area mobility support.
Abstract: This paper describes a new approach to Internet host mobility. We argue that by separating local and wide area mobility, the performance of existing mobile host protocols (e.g. Mobile IP) can be significantly improved. We propose Cellular IP, a new lightweight and robust protocol that is optimized to support local mobility but efficiently interworks with Mobile IP to provide wide area mobility support. Cellular IP shows great benefit in comparison to existing host mobility proposals for environments where mobile hosts migrate frequently, which we argue, will be the rule rather than the exception as Internet wireless access becomes ubiquitous. Cellular IP maintains distributed cache for location management and routing purposes. Distributed paging cache coarsely maintains the position of 'idle' mobile hosts in a service area. Cellular IP uses this paging cache to quickly and efficiently pinpoint 'idle' mobile hosts that wish to engage in 'active' communications. This approach is beneficial because it can accommodate a large number of users attached to the network without overloading the location management system. Distributed routing cache maintains the position of active mobile hosts in the service area and dynamically refreshes the routing state in response to the handoff of active mobile hosts. These distributed location management and routing algorithms lend themselves to a simple and low cost implementation of Internet host mobility requiring no new packet formats, encapsulations or address space allocation beyond what is present in IP.

599 citations


Proceedings Article
07 Sep 1999
TL;DR: This paper examines four commercial DBMSs running on an Intel Xeon and NT 4.0 and introduces a framework for analyzing query execution time, and finds that database developers should not expect the overall execution time to decrease significantly without addressing stalls related to subtle implementation issues.
Abstract: Recent high-performance processors employ sophisticated techniques to overlap and simultaneously execute multiple computation and memory operations. Intuitively, these techniques should help database applications, which are becoming increasingly compute and memory bound. Unfortunately, recent studies report that faster processors do not improve database system performance to the same extent as scientific workloads. Recent work on database systems focusing on minimizing memory latencies, such as cache-conscious algorithms for sorting and data placement, is one step toward addressing this problem. However, to best design high performance DBMSs we must carefully evaluate and understand the processor and memory behavior of commercial DBMSs on today’s hardware platforms. In this paper we answer the question “Where does time go when a database system is executed on a modern computer platform?” We examine four commercial DBMSs running on an Intel Xeon and NT 4.0. We introduce a framework for analyzing query execution time on a DBMS running on a server with a modern processor and memory architecture. To focus on processor and memory interactions and exclude effects from the I/O subsystem, we use a memory resident database. Using simple queries we find that database developers should (a) optimize data placement for the second level of data cache, and not the first, (b) optimize instruction placement to reduce first-level instruction cache stalls, but (c) not expect the overall execution time to decrease significantly without addressing stalls related to subtle implementation issues (e.g., branch prediction).

551 citations


Patent
07 Dec 1999
TL;DR: In this paper, a proxy redirector modifies the complete address specified in a GET request before sending it to the proxy cache to form an absolute URL, which is then used by the cache to determine whether it has the requested object stored in its cache.
Abstract: In order to transparently redirect an HTTP connection request that is directed to an origin server (107) to a proxy cache (110-1), a proxy redirector (104) translates the destination address of packets directed to the origin server to the address of the proxy. During a handshaking procedure, a TCP connection is transparently established between the client (110-1) and the proxy cache. When the client transmits a GET request to what it thinks is the origin server, which request specifies the complete address of an object at that origin server that it wants a copy of, the proxy redirector modifies the complete address specified in that GET request before it is sent to the proxy cache. Specifically, the IP address of the origin server found in the destination field in the IP header of the one or more packets from the client containing the GET request is added by the proxy redirector as a prefix to the complete URL in the GET request to form an absolute URL. The proxy cache determines from that absolute URL whether it has the requested object stored in its cache. If it does, it sends the object back to the proxy redirector, which masquerades those packets as coming from the origin server by translating their destination address to the address of the client and their source address to that of the origin server. If the proxy does not have the requested object, a separate TCP connection is established between the proxy and the origin server from where the object is retrieved and then forwarded over the TCP connection between the client and the proxy. In order to account for the additional number of bytes in the GET request, an acknowledgement sequence number in packets returned from the proxy that logically follow receipt of the GET request are decremented by that number by the proxy redirector before being forwarded to the client. Similarly, a sequence number in packets transmitted by the client subsequent to the GET request are incremented by that number before being forwarded by the proxy redirector to the proxy cache.

545 citations


Journal ArticleDOI
TL;DR: The main technique, controlled prefix expansion, transforms a set of prefixes into an equivalent set with fewer prefix lengths, and optimization techniques based on dynamic programming, and local transformations of data structures to improve cache behavior are used.
Abstract: Internet (IP) address lookup is a major bottleneck in high-performance routers. IP address lookup is challenging because it requires a longest matching prefix lookup. It is compounded by increasing routing table sizes, increased traffic, higher-speed links, and the migration to 128-bit IPv6 addresses. We describe how IP lookups and updates can be made faster using a set of of transformation techniques. Our main technique, controlled prefix expansion, transforms a set of prefixes into an equivalent set with fewer prefix lengths. In addition, we use optimization techniques based on dynamic programming, and local transformations of data structures to improve cache behavior. When applied to trie search, our techniques provide a range of algorithms (Expanded Tries) whose performance can be tuned. For example, using a processor with 1MB of L2 cache, search of the MaeEast database containing 38000 prefixes can be done in 3 L2 cache accesses. On a 300MHz Pentium II which takes 4 cycles for accessing the first word of the L2 cacheline, this algorithm has a worst-case search time of 180 nsec., a worst-case insert/delete time of 2.5 msec., and an average insert/delete time of 4 usec. Expanded tries provide faster search and faster insert/delete times than earlier lookup algirthms. When applied to Binary Search on Levels, our techniques improve worst-case search times by nearly a factor of 2 (using twice as much storage) for the MaeEast database. Our approach to algorithm design is based on measurements using the VTune tool on a Pentium to obtain dynamic clock cycle counts. Our techniques also apply to similar address lookup problems in other network protocols.

514 citations


Proceedings Article
07 Sep 1999
TL;DR: In this article, the authors discuss how vertically fragmented data structures optimize cache performance on sequential data access, and introduce radix algorithms for partitioned hash-join, which are quantified using a detailed analytical model that incorporates memory access cost.
Abstract: In the past decade, advances in speed of commodity CPUs have far out-paced advances in memory latency. Main-memory access is therefore increasingly a performance bottleneck for many computer applications, including database systems. In this article, we use a simple scan test to show the severe impact of this bottleneck. The insights gained are translated into guidelines for database architecture; in terms of both data structures and algorithms. We discuss how vertically fragmented data structures optimize cache performance on sequential data access. We then focus on equi-join, typically a random-access operation, and introduce radix algorithms for partitioned hash-join. The performance of these algorithms is quantied using a detailed analytical model that incorporates memory access cost. Experiments that validate this model were performed on the Monet database system. We obtained exact statistics on events like TLB misses, L1 and L2 cache misses, by using hardware performance counters found in modern CPUs. Using our cost model, we show how the carefully tuned memory access pattern of our radix algorithms make them perform well, which is conrmed by experimental results.

398 citations


Proceedings Article
06 Jun 1999
TL;DR: This paper presents the design of a new Web server architecture called the asymmetric multi-process event-driven (AMPED) architecture, and evaluates the performance of an implementation of this architecture, the Flash Web server.
Abstract: This paper presents the design of a new Web server architecture called the asymmetric multi-process event-driven (AMPED) architecture, and evaluates the performance of an implementation of this architecture, the Flash Web server. The Flash Web server combines the high performance of single-process event-driven servers on cached workloads with the performance of multiprocess and multi-threaded servers on disk-bound workloads. Furthermore, the Flash Web server is easily portable since it achieves these results using facilities available in all modern operating systems. The performance of different Web server architectures is evaluated in the context of a single implementation in order to quantify the impact of a server's concurrency architecture on its performance. Furthermore, the performance of Flash is compared with two widely-used Web servers, Apache and Zeus. Results indicate that Flash can match or exceed the performance of existing Web servers by up to 50% across a wide range of real workloads. We also present results that show the contribution of various optimizations embedded in Flash.

396 citations


Proceedings ArticleDOI
01 May 1999
TL;DR: It is demonstrated that careful data organization and layout provides an essential mechanism to improve the cache locality of pointer-manipulating programs and consequently, their performance.
Abstract: Hardware trends have produced an increasing disparity between processor speeds and memory access times. While a variety of techniques for tolerating or reducing memory latency have been proposed, these are rarely successful for pointer-manipulating programs.This paper explores a complementary approach that attacks the source (poor reference locality) of the problem rather than its manifestation (memory latency). It demonstrates that careful data organization and layout provides an essential mechanism to improve the cache locality of pointer-manipulating programs and consequently, their performance. It explores two placement techniques---clustering and coloring---that improve cache performance by increasing a pointer structure's spatial and temporal locality, and by reducing cache-conflicts.To reduce the cost of applying these techniques, this paper discusses two strategies---cache-conscious reorganization and cache-conscious allocation---and describes two semi-automatic tools---ccmorph and ccmalloc---that use these strategies to produce cache-conscious pointer structure layouts. ccmorph is a transparent tree reorganizer that utilizes topology information to cluster and color the structure. ccmalloc is a cache-conscious heap allocator that attempts to co-locate contemporaneously accessed data elements in the same physical cache block. Our evaluations, with microbenchmarks, several small benchmarks, and a couple of large real-world applications, demonstrate that the cache-conscious structure layouts produced by ccmorph and ccmalloc offer large performance benefits---in most cases, significantly outperforming state-of-the-art prefetching.

Proceedings ArticleDOI
01 Dec 1999
TL;DR: This work provides necessary conditions when timing anomalies can show up and identifies what architectural features that may cause such anomalies, and proposes some simple code modification techniques to make it impossible for any anomalies to occur.
Abstract: Previous timing analysis methods have assumed that the worst-case instruction execution time necessarily corresponds to the worst-case behavior. We show that this assumption is wrong in dynamically scheduled processors. A cache miss, for example, can in some cases result in a shorter execution time than a cache hit. Many examples of such timing anomalies are provided. We first provide necessary conditions when timing anomalies can show up and identify what architectural features that may cause such anomalies. We also show that analyzing the effect of these anomalies with known techniques results in prohibitive computational complexities. Instead, we propose some simple code modification techniques to make it impossible for any anomalies to occur. These modifications make it possible to estimate WCET by known techniques. Our evaluation shows that the pessimism imposed by these techniques is fairly limited; it is less than 27% for the programs in our benchmark suite.

Patent
Leonardo C. Massarani1
30 Apr 1999
TL;DR: In this paper, a content-indexing search system and method provides search results consistent with content filtering and blocking policies with respect to the search results and provides consistency between the results of the user content searches and the content filtering/blocking policies.
Abstract: A content-indexing search system and method provides search results consistent with content filtering and blocking policies. The search system comprises a content-indexing search engine including a database coupled to an information network. A user provides search queries to the search engine through a gateway serving as a proxy server and cache and blocking engine. The blocking engine implements content filtering and blocking policies with respect to the search results. Alternative embodiments provide consistency between the results of the user content searches and the content filtering/blocking policies. One embodiment modifies the search engine to implement the same content blocking policy as the caching and filtering engine. Another embodiment modifies the search engine to build an indexing database by searching the caching and engine content. A third embodiment modifies the search engine to go through the cache and filter engine as the search engine builds its indexing database. A fourth embodiment modifies a search engine to go through a caching and filtering engine as it builds an indexing database.

Journal ArticleDOI
17 May 1999
TL;DR: This paper describes the implementation of a consistent-hashing-based system and experiments that support the thesis that it can provide performance improvements, and provides an alternative to multicast and directory schemes and has several other advantages in load balancing and fault tolerance.
Abstract: Ak ey performance measure for the World Wide Web is the speed with which content is served to users. As traff ic on the Web increases, users are faced with increasing delays and failures in data delivery. Web caching is one of the key strategies that has been explored to improve performance. An important issue in many caching systems is how to decide what is cached where at any given time. Solutions have included multicast queries and directory schemes. In this paper, we offer a new Web caching strategy based on consistent hashing .C onsistent hashing provides an alternative to multicast and directory schemes, and has several other advantages in load balancing and fault tolerance. Its performance was analyzed theoretically in previous work; in this paper we describe the implementation of a consistent-hashing-based system and experiments that support our thesis that it can provide performance improvements.  1999 Published by Elsevier Science B.V. All rights reserved.

Proceedings ArticleDOI
21 Mar 1999
TL;DR: This paper presents a new approach for consistently caching dynamic Web data in order to improve performance and was a critical component at the official Web site for the 1998 Olympic Winter Games by using data update propagation (DUP), maintains data dependence information between cached objects and the underlying data which affect their values in a graph.
Abstract: This paper presents a new approach for consistently caching dynamic Web data in order to improve performance. Our algorithm, which we call data update propagation (DUP), maintains data dependence information between cached objects and the underlying data which affect their values in a graph. When the system becomes aware of a change to underlying data, graph traversal algorithms are applied to determine which cached objects are affected by the change. Cached objects which are found to be highly obsolete are then either invalidated or updated. The DUP was a critical component at the official Web site for the 1998 Olympic Winter Games. By using DUP, we were able to achieve cache hit rates close to 100% compared with 80% for an earlier version of our system which did not employ DUP. As a result of the high cache hit rates, the Olympic Games Web site was able to serve data quickly even during peak request periods.

Proceedings ArticleDOI
22 Feb 1999
TL;DR: IO-Lite eliminates all copying and multiple buffering of I/O data, and enables various cross-subsystem optimizations, and shows performance improvements between 40 and 80% on real workloads as a result of IO -Lite.
Abstract: This paper presents the design, implementation and evaluation of IO-Lite, a unified I/O buffering and caching system for general-purpose operating systems. IO-Lite unifies all buffering and caching in the system, to the extent permitted by the hardware. In particular, it allows applications, interprocess communication, the filesystem, the file cache, and the network subsystem to share a single physical copy of the data safely and concurrently. Protection and security are maintained through a combination of access control and read-only sharing. The various subsystems use (mutable) buffer aggregates to access the data according to their needs. IO-Lite eliminates all copying and multiple buffering of I/O data, and enables various cross-subsystem optimizations. Experiments with a Web server on IO-Lite show performance improvements between 40 and 80% on real workloads.

Proceedings ArticleDOI
07 Sep 1999
TL;DR: A new indexing technique called \Cache-Sensitive Search Trees" (CSS-trees) is proposed, to provide faster lookup times than binary search by paying attention to reference locality and cache behavior, without using substantial extra space.
Abstract: We study indexing techniques for main memory, including hash indexes, binary search trees, T-trees, B+-trees, interpolation search, and binary search on arrays. In a decision-support context, our primary concerns are the lookup time, and the space occupied by the index structure. Our goal is to provide faster lookup times than binary search by paying attention to reference locality and cache behavior, without using substantial extra space. We propose a new indexing technique called \Cache-Sensitive Search Trees" (CSS-trees). Our technique stores a directory structure on top of a sorted array. Nodes in this directory have size matching the cache-line size of the machine. We store the directory in an array and do not store internal-node pointers; child nodes can be found by performing arithmetic on array osets. We compare the algorithms based on their time and space requirements. We have implemented all of the techniques, and present a performance study on two popular modern machines. We demonstrate that with This research was supported by a David and Lucile Packard Foundation Fellowship in Science and Engineering, by an NSF Young Investigator Award, by NSF grant number IIS-98-12014, and by NSF CISE award CDA-9625374.

Journal ArticleDOI
TL;DR: This article describes methods for generating and solving Cache Miss Equations (CMEs) that give a detailed representation of cache behavior, including conflict misses, in loop-oriented scientific code within the SUIF compiler framework.
Abstract: With the ever-widening performance gap between processors and main memory, cache memory, which is used to bridge this gap, is becoming more and more significant. Caches work well for programs that exhibit sufficient locality. Other programs, however, have reference patterns that fail to exploit the cache, thereby suffering heavily from high memory latency. In order to get high cache efficiency and achieve good program performance, efficient memory accessing behavior is necessary. In fact, for many programs, program transformations or source-code changes can radically alter memory access patterns, significantly improving cache performance. Both hand-tuning and compiler optimization techniques are often used to transform codes to improve cache utilization. Unfortunately, cache conflicts are difficult to predict and estimate, precluding effective transformations. Hence, effective transformations require detailed knowledge about the frequency and causes of cache misses in the code. This article describes methods for generating and solving Cache Miss Equations (CMEs) that give a detailed representation of cache behavior, including conflict misses, in loop-oriented scientific code. Implemented within the SUIF compiler framework, our approach extends traditional compiler reuse analysis to generate linear Diophantine equations that summarize each loop's memory behavior. While solving these equations is in general difficult, we show that is also unnecessary, as mathematical techniques for manipulating Diophantine equations allow us to relatively easily compute and/or reduce the number of possible solutions, where each solution corresponds to a potential cache miss. The mathematical precision of CMEs allows us to find true optimal solutions for transformations such as blocking or padding. The generality of CMEs also allows us to reason about interactions between transformations applied in concert. The article also gives examples of their use to determine array padding and offset amounts that minimize cache misses, and to determine optimal blocking factors for tiled code. Overall, these equations represent an analysis framework that offers the generality and precision needed for detailed compiler optimizations.

Proceedings ArticleDOI
31 May 1999
TL;DR: This work describes the design and implementation of an integrated architecture for cache systems that scale to hundreds or thousands of caches with thousands to millions of users, and describes how to construct a scalable, high-performance data-location service that tracks where objects are replicated.
Abstract: We describe the design and implementation of an integrated architecture for cache systems that scale to hundreds or thousands of caches with thousands to millions of users. Rather than simply try to maximize hit rates, we take an end-to-end approach to improving response time by also considering hit times and miss times. We begin by studying several Internet caches and workloads, and we derive three core design principles for large scale distributed caches: minimize the number of hops to locate and access data on both hits and misses; share data among many users and scale to many caches; and cache data close to clients. Our strategies for addressing these issues are built around a scalable, high-performance data-location service that tracks where objects are replicated. We describe how to construct such a service and how to use this service to provide direct access to remote data and push-based data replication. We evaluate our system through trace-driven simulation and find that these strategies together provide response time speedups of 1.27 to 2.43 compared to a traditional three-level cache hierarchy for a range of trace workloads and simulated environments.

Proceedings ArticleDOI
17 Aug 1999
TL;DR: In this paper, a new approach using way prediction for achieving high performance and low energy consumption of set-associative caches is proposed, where only a single cache way is accessed, instead of accessing all the ways in a set.
Abstract: This paper proposes a new approach using way prediction for achieving high performance and low energy consumption of set-associative caches. By accessing only a single cache way predicted, instead of accessing all the ways in a set, the energy consumption can be reduced. This paper shows that the way-predicting set-associative cache improves the ED (energy-delay) product by 60-70% compared to a conventional set-associative cache,.

Patent
15 Apr 1999
TL;DR: In this paper, a high-performance cache is proposed for time and space efficiency for a diverse range of information objects, where objects are stored in portions of a nonvolatile storage device called arenas, which are contiguous regions from which space is allocated in parallel.
Abstract: A high-performance cache is disclosed. The cache is designed for time- and space-efficiency for a diverse range of information objects. Information objects are stored in portions of a non-volatile storage device called arenas, which are contiguous regions from which space is allocated in parallel. Objects are substantially contiguously allocated within an arena and are mapped by name keys and content-based object keys to a tag table, an open directory, and a directory table. The tag table is indexed by the name keys, and stores references to sets in the directory table. The tag table is compact and therefore can be stored in fast main memory, facilitating rapid lookups. The directory table is organized so that at least a frequently-accessed portion of it also usually resides in fast main memory, which further speeds lookups. The tag and directory tables are organized to quickly determine non-presence of objects. Large objects may be chunked into fragments, which are chained using a forward functional-iteration mechanism, to prevent the need for mutating existing on-disk data structures. Garbage collection periodically moves objects within an arena or to other arenas so that inactive objects are deleted and free space becomes contiguous. Because the objects are substantially contiguously allocated, reading and writing an typical object requires only one or two disk head actuator movements; thus, the cache can efficiently and smoothly stream data off of the storage device, providing optimal delivery of multimedia objects. The disclosure also encompasses a computer apparatus, computer program product, and computer data signal embodied in a carrier wave that are similarly configured.

Journal ArticleDOI
TL;DR: This paper proposes the Active Cache scheme, a feasible scheme that can result in significant network bandwidth savings at the expense of moderate CPU costs, and describes the protocol, interface and security mechanisms of the scheme.
Abstract: Dynamic documents constitute an increasing percentage of contents on the Web, and caching dynamic documents becomes an increasingly important issue that affects the scalability of the Web. In this paper, we propose the Active Cache scheme to support caching of dynamic contents at Web proxies. The scheme allows servers to supply cache applets to be attached with documents, and requires proxies to invoke cache applets upon cache hits to furnish the necessary processing without contacting the server. We describe the protocol, interface and security mechanisms of the Active Cache scheme, and illustrate its use via several examples. Through prototype implementation and performance measurements, we show that Active Cache is a feasible scheme that can result in significant network bandwidth savings at the expense of moderate CPU costs.

Proceedings ArticleDOI
01 May 1999
TL;DR: In this article, the authors describe two techniques, structure splitting and field reordering, that improve the cache behavior of data structures larger than a cache block by increasing the number of hot fields that can be placed in the cache block.
Abstract: A program's cache performance can be improved by changing the organization and layout of its data---even complex, pointer-based data structures. Previous techniques improved the cache performance of these structures by arranging distinct instances to increase reference locality. These techniques produced significant performance improvements, but worked best for small structures that could be packed into a cache block.This paper extends that work by concentrating on the internal organization of fields in a data structure. It describes two techniques---structure splitting and field reordering---that improve the cache behavior of structures larger than a cache block. For structures comparable in size to a cache block, structure splitting can increase the number of hot fields that can be placed in a cache block. In five Java programs, structure splitting reduced cache miss rates 10--27% and improved performance 6--18% beyond the benefits of previously described cache-conscious reorganization techniques.For large structures, which span many cache blocks, reordering fields, to place those with high temporal affinity in the same cache block can also improve cache utilization. This paper describes bbcache, a tool that recommends C structure field reorderings. Preliminary measurements indicate that reordering fields in 5 active structures improves the performance of Microsoft SQL Server 7.0 2--3%.

Proceedings ArticleDOI
17 Aug 1999
TL;DR: It is shown that a combination of subbanking, multiple line buffers and bit-line segmentation can reduce the on-chip cache power dissipation by as much as 75% in a technology-independent manner.
Abstract: Modern microprocessors employ one or two levels of on-chip caches to bridge the burgeoning speed disparities between the processor and the RAM. These SRAM caches are a major source of power dissipation. We investigate architectural techniques, that do not compromise the processor cycle time, for reducing the power dissipation within the on-chip cache hierarchy in superscalar microprocessors. We use a detailed register-level simulator of a superscalar microprocessor that simulates the execution of the SPEC benchmarks and SPICE measurements for the actual layout of a 0.5 micron, 4-metal layer cache, optimized for a 300 MHz, clock. We show that a combination of subbanking, multiple line buffers and bit-line segmentation can reduce the on-chip cache power dissipation by as much as 75% in a technology-independent manner.

Patent
27 Aug 1999
TL;DR: In this article, the authors present a method for buffering nodes of a hierarchical index (e.g., R-tree, bang file, hB-tree) during operations on multi-dimensional data represented by the index.
Abstract: Methods are provided for buffering nodes of a hierarchical index (e.g., R-tree, bang file, hB-tree) during operations on multi-dimensional data represented by the index. The methods are particularly suited for query operations, and a different method may be more suitable for one pattern of queries than another. Where queries are distributed in a relatively uniform manner across the domain or dataspace of an index, a node-area buffering method is provided. In this method nodes are cached or buffered in order of their respective areas (e.g., their minimum bounding areas), and a node having a smaller area will be replaced in cache before a node having a larger area. When, however, queries are not uniformly distributed, then a least frequently accessed buffering technique may be applied. According to this method statistics are maintained concerning the frequency with which individual index nodes are accessed. Those accessed less frequently are replaced in cache before those accessed more frequently. Yet another, generic, buffering strategy is provided that is suitable for all patterns of query distribution. In accordance with this method, whenever a node must be removed from cache in order to make room for a newly accessed node, cached nodes are compared to each other to determine which provides the least caching benefit and may therefore be ejected. A comparison may involve three factors—the difference in the nodes' areas, the difference in the frequency with which they have been accessed and the difference between their latest access times. These factors may be weighted to give them more or less effect in relation to each other.

Proceedings Article
01 May 1999
TL;DR: This paper describes how the Coda File System has evolved to exploit weak connectivity networks, and the underlying theme of this evolution has been the systematic introduction of adaptivity to eliminate hidden assumptions about strong connectivity.
Abstract: Weak connectivity, in the form of intermittent, low-bandwidth, or expensive networks is a fact of life in mobile computing. In this paper, we describe how the Coda File System has evolved to exploit such networks. The underlying theme of this evolution has been the systematic introduction of adaptivity to eliminate hidden assumptions about strong connectivity. Many aspects of the system, including communication, cache validation, update propagation and cache miss handling have been modified. As a result, Coda is able to provide good performance even when network bandwidth varies over four orders of magnitude - from modem speeds to LAN speeds.

Proceedings ArticleDOI
Chen Ding1, Ken Kennedy1
01 May 1999
TL;DR: It is demonstrated that run-time program transformations can substantially improve computation and data locality and, despite the complexity and cost involved, a compiler can automate such transformations, eliminating much of the associated run- time overhead.
Abstract: With the rapid improvement of processor speed, performance of the memory hierarchy has become the principal bottleneck for most applications. A number of compiler transformations have been developed to improve data reuse in cache and registers, thus reducing the total number of direct memory accesses in a program. Until now, however, most data reuse transformations have been static---applied only at compile time. As a result, these transformations cannot be used to optimize irregular and dynamic applications, in which the data layout and data access patterns remain unknown until run time and may even change during the computation.In this paper, we explore ways to achieve better data reuse in irregular and dynamic applications by building on the inspector-executor method used by Saltz for run-time parallelization. In particular, we present and evaluate a dynamic approach for improving both computation and data locality in irregular programs. Our results demonstrate that run-time program transformations can substantially improve computation and data locality and, despite the complexity and cost involved, a compiler can automate such transformations, eliminating much of the associated run-time overhead.

Proceedings ArticleDOI
22 Feb 1999
TL;DR: The design and implementation of Tornado is described, a new operating system designed from the ground up specifically for today's shared memory multiprocessors, which has far better performance characteristics, particularly for multithreaded applications, than existing commercial operating systems.
Abstract: We describe the design and implementation of Tornado, a new operating system designed from the ground up specifically for today's shared memory multiprocessors. The need for improved locality in the operating system is growing as multiprocessor hardware evolves, increasing the costs for cache misses and sharing, and adding complications due to NUMAness. Tornado is optimized so that locality and independence in application requests for operating system services-whetherfrom multiple sequential applications or a single parallel application-are mapped onto locality and independence in the servicing of these requests in the kernel and system servers. By contrast, previous shared memory multiprocessor operating systems all evolved from designs constructed at a time when sharing costs were low, memory latency was low and uniform, and caches were small; for these systems, concurrency was the main performance concern and locality was not an important issue. Tornado achieves this locality by starting with an object-oriented structure, where every virtual and physical resource is represented by an independent object. Locality, as well as concurrency, is further enhanced with the introduction of three key innovations: (i) clustered objects that support the partitioning of contended objects across processors, (ii) a protected procedure call facility that preserves the locality and concurrency of IPC's, and (iii) a new locking strategy that allows all locking to be encapsulated within the objects being protected and greatly simplifies the overall locking protocols. As a result of these techniques, Tornado has far better performance characteristics, particularly for multithreaded applications, than existing commercial operating systems. Tornado has been fully implemented and runs both on Toronto's NUMAchine hardware and on the SimOS simulator.

Proceedings ArticleDOI
01 Jan 1999
TL;DR: Alternative data structures are proposed, along with reordering algorithms to increase effectiveness of these data structures, to reduce the number of memory indirections inparse matrix-vector multiplication (SpMxV).
Abstract: Sparse matrix-vector multiplication (SpMxV) is one of the most important computational kernels in scientific computing. It often suffers from poor cache utilization and extra load operations because of memory indirections used to exploit sparsity. We propose alternative data structures, along with reordering algorithms to increase effectiveness of these data structures, to reduce the number of memory indirections. Toledo proposed handling the 1x2 blocks of a matrix separately, doing only one indirection for each block. We propose packing all contiguous nonzeros into a block to reduce the number of memory indirections further. This reduces memory indirections per block to one for the cost of an extra array in storage and a loop during SpMxV. We also propose an algorithm to permute the nonzeros of the matrix into contiguous locations. We state this problem as the traveling salesperson problem and use associated heuristics. Experiments verify the effectiveness of our techniques.

Journal ArticleDOI
TL;DR: For interprocedural analysis, existing methods are examined and a new approach that is especially tailored for the cache analysis is presented, which allows for a static classification of the cache behavior of memory references of programs.
Abstract: interpretation is a technique for the static detection of dynamic properties of programs. It is semantics based, that is, it computes approximative properties of the semantics of programs. On this basis, it supports correctness proofs of analyses. It replaces commonly used ad hoc techniques by systematic, provable ones, and it allows for the automatic generation of analyzers from specifications by existing tools. In this work, abstract interpretation is applied to the problem of predicting the cache behavior of programs. semantics of machine programs are defined which determine the contents of caches. For interprocedural analysis, existing methods are examined and a new approach that is especially tailored for the cache analysis is presented. This allows for a static classification of the cache behavior of memory references of programs. The calculated information can be used to improve worst case execution time estimations. It is possible to analyze instruction, data, and combined instruction/data caches for common (re)placement and write strategies. Experimental results are presented that demonstrate the applicability of the analyses.