scispace - formally typeset
Search or ask a question

Showing papers on "Cache published in 2004"


Proceedings ArticleDOI
29 Sep 2004
TL;DR: It is found that optimizing fairness usually increases throughput, while maximizing throughput does not necessarily improve fairness, and two algorithms are proposed that optimize fairness.
Abstract: This paper presents a detailed study of fairness in cache sharing between threads in a chip multiprocessor (CMP) architecture. Prior work in CMP architectures has only studied throughput optimization techniques for a shared cache. The issue of fairness in cache sharing, and its relation to throughput, has not been studied. Fairness is a critical issue because the operating system (OS) thread scheduler's effectiveness depends on the hardware to provide fair cache sharing to co-scheduled threads. Without such hardware, serious problems, such as thread starvation and priority inversion, can arise and render the OS scheduler ineffective. This paper makes several contributions. First, it proposes and evaluates five cache fairness metrics that measure the degree of fairness in cache sharing, and shows that two of them correlate very strongly with the execution-time fairness. Execution-time fairness is defined as how uniform the execution times of co-scheduled threads are changed, where each change is relative to the execution time of the same thread running alone. Secondly, using the metrics, the paper proposes static and dynamic L2 cache partitioning algorithms that optimize fairness. The dynamic partitioning algorithm is easy to implement, requires little or no profiling, has low overhead, and does not restrict the cache replacement algorithm to LRU. The static algorithm, although requiring the cache to maintain LRU stack information, can help the OS thread scheduler to avoid cache thrashing. Finally, this paper studies the relationship between fairness and throughput in detail. We found that optimizing fairness usually increases throughput, while maximizing throughput does not necessarily improve fairness. Using a set of co-scheduled pairs of benchmarks, on average our algorithms improve fairness by a factor of 4/spl times/, while increasing the throughput by 15%, compared to a nonpartitioned shared cache.

544 citations


Proceedings Article
29 Mar 2004
TL;DR: CoralCDN is a peer-to-peer content distribution network that allows a user to run a web site that offers high performance and meets huge demand, all for the price of a cheap broadband Internet connection.
Abstract: CoralCDN is a peer-to-peer content distribution network that allows a user to run a web site that offers high performance and meets huge demand, all for the price of a cheap broadband Internet connection Volunteer sites that run CoralCDN automatically replicate content as a side effect of users accessing it Publishing through CoralCDN is as simple as making a small change to the hostname in an object's URL; a peer-to-peer DNS layer transparently redirects browsers to nearby participating cache nodes, which in turn cooperate to minimize load on the origin web server One of the system's key goals is to avoid creating hot spots that might dissuade volunteers and hurt performance It achieves this through Coral, a latency-optimized hierarchical indexing infrastructure based on a novel abstraction called a distributed sloppy hash table, or DSHT

514 citations


Patent
16 Sep 2004
TL;DR: In this article, an apparatus and method for virtual memory mapping and transaction management in an object-oriented database system having permanent storage for storing data in at least one database, at least a cache memory for temporarily storing data, and a processing unit which runs application programs which request data using virtual addresses.
Abstract: An apparatus and method are provided for virtual memory mapping and transaction management in an object-oriented database system having permanent storage for storing data in at least one database, at least one cache memory for temporarily storing data, and a processing unit which runs application programs which request data using virtual addresses. The system performs data transfers in response to memory faults resulting from requested data not being available at specified virtual addressed and performs mapping of data in cache memory. The data in the database may include pointers containing persistent addresses, which pointers are relocated between persistent addresses and virtual addresses. When a data request is made, either for read or write, from a given client computer in a system, other client computers in the system are queried to determine if the requested data is cached and/or locked in a manner inconsistent with the requested use, and the inconsistent caching is downgraded or the transfer delayed until such downgrading can be performed.

442 citations


Journal ArticleDOI
TL;DR: The approach to improve chip-level performance of the Power5 was described, which specified increased performance and other functional enhancements of server virtualization, reliability, availability, and serviceability at both chip and system levels.
Abstract: IBM introduced Power4-based systems in 2001. The Power4 design integrates two processor cores on a single chip, a shared second-level cache, a directory for an off-chip third-level cache, and the necessary circuitry to connect it to other Power4 chips to form a system. The dual-processor chip provides natural thread-level parallelism at the chip level. The Power5 is the next-generation chip in this line. One of our key goals in designing the Power5 was to maintain both binary and structural compatibility with existing Power4 systems to ensure that binaries continue executing properly and all application optimizations carry forward to newer systems. With that base requirement, we specified increased performance and other functional enhancements of server virtualization, reliability, availability, and serviceability at both chip and system levels. We describe the approach we used to improve chip-level performance.

410 citations


Journal ArticleDOI
TL;DR: The results show that smart cache management and scheduling is essential to achieve high performance with shared cache memory and can improve the total IPC significantly over the standard least recently used (LRU) replacement policy.
Abstract: This paper proposes dynamic cache partitioning amongst simultaneously executing processes/threads. We present a general partitioning scheme that can be applied to set-associative caches. Since memory reference characteristics of processes/threads can change over time, our method collects the cache miss characteristics of processes/threads at run-time. Also, the workload is determined at run-time by the operating system scheduler. Our scheme combines the information, and partitions the cache amongst the executing processes/threads. Partition sizes are varied dynamically to reduce the total number of misses. The partitioning scheme has been evaluated using a processor simulator modeling a two-processor CMP system. The results show that the scheme can improve the total IPC significantly over the standard least recently used (LRU) replacement policy. In a certain case, partitioning doubles the total IPC over standard LRU. Our results show that smart cache management and scheduling is essential to achieve high performance with shared cache memory.

402 citations


Journal ArticleDOI
01 Feb 2004
TL;DR: This paper discusses the optimization of two operations: a sparse matrix times a dense vector and a sparse matrices times a set of dense vectors, and describes the different optimizations and parameter selection techniques and evaluates them on several machines using over 40 matrices.
Abstract: Sparse matrix-vector multiplication is an important computational kernel that performs poorly on most modern processors due to a low compute-to-memory ratio and irregular memory access patterns. Optimization is difficult because of the complexity of cache-based memory systems and because performance is highly dependent on the non-zero structure of the matrix. The SPARSITY system is designed to address these problems by allowing users to automatically build sparse matrix kernels that are tuned to their matrices and machines. SPARSITY combines traditional techniques such as loop transformations with data structure transformations and optimization heuristics that are specific to sparse matrices. It provides a novel framework for selecting optimization parameters, such as block size, using a combination of performance models and search. In this paper we discuss the optimization of two operations: a sparse matrix times a dense vector and a sparse matrix times a set of dense vectors. Our experience indicates that register level optimizations are effective for matrices arising in certain scientific simulations, in particular finite-element problems. Cache level optimizations are important when the vector used in multiplication is larger than the cache size, especially for matrices in which the non-zero structure is random. For applications involving multiple vectors, reorganizing the computation to perform the entire set of multiplications as a single operation produces significant speedups. We describe the different optimizations and parameter selection techniques and evaluate them on several machines using over 40 matrices taken from a broad set of application domains. Our results demonstrate speedups of up to 4X for the single vector case and up to 10X for the multiple vector case.

351 citations


Proceedings ArticleDOI
07 Mar 2004
TL;DR: A hybrid approach (HybridCache) is proposed, which can further improve the performance by taking advantage of CacheData and CachePath while avoiding their weaknesses, and can significantly reduce the query delay and message complexity when compared to other caching schemes.
Abstract: Most researches in ad hoc networks focus on routing, and not much work has been done on data access. A common technique used to improve the performance of data access is caching. Cooperative caching, which allows the sharing and coordination of cached data among multiple nodes, can further explore the potential of the caching techniques. Due to mobility and resource constraints of ad hoc networks, cooperative caching techniques designed for wired network may not be applicable to ad hoc networks. In this paper, we design and evaluate cooperative caching techniques to efficiently support data access in ad hoc networks. We first propose two schemes: cachedata which caches the data, and cachepath which caches the data path. After analyzing the performance of those two schemes, we propose a hybrid approach (hybridcache) which can further improve the performance by taking advantage of cachedata and cachepath while avoiding their weaknesses. Simulation results show that the proposed schemes can significantly reduce the query delay and message complexity when compared to other caching schemes.

327 citations


Journal ArticleDOI
02 Mar 2004
TL;DR: An adaptive policy that dynamically adapts to the costs and benefits of cache compression is developed and it is shown that compression can improve performance for memory-intensive commercial workloads by up to 17%.
Abstract: Modern processors use two or more levels ofcache memories to bridge the rising disparity betweenprocessor and memory speeds. Compression canimprove cache performance by increasing effectivecache capacity and eliminating misses. However,decompressing cache lines also increases cache accesslatency, potentially degrading performance.In this paper, we develop an adaptive policy thatdynamically adapts to the costs and benefits of cachecompression. We propose a two-level cache hierarchywhere the L1 cache holds uncompressed data and the L2cache dynamically selects between compressed anduncompressed storage. The L2 cache is 8-way set-associativewith LRU replacement, where each set can storeup to eight compressed lines but has space for only fouruncompressed lines. On each L2 reference, the LRUstack depth and compressed size determine whethercompression (could have) eliminated a miss or incurs anunnecessary decompression overhead. Based on thisoutcome, the adaptive policy updates a single globalsaturating counter, which predicts whether to allocatelines in compressed or uncompressed form.We evaluate adaptive cache compression usingfull-system simulation and a range of benchmarks. Weshow that compression can improve performance formemory-intensive commercial workloads by up to 17%.However, always using compression hurts performancefor low-miss-rate benchmarks-due to unnecessarydecompression overhead-degrading performance byup to 18%. By dynamically monitoring workload behavior,the adaptive policy achieves comparable benefitsfrom compression, while never degrading performanceby more than 0.4%.

304 citations


Journal ArticleDOI
02 Mar 2004
TL;DR: A proposed performance model for superscalar processors consists of a component that models the relationship between instructions issued per cycle and the size of the instruction window under ideal conditions and methods for calculating transient performance penalties due to branch mispredictions, instruction cache misses, and data cache misses.
Abstract: A proposed performance model for superscalar processorsconsists of 1) a component that models the relationshipbetween instructions issued per cycle and the sizeof the instruction window under ideal conditions, and 2)methods for calculating transient performance penaltiesdue to branch mispredictions, instruction cache misses,and data cache misses.Using trace-derived data dependenceinformation, data and instruction cache miss rates,and branch miss-prediction rates as inputs, the model canarrive at performance estimates for a typical superscalarprocessor that are within 5.8% of detailed simulation onaverage and within 13% in the worst case. The modelalso provides insights into the workings of superscalarprocessors and long-term microarchitecture trends such aspipeline depths and issue widths.

295 citations


Patent
10 Jun 2004
TL;DR: In this article, the authors describe a web site system that persistently stores event data reflective of events that occur during browsing sessions of web site users, and makes such data available to other applications and services in real time.
Abstract: A web site system (30) includes an event history server system (32) that persistently stores event data reflective of events that occur during browsing sessions of web site users, and makes such data available to other applications and services (38) in real time. The server system (32) may, for example, be used to record information about every mouse click of every recognized user, and may also be used to record other types of events such as impressions and mouse-over events. The event data of a particular user may be retrieved from the server system (32) based on event type, event time of occurrence, and various other criteria. In one embodiment, the server system (32) includes a cache layer (40) that caches event data by session ID, and includes a persistent storage layer (44) that persistently stores the event data by user ID. Also disclosed are various application features that may be implemented using the stored event data.

293 citations


Proceedings ArticleDOI
Ravi Iyer1
26 Jun 2004
TL;DR: A new cache management framework (CQoS) that recognizes the heterogeneity in memory access streams, introduces the notion of QoS to handle the varying degrees of locality and latency sensitivity and assigns and enforces priorities to streams based on latency sensitivity, locality degree and application performance needs is presented.
Abstract: Cache hierarchies have been traditionally designed for usage by a single application, thread or core. As multi-threaded (MT) and multi-core (CMP) platform architectures emerge and their workloads range from single-threaded and multithreaded applications to complex virtual machines (VMs), a shared cache resource will be consumed by these different entities generating heterogeneous memory access streams exhibiting different locality properties and varying memory sensitivity. As a result, conventional cache management approaches that treat all memory accesses equally are bound to result in inefficient space utilization and poor performance even for applications with good locality properties. To address this problem, this paper presents a new cache management framework (CQoS) that (1) recognizes the heterogeneity in memory access streams, (2) introduces the notion of QoS to handle the varying degrees of locality and latency sensitivity and (3) assigns and enforces priorities to streams based on latency sensitivity, locality degree and application performance needs. To achieve this, we propose CQoS options for priority classification, priority assignment and priority enforcement. We briefly describe CQoS priority classification and assignment options -- ranging from user-driven and developer-driven to compiler-detected and flow-based approaches. Our focus in this paper is on CQoS mechanisms for priority enforcement -- these include (1) selective cache allocation, (2) static/dynamic set partitioning and (3) heterogeneous cache regions. We discuss the architectural design and implementation complexity of these CQoS options. To evaluate the performance trade-offs for these options, we have modeled these CQoS options in a cache simulator and evaluated their performance in CMP platforms running network-intensive server workloads. Our simulation results show the effectiveness of our proposed options and make the case for CQoS in future multi-threaded/multi-core platforms since it improves shared cache efficiency and increases overall system performance as a result.

Patent
30 Jun 2004
TL;DR: In this article, a client assistant examines its cache for the requested document, and if the client assistant cannot provide the copy, the server seeks it from a document repository rather than the document's web host.
Abstract: Upon receipt of a document request, a client assistant examines its cache for the document. If not successful, a server searches for the requested document in its cache. If the server copy is still not fresh or not found, the server seeks the document from its host. If the host cannot provide the copy, the server seeks it from a document repository. Certain documents are identified from the document repository as being fresh or stable. Information about each these identified documents is transmitted to the server which inserts entries into an index if the index does not already contain an entry for the document. If and when this particular document is requested, the document will not be present in the server, however the server will contain an entry directing the server to obtain the document from the document repository rather than the document's web host.

Journal ArticleDOI
Nimrod Megiddo1, Dharmendra S. Modha1
TL;DR: The self-tuning, low-overhead, scan-resistant adaptive replacement cache algorithm outperforms the least-recently-used algorithm by dynamically responding to changing access patterns and continually balancing between workload recency and frequency features.
Abstract: The self-tuning, low-overhead, scan-resistant adaptive replacement cache algorithm outperforms the least-recently-used algorithm by dynamically responding to changing access patterns and continually balancing between workload recency and frequency features. Caching, a fundamental metaphor in modern computing, finds wide application in storage systems, databases, Web servers, middleware, processors, file systems, disk drives, redundant array of independent disks controllers, operating systems, and other applications such as data compression and list updating. In a two-level memory hierarchy, a cache performs faster than auxiliary storage, but it is more expensive. Cost concerns thus usually limit cache size to a fraction of the auxiliary memory's size.

Proceedings Article
31 Mar 2004
TL;DR: A simple and elegant new algorithm, namely, CLOCK with Adaptive Replacement (CAR), that has several advantages over CLOCK: it is scan-resistant, self-tuning and it adaptively and dynamically captures the "recency" and "frequency" features of a workload.
Abstract: CLOCK is a classical cache replacement policy dating back to 1968 that was proposed as a low-complexity approximation to LRU. On every cache hit, the policy LRU needs to move the accessed item to the most recently used position, at which point, to ensure consistency and correctness, it serializes cache hits behind a single global lock. CLOCK eliminates this lock contention, and, hence, can support high concurrency and high throughput environments such as virtual memory (for example, Multics, UNIX, BSD, AIX) and databases (for example, DB2). Unfortunately, CLOCK is still plagued by disadvantages of LRU such as disregard for "frequency", susceptibility to scans, and low performance.As our main contribution, we propose a simple and elegant new algorithm, namely, CLOCK with Adaptive Replacement (CAR), that has several advantages over CLOCK: (i) it is scan-resistant; (ii) it is self-tuning and it adaptively and dynamically captures the "recency" and "frequency" features of a workload; (iii) it uses essentially the same primitives as CLOCK, and, hence, is low-complexity and amenable to a high-concurrency implementation; and (iv) it outperforms CLOCK across a wide-range of cache sizes and workloads. The algorithm CAR is inspired by the Adaptive Replacement Cache (ARC) algorithm, and inherits virtually all advantages of ARC including its high performance, but does not serialize cache hits behind a single global lock. As our second contribution, we introduce another novel algorithm, namely, CAR with Temporal filtering (CART), that has all the advantages of CAR, but, in addition, uses a certain temporal filter to distill pages with long-term utility from those with only short-term utility.

Proceedings ArticleDOI
14 Feb 2004
TL;DR: This paper proposes several power-aware storage cache management algorithms that provide more opportunities for the underlying disk power management schemes to save energy and investigates the effects of four storage cache write policies on disk energy consumption.
Abstract: Reducing energy consumption is an important issue for data centers. Among the various components of a data center, storage is one of the biggest consumers of energy. Previous studies have shown that the average idle period for a server disk in a data center is very small compared to the time taken to spin down and spin up. This significantly limits the effectiveness of disk power management schemes. This paper proposes several power-aware storage cache management algorithms that provide more opportunities for the underlying disk power management schemes to save energy. More specifically, we present an off-line power-aware greedy algorithm that is more energy-efficient than Belady’s off-line algorithm (which minimizes cache misses only). We also propose an online power-aware cache replacement algorithm. Our trace-driven simulations show that, compared to LRU, our algorithm saves 16% more disk energy and provides 50% better average response time for OLTP I/O workloads. We have also investigated the effects of four storage cache write policies on disk energy consumption.

Proceedings ArticleDOI
07 Oct 2004
TL;DR: Compared with existing methods based on program code and execution intervals, locality phase prediction is unique because it uses locality profiles, and it marks phase boundaries in program code.
Abstract: As computer memory hierarchy becomes adaptive, its performance increasingly depends on forecasting the dynamic program locality. This paper presents a method that predicts the locality phases of a program by a combination of locality profiling and run-time prediction. By profiling a training input, it identifies locality phases by sifting through all accesses to all data elements using variable-distance sampling, wavelet filtering, and optimal phase partitioning. It then constructs a phase hierarchy through grammar compression. Finally, it inserts phase markers into the program using binary rewriting. When the instrumented program runs, it uses the first few executions of a phase to predict all its later executions.Compared with existing methods based on program code and execution intervals, locality phase prediction is unique because it uses locality profiles, and it marks phase boundaries in program code. The second half of the paper presents a comprehensive evaluation. It measures the accuracy and the coverage of the new technique and compares it with best known run-time methods. It measures its benefit in adaptive cache resizing and memory remapping. Finally, it compares the automatic analysis with manual phase marking. The results show that locality phase prediction is well suited for identifying large, recurring phases in complex programs.

01 Jan 2004
TL;DR: This work proposes and evaluates a simple significance-based compression scheme that has a low compression and decompression overhead and provides comparable compression ratios to more complex schemes that have higher cache hit latencies.
Abstract: With the widening gap between processor and memory speeds, memory system designers may find cache compression beneficial to increase cache capacity and reduce off-chip bandwidth. Most hardware compression algorithms fall into the dictionary-based category, which depend on building a dictionary and using its entries to encode repeated data values. Such algorithms are effective in compressing large data blocks and files. Cache lines, however, are typically short (32-256 bytes), and a per-line dictionary places a significant overhead that limits the compressibility and increases decompression latency of such algorithms. For such short lines, significance-based compression is an appealing alternative. We propose and evaluate a simple significance-based compression scheme that has a low compression and decompression overhead. This scheme, Frequent Pattern Compression (FPC) compresses individual cache lines on a word-by-word basis by storing common word patterns in a compressed format accompanied with an appropriate prefix. For a 64-byte cache line, compression can be completed in three cycles and decompression in five cycles, assuming 12 FO4 gate delays per cycle. We propose a compressed cache design in which data is stored in a compressed form in the L2 caches, but are uncompressed in the L1 caches. L2 cache lines are compressed to predetermined sizes that never exceed their original size to reduce decompression overhead. This simple scheme provides comparable compression ratios to more complex schemes that have higher cache hit latencies.

Proceedings Article
01 Jan 2004
TL;DR: A model-counting program that combines component caching with clause learning, one of the most important ideas used in modern SAT solvers, and provides significant evidence that it can outperform existing algorithms for #SAT by orders of magnitude.
Abstract: While there has been very substantial progress in practical algorithms for satisfiability, there are many related logical problems where satisfiability alone is not enough. One particularly useful extension to satisfiability is the associated counting problem, #SAT, which requires computing the number of assignments that satisfy the input formula. #SAT’s practical importance stems in part from its very close relationship to the problem of general Bayesian inference. #SAT seems to be more computationally difficult than SAT since an algorithm for SAT can stop once it has found a single satisfying assignment, whereas #SAT requires finding all such assignments. In fact, #SAT is complete for the class #P which is at least as hard as the polynomial-time hierarchy [10]. Not only is #SAT intrinsically important, it is also an excellent test-bed for algorithmic ideas in propositional reasoning. One of these new ideas is formula caching [7, 1, 5] which seems particularly promising when performed in the form called component caching [1, 2]. In component caching, disjoint components of the formula, generated dynamically during a DPLL search, are cached so that they only have to be solved once. While formula caching in general may have theoretical value even in SAT solvers [5], component caching seems to hold great promise for the practical improvement of #SAT algorithms (and Bayes inference) where there is more of a chance to reuse cached results. In particular, Bacchus, Dalmao, and Pitassi [1] discuss three different caching schemes: simple caching, component caching, and linear-space caching and show that component caching is theoretically competitive with the best of current methods for Bayesian inference (and substantially better in some instances). It has not been clear, however, whether component caching can be as competitive in practice as it is theoretically. We provide significant evidence that it can, demonstrating that on many instances it can outperform existing algorithms for #SAT by orders of magnitude. The key to this success is carefully incorporating component caching with clause learning, one of the most important ideas used in modern SAT solvers. Although both component caching and clause learning involve recording information collected during search, the nature and use of the recorded information is radically different. In clause learning, a clause that captures the reason for failure is computed from every failed search path. Component caching, on the other hand, stores the result computed when solving a subproblem. When that subproblem is encountered again its value can be retrieved from the cache rather than having to solve it again. It is not immediately obvious how to maintain correctness as well as obtain the best performance from a combination of these techniques. In this paper we show how this combination can be achieved so as to obtain the performance improvements just mentioned. Our model-counting program is built on the ZChaff SAT solver [8, 11]. ZChaff already implements clause learning, and we have added new modules and modified many others to support #SAT and to integrate component caching with clause learning. Ours is the first implementation we are aware of that is able to benefit from both component caching and clause learning. We have tested our program against the relsat [4, 3] system, which also performs component analysis, but does not cache the computed values of these components. In most instances of both random and structured problems our new solver is significantly faster than relsat, often by up to several orders of magnitude. 3 We begin by reviewing DPLL with caching for #SAT [1], and DPLL with learning for SAT. We then outline a basic approach for efficiently integrating component caching and clause learning. With this basic

Proceedings ArticleDOI
15 Apr 2004
TL;DR: Three new meta algorithms are proposed and compared against the de facto one and a recently proposed one by means of synthetic and trace-driven simulations, leading to improved performance under most simulated scenarios, especially under a low availability of storage.
Abstract: Large scale hierarchical caches for Web content have been deployed widely in an attempt to reduce delivery delays and bandwidth consumption and also to improve the scalability of content dissemination through the World Wide Web. Irrespective of the specific replacement algorithm employed in each cache, a de facto characteristic of contemporary hierarchical caches is that a hit for a document at an l-level cache leads to the caching of the document in all intermediate caches (levels l-1,..., 1) on the path towards the leaf cache that received the initial request. This paper presents various algorithms that revises this standard behavior and attempts to be more selective in choosing the caches that gets to store a local copy of the requested document. As these algorithms operate independently of the actual replacement algorithm running in each individual cache, they are referred to as meta algorithms. Three new meta algorithms are proposed and compared against the de facto one and a recently proposed one by means of synthetic and trace-driven simulations. The best of the new meta algorithms appears to be leading to improved performance under most simulated scenarios, especially under a low availability of storage. The latter observation makes the presented meta algorithms particularly favorable for the handling of large data objects such as stored music files or short video clips. Additionally, a simple load balancing algorithm that is based on the concept of meta algorithms is proposed and evaluated. The algorithm is shown to be able to provide for an effective balancing of load thus possibly addressing the recently discovered "filtering-effect" in hierarchical Web caches.

Patent
22 Jul 2004
TL;DR: In this paper, the authors propose a method of updating a cache in an integrated circuit comprising: the cache; a processor connected to the cache via a cache bus; a memory interface connected to cache via the cache and to the processor via a second bus, the first bus being wider than the second bus or the cache bus.
Abstract: A method of updating a cache in an integrated circuit comprising: the cache; a processor connected to the cache via a cache bus; a memory interface connected to the cache via a first bus and to the processor via a second bus, the first bus being wider than the second bus or the cache bus; and memory connected to the memory interface via a memory bus; the method comprising the steps of: (a) following a cache miss, using the processor to issue a request for first data via a first address, the first data being that associated with the cache miss; (b) in response to the request, using the memory interface to fetch the first data from the memory, and sending the first data to the processor; (c) sending, from the memory interface and via the first bus, the first data and additional data, the additional data being that stored in the memory adjacent the first data; (d) updating the cache with the first data and the additional data via the first bus; and (e) updating flags in the cache associated with the first data and the additional data, such that the updated first data and additional data in the cache is valid

Patent
21 May 2004
TL;DR: In this paper, a configuration management system creates ( 602 ) each configuration by assigning a configuration identifier to each configuration, and then tracks ( 604 ) changes to files of the configuration by storing information associating each new file version with the configuration identifier.
Abstract: A configuration management system creates ( 602 ) each configuration by assigning a configuration identifier to each configuration. In addition, relational information is computed ( 706 ) that indicates the relationships between the configuration and any configurations upon which it is based. The system then tracks ( 604 ) changes to files of the configuration by storing information associating each new file version with the configuration identifier. The system also tracks ( 1210 ) changes to file properties. A configuration is then reconstructed ( 608 ) as of a desired date, by identifying ( 2104, 2106 ) the file versions and properties associated with that configuration as of the desired date. A determination is made ( 2110 ) whether a user that has requested the file versions has access privileges by first checking a security cache ( 2600 ) for the user privileges information. If the information is not on the cache, it is computed from a security table ( 2800 ) and stored on the cache. The system automatically compresses ( 3118 ) and reconstitutes ( 3006 ) file versions that are stored in the version store.

Proceedings Article
27 Jun 2004
TL;DR: The design of the CoDeeN Content Distribution Network is discussed, focusing on the reliability and security mechanisms that have kept the service in operation, and whether future services, especially peer-to-peer systems, will require similar mechanisms.
Abstract: With the advent of large-scale, wide-area networking testbeds, researchers can deploy long-running distributed services that interact with other resources on the Web. The CoDeeN Content Distribution Network, deployed on PlanetLab, uses a network of caching Web proxy servers to intelligently distribute and cache requests from a potentially large client population. We have been running this system nearly continuously since June 2003, allowing open access from any client in the world. In that time, it has become the most heavily-used long-running service on PlanetLab, handling over four million accesses per day. In this paper, we discuss the design of our system, focusing on the reliability and security mechanisms that have kept the service in operation. Our reliability mechanisms assess node health, preventing failing nodes from disrupting the operation of the overall system. Our security mechanisms protect nodes from being exploited and from being implicated in malicious activities, problems that commonly plague other open proxies. We believe that future services, especially peer-to-peer systems, will require similar mechanisms as more services are deployed on non-dedicated distributed systems, and as their interaction with existing protocols and systems increases. Our experiences with CoDeeN and our data on its availability should serve as an important starting point for designers of future systems.

Proceedings ArticleDOI
24 May 2004
TL;DR: An analytical framework to investigate the behavior of the communication links of a node in a random mobility environment is developed and an efficient updating strategy for proactive routing protocols based on the derived statistics is designed.
Abstract: In this work, we develop an analytical framework to investigate the behavior of the communication links of a node in a random mobility environment. Analytical expressions characterizing various properties related to the formation, lifetime and expiration of links are derived. The derived framework can be used to design efficient algorithms for medium access, routing and transport control, or to analyze and optimize the performance of existing network protocols. A number of applications of the characteristics investigated, such as selection of stable routes, route cache lifetime optimization, providing Quality-of-Service (QoS) data communication and analysis of route lifetime are discussed. In particular, we focus on designing an efficient updating strategy for proactive routing protocols based on the derived statistics. Using simulations, we show that the proposed strategy can lead to significant performance improvements in terms of reduction in routing overhead, while maintaining high data packet delivery ratio and acceptable latency.

Book
01 Jan 2004
TL;DR: This presentation discusses the design philosophy, implementation, and practicality of the ARM MMU, as well as some of the techniques used to develop and demonstrate an MPU system.
Abstract: Table of Contents: 1. ARM Embedded Systems 1.1 The RISC Design Philosophy 1.2 The ARM Design Philosophy 1.3 Embedded System Hardware 1.4 Embedded System Software 1.5 Summary 2 ARM Processor Fundamentals 2.1 Registers 2.2 Current Program Status Register 2.3 Pipeline 2.4 Exceptions, Interrupts, and the Vector Table 2.5 Core Extensions 2.6 Architecture Revisions 2.7 ARM Processor Families 2.8 Summary 3 Introduction to the ARM Instruction Set 3.1 Data Processing Instructions 3.2 Branch Instructions 3.3 Load-Store Instructions 3.4 Software Interrupt Instruction 3.5 Program Status Register Instructions 3.6 Loading Constants 3.7 ARMv5E Extensions 3.8 Conditional Execution 3.9 Summary 4 Introduction to the Thumb Instruction Set 4.1 Thumb Register Usage 4.2 ARM-Thumb Interworking 4.3 Other Branch Instructions 4.4 Data Processing Instructions 4.5 Single-Register Load-Store Instructions 4.6 Multiple-Register Load-Store Instructions 4.7 Stack Instructions 4.8 Software Interrupt Instruction 4.9 Summary 5 Efficient C Programming 5.1 Overview of C Compilers and Optimization 5.2 Basic C Data Types 5.3 C Looping Structures 5.4 Register Allocation 5.5 Function Calls 5.6 Pointer Aliasing 5.7 Structure Arrangement 5.8 Bit-fields 5.9 Unaligned Data and Endianness 5.10 Division 5.11 Floating Point 5.12 Inline Functions and Inline Assembly 5.13 Portability Issues 5.14 Summary 6 Writing and Optimizing ARM Assembly Code 6.1 Writing Assembly Code 6.2 Profiling and Cycle Counting 6.3 Instruction Scheduling 6.4 Register Allocation 6.5 Conditional Execution 6.6 Looping Constructs 6.7 Bit Manipulation 6.8 Efficient Switches 6.9 Handling Unaligned Data 6.10 Summary 7 Optimized Primitives 7.1 Double-Precision Integer Multiplication 7.2 Integer Normalization and Count Leading Zeros 7.3 Division 7.4 Square Roots 7.5 Transcendental Functions: log, exp, sin, cos 7.6 Endian Reversal and Bit Operations 7.7 Saturated and Rounded Arithmetic 7.8 Random Number Generation 7.9 Summary 8 Digital Signal Processing 8.1 Representing a Digital Signal 8.2 Introduction to DSP on the ARM 8.3 FIR filters 8.4 IIR Filters 8.5 The Discrete Fourier Transform 8.6 Summary 9 Exception and Interruput Handling 9.1 Exception Handling 9.2 Interrupts 9.3 Interrupt Handling Schemes 9.4 Summary 10 Firmware 10.1 Firmware and Bootloader 10.2 Example: Sandstone 10.3 Summary 11 Embedded Operating Systems 11.1 Fundamental Components 11.2 Example: Simple Little Operating System 11.3 Summary 12 Caches 12.1 The Memory Hierarchy and Cache Memory 12.2 Cache Architecture 12.3 Cache Policy 12.4 Coprocessor 15 and Caches 12.5 Flushing and Cleaning Cache Memory 12.6 Cache Lockdown 12.7 Caches and Software Performance 12.8 Summary 13 Memory Protection Units 13.1 Protected Regions 13.2 Initializing the MPU, Caches, and Write Buffer 13.3 Demonstration of an MPU system 13.4 Summary 14 Memory Management Units 14.1 Moving from an MPU to an MMU 14.2 How Virtual Memory Works 14.3 Details of the ARM MMU 14.4 Page Tables 14.5 The Translation Lookaside Buffer 14.6 Domains and Memory Access Permission 14.7 The Caches and Write Buffer 14.8 Coprocessor 15 and MMU Configuration 14.9 The Fast Context Switch Extension 14.10 Demonstration: A Small Virtual Memory System 14.11 The Demonstration as mmuSLOS 14.12 Summary 15 The Future of the Architecture by John Rayfield 15.1 Advanced DSP and SIMD Support in ARMv6 15.2 System and Multiprocessor Support Additions to ARMv6 15.3 ARMv6 Implementations 15.4 Future Technologies beyond ARMv6 15.5 Conclusions Appendix A: ARM and Thumb Assembler Instructions Appendix: B ARM and Thumb Instruction Encodings Appendix C: Processors and Architecture Appendix D: Instruction Cycle Timings Appendix E: Suggested Reading Index

Proceedings ArticleDOI
10 Mar 2004
TL;DR: This paper presents StatCache, a novel sampling-based method for performing data-locality analysis on realistic workloads, based on a probabilistic model of the cache, rather than a functional cache simulator.
Abstract: The widening memory gap reduces performance of applications with poor data locality. Therefore, there is a need for methods to analyze data locality and help application optimization. In this paper we present StatCache, a novel sampling-based method for performing data-locality analysis on realistic workloads. StatCache is based on a probabilistic model of the cache, rather than a functional cache simulator. It uses statistics from a single run to accurately estimate miss ratios of fully-associative caches of arbitrary sizes and generate working-set graphs. We evaluate StatCache using the SPEC CPU2000 benchmarks and show that StatCache gives accurate results with a sampling rate as low as 10/sup -4/. We also provide a proof-of-concept implementation, and discuss potentially very fast implementation alternatives.

Proceedings ArticleDOI
Srikanth T. Srinivasan1, Ravi Rajwar1, Haitham Akkary1, Amit Gandhi1, Mike Upton1 
07 Oct 2004
TL;DR: Continual Flow Pipelines (CFP) is presented as a new non-blocking processor pipeline architecture that achieves the performance of a large instruction window without requiring cycle-critical structures such as the scheduler and register file to be large.
Abstract: Increased integration in the form of multiple processor cores on a single die, relatively constant die sizes, shrinking power envelopes, and emerging applications create a new challenge for processor architects. How to build a processor that provides high single-thread performance and enables multiple of these to be placed on the same die for high throughput while dynamically adapting for future applications? Conventional approaches for high single-thread performance rely on large and complex cores to sustain a large instruction window for memory tolerance, making them unsuitable for multi-core chips. We present Continual Flow Pipelines (CFP) as a new non-blocking processor pipeline architecture that achieves the performance of a large instruction window without requiring cycle-critical structures such as the scheduler and register file to be large. We show that to achieve benefits of a large instruction window, inefficiencies in management of both the scheduler and register file must be addressed, and we propose a unified solution. The non-blocking property of CFP keeps key processor structures affecting cycle time and power (scheduler, register file), and die size (second level cache) small. The memory latency-tolerant CFP core allows multiple cores on a single die while outperforming current processor cores for single-thread applications.

Patent
28 Jun 2004
TL;DR: In this article, a system and method for filtering of web-based content in a proxy cache server environment provides a local network having a client, a directory server and a proxy caching server that caches predetermined Internet-derived web content within the network.
Abstract: A system and method for filtering of web-based content in a proxy cache server environment provides a local network having a client, a directory server and a proxy cache server that caches predetermined Internet-derived web content within the network. When content is requested, it is vended to the client only if it meets predefined user policies for acceptability. These policies are implemented based upon one or more ratings lists provided by content rating vendors. The lists are downloaded to the network in whole or part, and cached for use in determining acceptability of content by a filter application. Ratings can be particularly based upon predetermined content categories. Caching occurs in a host or object cache for rapid access. Only if current ratings are not found in the host or object caches are ratings caches or vendors accessed for ratings. Ratings on requested content are then placed in the host or object cache for subsequent use. Object parsing or other techniques can be used to screen returned content that is unrated or otherwise allowed to pass to ensure that it is appropriate.

Journal ArticleDOI
TL;DR: The results show that data and instruction caches require different control strategies for efficient execution, and a technique called cache subbank prediction, which is used to selectively wake up only the necessary parts of the instruction cache, while allowing most of the cache to stay in a low-leakage drowsy mode is proposed.
Abstract: On-chip caches represent a sizable fraction of the total power consumption of microprocessors. As feature sizes shrink, the dominant component of this power consumption will be leakage. However, during a fixed period of time, the activity in a data cache is only centered on a small subset of the lines. This behavior can be exploited to cut the leakage power of large data caches by putting the cold cache lines into a state preserving, low-power drowsey mode. In this paper, we investigate policies and circuit techniques for implementing drowsy data caches. We show that with simple microarchitectural techniques, about 80%-90% of the data cache lines can be maintained in a drowsy state without affecting performance by more than 0.6%, even though moving lines into and out of a drowsy state incurs a slight performance loss. According to our projections, in a 70-nm complementary metal-oxide-semiconductor process, drowsy data caches will be able to reduce the total leakage energy consumed in the caches by 60%-75%. In addition, we extend the drowsy cache concept to reduce leakage power of instruction caches without significant impact on execution time. Our results show that data and instruction caches require different control strategies for efficient execution. In order to enable drowsy instruction caches, we propose a technique called cache subbank prediction, which is used to selectively wake up only the necessary parts of the instruction cache, while allowing most of the cache to stay in a low-leakage drowsy mode. This prediction technique reduces the negative performance impact by 78% compared with the no-prediction policy. Our technique works well even with small predictor sizes and enables a 75% reduction of leakage energy in a 32-kB instruction cache.

Book ChapterDOI
26 Feb 2004
TL;DR: This paper takes different measurements on Kazaa and Gnutella, and instead of passively recording requests, actively probe peers to get their cache contents information, which provides a map of contents that is used to evaluate the degree of clustering in the system.
Abstract: Peer-to-peer file sharing systems now generate a significant portion of Internet traffic. A good understanding of their workloads is crucial in order to improve their scalability, robustness and performance. Previous measurement studies on Kazaa and Gnutella were based on monitoring peer requests, and mostly concerned with peer and file availability and network traffic. In this paper, we take different measurements: instead of passively recording requests, we actively probe peers to get their cache contents information. This provides us with a map of contents, that we use to evaluate the degree of clustering in the system, and that could be exploited to improve significantly the search process.

Patent
18 Feb 2004
TL;DR: In this article, a data playback device records a plurality of playback start indexes, each playback start index being information regarding a playback position that is determined according to user input which is recorded when the user input is in a prescribed pattern.
Abstract: A hierarchical memory scheme capable of improving a hit rate for the segment containing the random access point rather than improving the overall hit rate of the cache, and a data playback scheme capable of automatically detecting positions that are potentially used as playback start indexes by the user and attaching indexes, are disclosed. The hierarchical storage device stores random access point segment information from which a possibility for each segment to contain a point that can potentially be random accessed in future can be estimated, and controls a selection of the selected segments to be stored in the cache storage device according to the random access point segment information. The data playback device records a plurality of playback start indexes, each playback start index being information regarding a playback position that is determined according to the user input which is recorded when the user input is in a prescribed pattern, and presents the plurality of playback start indexes to a user so as to urge the user to select a desired playback position.