scispace - formally typeset
Search or ask a question

Showing papers on "Cache published in 2000"


Journal ArticleDOI
TL;DR: This paper demonstrates the benefits of cache sharing, measures the overhead of the existing protocols, and proposes a new protocol called "summary cache", which reduces the number of intercache protocol messages, reduces the bandwidth consumption, and eliminates 30% to 95% of the protocol CPU overhead, all while maintaining almost the same cache hit ratios as ICP.
Abstract: The sharing of caches among Web proxies is an important technique to reduce Web traffic and alleviate network bottlenecks. Nevertheless it is not widely deployed due to the overhead of existing protocols. In this paper we demonstrate the benefits of cache sharing, measure the overhead of the existing protocols, and propose a new protocol called "summary cache". In this new protocol, each proxy keeps a summary of the cache directory of each participating proxy, and checks these summaries for potential hits before sending any queries. Two factors contribute to our protocol's low overhead: the summaries are updated only periodically, and the directory representations are very economical, as low as 8 bits per entry. Using trace-driven simulations and a prototype implementation, we show that, compared to existing protocols such as the Internet cache protocol (ICP), summary cache reduces the number of intercache protocol messages by a factor of 25 to 60, reduces the bandwidth consumption by over 50%, eliminates 30% to 95% of the protocol CPU overhead, all while maintaining almost the same cache hit ratios as ICP. Hence summary cache scales to a large number of proxies. (This paper is a revision of Fan et al. 1998; we add more data and analysis in this version.).

2,174 citations


Proceedings ArticleDOI
01 Aug 2000
TL;DR: Results indicate that gated-Vdd together with a novel resizable cache architecture reduces energy-delay by 62% with minimal impact on performance.
Abstract: Deep-submicron CMOS designs have resulted in large leakage energy dissipation in microprocessors. While SRAM cells in on-chip cache memories always contribute to this leakage, there is a large variability in active cell usage both within and across applications. This paper explores an integrated architectural and circuit-level approach to reducing leakage energy dissipation in instruction caches. We propose, gated-V/sub dd/, a circuit-level technique to gate the supply voltage and reduce leakage in unused SRAM cells. Our results indicate that gated-V/sub dd/ together with a novel resizable cache architecture reduces energy-delay by 62% with minimal impact on performance.

731 citations


Proceedings ArticleDOI
01 May 2000
TL;DR: The concept of the sphere of replication is introduced, which abstract both the physical redundancy of a lockstepped system and the logical redundancy of an SRT processor, and two mechanisms-slack fetch and branch outcome queue-are proposed and evaluated that enhance the performance of anSRT processor by allowing one thread to prefetch cache misses and branch results for the other thread.
Abstract: Smaller feature sizes, reduced voltage levels, higher transistor counts, and reduced noise margins make future generations of microprocessors increasingly prone to transient hardware faults. Most commercial fault-tolerant computers use fully replicated hardware components to detect microprocessor faults. The components are lockstepped (cycle-by-cycle synchronized) to ensure that, in each cycle, they perform the same operation on the same inputs, producing the same outputs in the absence of faults. Unfortunately, for a given hardware budget, full replication reduces performance by statically partitioning resources among redundant operations.We demonstrate that a Simultaneous and Redundantly Threaded (SRT) processor—derived from a Simultaneous Multithreaded (SMT) processor—provides transient fault coverage with significantly higher performance. An SRT processor provides transient fault coverage by running identical copies of the same program simultaneously as independent threads. An SRT processor provides higher performance because it dynamically schedules its hardware resources among the redundant copies. However, dynamic scheduling makes it difficult to implement lockstepping, because corresponding instructions from redundant threads may not execute in the same cycle or in the same order.This paper makes four contributions to the design of SRT processors. First, we introduce the concept of the sphere of replication, which abstracts both the physical redundancy of a lockstepped system and the logical redundancy of an SRT processor. This framework aids in identifying the scope of fault coverage and the input and output values requiring special handling. Second, we identify two viable spheres of replication in an SRT processor, and show that one of them provides fault detection while checking only committed stores and uncached loads. Third, we identify the need for consistent replication of load values, and propose and evaluate two new mechanisms for satisfying this requirement. Finally, we propose and evaluate two mechanisms—slack fetch and branch outcome queue—that enhance the performance of an SRT processor by allowing one thread to prefetch cache misses and branch results for the other thread. Our results with 11 SPEC95 benchmarks show that an SRT processor can outperform an equivalently sized, on-chip, hardware-replicated solution by 16% on average, with a maximum benefit of up to 29%.

672 citations


Proceedings ArticleDOI
01 May 2000
TL;DR: The Piranha system, a research prototype being developed at Compaq that aggressively exploits chip multiprocessing by integrating eight simple Alpha processor cores along with a two-level cache hierarchy onto a single chip, is described.
Abstract: The microprocessor industry is currently struggling with higher development costs and longer design times that arise from exceedingly complex processors that are pushing the limits of instruction-level parallelism. Meanwhile, such designs are especially ill suited for important commercial applications, such as on-line transaction processing (OLTP), which suffer from large memory stall times and exhibit little instruction-level parallelism. Given that commercial applications constitute by far the most important market for high-performance servers, the above trends emphasize the need to consider alternative processor designs that specifically target such workloads. The abundance of explicit thread-level parallelism in commercial workloads, along with advances in semiconductor integration density, identify chip multiprocessing (CMP) as potentially the most promising approach for designing processors targeted at commercial servers.This paper describes the Piranha system, a research prototype being developed at Compaq that aggressively exploits chip multi-processing by integrating eight simple Alpha processor cores along with a two-level cache hierarchy onto a single chip. Piranha also integrates further on-chip functionality to allow for scalable multiprocessor configurations to be built in a glueless and modular fashion. The use of simple processor cores combined with an industry-standard ASIC design methodology allow us to complete our prototype within a short time-frame, with a team size and investment that are an order of magnitude smaller than that of a commercial microprocessor. Our detailed simulation results show that while each Piranha processor core is substantially slower than an aggressive next-generation processor, the integration of eight cores onto a single chip allows Piranha to outperform next-generation processors by up to 2.9 times (on a per chip basis) on important workloads such as OLTP. This performance advantage can approach a factor of five by using full-custom instead of ASIC logic. In addition to exploiting chip multiprocessing, the Piranha prototype incorporates several other unique design choices including a shared second-level cache with no inclusion, a highly optimized cache coherence protocol, and a novel I/O architecture.

582 citations


Proceedings ArticleDOI
01 Jun 2000
TL;DR: This paper uses the use of SimplePower to evaluate the impact of a new selective gated pipeline register optimization, a high-level data transformation and a pow er-conscious post compilation optimization on the datapath, memory and on-chip bus energy, respectively.
Abstract: In this paper, we presen t the design and use of a comprehensiv e framework, SimplePower, for ev aluating the effect of high-level algorithmic, architectural, and compilation trade-offs on energy. An execution-driven, cycle-accurate RT lev el energy estimation tool that uses transition sensitive energy models forms the cornerstone of this framework. SimplePower also pro vides the energy consumed in the memory system and on-chip buses using analytical energy models.We presen t the use of SimplePower to evaluate the impact of a new selective gated pipeline register optimization, a high-level data transformation and a pow er-conscious post compilation optimization (register relabeling) on the datapath, memory and on-chip bus energy, respectively. We find that these three optimizations reduce the energy by 18-36% in the datapath, 62% in the memory system and 12% in the instruction cache data bus, respectively.

495 citations


Book
01 Dec 2000
TL;DR: This is the authoritative reference guide to the ARM RISC architecture and contains detailed information about all versions of the ARM and Thumb instruction sets, the memory management and cache functions, as well as optimized code examples.
Abstract: From the Publisher: This is the authoritative reference guide to the ARM RISC architecture. Produced by the architects that are actively working on the ARM specification, the book contains detailed information about all versions of the ARM and Thumb instruction sets, the memory management and cache functions, as well as optimized code examples.

470 citations


Proceedings ArticleDOI
01 Dec 2000
TL;DR: This paper proposes a cache and TLB layout and design that leverages repeater insertion to provide dynamic low-cost configurability trading off size and speed on a per application phase basis and demonstrates that a configurable L2/L3 cache hierarchy coupled with a conventional LI results in an average 43% reduction in memory hierarchy energy in addition to improved performance.
Abstract: Conventional microarchitectures choose a single memory hierarchy design point targeted at the average application. In this paper, we propose a cache and TLB layout and design that leverages repeater insertion to provide dynamic low-cost configurability trading off size and speed on a per application phase basis. A novel configuration management algorithm dynamically detects phase changes and reacts to an application's hit and miss intolerance in order to improve memory hierarchy performance while taking energy consumption into consideration. When applied to a two-level cache and TLB hierarchy at 0.1 /spl mu/m technology, the result is an average 15% reduction in cycles per instruction (CPI), corresponding to an average 27% reduction in memory-CPI, across a broad class of applications compared to the best conventional two-level hierarchy of comparable size. Projecting to sub-.1 /spl mu/m technology design considerations that call for a three-level conventional cache hierarchy for performance reasons, we demonstrate that a configurable L2/L3 cache hierarchy coupled with a conventional LI results in an average 43% reduction in memory hierarchy energy in addition to improved performance.

425 citations


Journal ArticleDOI
TL;DR: There is a surprising consistency over time in the relative amount of web traffic from the server along a path, lending a stability to the TERC location solution and these techniques can be used by network providers to reduce traffic load in their network.
Abstract: This paper studies the problem of where to place network caches. Emphasis is given to caches that are transparent to the clients since they are easier to manage and they require no cooperation from the clients. Our goal is to minimize the overall flow or the average delay by placing a given number of caches in the network. We formulate these location problems both for general caches and for transparent en-route caches (TERCs), and identify that, in general, they are intractable. We give optimal algorithms for line and ring networks, and present closed form formulae for some special cases. We also present a computationally efficient dynamic programming algorithm for the single server case. This last case is of particular practical interest. It models a network that wishes to minimize the average access delay for a single web server. We experimentally study the effects of our algorithm using real web server data. We observe that a small number of TERCs are sufficient to reduce the network traffic significantly. Furthermore, there is a surprising consistency over time in the relative amount of web traffic from the server along a path, lending a stability to our TERC location solution. Our techniques can be used by network providers to reduce traffic load in their network.

400 citations


Journal ArticleDOI
16 May 2000
TL;DR: A new indexing technique called CSB+-Trees is proposed that stores all the child nodes of any given node contiguously, and keeps only the address of the first child in each node, and introduces two variants of CSB+, which can reduce the copying cost when there is a split and preallocate space for the full node group to reduce the split cost.
Abstract: Previous research has shown that cache behavior is important for main memory index structures. Cache conscious index structures such as Cache Sensitive Search Trees (CSS-Trees) perform lookups much faster than binary search and T-Trees. However, CSS-Trees are designed for decision support workloads with relatively static data. Although B+-Trees are more cache conscious than binary search and T-Trees, their utilization of a cache line is low since half of the space is used to store child pointers. Nevertheless, for applications that require incremental updates, traditional B+-Trees perform well.Our goal is to make B+-Trees as cache conscious as CSS-Trees without increasing their update cost too much. We propose a new indexing technique called “Cache Sensitive B+-Trees” (CSB+-Trees). It is a variant of B+-Trees that stores all the child nodes of any given node contiguously, and keeps only the address of the first child in each node. The rest of the children can be found by adding an offset to that address. Since only one child pointer is stored explicitly, the utilization of a cache line is high. CSB+-Trees support incremental updates in a way similar to B+-Trees.We also introduce two variants of CSB+-Trees. Segmented CSB+-Trees divide the child nodes into segments. Nodes within the same segment are stored contiguously and only pointers to the beginning of each segment are stored explicitly in each node. Segmented CSB+-Trees can reduce the copying cost when there is a split since only one segment needs to be moved. Full CSB+-Trees preallocate space for the full node group and thus reduce the split cost. Our performance studies show that CSB+-Trees are useful for a wide range of applications.

398 citations


Proceedings ArticleDOI
17 Sep 2000
TL;DR: This work evaluates the energy usage of each thread and throttles the system activity so that the scheduling goal is achieved, and shows that the correlation of events and energy values provides the necessary information for energy-aware scheduling policies.
Abstract: A prerequisite of energy-aware scheduling is precise knowledge of any activity inside the computer system. Embedded hardware monitors (e.g., processor performance counters) have proved to offer valuable information in the field of performance analysis. The same approach can be applied to investigate the energy usage patterns of individual threads. We use information about active hardware units (e.g., integer/floating-point unit, cache/memory interface) gathered by event counters to establish a thread-specific energy accounting. The evaluation shows that the correlation of events and energy values provides the necessary information for energy-aware scheduling policies.Our approach to OS-directed power management adds the energy usage pattern to the runtime context of a thread. Depending on the field of application we present two scenarios that benefit from applying energy usage patterns: Workstations with passive cooling on the one hand and battery-powered mobile systems on the other hand.Energy-aware scheduling evaluates the energy usage of each thread and throttles the system activity so that the scheduling goal is achieved. In workstations we throttle the system if the average energy use exceeds a predefined power-dissipation capacity. This makes a compact, noiseless and affordable system design possible that meets sporadic yet high demands in computing power. Nowadays, more and more mobile systems offer the features of reducible clock speed and dynamic voltage scaling. Energy-aware scheduling can employ these features to yield a longer battery life by slowing down low-priority threads while preserving a certain quality of service.

382 citations


Journal ArticleDOI
TL;DR: The design of a CMP is motivated, the architecture of the Hydra design is described with a focus on its speculative thread support, and the prototype implementation is described.
Abstract: The Hydra chip multiprocessor (CMP) integrates four MIPS-based processors and their primary caches on a single chip together with a shared secondary cache. A standard CMP offers implementation and performance advantages compared to wide-issue superscalar designs. However, it must be programmed with a more complicated parallel programming model to obtain maximum performance. To simplify parallel programming, the Hydra CMP supports thread-level speculation and memory renaming, a paradigm that allows performance similar to a uniprocessor of comparable die area on integer programs. This article motivates the design of a CMP, describes the architecture of the Hydra design with a focus on its speculative thread support, and describes our prototype implementation. Chip multiprocessors offer an economical, scalable architecture for future microprocessors. Thread-level speculation support allows them to speed up past software.

Patent
24 Nov 2000
TL;DR: In this paper, a preloader uses a cache replacement manager to manage requests for retrievals, insertions, and removal of web page components in a component cache, and a profile server predicts a user's next content request.
Abstract: A preloader works in conjunction with a web/app server and optionally a profile server to cache web page content elements or components for faster on-demand and anticipatory dynamic web page delivery. The preloader uses a cache manager to manage requests for retrievals, insertions, and removal of web page components in a component cache. The preloader uses a cache replacement manager to manage the replacement of components in the cache. While the cache replacement manager may utilize any cache replacement policy, a particularly effective replacement policy utilizes predictive information to make replacement decisions. Such a policy uses a profile server, which predicts a user's next content request. The components that can be cached are identified by tagging them within the dynamic scripts that generate them. The preloader caches components that are likely to be accessed next, thus improving a web site's scalability.

Journal ArticleDOI
TL;DR: This paper proposes a novel replacement policy, called LRV, which selects for replacement the document with the lowest relative value among those in cache, and shows how LRV outperforms least recently used (LRU) and other policies and can significantly improve the performance of the cache, especially for a small one.
Abstract: In this paper, we analyze access traces to a Web proxy, looking at statistical parameters to be used in the design of a replacement policy for documents held in the cache. In the first part of this paper, we present a number of properties of the lifetime and statistics of access to documents, derived from two large trace sets coming from very different proxies and spanning over time intervals of up to five months. In the second part, we propose a novel replacement policy, called LRV, which selects for replacement the document with the lowest relative value among those in cache. In LRV, the value of a document is computed adaptively based on information readily available to the proxy server. The algorithm has no hardwired constants, and the computations associated with the replacement policy require only a small constant time. We show how LRV outperforms least recently used (LRU) and other policies and can significantly improve the performance of the cache, especially for a small one.

Patent
03 Mar 2000
TL;DR: In this paper, a method, system, and computer program product for caching dynamically generated content (including, but not limited to, dynamically generated Web pages), as well as determining when the cached content should be invalidated or purged.
Abstract: A method, system, and computer program product for caching dynamically generated content (including, but not limited to, dynamically generated Web pages), as well as determining when the cached content should be invalidated or purged. Rather than caching the generated datastream (i.e. the end result of the computations used in the dynamic generation process) as in the prior art, the interim results of computations (such as a generated bean instance or object, where the interim results may be stored using properties and methods) are cached according to the present invention. The input properties used to generate the bean or object, along with the input property values, are used to distinguish among cached instances and thereby identify when a cached instance may be used to respond to a subsequent request for the same content. Re-execution of the business logic of the bean or object may then be avoided, using the cached bean's or object's output properties to generate the content response. Application-specific, developer-defined criteria may be used in the cache invalidation determination.

Proceedings ArticleDOI
01 Aug 2000
TL;DR: Kinetic data structures (KDSs) as discussed by the authors are a formal framework for designing and analyzing sets of assertions to cache about the environment, so that these assertion sets are at once relatively stable and tailored to facilitate or trivialize the computation of the attribute of interest.
Abstract: Computer systems commonly cache the values of variables to gain efficiency. In applications where the goal is to track attributes of a continuously moving or deforming physical system over time, caching relations between variables works better than caching individual values. The reason is that, as the system evolves, such relationships are more stable than the values of individual variables.Kinetic data structures (KDSs) are a novel formal framework for designing and analyzing sets of assertions to cache about the environment, so that these assertion sets are at once relatively stable and tailored to facilitate or trivialize the computation of the attribute of interest. Formally, a KDS is a mathematical proof animated through time, proving the validity of a certain computation for the attribute of interest. KDSs have rigorous associated measures of performance and their design shares many qualities with that of classical data structures.The KDS framework has led to many new and promising algorithms in applications where the efficient modeling of motion is essential. Among these are collision detection for moving rigid and deformable bodies, connectivity maintenace in ad-hoc networks, local environment tracking for mobile agents, and visibility/occlusion maintenance. This talk will survey the general ideas behind KDSs and illustrate their application to simple geometric problems that arise in virtual and physical environments.

Proceedings ArticleDOI
01 May 2000
TL;DR: A new reconfigurable cache design is proposed that enables the cache SRAM arrays to be dynamically divided into multiple partitions that can be used for different processor activities.
Abstract: High performance general-purpose processors are increasingly being used for a variety of application domains - scientific, engineering, databases, and more recently, media processing. It is therefore important to ensure that architectural features that use a significant fraction of the on-chip transistors are applicable across these different domains. For example, current processor designs often devote the largest fraction of on-chip transistors (up to 80%) to caches. Many workloads, however, do not make effective use of large caches; e.g., media processing workloads which often have streaming data access patterns and large working sets.This paper proposes a new reconfigurable cache design. This design enables the cache SRAM arrays to be dynamically divided into multiple partitions that can be used for different processor activities. These activities can benefit applications that would otherwise not use the storage allocated to large conventional caches. Our design involves relatively few modifications to conventional cache design, and analysis using a modification of the CACTI analytical model shows a small impact on cache access time. We evaluate one representative use of reconfigurable caches - instruction reuse for media processing. We find this use gives IPC improvements ranging from 1.04X to 1.20X in simulation across eight media processing benchmarks.

Patent
23 Mar 2000
TL;DR: In this paper, a plurality of cache servers capable of caching WWW information provided by the information servers are provided in association with the wireless network, and the cache servers can be managed by receiving a message indicating at least a connected location of a mobile computer in a wireless network from the mobile computer, selecting one or more cache servers located nearby the mobile computers according to the message, and controlling these one or multiple cache servers to cache selected web information selected for the mobile Computer, so as to enable faster accesses to the selected Web information by the Mobile Computer.
Abstract: In the disclosed information delivery scheme for delivering WWW information provided by information servers on the Internet to mobile computers connected to the Internet through a wireless network, a plurality of cache servers capable of caching WWW information provided by the information servers are provided in association with the wireless network. The cache servers can be managed by receiving a message indicating at least a connected location of a mobile computer in the wireless network from the mobile computer, selecting one or more cache servers located nearby the mobile computer according to the message, and controlling these one or more cache servers to cache selected WWW information selected for the mobile computer, so as to enable faster accesses to the selected WWW information by the mobile computer. Also, the cache servers can be managed by selecting one or more cache servers located within a geographic range defined for an information provider who provides WWW information from an information server, and controlling these one or more cache servers to cache selected WWW information selected for the information provider, so as to enable faster accesses to the selected WWW information by the mobile computer.

Proceedings ArticleDOI
01 Aug 2000
TL;DR: The results show that semantic caching is more flexible and effective for use in LDD applications than page caching, whose performance is quite sensitive to the database physical organization.
Abstract: Location-dependent applications are becoming very popular in mobile environments. To improve system performance and facilitate disconnection, caching is crucial to such applications. In this paper, a semantic caching scheme is used to access location dependent data in mobile computing. We first develop a mobility model to represent the moving behaviors of mobile users and formally define location dependent queries. We then investigate query processing and cache management strategies. The performance of the semantic caching scheme and its replacement strategy FAR is evaluated through a simulation study. Our results show that semantic caching is more flexible and effective for use in LDD applications than page caching, whose performance is quite sensitive to the database physical organization. We also notice that the semantic cache replacement strategy FAR, which utilizes the semantic locality in terms of locations, performs robustly under different kinds of workloads.

Journal ArticleDOI
Martin Arlitt1, Ludmila Cherkasova1, John Dilley1, Rich Friedrich1, Tai Jin1 
01 Mar 2000
TL;DR: A trace of client requests to a busy Web proxy in an ISP environment is utilized to evaluate the performance of several existing replacement policies and of two new, parameterless replacement policies that are introduced in this paper.
Abstract: The continued growth of the World-Wide Web and the emergence of new end-user technologies such as cable modems necessitate the use of proxy caches to reduce latency, network traffic and Web server loads. Current Web proxy caches utilize simple replacement policies to determine which files to retain in the cache. We utilize a trace of client requests to a busy Web proxy in an ISP environment to evaluate the performance of several existing replacement policies and of two new, parameterless replacement policies that we introduce in this paper. Finally, we introduce Virtual Caches, an approach for improving the performance of the cache for multiple metrics simultaneously.

Patent
23 Jun 2000
TL;DR: In this paper, an apparatus and method for enhancing the infrastructure of a network such as the Internet is disclosed, where multiple edge servers and edge caches are provided at the edge of the network so as to cover and monitor all points of presence.
Abstract: An apparatus and method for enhancing the infrastructure of a network such as the Internet is disclosed. Multiple edge servers and edge caches are provided at the edge of the network so as to cover and monitor all points of presence. The edge servers selectively intercept domain name translation requests generated by downstream clients, coupled to the monitored points of presence, to subscribing Web servers and provide translations which either enhance content delivery services or redirect the requesting client to the edge cache to make its content requests. Further, network traffic monitoring is provided in order to detect malicious or otherwise unauthorized data transmissions.

Proceedings ArticleDOI
26 Mar 2000
TL;DR: A proxy caching mechanism for layered-encoded multimedia streams in the Internet to maximize the delivered quality of popular streams to interested clients and presents a prefetching mechanism to support higher quality cached streams during subsequent playbacks and improve the quality of the cached stream with its popularity.
Abstract: The Internet has witnessed a rapid growth in deployment of Web-based streaming applications during recent years. In these applications, the server should be able to perform end-to-end congestion control and quality adaptation to match the delivered stream quality to the average available bandwidth. Thus the delivered quality is limited by the bottleneck bandwidth on the path to the client. This paper proposes a proxy caching mechanism for layered-encoded multimedia streams in the Internet to maximize the delivered quality of popular streams to interested clients. The main challenge is to replay a quality-variable cached stream while performing quality adaptation effectively in response to the variations in available bandwidth. We present a prefetching mechanism to support higher quality cached streams during subsequent playbacks and improve the quality of the cached stream with its popularity. We exploit inherent properties of multimedia streams to extend the semantics of popularity and capture both level of interest among clients and usefulness of a layer in the cache. We devise a fine-grain replacement algorithm suited for layered-encoded streams. Our simulation results show that the interaction between the replacement algorithm and prefetching mechanism causes the state of the cache to converge to an efficient state such that the quality of a cached stream is proportional to its popularity, and the variations in quality of a cached stream are inversely proportional to its popularity. This implies that after serving several requests for a stream, the proxy can effectively hide low bandwidth paths to the original server from interested clients.

Journal ArticleDOI
01 May 2000
TL;DR: This paper shows how the microarchitecture analysis can be separated from the path analysis in order to make the overall analysis fast and shows that the approach can be used to analyse executables created by a standard optimising compiler.
Abstract: Precise run-time prediction suffers from a complexity problem when doing an integrated analysis. This problem is characterised by the conflict between an optimal solution and the complexity of the computation of the solution. The analysis of modern hardware consists of two parts: a) the analysis of the microarchitecture‘s behaviour (caches, pipelines) and b) the search for the longest program path. Because an integrated analysis has a significant computational complexity, we chose to separate these two steps. By this, an ordering problem arises, because the steps depend on each other. In this paper we show how the microarchitecture analysis can be separated from the path analysis in order to make the overall analysis fast. Practical experiments will show that this separation, however, does not make the analysis more pessimistic than existing approaches. Furthermore, we show that the approach can be used to analyse executables created by a standard optimising compiler.

Proceedings ArticleDOI
01 Aug 2000
TL;DR: This paper focuses on the features of the M340 cache sub-system and illustrates the effect on power and performance through benchmark analysis and actual silicon measurements.
Abstract: Advances in technology have allowed portable electronic devices to become smaller and more complex, placing stringent power and performance requirements on the devices components. The M7CORE M3 architecture was developed specifically for these embedded applications. To address the growing need for longer battery life and higher performance, an 8-Kbyte, 4-way set-associative, unified (instruction and data) cache with pro-grammable features was added to the M3 core. These features allow the architecture to be optimized based on the applications requirements. In this paper, we focus on the features of the M340 cache sub-system and illustrate the effect on power and perfor-mance through benchmark analysis and actual silicon measure-ments.

Patent
07 Sep 2000
TL;DR: In this paper, an audio element cache is provided that is capable of caching audio elements for each user in a personal radio server system, where customized radio content is provided to remote listeners by storing a plurality of audio elements in a file server, retrieving a subset of the audio elements from the file server by predicting the content desired by a remote listener based on a user profile of the remote listener.
Abstract: An audio element cache is provided that is capable of caching audio elements for each user in a personal radio server system. In operation, customized radio content is provided to remote listeners in a personal radio server system by: storing a plurality of audio elements in a file server; retrieving a subset of the plurality of audio elements from the file server by predicting the content desired by a remote listener based on a user profile of the remote listener; storing the subset of the plurality of audio elements in an audio element cache; selecting audio elements to provide to a remote listener from the audio element cache; and transmitting the audio elements to the remote listener. In an embodiment, the plurality of audio elements are stored in the audio element cache when a remote listener logs-on the personal radio server system.

Proceedings ArticleDOI
01 Feb 2000
TL;DR: The rules of thumb for the design of data storage systems are reexamines with a particular focus on performance and price/performance, and the 5-minute rule for disk caching becomes a cache-everything rule for Web caching.
Abstract: This paper reexamines the rules of thumb for the design of data storage systems Briefly, it looks at storage, processing, and networking costs, ratios, and trends with a particular focus on performance and price/performance Amdahl's ratio laws for system design need only slight revision after 35 years-the major change being the increased use of RAM An analysis also indicates storage should be used to cache both database and Web data to save disk bandwidth, network bandwidth, and people's time Surprisingly, the 5-minute rule for disk caching becomes a cache-everything rule for Web caching

Patent
14 Aug 2000
TL;DR: In this paper, a system for updating web pages stored in cache based on modifications to data stored in a database is described, which is part of a larger system having a database management system for storing data used to generate web pages, and the servers are capable of communicating an update command to the cache that contains the stored web pages associated with the identified modified data, for the purpose of updating the stored Web pages.
Abstract: A system for updating Web pages stored in cache based on modifications to data stored in a database is disclosed. The system for updating stored Web pages may be part of a larger system having a database management system for storing data used to generate Web pages. The database management system is capable of identifying modified data stored in the database. The system for updating stored Web pages is comprised of one or more servers programmed for maintaining associations between the stored Web pages and the stored data, and receiving the identity of modified data from the memory management system. In addition, the servers are capable of determining, from the identified modified data and the maintained associations, which stored Web pages are associated with the identified modified data. Furthermore, the servers are capable of communicating an update command to the cache that contains the stored Web pages associated with the identified modified data, for the purpose of updating the stored Web pages.

Proceedings ArticleDOI
09 Jul 2000
TL;DR: The initial experiments on iterative data-parallel applications show that the work-stealing scheduling algorithm matches the performance of static-partitioning under traditional work loads but improves the performance up to 50% over static partitioning under multiprogrammed work loads and a locality-guided work stealing algorithm that improves the data locality of multi-threaded computations by allowing a thread to have an affinity for a processor.
Abstract: This paper studies the data locality of the work-stealing scheduling algorithm on hardware-controlled shared-memory machines. We present lower and upper bounds on the number of cache misses using work stealing, and introduce a locality-guided work-stealing algorithm along with experimental validation.As a lower bound, we show that there is a family of multi-threaded computations Gn each member of which requires T(n) total instructions (work) for which when using work-stealing the number of cache misses on one processor is constant, while even on two processors the total number of cache misses is O(n). This implies that for general computations there is no useful bound relating multiprocessor to uninprocessor cache misses. For nested-parallel computations, however, we show that on P processors the expected additional number of cache misses beyond those on a single processor is bounded by O(C⌈m/sPT∞), where m is the execution time of an instruction incurring a cache miss, s is the steal time, C is the size of cache, and T∞ is the number of nodes on the longest chain of dependences. Based on this we give strong bounds on the total running time of nested-parallel computations using work stealing.For the second part of our results, we present a locality-guided work stealing algorithm that improves the data locality of multi-threaded computations by allowing a thread to have an affinity for a processor. Our initial experiments on iterative data-parallel applications show that the algorithm matches the performance of static-partitioning under traditional work loads but improves the performance up to 50% over static partitioning under multiprogrammed work loads. Furthermore, the locality-guided work stealing improves the performance of work-stealing up to 80%.

Patent
21 Sep 2000
TL;DR: In this article, the authors propose a traffic analysis mechanism to provide the third party network with real-time data identifying the content delivered by the CDS from the third-party caches.
Abstract: Third party cache appliances are configured into a content delivery service to enable such devices to cache and serve content that has been tagged for delivery by the service. The invention enables the content delivery service to extend the reach of its network while taking advantage of high performance, off-the-shelf cache appliances. If the third party caches comprise part of a third party content delivery network, the interconnection of caches to the CDS according to the present invention enables the CDS and the third party network to share responsibility for delivering the content. To facilitate such “content peering,” the CDS may also include a traffic analysis mechanism to provide the third party network with preferably real-time data identifying the content delivered by the CDS from the third party caches. The CDS may also include a logging mechanism to generate appropriate billing and reporting of the third party content that is delivered from the cache appliances that have been joined into the CDS.

Proceedings ArticleDOI
01 May 2000
TL;DR: A practical, fully associative, software-managed secondary cache system that provides performance competitive with or superior to traditional caches without OS or application involvement is presented.
Abstract: As DRAM access latencies approach a thousand instruction-execution times and on-chip caches grow to multiple megabytes, it is not clear that conventional cache structures continue to be appropriate. Two key features—full associativity and software management—have been used successfully in the virtual-memory domain to cope with disk access latencies. Future systems will need to employ similar techniques to deal with DRAM latencies. This paper presents a practical, fully associative, software-managed secondary cache system that provides performance competitive with or superior to traditional caches without OS or application involvement. We see this structure as the first step toward OS- and application-aware management of large on-chip caches.This paper has two primary contributions: a practical design for a fully associative memory structure, the indirect index cache (IIC), and a novel replacement algorithm, generational replacement, that is specifically designed to work with the IIC. We analyze the behavior of an IIC with generational replacement as a drop-in, transparent substitute for a conventional secondary cache. We achieve miss rate reductions from 8% to 85% relative to a 4-way associative LRU organization, matching or beating a (practically infeasible) fully associative true LRU cache. Incorporating these miss rates into a rudimentary timing model indicates that the IIC/generational replacement cache could be competitive with a conventional cache at today's DRAM latencies, and will outperform a conventional cache as these CPU-relative latencies grow.

Patent
16 Feb 2000
TL;DR: In this article, a field programmable gate array (FPGA) which includes first and second arrays of configurable logic blocks, and first-and second-level configuration cache memories coupled to the first and the second arrays, respectively, is described.
Abstract: A field programmable gate array (FPGA) which includes first and second arrays of configurable logic blocks, and first and second configuration cache memories coupled to the first and second arrays of configurable logic blocks, respectively. The first configuration cache memory array can either store values for reconfiguring the first array of configurable logic blocks, or operate as a RAM. Similarly, the second configuration cache array can either store values for reconfiguring the second array of configurable logic blocks, or operate as a RAM. The first configuration cache memory array and the second configuration cache memory array are independently controlled, such that partial reconfiguration of the FPGA can be accomplished. In addition, the second configuration cache memory array can store values for reconfiguring the first (rather than the second) array of configurable logic blocks, thereby providing a second-level reconfiguration cache memory.