scispace - formally typeset
Search or ask a question

Showing papers on "Cache published in 2002"


Patent
20 Sep 2002
TL;DR: In this article, a system for implementing view caching in a framework to support web-based applications is presented. But this system is limited to a set of server-side objects managed by an object manager (OM) running on a server.
Abstract: According to one aspect of the present invention, a system is provided for implementing view caching in a framework to support web-based applications. The system comprising a set of server-side objects managed by an object manager (OM) running on a server. The system further comprises a set of browser-side objects running on a browser running on a client. The system also comprises a remote procedure call (RPC) mechanism and a notification mechanism to facilitate communication and synchronization between the browser-side objects and the server-side objects. The system additionally comprises a cache on the client to store layouts of views, wherein each view is a display panel consisting of a particular arrangement of applets.

1,158 citations


Journal ArticleDOI
01 May 2002
TL;DR: It is argued that the use of drowsy caches can simplify the design and control of low-leakage caches, and avoid the need to completely turn off selected cache lines and lose their state.
Abstract: On-chip caches represent a sizable fraction of the total power consumption of microprocessors. Although large caches can significantly improve performance, they have the potential to increase power consumption. As feature sizes shrink, the dominant component of this power loss will be leakage. However, during a fixed period of time the activity in a cache is only centered on a small subset of the lines. This behavior can be exploited to cut the leakage power of large caches by putting the cold cache lines into a state preserving, low-power drowsy mode. Moving lines into and out of drowsy state incurs a slight performance loss. In this paper we investigate policies and circuit techniques for implementing drowsy caches. We show that with simple architectural techniques, about 80%-90% of the cache lines can be maintained in a drowsy state without affecting performance by more than 1%. According to our projections, in a 0.07um CMOS process, drowsy caches will be able to reduce the total energy (static and dynamic) consumed in the caches by 50%-75%. We also argue that the use of drowsy caches can simplify the design and control of low-leakage caches, and avoid the need to completely turn off selected cache lines and lose their state.

823 citations


Proceedings ArticleDOI
01 Oct 2002
TL;DR: This paper proposes physical designs for these Non-Uniform Cache Architectures (NUCAs) and extends these physical designs with logical policies that allow important data to migrate toward the processor within the same level of the cache.
Abstract: Growing wire delays will force substantive changes in the designs of large caches. Traditional cache architectures assume that each level in the cache hierarchy has a single, uniform access time. Increases in on-chip communication delays will make the hit time of large on-chip caches a function of a line's physical location within the cache. Consequently, cache access times will become a continuum of latencies rather than a single discrete latency. This non-uniformity can be exploited to provide faster access to cache lines in the portions of the cache that reside closer to the processor. In this paper, we evaluate a series of cache designs that provides fast hits to multi-megabyte cache memories. We first propose physical designs for these Non-Uniform Cache Architectures (NUCAs). We extend these physical designs with logical policies that allow important data to migrate toward the processor within the same level of the cache. We show that, for multi-megabyte level-two caches, an adaptive, dynamic NUCA design achieves 1.5 times the IPC of a Uniform Cache Architecture of any size, outperforms the best static NUCA scheme by 11%, outperforms the best three-level hierarchy--while using less silicon area--by 13%, and comes within 13% of an ideal minimal hit latency solution.

799 citations


Proceedings ArticleDOI
01 Jun 2002
TL;DR: LIRS effectively addresses the limits of LRU by using recency to evaluate Inter-Reference Recency (IRR) for making a replacement decision, and significantly outperforms LRU, and outperforms other existing replacement algorithms in most cases.
Abstract: Although LRU replacement policy has been commonly used in the buffer cache management, it is well known for its inability to cope with access patterns with weak locality. Previous work, such as LRU-K and 2Q, attempts to enhance LRU capacity by making use of additional history information of previous block references other than only the recency information used in LRU. These algorithms greatly increase complexity and/or can not consistently provide performance improvement. Many recently proposed policies, such as UBM and SEQ, improve replacement performance by exploiting access regularities in references. They only address LRU problems on certain specific and well-defined cases such as access patterns like sequences and loops. Motivated by the limits of previous studies, we propose an efficient buffer cache replacement policy, called Low Inter-reference Recency Set (LIRS). LIRS effectively addresses the limits of LRU by using recency to evaluate Inter-Reference Recency (IRR) for making a replacement decision. This is in contrast to what LRU does: directly using recency to predict next reference timing. At the same time, LIRS almost retains the same simple assumption of LRU to predict future access behavior of blocks. Our objectives are to effectively address the limits of LRU for a general purpose, to retain the low overhead merit of LRU, and to outperform those replacement policies relying on the access regularity detections. Conducting simulations with a variety of traces and a wide range of cache sizes, we show that LIRS significantly outperforms LRU, and outperforms other existing replacement algorithms in most cases. Furthermore, we show that the additional cost for implementing LIRS is trivial in comparison with LRU.

575 citations


Journal ArticleDOI
TL;DR: Both model-based and real trace simulation studies show that the proposed cooperative architecture results in more than 50% memory saving and substantial central processing unit (CPU) power saving for the management and update of cache entries compared with the traditional uncooperative hierarchical caching architecture.
Abstract: This paper aims at finding fundamental design principles for hierarchical Web caching. An analytical modeling technique is developed to characterize an uncooperative two-level hierarchical caching system where the least recently used (LRU) algorithm is locally run at each cache. With this modeling technique, we are able to identify a characteristic time for each cache, which plays a fundamental role in understanding the caching processes. In particular, a cache can be viewed roughly as a low-pass filter with its cutoff frequency equal to the inverse of the characteristic time. Documents with access frequencies lower than this cutoff frequency have good chances to pass through the cache without cache hits. This viewpoint enables us to take any branch of the cache tree as a tandem of low-pass filters at different cutoff frequencies, which further results in the finding of two fundamental design principles. Finally, to demonstrate how to use the principles to guide the caching algorithm design, we propose a cooperative hierarchical Web caching architecture based on these principles. Both model-based and real trace simulation studies show that the proposed cooperative architecture results in more than 50% memory saving and substantial central processing unit (CPU) power saving for the management and update of cache entries compared with the traditional uncooperative hierarchical caching architecture.

512 citations


Proceedings ArticleDOI
21 Jul 2002
TL;DR: This paper proposes and evaluates decentralized web caching algorithms for Squirrel, and discovers that it exhibits performance comparable to a centralized web cache in terms of hit ratio, bandwidth usage and latency.
Abstract: This paper presents a decentralized, peer-to-peer web cache called Squirrel. The key idea is to enable web browsers on desktop machines to share their local caches, to form an efficient and scalable web cache, without the need for dedicated hardware and the associated administrative cost. We propose and evaluate decentralized web caching algorithms for Squirrel, and discover that it exhibits performance comparable to a centralized web cache in terms of hit ratio, bandwidth usage and latency. It also achieves the benefits of decentralization, such as being scalable, self-organizing and resilient to node failures, while imposing low overhead on the participating nodes.

429 citations


Proceedings ArticleDOI
04 Mar 2002
TL;DR: An algorithm integrated into a compiler is presented which analyses the application and selects program and data parts which are placed into the scratchpad and Comparisons against a cache solution show remarkable advantages between 12% and 43% in energy consumption for designs of the same memory size.
Abstract: The number of embedded systems is increasing and a remarkable percentage is designed as mobile applications. For the latter, energy consumption is a limiting factor because of today's battery capacities. Besides the processor, memory accesses consume a high amount of energy. The use of additional less power hungry memories like caches or scratchpads is thus common. Caches incorporate the hardware control logic for moving data in and out automatically. On the other hand, this logic requires chip area and energy. A scratchpad memory is much more energy efficient, but there is a need for software control of its content. In this paper, an algorithm integrated into a compiler is presented which analyses the application and selects program and data parts which are placed into the scratchpad. Comparisons against a cache solution show remarkable advantages between 12% and 43% in energy consumption for designs of the same memory size.

374 citations


Journal ArticleDOI
TL;DR: Results suggest that client latency is not as dependent on aggressive caching as is commonly believed, and that the widespread use of dynamic low-TTL A-record bindings should not greatly increase DNS related wide-area network traffic.
Abstract: This paper presents a detailed analysis of traces of domain name system (DNS) and associated TCP traffic collected on the Internet links of the MIT Laboratory for Computer Science and the Korea Advanced Institute of Science and Technology (KAIST). The first part of the analysis details how clients at these institutions interact with the wide-area domain name system, focusing on client-perceived performance and the prevalence of failures and errors. The second part evaluates the effectiveness of DNS caching. In the most recent MIT trace, 23% of lookups receive no answer; these lookups account for more than half of all traced DNS packets since query packets are retransmitted overly persistently. About 13% of all lookups result in an answer that indicates an error condition. Many of these errors appear to be caused by missing inverse (IP-to-name) mappings or NS records that point to nonexistent or inappropriate hosts. 27% of the queries sent to the root name servers result in such errors. The paper also presents the results of trace-driven simulations that explore the effect of varying time-to-live (TTL) and varying degrees of cache sharing on DNS cache hit rates. Due to the heavy-tailed nature of name accesses, reducing the TTL of address (A) records to as low as a few hundred seconds has little adverse effect on hit rates, and little benefit is obtained from sharing a forwarding DNS cache among more than 10 or 20 clients. These results suggest that client latency is not as dependent on aggressive caching as is commonly believed, and that the widespread use of dynamic low-TTL A-record bindings should not greatly increase DNS related wide-area network traffic.

358 citations


Patent
30 Sep 2002
TL;DR: In this paper, an interface device is connected to a host by an I/O bus and provides hardware and processing mechanisms for accelerating data transfers between a network and a storage unit, while controlling the data transfers by the host.
Abstract: An interface device is connected to a host by an I/O bus and provides hardware and processing mechanisms for accelerating data transfers between a network and a storage unit, while controlling the data transfers by the host. The interface device includes hardware circuitry for processing network packet headers, and can use a dedicated fast-path for data transfer between the network and the storage unit, the fast-path set up by the host. The host CPU and protocol stack avoids protocol processing for data transfer over the fast-path, freeing host bus bandwidth, and the data need not cross the I/O bus, freeing I/O bus bandwidth. The storage unit may include RAID or other multiple drive configurations and may be connected to the INIC by a parallel channel such as SCSI or by a serial channel such as Ethernet or Fibre Channel. The interface device contains a file cache that stores data transferred between the network and storage unit, with organization of data in the interface device file cache controlled by a file system on the host. Additional interface devices may be connected to the host via the I/O bus, with each additional interface device having a file cache controlled by the host file system, and providing additional network connections and/or being connected to additional storage units.

348 citations


Proceedings ArticleDOI
02 Feb 2002
TL;DR: A scheme that enables an accurate estimate of the isolated miss-rates of each process as a function of cache size under the standard LRU replacement policy is described, which can be used to schedule jobs or to partition the cache to minimize the overall miss-rate.
Abstract: We propose a low overhead, online memory monitoring scheme utilizing a set of novel hardware counters. The counters indicate the marginal gain in cache hits as the size of the cache is increased, which gives the cache miss-rate as a function of cache size. Using the counters, we describe a scheme that enables an accurate estimate of the isolated miss-rates of each process as a function of cache size under the standard LRU replacement policy. This information can be used to schedule jobs or to partition the cache to minimize the overall miss-rate. The data collected by the monitors can also be used by an analytical model of cache and memory behavior to produce a more accurate overall miss-rate for the collection of processes sharing a cache in both time and space. This overall miss-rate can be used to improve scheduling and partitioning schemes.

325 citations


Patent
24 Jul 2002
TL;DR: An apparatus and method for client-side content processing such as filtering and caching of secure content sent using Transport Layer Security (TLS) or Secure Socket Layer (SSL) protocols are provided in this paper.
Abstract: An apparatus and method are provided for client-side content processing such as filtering and caching of secure content sent using Transport Layer Security (TLS) or Secure Socket Layer (SSL) protocols. An appliance functions as a controlled man-in-the-middle on the client side to terminate, cache, switch, and modify secure client side content.

31 May 2002
TL;DR: This thesis is a reference to the Monet system in all its detail, and outlines an SQL front-end that uses Monet as a back-end, for constructing a full-fledged SQL compliant RDBMS including ACID properties.
Abstract: Monet is a database kernel targeted at query-intensive, heavy analysis applications (the opposite of transaction processing), which include OLAP and data mining, but also go beyond the business domain in GIS processing, multi-media retrieval and XML. The clean sheet approach of Monet tries to depart from the traditional RDBMS design and implementation patterns in an attempt to obtain best performance on modern hardware, which has changed a lot since the currently dominant relational database systems were designed and developed. While most hardware components have experienced exponential growth in power over the years (a.k.a. Moore's law), I/O and especially memory latency have been lagging, creating an exponentially growing bottleneck. Additionally, modern hyperpipelined CPUs increasingly require code that is fully predictable (as to avoid branch mispredictions) and independent (to exploit parallel execution units) in order to reach their advertised performance, which makes a tough match with the interpreted, highly unpredictable and interdependent code found in database execution engines. The choice in Monet for the Decomposed Storage Model (DSM), which stores data in binary (2-column) tables only is motivated by the fact that query-intensive access patterns often profit from a vertically fragmented physical data model, which minimizes both I/O and cache misses when queries touch many rows but few columns. The column-wise processing model followed in the MIL language allows for a ``RISC'' (Reduced Instruction Set) query processing algebra, whose operators have a very low degree of freedom, thus allowing for a CPU-wise highly efficient implementation (i.e. one that consists of predictable and independent instructions). Also, specific attention was paid in Monet in developing cache-conscious query processing algorithms, in particular the radix-algorithms for join processing. This thesis is a reference to the Monet system in all its detail, and also outlines an SQL front-end that uses Monet as a back-end, for constructing a full-fledged SQL compliant RDBMS including ACID properties.

ReportDOI
10 Jun 2002
TL;DR: In this article, the authors explore the benefits of a simple scheme to achieve exclusive caching, in which a data block is cached at either a client or the disk array, but not both.
Abstract: Modern high-end disk arrays often have several gigabytes of cache RAM. Unfortunately, most array caches use management policies which duplicate the same data blocks at both the client and array levels of the cache hierarchy: they are inclusive. Thus, the aggregate cache behaves as if it was only as big as the larger of the client and array caches, instead of as large as the sum of the two. Inclusiveness is wasteful: cache RAM is expensive. We explore the benefits of a simple scheme to achieve exclusive caching, in which a data block is cached at either a client or the disk array, but not both. Exclusiveness helps to create the effect of a single, large unified cache. We introduce a DEMOTE operation to transfer data ejected from the client to the array, and explore its effectiveness with simulation studies. We quantify the benefits and overheads of demotions across both synthetic and real-life workloads. The results show that we can obtain useful—sometimes substantial—speedups. During our investigation, we also developed some new cache-insertion algorithms that show promise for multiclient systems, and report on some of their properties.

Proceedings ArticleDOI
11 Aug 2002
TL;DR: This paper proposes several simple optimisations to well-known integer compression schemes, and shows experimentally that these lead to significant reductions in time, and concludes that fast byte-aligned codes should be used to store integers in inverted lists.
Abstract: Compression reduces both the size of indexes and the time needed to evaluate queries. In this paper, we revisit the compression of inverted lists of document postings that store the position and frequency of indexed terms, considering two approaches to improving retrieval efficiency: better implementation and better choice of integer compression schemes. First, we propose several simple optimisations to well-known integer compression schemes, and show experimentally that these lead to significant reductions in time. Second, we explore the impact of choice of compression scheme on retrieval efficiency.In experiments on large collections of data, we show two surprising results: use of simple byte-aligned codes halves the query evaluation time compared to the most compact Golomb-Rice bitwise compression schemes; and, even when an index fits entirely in memory, byte-aligned codes result in faster query evaluation than does an uncompressed index, emphasising that the cost of transferring data from memory to the CPU cache is less for an appropriately compressed index than for an uncompressed index. Moreover, byte-aligned schemes have only a modest space overhead: the most compact schemes result in indexes that are around 10% of the size of the collection, while a byte-aligned scheme is around 13%. We conclude that fast byte-aligned codes should be used to store integers in inverted lists.

Patent
25 Jan 2002
TL;DR: In this paper, a cache handoff system for managing cacheable streaming content requested by a mobile node within a network architecture is disclosed, which includes a first caching proxy operable in the first subnet to supply a content stream in response to a request of the mobile node.
Abstract: A cache handoff system for managing cacheable streaming content requested by a mobile node within a network architecture is disclosed. The network architecture includes a first subnet and a second subnet. The cache handoff system includes a first caching proxy operable in the first subnet to supply a content stream in response to a request of the mobile node operable in the first subnet. In addition, the cache handoff system includes a second caching proxy operable in the second subnet. The first caching proxy may initiate a cache handoff of the request to the second caching proxy when the mobile node relocates to the second subnet. The second caching proxy may seamlessly continue to supply the requested content stream as a function of the cache handoff.

Posted Content
TL;DR: In this article, the idea of cache memory being used as a side-channel which leaks information during the run of a cryptographic algorithm has been investigated, and it has been shown that an attacker may be able to reveal or narrow the possible values of secret information held on the target device.
Abstract: We expand on the idea, proposed by Kelsey et al [?], of cache memory being used as a side-channel which leaks information during the run of a cryptographic algorithm By using this side-channel, an attacker may be able to reveal or narrow the possible values of secret information held on the target device We describe an attack which encrypts 2 chosen plaintexts on the target processor in order to collect cache profiles and then performs around 2 computational steps to recover the key As well as describing and simulating the theoretical attack, we discuss how hardware and algorithmic alterations can be used to defend against such techniques

Patent
13 Aug 2002
TL;DR: In this paper, a switching fabric includes one or more fast paths for handling lightweight, common data operations and at least one control path for handling other data operations, and a locking mechanism is described for controlling access to data shared by the control paths.
Abstract: Described are techniques used in a computer system for handling data operations to storage devices. A switching fabric includes one or more fast paths for handling lightweight, common data operations and at least one control path for handling other data operations. A control path manages one or more fast paths. The fast path and the control path are utilized in mapping virtual to physical addresses using mapping tables. The mapping tables include an extent table of one or more entries corresponding to varying address ranges. The size of an extent may be changed dynamically in accordance with a corresponding state change of physical storage. The fast path may cache only portions of the extent table as needed in accordance with a caching technique. The fast path may cache a subset of the extent table stored within the control path. A set of primitives may be used in performing data operations. A locking mechanism is described for controlling access to data shared by the control paths.

Journal ArticleDOI
TL;DR: The results support the hypothesis that population differences in food caching, memory, and the hippocampus of black-capped chickadees from Alaska and Colorado reflect adaptations to a harsh environment.
Abstract: To test the hypothesis that accurate cache recovery is more critical for birds that live in harsh conditions where the food supply is limited and unpredictable, the authors compared food caching, memory, and the hippocampus of black-capped chickadees (Poecile atricapilla) from Alaska and Colorado. Under identical laboratory conditions, Alaska chickadees (a) cached significantly more food; (b) were more efficient at cache recovery; (c) performed more accurately on one-trial associative learning tasks in which birds had to rely on spatial memory, but did not differ when tested on a nonspatial version of this task; and (d) had significantly larger hippocampal volumes containing more neurons compared with Colorado chickadees. The results support the hypothesis that these population differences may reflect adaptations to a harsh environment. Some species of animals regularly cache food and rely on memory to retrieve their caches at a later time when supplies are less abundant (for a review, see Vander Wall, 1990). A Clark’s nutcracker (Nucifraga columbiana), for example, may cache about 33,000 seeds a year and remember cache locations for as long as 9 months (Balda & Kamil, 1992). Some boreal parids have been estimated to cache up to 500,000 food items per year (Brodin, 1994; Haftorn, 1956; Pravosudov, 1985). The number of items cached and the length of time they are left before recovery vary from species to species. One reason for this species difference may be that reliance on stored food may be greater for those living in harsher environments, where failure to recover food caches in the winter may result in death from starvation (Pravosudov & Grubb,

Journal ArticleDOI
01 May 2002
TL;DR: Simulations reveal that, for an 8-way processor, a 2K-entry WIB with a 32-entry issue queue can achieve speedups of 20%, 84%, and 50% over a conventional 32- entry issue queue for a subset of the SPEC CINT2000, SPEC CFP2000, and Olden benchmarks, respectively.
Abstract: Instruction window size is an important design parameter for many modern processors. Large instruction windows offer the potential advantage of exposing large amounts of instruction level parallelism. Unfortunately naively scaling conventional window designs can significantly degrade clock cycle time, undermining the benefits of increased parallelism.This paper presents a new instruction window design targeted at achieving the latency tolerance of large windows with the clock cycle time of small windows. The key observation is that instructions dependent on a long latency operation (e.g., cache miss) cannot execute until that source operation completes. These instructions are moved out of the conventional, small, issue queue to a much larger waiting instruction buffer (WIB). When the long latency operation completes, the instructions are reinserted into the issue queue. In this paper, we focus specifically on load cache misses and their dependent instructions. Simulations reveal that, for an 8-way processor, a 2K-entry WIB with a 32-entry issue queue can achieve speedups of 20%, 84%, and 50% over a conventional 32-entry issue queue for a subset of the SPEC CINT2000, SPEC CFP2000, and Olden benchmarks, respectively.

Proceedings ArticleDOI
03 Jun 2002
TL;DR: A simple extension to the existing federated features in DB2 UDB is presented, which enables a regular DB2 instance to become a DBCache without any application modification, and an extensive set of experiments with an E-Commerce benchmark is conducted to show the benefits of this approach and illustrate tradeoffs in caching considerations.
Abstract: While scaling up to the enormous and growing Internet population with unpredictable usage patterns, E-commerce applications face severe challenges in cost and manageability, especially for database servers that are deployed as those applications' backends in a multi-tier configuration. Middle-tier database caching is one solution to this problem. In this paper, we present a simple extension to the existing federated features in DB2 UDB, which enables a regular DB2 instance to become a DBCache without any application modification. On deployment of a DBCache at an application server, arbitrary SQL statements generated from the unchanged application that are intended for a backend database server, can be answered: at the cache, at the backend database server, or at both locations in a distributed manner. The factors that determine the distribution of workload include the SQL statement type, the cache content, the application requirement on data freshness, and cost-based optimization at the cache. We have developed a research prototype of DBCache, and conducted an extensive set of experiments with an E-Commerce benchmark to show the benefits of this approach and illustrate tradeoffs in caching considerations.

Patent
13 Aug 2002
TL;DR: In this paper, a switching fabric includes one or more fast paths for handling lightweight, common data operations and at least one control path for handling other data operations, and a locking mechanism is described for controlling access to data shared by the control paths.
Abstract: Described are techniques used in a computer system for handling data operations to storage devices. A switching fabric includes one or more fast paths for handling lightweight, common data operations and at least one control path for handling other data operations. A control path manages one or more fast paths. The fast path and the control path are utilized in mapping virtual to physical addresses using mapping tables. The mapping tables include an extent table of one or more entries corresponding to varying address ranges. The size of an extent may be changed dynamically in accordance with a corresponding state change of physical storage. The fast path may cache only portions of the extent table as needed in accordance with a caching technique. The fast path may cache a subset of the extent table stored within the control path. A set of primitives may be used in performing data operations. A locking mechanism is described for controlling access to data shared by the control paths.

Patent
21 Feb 2002
TL;DR: In this paper, a distributed database caching system with the capability to support and accelerate read and update transactions to and from one or more central database management system (DBMS) servers for multiple concurrent users is described.
Abstract: A system and method are described for implementing a distributed database caching system with the capability to support and accelerate read and update transactions to and from one or more central Database Management System (DBMS) servers for multiple concurrent users. The system and method include a resource abstraction layer in a database client driver in communication with remote server units (RSUs) having a cache database. RSUs respond to user requests using the cache database if possible. If the cache database does not have the needed data, the RSU sends the request to a database subscription manager (DSM) in communication with the DBMS server. The DSM responds to the request and sends predicate data based on queries processed by the DBMS server for use in updating the cache databases.

Journal ArticleDOI
TL;DR: This 64-b microprocessor is the second-generation design of the new Itanium architecture, termed explicitly parallel instruction computing (EPIC), and seeks to extract maximum performance from EPIC by optimizing the memory system and execution resources for a combination of high bandwidth and low latency.
Abstract: This 64-b microprocessor is the second-generation design of the new Itanium architecture, termed explicitly parallel instruction computing (EPIC). The design seeks to extract maximum performance from EPIC by optimizing the memory system and execution resources for a combination of high bandwidth and low latency. This is achieved by tightly coupling microarchitecture choices to innovative circuit designs and the capabilities of the transistors and wires in the 0.18-/spl mu/m bulk Al metal process. The key features of this design are: a short eight-stage pipeline, 11 sustainable issue ports (six integer, four floating point, half-cycle access level-1 caches, 64-GB/s level-2 cache and 3-MB level-3 cache), all integrated on a 421 mm/sup 2/ die. The chip operates at over 1 GHz and is built on significant advances in CMOS circuits and methodologies. After providing an overview of the processor microarchitecture and design, this paper describes a few of these key enabling circuits and design techniques.

Patent
13 Aug 2002
TL;DR: In this paper, a switching fabric includes one or more fast paths for handling lightweight, common data operations and at least one control path for handling other data operations, and a locking mechanism is described for controlling access to data shared by the control paths.
Abstract: Described are techniques used in a computer system for handling data operations to storage devices. A switching fabric includes one or more fast paths for handling lightweight, common data operations and at least one control path for handling other data operations. A control path manages one or more fast paths. The fast path and the control path are utilized in mapping virtual to physical addresses using mapping tables. The mapping tables include an extent table of one or more entries corresponding to varying address ranges. The size of an extent may be changed dynamically in accordance with a corresponding state change of physical storage. The fast path may cache only portions of the extent table as needed in accordance with a caching technique. The fast path may cache a subset of the extent table stored within the control path. A set of primitives may be used in performing data operations. A locking mechanism is described for controlling access to data shared by the control paths.

Book ChapterDOI
07 Mar 2002
TL;DR: This work describes Backslash, a collaborative web mirroring system run by a collective of web sites that wish to protect themselves from flash crowds and explores cache diffusion techniques for use in such a system and finds that probabilistic forwarding improves load distribution albeit not dramatically.
Abstract: Flash crowds can cripple a web site's performance. Since they are infrequent and unpredictable, these floods do not justify the cost of traditional commercial solutions. We describe Backslash, a collaborative web mirroring system run by a collective of web sites that wish to protect themselves from flash crowds. Backslash is built on a distributed hash table overlay and uses the structure of the overlay to cache aggressively a resource that experiences an uncharacteristically high request load. By redirecting requests for that resource uniformly to the created caches, Backslash helps alleviate the effects of flash crowds. We explore cache diffusion techniques for use in such a system and find that probabilistic forwarding improves load distribution albeit not dramatically.

Patent
21 Oct 2002
TL;DR: In this paper, a processor with at least two cores, where each of the cores include a first level cache memory, and each of these cores are multi-threaded is represented by a crossbar.
Abstract: In one embodiment, a processor is provided. The processor includes at least two cores, where each of the cores include a first level cache memory. Each of the cores are multi-threaded. In another embodiment, each of the cores includes four threads. In another embodiment a crossbar is included. A plurality of cache bank memories in communication with the at cores through the crossbar is provided. Each of the plurality of cache bank memories are in communication with a main memory interface. In another embodiment a buffer switch core in communication with each of the plurality of cache bank memories is also included. A server and a method for optimizing the utilization of a multithreaded processor core are also provided.

Journal ArticleDOI
TL;DR: The partitioned hash-join is refined with a new partitioning algorithm called radix-cluster, which is specifically designed to optimize memory access, and the effect of implementation techniques that optimize CPU resource usage is investigated.
Abstract: In the past decade, the exponential growth in commodity CPU's speed has far outpaced advances in memory latency. A second trend is that CPU performance advances are not only brought by increased clock rates, but also by increasing parallelism inside the CPU. Current database systems have not yet adapted to these trends and show poor utilization of both CPU and memory resources on current hardware. In this paper, we show how these resources can be optimized for large joins and translate these insights into guidelines for future database architectures, encompassing data structures, algorithms, cost modeling and implementation. In particular, we discuss how vertically fragmented data structures optimize cache performance on sequential data access. On the algorithmic side, we refine the partitioned hash-join with a new partitioning algorithm called "radix-cluster", which is specifically designed to optimize memory access. The performance of this algorithm is quantified using a detailed analytical model that incorporates memory access costs in terms of a limited number of parameters, such as cache sizes and miss penalties. We also present a calibration tool that extracts such parameters automatically from any computer hardware. The accuracy of our models is proven by exhaustive experiments conducted with the Monet database system on three different hardware platforms. Finally, we investigate the effect of implementation techniques that optimize CPU resource usage. Our experiments show that large joins can be accelerated almost an order of magnitude on modern RISC hardware when both memory and CPU resources are optimized.

Journal ArticleDOI
TL;DR: A locality-guided work-stealing algorithm that improves the data locality of multithreaded computations by allowing a thread to have an affinity for a processor and improves the performance of work stealing up to 80%.
Abstract: This paper studies the data locality of the work-stealing scheduling algorithm on hardware-controlled shared-memory machines, where movement of data to and from the cache is solely controlled by the hardware. We present lower and upper bounds on the number of cache misses when using work stealing, and introduce a locality-guided work-stealing algorithm and its experimental validation. As a lower bound, we show that a work-stealing application that exhibits good data locality on a uniprocessor may exhibit poor data locality on a multiprocessor. In particular, we show a family of multithreaded computations G n whose members perform Θ(n) operations (work) and incur a constant number of cache misses on a uniprocessor, while even on two processors the total number of cache misses soars to Ω(n) . On the other hand, we show a tight upper bound on the number of cache misses that nested-parallel computations, a large, important class of computations, incur due to multiprocessing. In particular, for nested-parallel computations, we show that on P processors a multiprocessor execution incurs an expected O (C ⌉m/s;⌈PT ∞more misses than the uniprocessor execution. Here m is the execution time of an instruction incurring a cache miss, s is the steal time, C is the size of cache, and T ∈ fty is the number of nodes on the longest chain of dependencies. Based on this we give strong execution time bounds for nested-parallel computations using work stealing.} For the second part of our results, we present a locality-guided work-stealing algorithm that improves the data locality of multithreaded computations by allowing a thread to have an affinity for a processor. Our initial experiments on iterative data-parallel applications show that the algorithm matches the performance of static-partitioning under traditional work loads but improves the performance up to 50% over static partitioning under multiprogrammed work loads. Furthermore, locality-guided work stealing improves the performance of work stealing up to 80%.

Proceedings ArticleDOI
07 Nov 2002
TL;DR: This work proposes an enhanced-clustering cache replacement scheme for use in place of LRU, which improved the request hit ratio dramatically while keeping the small average hops per successful request comparable to LRU.
Abstract: Efficient data retrieval in a peer-to-peer system like Freenet is a challenging problem. We study the impact of cache replacement policy on the performance of Freenet. We find that, with Freenet's LRU (least recently used) cache replacement, there is a steep reduction in the hit ratio with increasing load. Based on intuition from the small-world models and the recent theoretical results by Kleinberg, we propose an enhanced-clustering cache replacement scheme for use in place of LRU. Such a replacement scheme forces the routing tables to resemble neighbor relationships in a small-world acquaintance graph - clustering with light randomness. In our simulation, this new scheme improved the request hit ratio dramatically while keeping the small average hops per successful request comparable to LRU. A simple, highly idealized model of Freenet under clustering with light randomness proves that the expected message delivery time in Freenet is O(log/sup 2/n) if the routing tables satisfy the small-world model and have the size /spl theta/(log/sup 2/n).

Proceedings ArticleDOI
21 May 2002
TL;DR: This paper study's the traffic patterns of Gnutella, a popular large-scale peer-to-peer system, and shows that traffic patterns are very bursty even over several time scales, and proposes simple GNutella caching mechanisms that cache query responses.
Abstract: Peer-to-peer computing and networking, an emerging model of communication and computation, has recently started to gain significant acceptance. This model not only enables clients to take a more active role in the information dissemination process, but also may significantly increase the performance and reliability of the overall system, by eliminating the traditional notion of the "server" which could be a single point of failure, and a potential bottleneck. Although peer-to-peer systems enjoy significant and continually increasing popularity, we still do not have a clear understanding of the magnitude, the traffic patterns, and the potential performance bottlenecks of the recent peer-to-peer networks. In this paper we study the traffic patterns of Gnutella, a popular large-scale peer-to-peer system, and show that traffic patterns are very bursty even over several time scales. We especially focus on the types of the queries submitted by Gnutella peers, and their associated replies. We show that the queries submitted exhibit significant amounts of locality, that is, queries tend to be frequently and repeatedly submitted. To capitalize on this locality, we propose simple Gnutella caching mechanisms that cache query responses. Using trace-driven simulation we evaluate the effectiveness of Gnutella caching and show that it improves performance by as much as a factor of two.