scispace - formally typeset
Search or ask a question

Showing papers on "Cache published in 2005"


Posted Content
TL;DR: In this article, the authors describe side-channel attacks based on inter-process leakage through the state of the CPU's memory cache, which can be used for cryptanalysis of cryptographic primitives that employ data-dependent table lookups.
Abstract: We describe several software side-channel attacks based on inter-process leakage through the state of the CPU’s memory cache. This leakage reveals memory access patterns, which can be used for cryptanalysis of cryptographic primitives that employ data-dependent table lookups. The attacks allow an unprivileged process to attack other processes running in parallel on the same processor, despite partitioning methods such as memory protection, sandboxing and virtualization. Some of our methods require only the ability to trigger services that perform encryption or MAC using the unknown key, such as encrypted disk partitions or secure network links. Moreover, we demonstrate an extremely strong type of attack, which requires knowledge of neither the specific plaintexts nor ciphertexts, and works by merely monitoring the effect of the cryptographic process on the cache. We discuss in detail several such attacks on AES, and experimentally demonstrate their applicability to real systems, such as OpenSSL and Linux’s dm-crypt encrypted partitions (in the latter case, the full key can be recovered after just 800 writes to the partition, taking 65 milliseconds). Finally, we describe several countermeasures for mitigating such attacks.

1,109 citations


Journal ArticleDOI
TL;DR: The Niagara processor implements a thread-rich architecture designed to provide a high-performance solution for commercial server applications that exploits the thread-level parallelism inherent to server applications, while targeting low levels of power consumption.
Abstract: The Niagara processor implements a thread-rich architecture designed to provide a high-performance solution for commercial server applications. This is an entirely new implementation of the Sparc V9 architectural specification, which exploits large amounts of on-chip parallelism to provide high throughput. The hardware supports 32 threads with a memory subsystem consisting of an on-board crossbar, level-2 cache, and memory controllers for a highly integrated design that exploits the thread-level parallelism inherent to server applications, while targeting low levels of power consumption.

1,053 citations


Proceedings ArticleDOI
12 Feb 2005
TL;DR: Three performance models are proposed that predict the impact of cache sharing on co-scheduled threads and the most accurate model, the inductive probability model, achieves an average error of only 3.9%.
Abstract: This paper studies the impact of L2 cache sharing on threads that simultaneously share the cache, on a chip multi-processor (CMP) architecture. Cache sharing impacts threads nonuniformly, where some threads may be slowed down significantly, while others are not. This may cause severe performance problems such as sub-optimal throughput, cache thrashing, and thread starvation for threads that fail to occupy sufficient cache space to make good progress. Unfortunately, there is no existing model that allows extensive investigation of the impact of cache sharing. To allow such a study, we propose three performance models that predict the impact of cache sharing on co-scheduled threads. The input to our models is the isolated L2 cache stack distance or circular sequence profile of each thread, which can be easily obtained on-line or off-line. The output of the models is the number of extra L2 cache misses for each thread due to cache sharing. The models differ by their complexity and prediction accuracy. We validate the models against a cycle-accurate simulation that implements a dual-core CMP architecture, on fourteen pairs of mostly SPEC benchmarks. The most accurate model, the inductive probability model, achieves an average error of only 3.9%. Finally, to demonstrate the usefulness and practicality of the model, a case study that details the relationship between an application's temporal reuse behavior and its cache sharing impact is presented.

543 citations


Proceedings ArticleDOI
12 Feb 2005
TL;DR: A hardware implementation of unbounded transactional memory, called UTM, is described, which exploits the common case for performance without sacrificing correctness on transactions whose footprint can be nearly as large as virtual memory.
Abstract: Hardware transactional memory should support unbounded transactions: transactions of arbitrary size and duration. We describe a hardware implementation of unbounded transactional memory, called UTM, which exploits the common case for performance without sacrificing correctness on transactions whose footprint can be nearly as large as virtual memory. We performed a cycle-accurate simulation of a simplified architecture, called LTM. LTM is based on UTM but is easier to implement, because it does not change the memory subsystem outside of the processor. LTM allows nearly unbounded transactions, whose footprint is limited only by physical memory size and whose duration by the length of a timeslice. We assess UTM and LTM through microbenchmarking and by automatically converting the SPECjvm98 Java benchmarks and the Linux 2.4.19 kernel to use transactions instead of locks. We use both cycle-accurate simulation and instrumentation to understand benchmark behavior. Our studies show that the common case is small transactions that commit, even when contention is high, but that some applications contain very large transactions. For example, although 99.9% of transactions in the Linux study touch 54 cache lines or fewer, some transactions touch over 8000 cache lines. Our studies also indicate that hardware support is required, because some applications spend over half their time in critical regions. Finally, they suggest that hardware support for transactions can make Java programs run faster than when run using locks and can increase the concurrency of the Linux kernel by as much as a factor of 4 with no additional programming work.

447 citations


Patent
18 Feb 2005
TL;DR: In this paper, the authors present an architecture that provides capabilities to transport and process Internet Protocol (IP) packets from Layer 2 through transport protocol layer and may also provide packet inspection through Layer 7.
Abstract: An architecture provides capabilities to transport and process Internet Protocol (IP) packets from Layer 2 through transport protocol layer and may also provide packet inspection through Layer 7. A set of engines may perform pass-through packet classification, policy processing and/or security processing enabling packet streaming through the architecture at nearly the full fine rate. A scheduler schedules packets to packet processors for processing. An internal memory or local session database cache stores a session information database for a certain number of active sessions. The session information that is not in the internal memory is stored and retrieved to/from an additional memory. An application running on an Initiator or target can in certain instantiations register a region of memory, which is made available to its peer(s) for access directly without substantial host intervention through RDMA data transfer. A security system is also disclosed that enables a new way of implementing security capabilities inside enterprise networks in a distributed manner using a protocol processing hardware with appropriate security features.

406 citations


Journal ArticleDOI
01 May 2005
TL;DR: Examination of the area, power, performance, and design issues for the on-chip interconnects on a chip multiprocessor shows that designs that treat interconnect as an entity that can be independently architected and optimized would not arrive at the best multi-core design.
Abstract: This paper examines the area, power, performance, and design issues for the on-chip interconnects on a chip multiprocessor, attempting to present a comprehensive view of a class of interconnect architectures. It shows that the design choices for the interconnect have significant effect on the rest of the chip, potentially consuming a significant fraction of the real estate and power budget. This research shows that designs that treat interconnect as an entity that can be independently architected and optimized would not arrive at the best multi-core design. Several examples are presented showing the need for careful co-design. For instance, increasing interconnect bandwidth requires area that then constrains the number of cores or cache sizes, and does not necessarily increase performance. Also, shared level-2 caches become significantly less attractive when the overhead of the resulting crossbar is accounted for. A hierarchical bus structure is examined which negates some of the performance costs of the assumed base-line architecture.

402 citations


Patent
31 Oct 2005
TL;DR: In this paper, the edge DNS cache servers are published as the authoritative servers for customer domains instead of the origin server, and when a request for a DNS record results in a cache miss, the edge cache servers get the information from the origin servers and cache it for use in response to future requests.
Abstract: A distributed DNS network includes a central origin server that actually controls the zone, and edge DNS cache servers configured to cache the DNS content of the origin server. The edge DNS cache servers are published as the authoritative servers for customer domains instead of the origin server. When a request for a DNS record results in a cache miss, the edge DNS cache servers get the information from the origin server and cache it for use in response to future requests. Multiple edge DNS cache servers can be deployed at multiple locations. Since an unlimited number of edge DNS cache servers can be deployed, the system is highly scalable. The disclosed techniques protect against DoS attacks, as DNS requests are not made to the origin server directly.

370 citations


14 Oct 2005
TL;DR: In this article, the authors examined the potential of using the STI Cell processor as a building block for future high-end computing systems and proposed modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations.
Abstract: The slowing pace of commodity microprocessor performance improvements combined with ever-increasing chip power demands has become of utmost concern to computational scientists. As a result, the high performance computing community is examining alternative architectures that address the limitations of modern cache-based designs. In this work, we examine the potential of the using the forth coming STI Cell processor as a building block for future high-end computing systems. Our work contains several novel contributions. We are the first to present quantitative Cell performance data on scientific kernels and show direct comparisons against leading superscalar (AMD Opteron), VLIW (IntelItanium2), and vector (Cray X1) architectures. Since neither Cell hardware nor cycle-accurate simulators are currently publicly available, we develop both analytical models and simulators to predict kernel performance. Our work also explores the complexity of mapping several important scientific algorithms onto the Cells unique architecture. Additionally, we propose modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations. Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency.

346 citations


Journal ArticleDOI
01 May 2005
TL;DR: This paper presents a new cache management policy, victim replication, which combines the advantages of private and shared schemes, and shows that victim replication reduces the average memory access latency of the shared L2 cache by an average of 16% for multi-threaded benchmarks and 24% for single-threading benchmarks.
Abstract: In this paper, we consider tiled chip multiprocessors (CMP) where each tile contains a slice of the total on-chip L2 cache storage and tiles are connected by an on-chip network. The L2 slices can be managed using two basic schemes: 1) each slice is treated as a private L2 cache for the tile 2) all slices are treated as a single large L2 cache shared by all tiles. Private L2 caches provide the lowest hit latency but reduce the total effective cache capacity, as each tile creates local copies of any line it touches. A shared L2 cache increases the effective cache capacity for shared data, but incurs long hit latencies when L2 data is on a remote tile. We present a new cache management policy, victim replication, which combines the advantages of private and shared schemes. Victim replication is a variant of the shared scheme which attempts to keep copies of local primary cache victims within the local L2 cache slice. Hits to these replicated copies reduce the effective latency of the shared L2 cache, while retaining the benefits of a higher effective capacity for shared data. We evaluate the various schemes using full-system simulation of both single-threaded and multi-threaded benchmarks running on an 8-processor tiled CMP. We show that victim replication reduces the average memory access latency of the shared L2 cache by an average of 16%for multi-threaded benchmarks and 24%for single-threaded benchmarks, providing better overall performance than either private or shared schemes.

331 citations


Journal ArticleDOI
01 May 2005
TL;DR: This work proposes controlled replication to reduce capacity pressure by not making extra copies in some cases, and obtaining the data from an existing on-chip copy, and proposes capacity stealing in which private data that exceeds a coreýs capacity is placed in a neighboring cache with less capacity demand.
Abstract: Chip multiprocessors (CMPs) substantially increase capacity pressure on the on-chip memory hierarchy while requiring fast access. Neither private nor shared caches can provide both large capacity and fast access in CMPs. We observe that compared to symmetric multiprocessors (SMPs), CMPs change the latency-capacity tradeoff in two significant ways. We propose three novel ideas to exploit the changes: (1) Though placing copies close to requestors allows fast access for read-only sharing, the copies also reduce the already-limited on-chip capacity in CMPs. We propose controlled replication to reduce capacity pressure by not making extra copies in some cases, and obtaining the data from an existing on-chip copy. This option is not suitable for SMPs because obtaining data from another processor is expensive and capacity is not limited to on-chip storage. (2) Unlike SMPs, CMPs allow fast on-chip communication between processors for read-write sharing. Instead of incurring slow access to read-write shared data through coherence misses as do SMPs, we propose in-situ communication to provide fast access without making copies or incurring coherence misses. (3) Accessing neighborsý caches is not as expensive in CMPs as it is in SMPs. We propose capacity stealing in which private data that exceeds a coreýs capacity is placed in a neighboring cache with less capacity demand. To incorporate our ideas, we use a hybrid of private, per-processor tag arrays and a shared data array. Because the shared data array is slow, we employ non-uniform access and distance associativity from previous proposals to hold frequently-accessed data in regions close to the requestor. We extend the previously-proposed Non-uniform access with Replacement And Placement usIng Distance associativity (NuRAPID) to CMPs, and call our cache CMP-NuRAPID. Our results show that for a 4-core CMP with 8 MB cache, CMP-NuRAPID improves performance by 13% over a shared cache and 8% over private caches for three commercial multithreaded workloads.

279 citations


Patent
26 Oct 2005
TL;DR: In this article, an apparatus and method for enhancing the infrastructure of a network such as the Internet is disclosed, where multiple edge servers and edge caches are provided at the edge of the network so as to cover and monitor all points of presence.
Abstract: An apparatus and method for enhancing the infrastructure of a network such as the Internet is disclosed. Multiple edge servers and edge caches are provided at the edge of the network so as to cover and monitor all points of presence. The edge servers selectively intercept domain name translation requests generated by downstream clients, coupled to the monitored points of presence, to subscribing Web servers and provide translations which either enhance content delivery services or redirect the requesting client to the edge cache to make its content requests. Further, network traffic monitoring is provided in order to detect malicious or otherwise unauthorized data transmissions.

Journal ArticleDOI
C. McNairy1, Rohit Bhatia1
TL;DR: Intel's Montecito is the first Itanium processor to feature duplicate, dual-thread cores and cache hierarchies on a single die, and it features a landmark 1.72 billion transistors and server-focused technologies.
Abstract: Intel's Montecito is the first Itanium processor to feature duplicate, dual-thread cores and cache hierarchies on a single die. It features a landmark 1.72 billion transistors and server-focused technologies, and it requires only 100 watts of power. Intel's Itanium 2 processor series has regularly delivered additional performance through the increased frequency and cache as evidenced by the 6-Mbyte and 9-Mbyte versions.

Patent
07 Mar 2005
TL;DR: A buffer cache interposed between a non-volatile memory and a host may be partitioned into segments that may operate with different policies as mentioned in this paper, such as write-through, write and read-look-ahead.
Abstract: A buffer cache interposed between a non-volatile memory and a host may be partitioned into segments that may operate with different policies. Cache policies include write-through, write and read-look-ahead. Write-through and write back policies may improve speed. Read-look-ahead cache allows more efficient use of the bus between the buffer cache and non-volatile memory. A session command allows data to be maintained in volatile memory by guaranteeing against power loss.

Proceedings ArticleDOI
12 Nov 2005
TL;DR: A novel algorithm to solve dense linear systems using graphics processors (GPUs) by reducing matrix decomposition and row operations to a series of rasterization problems on the GPU and demonstrating that the commodity GPU is a useful co-processor for many scientific applications.
Abstract: We present a novel algorithm to solve dense linear systems using graphics processors (GPUs). We reduce matrix decomposition and row operations to a series of rasterization problems on the GPU. These include new techniques for streaming index pairs, swapping rows and columns and parallelizing the computation to utilize multiple vertex and fragment processors. We also use appropriate data representations to match the rasterization order and cache technology of graphics processors. We have implemented our algorithm on different GPUs and compared the performance with optimized CPU implementations. In particular, our implementation on a NVIDIA GeForce 7800 GPU outperforms a CPU-based ATLAS implementation. Moreover, our results show that our algorithm is cache and bandwidth efficient and scales well with the number of fragment processors within the GPU and the core GPU clock rate. We use our algorithm for fluid flow simulation and demonstrate that the commodity GPU is a useful co-processor for many scientific applications.

Patent
13 Jun 2005
TL;DR: A peer-to-peer name resolution protocol (PNRP) is proposed in this paper, which allows resolution of names which are mapped onto the circular number space through a hash function.
Abstract: A serverless name resolution protocol ensures convergence despite the size of the network, without requiring an ever-increasing cache and with a reasonable numbers of hops. This convergence is ensured through a multi-level cache and a proactive cache initialization strategy. The multi-level cache is built based on a circular number space. Each level contains information from different levels of slivers of the circular space. A mechanism is included to add a level to the multi-level cache when the node determines that the last level is full. A peer-to-peer name resolution protocol (PNRP) includes a mechanism to allow resolution of names which are mapped onto the circular number space through a hash function. Further, the PNRP may also operate with the domain name system by providing each node with an identification consisting of a domain name service (DNS) component and a unique number.

Proceedings ArticleDOI
17 Sep 2005
TL;DR: Several compiler techniques that aim at automatically generating high quality codes over a wide range of heterogeneous parallelism available on the CELL processor are described and results indicate that significant speedup can be achieved with a high level of support from the compiler.
Abstract: Developed for multimedia and game applications, as well as other numerically intensive workloads, the CELL processor provides support both for highly parallel codes, which have high computation and memory requirements, and for scalar codes, which require fast response time and a full-featured programming environment. This first generation CELL processor implements on a single chip a Power Architecture processor with two levels of cache, and eight attached streaming processors with their own local memories and globally coherent DMA engines. In addition to processor-level parallelism, each processing element has a Single Instruction Multiple Data (SIMD) unit that can process from 2 double precision floating points up to 16 bytes per instruction. This paper describes, in the context of a research prototype, several compiler techniques that aim at automatically generating high quality codes over a wide range of heterogeneous parallelism available on the CELL processor. Techniques include compiler-supported branch prediction, compiler-assisted instruction fetch, generation of scalar codes on SIMD units, automatic generation of SIMD codes, and data and code partitioning across the multiple processor elements in the system. Results indicate that significant speedup can be achieved with a high level of support from the compiler.

Patent
02 Dec 2005
TL;DR: In this paper, a runtime adaptable security processor (6614) is proposed for Internet Protocol (IP) packets from Layer 2 through transport protocol layer and may also provide packet inspection through Layer 7.
Abstract: A runtime adaptable security processor (6614) is disclosed. The processor architecture provides capabilities to transport and process Internet Protocol (IP) packets from Layer 2 through transport protocol layer and may also provide packet inspection through Layer 7. Further, a runtime adaptable processor is coupled to the protocol processing hardware (6615) and may be dynamically adapted to perform hardware tasks as per the needs of the network traffic being sent or received and/or the policies programmed or services or applications being supported. A set of engines (6603-6606) may perform pass-through packet classification, policy processing and/or security processing enabling packet streaming through the architecture at nearly the full line rate. A high performance content search and rules processing security processor is disclosed which may be used for application layer and network layer security. A scheduler schedules packets to packet processors for processing. An internal memory or local session database cache stores a session information database for a certain number of active sessions. The session information that is not in the internal memory is stored and retrieved to/from an additional memory. An application running on an initiator or target can in certain instantiations register a region of memory, which is made available to its peer(s) for access directly without substantial host intervention through RDMA data transfer. A security system is also disclosed that enables a new way of implementing security capabilities inside enterprise networks in a distributed manner using a protocol processing hardware with appropriate security features.

Proceedings ArticleDOI
20 Jun 2005
TL;DR: It is demonstrated that migratory, dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased power consumption and complexity, especially as per-application cache partitioning strategies are applied.
Abstract: We propose an organization for the on-chip memory system of a chip multiprocessor, in which 16 processors share a 16MB pool of 256 L2 cache banks. The L2 cache is organized as a non-uniform cache architecture (NUCA) array with a switched network embedded in it for high performance. We show that this organization can support the spectrum of degrees of sharing: unshared, in which each processor has a private portion of the cache, thus reducing hit latency, completely shared, in which every processor shares the entire cache, thus minimizing misses, and every point in between. We find the optimal degree of sharing for a number of cache bank mapping policies, and also evaluate a per-application cache partitioning strategy. We conclude that a static NUCA organization with sharing degrees of two or four work best across a suite of commercial and scientific parallel workloads. We also demonstrate that migratory, dynamic NUCA approaches improve performance significantly for a subset of the workloads at the cost of increased power consumption and complexity, especially as per-application cache partitioning strategies are applied.

Patent
09 May 2005
TL;DR: In this article, the authors present a system and method of caching data employing probabilistic predictive techniques, which has particular application to multimedia systems for providing local storage of a subset of available viewing selections by assigning a value to a selection and retaining selections in the cache depending on the value and size of the selection.
Abstract: The present invention is related to a system and method of caching data employing probabilistic predictive techniques. The system and method has particular application to multimedia systems for providing local storage of a subset of available viewing selections by assigning a value to a selection and retaining selections in the cache depending on the value and size of the selection. The value assigned to an item can represent the time-dependent likelihood that a user will review an item at some time in the future. An initial value of an item can be based on the user's viewing habits, the user's viewing habit over particular time segment (e.g., early morning, late morning, early afternoon, late afternoon, primetime, late night) and/or viewing habits of a group of user's during a particular time segment. A value assigned to a selection dynamically changes according to a set of cache retention policies, where the value can be time-dependent functions that decay based on the class of the item, as determined by inference about the class or via a label associated with the item. A selections value may be reduced as the selection ages because a user is less likely to view the selection over time. Additionally, a value of a selection may change based on changes on a user's viewing habits, changes in time segments or a user's modification of the cache retention policies.

Patent
01 Nov 2005
TL;DR: Improved caching control for streaming media includes one or more cache control directives associated with streaming media content that can be used by a source of the streamed media content to identify how caching proxy servers are to handle the streaming media contents as discussed by the authors.
Abstract: Improved caching control for streaming media includes one or more cache control directives associated with streaming media content that can be used by a source of the streaming media content to identify how caching proxy servers are to handle the streaming media content. Upon receipt of the streaming media content, the caching proxy servers handle the content as indicated by the cache control directive(s).

Journal ArticleDOI
Amit Agarwal1, Bipul C. Paul1, Hamid Mahmoodi1, Animesh Datta1, Kaushik Roy1 
TL;DR: This technique dynamically detects and replaces faulty cells by dynamically resizing the cache and surpasses all the contemporary fault tolerant schemes such as row/column redundancy and error-correcting code (ECC) in handling failures due to process variation.
Abstract: Process parameter variations are expected to be significantly high in a sub-50-nm technology regime, which can severely affect the yield, unless very conservative design techniques are employed. The parameter variations are random in nature and are expected to be more pronounced in minimum geometry transistors commonly used in memories such as SRAM. Consequently, a large number of cells in a memory are expected to be faulty due to variations in different process parameters. We analyze the impact of process variation on the different failure mechanisms in SRAM cells. We also propose a process-tolerant cache architecture suitable for high-performance memory. This technique dynamically detects and replaces faulty cells by dynamically resizing the cache. It surpasses all the contemporary fault tolerant schemes such as row/column redundancy and error-correcting code (ECC) in handling failures due to process variation. Experimental results on a 64-K direct map L1 cache show that the proposed technique can achieve 94% yield compared to its original 33% yield (standard cache) in a 45-nm predictive technology under /spl sigma//sub Vt-inter/=/spl sigma//sub Vt-intra/=30 mV.

Journal ArticleDOI
C.W. Slayman1
TL;DR: In most system applications, a combination of several techniques is required to meet the necessary reliability and data-integrity targets, and the tradeoffs of these techniques in terms of area, power, and performance penalties versus increased reliability are covered.
Abstract: As the size of the SRAM cache and DRAM memory grows in servers and workstations, cosmic-ray errors are becoming a major concern for systems designers and end users. Several techniques exist to detect and mitigate the occurrence of cosmic-ray upset, such as error detection, error correction, cache scrubbing, and array interleaving. This paper covers the tradeoffs of these techniques in terms of area, power, and performance penalties versus increased reliability. In most system applications, a combination of several techniques is required to meet the necessary reliability and data-integrity targets.

Journal ArticleDOI
01 May 2005
TL;DR: The proposed variable-way, or V-Way, set-associative cache achieves an average miss rate reduction of 13% on sixteen benchmarks from the SPEC CPU2000 suite, which translates into an average IPC improvement of 8%.
Abstract: As processor speeds increase and memory latency becomes more critical, intelligent design and management of secondary caches becomes increasingly important. The efficiency of current set-associative caches is reduced because programs exhibit a non-uniform distribution of memory accesses across different cache sets. We propose a technique to vary the associativity of a cache on a per-set basis in response to the demands of the program. By increasing the number of tag-store entries relative to the number of data lines, we achieve the performance benefit of global replacement while maintaining the constant hit latency of a set-associative cache. The proposed variable-way, or V-Way, set-associative cache achieves an average miss rate reduction of 13% on sixteen benchmarks from the SPEC CPU2000 suite. This translates into an average IPC improvement of 8%.

Patent
15 Nov 2005
TL;DR: In this paper, a method for enabling collaboration with web pages and other resources is described, which includes the step of establishing a collaboration session between a first client and a second client.
Abstract: Methods and apparatus for enabling collaboration with web pages and other resources is described. A method includes the step of establishing a collaboration session between a first client and a second client. A requested resource is cached with the session host in response to a request having a first uniform resource locator (URL) issued by the first client, if the requested resource is a pre-determined type of resource. A second URL is provided to the second client. The second URL identifies the requested resource or the cached resource in accordance with whether the requested resource is cached. Apparatus for enabling collaboration includes a web server, a cache, and a filter. The web server provides a requested web page in response to a first client's request. The filter stores the requested web page in the cache, if the requested web page is a pre-determined type of web page. A number of pre-determined characteristics for caching are described in various embodiments of the methods and apparatus. In one embodiment, the requested resource is cached if it is a dynamic web page. In one embodiment an expiration date of the requested resource determines whether the requested resource should be cached. In another embodiment, a filename associated with the requested resource determines whether the requested resource should be cached. In another embodiment, components of the request determine whether the requested web page should be cached.

Patent
Cheryl Senter1, Johannes Wang1
25 Apr 2005
TL;DR: In this article, a load store unit is provided whose main purpose is to make load requests out of order whenever possible to get the load data back for use by an instruction execution unit as quickly as possible.
Abstract: The present invention provides a system and method for managing load and store operations necessary for reading from and writing to memory or I/O in a superscalar RISC architecture environment. To perform this task, a load store unit is provided whose main purpose is to make load requests out of order whenever possible to get the load data back for use by an instruction execution unit as quickly as possible. A load operation can only be performed out of order if there are no address collisions and no write pendings. An address collision occurs when a read is requested at a memory location where an older instruction will be writing. Write pending refers to the case where an older instruction requests a store operation, but the store address has not yet been calculated. The data cache unit returns 8 bytes of unaligned data. The load/store unit aligns this data properly before it is returned to the instruction execution unit. Thus, the three main tasks of the load store unit are: (1) handling out of order cache requests; (2) detecting address collisions; and (3) alignment of data.

Proceedings ArticleDOI
Matteo Frigo1, Volker Strumpen1
20 Jun 2005
TL;DR: This work presents a cache oblivious algorithm for stencil computations, which arise for example in finite-difference methods, and it exploits temporal locality optimally throughout the entire memory hierarchy.
Abstract: We present a cache oblivious algorithm for stencil computations, which arise for example in finite-difference methods. Our algorithm applies to arbitrary stencils in n-dimensional spaces. On an "ideal cache" of size Z, our algorithm saves a factor of Θ(Z1/n) cache misses compared to a naive algorithm, and it exploits temporal locality optimally throughout the entire memory hierarchy.

Patent
30 Nov 2005
TL;DR: In this article, a hierarchical storage structure is used to cache content for provision to a plurality of subscribers, which can be used to balance competing concerns of efficiently using network and/or storage resources and providing expedient responses to subscribers' requests for content.
Abstract: Embodiments of the invention provide networked storage of content, which can be used to allow the enhanced provision of real-time and/or on-demand content (such as video content, audio content, etc.) to a subscriber. Merely by way of example, in an aspect of some embodiments, a hierarchical storage structure may be used to cache content for provision to a plurality of subscribers. This can be used to balance competing concerns of efficiently using network and/or storage resources and providing expedient responses to subscribers' requests for content.

Journal ArticleDOI
01 May 2005
TL;DR: The proposed RegionScout, a family of simple filter mechanisms that dynamically detect most non-shared regions, is presented and is used to avoid broadcasts for non- shared regions thus reducing bandwidth and snoop induced tag lookups thus reducing energy.
Abstract: It has been shown that many requests miss in all remote nodes in shared memory multiprocessors. We are motivated by the observation that this behavior extends to much coarser grain areas of memory. We define a region to be a continuous, aligned memory area whose size is a power of two and observe that many requests find that no other node caches a block in the same region even for regions as large as 16K bytes. We propose RegionScout, a family of simple filter mechanisms that dynamically detect most non-shared regions. A node with a RegionScout filter can determine in advance that a request will miss in all remote nodes. RegionScout filters are implemented as a layered extension over existing snoop-based coherence systems. They require no changes to existing coherence protocols or caches and impose no constraints on what can be cached simultaneously. Their operation is completely transparent to software and the operating system. RegionScout filters require little additional storage and a single additional global signal. These characteristics are made possible by utilizing imprecise information about the regions cached in each node. Since they rely on dynamically collected information RegionScout filters can adapt to changing sharing patterns. We present two applications of RegionScout: In the first RegionScout is used to avoid broadcasts for non-shared regions thus reducing bandwidth. In the second RegionScout is used to avoid snoop induced tag lookups thus reducing energy.

Proceedings ArticleDOI
25 Jul 2005
TL;DR: This work proposes the ReStore architecture, which leverages existing performance enhancing checkpointing hardware to recover from soft error events in a low cost fashion and is an ideal means to provide very cost effective error coverage for processor applications that can tolerate a nonzero, but small, soft error failure rate.
Abstract: Device scaling and large scale integration have led to growing concerns about soft errors in microprocessors. To date, in all but the most demanding applications, implementing parity and ECC for caches and other large, regular SRAM structures have been sufficient to stem the growing soft error tide. This will not be the case for long, and questions remain as to the best way to detect and recover from soft errors in the remainder of the processor - in particular, the less structured execution core. In this work, we propose the ReStore architecture, which leverages existing performance enhancing checkpointing hardware to recover from soft error events in a low cost fashion. Error detection in the ReStore architecture is novel: symptoms that hint at the presence of soft errors trigger restoration of a previous checkpoint. Example symptoms include exceptions, control flow mis-speculations, and cache or translation look-aside buffer misses. Compared to conventional soft error detection via full replication, the ReStore framework incurs little overhead, but sacrifices some amount of error coverage. These attributes make it an ideal means to provide very cost effective error coverage for processor applications that can tolerate a nonzero, but small, soft error failure rate. Our evaluation of an example ReStore implementation exhibits a 2x increase in MTBE (mean time between failures) over a standard pipeline with minimal hardware and performance overheads. The MTBF increases by 7x if ReStore is coupled with parity protection for certain pipeline structures.

Journal ArticleDOI
TL;DR: Several major potential performance problems with WebGIS are identified and several possible techniques to improve the performance are discussed, which include the use of pyramids and hash indices on the server side to handle large images and clustering and multithreading techniques.
Abstract: WebGIS (also known as web‐based GIS and Internet GIS) denotes a type of Geographic Information System (GIS), whose client is implemented in a Web browser. WebGISs have been developed and used extensively in real‐world applications. However, when such a complex web‐based system involves the dissemination of large volumes of data and/or massive user interactions, its performance can become an issue. In this paper, we first identify several major potential performance problems with WebGIS. Then, we discuss several possible techniques to improve the performance. These techniques include the use of pyramids and hash indices on the server side to handle large images. To resolve server‐side conflicts originating from concurrent massive access and user interactions, we suggest clustering and multithreading techniques. Multithreading is also used to break down the long sequential, layer‐based data access to concurrent data access on the client side. Caching is suggested as a means to enhance concurrent data access...