scispace - formally typeset
Search or ask a question

Showing papers on "Cache published in 2009"


01 Jan 2009
TL;DR: This report details the analytical model assumed for the newly added modules along with their validation analysis of CACTI 6.0, a significantly enhanced version of the tool that primarily focuses on interconnect design for large caches.
Abstract: © CACTI 6.0: A Tool to Model Large Caches Naveen Muralimanohar, Rajeev Balasubramonian, Norman P. Jouppi HP Laboratories HPL-2009-85 No keywords available. Future processors will likely have large on-chip caches with a possibility of dedicating an entire die for on-chip storage in a 3D stacked design. With the ever growing disparity between transistor and wire delay, the properties of such large caches will primarily depend on the characteristics of the interconnection networks that connect various sub-modules of a cache. CACTI 6.0 is a significantly enhanced version of the tool that primarily focuses on interconnect design for large caches. In addition to strengthening the existing analytical model of the tool for dominant cache components, CACTI 6.0 includes two major extensions over earlier versions: first, the ability to model Non-Uniform Cache Access (NUCA), and second, the ability to model different types of wires, such as RC based wires with different power, delay, and area characteristics and differential low-swing buses. This report details the analytical model assumed for the newly added modules along with their validation analysis. External Posting Date: April 21, 2009 [Fulltext] Approved for External Publication Internal Posting Date: April 21, 2009 [Fulltext] Published in International Symposium on Microarchitecture, Chicago, Dec 2007. Copyright International Symposium on Microarchitecture, 2007. CACTI 6.0: A Tool to Model Large Caches Naveen Muralimanohar, Rajeev Balasubramonian, Norman P. Jouppi † School of Computing, University of Utah ‡ Hewlett-Packard Laboratories Abstract Future processors will likely have large on-chip caches with a possibility of dedicating an entire die for on-chip storage in a 3D stacked design. With the ever growing disparity between transistor and wire delay, the properties of such large caches will primarily depend on the characteristics of the interconnection networks that connect various sub-modules of a cache. CACTI 6.0 is a significantly enhanced version of the tool that primarily focuses on interconnect design for large caches. In addition to strengthening the existing analytical model of the tool for dominant cache components, CACTI 6.0 includes two major extensions over earlier versions: first, the ability to model Non-Uniform Cache Access (NUCA), and second, the ability to model different types of wires, such as RC based wires with different power, delay, and area characteristics and differential low-swing buses. This report details the analytical model assumed for the newly added modules along with their validation analysis.Future processors will likely have large on-chip caches with a possibility of dedicating an entire die for on-chip storage in a 3D stacked design. With the ever growing disparity between transistor and wire delay, the properties of such large caches will primarily depend on the characteristics of the interconnection networks that connect various sub-modules of a cache. CACTI 6.0 is a significantly enhanced version of the tool that primarily focuses on interconnect design for large caches. In addition to strengthening the existing analytical model of the tool for dominant cache components, CACTI 6.0 includes two major extensions over earlier versions: first, the ability to model Non-Uniform Cache Access (NUCA), and second, the ability to model different types of wires, such as RC based wires with different power, delay, and area characteristics and differential low-swing buses. This report details the analytical model assumed for the newly added modules along with their validation analysis.

845 citations


Proceedings ArticleDOI
20 Jun 2009
TL;DR: Reactive NUCA (R-NUCA), a distributed cache design which reacts to the class of each cache access and places blocks at the appropriate location in the cache, is proposed.
Abstract: Increases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the aggregate cache capacity and minimizes off-chip memory requests. At the same time, the growing on-chip communication delay favors core-private caches that replicate data to minimize delays on global wires. Recent hybrid proposals offer lower average latency than conventional designs, but they address the placement requirements of only a subset of the data accessed by the application, require complex lookup and coherence mechanisms that increase latency, or fail to scale to high core counts.In this work, we observe that the cache access patterns of a range of server and scientific workloads can be classified into distinct classes, where each class is amenable to different block placement policies. Based on this observation, we propose Reactive NUCA (R-NUCA), a distributed cache design which reacts to the class of each cache access and places blocks at the appropriate location in the cache. R-NUCA cooperates with the operating system to support intelligent placement, migration, and replication without the overhead of an explicit coherence mechanism for the on-chip last-level cache. In a range of server, scientific, and multiprogrammed workloads, R-NUCA matches the performance of the best cache design for each workload, improving performance by 14% on average over competing designs and by 32% at best, while achieving performance within 5% of an ideal cache design.

436 citations


Proceedings ArticleDOI
20 Jun 2009
TL;DR: This paper discusses and evaluates two types of hybrid cache architectures: inter cache Level HCA (LHCA), in which the levels in a cache hierarchy can be made of disparate memory technologies; and intra cache level or cache Region based H CA (RHCA), where a single level of cache can be partitioned into multiple regions, each of a different memory technology.
Abstract: Caching techniques have been an efficient mechanism for mitigating the effects of the processor-memory speed gap. Traditional multi-level SRAM-based cache hierarchies, especially in the context of chip multiprocessors (CMPs), present many challenges in area requirements, core-to-cache balance, power consumption, and design complexity. New advancements in technology enable caches to be built from other technologies, such as Embedded DRAM (EDRAM), Magnetic RAM (MRAM), and Phase-change RAM (PRAM), in both 2D chips or 3D stacked chips. Caches fabricated in these technologies offer dramatically different power and performance characteristics when compared with SRAM-based caches, particularly in the areas of access latency, cell density, and overall power consumption. In this paper, we propose to take advantage of the best characteristics that each technology offers, through the use of Hybrid Cache Architecture (HCA) designs. We discuss and evaluate two types of hybrid cache architectures: inter cache Level HCA (LHCA), in which the levels in a cache hierarchy can be made of disparate memory technologies; and intra cache level or cache Region based HCA (RHCA), where a single level of cache can be partitioned into multiple regions, each of a different memory technology. We have studied a number of different HCA architectures and explored the potential of hardware support for intra-cache data movement and power consumption management within HCA caches. Utilizing a full-system simulator that has been validated against real hardware, we demonstrate that an LHCA design can provide a geometric mean 7% IPC improvement over a baseline 3-level SRAM cache design under the same area constraint across a collection of 25 workloads. A more aggressive RHCA-based design provides 12% IPC improvement over the baseline. Finally, a 2-layer 3D cache stack (3DHCA) of high density memory technology within the same chip footprint gives 18% IPC improvement over the baseline. Furthermore, up to 70% reduction in power consumption over a baseline SRAM-only design is achieved.

375 citations


Journal ArticleDOI
18 Feb 2009-Nature
TL;DR: A Chinese laboratory is the only source of a valuable crystal and it won't share its supplies with the rest of the world, David Cyranoski investigates.
Abstract: A Chinese laboratory is the only source of a valuable crystal. David Cyranoski investigates why it won't share its supplies.

370 citations


Proceedings ArticleDOI
02 Nov 2009
TL;DR: Results indicate that EWT is an effective and practical scheme to improve the energy efficiency of a STT-RAM cache, and up to 80% of write energy reduction can be achieved through EWT.
Abstract: The emerging Spin Torque Transfer memory (STT-RAM) is a promising candidate for future on-chip caches due to STT-RAM's high density, low leakage, long endurance and high access speed. However, one of the major challenges of STT-RAM is its high write current, which is disadvantageous when used as an on-chip cache since the dynamic power generated is too high. In this paper, we propose Early Write Termination (EWT), a novel technique to significantly reduce write energy with no performance penalty. EWT can be implemented with low complexity and low energy overhead. Our evaluation shows that up to 80% of write energy reduction can be achieved through EWT, resulting 33% less total energy consumption, and 34% reduction in ED2. These results indicate that EWT is an effective and practical scheme to improve the energy efficiency of a STT-RAM cache.

349 citations


Proceedings ArticleDOI
20 Jun 2009
TL;DR: This work proposes a new cache management approach that combines dynamic insertion and promotion policies to provide the benefits of cache partitioning, adaptive insertion, and capacity stealing all with a single mechanism.
Abstract: Many multi-core processors employ a large last-level cache (LLC) shared among the multiple cores. Past research has demonstrated that sharing-oblivious cache management policies (e.g., LRU) can lead to poor performance and fairness when the multiple cores compete for the limited LLC capacity. Different memory access patterns can cause cache contention in different ways, and various techniques have been proposed to target some of these behaviors. In this work, we propose a new cache management approach that combines dynamic insertion and promotion policies to provide the benefits of cache partitioning, adaptive insertion, and capacity stealing all with a single mechanism. By handling multiple types of memory behaviors, our proposed technique outperforms techniques that target only either capacity partitioning or adaptive insertion.

334 citations


Proceedings ArticleDOI
01 Apr 2009
TL;DR: This paper proposes a hot-page coloring approach enforcing coloring on only a small set of frequently accessed (or hot) pages for each process, and demonstrates that hot page identification and selective coloring can significantly alleviate the coloring-induced adverse effects in practice.
Abstract: Modern multi-core processors present new resource management challenges due to the subtle interactions of simultaneously executing processes sharing on-chip resources (particularly the L2 cache). Recent research demonstrates that the operating system may use the page coloring mechanism to control cache partitioning, and consequently to achieve fair and efficient cache utilization. However, page coloring places additional constraints on memory space allocation, which may conflict with application memory needs. Further, adaptive adjustments of cache partitioning policies in a multi-programmed execution environment may incur substantial overhead for page recoloring (or copying). This paper proposes a hot-page coloring approach enforcing coloring on only a small set of frequently accessed (or hot) pages for each process. The cost of identifying hot pages online is reduced by leveraging the knowledge of spatial locality during a page table scan of access bits. Our results demonstrate that hot page identification and selective coloring can significantly alleviate the coloring-induced adverse effects in practice. However, we also reach the somewhat negative conclusion that without additional hardware support, adaptive page coloring is only beneficial when recoloring is performed infrequently (meaning long scheduling time quanta in multi-programmed executions).

311 citations


Journal ArticleDOI
TL;DR: Results demonstrate that recent trends in memory system organization have reduced the eficacy of traditional cache- blocking optimizations, and represent one of the most extensive analyses of stencil optimizations and performance modeling to date.
Abstract: Stencil-based kernels constitute the core of many important scientific applications on block-structured grids. Unfortunately, these codes achieve a low fraction of peak performance, due primarily to the disparity between processor and main memory speeds. In this paper, we explore the impact of trends in memory subsystems on a variety of stencil optimization techniques and develop performance models to analytically guide our optimizations. Our work targets cache reuse methodologies across single and multiple stencil sweeps, examining cache-aware algorithms as well as cache-oblivious techniques on the Intel Itanium2, AMD Opteron, and IBM Power5. Additionally, we consider stencil computations on the heterogeneous multicore design of the Cell processor, a machine with an explicitly managed memory hierarchy. Overall our work represents one of the most extensive analyses of stencil optimizations and performance modeling to date. Results demonstrate that recent trends in memory system organization have reduced the efficacy of traditional cache-blocking optimizations. We also show that a cache-aware implementation is significantly faster than a cache-oblivious approach, while the explicitly managed memory on Cell enables the highest overall efficiency: Cell attains 88% of algorithmic peak while the best competing cache-based processor achieves only 54% of algorithmic peak performance.

243 citations


Proceedings ArticleDOI
12 Sep 2009
TL;DR: This paper presents fundamental details of the newly introduced Intel Nehalem microarchitecture with its integrated memory controller, Quick Path Interconnect, and ccNUMA architecture, based on sophisticated benchmarks to measure the latency and bandwidth between different locations in the memory subsystem.
Abstract: Today's microprocessors have complex memory subsystems with several cache levels. The efficient use of this memory hierarchy is crucial to gain optimal performance, especially on multicore processors. Unfortunately, many implementation details of these processors are not publicly available. In this paper we present such fundamental details of the newly introduced Intel Nehalem microarchitecture with its integrated memory controller, Quick Path Interconnect, and ccNUMA architecture. Our analysis is based on sophisticated benchmarks to measure the latency and bandwidth between different locations in the memory subsystem. Special care is taken to control the coherency state of the data to gain insight into performance relevant implementation details of the cache coherency protocol. Based on these benchmarks we present undocumented performance data and architectural properties.

243 citations


Proceedings ArticleDOI
13 Nov 2009
TL;DR: Experimental results demonstrate that two resource management approaches are effective in isolating cache interference impacts a VM may have on another VM, and incorporate these approaches in the resource management framework of the example cloud infrastructure, which enables the deployment of VMs with isolation enhanced SLAs.
Abstract: The cloud infrastructure provider (CIP) in a cloud computing platform must provide security and isolation guarantees to a service provider (SP), who builds the service(s) for such a platform. We identify last level cache (LLC) sharing as one of the impediments to finer grain isolation required by a service, and advocate two resource management approaches to provide performance and security isolation in the shared cloud infrastructure - cache hierarchy aware core assignment and page coloring based cache partitioning. Experimental results demonstrate that these approaches are effective in isolating cache interference impacts a VM may have on another VM. We also incorporate these approaches in the resource management (RM) framework of our example cloud infrastructure, which enables the deployment of VMs with isolation enhanced SLAs.

234 citations


Proceedings ArticleDOI
17 May 2009
TL;DR: Two ways in which programs that lack key-dependent control flow and key- dependent cache behavior can still leak timing information on modern x86 implementations such as the Intel Core 2 Duo are demonstrated, and defense mechanisms against them are proposed.
Abstract: This paper studies and evaluates the extent to which automated compiler techniques can defend against timing-based side-channel attacks on modern x86 processors. We study how modern x86 processors can leak timing information through side-channels that relate to control flow and data flow. To eliminate key-dependent control flow and key-dependent timing behavior related to control flow, we propose the use of if-conversion in a compiler backend, and evaluate a proof-of-concept prototype implementation. Furthermore, we demonstrate two ways in which programs that lack key-dependent control flow and key- dependent cache behavior can still leak timing information on modern x86 implementations such as the Intel Core 2 Duo, and propose defense mechanisms against them.

Proceedings ArticleDOI
07 Mar 2009
TL;DR: This work has developed a low-overhead software technique to obtain L2 MRCs online on current processors, exploiting features available in their performance monitoring units so that no changes to the application source code or binaries are required.
Abstract: Miss rate curves (MRCs) are useful in a number of contexts. In our research, online L2 cache MRCs enable us to dynamically identify optimal cache sizes when cache-partitioning a shared-cache multicore processor. Obtaining L2 MRCs has generally been assumed to be expensive when done in software and consequently, their usage for online optimizations has been limited. To address these problems and opportunities, we have developed a low-overhead software technique to obtain L2 MRCs online on current processors, exploiting features available in their performance monitoring units so that no changes to the application source code or binaries are required. Our technique, called RapidMRC, requires a single probing period of roughly 221 million processor cycles (147 ms), and subsequently 124 million cycles (83 ms) to process the data. We demonstrate its accuracy by comparing the obtained MRCs to the actual L2 MRCs of 30 applications taken from SPECcpu2006, SPECcpu2000, and SPECjbb2000. We show that RapidMRC can be applied to sizing cache partitions, helping to achieve performance improvements of up to 27%.

Proceedings ArticleDOI
19 Apr 2009
TL;DR: This paper considers a network in which each router has a local cache that caches files passing through it and develops a simple content caching, location, and routing systems that adopts an implicit, transparent, and best-effort approach towards caching.
Abstract: For several years, web caching has been used to meet the ever-increasing Web access loads. A fundamental capability of all such systems is that of inter-cache coordination, which can be divided into two main types: explicit and implicit coordination. While the former allows for greater control over resource allocation, the latter does not suffer from the additional communication overhead needed for coordination. In this paper, we consider a network in which each router has a local cache that caches files passing through it. By additionally storing minimal information regarding caching history, we develop a simple content caching, location, and routing systems that adopts an implicit, transparent, and best-effort approach towards caching. Though only best effort, the policy outperforms classic policies that allow explicit coordination between caches.

Proceedings ArticleDOI
06 Mar 2009
TL;DR: A novel router micro-architecture is proposed, called XShare, which exploits data value locality and bimodal traffic characteristics of CMP applications to transfer multiple small flits over a single channel, and helps in enhancing the network throughput by 35%, providing a latency reduction of 14% with synthetic traffic, and improving IPC on an average 4% with application workloads.
Abstract: Performance and power consumption of an on-chip interconnect that forms the backbone of Chip Multiprocessors (CMPs), are directly influenced by the underlying network topology. Both these parameters can also be optimized by application induced communication locality since applications mapped on a large CMP system will benefit from clustered communication, where data is placed in cache banks closer to the cores accessing it. Thus, in this paper, we design a hierarchical network topology that takes advantage of such communication locality. The two-tier hierarchical topology consists of local networks that are connected via a global network. The local network is a simple, high-bandwidth, low-power shared bus fabric, and the global network is a low-radix mesh. The key insight that enables the hybrid topology is that most communication in CMP applications can be limited to the local network, and thus, using a fast, low-power bus to handle local communication will improve both packet latency and power-efficiency. The proposed hierarchical topology provides up to 63% reduction in energy-delay-product over mesh, 47% over flattened butterfly, and 33% with respect to concentrated mesh across network sizes with uniform and non-uniform synthetic traffic. For real parallel workloads, the hybrid topology provides up to 14% improvement in system performance (IPC) and in terms of energy-delay-product, improvements of 70%, 22%, 30% over the mesh, flattened butterfly, and concentrated mesh, respectively, for a 32-way CMP. Although the hybrid topology scales in a power- and bandwidth-efficient manner with network size, while keeping the average packet latency low in comparison to high radix topologies, it has lower throughput due to high concentration. To improve the throughput of the hybrid topology, we propose a novel router micro-architecture, called XShare, which exploits data value locality and bimodal traffic characteristics of CMP applications to transfer multiple small flits over a single channel. This helps in enhancing the network throughput by 35%, providing a latency reduction of 14% with synthetic traffic, and improving IPC on an average 4% with application workloads.

Proceedings ArticleDOI
20 Jun 2009
TL;DR: Phastlane is presented, a hybrid electrical/optical routing network for future large scale, cache coherent multicore microprocessors that achieves 2X better network performance than a state-of-the-art electrical baseline while consuming 80% less network power.
Abstract: Tens and eventually hundreds of processing cores are projected to be integrated onto future microprocessors, making the global interconnect a key component to achieving scalable chip performance within a given power envelope. While CMOS-compatible nanophotonics has emerged as a leading candidate for replacing global wires beyond the 22nm timeframe, on-chip optical interconnect architectures proposed thus far are either limited in scalability or are dependent on comparatively slow electrical control networks.In this paper, we present Phastlane, a hybrid electrical/optical routing network for future large scale, cache coherent multicore microprocessors. The heart of the Phastlane network is a low-latency optical crossbar that uses simple predecoded source routing to transmit cache-line-sized packets several hops in a single clock cycle under contentionless conditions. When contention exists, the router makes use of electrical buffers and, if necessary, a high speed drop signaling network. Overall, Phastlane achieve 2X better network performance than a state-of-the-art electrical baseline while consuming 80% less network power.

Proceedings ArticleDOI
12 Dec 2009
TL;DR: PVC provides strong guarantees, reduces packet delay variation, and enables efficient reclamation of idle network bandwidth without per-flow buffering at the routers and with minimal buffered at the source nodes, and simplifies network management through a flexible allocation mechanism.
Abstract: Future many-core chip multiprocessors (CMPs) and systems-on-a-chip (SOCs) will have numerous processing elements executing multiple applications concurrently. These applications and their respective threads will interfere at the on-chip network level and compete for shared resources such as cache banks, memory controllers, and specialized accelerators. Often, the communication and sharing patterns of these applications will be impossible to predict off-line, making fairness guarantees and performance isolation difficult through static thread and link scheduling. Prior techniques for providing network quality-of-service (QOS) have too much algorithmic complexity, cost (area and/or energy) or performance overhead to be attractive for on-chip implementation. To better understand the preferred solution space, we define desirable features and evaluation metrics for QOS in a network-on-a-chip (NOC). Our insights lead us to propose a novel QOS system called Preemptive Virtual Clock (PVC). PVC provides strong guarantees, reduces packet delay variation, and enables efficient reclamation of idle network bandwidth without per-flow buffering at the routers and with minimal buffering at the source nodes. PVC averts priority inversion through preemption of lower-priority packets. By controlling preemption aggressiveness, PVC enables a trade-off between the strength of the guarantees and overall throughput. Finally, PVC simplifies network management through a flexible allocation mechanism that enables per-application bandwidth provisioning independent of thread count and supports transparent bandwidth recycling among an application's threads.

Patent
17 Aug 2009
TL;DR: In this article, a domain name server includes a processor configured to receive a request from a requester for an edge cache address, identify a first edge cache serving content requests to an anycast address from the requester, and determine a load of first-edge cache.
Abstract: A domain name server includes a processor configured to receive a request from a requester for an edge cache address, identify a first edge cache serving content requests to an anycast address from the requester, and determine a load of first edge cache. The processor is further configured to provide unicast address of an alternate edge cache to requester in response to the request when the load exceeds a threshold or to provide anycast address to requester in response to request when the load is below the threshold.

Proceedings ArticleDOI
Zeshan A. Chishti1, Alaa R. Alameldeen1, Christopher B. Wilkerson1, Wei Wu1, Shih-Lien Lu1 
12 Dec 2009
TL;DR: A novel adaptive technique to improve cache lifetime reliability and enable low voltage operation, multi-bit segmented ECC (MS-ECC), which addresses both persistent and non-persistent failures.
Abstract: Voltage scaling is one of the most effective mechanisms to reduce microprocessor power consumption However, the increased severity of manufacturing-induced parameter variations at lower voltages limits voltage scaling to a minimum voltage, Vccmin, below which a processor cannot operate reliably Memory cell failures in large memory structures (eg, caches) typically determine the Vccmin for the whole processor Memory failures can be persistent (ie, failures at time zero which cause yield loss) or non-persistent (eg, soft errors or erratic bit failures) Both types of failures increase as supply voltage decreases and both need to be addressed to achieve reliable operation at low voltages In this paper, we propose a novel adaptive technique to improve cache lifetime reliability and enable low voltage operation This technique, multi-bit segmented ECC (MS-ECC) addresses both persistent and non-persistent failures Like previous work on mitigating persistent failures, MS-ECC trades off cache capacity for lower voltages However, unlike previous schemes, MS-ECC does not rely on testing to identify and isolate defective bits, and therefore enables error tolerance for non-persistent failures like erratic bits and soft errors at low voltages Furthermore, MS-ECC's design can allow the operating system to adaptively change the cache size and ECC capability to adjust to system operating conditions Compared to current designs with single-bit correction, the most aggressive implementation for MS-ECC enables a 30% reduction in supply voltage, reducing power by 71% and energy per instruction by 42%

Proceedings ArticleDOI
29 Sep 2009
TL;DR: This paper proposes to use Direct Memory Access (DMA), Master (MST) burst, and a dedicated Block RAM (BRAM) cache respectively to reduce the reconfiguration time by one order of magnitude.
Abstract: Run-time Partial Reconfiguration (PR) speed is significant in applications especially when fast IP core switching is required. In this paper, we propose to use Direct Memory Access (DMA), Master (MST) burst, and a dedicated Block RAM (BRAM) cache respectively to reduce the reconfiguration time. Based on the Xilinx PR technology and the Internal Configuration Access Port (ICAP) primitive in the FPGA fabric, we discuss multiple design architectures and thoroughly investigate their performance with measurements for different partial bitstream sizes. Compared to the reference OPB HWICAP and XPS HWICAP designs, experimental results showthatDMA HWICAP and MST HWICAP reduce the reconfiguration time by one order of magnitude, with little resource consumption overhead. The BRAM HWICAP design can even approach the reconfiguration speed limit of the ICAP primitive at the cost of large Block RAM utilization.

Patent
08 Oct 2009
TL;DR: In this paper, repeated data patterns are detected and encoded by compressed meta-data codes that are stored in meta-pattern entries in a metapattern cache of a meta pattern flash block.
Abstract: A flash memory solid-state-drive (SSD) has a smart storage switch that reduces write acceleration that occurs when more data is written to flash memory than is received from the host. Page mapping rather than block mapping reduces write acceleration. Host commands are loaded into a Logical-Block-Address (LBA) range FIFO. Entries are sub-divided and portions invalidated when a new command overlaps an older command in the FIFO. Host data is aligned to page boundaries with pre- and post-fetched data filling in to the boundaries. Repeated data patterns are detected and encoded by compressed meta-data codes that are stored in meta-pattern entries in a meta-pattern cache of a meta-pattern flash block. The sector data is not written to flash. The meta-pattern entries are located using a meta-data mapping table. Storing host CRC's for comparison to incoming host data can detect identical data writes that can be skipped, avoiding a write to flash.

Book ChapterDOI
02 Dec 2009
TL;DR: It is shown that the combination of vector quantization and hidden Markov model cryptanalysis is a powerful tool for automated analysis of cache-timing data; it can be used to recover critical algorithm state such as key material.
Abstract: Cache-timing attacks are a serious threat to security-critical software. We show that the combination of vector quantization and hidden Markov model cryptanalysis is a powerful tool for automated analysis of cache-timing data; it can be used to recover critical algorithm state such as key material. We demonstrate its effectiveness by running an attack on the elliptic curve portion of OpenSSL (0.9.8k and under). This involves automated lattice attacks leading to key recovery within hours. We carry out the attack on live cache-timing data without simulating the side channel, showing these attacks are practical and realistic.

Journal ArticleDOI
TL;DR: Two techniques are presented, among the first to enable quantitative analysis of whole-program locality in general sequential code, that predict how the locality of a program changes with its input.
Abstract: On modern computer systems, the memory performance of an application depends on its locality. For a single execution, locality-correlated measures like average miss rate or working-set size have long been analyzed using reuse distance—the number of distinct locations accessed between consecutive accesses to a given location. This article addresses the analysis problem at the program level, where the size of data and the locality of execution may change significantly depending on the input.The article presents two techniques that predict how the locality of a program changes with its input. The first is approximate reuse-distance measurement, which is asymptotically faster than exact methods while providing a guaranteed precision. The second is statistical prediction of locality in all executions of a program based on the analysis of a few executions. The prediction process has three steps: dividing data accesses into groups, finding the access patterns in each group, and building parameterized models. The resulting prediction may be used on-line with the help of distance-based sampling. When evaluated on fifteen benchmark applications, the new techniques predicted program locality with good accuracy, even for test executions that are orders of magnitude larger than the training executions.The two techniques are among the first to enable quantitative analysis of whole-program locality in general sequential code. These findings form the basis for a unified understanding of program locality and its many facets. Concluding sections of the article present a taxonomy of related literature along five dimensions of locality and discuss the role of reuse distance in performance modeling, program optimization, cache and virtual memory management, and network traffic analysis.

Proceedings ArticleDOI
12 Dec 2009
TL;DR: Simulations of commercial and scientific workloads indicate that TL has no statistically significant impact on performance, and incurs only a 2.5% increase in bandwidth utilization, and Analytical modelling predicts that TL continues to scale well up to at least 1024 cores.
Abstract: A key challenge in architecting a CMP with many cores is maintaining cache coherence in an efficient manner. Directory-based protocols avoid the bandwidth overhead of snoop-based protocols, and therefore scale to a large number of cores. Unfortunately, conventional directory structures incur significant area overheads in larger CMPs. The Tagless Coherence Directory (TL) is a scalable coherence solution that uses an implicit, conservative representation of sharing information. Conceptually, TL consists of a grid of small Bloom filters. The grid has one column per core and one row per cache set. TL uses 48% less area, 57% less leakage power, and 44% less dynamic energy than a conventional coherence directory for a 16-core CMP with 1MB private L2 caches. Simulations of commercial and scientific workloads indicate that TL has no statistically significant impact on performance, and incurs only a 2.5% increase in bandwidth utilization. Analytical modelling predicts that TL continues to scale well up to at least 1024 cores. Categories and Subject Descriptors

Patent
03 Jun 2009
TL;DR: In this paper, the authors present a method and a node for finding the shortest path to a cache node in a content delivery network (CDN) comprising requested content and a method for creating a virtual representation of a network.
Abstract: Embodiments of the present invention a method and a node for finding the shortest path to a cache node in a content delivery network (CDN) comprising requested content and a method for creating a virtual representation of a network According to an embodiment of the present invention, the virtual representation is in the form of a virtual, hierarchical topology, and the cache nodes correspond to the cache nodes of the real network All cache nodes are arranged at a first level and with the virtual nodes arranged at higher levels In the virtual representation, all nodes (cache and virtual) are connected with virtual links such that there exist only one path between any two arbitrary cache nodes Further, costs to the virtual links are assigned such that the path cost between any two arbitrary cache nodes in the virtual representation generally corresponds to the lowest path cost between corresponding cache nodes in the real network

Proceedings ArticleDOI
20 Apr 2009
TL;DR: A new set of feature-based cache eviction policies that achieve significant improvements over all previous methods, substantially narrowing the existing performance gap to the theoretically optimal (clairvoyant) method.
Abstract: Query processing is a major cost factor in operating large web search engines. In this paper, we study query result caching, one of the main techniques used to optimize query processing performance. Our first contribution is a study of result caching as a weighted caching problem. Most previous work has focused on optimizing cache hit ratios, but given that processing costs of queries can vary very significantly we argue that total cost savings also need to be considered. We describe and evaluate several algorithms for weighted result caching, and study the impact of Zipf-based query distributions on result caching. Our second and main contribution is a new set of feature-based cache eviction policies that achieve significant improvements over all previous methods, substantially narrowing the existing performance gap to the theoretically optimal (clairvoyant) method. Finally, using the same approach, we also obtain performance gains for the related problem of inverted list caching.

Proceedings ArticleDOI
20 Jun 2009
TL;DR: A novel technique, Memory Mapped ECC, is presented, which reduces the cost of providing error correction for SRAM caches and only dedicates SRAM for error detection while the ECC bits are stored within the memory hierarchy as data.
Abstract: This paper presents a novel technique, Memory Mapped ECC, which reduces the cost of providing error correction for SRAM caches. It is important to limit such overheads as processor resources become constrained and error propensity increases. The continuing decrease in SRAM cell size and the growing capacity of caches increases the likelihood of errors in SRAM arrays. To address this, redundant information can be used to correct a value after an error occurs. Information redundancy is typically provided through error-correcting codes (ECC), which append bits to every SRAM row and increase the array's area and energy consumption. We make three observations regarding error protection and utilize them in our architecture: (1) much of the data in a cache is replicated throughout the hierarchy and is inherently redundant; (2) error-detection is necessary for every cache access and is cheaper than error correction, which is very infrequent; (3) redundant information for correction need not be stored in high-cost SRAM. Our unique architecture only dedicates SRAM for error detection while the ECC bits are stored within the memory hierarchy as data. We associate a physical memory address with each cache line for ECC storage and rely on locality to minimize the impact. The cache is dynamically and transparently partitioned between data and ECC with the fraction of ECC growing with the number of dirty cache lines. We show that this has little impact on both performance (1.3% average and

Proceedings ArticleDOI
12 Oct 2009
TL;DR: A scheduling strategy for real-time tasks with both timing and cache space constraints is presented, which allows each task to use a fixed number of cache partitions, and makes sure that at any time a cache partition is occupied by at most one running task.
Abstract: The major obstacle to use multicores for real-time applications is that we may not predict and provide any guarantee on real-time properties of embedded software on such platforms; the way of handling the on-chip shared resources such as L2 cache may have a significant impact on the timing predictability. In this paper, we propose to use cache space isolation techniques to avoid cache contention for hard real-time tasks running on multicores with shared caches. We present a scheduling strategy for real-time tasks with both timing and cache space constraints, which allows each task to use a fixed number of cache partitions, and makes sure that at any time a cache partition is occupied by at most one running task. In this way, the cache spaces of tasks are isolated at run-time.As technical contributions, we have developed a sufficient schedulability test for non-preemptive fixed-priority scheduling for multicores with shared L2 cache, encoded as a linear programming problem. To improve the scalability of the test, we then present our second schedulability test of quadratic complexity, which is an over approximation of the first test. To evaluate the performance and scalability of our techniques, we use randomly generated task sets. Our experiments show that the first test which employs an LP solver can easily handle task sets with thousands of tasks in minutes using a desktop computer. It is also shown that the second test is comparable with the first one in terms of precision, but scales much better due to its low complexity, and is therefore a good candidate for efficient schedulability tests in the design loop for embedded systems or as an on-line test for admission control.

Proceedings ArticleDOI
20 Apr 2009
TL;DR: It is shown in the simulation that a 37% increase to net benefits could be achieved over the standard method of full cache deployment to cache all POPs traffic, and that CDN traffic is much more efficient than P2P content and that there is large skew in the Air Miles between POP in a typical network.
Abstract: This paper proposes and evaluates a Network Aware Forward Caching approach for determining the optimal deployment strategy of forward caches to a network. A key advantage of this approach is that we can reduce the network costs associated with forward caching to maximize the benefit obtained from their deployment. We show in our simulation that a 37% increase to net benefits could be achieved over the standard method of full cache deployment to cache all POPs traffic. In addition, we show that this maximal point occurs when only 68% of the total traffic is cached.Another contribution of this paper is the analysis we use to motivate and evaluate this problem. We characterize the Internet traffic of 100K subscribers of a US residential broadband provider. We use both layer 4 and layer 7 analysis to investigate the traffic volumes of the flows as well as study the general characteristics of the applications used. We show that HTTP is a dominant protocol and account for 68% of the total downstream traffic and that 34% of that traffic is multimedia. In addition, we show that multimedia content using HTTP exhibits a 83% annualized growth rate and other HTTP traffic has a 53% growth rate versus the 26% over all annual growth rate of broadband traffic. This shows that HTTP traffic will become ever more dominent and increase the potential caching opportunities. Furthermore, we characterize the core backbone traffic of this broadband provider to measure the distance travelled by content and traffic. We find that CDN traffic is much more efficient than P2P content and that there is large skew in the Air Miles between POP in a typical network. Our findings show that there are many opportunties in broadband provider networks to optimize how traffic is delivered and cached.

Proceedings ArticleDOI
12 Dec 2009
TL;DR: A set of sophisticated benchmarks for latency and bandwidth measurements to arbitrary locations in the memory subsystem are presented and the coherency state of cache lines are considered to analyze the cache co herency protocols and their performance impact.
Abstract: Across a broad range of applications, multicore technology is the most important factor that drives today's microprocessor performance improvements. Closely coupled is a growing complexity of the memory subsystems with several cache levels that need to be exploited efficiently to gain optimal application performance. Many important implementation details of these memory subsystems are undocumented. We therefore present a set of sophisticated benchmarks for latency and bandwidth measurements to arbitrary locations in the memory subsystem. We consider the coherency state of cache lines to analyze the cache coherency protocols and their performance impact. The potential of our approach is demonstrated with an in-depth comparison of ccNUMA multiprocessor systems with AMD (Shanghai) and Intel (Nehalem-EP) quad-core x86-64 processors that both feature integrated memory controllers and coherent point-to-point interconnects. Using our benchmarks we present fundamental memory performance data and architectural properties of both processors. Our comparison reveals in detail how the microarchitectural differences tremendously affect the performance of the memory subsystem.

Proceedings ArticleDOI
26 Apr 2009
TL;DR: A new timing simulator is presented that models a modern x86 microarchitecture at a very low level, including out-of-order scheduling and execution that much more closely mirrors current implementations, a detailed cache/memory hierarchy, as well as many x86-specific microarch Architecture features (e.g., simple vs. complex decoders, micro-op decomposition and fusion).
Abstract: For academic computer architecture research, a large number of publicly available simulators make use of relatively simple abstractions for the microarchitecture of the processor pipeline. For some types of studies, such as those for multi-core cache coherence designs, a simple pipeline model may suffice. For detailed microarchitecture research, such as those that are sensitive to the exact behavior of out-of-order scheduling, ALU and bypass network contention, and resource management (e.g., RS and ROB entries), an over-simplified model is not representative of modern processor organizations. We present a new timing simulator that models a modern x86 microarchitecture at a very low level, including out-of-order scheduling and execution that much more closely mirrors current implementations, a detailed cache/memory hierarchy, as well as many x86-specific microarchitecture features (e.g., simple vs. complex decoders, micro-op decomposition and fusion, microcode lookup overhead for long/complex x86 instructions).