Showing papers on "Cache published in 2008"

PDF

Open Access

Proceedings Article•

Avoiding the disk bottleneck in the data domain deduplication file system

[...]

Benjamin Zhu¹, Kai Li², Hugo Patterson¹•Institutions (2)

26 Feb 2008

TL;DR: Three techniques employed in the production Data Domain deduplication file system to relieve the disk bottleneck are described, which enable a modern two-socket dual-core system to run at 90% CPU utilization with only one shelf of 15 disks and achieve 100 MB/sec for single-stream throughput and 210 MB/ sec for multi- stream throughput.

...read moreread less

Abstract: Disk-based deduplication storage has emerged as the new-generation storage system for enterprise data protection to replace tape libraries. Deduplication removes redundant data segments to compress data into a highly compact form and makes it economical to store backups on disk instead of tape. A crucial requirement for enterprise data protection is high throughput, typically over 100 MB/sec, which enables backups to complete quickly. A significant challenge is to identify and eliminate duplicate data segments at this rate on a low-cost system that cannot afford enough RAM to store an index of the stored segments and may be forced to access an on-disk index for every input segment. This paper describes three techniques employed in the production Data Domain deduplication file system to relieve the disk bottleneck. These techniques include: (1) the Summary Vector, a compact in-memory data structure for identifying new segments; (2) Stream-Informed Segment Layout, a data layout method to improve on-disk locality for sequentially accessed segments; and (3) Locality Preserved Caching, which maintains the locality of the fingerprints of duplicate segments to achieve high cache hit ratios. Together, they can remove 99% of the disk accesses for deduplication of real world workloads. These techniques enable a modern two-socket dual-core system to run at 90% CPU utilization with only one shelf of 15 disks and achieve 100 MB/sec for single-stream throughput and 210 MB/sec for multi-stream throughput.

...read moreread less

934 citations

Proceedings Article•DOI•

Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems

[...]

Jiang Lin¹, Qingda Lu², Xiaoning Ding², Zhao Zhang¹, Xiaodong Zhang², P. Sadayappan² - Show less +2 more•Institutions (2)

Iowa State University¹, Ohio State University²

24 Oct 2008

TL;DR: This paper has comprehensively evaluated several representative cache partitioning schemes with different optimization objectives, including performance, fairness, and quality of service (QoS) and provides new insights into dynamic behaviors and interaction effects.

...read moreread less

Abstract: Cache partitioning and sharing is critical to the effective utilization of multicore processors. However, almost all existing studies have been evaluated by simulation that often has several limitations, such as excessive simulation time, absence of OS activities and proneness to simulation inaccuracy. To address these issues, we have taken an efficient software approach to supporting both static and dynamic cache partitioning in OS through memory address mapping. We have comprehensively evaluated several representative cache partitioning schemes with different optimization objectives, including performance, fairness, and quality of service (QoS). Our software approach makes it possible to run the SPEC CPU2006 benchmark suite to completion. Besides confirming important conclusions from previous work, we are able to gain several insights from whole-program executions, which are infeasible from simulation. For example, giving up some cache space in one program to help another one may improve the performance of both programs for certain workloads due to reduced contention for memory bandwidth. Our evaluation of previously proposed fairness metrics is also significantly different from a simulation-based study. The contributions of this study are threefold. (1) To the best of our knowledge, this is a highly comprehensive execution- and measurement-based study on multicore cache partitioning. This paper not only confirms important conclusions from simulation-based studies, but also provides new insights into dynamic behaviors and interaction effects. (2) Our approach provides a unique and efficient option for evaluating multicore cache partitioning. The implemented software layer can be used as a tool in multicore performance evaluation and hardware design. (3) The proposed schemes can be further refined for OS kernels to improve performance.

...read moreread less

382 citations

Proceedings Article•DOI•

Predicting faults from cached history

[...]

Sunghun Kim¹, Thomas Zimmermann², E. James Whitehead³, Andreas Zeller²•Institutions (3)

Massachusetts Institute of Technology¹, Saarland University², University of California, Santa Cruz³

19 Feb 2008

TL;DR: In the evaluation of seven open source projects with more than 200,000 revisions, the cache selects 10% of the source code files; these files account for 73%-95% of faults--a significant advance beyond the state of the art.

...read moreread less

Abstract: We analyze the version history of 7 software systems to predict the most fault prone entities and files. The basic assumption is that faults do not occur in isolation, but rather in bursts of several related faults. Therefore, we cache locations that are likely to have faults: starting from the location of a known (fixed) fault, we cache the location itself, any locations changed together with the fault, recently added locations, and recently changed locations. By consulting the cache at the moment a fault is fixed, a developer can detect likely fault-prone locations. This is useful for prioritizing verification and validation resources on the most fault prone files or entities. In our evaluation of seven open source projects with more than 200,000 revisions, the cache selects 10% of the source code files; these files account for 73%-95% of faults--a significant advance beyond the state of the art

...read moreread less

364 citations

Journal Article•DOI•

High-performance implementation of the level-3 BLAS

[...]

Kazushige Goto¹, Robert A. van de Geijn¹•Institutions (1)

University of Texas at Austin¹

25 Jul 2008-ACM Transactions on Mathematical Software

TL;DR: A simple but highly effective approach for transforming high-performance implementations on cache-based architectures of matrix-Matrix multiplication into implementations of other commonly used matrix-matrix computations (the level-3 BLAS) is presented.

...read moreread less

Abstract: A simple but highly effective approach for transforming high-performance implementations on cache-based architectures of matrix-matrix multiplication into implementations of other commonly used matrix-matrix computations (the level-3 BLAS) is presented. Exceptional performance is demonstrated on various architectures.

...read moreread less

349 citations

Patent•

Command queuing smart storage transfer manager for striping data to raw-NAND flash modules

[...]

Frank Yu, Charles C. Lee, Abraham C. Ma

15 Oct 2008

TL;DR: In this article, a flash module has raw-NAND flash memory chips accessed over a physical-block address (PBA) bus by a NVM controller, where data striping and interleaving among multiple channels of the flash modules is controlled at a high level by a smart storage transaction manager.

...read moreread less

Abstract: A flash module has raw-NAND flash memory chips accessed over a physical-block address (PBA) bus by a NVM controller. The NVM controller is on the flash module or on a system board for a solid-state disk (SSD). The NVM controller converts logical block addresses (LBA) to physical block addresses (PBA). Data striping and interleaving among multiple channels of the flash modules is controlled at a high level by a smart storage transaction manager, while further interleaving and remapping within a channel may be performed by the NVM controllers. A SDRAM buffer is used by a smart storage switch to cache host data before writing to flash memory. A Q-R pointer table stores quotients and remainders of division of the host address. The remainder points to a location of the host data in the SDRAM. A command queue stores Q, R for host commands.

...read moreread less

341 citations

Proceedings Article•DOI•

Adaptive insertion policies for managing shared caches

[...]

Aamer Jaleel¹, William C. Hasenplaugh¹, Moinuddin K. Qureshi², Julien Sebot¹, Simon C. Steely¹, Joel Emer¹ - Show less +2 more•Institutions (2)

Intel¹, IBM²

25 Oct 2008

TL;DR: This paper proposes Thread-Aware Dynamic Insertion Policy (TADIP), a adaptive insertion policy that can take into account the memory requirements of each of the concurrently executing applications and provides performance benefits similar to doubling the size of an LRU-managed cache.

...read moreread less

Abstract: Chip Multiprocessors (CMPs) allow different applications to concurrently execute on a single chip. When applications with differing demands for memory compete for a shared cache, the conventional LRU replacement policy can significantly degrade cache performance when the aggregate working set size is greater than the shared cache. In such cases, shared cache performance can be significantly improved by preserving the entire working set of applications that can co-exist in the cache and preserving some portion of the working set of the remaining applications. This paper investigates the use of adaptive insertion policies to manage shared caches. We show that directly extending the recently proposed dynamic insertion policy (DIP) is inadequate for shared caches since DIP is unaware of the characteristics of individual applications. We propose Thread-Aware Dynamic Insertion Policy (TADIP) that can take into account the memory requirements of each of the concurrently executing applications. Our evaluation with multi-programmed workloads for 2-core, 4-core, 8-core, and 16-core CMPs show that a TADIP-managed shared cache improves overall throughput by as much as 94%, 64%, 26%, and 16% respectively (on average 14%, 18%, 15%, and 17%) over the baseline LRU policy. The performance benefit of TADIP is 2.6x compared to DIP and 1.3x compared to the recently proposed Utility-based Cache Partitioning (UCP) scheme. We also show that a TADIP-managed shared cache provides performance benefits similar to doubling the size of an LRU-managed cache. Furthermore, TADIP requires a total storage overhead of less than two bytes per core, does not require changes to the existing cache structure, and performs similar to LRU for LRU friendly workloads.

...read moreread less

321 citations

Journal Article•DOI•

Trading off Cache Capacity for Reliability to Enable Low Voltage Operation

[...]

Christopher B. Wilkerson¹, Hongliang Gao², Alaa R. Alameldeen¹, Zeshan A. Chishti¹, Muhammad M. Khellah¹, Shih-Lien Lu¹ - Show less +2 more•Institutions (2)

Intel¹, University of Central Florida²

01 Jun 2008

TL;DR: Two architectural techniques are proposed that enable microprocessor caches (L1 and L2), to operate at low voltages despite very high memory cell failure rates, and enable a 40% voltage reduction, which reduces power by 85% and energy per instruction (EPI) by 53%.

...read moreread less

Abstract: One of the most effective techniques to reduce a processor’s power consumption is to reduce supply voltage. However, reducing voltage in the context of manufacturing-induced parameter variations cancause many types of memory circuits to fail. As a result, voltage scaling is limited by a minimum voltage, often called Vccmin, beyond which circuits may not operate reliably. Large memory structures (e.g., caches) typically set Vccmin for the whole processor. In this paper, we propose two architectural techniques that enable microprocessor caches (L1and L2), to operate at low voltages despite very high memory cell failure rates. The Word-disable scheme combines two consecutive cache lines, to form a single cache line where only non-failing words are used. The Bit-fix scheme uses a quarter of the ways in a cache set to store positions and fix bits for failing bits in other ways of the set. During high voltage operation, both schemes allow use of the entire cache. During low voltage operation, they sacrifice cache capacity by 50% and 25%, respectively, to reduce Vccmin below 500mV. Compared to current designs with a Vccmin of 825mV, our schemes enable a 40% voltage reduction, which reduces power by 85% and energy per instruction (EPI) by 53%

...read moreread less

289 citations

Proceedings Article•DOI•

Accelerating two-dimensional page walks for virtualized systems

[...]

Ravi Bhargava¹, Benjamin C. Serebrin¹, Francesco Spadini¹, Srilatha Manne¹•Institutions (1)

Advanced Micro Devices¹

01 Mar 2008

TL;DR: An in-depth examination of the 2D page table walk overhead and options for decreasing it is presented, which includes using the AMD Opteron processor's page walk cache to exploit the strong reuse of page entry references.

...read moreread less

Abstract: Nested paging is a hardware solution for alleviating the software memory management overhead imposed by system virtualization. Nested paging complements existing page walk hardware to form a two-dimensional (2D) page walk, which reduces the need for hypervisor intervention in guest page table management. However, the extra dimension also increases the maximum number of architecturally-required page table references.This paper presents an in-depth examination of the 2D page table walk overhead and options for decreasing it. These options include using the AMD Opteron processor's page walk cache to exploit the strong reuse of page entry references. For a mix of server and SPEC benchmarks, the presented results show a 15%-38% improvement in guest performance by extending the existing page walk cache to also store the nested dimension of the 2D page walk. Caching nested page table translations and skipping multiple page entry references produce an additional 3%-7% improvement.Much of the remaining 2D page walk overhead is due to low-locality nested page entry references, which result in additional memory hierarchy misses. By using large pages, the hypervisor can eliminate many of these long-latency accesses and further improve the guest performance by 3%-22%.

...read moreread less

280 citations

Proceedings Article•DOI•

A novel cache architecture with enhanced performance and security

[...]

Zhenghong Wang¹, Ruby B. Lee¹•Institutions (1)

Princeton University¹

08 Nov 2008

TL;DR: The results show that the proposed cache architecture has low miss rates comparable to a highly associative cache and short access times and power efficiency close to that of a direct-mapped cache, and can thwart cache-based software side-channel attacks, providing both legacy and security-enhanced software a much higher degree of security.

...read moreread less

Abstract: Caches ideally should have low miss rates and short access times, and should be power efficient at the same time. Such design goals are often contradictory in practice. Recent findings on efficient attacks based on information leakage in caches have also brought the security issue up front. Design for security introduces even more restrictions and typically leads to significant performance degradation. This paper presents a novel cache architecture that can simultaneously achieve the above goals. Specifically, cache miss rates are reduced with dynamic remapping and longer cache indices, access-time overhead overcome with astute low-level circuit design, and information leakage thwarted by a security-aware cache replacement algorithm together with the performance enhancing mechanisms. We present both theoretical analysis and experimental results, using the SPEC2000 suite to evaluate the cache miss behavior, and CACTI and HSPICE to validate the circuit design. Our results show that the proposed cache architecture has low miss rates comparable to a highly associative cache and short access times and power efficiency close to that of a direct-mapped cache. At the same time it can thwart cache-based software side-channel attacks, providing both legacy and security-enhanced software a much higher degree of security. Additional benefits that the proposed cache architecture can bring, like fault tolerance and hot-spot mitigation, are also discussed briefly.

...read moreread less

278 citations

Patent•

Multimedia visual progress indication system

[...]

Robert P. Vallone, Howard D. Look, Ain Mckendrick

04 Mar 2008

TL;DR: A multimedia visual progress indication system that provides a cache bar that is overlaid onto the program material or displayed on a dedicated display is presented in this paper, where the cache bar indicates the length of a recording session or the duration of stored program material and expands to the right when material is being recorded.

...read moreread less

Abstract: A multimedia visual progress indication system that provides a cache bar that is overlaid onto the program material or displayed on a dedicated display. A cache bar indicates the length of a recording session or the length of stored program material and expands to the right when material is being recorded. Index and/or bookmark indicators are displayed next to the cache bar. A position indicator moves within the cache bar and tells the user visually where his current position is within the program material. Numeric time or counter mark of the current position is displayed in the vicinity of the cache bar. The trick play bar and its associated components are displayed for a predetermined time period.

...read moreread less

268 citations

Journal Article•DOI•

Approximation Algorithms for Data Placement Problems

[...]

Ivan D. Baev, Rajmohan Rajaraman, Chaitanya Swamy¹•Institutions (1)

University of Waterloo¹

01 Jul 2008-SIAM Journal on Computing

TL;DR: This work develops approximation algorithms for the problem of placing replicated data in arbitrary networks, where the nodes may both issue requests for data objects and have capacity for storing data objects so as to minimize the average data-access cost.

...read moreread less

Abstract: We develop approximation algorithms for the problem of placing replicated data in arbitrary networks, where the nodes may both issue requests for data objects and have capacity for storing data objects so as to minimize the average data-access cost. We introduce the data placement problem to model this problem. We have a set of caches $\mathcal{F}$, a set of clients $\mathcal{D}$, and a set of data objects $\mathcal{O}$. Each cache $i$ can store at most $u_i$ data objects. Each client $j\in\mathcal{D}$ has demand $d_j$ for a specific data object $o(j)\in\mathcal{O}$ and has to be assigned to a cache that stores that object. Storing an object $o$ in cache $i$ incurs a storage cost of $f_i^o$, and assigning client $j$ to cache $i$ incurs an access cost of $d_jc_{ij}$. The goal is to find a placement of the data objects to caches respecting the capacity constraints, and an assignment of clients to caches so as to minimize the total storage and client access costs. We present a 10-approximation algorithm for this problem. Our algorithm is based on rounding an optimal solution to a natural linear-programming relaxation of the problem. One of the main technical challenges encountered during rounding is to preserve the cache capacities while incurring only a constant-factor increase in the solution cost. We also introduce the connected data placement problem to capture settings where write-requests are also issued for data objects, so that one requires a mechanism to maintain consistency of data. We model this by requiring that all caches containing a given object be connected by a Steiner tree to a root for that object, which issues a multicast message upon a write to (any copy of) that object. The total cost now includes the cost of these Steiner trees. We devise a 14-approximation algorithm for this problem. We show that our algorithms can be adapted to handle two variants of the problem: (a) a $k$-median variant, where there is a specified bound on the number of caches that may contain a given object, and (b) a generalization where objects have lengths and the total length of the objects stored in any cache must not exceed its capacity.

...read moreread less

Journal Article•DOI•

A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies

[...]

Shyamkumar Thoziyoor¹, Jung Ho Ahn², Matteo Monchiero², Jay B. Brockman¹, Norman P. Jouppi² - Show less +1 more•Institutions (2)

University of Notre Dame¹, Hewlett-Packard²

01 Jun 2008

TL;DR: This study uses CACTI-D to model all components of the memory hierarchy including L1, L2, last level SRAM, logic process based DRAM or commodity DRAM L3 caches, and main memory DRAM chips and finds that commodity DRam technology is most attractive for stacked last level caches, with significantly lower energy-delay products.

...read moreread less

Abstract: In this paper we introduce CACTI-D, a significant enhancement of CACTI 5.0. CACTI-D adds support for modeling of commodity DRAM technology and support for main memory DRAM chip organization. CACTI-D enables modeling of the complete memory hierarchy with consistent models all the way from SRAM based L1 caches through main memory DRAMs on DIMMs. We illustrate the potential applicability of CACTI-D in the design and analysis of future memory hierarchies by carrying out a last level cache study for a multicore multithreaded architecture at the 32nm technology node. In this study we use CACTI-D to model all components of the memory hierarchy including L1, L2, last level SRAM, logic process based DRAM or commodity DRAM L3 caches, and main memory DRAM chips. We carry out architectural simulation using benchmarks with large data sets and present results of their execution time, breakdown of power in the memory hierarchy, and system energy-delay product for the different system configurations. We find that commodity DRAM technology is most attractive for stacked last level caches, with significantly lower energy-delay products.

...read moreread less

Journal Article•DOI•

Counter-Based Cache Replacement and Bypassing Algorithms

[...]

Mazen Kharbutli¹, Yan Solihin²•Institutions (2)

Jordan University of Science and Technology¹, North Carolina State University²

01 Apr 2008-IEEE Transactions on Computers

TL;DR: A new counter-based approach to deal with cache pollution, predicting lines that have become dead and replacing them early from the L2 cache and identifying never-reaccessed lines, which is augmented with an event counter that is incremented when an event of interest such as certain cache accesses occurs.

...read moreread less

Abstract: Recent studies have shown that, in highly associative caches, the performance gap between the least recently used (LRU) and the theoretical optimal replacement algorithms is large, motivating the design of alternative replacement algorithms to improve cache performance. In LRU replacement, a line, after its last use, remains in the cache for a long time until it becomes the LRU line. Such deadlines unnecessarily reduce the cache capacity available for other lines. In addition, in multilevel caches, temporal reuse patterns are often inverted, showing in the L1 cache but, due to the filtering effect of the L1 cache, not showing in the L2 cache. At the L2, these lines appear to be brought in the cache but are never reaccessed until they are replaced. These lines unnecessarily pollute the L2 cache. This paper proposes a new counter-based approach to deal with the above problems. For the former problem, we predict lines that have become dead and replace them early from the L2 cache. For the latter problem, we identify never-reaccessed lines, bypass the L2 cache, and place them directly in the L1 cache. Both techniques are achieved through a single counter-based mechanism. In our approach, each line in the L2 cache is augmented with an event counter that is incremented when an event of interest such as certain cache accesses occurs. When the counter reaches a threshold, the line ";expires"; and becomes replaceable. Each line's threshold is unique and is dynamically learned. We propose and evaluate two new replacement algorithms: Access interval predictor (AIP) and live-time predictor (LvP). AIP and LvP speed up 10 capacity-constrained SPEC2000 benchmarks by up to 48 percent and 15 percent on average (7 percent on average for the whole 21 Spec2000 benchmarks). Cache bypassing further reduces L2 cache pollution and improves the average speedups to 17 percent (8 percent for the whole 21 Spec2000 benchmarks).

...read moreread less

Patent•

Targeted Caching to Reduce Bandwidth Consumption

[...]

Alan L. Glasser¹•Institutions (1)

AT&T¹

27 Aug 2008

TL;DR: In this article, a system includes a name server, an edge cache server and a local cache server, which is configured to respond to the anycast IP address and a unicast IP address.

...read moreread less

Abstract: A system includes a name server, an edge cache server, and a local cache server. The name server is configured to provide an anycast IP address in response to a request for an IP address of an origin hostname from a client system. The edge cache server is configured to respond to the anycast IP address and a unicast IP address and to retrieve content from an origin. The local cache server includes a storage and is configured to respond to the anycast IP address, to retrieve content from the edge cache server, and provide the content to a client system.

...read moreread less

Proceedings Article•DOI•

Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency

[...]

Haiming Liu¹, Michael Ferdman², Jaehyuk Huh³, Doug Burger⁴•Institutions (4)

University of Texas at Austin¹, Carnegie Mellon University², Advanced Micro Devices³, Microsoft⁴

08 Nov 2008

TL;DR: This paper proposes a new class of dead-block predictors that predict dead blocks based on bursts of accesses to a cache block, and evaluates three ways to increase cache efficiency by eliminating dead blocks early: replacement optimization, bypassing, and prefetching.

...read moreread less

Abstract: Data caches in general-purpose microprocessors often contain mostly dead blocks and are thus used inefficiently. To improve cache efficiency, dead blocks should be identified and evicted early. Prior schemes predict the death of a block immediately after it is accessed; however, these schemes yield lower prediction accuracy and coverage. Instead, we find that predicting the death of a block when it just moves out of the MRU position gives the best tradeoff between timeliness and prediction accuracy/coverage. Furthermore, the individual reference history of a block in the L1 cache can be irregular because of data/control dependence. This paper proposes a new class of dead-block predictors that predict dead blocks based on bursts of accesses to a cache block. A cache burst begins when a block becomes MRU and ends when it becomes non-MRU. Cache bursts are more predictable than individual references because they hide the irregularity of individual references. When used at the L1 cache, the best burst-based predictor can identify 96% of the dead blocks with a 96% accuracy. With the improved dead-block predictors, we evaluate three ways to increase cache efficiency by eliminating dead blocks early: replacement optimization, bypassing, and prefetching. The most effective approach, prefetching into dead blocks, increases the average L1 efficiency from 8% to 17% and the L2 efficiency from 17% to 27%. This increased cache efficiency translates into higher overall performance: prefetching into dead blocks outperforms the same prefetch scheme without dead-block prediction by 12% at the L1 and by 13% at the L2.

...read moreread less

Proceedings Article•DOI•

All-pairs shortest-paths for large graphs on the GPU

[...]

Gary J. Katz¹, Joseph T. Kider¹•Institutions (1)

University of Pennsylvania¹

20 Jun 2008

TL;DR: A shared memory cache efficient GPU implementation to solve transitive closure and the all-pairs shortest-path problem on directed graphs for large datasets using the NVIDIA G80 GPU architecture using the CUDA API is described.

...read moreread less

Abstract: The all-pairs shortest-path problem is an intricate part in numerous practical applications. We describe a shared memory cache efficient GPU implementation to solve transitive closure and the all-pairs shortest-path problem on directed graphs for large datasets. The proposed algorithmic design utilizes the resources available on the NVIDIA G80 GPU architecture using the CUDA API. Our solution generalizes to handle graph sizes that are inherently larger then the DRAM memory available on the GPU. Experiments demonstrate that our method is able to significantly increase processing large graphs making our method applicable for bioinformatics, internet node traffic, social networking, and routing problems.

...read moreread less

Patent•

Client side caching of synchronized data

[...]

Nisha K. Nair¹, Dinesh K. Nirmal¹, Sandhya C. Turaga¹, David J. Wisneski¹•Institutions (1)

IBM¹

07 Apr 2008

TL;DR: In this paper, the authors propose a method for synchronizing a database with data stored at a client by providing a data feed to receive data by the client from the database and providing received data, caching the received data in a client side cache to provide client side cached data.

...read moreread less

Abstract: A method for synchronizing a database with data stored at a client includes providing a data feed to receive data by the client from the database and provide received data, caching the received data in a client side cache to provide client side cached data, detecting a database change to data within the database corresponding to the client side cached data according to a polling operation to provide a change event, pushing the change event to the client side cached data to update the client side cached data in accordance with the database change and the polling operation, requesting further data from the database, determining whether the further data includes data of the client side cached data to determine remaining data exclusive of the client side cached data and pushing the remaining data to the client side cached data.

...read moreread less

Patent•

Method for Compressed Data with Reduced Dictionary Sizes by Coding Value Prefixes

[...]

Pablo Diaz-Gutierrez¹, Vijayshankar Raman¹, Garret Swart¹•Institutions (1)

IBM¹

08 Jan 2008

TL;DR: In this article, a value prefix coding scheme is presented, wherein value prefixes are stored in the dictionary to get good compression from small dictionaries, and an algorithm is presented to determine the optimal entries for the value prefix dictionary.

...read moreread less

Abstract: The speed of dictionary based decompression is limited by the cost of accessing random values in the dictionary. If the size of the dictionary can be limited so it fits into cache, decompression is made to be CPU bound rather than memory bound. To achieve this, a value prefix coding scheme is presented, wherein value prefixes are stored in the dictionary to get good compression from small dictionaries. Also presented is an algorithm that determines the optimal entries for a value prefix dictionary. Once the dictionary fits in cache, decompression speed is often limited by the cost of mispredicted branches during Huffman code processing. A novel way is presented to quantize Huffman code lengths to allow code processing to be performed with few instructions, no branches, and very little extra memory. Also presented is an algorithm for code length quantization that produces the optimal assignment of Huffman codes and show that the adverse effect of quantization on the compression ratio is quite small.

...read moreread less

Journal Article•DOI•

Benefit-Based Data Caching in Ad Hoc Networks

[...]

Bin Tang¹, Himanshu Gupta², Samir R. Das²•Institutions (2)

Wichita State University¹, Stony Brook University²

01 Mar 2008-IEEE Transactions on Mobile Computing

TL;DR: This article presents a polynomial-time centralized approximation algorithm that provably delivers a solution whose benefit is at least 1/4 (1/2 for uniform-size data items) of the optimal benefit of the cache placement problem of minimizing total data access cost in ad hoc networks with multiple data items and nodes with limited memory capacity.

...read moreread less

Abstract: Data caching can significantly improve the efficiency of information access in a wireless ad hoc network by reducing the access latency and bandwidth usage. However, designing efficient distributed caching algorithms is nontrivial when network nodes have limited memory. In this article, we consider the cache placement problem of minimizing total data access cost in ad hoc networks with multiple data items and nodes with limited memory capacity. The above optimization problem is known to be NP-hard. Defining benefit as the reduction in total access cost, we present a polynomial-time centralized approximation algorithm that provably delivers a solution whose benefit is at least 1/4 (1/2 for uniform-size data items) of the optimal benefit. The approximation algorithm is amenable to localized distributed implementation, which is shown via simulations to perform close to the approximation algorithm. Our distributed algorithm naturally extends to networks with mobile nodes. We simulate our distributed algorithm using a network simulator (ns2) and demonstrate that it significantly outperforms another existing caching technique (by Yin and Cao [33]) in all important performance metrics. The performance differential is particularly large in more challenging scenarios such as higher access frequency and smaller memory.

...read moreread less

Proceedings Article•DOI•

Analysis and approximation of optimal co-scheduling on chip multiprocessors

[...]

Yunlian Jiang¹, Xipeng Shen¹, Jie Chen², Rahul Tripathi³•Institutions (3)

College of William & Mary¹, Thomas Jefferson National Accelerator Facility², University of South Florida³

25 Oct 2008

TL;DR: This paper presents a theoretical analysis of the complexity of co-scheduling, proving its NP-completeness and designs and evaluates a sequence of approximation algorithms, among which, the hierarchical matching algorithm produces near-optimal schedules and shows good scalability.

...read moreread less

Abstract: Cache sharing among processors is important for Chip Multiprocessors to reduce inter-thread latency, but also brings cache contention, degrading program performance considerably. Recent studies have shown that job co-scheduling can effectively alleviate the contention, but it remains an open question how to efficiently find optimal co-schedules. Solving the question is critical for determining the potential of a co-scheduling system. This paper presents a theoretical analysis of the complexity of co-scheduling, proving its NP-completeness. Furthermore, for a special case when there are two sharers per chip, we propose an algorithm that finds the optimal co-schedules in polynomial time. For more complex cases, we design and evaluate a sequence of approximation algorithms, among which, the hierarchical matching algorithm produces near-optimal schedules and shows good scalability. This study facilitates the evaluation of co-scheduling systems, as well as offers some techniques directly usable in proactive job co-scheduling.

...read moreread less

Proceedings Article•DOI•

Watch Global, Cache Local: YouTube Network Traffic at a Campus Network - Measurements and Implications

[...]

Michael Zink¹, Kyoungwon Suh², Yu Gu¹, Jim Kurose¹•Institutions (2)

University of Massachusetts Amherst¹, Illinois State University²

27 Jan 2008

TL;DR: In this article, a measurement study of YouTube traffic in a large university campus network was conducted and the results of these simulations show that client-based local caching, P2P-based distribution, and proxy caching can reduce network traffic significantly and allow faster access to video clips.

...read moreread less

Abstract: Web services such as YouTube which allow the distribution of user-produced media have recently become very popular YouTube-like services are different from existing traditional VoD services because the service provider has only limited control over the creation of new content We analyze how the content distribution in YouTube is realized and then conduct a measurement study of YouTube traffic in a large university campus network The analysis of the traffic shows that: (1) No strong correlation is observed between global and local popularity; (2) neither time scale nor user population has an impact on the local popularity distribution; (3) video clips of local interest have a high local popularity Using our measurement data to drive trace-driven simulations, we also demonstrate the implications of alternative distribution infrastructures on the performance of a YouTube-like VoD service The results of these simulations show that client-based local caching, P2P-based distribution, and proxy caching can reduce network traffic significantly and allow faster access to video clips

...read moreread less

Patent•

Resilient service quality in a managed multimedia delivery network

[...]

Andrey Kisel¹, David Cecil Robinson¹, Tiaan Schutte¹•Institutions (1)

Alcatel-Lucent¹

30 Oct 2008

TL;DR: In this article, a managed multimedia delivery network for providing a multimedia service with resilient service quality is disclosed, comprising a plurality of caching nodes (3) for caching multimedia data segments; an edge caching node (4) for colleting requested multimedia data segment from the caching nodes and serving a user equipment (2) with the collected multimedia data; and a service gateway (5) for providing cache information to the edge caching nodes indicating how to obtain the requested multimedia segments from caching nodes.

...read moreread less

Abstract: A managed multimedia delivery network (1) for providing a multimedia service with resilient service quality is disclosed. The network comprising a plurality of caching nodes (3) for caching multimedia data segments; an edge caching node (4) for colleting requested multimedia data segments from the caching nodes (3) and for serving a user equipment (2) with the collected multimedia data; and a service gateway (5) for providing cache information to the edge caching node (4) indicating how to obtain the requested multimedia data segments from the caching nodes (3). The edge caching node (4) comprises a service quality monitoring unit for monitoring the collection of the data segments from the caching nodes (3) and for requesting cache information from the service gateway (5) when the collection of data segments impacts the service quality.

...read moreread less

Proceedings Article•DOI•

WCET Analysis for Multi-Core Processors with Shared L2 Instruction Caches

[...]

Jun Yan¹, Wei Zhang¹•Institutions (1)

Southern Illinois University Carbondale¹

22 Apr 2008

TL;DR: The proposed approach can reasonably estimate the worst- case shared L2 instruction cache misses by considering inter-thread instruction conflicts and the WCET of applications running on multi-core processors estimated by the approach is much better than the estimation by simply assuming all L2 instructions are misses.

...read moreread less

Abstract: Multi-core chips have been increasingly adopted by microprocessor industry. For real-time systems to safely harness the potential of multi-core computing, designers must be able to accurately obtain the worst-case execution time (WCET) of applications running on multi-core platforms, which is very challenging due to the possible runtime inter-core interferences in using shared resources such as the shared L2 caches. As the first step toward time-predictable multi-core computing, this paper presents a novel approach to bounding the worst-case performance for threads running on multi-core processors with shared L2 instruction caches. The idea of our approach is to compute the worst-case instruction access interferences between different threads based on the program control flow information of each thread, which can be statically analyzed. Our experiments indicate that the proposed approach can reasonably estimate the worst- case shared L2 instruction cache misses by considering inter-thread instruction conflicts. Also, the WCET of applications running on multi-core processors estimated by our approach is much better than the estimation by simply assuming all L2 instruction accesses are misses.

...read moreread less

Patent•

Hybrid unicast/anycast content distribution network system

[...]

Alexandre Gerber¹, Oliver Spatscheck¹, Jacobus Van der Merwe¹•Institutions (1)

AT&T¹

11 Nov 2008

TL;DR: In this article, the authors propose a method to compare an edge cache address to an anycast group, and then determine an optimal edge cache server and a unicast address of the optimal cache server when the requester address is not in the anycast groups.

...read moreread less

Abstract: A method includes receiving a request for an edge cache address, and comparing a requester address to an anycast group. The method can further include providing an anycast edge cache address when the requestor address is in the anycast group. Alternatively, the method can further include determining an optimal cache server, and providing a unicast address of the optimal cache server when the requester address is not in the anycast group.

...read moreread less

Proceedings Article•DOI•

FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue

[...]

John Giacomoni¹, Tipp Moseley¹, Manish Vachharajani¹•Institutions (1)

University of Colorado Boulder¹

20 Feb 2008

TL;DR: FastForward is presented, a cache-optimized single-producer/single-consumer concurrent lock-free queue for pipeline parallelism on multicore architectures, with weak to strongly ordered consistency models, with up to 5x faster than the next best solution.

...read moreread less

Abstract: Low overhead core-to-core communication is critical for efficient pipeline-parallel software applications. This paper presents FastForward, a cache-optimized single-producer/single-consumer concurrent lock-free queue for pipeline parallelism on multicore architectures, with weak to strongly ordered consistency models. Enqueue and dequeue times on a 2.66 GHz Opteron 2218 based system are as low as 28.5 ns, up to 5x faster than the next best solution. FastForward's effectiveness is demonstrated for real applications by applying it to line-rate soft network processing on Gigabit Ethernet with general purpose commodity hardware.

...read moreread less

Patent•

Content delivery system, cache server, and cache control server

[...]

Kenji Fujihira¹, Daisuke Matsubara¹, Yukiko Takeda¹•Institutions (1)

Hitachi¹

14 Jul 2008

TL;DR: In this paper, cache delete priority assignment is performed from a position where a user finished playback based on whether the user intends to view the content later, and a cache delete inhibit span is determined based on a playback stop position or a normal speed playback time.

...read moreread less

Abstract: In a cache control assuming plural user terminals accessing identical content, cache delete priority assignment is performed from a position where a user finished playback based on whether the user intends to view the content later. A cache control server is provided, and a cache delete inhibit span is determined based on a playback stop position or a normal speed playback time. A cache server deletes the cache based on the delete inhibit span received from the cache control server. Traffic of the core network due to re-cache can thus be reduced.

...read moreread less

Patent•

System and method for reducing latency in call setup and teardown

[...]

Rajat Ghai¹, Jim Towey¹•Institutions (1)

Cisco Systems, Inc.¹

07 Jul 2008

TL;DR: In this paper, a network device with integrated functionalities and a cache is provided that stores policy information to reduce the amount of signaling that is necessary to setup and teardown sessions.

...read moreread less

Abstract: Systems and methods for reducing latency in call setup and teardown are provided. A network device with integrated functionalities and a cache is provided that stores policy information to reduce the amount of signaling that is necessary to setup and teardown sessions. By handling various aspects of the setup and teardown within a network device, latency is reduced and the amount of bandwidth needed for setup signaling is also reduced.

...read moreread less

Patent•

Operation of a non-volatile memory array

[...]

Kobi Danon¹, Shai Eisen¹, Marcelo Krygier¹•Institutions (1)

Spansion¹

14 Nov 2008

TL;DR: In this article, a cache programming operation which requires 2 SRAMs (one for the user and one for the array) may be combined with a multi-level cell (MLC) programming operation, using only a total of two SRAM or buffers.

...read moreread less

Abstract: A cache programming operation which requires 2 SRAMs (one for the user and one for the array) may be combined with a multi-level cell (MLC) programming operation which also requires 2 SRAMs (one for caching the data and one for verifying the data), using only a total of two SRAMs (or buffers). One of the buffers (User SRAM) receives and stores user data. The other of the two buffers (Cache SRAM) may perform a caching function as well as a verify function. In this manner, if a program operation fails, the user can have its original data back so that he can try to reprogram it to a different place (address).

...read moreread less

Patent•

Method and Apparatus for Processing Video Stream in a Digital Video Broadcasting System

[...]

Guo Hui Lin¹, Yonghua Lin¹, Yudong Yang¹, Yu Yuan¹•Institutions (1)

IBM¹

27 Jun 2008

TL;DR: In this paper, a server-based cache mechanism which caches all channels simultaneously in a cache server near from the video playing terminal is proposed for enhancing user experience when switching channel in digital video broadcasting system.

...read moreread less

Abstract: A novel method and system for enhancing user experience when switching channel in digital video broadcasting system is proposed. The invention proposes a server-based cache mechanism which caches all channels simultaneously in a cache server near from the video playing terminal. The channel switch latency could be heavily reduced since the initial part of the current GOP of any channel could be retrieved from the cache server, therefore the user experience is improved greatly.

...read moreread less

Proceedings Article•DOI•

Parallax: virtual disks for virtual machines

[...]

Dutch T. Meyer¹, Gitika Aggarwal¹, Brendan Cully¹, Geoffrey Lefebvre¹, Michael J. Feeley¹, Norman C. Hutchinson¹, Andrew Warfield¹ - Show less +3 more•Institutions (1)

University of British Columbia¹

01 Apr 2008

TL;DR: Parallax offers a comprehensive set of storage features including frequent, low-overhead snapshot of virtual disks, the 'gold-mastering' of template images, and the ability to use local disks as a persistent cache to dampen burst demand on networked storage.

...read moreread less

Abstract: Parallax is a distributed storage system that uses virtualization to provide storage facilities specifically for virtual environments. The system employs a novel architecture in which storage features that have traditionally been implemented directly on high-end storage arrays and switches are relocated into a federation of storage VMs, sharing the same physical hosts as the VMs that they serve. This architecture retains the single administrative domain and OS agnosticism achieved by array- and switch-based approaches, while lowering the bar on hardware requirements and facilitating the development of new features. Parallax offers a comprehensive set of storage features including frequent, low-overhead snapshot of virtual disks, the 'gold-mastering' of template images, and the ability to use local disks as a persistent cache to dampen burst demand on networked storage.

...read moreread less

Collapse