scispace - formally typeset
Search or ask a question
Topic

Cache pollution

About: Cache pollution is a research topic. Over the lifetime, 11353 publications have been published within this topic receiving 262139 citations.


Papers
More filters
Journal ArticleDOI
TL;DR: Two schemes for implementing associativity greater than two are proposed, which are an extension of the column-associative cache and the parallel multicolumn cache, which can effectively reduce the average access time.
Abstract: In the race to improve cache performance, many researchers have proposed schemes that increase a cache's associativity. The associativity of a cache is the number of places in the cache where a block may reside. In a direct-mapped cache, which has an associativity of 1, there is only one location to search for a match for each reference. In a cache with associativity n-an n-way set-associative cache-there are n locations. Increasing associativity reduces the miss rate by decreasing the number of conflict, or interference, references. The column-associative cache and the predictive sequential associative cache seem to have achieved near-optimal performance for an associativity of two. Increasing associativity beyond two, therefore, is one of the most important ways to further improve cache performance. We propose two schemes for implementing associativity greater than two: the sequential multicolumn cache, which is an extension of the column-associative cache, and the parallel multicolumn cache. For an associativity of four, they achieve the low miss rate of a four-way set-associative cache. Our simulation results show that both schemes can effectively reduce the average access time.

84 citations

Proceedings ArticleDOI
09 Mar 2015
TL;DR: A priority-based cache allocation (PCAL) that provides preferential cache capacity to a subset of high-priority threads while simultaneously allowing lower priority threads to execute without contending for the cache is proposed.
Abstract: GPUs employ massive multithreading and fast context switching to provide high throughput and hide memory latency. Multithreading can Increase contention for various system resources, however, that may result In suboptimal utilization of shared resources. Previous research has proposed variants of throttling thread-level parallelism to reduce cache contention and improve performance. Throttling approaches can, however, lead to under-utilizing thread contexts, on-chip interconnect, and off-chip memory bandwidth. This paper proposes to tightly couple the thread scheduling mechanism with the cache management algorithms such that GPU cache pollution is minimized while off-chip memory throughput is enhanced. We propose priority-based cache allocation (PCAL) that provides preferential cache capacity to a subset of high-priority threads while simultaneously allowing lower priority threads to execute without contending for the cache. By tuning thread-level parallelism while both optimizing caching efficiency as well as other shared resource usage, PCAL builds upon previous thread throttling approaches, improving overall performance by an average 17% with maximum 51%.

84 citations

Proceedings Article
Binny S. Gill1, Luis Angel D. Bathen1
13 Feb 2007
TL;DR: A first formal analysis of the criteria necessary for optimal throughput when using an AA algorithm in a cache shared by multiple steady sequential streams is provided and a simple implementation called AMP is provided, which adapts accordingly leading to near optimal performance for any kind of sequential workload and cache size.
Abstract: Prefetching is a widely used technique in modern data storage systems. We study the most widely used class of prefetching algorithms known as sequential prefetching. There are two problems that plague the state-of-the-art sequential prefetching algorithms: (i) cache pollution, which occurs when prefetched data replaces more useful prefetched or demand-paged data, and (ii) prefetch wastage, which happens when prefetched data is evicted from the cache before it can be used. A sequential prefetching algorithm can have a fixed or adaptive degree of prefetch and can be either synchronous (when it can prefetch only on a miss), or asynchronous (when it can also prefetch on a hit). To capture these distinctions we define four classes of prefetching algorithms: Fixed Synchronous (FS), Fixed Asynchronous (FA), Adaptive Synchronous (AS), and Adaptive Asynchronous (AA). We find that the relatively unexplored class of AA algorithms is in fact the most promising for sequential prefetching. We provide a first formal analysis of the criteria necessary for optimal throughput when using an AA algorithm in a cache shared by multiple steady sequential streams. We then provide a simple implementation called AMP, which adapts accordingly leading to near optimal performance for any kind of sequential workload and cache size. Our experimental set-up consisted of an IBM xSeries 345 dual processor server running Linux using five SCSI disks. We observe that AMP convincingly outperforms all the contending members of the FA, FS, and AS classes for any number of streams, and over all cache sizes. As anecdotal evidence, in an experiment with 100 concurrent sequential streams and varying cache sizes, AMP beats the FA, FS, and AS algorithms by 29-172%, 12-24%, and 21-210% respectively while outperforming OBL by a factor of 8. Even for complex workloads like SPC1-Read, AMP is consistently the best performing algorithm. For the SPC2 Video-on-Demand workload, AMP can sustain at least 25% more streams than the next best algorithm. Finally, for a workload consisting of short sequences, where optimality is more elusive, AMP is able to outperform all the other contenders in overall performance.

84 citations

Patent
05 Jun 1989
TL;DR: In this article, a block descriptor table (40 ) is divided into a plurality of sets (42), depending upon the size of the memory cache, and each set is similarly indexed to define memory groups (44) having tag, cache address, and usage information.
Abstract: A controller (10) for use with a hard disk (38) or other mass storage medium provides a memory cache (36). A block descriptor table (40 ) is divided into a plurality of sets (42), depending upon the size of the memory cache (36). Each set is similarly indexed to define memory groups (44) having tag, cache address, and usage information. Upon a read command, an index is generated corresponding to the address requested by the host computer, and the tag information is matched with a generated tag from the address. Each set is checked until a hit occurs or a miss occurs in every set. After each miss, the usage information (50) corresponding to the memory group (44) is decremented. When reading information from the storage device (32) to the memory cache (36), the controller (10) may selectively read additional sectors. The number of sectors read from the storage device may be selectively controlled by the user or the host processor. Further, a cap may be provided to provide a maximum number of sectors to be read.

84 citations

Proceedings ArticleDOI
04 Jun 2011
TL;DR: The parallel cache-oblivious (PCO) model is presented, a relatively simple modification to the CO model that can be used to account for costs on a broad range of cache hierarchies, and a new scheduler is described, which attains provably good cache performance and runtime on parallel machine models with hierarchical caches.
Abstract: For nested-parallel computations with low depth (span, critical path length) analyzing the work, depth, and sequential cache complexity suffices to attain reasonably strong bounds on the parallel runtime and cache complexity on machine models with either shared or private caches. These bounds, however, do not extend to general hierarchical caches, due to limitations in (i) the cache-oblivious (CO) model used to analyze cache complexity and (ii) the schedulers used to map computation tasks to processors. This paper presents the parallel cache-oblivious (PCO) model, a relatively simple modification to the CO model that can be used to account for costs on a broad range of cache hierarchies. The first change is to avoid capturing artificial data sharing among parallel threads, and the second is to account for parallelism-memory imbalances within tasks. Despite the more restrictive nature of PCO compared to CO, many algorithms have the same asymptotic cache complexity bounds.The paper then describes a new scheduler for hierarchical caches, which extends recent work on "space-bounded schedulers" to allow for computations with arbitrary work imbalance among parallel subtasks. This scheduler attains provably good cache performance and runtime on parallel machine models with hierarchical caches, for nested-parallel computations analyzed using the PCO model. We show that under reasonable assumptions our scheduler is "work efficient" in the sense that the cost of the cache misses are evenly balanced across the processors---i.e., the runtime can be determined within a constant factor by taking the total cost of the cache misses analyzed for a computation and dividing it by the number of processors. In contrast, to further support our model, we show that no scheduler can achieve such bounds (optimizing for both cache misses and runtime) if work, depth, and sequential cache complexity are the only parameters used to analyze a computation.

84 citations


Network Information
Related Topics (5)
Cache
59.1K papers, 976.6K citations
93% related
Compiler
26.3K papers, 578.5K citations
89% related
Scalability
50.9K papers, 931.6K citations
87% related
Server
79.5K papers, 1.4M citations
86% related
Static routing
25.7K papers, 576.7K citations
84% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202342
2022110
202112
202020
201915
201830