Topic

Cache pollution

About: Cache pollution is a research topic. Over the lifetime, 11353 publications have been published within this topic receiving 262139 citations.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Two fast and high-associativity cache schemes

[...]

Chenxi Zhang, Xiaodong Zhang, Yong Yan

01 Sep 1997-IEEE Micro

TL;DR: Two schemes for implementing associativity greater than two are proposed, which are an extension of the column-associative cache and the parallel multicolumn cache, which can effectively reduce the average access time.

...read moreread less

Abstract: In the race to improve cache performance, many researchers have proposed schemes that increase a cache's associativity. The associativity of a cache is the number of places in the cache where a block may reside. In a direct-mapped cache, which has an associativity of 1, there is only one location to search for a match for each reference. In a cache with associativity n-an n-way set-associative cache-there are n locations. Increasing associativity reduces the miss rate by decreasing the number of conflict, or interference, references. The column-associative cache and the predictive sequential associative cache seem to have achieved near-optimal performance for an associativity of two. Increasing associativity beyond two, therefore, is one of the most important ways to further improve cache performance. We propose two schemes for implementing associativity greater than two: the sequential multicolumn cache, which is an extension of the column-associative cache, and the parallel multicolumn cache. For an associativity of four, they achieve the low miss rate of a four-way set-associative cache. Our simulation results show that both schemes can effectively reduce the average access time.

...read moreread less

84 citations

Proceedings Article•DOI•

Priority-based cache allocation in throughput processors

[...]

Dong Li¹, Minsoo Rhu¹, Daniel R. Johnson², Mike O'Connor¹, Mattan Erez¹, Doug Burger³, Donald S. Fussell¹, Stephen W. Redder¹ - Show less +4 more•Institutions (3)

University of Texas at Austin¹, Nvidia², Microsoft³

09 Mar 2015

TL;DR: A priority-based cache allocation (PCAL) that provides preferential cache capacity to a subset of high-priority threads while simultaneously allowing lower priority threads to execute without contending for the cache is proposed.

...read moreread less

Abstract: GPUs employ massive multithreading and fast context switching to provide high throughput and hide memory latency. Multithreading can Increase contention for various system resources, however, that may result In suboptimal utilization of shared resources. Previous research has proposed variants of throttling thread-level parallelism to reduce cache contention and improve performance. Throttling approaches can, however, lead to under-utilizing thread contexts, on-chip interconnect, and off-chip memory bandwidth. This paper proposes to tightly couple the thread scheduling mechanism with the cache management algorithms such that GPU cache pollution is minimized while off-chip memory throughput is enhanced. We propose priority-based cache allocation (PCAL) that provides preferential cache capacity to a subset of high-priority threads while simultaneously allowing lower priority threads to execute without contending for the cache. By tuning thread-level parallelism while both optimizing caching efficiency as well as other shared resource usage, PCAL builds upon previous thread throttling approaches, improving overall performance by an average 17% with maximum 51%.

...read moreread less

84 citations

Proceedings Article•

AMP: adaptive multi-stream prefetching in a shared cache

[...]

Binny S. Gill¹, Luis Angel D. Bathen¹•Institutions (1)

IBM¹

13 Feb 2007

TL;DR: A first formal analysis of the criteria necessary for optimal throughput when using an AA algorithm in a cache shared by multiple steady sequential streams is provided and a simple implementation called AMP is provided, which adapts accordingly leading to near optimal performance for any kind of sequential workload and cache size.

...read moreread less

Abstract: Prefetching is a widely used technique in modern data storage systems. We study the most widely used class of prefetching algorithms known as sequential prefetching. There are two problems that plague the state-of-the-art sequential prefetching algorithms: (i) cache pollution, which occurs when prefetched data replaces more useful prefetched or demand-paged data, and (ii) prefetch wastage, which happens when prefetched data is evicted from the cache before it can be used. A sequential prefetching algorithm can have a fixed or adaptive degree of prefetch and can be either synchronous (when it can prefetch only on a miss), or asynchronous (when it can also prefetch on a hit). To capture these distinctions we define four classes of prefetching algorithms: Fixed Synchronous (FS), Fixed Asynchronous (FA), Adaptive Synchronous (AS), and Adaptive Asynchronous (AA). We find that the relatively unexplored class of AA algorithms is in fact the most promising for sequential prefetching. We provide a first formal analysis of the criteria necessary for optimal throughput when using an AA algorithm in a cache shared by multiple steady sequential streams. We then provide a simple implementation called AMP, which adapts accordingly leading to near optimal performance for any kind of sequential workload and cache size. Our experimental set-up consisted of an IBM xSeries 345 dual processor server running Linux using five SCSI disks. We observe that AMP convincingly outperforms all the contending members of the FA, FS, and AS classes for any number of streams, and over all cache sizes. As anecdotal evidence, in an experiment with 100 concurrent sequential streams and varying cache sizes, AMP beats the FA, FS, and AS algorithms by 29-172%, 12-24%, and 21-210% respectively while outperforming OBL by a factor of 8. Even for complex workloads like SPC1-Read, AMP is consistently the best performing algorithm. For the SPC2 Video-on-Demand workload, AMP can sustain at least 25% more streams than the next best algorithm. Finally, for a workload consisting of short sequences, where optimality is more elusive, AMP is able to outperform all the other contenders in overall performance.

...read moreread less

84 citations

Patent•

Disk controller includes cache memory and a local processor which limits data transfers from memory to cache in accordance with a maximum look ahead parameter

[...]

Theodore E. Weber, Paul V. Tischler

05 Jun 1989

TL;DR: In this article, a block descriptor table (40 ) is divided into a plurality of sets (42), depending upon the size of the memory cache, and each set is similarly indexed to define memory groups (44) having tag, cache address, and usage information.

...read moreread less

Abstract: A controller (10) for use with a hard disk (38) or other mass storage medium provides a memory cache (36). A block descriptor table (40 ) is divided into a plurality of sets (42), depending upon the size of the memory cache (36). Each set is similarly indexed to define memory groups (44) having tag, cache address, and usage information. Upon a read command, an index is generated corresponding to the address requested by the host computer, and the tag information is matched with a generated tag from the address. Each set is checked until a hit occurs or a miss occurs in every set. After each miss, the usage information (50) corresponding to the memory group (44) is decremented. When reading information from the storage device (32) to the memory cache (36), the controller (10) may selectively read additional sectors. The number of sectors read from the storage device may be selectively controlled by the user or the host processor. Further, a cap may be provided to provide a maximum number of sectors to be read.

...read moreread less

84 citations

Proceedings Article•DOI•

Scheduling irregular parallel computations on hierarchical caches

[...]

Guy E. Blelloch¹, Jeremy T. Fineman¹, Phillip B. Gibbons², Harsha Vardhan Simhadri¹•Institutions (2)

Carnegie Mellon University¹, Intel²

04 Jun 2011

TL;DR: The parallel cache-oblivious (PCO) model is presented, a relatively simple modification to the CO model that can be used to account for costs on a broad range of cache hierarchies, and a new scheduler is described, which attains provably good cache performance and runtime on parallel machine models with hierarchical caches.

...read moreread less

Abstract: For nested-parallel computations with low depth (span, critical path length) analyzing the work, depth, and sequential cache complexity suffices to attain reasonably strong bounds on the parallel runtime and cache complexity on machine models with either shared or private caches. These bounds, however, do not extend to general hierarchical caches, due to limitations in (i) the cache-oblivious (CO) model used to analyze cache complexity and (ii) the schedulers used to map computation tasks to processors. This paper presents the parallel cache-oblivious (PCO) model, a relatively simple modification to the CO model that can be used to account for costs on a broad range of cache hierarchies. The first change is to avoid capturing artificial data sharing among parallel threads, and the second is to account for parallelism-memory imbalances within tasks. Despite the more restrictive nature of PCO compared to CO, many algorithms have the same asymptotic cache complexity bounds.The paper then describes a new scheduler for hierarchical caches, which extends recent work on "space-bounded schedulers" to allow for computations with arbitrary work imbalance among parallel subtasks. This scheduler attains provably good cache performance and runtime on parallel machine models with hierarchical caches, for nested-parallel computations analyzed using the PCO model. We show that under reasonable assumptions our scheduler is "work efficient" in the sense that the cost of the cache misses are evenly balanced across the processors---i.e., the runtime can be determined within a constant factor by taking the total cost of the cache misses analyzed for a computation and dividing it by the number of processors. In contrast, to further support our model, we show that no scheduler can achieve such bounds (optimizing for both cache misses and runtime) if work, depth, and sequential cache complexity are the only parameters used to analyze a computation.

...read moreread less

84 citations

Collapse

Network Information

Performance

Metrics

11,507

Papers

268,081

Citations

No. of papers in the topic in previous years
Year	Papers
2023	42
2022	110
2021	12
2020	20
2019	15
2018	30

Cache pollution

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics