scispace - formally typeset
Search or ask a question
Topic

Cache pollution

About: Cache pollution is a research topic. Over the lifetime, 11353 publications have been published within this topic receiving 262139 citations.


Papers
More filters
Proceedings ArticleDOI
13 Dec 2014
TL;DR: A specialized cache management policy for GPGPUs is proposed that is coordinated with warp throttling to dynamically control the active number of warps and a simple predictor to dynamically estimate the optimal number of active warps that can take full advantage of the cache space and on-chip resources.
Abstract: With the SIMT execution model, GPUs can hidememory latency through massive multithreading for many applications that have regular memory access patterns. To support applications with irregular memory access patterns, cache hierarchies have been introduced to GPU architectures to capture temporal and spatial locality and mitigate the effect of irregular accesses. However, GPU caches exhibit poor efficiency due to the mismatch of the throughput-oriented execution model and its cache hierarchy design, which limits system performance and energy-efficiency. The massive amount of memory requests generated by GPU scause cache contention and resource congestion. Existing CPUcache management policies that are designed for multicoresystems, can be suboptimal when directly applied to GPUcaches. We propose a specialized cache management policy for GPGPUs. The cache hierarchy is protected from contention by the bypass policy based on reuse distance. Contention and resource congestion are detected at runtime. To avoid oversaturatingon-chip resources, the bypass policy is coordinated with warp throttling to dynamically control the active number of warps. We also propose a simple predictor to dynamically estimate the optimal number of active warps that can take full advantage of the cache space and on-chip resources. Experimental results show that cache efficiency is significantly improved and on-chip resources are better utilized for cache sensitive benchmarks. This results in a harmonic mean IPCimprovement of 74% and 17% (maximum 661% and 44% IPCimprovement), compared to the baseline GPU architecture and optimal static warp throttling, respectively.

142 citations

Patent
31 Dec 1981
TL;DR: In this paper, a multiprocessing three level memory hierarchy implementation is described which uses a "write" flag and a "share" flag per page of information stored in a level three main memory.
Abstract: A multiprocessing three level memory hierarchy implementation is described which uses a "write" flag and a "share" flag per page of information stored in a level three main memory. These two flag bits are utilized to communicate from main memory at level three to private and shared caches at memory levels one and two how a given page of information is to be used. Essentially, pages which can be both written and shared are moved from main memory to the shared level two cache and then to the shared level one cache, with the processors executing from the shared level one cache. All other pages are moved from main memory to the private level two and level one caches of the requesting processor. Thus, a processor executes either from its private or shared level one cache. This allows several processors to share a level three common main memory without encountering cross interrogation overhead.

142 citations

Proceedings ArticleDOI
09 Jun 2007
TL;DR: This paper proposes OneTM to simplify the implementation of unbounded transactional memory by bounding the concurrency of transactions that overflow the cache, and introduces the permissions-only cache to extend the bound at which transactions overflow to allow the fast, bounded case to be used as frequently as possible.
Abstract: Hardware transactional memory has great potential to simplify the creation ofcorrect and efficient multithreaded programs, allowing programmers to exploitmore effectively the soon-to-be-ubiquitous multi-core designs. Several recentproposals have extended the original bounded transactional memory to unboundedtransactional memory, a crucial step toward transactions becoming ageneral-purpose primitive. Unfortunately, supporting the concurrent executionof an unbounded number of unbounded transactions is challenging, and as aresult, many proposed implementations are complex.This paper explores a different approach. First, we introduce thepermissions-only cache to extend the bound at which transactions overflow toallow the fast, bounded case to be used as frequently as possible. Second, wepropose OneTM to simplify the implementation of unbounded transactional memoryby bounding the concurrency of transactions that overflow the cache. Thesemechanisms work synergistically to provide a simple and fast unboundedtransactional memory system.The permissions-only cache efficiently maintains the coherencepermissions-but not data-for blocks read or written transactionally thathave been evicted from the processor's caches. By holding coherencepermissions for these blocks, the regular cache coherence protocol can be usedto detect transactional conflicts using only a few bits of on-chip storage peroverflowed cache block.OneTM allows only one overflowed transaction at a time, relying on thepermissions-only cache to ensure that overflow is infrequent. We present twoimplementations. In OneTM-Serialized, an overflowed transaction simply stallsall other threads in the application.In OneTM-Concurrent, non-overflowedtransactions and non-transactional code can execute concurrently with theoverflowed transaction, providing more concurrency while retaining OneTM's coresimplifying assumption.

141 citations

Patent
21 Apr 2003
TL;DR: In this article, the disk drive has a cache control system that is configured to efficiently respond to host commands by forming variable length segments of memory clusters for caching disk data in contiguous ranges of logical block addresses without regard to the sequential order of the memory clusters.
Abstract: The present invention is embodied in the disk drive having a cache control system that is configured to efficiently respond to host commands by forming variable length segments of memory clusters for caching disk data in contiguous ranges of logical block addresses without regard to the sequential order of the memory clusters. The cache control system has a tag memory usable only for defining the segments. The tag memory has a plurality of tag records pointing to cluster control blocks associated with the memory clusters for defining the segments. The tag memory may be accessed and updated by several state machines in the cache control system and by a microprocessor in the disk drive.

141 citations

Journal ArticleDOI
18 Jun 2016
TL;DR: This paper explains how a cache replacement algorithm can nonetheless learn from Belady's algorithm by applying it to past cache accesses to inform future cache replacement decisions, and shows that the implementation is surprisingly efficient.
Abstract: Belady's algorithm is optimal but infeasible because it requires knowledge of the future. This paper explains how a cache replacement algorithm can nonetheless learn from Belady's algorithm by applying it to past cache accesses to inform future cache replacement decisions. We show that the implementation is surprisingly efficient, as we introduce a new method of efficiently simulating Belady's behavior, and we use known sampling techniques to compactly represent the long history information that is needed for high accuracy. For a 2MB LLC, our solution uses a 16KB hardware budget (excluding replacement state in the tag array). When applied to a memory-intensive subset of the SPEC 2006 CPU benchmarks, our solution improves performance over LRU by 8.4%, as opposed to 6.2% for the previous state-of-the-art. For a 4-core system with a shared 8MB LLC, our solution improves performance by 15.0%, compared to 12.0% for the previous state-of-the-art.

141 citations


Network Information
Related Topics (5)
Cache
59.1K papers, 976.6K citations
93% related
Compiler
26.3K papers, 578.5K citations
89% related
Scalability
50.9K papers, 931.6K citations
87% related
Server
79.5K papers, 1.4M citations
86% related
Static routing
25.7K papers, 576.7K citations
84% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202342
2022110
202112
202020
201915
201830