scispace - formally typeset
Search or ask a question
Author

Chenjie Yu

Bio: Chenjie Yu is an academic researcher from University of Maryland, College Park. The author has contributed to research in topics: Cache & Cache algorithms. The author has an hindex of 7, co-authored 11 publications receiving 117 citations.

Papers
More filters
Proceedings ArticleDOI
13 Jun 2010
TL;DR: This work takes a different approach in which tasks' memory bandwidth requirements are taken into account when identifying a cache partitioning for multi-programmed and/or multithreaded workloads, in which the overall system bandwidth requirement is minimized for the target workload.
Abstract: We present a methodology for off-chip memory bandwidth minimization through application-driven L2 cache partitioning in multi-core systems. A major challenge with multi-core system design is the widening gap between the memory demand generated by the processor cores and the limited off-chip memory bandwidth and memory service speed. This severely restricts the number of cores that can be integrated into a multi-core system and the parallelism that can be actually achieved and efficiently exploited for not only memory demanding applications, but also for workloads consisting of many tasks utilizing a large number of cores and thus exceeding the available off-chip bandwidth. Last level shared cache partitioning has been shown to be a promising technique to enhance cache utilization and reduce missrates. While most cache partitioning techniques focus on cache miss rates, our work takes a different approach in which tasks' memory bandwidth requirements are taken into account when identifying a cache partitioning for multi-programmed and/or multithreaded workloads. Cache resources are allocated with the objective that the overall system bandwidth requirement is minimized for the target workload. The key insight is that cache miss-rate information may severely misrepresent the actual bandwidth demand of the task, which ultimately determines the overall system performance and power consumption.

37 citations

Proceedings ArticleDOI
08 Jun 2008
TL;DR: This paper proposes a compiler-based register reassignment methodology, which purpose is to break such groups of registers and to uniformly distribute the accesses to the register file, and shows that the underlying problem is NP-hard.
Abstract: Temperature hot-spots have been known to cause severe reliability problems and to significantly increase leakage power. The register file has been previously shown to exhibit the highest temperature compared to all other hardware components in a modern high- end embedded processor, which makes it particularly susceptible to faults and elevated leakage power. We show that this is mostly due to the highly clustered register file accesses where a set of few registers physically placed close to each other are accessed with very high frequency. In this paper we propose a compiler-based register reassignment methodology, which purpose is to break such groups of registers and to uniformly distribute the accesses to the register file. This is achieved with no performance and no hardware overheads. We show that the underlying problem is NP-hard, and subsequently introduce an efficient algorithmic heuristic.

16 citations

Journal ArticleDOI
TL;DR: The proposed architecture effectively implements the queued-lock semantics in a completely decentralized manner through low-cost and distributed synchronization controllers performing distributed synchronization management protocols.
Abstract: We present a framework for a distributed and lowcost implementation of synchronization mechanisms for embedded shared-memory multiprocessors. The proposed architecture effectively implements the queued-lock semantics in a completely decentralized manner through low-cost and distributed synchronization controllers performing distributed synchronization management protocols. The proposed approach achieves three major benefits. First, it completely eliminates the overwhelming bus contention traffic when multiple cores compete for a synchronization variable. Second, it exhibits extremely low best-case latency of lock acquisition (with zero bus transactions). Third, the approach enables multiple venues for high energy efficiency as the local synchronization controllers can efficiently determine, without any bus transactions or local cache spinning, the exact timing of when a lock is made available to or a barrier enabled at the local processor. It becomes possible for the system software or the thread library to employ various low-power policies.

14 citations

Proceedings ArticleDOI
19 Oct 2008
TL;DR: The proposed approach to synchronization implementation not only completely eliminates the overwhelming bus contention traffic when multiple cores compete for a synchronization variable, but also achieves very high energy efficiency as the local synchronization controller can efficiently determine, without any bus transactions or local cache spinning, the exact timing of when the lock is made available to the local processor.
Abstract: In this paper we present a framework for a distributed and very low-cost implementation of synchronization controllers and protocols for embedded multiprocessors. The proposed architecture effectively implements the queued-lock semantics in a completely distributed way. The proposed approach to synchronization implementation not only completely eliminates the overwhelming bus contention traffic when multiple cores compete for a synchronization variable, but also achieves very high energy efficiency as the local synchronization controller can efficiently determine, without any bus transactions or local cache spinning, the exact timing of when the lock is made available to the local processor. Application-specific information regarding synchronization variables in the local task is exploited in implementing the distributed synchronization protocol. The local synchronization controllers enable the system software or the thread library to implement various low-power policies, such as disabling the cache accesses or even completely powering down the local processor while waiting for a synchronization variable.

10 citations

Proceedings ArticleDOI
30 Sep 2007
TL;DR: This work proposes an application-driven customization technique where application knowledge regarding data sharing in producer-consumer relationships is used in order to aggressively eliminate unnecessary and predictable snoop-induced cache tag lookups even for references to shared data, thus, achieving significant power reduction with minimal hardware cost.
Abstract: Snoop-based cache coherence protocols are typically used when multiple processor cores share memory through a common bus. It is well known, however, that these coherence protocols introduce an excessive power overhead. To help alleviate this problem, we propose an application-driven customization technique where application knowledge regarding data sharing in producer-consumer relationships is used in order to aggressively eliminate unnecessary and predictable snoop-induced cache tag lookups even for references to shared data, thus, achieving significant power reduction with minimal hardware cost. Snoop-induced cache tag lookups for accesses to both shared and private data are eliminated when it is ensured that such lookups will not result in extra knowledge regarding the cache state in respect to the other caches and memories. The proposed methodology relies on the combined support from the compiler, the operating system, and the hardware architecture. Our experiments show average power reductions of more than 80% compared to a general-purpose snoop protocol.

9 citations


Cited by
More filters
01 Jan 2010
TL;DR: This journal special section will cover recent progress on parallel CAD research, including algorithm foundations, programming models, parallel architectural-specific optimization, and verification, as well as other topics relevant to the design of parallel CAD algorithms and software tools.
Abstract: High-performance parallel computer architecture and systems have been improved at a phenomenal rate. In the meantime, VLSI computer-aided design (CAD) software for multibillion-transistor IC design has become increasingly complex and requires prohibitively high computational resources. Recent studies have shown that, numerous CAD problems, with their high computational complexity, can greatly benefit from the fast-increasing parallel computation capabilities. However, parallel programming imposes big challenges for CAD applications. Fully exploiting the computational power of emerging general-purpose and domain-specific multicore/many-core processor systems, calls for fundamental research and engineering practice across every stage of parallel CAD design, from algorithm exploration, programming models, design-time and run-time environment, to CAD applications, such as verification, optimization, and simulation. This journal special section will cover recent progress on parallel CAD research, including algorithm foundations, programming models, parallel architectural-specific optimization, and verification. More specifically, papers with in-depth and extensive coverage of the following topics will be considered, as well as other topics relevant to the design of parallel CAD algorithms and software tools. 1. Parallel algorithm design and specification for CAD applications 2. Parallel programming models and languages of particular use in CAD 3. Runtime support and performance optimization for CAD applications 4. Parallel architecture-specific design and optimization for CAD applications 5. Parallel program debugging and verification techniques particularly relevant for CAD The papers should be submitted via the Manuscript Central website and should adhere to standard ACM TODAES formatting requirements (http://todaes.acm.org/). The page count limit is 25.

459 citations

Proceedings ArticleDOI
25 Feb 2012
TL;DR: A TLP-aware cache management policy for CPU-GPU heterogeneous architectures is proposed, and a core-sampling mechanism to detect how caching affects the performance of a GPGPU application is introduced.
Abstract: Combining CPUs and GPUs on the same chip has become a popular architectural trend. However, these heterogeneous architectures put more pressure on shared resource management. In particular, managing the last-level cache (LLC) is very critical to performance. Lately, many researchers have proposed several shared cache management mechanisms, including dynamic cache partitioning and promotion-based cache management, but no cache management work has been done on CPU-GPU heterogeneous architectures. Sharing the LLC between CPUs and GPUs brings new challenges due to the different characteristics of CPU and GPGPU applications. Unlike most memory-intensive CPU benchmarks that hide memory latency with caching, many GPGPU applications hide memory latency by combining thread-level parallelism (TLP) and caching. In this paper, we propose a TLP-aware cache management policy for CPU-GPU heterogeneous architectures. We introduce a core-sampling mechanism to detect how caching affects the performance of a GPGPU application. Inspired by previous cache management schemes, Utility-based Cache Partitioning (UCP) and Re-Reference Interval Prediction (RRIP), we propose two new mechanisms: TAP-UCP and TAP-RRIP. TAP-UCP improves performance by 5% over UCP and 11% over LRU on 152 heterogeneous workloads, and TAP-RRIP improves performance by 9% over RRIP and 12% over LRU.

127 citations

Journal ArticleDOI
TL;DR: The aim of this survey is to enable engineers and researchers to get insights into the techniques for improving cache power efficiency and motivate them to invent novel solutions for enabling low-power operation of caches.

125 citations

Proceedings ArticleDOI
22 Oct 2011
TL;DR: This paper evaluates the two widely used zero initialization designs, showing that they make different tradeoffs to achieve very similar performance, and inspires three better designs: bulk zeroing with cache-bypassing (non-temporal) instructions to reduce the direct and indirect zeroing costs simultaneously.
Abstract: Memory safety defends against inadvertent and malicious misuse of memory that may compromise program correctness and security. A critical element of memory safety is zero initialization. The direct cost of zero initialization is surprisingly high: up to 12.7%, with average costs ranging from 2.7 to 4.5% on a high performance virtual machine on IA32 architectures. Zero initialization also incurs indirect costs due to its memory bandwidth demands and cache displacement effects. Existing virtual machines either: a) minimize direct costs by zeroing in large blocks, or b) minimize indirect costs by zeroing in the allocation sequence, which reduces cache displacement and bandwidth. This paper evaluates the two widely used zero initialization designs, showing that they make different tradeoffs to achieve very similar performance. Our analysis inspires three better designs: (1) bulk zeroing with cache-bypassing (non-temporal) instructions to reduce the direct and indirect zeroing costs simultaneously, (2) concurrent non-temporal bulk zeroing that exploits parallel hardware to move work off the application's critical path, and (3) adaptive zeroing, which dynamically chooses between (1) and (2) based on available hardware parallelism. The new software strategies offer speedups sometimes greater than the direct overhead, improving total performance by 3% on average. Our findings invite additional optimizations and microarchitectural support.

81 citations

Proceedings ArticleDOI
05 Jun 2011
TL;DR: This paper presents a novel energy optimization technique which employs both dynamic reconfiguration of private caches and partitioning of the shared cache for multicore systems with real-time tasks and can achieve 29.29% energy saving on average.
Abstract: Multicore architectures, especially chip multi-processors, have been widely acknowledged as a successful design paradigm. Existing approaches primarily target application-driven partitioning of the shared cache to alleviate inter-core cache interference so that both performance and energy efficiency are improved. Dynamic cache reconfiguration is a promising technique in reducing energy consumption of the cache subsystem for uniprocessor systems. In this paper, we present a novel energy optimization technique which employs both dynamic reconfiguration of private caches and partitioning of the shared cache for multicore systems with real-time tasks. Our static profiling based algorithm is designed to judiciously find beneficial cache configurations (of private caches) for each task as well as partition factors (of the shared cache) for each core so that the energy consumption is minimized while task deadline is satisfied. Experimental results using real benchmarks demonstrate that our approach can achieve 29.29% energy saving on average compared to systems employing only cache partitioning.

78 citations