Home
/
Authors
/
Chenjie Yu

Author

Chenjie Yu

Bio: Chenjie Yu is an academic researcher from University of Maryland, College Park. The author has contributed to research in topics: Cache & Cache algorithms. The author has an hindex of 7, co-authored 11 publications receiving 117 citations.

Topics: Cache, Cache algorithms, Bus sniffing, Cache pollution, MESI protocol ...read more

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

Off-chip memory bandwidth minimization through cache partitioning for multi-core platforms

[...]

Chenjie Yu¹, Peter Petrov¹•Institutions (1)

University of Maryland, College Park¹

13 Jun 2010

TL;DR: This work takes a different approach in which tasks' memory bandwidth requirements are taken into account when identifying a cache partitioning for multi-programmed and/or multithreaded workloads, in which the overall system bandwidth requirement is minimized for the target workload.

...read moreread less

Abstract: We present a methodology for off-chip memory bandwidth minimization through application-driven L2 cache partitioning in multi-core systems. A major challenge with multi-core system design is the widening gap between the memory demand generated by the processor cores and the limited off-chip memory bandwidth and memory service speed. This severely restricts the number of cores that can be integrated into a multi-core system and the parallelism that can be actually achieved and efficiently exploited for not only memory demanding applications, but also for workloads consisting of many tasks utilizing a large number of cores and thus exceeding the available off-chip bandwidth. Last level shared cache partitioning has been shown to be a promising technique to enhance cache utilization and reduce missrates. While most cache partitioning techniques focus on cache miss rates, our work takes a different approach in which tasks' memory bandwidth requirements are taken into account when identifying a cache partitioning for multi-programmed and/or multithreaded workloads. Cache resources are allocated with the objective that the overall system bandwidth requirement is minimized for the target workload. The key insight is that cache miss-rate information may severely misrepresent the actual bandwidth demand of the task, which ultimately determines the overall system performance and power consumption.

...read moreread less

37 citations

Proceedings Article•DOI•

Compiler-driven register re-assignment for register file power-density and temperature reduction

[...]

Xiangrong Zhou¹, Chenjie Yu¹, Peter Petrov¹•Institutions (1)

University of Maryland, College Park¹

08 Jun 2008

TL;DR: This paper proposes a compiler-based register reassignment methodology, which purpose is to break such groups of registers and to uniformly distribute the accesses to the register file, and shows that the underlying problem is NP-hard.

...read moreread less

Abstract: Temperature hot-spots have been known to cause severe reliability problems and to significantly increase leakage power. The register file has been previously shown to exhibit the highest temperature compared to all other hardware components in a modern high- end embedded processor, which makes it particularly susceptible to faults and elevated leakage power. We show that this is mostly due to the highly clustered register file accesses where a set of few registers physically placed close to each other are accessed with very high frequency. In this paper we propose a compiler-based register reassignment methodology, which purpose is to break such groups of registers and to uniformly distribute the accesses to the register file. This is achieved with no performance and no hardware overheads. We show that the underlying problem is NP-hard, and subsequently introduce an efficient algorithmic heuristic.

...read moreread less

16 citations

Journal Article•DOI•

Low-Cost and Energy-Efficient Distributed Synchronization for Embedded Multiprocessors

[...]

Chenjie Yu¹, Peter Petrov¹•Institutions (1)

University of Maryland, College Park¹

01 Aug 2010-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: The proposed architecture effectively implements the queued-lock semantics in a completely decentralized manner through low-cost and distributed synchronization controllers performing distributed synchronization management protocols.

...read moreread less

Abstract: We present a framework for a distributed and lowcost implementation of synchronization mechanisms for embedded shared-memory multiprocessors. The proposed architecture effectively implements the queued-lock semantics in a completely decentralized manner through low-cost and distributed synchronization controllers performing distributed synchronization management protocols. The proposed approach achieves three major benefits. First, it completely eliminates the overwhelming bus contention traffic when multiple cores compete for a synchronization variable. Second, it exhibits extremely low best-case latency of lock acquisition (with zero bus transactions). Third, the approach enables multiple venues for high energy efficiency as the local synchronization controllers can efficiently determine, without any bus transactions or local cache spinning, the exact timing of when a lock is made available to or a barrier enabled at the local processor. It becomes possible for the system software or the thread library to employ various low-power policies.

...read moreread less

14 citations

Proceedings Article•DOI•

Distributed and low-power synchronization architecture for embedded multiprocessors

[...]

Chenjie Yu¹, Peter Petrov¹•Institutions (1)

University of Maryland, College Park¹

19 Oct 2008

TL;DR: The proposed approach to synchronization implementation not only completely eliminates the overwhelming bus contention traffic when multiple cores compete for a synchronization variable, but also achieves very high energy efficiency as the local synchronization controller can efficiently determine, without any bus transactions or local cache spinning, the exact timing of when the lock is made available to the local processor.

...read moreread less

Abstract: In this paper we present a framework for a distributed and very low-cost implementation of synchronization controllers and protocols for embedded multiprocessors. The proposed architecture effectively implements the queued-lock semantics in a completely distributed way. The proposed approach to synchronization implementation not only completely eliminates the overwhelming bus contention traffic when multiple cores compete for a synchronization variable, but also achieves very high energy efficiency as the local synchronization controller can efficiently determine, without any bus transactions or local cache spinning, the exact timing of when the lock is made available to the local processor. Application-specific information regarding synchronization variables in the local task is exploited in implementing the distributed synchronization protocol. The local synchronization controllers enable the system software or the thread library to implement various low-power policies, such as disabling the cache accesses or even completely powering down the local processor while waiting for a synchronization variable.

...read moreread less

10 citations

Proceedings Article•DOI•

Aggressive snoop reduction for synchronized producer-consumer communication in energy-efficient embedded multi-processors

[...]

Chenjie Yu¹, Peter Petrov¹•Institutions (1)

University of Maryland, College Park¹

30 Sep 2007

TL;DR: This work proposes an application-driven customization technique where application knowledge regarding data sharing in producer-consumer relationships is used in order to aggressively eliminate unnecessary and predictable snoop-induced cache tag lookups even for references to shared data, thus, achieving significant power reduction with minimal hardware cost.

...read moreread less

Abstract: Snoop-based cache coherence protocols are typically used when multiple processor cores share memory through a common bus. It is well known, however, that these coherence protocols introduce an excessive power overhead. To help alleviate this problem, we propose an application-driven customization technique where application knowledge regarding data sharing in producer-consumer relationships is used in order to aggressively eliminate unnecessary and predictable snoop-induced cache tag lookups even for references to shared data, thus, achieving significant power reduction with minimal hardware cost. Snoop-induced cache tag lookups for accesses to both shared and private data are eliminated when it is ensured that such lookups will not result in extra knowledge regarding the cache state in respect to the other caches and memories. The proposed methodology relies on the combined support from the compiler, the operating system, and the hardware architecture. Our experiments show average power reductions of more than 80% compared to a general-purpose snoop protocol.

...read moreread less

9 citations

Cited by

PDF

Open Access

More filters

Parallel CAD: Algorithm Design and Programming Special Section Call for Papers TODAES: ACM Transactions on Design Automation of Electronic Systems

[...]

Kurt Keutzer, Peng Li, Li Shang, Hai Zhou

01 Jan 2010

TL;DR: This journal special section will cover recent progress on parallel CAD research, including algorithm foundations, programming models, parallel architectural-specific optimization, and verification, as well as other topics relevant to the design of parallel CAD algorithms and software tools.

...read moreread less

Abstract: High-performance parallel computer architecture and systems have been improved at a phenomenal rate. In the meantime, VLSI computer-aided design (CAD) software for multibillion-transistor IC design has become increasingly complex and requires prohibitively high computational resources. Recent studies have shown that, numerous CAD problems, with their high computational complexity, can greatly benefit from the fast-increasing parallel computation capabilities. However, parallel programming imposes big challenges for CAD applications. Fully exploiting the computational power of emerging general-purpose and domain-specific multicore/many-core processor systems, calls for fundamental research and engineering practice across every stage of parallel CAD design, from algorithm exploration, programming models, design-time and run-time environment, to CAD applications, such as verification, optimization, and simulation. This journal special section will cover recent progress on parallel CAD research, including algorithm foundations, programming models, parallel architectural-specific optimization, and verification. More specifically, papers with in-depth and extensive coverage of the following topics will be considered, as well as other topics relevant to the design of parallel CAD algorithms and software tools. 1. Parallel algorithm design and specification for CAD applications 2. Parallel programming models and languages of particular use in CAD 3. Runtime support and performance optimization for CAD applications 4. Parallel architecture-specific design and optimization for CAD applications 5. Parallel program debugging and verification techniques particularly relevant for CAD The papers should be submitted via the Manuscript Central website and should adhere to standard ACM TODAES formatting requirements (http://todaes.acm.org/). The page count limit is 25.

...read moreread less

459 citations

Proceedings Article•DOI•

TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture

[...]

Jaekyu Lee¹, Hyesoon Kim¹•Institutions (1)

Georgia Institute of Technology¹

25 Feb 2012

TL;DR: A TLP-aware cache management policy for CPU-GPU heterogeneous architectures is proposed, and a core-sampling mechanism to detect how caching affects the performance of a GPGPU application is introduced.

...read moreread less

Abstract: Combining CPUs and GPUs on the same chip has become a popular architectural trend. However, these heterogeneous architectures put more pressure on shared resource management. In particular, managing the last-level cache (LLC) is very critical to performance. Lately, many researchers have proposed several shared cache management mechanisms, including dynamic cache partitioning and promotion-based cache management, but no cache management work has been done on CPU-GPU heterogeneous architectures. Sharing the LLC between CPUs and GPUs brings new challenges due to the different characteristics of CPU and GPGPU applications. Unlike most memory-intensive CPU benchmarks that hide memory latency with caching, many GPGPU applications hide memory latency by combining thread-level parallelism (TLP) and caching. In this paper, we propose a TLP-aware cache management policy for CPU-GPU heterogeneous architectures. We introduce a core-sampling mechanism to detect how caching affects the performance of a GPGPU application. Inspired by previous cache management schemes, Utility-based Cache Partitioning (UCP) and Re-Reference Interval Prediction (RRIP), we propose two new mechanisms: TAP-UCP and TAP-RRIP. TAP-UCP improves performance by 5% over UCP and 11% over LRU on 152 heterogeneous workloads, and TAP-RRIP improves performance by 9% over RRIP and 12% over LRU.

...read moreread less

127 citations

Journal Article•DOI•

A Survey of Architectural Techniques For Improving Cache Power Efficiency

[...]

Sparsh Mittal¹•Institutions (1)

Oak Ridge National Laboratory¹

01 Mar 2014-Sustainable Computing: Informatics and Systems

TL;DR: The aim of this survey is to enable engineers and researchers to get insights into the techniques for improving cache power efficiency and motivate them to invent novel solutions for enabling low-power operation of caches.

...read moreread less

125 citations

Proceedings Article•DOI•

Why nothing matters: the impact of zeroing

[...]

Xi Yang¹, Stephen M. Blackburn¹, Daniel Frampton¹, Jennifer B. Sartor², Kathryn S. McKinley³ - Show less +1 more•Institutions (3)

Australian National University¹, École Polytechnique Fédérale de Lausanne², University of Texas at Austin³

22 Oct 2011

TL;DR: This paper evaluates the two widely used zero initialization designs, showing that they make different tradeoffs to achieve very similar performance, and inspires three better designs: bulk zeroing with cache-bypassing (non-temporal) instructions to reduce the direct and indirect zeroing costs simultaneously.

...read moreread less

Abstract: Memory safety defends against inadvertent and malicious misuse of memory that may compromise program correctness and security. A critical element of memory safety is zero initialization. The direct cost of zero initialization is surprisingly high: up to 12.7%, with average costs ranging from 2.7 to 4.5% on a high performance virtual machine on IA32 architectures. Zero initialization also incurs indirect costs due to its memory bandwidth demands and cache displacement effects. Existing virtual machines either: a) minimize direct costs by zeroing in large blocks, or b) minimize indirect costs by zeroing in the allocation sequence, which reduces cache displacement and bandwidth. This paper evaluates the two widely used zero initialization designs, showing that they make different tradeoffs to achieve very similar performance. Our analysis inspires three better designs: (1) bulk zeroing with cache-bypassing (non-temporal) instructions to reduce the direct and indirect zeroing costs simultaneously, (2) concurrent non-temporal bulk zeroing that exploits parallel hardware to move work off the application's critical path, and (3) adaptive zeroing, which dynamically chooses between (1) and (2) based on available hardware parallelism. The new software strategies offer speedups sometimes greater than the direct overhead, improving total performance by 3% on average. Our findings invite additional optimizations and microarchitectural support.

...read moreread less

81 citations

Proceedings Article•DOI•

Dynamic cache reconfiguration and partitioning for energy optimization in real-time multi-core systems

[...]

Weixun Wang¹, Prabhat Mishra¹, Sanjay Ranka¹•Institutions (1)

University of Florida¹

05 Jun 2011

TL;DR: This paper presents a novel energy optimization technique which employs both dynamic reconfiguration of private caches and partitioning of the shared cache for multicore systems with real-time tasks and can achieve 29.29% energy saving on average.

...read moreread less

Abstract: Multicore architectures, especially chip multi-processors, have been widely acknowledged as a successful design paradigm. Existing approaches primarily target application-driven partitioning of the shared cache to alleviate inter-core cache interference so that both performance and energy efficiency are improved. Dynamic cache reconfiguration is a promising technique in reducing energy consumption of the cache subsystem for uniprocessor systems. In this paper, we present a novel energy optimization technique which employs both dynamic reconfiguration of private caches and partitioning of the shared cache for multicore systems with real-time tasks. Our static profiling based algorithm is designed to judiciously find beneficial cache configurations (of private caches) for each task as well as partition factors (of the shared cache) for each core so that the energy consumption is minimized while task deadline is satisfied. Experimental results using real benchmarks demonstrate that our approach can achieve 29.29% energy saving on average compared to systems employing only cache partitioning.

...read moreread less

78 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

Collapse