scispace - formally typeset
Open AccessProceedings ArticleDOI

ProfileMe: hardware support for instruction-level profiling on out-of-order processors

TLDR
An inexpensive hardware implementation of ProfileMe is described, a variety of software techniques to extract useful profile information from the hardware are outlined, and several ways in which this information can provide valuable feedback for programmers and optimizers are explained.
Abstract
Profile data is valuable for identifying performance bottlenecks and guiding optimizations. Periodic sampling of a processor's performance monitoring hardware is an effective, unobtrusive way to obtain detailed profiles. Unfortunately, existing hardware simply counts events, such as cache misses and branch mispredictions, and cannot accurately attribute these events to instructions, especially on out-of-order machines. We propose an alternative approach, called ProfileMe, that samples instructions. As a sampled instruction moves through the processor pipeline, a detailed record of all interesting events and pipeline stage latencies is collected. ProfileMe also supports paired sampling, which captures information about the interactions between concurrent instructions, revealing information about useful concurrency and the utilization of various pipeline stages while an instruction is in flight. We describe an inexpensive hardware implementation of ProfileMe, outline a variety of software techniques to extract useful profile information from the hardware, and explain several ways in which this information can provide valuable feedback for programmers and optimizers.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

Selective cache ways: on-demand cache resource allocation

TL;DR: In this paper, a tradeoff between performance and energy is made between a small performance degradation for energy savings, and the tradeoff can produce a significant reduction in cache energy dissipation.
Proceedings ArticleDOI

Cache decay: exploiting generational behavior to reduce cache leakage power

TL;DR: This paper discusses policies and implementations for reducing cache leakage by invalidating and “turning off” cache lines when they hold data not likely to be reused, and proposes adaptive policies that effectively reduce LI cache leakage energy by 5x for the SPEC2000 with only negligible degradations in performance.

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

TL;DR: The design of Dapper is introduced, Google’s production distributed systems tracing infrastructure is described, and how its design goals of low overhead, application-level transparency, and ubiquitous deployment on a very large scale system were met are described.
Proceedings ArticleDOI

Rapidly Selecting Good Compiler Optimizations using Performance Counters

TL;DR: This paper proposes a different approach using performance counters as a means of determining good compiler optimization settings by learning a model off-line which can then be used to determine good settings for any new program.
Journal ArticleDOI

A Survey of Adaptive Optimization in Virtual Machines

TL;DR: This paper surveys the evolution and current state of adaptive optimization technology in virtual machines and concludes that adaptive optimization has begun to mature as a widespread production-level technology.
References
More filters
Journal ArticleDOI

Trace Scheduling: A Technique for Global Microcode Compaction

TL;DR: Compilation of high-level microcode languages into efficient horizontal microcode and good hand coding probably both require effective global compaction techniques.
Proceedings ArticleDOI

Efficient path profiling

TL;DR: A new algorithm for path profiling is described, which selects and places profile instrumentation to minimize run-time overhead and identifies longer paths than a previous technique, which predicted paths from edge profiles.
Journal ArticleDOI

Continuous profiling: where have all the cycles gone?

TL;DR: The Digital Continuous Profiling Infrastructure is a sampling-based profiling system designed to run continuously on production systems, supporting multiprocessors, works on unmodified executables, and collects profiles for entire systems, including user programs, shared libraries, and the operating system kernel.
Proceedings ArticleDOI

Operating system support for improving data locality on CC-NUMA compute servers

TL;DR: The experiments show that dynamic page migration and replication can substantially increase application performance, as much as 30%, and reduce contention for resources in the NUMA memory system.
Journal ArticleDOI

Avoiding conflict misses dynamically in large direct-mapped caches

TL;DR: Using trace-driven simulation of applications and the operating system, it is shown that a CML buffer enables a large direct-mapped cache to perform nearly as well as a two-way set associative cache of equivalent size and speed, although with lower hardware cost and complexity.
Related Papers (5)