ProfileMe: hardware support for instruction-level profiling on out-of-order processors

doi:10.5555/266800.266828

Open AccessProceedings ArticleDOI

ProfileMe: hardware support for instruction-level profiling on out-of-order processors

- pp 292-302

TLDR

An inexpensive hardware implementation of ProfileMe is described, a variety of software techniques to extract useful profile information from the hardware are outlined, and several ways in which this information can provide valuable feedback for programmers and optimizers are explained.

Abstract:

Profile data is valuable for identifying performance bottlenecks and guiding optimizations. Periodic sampling of a processor's performance monitoring hardware is an effective, unobtrusive way to obtain detailed profiles. Unfortunately, existing hardware simply counts events, such as cache misses and branch mispredictions, and cannot accurately attribute these events to instructions, especially on out-of-order machines. We propose an alternative approach, called ProfileMe, that samples instructions. As a sampled instruction moves through the processor pipeline, a detailed record of all interesting events and pipeline stage latencies is collected. ProfileMe also supports paired sampling, which captures information about the interactions between concurrent instructions, revealing information about useful concurrency and the utilization of various pipeline stages while an instruction is in flight. We describe an inexpensive hardware implementation of ProfileMe, outline a variety of software techniques to extract useful profile information from the hardware, and explain several ways in which this information can provide valuable feedback for programmers and optimizers.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Selective cache ways: on-demand cache resource allocation

David H. Albonesi

TL;DR: In this paper, a tradeoff between performance and energy is made between a small performance degradation for energy savings, and the tradeoff can produce a significant reduction in cache energy dissipation.

...read moreread less

Proceedings ArticleDOI

Cache decay: exploiting generational behavior to reduce cache leakage power

Stefanos Kaxiras, +2 more

TL;DR: This paper discusses policies and implementations for reducing cache leakage by invalidating and “turning off” cache lines when they hold data not likely to be reused, and proposes adaptive policies that effectively reduce LI cache leakage energy by 5x for the SPEC2000 with only negligible degradations in performance.

...read moreread less

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

Benjamin H. Sigelman, +7 more

TL;DR: The design of Dapper is introduced, Google’s production distributed systems tracing infrastructure is described, and how its design goals of low overhead, application-level transparency, and ubiquitous deployment on a very large scale system were met are described.

...read moreread less

Proceedings ArticleDOI

Rapidly Selecting Good Compiler Optimizations using Performance Counters

John Cavazos, +5 more

TL;DR: This paper proposes a different approach using performance counters as a means of determining good compiler optimization settings by learning a model off-line which can then be used to determine good settings for any new program.

...read moreread less

Journal ArticleDOI

A Survey of Adaptive Optimization in Virtual Machines

Matthew Arnold, +4 more

TL;DR: This paper surveys the evolution and current state of adaptive optimization technology in virtual machines and concludes that adaptive optimization has begun to mature as a widespread production-level technology.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Trace Scheduling: A Technique for Global Microcode Compaction

Fisher

- 01 Jul 1981 -

IEEE Transactions on Computers

TL;DR: Compilation of high-level microcode languages into efficient horizontal microcode and good hand coding probably both require effective global compaction techniques.

...read moreread less

Proceedings ArticleDOI

Efficient path profiling

Thomas Ball, +1 more

TL;DR: A new algorithm for path profiling is described, which selects and places profile instrumentation to minimize run-time overhead and identifies longer paths than a previous technique, which predicted paths from edge profiles.

...read moreread less

Journal ArticleDOI

Continuous profiling: where have all the cycles gone?

Jennifer M. Anderson, +9 more

TL;DR: The Digital Continuous Profiling Infrastructure is a sampling-based profiling system designed to run continuously on production systems, supporting multiprocessors, works on unmodified executables, and collects profiles for entire systems, including user programs, shared libraries, and the operating system kernel.

...read moreread less

Proceedings ArticleDOI

Operating system support for improving data locality on CC-NUMA compute servers

Ben Verghese, +3 more

TL;DR: The experiments show that dynamic page migration and replication can substantially increase application performance, as much as 30%, and reduce contention for resources in the NUMA memory system.

...read moreread less

Journal ArticleDOI

Avoiding conflict misses dynamically in large direct-mapped caches

Brian N. Bershad, +3 more

TL;DR: Using trace-driven simulation of applications and the operating system, it is shown that a CML buffer enables a large direct-mapped cache to perform nearly as well as a two-way set associative cache of equivalent size and speed, although with lower hardware cost and complexity.

...read moreread less

ProfileMe: hardware support for instruction-level profiling on out-of-order processors

Citations

Selective cache ways: on-demand cache resource allocation

Cache decay: exploiting generational behavior to reduce cache leakage power

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

Rapidly Selecting Good Compiler Optimizations using Performance Counters

A Survey of Adaptive Optimization in Virtual Machines

References

Trace Scheduling: A Technique for Global Microcode Compaction

Efficient path profiling

Continuous profiling: where have all the cycles gone?

Operating system support for improving data locality on CC-NUMA compute servers

Avoiding conflict misses dynamically in large direct-mapped caches

Related Papers (5)

Continuous profiling: where have all the cycles gone?

Exploiting hardware performance counters with flow and context sensitive profiling

Efficient path profiling

Dynamo: a transparent dynamic optimization system

A framework for reducing the cost of instrumented code