ProfileMe: hardware support for instruction-level profiling on out-of-order processors
Jeffrey Dean,James W. Hicks,Carl A. Waldspurger,William E. Weihl,George Z. Chrysos +4 more
- pp 292-302
TLDR
An inexpensive hardware implementation of ProfileMe is described, a variety of software techniques to extract useful profile information from the hardware are outlined, and several ways in which this information can provide valuable feedback for programmers and optimizers are explained.Abstract:
Profile data is valuable for identifying performance bottlenecks and guiding optimizations. Periodic sampling of a processor's performance monitoring hardware is an effective, unobtrusive way to obtain detailed profiles. Unfortunately, existing hardware simply counts events, such as cache misses and branch mispredictions, and cannot accurately attribute these events to instructions, especially on out-of-order machines. We propose an alternative approach, called ProfileMe, that samples instructions. As a sampled instruction moves through the processor pipeline, a detailed record of all interesting events and pipeline stage latencies is collected. ProfileMe also supports paired sampling, which captures information about the interactions between concurrent instructions, revealing information about useful concurrency and the utilization of various pipeline stages while an instruction is in flight. We describe an inexpensive hardware implementation of ProfileMe, outline a variety of software techniques to extract useful profile information from the hardware, and explain several ways in which this information can provide valuable feedback for programmers and optimizers.read more
Citations
More filters
Proceedings ArticleDOI
Selective cache ways: on-demand cache resource allocation
TL;DR: In this paper, a tradeoff between performance and energy is made between a small performance degradation for energy savings, and the tradeoff can produce a significant reduction in cache energy dissipation.
Proceedings ArticleDOI
Cache decay: exploiting generational behavior to reduce cache leakage power
TL;DR: This paper discusses policies and implementations for reducing cache leakage by invalidating and “turning off” cache lines when they hold data not likely to be reused, and proposes adaptive policies that effectively reduce LI cache leakage energy by 5x for the SPEC2000 with only negligible degradations in performance.
Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
Benjamin H. Sigelman,Luiz Andre Barroso,Mike Burrows,Pat Stephenson,Manoj Plakal,Donald Beaver,Saul Jaspan,Chandan Shanbhag +7 more
TL;DR: The design of Dapper is introduced, Google’s production distributed systems tracing infrastructure is described, and how its design goals of low overhead, application-level transparency, and ubiquitous deployment on a very large scale system were met are described.
Proceedings ArticleDOI
Rapidly Selecting Good Compiler Optimizations using Performance Counters
TL;DR: This paper proposes a different approach using performance counters as a means of determining good compiler optimization settings by learning a model off-line which can then be used to determine good settings for any new program.
Journal ArticleDOI
A Survey of Adaptive Optimization in Virtual Machines
TL;DR: This paper surveys the evolution and current state of adaptive optimization technology in virtual machines and concludes that adaptive optimization has begun to mature as a widespread production-level technology.
References
More filters
Journal ArticleDOI
Trace Scheduling: A Technique for Global Microcode Compaction
TL;DR: Compilation of high-level microcode languages into efficient horizontal microcode and good hand coding probably both require effective global compaction techniques.
Proceedings ArticleDOI
Efficient path profiling
Thomas Ball,James R. Larus +1 more
TL;DR: A new algorithm for path profiling is described, which selects and places profile instrumentation to minimize run-time overhead and identifies longer paths than a previous technique, which predicted paths from edge profiles.
Journal ArticleDOI
Continuous profiling: where have all the cycles gone?
Jennifer M. Anderson,Lance M. Berc,Jeffrey Dean,Sanjay Ghemawat,Monika Henzinger,Shun-Tak Albert Leung,Richard L. Sites,Mark T. Vandevoorde,Carl A. Waldspurger,William E. Weihl +9 more
TL;DR: The Digital Continuous Profiling Infrastructure is a sampling-based profiling system designed to run continuously on production systems, supporting multiprocessors, works on unmodified executables, and collects profiles for entire systems, including user programs, shared libraries, and the operating system kernel.
Proceedings ArticleDOI
Operating system support for improving data locality on CC-NUMA compute servers
TL;DR: The experiments show that dynamic page migration and replication can substantially increase application performance, as much as 30%, and reduce contention for resources in the NUMA memory system.
Journal ArticleDOI
Avoiding conflict misses dynamically in large direct-mapped caches
TL;DR: Using trace-driven simulation of applications and the operating system, it is shown that a CML buffer enables a large direct-mapped cache to perform nearly as well as a two-way set associative cache of equivalent size and speed, although with lower hardware cost and complexity.