scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Pin: building customized program analysis tools with dynamic instrumentation

12 Jun 2005-Vol. 40, Iss: 6, pp 190-200
TL;DR: The goals are to provide easy-to-use, portable, transparent, and efficient instrumentation, and to illustrate Pin's versatility, two Pintools in daily use to analyze production software are described.
Abstract: Robust and powerful software instrumentation tools are essential for program analysis tasks such as profiling, performance evaluation, and bug detection. To meet this need, we have developed a new instrumentation system called Pin. Our goals are to provide easy-to-use, portable, transparent, and efficient instrumentation. Instrumentation tools (called Pintools) are written in C/C++ using Pin's rich API. Pin follows the model of ATOM, allowing the tool writer to analyze an application at the instruction level without the need for detailed knowledge of the underlying instruction set. The API is designed to be architecture independent whenever possible, making Pintools source compatible across different architectures. However, a Pintool can access architecture-specific details when necessary. Instrumentation with Pin is mostly transparent as the application and Pintool observe the application's original, uninstrumented behavior. Pin uses dynamic compilation to instrument executables while they are running. For efficiency, Pin uses several techniques, including inlining, register re-allocation, liveness analysis, and instruction scheduling to optimize instrumentation. This fully automated approach delivers significantly better instrumentation performance than similar tools. For example, Pin is 3.3x faster than Valgrind and 2x faster than DynamoRIO for basic-block counting. To illustrate Pin's versatility, we describe two Pintools in daily use to analyze production software. Pin is publicly available for Linux platforms on four architectures: IA32 (32-bit x86), EM64T (64-bit x86), Itanium®, and ARM. In the ten months since Pin 2 was released in July 2004, there have been over 3000 downloads from its website.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
04 Oct 2009
TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.
Abstract: This paper presents and characterizes Rodinia, a benchmark suite for heterogeneous computing. To help architects study emerging platforms such as GPUs (Graphics Processing Units), Rodinia includes applications and kernels which target multi-core CPU and GPU platforms. The choice of applications is inspired by Berkeley's dwarf taxonomy. Our characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.

2,697 citations


Cites methods from "Pin: building customized program an..."

  • ...While developing and characterizing these benchmarks, we have experienced first-hand the following challenges of the GPU platform: Data Structure Mapping: Programmers must find efficient mappings of their applications’ data structures to CUDA’s hierarchical (grid of thread blocks) domain model....

    [...]

Proceedings ArticleDOI
10 Jun 2007
TL;DR: Valgrind is described, a DBI framework designed for building heavyweight DBA tools that can be used to build more interesting, heavyweight tools that are difficult or impossible to build with other DBI frameworks such as Pin and DynamoRIO.
Abstract: Dynamic binary instrumentation (DBI) frameworks make it easy to build dynamic binary analysis (DBA) tools such as checkers and profilers. Much of the focus on DBI frameworks has been on performance; little attention has been paid to their capabilities. As a result, we believe the potential of DBI has not been fully exploited.In this paper we describe Valgrind, a DBI framework designed for building heavyweight DBA tools. We focus on its unique support for shadow values-a powerful but previously little-studied and difficult-to-implement DBA technique, which requires a tool to shadow every register and memory value with another value that describes it. This support accounts for several crucial design features that distinguish Valgrind from other DBI frameworks. Because of these features, lightweight tools built with Valgrind run comparatively slowly, but Valgrind can be used to build more interesting, heavyweight tools that are difficult or impossible to build with other DBI frameworks such as Pin and DynamoRIO.

2,540 citations

Proceedings ArticleDOI
12 Dec 2009
TL;DR: Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks at the 22nm technology node for both common in-order and out-of-order manycore designs shows that when die cost is not taken into account clustering 8 cores together gives the best energy-delay product, whereas when cost is taking into account configuring clusters with 4 cores gives thebest EDA2P and EDAP.
Abstract: This paper introduces McPAT, an integrated power, area, and timing modeling framework that supports comprehensive design space exploration for multicore and manycore processor configurations ranging from 90nm to 22nm and beyond. At the microarchitectural level, McPAT includes models for the fundamental components of a chip multiprocessor, including in-order and out-of-order processor cores, networks-on-chip, shared caches, integrated memory controllers, and multiple-domain clocking. At the circuit and technology levels, McPAT supports critical-path timing modeling, area modeling, and dynamic, short-circuit, and leakage power modeling for each of the device types forecast in the ITRS roadmap including bulk CMOS, SOI, and double-gate transistors. McPAT has a flexible XML interface to facilitate its use with many performance simulators. Combined with a performance simulator, McPAT enables architects to consistently quantify the cost of new ideas and assess tradeoffs of different architectures using new metrics like energy-delay-area2 product (EDA2P) and energy-delay-area product (EDAP). This paper explores the interconnect options of future manycore processors by varying the degree of clustering over generations of process technologies. Clustering will bring interesting tradeoffs between area and performance because the interconnects needed to group cores into clusters incur area overhead, but many applications can make good use of them due to synergies of cache sharing. Combining power, area, and timing results of McPAT with performance simulation of PARSEC benchmarks at the 22nm technology node for both common in-order and out-of-order manycore designs shows that when die cost is not taken into account clustering 8 cores together gives the best energy-delay product, whereas when cost is taken into account configuring clusters with 4 cores gives the best EDA2P and EDAP.

2,487 citations


Cites methods from "Pin: building customized program an..."

  • ...[17] A. Jain, et al., “A 1.2 GHz Alpha Microprocessor with 44.8 GB/s Chip Pin Bandwidth,” in ISSCC, 2001....

    [...]

  • ...[26] C.-K. Luk, et al., “Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation,” in PLDI, Jun 2005....

    [...]

  • ...We modified a user-level thread library [34] in the Pin [26] binary instrumentation tool to support more pthread APIs, and use it as a functional simulator to run applications....

    [...]

01 Jan 2011
TL;DR: A methodology to design effective benchmark suites is developed and its effectiveness is demonstrated by developing and deploying a benchmark suite for evaluating multiprocessors called PARSEC, which has been adopted by many architecture groups in both research and industry.
Abstract: Benchmarking has become one of the most important methods for quantitative performance evaluation of processor and computer system designs. Benchmarking of modern multiprocessors such as chip multiprocessors is challenging because of their application domain, scalability and parallelism requirements. In my thesis, I have developed a methodology to design effective benchmark suites and demonstrated its effectiveness by developing and deploying a benchmark suite for evaluating multiprocessors. More specifically, this thesis includes several contributions. First, the thesis shows that a new benchmark suite for multiprocessors is needed because the behavior of modern parallel programs is significantly different from those represented by SPLASH-2, the most popular parallel benchmark suite developed over ten years ago. Second, the thesis quantitatively describes the requirements and characteristics of a set of multithreaded programs and their underlying technology trends. Third, the thesis presents a systematic approach to scale and select benchmark inputs with the goal of optimizing benchmarking accuracy subject to constrained execution or simulation time. Finally, the thesis describes a parallel benchmark suite called PARSEC for evaluating modern shared-memory multiprocessors. Since its initial release, PARSEC has been adopted by many architecture groups in both research and industry.

1,043 citations


Cites background or methods from "Pin: building customized program an..."

  • ...2 Experimental Setup The data was obtained with Pin [58]....

    [...]

  • ...CMP$im is a plug-in for Pin [58] that simulates the cache hierarchy of a CMP....

    [...]

Proceedings ArticleDOI
05 Mar 2011
TL;DR: A lightweight, high-performance persistent object system called NV-heaps is implemented that provides transactional semantics while preventing these errors and providing a model for persistence that is easy to use and reason about.
Abstract: Persistent, user-defined objects present an attractive abstraction for working with non-volatile program state. However, the slow speed of persistent storage (i.e., disk) has restricted their design and limited their performance. Fast, byte-addressable, non-volatile technologies, such as phase change memory, will remove this constraint and allow programmers to build high-performance, persistent data structures in non-volatile storage that is almost as fast as DRAM. Creating these data structures requires a system that is lightweight enough to expose the performance of the underlying memories but also ensures safety in the presence of application and system failures by avoiding familiar bugs such as dangling pointers, multiple free()s, and locking errors. In addition, the system must prevent new types of hard-to-find pointer safety bugs that only arise with persistent objects. These bugs are especially dangerous since any corruption they cause will be permanent.We have implemented a lightweight, high-performance persistent object system called NV-heaps that provides transactional semantics while preventing these errors and providing a model for persistence that is easy to use and reason about. We implement search trees, hash tables, sparse graphs, and arrays using NV-heaps, BerkeleyDB, and Stasis. Our results show that NV-heap performance scales with thread count and that data structures implemented using NV-heaps out-perform BerkeleyDB and Stasis implementations by 32x and 244x, respectively, by avoiding the operating system and minimizing other software overheads. We also quantify the cost of enforcing the safety guarantees that NV-heaps provide and measure the costs of NV-heap primitive operations.

850 citations


Cites methods from "Pin: building customized program an..."

  • ...The system usesPin[39]to performa detailed simulation of the system s memory hierarchy augmented with non-volatile memory technology and the atomicity support that NV-heaps require....

    [...]

  • ...Pin: building customized pro­gram analysis tools with dynamic instrumentation....

    [...]

  • ...The system uses Pin [39] to perform a detailed simulation of the system’s memory hierarchy augmented with non-volatile memory technology and the atomicity support that NV-heaps require....

    [...]

References
More filters
Proceedings Article
27 Jun 2004
TL;DR: DTrace features the ability to dynamically instrument both user-level and kernel-level software in a unified and absolutely safe fashion and features a C-like high-level control language to describe the predicates and actions at a given point of instrumentation.
Abstract: This paper presents DTrace, a new facility for dynamic instrumentation of production systems. DTrace features the ability to dynamically instrument both user-level and kernel-level software in a unified and absolutely safe fashion. When not explicitly enabled, DTrace has zero. probe effect--the system operates exactly as if DTrace were not present at all. DTrace allows for many tens of thousands of instrumentation points, with even the smallest of systems offering on the order of 30,000 such points in the kernel alone. We have developed a C-like high-level control language to describe the predicates and actions at a given point of instrumentation. The language features user-defined variables, including thread-local variables and associative arrays. To eliminate the need for most postprocessing, the facility features a scalable mechanism for aggregating data and a mechanism for speculative tracing. DTrace has been integrated into the Solaris operating system and has been used to find serious systemic performance problems on production systems-problems that could not be found using pre-existing facilities.

512 citations


"Pin: building customized program an..." refers methods in this paper

  • ...Example probe-ba s d systems include Dyninst [7], Vulcan [29], and DTrace [9]....

    [...]

  • ...Example probe-based systems include Dyninst [7], Vulcan [29], and DTrace [9]....

    [...]

Proceedings ArticleDOI
01 Jun 1995
TL;DR: EEL supports a machine- and system-independent editing model that enables tool builders to modify an executable without being aware of the details of the underlying architecture or operating system or being concerned with the consequences of deleting instructions or adding foreign code.
Abstract: EEL (Executable Editing Library) is a library for building tools to analyze and modify an executable (compiled) program. The systems and languages communities have built many tools for error detection, fault isolation, architecture translation, performance measurement, simulation, and optimization using this approach of modifying executables. Currently, however, tools of this sort are difficult and time-consuming to write and are usually closely tied to a particular machine and operating system. EEL supports a machine- and system-independent editing model that enables tool builders to modify an executable without being aware of the details of the underlying architecture or operating system or being concerned with the consequences of deleting instructions or adding foreign code.

500 citations


"Pin: building customized program an..." refers methods in this paper

  • ...EEL: Machine-independent executable editing....

    [...]

  • ...Static binary instrumentation was pioneered by ATOM [30], followed by others such as EEL [18], Etch [25], and Morph [31]....

    [...]

  • ...Static binary instrumentation was pioneered by ATOM [30], followed by others such as EEL [18], Etch [25], and Morph [31]....

    [...]

Dissertation
01 Jan 2004
TL;DR: D DynamoRIO is presented, a fully-implemented runtime code manipulation system that supports code transformations on any part of a program, while it executes, with zero to thirty percent time and memory overhead on both Windows and Linux.
Abstract: This thesis addresses the challenges of building a software system for general-purpose runtime code manipulation. Modern applications, with dynamically-loaded modules and dynamically-generated code, are assembled at runtime. While it was once feasible at compile time to observe and manipulate every instruction—which is critical for program analysis, instrumentation, trace gathering, optimization, and similar tools—it can now only be done at runtime. Existing runtime tools are successful at inserting instrumentation calls, but no general framework has been developed for fine-grained and comprehensive code observation and modification without high overheads. This thesis demonstrates the feasibility of building such a system in software. We present DynamoRIO, a fully-implemented runtime code manipulation system that supports code transformations on any part of a program, while it executes. DynamoRIO uses code caching technology to provide efficient, transparent, and comprehensive manipulation of an unmodified application running on a stock operating system and commodity hardware. DynamoRIO executes large, complex, modern applications with dynamically-loaded, generated, or even modified code. Despite the formidable obstacles inherent in the IA-32 architecture, DynamoRIO provides these capabilities efficiently, with zero to thirty percent time and memory overhead on both Windows and Linux. DynamoRIO exports an interface for building custom runtime code manipulation tools of all types. It has been used by many researchers, with several hundred downloads of our public release, and is being commercialized in a product for protection against remote security exploits, one of numerous applications of runtime code manipulation. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

386 citations


"Pin: building customized program an..." refers background in this paper

  • ...Categories and Subject Descriptors D.2.5 [Software Engineer­ing]: Testing and Debugging-code inspections and walk-throughs, debugging aids, tracing; D.3.4 [Programming Languages]: Processors­compilers, incremental compilers General Terms Languages, Performance, Experimentation Keywords Instrumentation, program analysis tools, dynamic com­pilation Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro.t or commercial advantage and that copies bear this notice and the full citation on the .rst page....

    [...]

  • ...…tracing; D.3.4 [Programming Languages]: Processors­compilers, incremental compilers General Terms Languages, Performance, Experimentation Keywords Instrumentation, program analysis tools, dynamic com­pilation Permission to make digital or hard copies of all or part of this work for…...

    [...]

Journal ArticleDOI
TL;DR: A new algorithm for fast global register allocation called linear scan, which allocates registers to variables in a single linear-time scan of the variables' live ranges, is described.
Abstract: We describe a new algorithm for fast global register allocation called linear scan. This algorithm is not based on graph coloring, but allocates registers to variables in a single linear-time scan of the variables' live ranges. The linear scan algorithm is considerably faster than algorithms based on graph coloring, is simple to implement, and results in code that is almost as efficient as that obtained using more complex and time-consuming register allocators based on graph coloring. The algorithm is of interest in applications where compile time is a concern, such as dynamic compilation systems, “just-in-time” compilers, and interactive development environments.

370 citations


"Pin: building customized program an..." refers methods in this paper

  • ...Rather than obtai ning extra registers in an ad-hoc way, Pin re-allocates registers used in both the application and the Pintool, using linear-scan regis t r allocation [24]....

    [...]

Proceedings ArticleDOI
Harish Patil1, Robert Cohn1, Mark J. Charney1, Rajiv Kapoor1, Andrew Y. Sun1, Anand Karunanidhi1 
04 Dec 2004
TL;DR: This work uses the well-known SimPoint methodology to find representative portions of an application to simulate, and develops a toolkit that automatically detects PinPoints, validates whether they are representative using hardware performance counters, and generates traces for large Itanium® programs.
Abstract: Detailed modeling of the performance of commercial applications is difficult. The applications can take a very long time to run on real hardware and it is impractical to simulate them to completion on performance models. Furthermore, these applications have complex execution environments that cannot easily be reproduced on a simulator, making porting the applications to simulators difficult. We attack these problems using the well-known SimPoint methodology to find representative portions of an application to simulate, and a dynamic instrumentation framework called Pin to avoid porting altogether. Our system uses dynamic instrumentation instead of simulation to find representative portions - called Pin-Points - for simulation. We have developed a toolkit that automatically detects PinPoints, validates whether they are representative using hardware performance counters, and generates traces for large Itanium® programs. We compared SimPoint-based selection to random selection of simulation points. We found for 95% of the SPEC2000 programs we tested, the PinPoints prediction was within 8% of the actual whole-program CPI, as opposed to 18% for random selection. We measure the end-to-end error, comparing real hardware to a performance model, and have a simple and efficient methodology to determine the step that introduced the error. Finally, we evaluate the system in the context of multiple configurations of real hardware, commercial applications, and industrial-strength performance models to understand the behavior of a complete and practical workload collection system. We have successfully used our system with many commercial Itanium® programs, some running for trillions of instructions, and have used the resulting traces for predicting performance of those applications on future Itanium processors.

346 citations


"Pin: building customized program an..." refers background or methods in this paper

  • ...The purpose of the [ 23 ] toolkit is to automate the otherwise tedious process of finding regions of programs to simulate, validating that the regions are representative, and generating traces for those regions....

    [...]

  • ...We also validate that the regions chosen represent wholeprogram behavior (e.g., the cycles-per-instruction predicted by Pin-Points is typically within 10% of the actual value [ 23 ])....

    [...]