scispace - formally typeset
Search or ask a question

Showing papers by "Jeffrey Dean published in 1997"


Journal ArticleDOI
01 Oct 1997
TL;DR: The Digital Continuous Profiling Infrastructure is a sampling-based profiling system designed to run continuously on production systems, supporting multiprocessors, works on unmodified executables, and collects profiles for entire systems, including user programs, shared libraries, and the operating system kernel.
Abstract: This article describes the Digital Continuous Profiling Infrastructure, a sampling-based profiling system designed to run continuously on production systems. The system supports multiprocessors, works on unmodified executables, and collects profiles for entire systems, including user programs, shared libraries, and the operating system kernel. Samples are collected at a high rate (over 5200 samples/sec. per 333MHz processor), yet with low overhead (1–3% slowdown for most workloads). Analysis tools supplied with the profiling system use the sample data to produce a precise and accurate accounting, down to the level of pipeline stalls incurred by individual instructions, of where time is bring spent. When instructions incur stalls, the tools identify possible reasons, such as cache misses, branch mispredictions, and functional unit contention. The fine-grained instruction-level analysis guides users and automated optimizers to the causes of performance problems and provides important insights for fixing them.

545 citations


Proceedings ArticleDOI
01 Dec 1997
TL;DR: An inexpensive hardware implementation of ProfileMe is described, a variety of software techniques to extract useful profile information from the hardware are outlined, and several ways in which this information can provide valuable feedback for programmers and optimizers are explained.
Abstract: Profile data is valuable for identifying performance bottlenecks and guiding optimizations. Periodic sampling of a processor's performance monitoring hardware is an effective, unobtrusive way to obtain detailed profiles. Unfortunately, existing hardware simply counts events, such as cache misses and branch mispredictions, and cannot accurately attribute these events to instructions, especially on out-of-order machines. We propose an alternative approach, called ProfileMe, that samples instructions. As a sampled instruction moves through the processor pipeline, a detailed record of all interesting events and pipeline stage latencies is collected. ProfileMe also supports paired sampling, which captures information about the interactions between concurrent instructions, revealing information about useful concurrency and the utilization of various pipeline stages while an instruction is in flight. We describe an inexpensive hardware implementation of ProfileMe, outline a variety of software techniques to extract useful profile information from the hardware, and explain several ways in which this information can provide valuable feedback for programmers and optimizers.

338 citations


Proceedings ArticleDOI
09 Oct 1997
TL;DR: In this paper, a parameterized algorithmic framework for call graph construction in the presence of message sends and/or first class functions is presented, which is used to describe and implement a number of well-known and new algorithms.
Abstract: Interprocedural analyses enable optimizing compilers to more precisely model the effects of non-inlined procedure calls, potentially resulting in substantial increases in application performance Applying interprocedural analysis to programs written in object-oriented or functional languages is complicated by the difficulty of constructing an accurate program call graph This paper presents a parameterized algorithmic framework for call graph construction in the presence of message sends and/or first class functions We use this framework to describe and to implement a number of well-known and new algorithms We then empirically assess these algorithms by applying them to a suite of medium-sized programs written in Cecil and Java, reporting on the relative cost of the analyses, the relative precision of the constructed call graphs, and the impact of this precision on the effectiveness of a number of interprocedural optimizations

338 citations


Patent
26 Nov 1997
TL;DR: In this article, a method for scheduling execution of a plurality of threads executed in a multithreaded processor is presented. But the method is limited to a single thread and it is not suitable for multi-threaded systems.
Abstract: A method is provided for scheduling execution of a plurality of threads executed in a multithreaded processor. Resource utilizations of each of the plurality of threads are measured while the plurality of threads are concurrently executing in the multithreaded processor. Each of the plurality of threads is scheduled according to the measured resource utilizations using a thread scheduler.

169 citations


Patent
26 Nov 1997
TL;DR: In this article, a method for estimating execution rates of program executions paths is presented, based on path-identifying state information of selected instructions while executing the program in a processor.
Abstract: A method is provided for estimating execution rates of program executions paths. The method samples path-identifying state information of selected instructions while executing the program in a processor. A control flow graph of the program is supplied, the control flow graph includes a plurality of path segments. The control flow graph is analyzed using the path-identifying state information to identify a set of path segments that are consistent with the sampled state information. The set of paths segments can be counted to determine their relative execution frequencies.

148 citations


Patent
26 Nov 1997
TL;DR: In this paper, a method for scheduling execution contexts in a computer system based on memory interactions is proposed, where a processor and a hierarchical memory are arranged in a plurality of levels.
Abstract: A method schedules execution contexts in a computer system based on memory interactions. The computer system includes a processor and a hierarchical memory arranged in a plurality of levels. Memory transactions are randomly sampled for a plurality of contexts. The contexts can be threads, processes, or hardware contexts. Resource interactions of the plurality of contexts is estimated, and particular contexts are chosen to be scheduled based on the estimated resource interactions.

143 citations


Patent
26 Nov 1997
TL;DR: In this article, a method for optimizing a program by inserting memory prefetch operations in the program executing in a computer system is presented, where a program optimizer uses the measured latencies to estimate the number of cycles that elapse before data of a memory operation are available.
Abstract: A method is provided for optimizing a program by inserting memory prefetch operations in the program executing in a computer system. The computer system includes a processor and a memory. Latencies of instructions of the program are measured by hardware while the instructions are processed by a pipeline of the processor. Memory prefetch instructions are automatically inserted in the program based on the measured latencies to optimize execution of the program. The latencies measure the time from when a load instructions issues a request for data to the memory until the data are available in the processor. A program optimizer uses the measured latencies to estimate the number of cycles that elapse before data of a memory operation are available.

89 citations


Patent
26 Nov 1997
TL;DR: In this article, a method for scheduling instructions executed in a computer system including a processor and a memory subsystem, pipeline latencies and resource utilization are measured by sampling hardware while the instructions are executing.
Abstract: In a method for scheduling instructions executed in a computer system including a processor and a memory subsystem, pipeline latencies and resource utilization are measured by sampling hardware while the instructions are executing. The instructions are then scheduled according to the measured latencies and resource utilizations using an instruction scheduler.

85 citations


Patent
26 Nov 1997
TL;DR: In this article, an apparatus is provided for sampling instructions in a processor pipeline of a computer system, where instructions are fetched into a first stage of the pipeline and a subset of the fetched instructions are identified as selected instructions.
Abstract: An apparatus is provided for sampling instructions in a processor pipeline of a computer system. The pipeline has a plurality of processing stages. Instructions are fetched into a first stage of the pipeline. A subset of the fetched instructions are identified as selected instructions. Event, latency, and state information of the system is sampled while any of the selected instructions are in any stage of the pipeline. Software is informed whenever any of the selected instructions leaves the pipeline to read the event and latency information.

65 citations


Patent
26 Nov 1997
TL;DR: In this paper, an apparatus is provided for sampling multiple concurretly executing instructions in a processor pipeline of a system, where state information of the system is sampled while any of the selected instructions are in any stage of the pipeline.
Abstract: An apparatus is provided for sampling multiple concurretly executing instructions in a processor pipeline of a system. The pipeline has a plurality of processing stages. The apparatus identifies multiple selected when the instructions are fetched into a first stage of the pipeline. A subset of the the multiple selected instructions to execute concurrently in the pipeline. State information of the system is sampled while any of the multiple selected instructions are in any stage of the pipeline. Software is informed whenever all of the selected instructions leave the pipeline so that the software can read any of the state information.

63 citations


Patent
26 Nov 1997
TL;DR: In this article, a method for guiding virtual-to-physical mapping policies in a computer system including a processor and a memory is provided, where state information is randomly sampled from selected memory references in a stream of memory references issued by the processor to the memory.
Abstract: A method is provided for guiding virtual-to-physical mapping policies in a computer system including a processor and a memory. State information is randomly sampled from selected memory references in a stream of memory references issued by the processor to the memory. Cache hit/miss status, translation-look-aside buffer hit/miss status, and effective virtual and physical memory addresses of the sampled memory references are recorded in a profile record. The recorded information is aggregated by virtual memory address, and a new virtual-to-physical mapping is choosen to reduce cache and translation-look-aside buffer miss rates.

Patent
26 Nov 1997
TL;DR: In this paper, the average number of instructions entering a stage of a processor pipeline of a computer system during a clock cycle of the processor clock is calculated. But the number is not the same for all stages of the pipeline.
Abstract: An apparatus is provided for determining an average number of instructions entering a stage of a processor pipeline of a computer system during a clock cycle of a processor clock. The number of instructions entering a particular stage of the pipeline are stored in a queue during each of a predetermined number (N) of clock cycles. The total number of instructions processed over the last P clock cycles is computed, where P is less than or equal to N. The total number of instructions processed is divided by the last P processor cycles to yield the instantaneous average number of instructions processed for each processor cycle. This average number of instructions processed is communicated to software.

Patent
26 Nov 1997
TL;DR: In this article, an apparatus is provided for collecting state information associated with an execution path of recently processed instructions in a processor pipeline of a computer system, and a shift register stores a predetermined number of entries storing selected state information, which is simultaneously sampled along with additional state information about the instruction being executed at the time of sampling.
Abstract: An apparatus is provided for collecting state information associated with an execution path of recently processed instructions in a processor pipeline of a computer system. The apparatus identifies a class of instructions to be sampled. Path-identifying state information of a currently processed instruction is sampled when the currently processed instruction belongs to the identified class of instructions. A shift register stores a predetermined number of entries storing selected state information, the shift register is simultaneously sampled along with additional state information about the instruction being executed at the time of sampling.

01 Jan 1997
TL;DR: An inexpensive hardware implementation of ProfileMe is described, a variety of software techniques to extract useful profile information from the hardware are outlined, and several ways in which this information can provide valuable feedback for programmers and optimizers are explained.
Abstract: Profile data is valuable for identifying performance bottlenecks and guiding optimizations. Periodic sampling of a processor’s performance monitoring hardware is an effective, unobtrusive way to obtain detailed profiles. Unfortunately, existing hardware simply counts events, such as cache misses and branch mispredictions, and cannot accurately attribute these events to instructions, especially on out-of-order machines. We propose an altemative approach, called ProjileMe, that samples instructions. As a sampled instruction moves through the processor pipeline, a detailed record of all interesting events and pipeline stage latencies is collected. ProfileMe also support paired sumpling, which captures information about the interactions between concurrent instructions, revealing information about useful concurrency and the utilization of various pipeline stages while an instruction is in flight. We describe an inexpensive hardware implementation of ProfileMe, outline a variety of software techniques to extract useful profile information from the hardware, and explain several ways in which this information can provide valuable feedback for programmers and optimizers.