DAISY: dynamic compilation for 100% architectural compatibility

doi:10.1145/264107.264126

Home
/
Papers
/
DAISY: dynamic compilation for 100% architectural compatibility

Proceedings Article•DOI•

DAISY: dynamic compilation for 100% architectural compatibility

Kemal Ebcioglu¹, Erik R. Altman¹•Institutions (1)

IBM¹

01 May 1997-Vol. 25, Iss: 2, pp 26-37

TL;DR: The architectural requirements for such a VLIW, to deal with issues including self-modifying code, precise exceptions, and aggressive reordering of memory references in the presence of strong MP consistency and memory mapped I/O are discussed.

read less

Abstract: Although VLIW architectures offer the advantages of simplicity of design and high issue rates, a major impediment to their use is that they are not compatible with the existing software base. We describe new simple hardware features for a VLIW machine we call DAISY (DynamicallyArchitectedInstructionSet fromYorktown). DAISY is specifically intended to emulate existing architectures, so that all existing software for an old architecture (including operating system kernel code) runs without changes on the VLIW. Each time a new fragment of code is executed for the first time, the code is translated to VLIW primitives, parallelized and saved in a portion of main memory not visible to the old architecture, by a Virtual Machine Monitor (software) residing in read only memory. Subsequent executions of the same fragment do not require a translation (unless cast out). We discuss the architectural requirements for such a VLIW, to deal with issues including self-modifying code, precise exceptions, and aggressive reordering of memory references in the presence of strong MP consistency and memory mapped I/O. We have implemented the dynamic parallelization algorithms for the PowerPC architecture. The initial results show high degrees of instruction level parallelism with reasonable translation overhead and memory usage.

...read moreread less

Content maybe subject to copyright Report

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

LLVM: a compilation framework for lifelong program analysis & transformation

[...]

Chris Lattner¹, Vikram Adve¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

20 Mar 2004

TL;DR: The design of the LLVM representation and compiler framework is evaluated in three ways: the size and effectiveness of the representation, including the type information it provides; compiler performance for several interprocedural problems; and illustrative examples of the benefits LLVM provides for several challenging compiler problems.

...read moreread less

Abstract: We describe LLVM (low level virtual machine), a compiler framework designed to support transparent, lifelong program analysis and transformation for arbitrary programs, by providing high-level information to compiler transformations at compile-time, link-time, run-time, and in idle time between runs. LLVM defines a common, low-level code representation in static single assignment (SSA) form, with several novel features: a simple, language-independent type-system that exposes the primitives commonly used to implement high-level language features; an instruction for typed address arithmetic; and a simple mechanism that can be used to implement the exception handling features of high-level languages (and setjmp/longjmp in C) uniformly and efficiently. The LLVM compiler framework and code representation together provide a combination of key capabilities that are important for practical, lifelong analysis and transformation of programs. To our knowledge, no existing compilation approach provides all these capabilities. We describe the design of the LLVM representation and compiler framework, and evaluate the design in three ways: (a) the size and effectiveness of the representation, including the type information it provides; (b) compiler performance for several interprocedural problems; and (c) illustrative examples of the benefits LLVM provides for several challenging compiler problems.

...read moreread less

4,841 citations

Cites background from "DAISY: dynamic compilation for 100%..."

...Allowing lifelong reoptimization of the program gives architects the power to evolve processors and exposed interfaces in more flexible ways [11, 20], while allowing legacy applications to run well on new systems....
[...]
...There have also been several systems that perform transparent runtime optimization of native code [6, 20, 16]....
[...]

Journal Article•DOI•

Dynamo: a transparent dynamic optimization system

[...]

Vasanth Bala¹, Evelyn Duesterwald¹, Sanjeev Banerjia¹•Institutions (1)

Hewlett-Packard¹

01 May 2000

TL;DR: The design and implementation of Dynamo, a software dynamic optimization system that is capable of transparently improving the performance of a native instruction stream as it executes on the processor, are described and evaluated.

...read moreread less

Abstract: We describe the design and implementation of Dynamo, a software dynamic optimization system that is capable of transparently improving the performance of a native instruction stream as it executes on the processor. The input native instruction stream to Dynamo can be dynamically generated (by a JIT for example), or it can come from the execution of a statically compiled native binary. This paper evaluates the Dynamo system in the latter, more challenging situation, in order to emphasize the limits, rather than the potential, of the system. Our experiments demonstrate that even statically optimized native binaries can be accelerated Dynamo, and often by a significant degree. For example, the average performance of -O optimized SpecInt95 benchmark binaries created by the HP product C compiler is improved to a level comparable to their -O4 optimized version running without Dynamo. Dynamo achieves this by focusing its efforts on optimization opportunities that tend to manifest only at runtime, and hence opportunities that might be difficult for a static compiler to exploit. Dynamo's operation is transparent in the sense that it does not depend on any user annotations or binary instrumentation, and does not require multiple runs, or any special compiler, operating system or hardware support. The Dynamo prototype presented here is a realistic implementation running on an HP PA-8000 workstation under the HPUX 10.20 operating system.

...read moreread less

935 citations

Cites background from "DAISY: dynamic compilation for 100%..."

...A lot of work has been done on dynamic translation as a technique for non-native system emulation [8][30][5][31][12][17]....
[...]

Book•

Memory Systems: Cache, DRAM, Disk

[...]

Bruce Jacob, Spencer W. Ng, David T. Wang

10 Sep 2007

TL;DR: Is your memory hierarchy stopping your microprocessor from performing at the high level it should be?

...read moreread less

Abstract: Is your memory hierarchy stopping your microprocessor from performing at the high level it should be? Memory Systems: Cache, DRAM, Disk shows you how to resolve this problem. The book tells you everything you need to know about the logical design and operation, physical design and operation, performance characteristics and resulting design trade-offs, and the energy consumption of modern memory hierarchies. You learn how to to tackle the challenging optimization problems that result from the side-effects that can appear at any point in the entire hierarchy.As a result you will be able to design and emulate the entire memory hierarchy. . Understand all levels of the system hierarchy -Xcache, DRAM, and disk. . Evaluate the system-level effects of all design choices. . Model performance and energy consumption for each component in the memory hierarchy.

...read moreread less

659 citations

Journal Article•DOI•

The Jalapeño virtual machine

[...]

Bowen Alpern¹, Clement Richard Attanasio¹, John Barton², Michael G. Burke¹, Perry Cheng³, Jong-Deok Choi¹, Anthony Cocchi¹, Stephen J. Fink¹, David Grove¹, Michael Hind¹, Susan Flynn Hummel¹, Derek Lieber¹, Vassily Litvinov⁴, Mark F. Mergen¹, Ton Ngo¹, J. R. Russell¹, Vivek Sarkar¹, Mauricio J. Serrano¹, J. C. Shepherd¹, Stephen Edwin Smith¹, Vugranam C. Sreedhar¹, Harini Srinivasan¹, John Whaley¹ - Show less +19 more•Institutions (4)

IBM¹, Hewlett-Packard², Carnegie Mellon University³, University of Washington⁴

01 Jan 2000-Ibm Systems Journal

TL;DR: Jalapeno is a virtual machine for JavaTM servers written in the Java language to be as self-sufficient as possible and to obtain high quality code for methods that are observed to be frequently executed or computationally intensive.

...read moreread less

Abstract: Jalapeno is a virtual machine for JavaTM servers written in the Java language. To be able to address the requirements of servers (performance and scalability in particular), Jalapeno was designed "from scratch" to be as self-sufficient as possible. Jalapeno's unique object model and memory layout allows a hardware null-pointer check as well as fast access to array elements, fields, and methods. Run-time services conventionally provided in native code are implemented primarily in Java. Java threads are multiplexed by virtual processors (implemented as operating system threads). A family of concurrent object allocators and parallel type-accurate garbage collectors is supported. Jalapeno's interoperable compilers enable quasi-preemptive thread switching and precise location of object references. Jalapeno's dynamic optimizing compiler is designed to obtain high quality code for methods that are observed to be frequently executed or computationally intensive.

...read moreread less

632 citations

Journal Article•DOI•

Disco: running commodity operating systems on scalable multiprocessors

[...]

Edouard Bugnion¹, Scott W. Devine¹, Kinshuk Govil¹, Mendel Rosenblum¹•Institutions (1)

Stanford University¹

01 Nov 1997-ACM Transactions on Computer Systems

TL;DR: This article uses virtual machines to run multiple commodity operating systems on a scalable multiprocessor to reduce the memory overheads associated with running multiple operating systems, and uses the distributed-system support of modern operating systems to export a partial single system image to the users.

...read moreread less

Abstract: In this article we examine the problem of extending modern operating systems to run efficiently on large-scale shared-memory multiprocessors without a large implementation effort. Our approach brings back an idea popular in the 1970s: virtual machine monitors. We use virtual machines to run multiple commodity operating systems on a scalable multiprocessor. This solution addresses many of the challenges facing the system software for these machines. We demonstrate our approach with a prototype called Disco that runs multiple copies of Silicon Graphics' IRIX operating system on a multiprocessor. Our experience shows that the overheads of the monitor are small and that the approach provides scalability as well as the ability to deal with the nonuniform memory access time of these systems. To reduce the memory overheads associated with running multiple operating systems, virtual machines transparently share major data structures such as the program code and the file system buffer cache. We use the distributed-system support of modern operating systems to export a partial single system image to the users. The overall solution achieves most of the benefits of operating systems customized for scalable multiprocessors, yet it can be achieved with a significantly smaller implementation effort.

...read moreread less

603 citations

Cites background from "DAISY: dynamic compilation for 100%..."

...DAISY [Ebcioglu and Altman 1997] uses dynamic compilation techniques to run a single virtual machine with a different instruction set architecture than the host processor....
[...]
...DAISY [Ebcioglu and Altman 1997] uses dynamic compilation techniques to run a single virtual machine with a different instruction set architecture than the host processor....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82

Collapse

References

PDF

Open Access

More filters

Proceedings Article•DOI•

Shade: a fast instruction-set simulator for execution profiling

[...]

Bob Cmelik¹, David Keppel²•Institutions (2)

Sun Microsystems¹, University of Washington²

01 May 1994

TL;DR: A tool called Shade is described which combines efficient instruction-set simulation with a flexible, extensible trace generation capability and discusses instruction set emulation in general.

...read moreread less

Abstract: Tracing tools are used widely to help analyze, design, and tune both hardware and software systems. This paper describes a tool called Shade which combines efficient instruction-set simulation with a flexible, extensible trace generation capability. Efficiency is achieved by dynamically compiling and caching code to simulate and trace the application program. The user may control the extent of tracing in a variety of ways; arbitrarily detailed application state information may be collected during the simulation, but tracing less translates directly into greater efficiency. Current Shade implementations run on SPARC systems and simulate the SPARC (Versions 8 and 9) and MIPS I instruction sets. This paper describes the capabilities, design, implementation, and performance of Shade, and discusses instruction set emulation in general.

...read moreread less

745 citations

Proceedings Article•DOI•

Trace cache: a low latency approach to high bandwidth instruction fetching

[...]

Eric Rotenberg¹, Steve Bennett², James E. Smith¹•Institutions (2)

University of Wisconsin-Madison¹, Intel²

02 Dec 1996

TL;DR: It is shown that the trace cache's efficient, low latency approach enables it to outperform more complex mechanisms that work solely out of the instruction cache.

...read moreread less

Abstract: As the issue width of superscalar processors is increased, instruction fetch bandwidth requirements will also increase. It will become necessary to fetch multiple basic blocks per cycle. Conventional instruction caches hinder this effort because long instruction sequences are not always in contiguous cache locations. We propose supplementing the conventional instruction cache with a trace cache. This structure caches traces of the dynamic instruction stream, so instructions that are otherwise noncontiguous appear contiguous. For the Instruction Benchmark Suite (IBS) and SPEC92 integer benchmarks, a 4 kilobyte trace cache improves performance on average by 28% over conventional sequential fetching. Further, it is shown that the trace cache's efficient, low latency approach enables it to outperform more complex mechanisms that work solely out of the instruction cache.

...read moreread less

637 citations

"DAISY: dynamic compilation for 100%..." refers methods in this paper

...Keywords: INSTRUCTION-LEVEL PARALLELISM, OBJECT CODE COMPATIBLE VLIW, DYNAMIC COMPILATION, BINARY TRANSLATION, SUPERSCALAR...
[...]

Book•

Limits of instruction-level parallelism

[...]

David W. Wall

01 Mar 1995

TL;DR: In this paper, the authors present the results of simulations of 18 different test programs under 375 different models of available parallelism analysis, including branch prediction, register renaming and alias analysis.

...read moreread less

Abstract: Growing interest in ambitious multiple-issue machines and heavilypipelined machines requires a careful examination of how much instructionlevel parallelism exists in typical programs. Such an examination is complicated by the wide variety of hardware and software techniques for increasing the parallelism that can be exploited, including branch prediction, register renaming, and alias analysis. By performing simulations based on instruction traces, we can model techniques at the limits of feasibility and even beyond. This paper presents the results of simulations of 18 different test programs under 375 different models of available parallelism analysis. This paper replaces Technical Note TN-15, an earlier version of the same material.

...read moreread less

587 citations

Book•

Bulldog: A Compiler for VLIW Architectures

[...]

John R. Ellis

22 Apr 1986

TL;DR: The Bulldog compiler described here uses several new compilation techniques: trace scheduling to find more parallelism, memory-reference and memorybank disambiguation to increase memory bandwidth, and new code-generation algorithms.

...read moreread less

Abstract: "Bulldog "demonstrates that a symbiosis of new Very Long Instruction Word (VLIW) architectures and new compiling technology is practicable.VLIW architectures are reduced-instruction-set machines with a large number of parallel, pipelined functional units but only a single thread of control. These machines offer the promise of an immediate order-of-magnitude increase in speed for general purpose scientific computing. However, a traditional compiler can't find enough parallelism in scientific programs to utilize a VLIW effectively. The Bulldog compiler described here uses several new compilation techniques: trace scheduling to find more parallelism, memory-reference and memorybank disambiguation to increase memory bandwidth, and new code-generation algorithms.Although originally developed for VLIWs, many of the ideas in "Bulldog "could be applied to pipelined reduced-instruction-set architectures such as the MIPS. Ellis's experiments indicate that speed improvements of thirty to eighty percent are possible for scientific code on such machines.John R. Ellis received his doctorate from Yale University and is currently Principal Software Engineer, Digital Equipment Corporation Systems Research Center, Palo Alto. "Bulldog: A Compiler for VLIW Architectures" is winner of the 1985 ACM Doctoral Dissertation Award.

...read moreread less

555 citations

Journal Article•DOI•

Binary translation

[...]

Richard L. Sites, Anton Chernoff, Matthew B. Kirk, Maurice P. Marks, Scott G. Robinson - Show less +1 more

01 Feb 1993-Communications of The ACM

TL;DR: Two binary translators are among the migration tools available for Alpha AXP computers: VEST translates OpenV MS VAX binary images to OpenVMS A XP images; mx translates ULTRIX MIPS images to DEC OSF/1 AXP images.

...read moreread less

Abstract: Binary translation is a technique used to change an executable program for one computer architecture and operating system into an executable program for a different computer architecture and operating system. Two binary translators are among the migration tools available for Alpha AXP computers: VEST translates OpenVMS VAX binary images to OpenVMS AXP images; mx translates ULTRIX MIPS images to DEC OSF/1 AXP images. In both cases, translated code usually runs on Alpha AXP computers as fast or faster than the original code runs on the original architecture. In contrast to other migration efforts in the industry, the VAX translator reproduces subtle CISC behavior on a RISC machine, and both open-ended translators provide good performance on dynamically modified programs. Alpha AXP binary translators are important migration tools hundreds of translated OpenVMS VAX and ULTRIX MIPS images currently run on Alpha AXP systems.

...read moreread less

341 citations