Next: 2 Background Up: Zaleski Dissertation Previous: List of Figures Contents

Subsections

1 Introduction

$RCSfile: intro.lyx,v $% WIDTH=186 HEIGHT=35 $Revision: 1.46 $% WIDTH=128 HEIGHT=16

Modern computer languages are commonly implemented in two main parts - a compiler that targets a virtual instruction set, and a so-called high-level language virtual machine (or simply language VM) to run the resulting virtual program. This approach simplifies the compiler by eliminating the need for any machine dependent code generation. Tailoring the virtual instruction set can further simplify the compiler by providing operations that perfectly match the functionality of the language.

There are two ways a language VM can run a virtual program. The simplest approach is to interpret the virtual program. An interpreter dispatches a virtual instruction body to emulate each virtual instruction in turn. A more complicated, but faster, approach deploys a dynamic, or just in time (JIT), compiler to translate the virtual instructions to machine instructions and dispatch the resulting native code. Mixed-mode systems interpret some parts of a virtual program and compile others. In general, compiled code will run much more quickly than virtual instructions can be interpreted. By judiciously choosing which parts of a virtual program to JIT compile, a mixed-mode system can run much more quickly than the fastest interpreter.

Currently, although many popular languages depend on virtual machines, relatively few JIT compilers have been deployed. Notable exceptions include research languages like Self and several Java Virtual Machines (JVM). Consequently, users of important computer languages, including JavaScript, Python, and many others, do not enjoy the performance benefits of mixed-mode execution.

The primary goal of our research is to make it easier to extend an interpreter with a JIT compiler. To this end we describe a new architecture for a language VM that significantly increases the performance of interpretation at the same time as it reduces the complexity of extending it to be a mixed-mode system. Our technique has two main features.

First, our JIT identifies and compiles hot interprocedural paths, or traces. Traces are single entry multiple exit regions that are easier to compile than the methods compiled by current systems. In addition, hot traces help predict the destination of virtual branches. This means that even before traces are compiled they provide a simple way to improve the interpreted performance of virtual branches.

Second, we implement virtual instruction bodies as lightweight, callable routines, and at the same time, we closely integrate the JIT compiler and interpreter. This gives JIT developers a simple alternative to compiling each virtual instruction. Either a virtual instruction is translated to native code, or instead, a call to the corresponding body is generated. The task of JIT developers is thereby simplified by making it possible to deploy a fully functional JIT compiler that compiles only a subset of virtual instructions. In addition, callable virtual instruction bodies have a beneficial effect on interpreter performance because they enable a simple interpretation technique, subroutine threading, that very efficiently executes straight-line, or non-branching, regions of a virtual program.

We prototype our ideas in Java because there exist many high-quality Java interpreters and JIT compilers with which to compare our results. We are able to determine that the performance of our prototype compares favourably with state-of-the art interpreters like JamVM and SableVM. An obvious next step would be to apply our techniques to enhance the performance of languages that currently do not offer a JIT.

The discussion in the next few sections refers to many technical terms and techniques that are described in detail in Chapter , which introduces the basic concepts and related work, and Chapter , which provides a tutorial-like description of several interpreter techniques.

1 Challenges of Method-based JIT Compilation

Today, the usual approach taken by mixed-mode systems is to identify frequently executed, or hot, methods. Hot methods are passed to the JIT compiler which compiles them to native code. Then, when the interpreter sees an invocation of a compiled method, it dispatches the native code instead.

Up Front Effort

This method-oriented approach has been followed for many years, but requires a large up-front investment in effort. Such a system cannot improve the performance of a method until it can compile every feature of the language that appears in it. For significant applications this requires the JIT to compile essentially the whole language, including complicated features already implemented by high-level virtual instruction bodies, such as those for method invocation, object creation, and exception handling.

1 Compiling Cold Code

Just because a method is frequently executed does not mean that all the instructions within it are frequently executed also. In fact, regions of a hot method may be cold, that is, they may have never executed. Compiling cold code has more implications than simply wasting compile time. Except at the very highest levels of optimization, where analyzing cold code may prove useful facts about hot regions, there is little point compiling code that never runs. A more serious issue is that cold code increases the complexity of dynamic compilation. We give three examples. First, for late binding languages such as Java, cold code likely contains references to external symbols which are not yet bound. Thus, when the cold code does eventually run, the generated code and the runtime that supports it must deal with the complexities of late binding [#!vj_cgo!#]. Second, certain dynamic optimizations are not possible without runtime profiling information. Foremost amongst these is the optimization of virtual function calls. Since there is no profiling information for cold code, the JIT may have to generate relatively slow, conservative code. This issue is even more important for runtime typed languages, like Python, in which the type of the operands of a virtual instruction may not be known until run time. Without runtime information neither a static, nor a dynamic, Python compiler may be able to determine whether the inputs of simple arithmetic operations such as addition are integers, floats, or strings. Third, as execution proceeds, some of the formerly cold regions in compiled methods may become hot. The conservative assumptions made during the initial compilation may now be a drag on performance. The straightforward-sounding approach of recompiling the method containing the formerly cold code undermines the profitability of compilation. Furthermore, it is complicated by problems such as what to do about threads that are still executing in the method or that will return to the method in the future.

2 Challenges of Efficient Interpretation

After a virtual program is loaded by an interpreter into memory it can be executed by dispatching each virtual instruction body (or just body) in the order specified by the virtual program. From the processor's point of view, this is not a typical workload because the control transfer from one body to the next is data dependent on the sequence of instructions making up the virtual program. This makes the dispatch branches hard for a processor to predict. Ertl and Gregg observed that the performance of otherwise efficient interpretation is limited by pipeline stalls and flushes due to extremely poor branch prediction [#!ertl:dispatch-arch!#].

3 What We Need

The challenges we identified above suggest that the architecture of a gradually extensible mixed-mode virtual machine should have three important properties.

Virtual instruction bodies should be callable. This allows JIT implementors to compile only some instructions, and fall back on the emulation functionality already implemented by the virtual instruction bodies for others.
The unit of compilation must be dynamically determined and of flexible shape. This allows the JIT compiler to translate hot regions while avoiding cold code.
As new regions of hot code reveal themselves and are compiled, a way is needed of gracefully linking them to previously compiled hot code.

1 Callable Virtual Instruction Bodies

Packaging bodies as callable can also address the prediction problems observed in interpreters. Any straight-line sequence of virtual instructions can be translated to a very simple sequence of generated machine instructions. Corresponding to each virtual instruction we generate a single direct call which dispatches the corresponding virtual instruction body. Executing the resulting generated code thus emulates each virtual instruction in the linear sequence in turn. No branch mispredictions occur because the destination of each direct call is explicit and the return instruction ending each body is predicted perfectly by the return branch predictor present in most modern processors.

2 Traces

Our system compiles frequently executed, dynamically identified interprocedural paths, or traces. Traces contain no cold code, so our system leaves all the complexities of running cold code to the interpreter. Since traces are paths through the virtual program, they explicitly predict the destination of each virtual branch. As a consequence even a very simple implementation of traces can significantly improve performance by reducing branch mispredictions caused by dispatching virtual branches. This is the basis of our new technique, interpreted traces.

4 Overview of Our Solution

In this dissertation we describe a system that supports dynamic compilation units of varying shapes. Just as a virtual instruction body implements a virtual instruction, a region body implements a region of the virtual program. Possible region bodies include single virtual instructions, basic blocks, methods, partial methods, inlined method nests, and traces. The key idea is to package every region body as callable, regardless of the size or shape of the region of the virtual program that it implements. The interpreter can then execute the virtual program by dispatching each region body in sequence.

Region bodies corresponding to longer sequences of virtual instructions will run faster than those compiled from short ones because fewer dispatches are required. In addition, larger region bodies should offer more opportunities for optimization. However, larger region bodies are more complicated and so we expect them to require more development effort to detect and compile than short ones. This suggests that the performance of a mixed-mode VM can be gradually extended by incrementally increasing the scope of region bodies it identifies and compiles. Ultimately, the peak performance of the system should be at least as high as current method-based JIT compilers since, with basically the same engineering effort, inlined method nests could be compiled to region bodies also.

The practicality of our scheme depends on the efficiency of dispatching bodies by calling them. Thus, the first phase of our research, described in Chapters and ,was to retrofit SableVM [#!gagnon:inline-thread-prep-seq!#], a Java virtual machine, and ocamlrun, an OCaml interpreter [#!ocaml:book!#], to a new hybrid dispatch technique we call context threading. We evaluated context threading on PowerPC and Pentium 4 platforms by comparing branch predictor and runtime performance of common benchmarks to unmodified, direct-threaded versions of the virtual machines. We show that callable bodies can be dispatched more efficiently than dispatch techniques currently thought to be very efficient. For instance, on a Pentium 4, our subroutine threaded version of SableVM runs the SPECjvm98 benchmarks about 19% faster than direct threading.

In the second phase of this research, described in Chapters and , we gradually extended JamVM, a cleanly implemented and relatively high performance Java interpreter [#!lougher:jamvmsite!#], to create Yeti (graduallY Extensible Trace Interpreter). We decided to start afresh because it proved difficult to cleanly add trace detection and profiling instrumentation to our implementation of context threading. We chose JamVM as the starting point for Yeti, rather than SableVM, because it is simpler.

We built Yeti in five stages with the explicit intention of providing a design trajectory from a simple system to a high performance implementation. First, we repackaged all virtual instruction bodies as callable. Our initial implementation executed only single virtual instructions which were dispatched via an indirect call from a simple dispatch loop. This is slow compared to context threading but very easy to instrument with profiling code. Second, we identified linear blocks, or sequences of virtual instructions ending in branches. Third, we extended our system to identify and dispatch interpreted traces, or sequences of linear blocks. Traces are significantly more complex region bodies than linear blocks because they must accommodate virtual branch instructions. Fourth, we extended our trace runtime system to link traces together. In the fifth and final stage, we implemented a naive, non-optimizing compiler to compile the traces. An interesting feature of the JIT is that it performs simple compilation and register allocation for some virtual instructions but falls back on calling virtual instruction bodies for others. Our compiler currently generates PowerPC code for about 50 integer and object virtual instructions.

We chose traces as our unit of compilation because traces have several attractive properties: (i) they can extend across the invocation and return of methods, and thus have an interprocedural view of the program, (ii) they contain only hot code, (iii) they are relatively simple to compile as they are single-entry multiple-exit regions of code, and (iv), it is straightforward to generate new traces and link them onto existing ones as new hot paths reveal themselves.

Instrumentation built into our prototype shows that on the average, traces accurately predict paths taken by the Java SPECjvm98 benchmark programs. This result corroborates those reported by Bala et al. [#!Dynamo00!#] and Duesterwald and Bala [#!dynamoAsplosLessIsMore2000!#] for C and Fortran programs. Performance measurements show that the overhead of trace identification is reasonable. Even with our naive compiler, Yeti runs about twice as fast as unmodified JamVM.

5 Thesis Statement

The performance of a high level language virtual machine can be more easily enhanced from a simple interpreter to a high performance mixed-mode system if its design includes two main ideas: (i) virtual instruction bodies that are callable, and (ii) dynamic identification, translation into machine code, and execution of regions that contain no cold code. Traces are a good choice because they predict the destination of virtual branch instructions and hence support efficient interpretation. Traces are also simple to compile as they contain no merge points.

6 Contributions

We show that if virtual instruction bodies are implemented as callable routines a family of dispatch techniques becomes possible, from very simple, portable and slow, to somewhat machine dependent but much faster. Since the implementation of the virtual instruction bodies makes up a large portion of an interpreter, an attractive aspect of this approach is that there is no need to modify the bodies as more complex, and higher performing, mechanisms are implemented to dispatch them.

The simplest, and most portable, way to build an interpreter with callable bodies is to to write a dispatch loop in C that dispatches each instruction via a function pointer. This technique, called direct call threading, or DCT, performs about the same as a switch threaded interpreter. DCT is a good starting point for our family of techniques because it is simple to code and as portable as gcc. Our strategy is to extend DCT by inserting profiling code into the dispatch loop. The instrumentation dynamically identifies regions of the virtual program and translates them into callable region bodies. These region bodies can then be called from the same dispatch loop, increasing performance.

We introduce a new technique, interpreted traces, to address branch mispredictions caused by dispatch. As virtual instructions are dispatched, our profiling instrumentation uses well known heuristics to identify hot, interprocedural paths, or traces. We say the traces are interpreted because virtual instruction bodies do all the real work. Straight-line portions of each trace are implemented using subroutine threading, whereby a direct call machine instruction is generated to call the virtual instruction body implementing each virtual instruction. We follow the dispatch of each virtual branch instruction with trace exit code that exploits the fact that traces predict the destination of virtual branches. Interpreted traces require the generation of only three machine instructions: direct call, compare immediate, and conditional jump. Thus, the machine dependency of the technique is modest.

We use micro-architectural performance counter measurements to show that interpreted traces result in good branch prediction. We show that interpreted traces improve the performance of our prototype relative to direct threading about the same amount as selective inlining gains over direct threading in SableVM. This means that interpreted traces are competitive with the highest performing techniques to optimize the dispatch performance of an interpreter. We achieve this level of performance despite the fact that our system performs runtime profiling as traces are detected.

Finally, we show that interpreted traces are a good starting point for a trace-based just in time (JIT) compiler. We extend our code generator for interpreted traces such that traces may contain a mixture of compiled code for some virtual instructions and subroutine threaded dispatch for others. By compiling about 50 integer and object virtual instructions to register allocated compiled code we improve the performance of our prototype by about 30% over interpreted traces to run about twice as fast as the direct threaded system with which we started.

Taken together, direct call threading, interpreted traces, and our trace-based JIT provide a design trajectory for a language VM with a range of performance from switch threading, a very widely deployed entry level technique, to about double the performance of a direct threaded interpreter. The fact that interpreted traces are gradually extensible in this way makes them a good strategic design option for future language virtual machines.

1 Summary of Contributions

If virtual instruction bodies are implemented as callable routines straight-line sections of virtual programs can be efficiently interpreted by load-time generated sequences of subroutine threaded code. We show that on modern processors the extra path length of the call and return instructions used by subroutine threading is more than made up for by the elimination of stalls caused by mispredicted indirect branches used by direct threading.
We introduce a new technique, interpreted traces, which identifies traces, or hot paths, to predict the destination of virtual branch instructions. We implement interpreted traces in JamVM, a high performance Java interpreter, and show that they outperform direct threading by 25%. This is about the same speedup achieved by SableVM's implementation of selective inlining.
The code generator for interpreted traces can be gradually extended to be a trace-based JIT by adding support for virtual instructions one at a time. Traces are simple to compile as they contain no cold code or merge points. Our trace-based JIT currently compiles about 50 virtual instructions and obtains a speed up of about 30% over interpreted traces.

7 Outline of Thesis

We describe an architecture for a virtual machine interpreter that facilitates the gradual extension to a trace-based mixed-mode JIT compiler. We demonstrate the feasibility of this approach in a prototype, Yeti, and show that performance can be gradually improved as larger program regions are identified and compiled.

In Chapters and we present background and related work on interpreters and JIT compilers. In Chapter we describe the design and implementation of context threading. Chapter describes how we evaluated context threading. The design and implementation of Yeti is described in Chapter . We evaluate the benefits of this approach in Chapter . Finally, we discuss possible avenues for future work and conclude in Chapter .

Next: 2 Background Up: Zaleski Dissertation Previous: List of Figures Contents

Mathew Zaleski 2008-01-22