Next: Bibliography Up: Zaleski Dissertation Previous: 7 Evaluation of Yeti Contents

Subsections

8 Conclusions and Future Work

$RCSfile: concl.lyx,v $% WIDTH=186 HEIGHT=35 $Revision: 1.20 $% WIDTH=128 HEIGHT=16

Interpreters play an important role in the implementation of computer languages. Initially, language implementors need a language VM to be simple and flexible in order to support the evolution of their language. Later, as their language increases in popularity, performance may become more of a concern.

Today, commonly implemented interpreter designs do not anticipate the need for more performance, and just in time (JIT) compiler designs, though capable of very high performance, require a great deal of up-front development. These factors conspire to prevent, or at least delay, important language implementations from improving performance by deploying a JIT. In this dissertation we have responded to this challenge by describing a design for a language VM that explicitly maps out a trajectory of staged deployments, providing gradually increasing performance as development effort is invested.

1 Conclusions and Lessons Learned

Our approach is different from most interpreter designs because we intentionally start out running a simple dispatch mechanism, direct call threading (DCT). DCT is an appropriate choice not because it is particularly fast - it runs about the same speed as a regular switch threaded interpreter - but because it is the simplest way to dispatch callable virtual instruction bodies and because it is easy to augment with profiling. This makes the early versions of a language VM simple to deploy.

To gain performance in later releases the DCT interpreter can be extended by inserting profiling into the dispatch loop and identifying interpreted traces. When more performance is required interpreted traces can be be enhanced by JIT compiling a subset of virtual instructions.

Our approach is motivated by a few observations:

We realized that callable bodies can be very efficiently dispatched by the old technique of subroutine threading now that processors commonly implement return branch predictors. This is effective for straight-line sections of the virtual program.
We realized that although the overhead of a dispatch loop is high for dispatching single virtual instruction bodies, it may be perfectly reasonable for dispatching callable region bodies generated from dozens or hundreds of virtual instructions. The basic idea behind Yeti's extensibility is that development effort should be invested in identifying and compiling larger and more complex regions of the virtual program which are then dispatched from a profiled dispatch loop.
Optimizing the dispatch of virtual branch instructions, for instance by selective inlining, is typically carried out by an interpreter when a method is loaded. Instead, we identify traces at run time using profiling instrumentation called from the dispatch loop. Hot traces predict paths through the virtual program which we exploit to generate simple trace exit code in otherwise subroutine-threaded interpreted traces.
When even better performance is needed, we show how a trace-based JIT can be built to eliminate dispatch and replace the expression stack with register-to-register compiled code. The novel aspect of our JIT is that it exploits the fact that Yeti's virtual instruction bodies are callable. Unsupported virtual instructions, or difficult compiler corner cases can be side-stepped by dispatching virtual instruction bodies instead. This allows support for virtual instructions to be added one at a time. The importance of the latter point is hard to quantify, but seemed to reduce the difficulty of debugging the back end of the compiler significantly.

Most of the elements of our approach are plausible as soon as it has been proved that callable bodies can be efficiently dispatched. However, actual performance improvements depend on a subtle trade-off between the overhead of runtime profiling and the reduction of stalls caused by branch mispredictions. The only way to determine that our ideas were viable is to build a fairly complete prototype. We chose to build a prototype in Java because there are commonly accepted benchmark programs to measure and many high quality implementations to compare ourselves to.

In the process we learned a number of interesting things:

Calling virtual instruction bodies can be very efficient on modern CPUs. Our implementation of subroutine threading (SUB) is very simple and eliminates most of the branch mispredictions caused by switch or direct threading, namely those caused by dispatching straight-line code. SUB outperforms direct threading by about 20%. However, SUB does not address mispredictions caused by dispatching virtual branch instructions. Also, it is difficult to interpose runtime instrumentation into subroutine threaded execution.
Direct Call Threading (DCT) is simpler than SUB, but much slower, running about 40% slower than direct threading. This, however, is not worse than switch, which is widely implemented by heavily used languages like Python and JavaScript. DCT is very easy to augment with profiling, since instrumentation can simply be called from the dispatch loop before and after dispatching each body. Furthermore, by providing multiple dispatch loops it is easy to turn instrumentation on and off.
Branch inlining, our initial approach to improving the virtual branch performance of SUB, is labor intensive and non-portable. It improves the performance of subroutine threading by about 5%.
Interpreted traces are a powerful interpretation technique. They perform well, as fast as SableVM's inline-threading, running Java benchmarks about 25% faster than direct threading on a PowerPC 970. This performance includes the cost of the runtime profiling to identify traces. A system running interpreted traces has already implemented the infrastructure to identify hot regions of a running program, an essential ingredient of a JIT. This makes interpreted traces a good strategic option for language virtual machines that may eventually need to be extended with a JIT.
Our trace compiler was easy to build, and we attribute this primarily to two factors. First, traces contain no merge points, so it is easy to track where expression temporary values are on the expression stack and assign them to registers. Second, callable virtual instruction bodies enabled us to add compiler support for virtual instructions one at a time. By compiling about 50 integer virtual instructions in this way the performance of Yeti was increased to about double the performance of direct threading.

The primary weakness of our prototype is the specific mechanism we used to implement callable virtual instruction bodies. Our approach, as illustrated by Figure

, hides the return branch from the compiler. This means that the optimizer does not properly understand the control flow graph of the interpreter. The workaround, suitable only for a prototype, is to ``fake'' the missing control flow by adding computed goto statements that are never executed immediately following each inline return instruction. Nested functions, a relatively commonly implemented extension to C, are a promising alternative that will be discussed in the next section.

2 Future work

Substantial additional performance gains are no doubt possible by extending our trace-based JIT to handle more types of instructions (such as the floating point bytecodes) and by applying classical optimizations such as common subexpression elimination. Improving the performance of compiled code by applying classical optimizations is relatively well understood. Hence, on its own, such an effort seems to have relatively little to contribute to research. Moreover, it would require significant engineering work and likely could only be undertaken by a well-funded project.

We will discuss four avenues for further research. First, a way to package virtual instruction bodies as nested functions. Second, how the approach we describe in Section to optimize virtual method invocation could be adapted for runtime typed languages. Third, we comment on how new shapes of region bodies could be derived from linked traces. Fourth, we describe our vision of how our design could be used by the implementors of a new language.

1 Virtual instruction bodies as nested functions

An better option for implementing callable virtual instruction bodies might be to define them as nested functions. Nested functions are a common extension to C, implemented by gcc and other C compilers, that allows one function to be declared within another. The idea is that each virtual instruction body is declared as a separate nested function, with all bodies nested within the main interpreter function. Important interpreter variables, like the vPC, are defined, as currently, as local variables in the main interpreter function but can be used from the nested function implementing each virtual instruction body as well.

The approach is elegant, since functions are a natural way to express virtual instruction bodies, and also well supported by the tool chain, including the debugger. However, our first attempts in this direction did not perform well. In short, when a nested function is called via a function pointer, like from our DCT dispatch loop, gcc adds an extra level of indirection and calls the nested function via a runtime generated trampoline. As a result the DCT dispatch loop runs very slowly.

We investigated the possible performance of nested functions by hand-modifying the assembler generated by gcc to short-circuit the trampoline. In this way, we created a one-off version of OCaml that declares each virtual instruction body in its own nested function and runs a simple DCT dispatch loop like the one illustrated by Figure . On the PowerPC this DCT interpreter runs the same OCaml benchmarks used in Chapter about 22% more slowly than switch threading.

Further improvements to nested function performance should be investigated, possibly including modifications to gcc to create a variant of nested functions more suitable for implementing virtual instruction bodies.

2 Extension to Runtime Typed Languages

An exciting possibility is to create new speculative dynamic optimizations based on the runtime profile data collected while training a trace (See Section .) The basic realization is that a mechanism very similar to a trace exit can be used to guard almost any speculative optimization. As a specific example we consider the optimization of arithmetic operations in a runtime typed language.

A runtime typed language is a language that does not force the user to declare the types of variables but instead discovers types at run time. A typical implementation compiles expressions to sequences of virtual instructions that are not type specific. For instance, in Tcl or Python the virtual body for addition will work for integers, floating point numbers or even strings. Performance tends to be poor as each virtual instruction body must check the type of each input before actually calculating its result.

We believe the same profiling infrastructure that we use to optimize callsites in Java (Section ) could be used to improve arithmetic bytecodes in a runtime typed language. Whereas the destination of a Java method invocation depends only upon the type of the invoked-upon object, the operation carried out by a polymorphic virtual instruction may depend on the type of each input. For instance, suppose that a specific instance of the addition instruction in Tcl, Python or JavaScript has integer type. (We would know this if its inputs were observed to be integers during trace training.) We could generate one or more trace exits, or guards, to ensure that the inputs are actually integers. Following the guards we could generate specialized integer code, or dispatch a version of the addition virtual instruction body specialized for integers.

3 New shapes of region body

Just as basic blocks are collected into traces, so traces could be collected into yet larger regions for optimization. An obvious possibility would be to identify loop nests amongst the linked traces, and use these as a higher level unit of compilation.

The data recorded by our trace region payload structures already includes the information necessary to build a flow graph of the program in the code cache. It remains to adapt classical flow graph algorithms to detect nested loops and create a strategy for compiling the resulting code.

There seems to be little point, however, in detecting loop nests without any capability of optimizing them. Thus, this extension of our work would only make sense for a system that plans to build an optimizer.

4 Vision for new language implementation

Our vision for a new language implementation would be to start by building a direct call threaded interpreter. Until the issues with nested functions have been dealt with, the virtual bodies would have to be packaged as we described in Chapter . The level of performance would be roughly the same as a switch-threaded interpreter.

Then, as more performance is called for, we would add linear blocks, interpreted traces, and trace linking. It would be natural to make these extensions in separate releases of our implementation. We believe that much of the runtime profiling infrastructure we built for Yeti could be reused as is. Finally, when performance requirements demand a JIT compiler could be built. Like Yeti, the first implementation would compile only a subset of the virtual instructions, perhaps only the ones needed to address specific performance issues with a given application.

3 Summary

We have described a design trajectory which describes how a high level language virtual machine can be deployed in a sequence of stages, starting with a simple entry-level direct call threaded interpreter, followed by interpreted traces and finally a trace-based just in time compiler.

We have shown that it is beneficial to implement virtual instruction bodies as callable routines both from the perspective of efficient interpretation and because it allows bodies to be reused by the JIT. We recognized that on modern computers subroutine threading is a very efficient way to dispatch straight-line sequences of virtual instructions. For branches we introduce a new technique, interpreted traces. Our technique exploits the power of traces to predict branch destinations and hence reduce mispredictions caused by the dispatch of virtual branches. Interpreted traces are a state-of-the-art technique, running about 25% faster than direct threading. This is about the same speed up as achieved by inlined-threading, SableVM's implementation of selective inlining.

We show how interpreted traces can be gradually enhanced with a trace-based JIT compiler. An attractive property of our approach is that compiler support can be added one virtual instruction at a time. Our trace-based JIT currently compiles about 50 integer virtual instructions, running about 30% faster than interpreted traces, or about double the performance of direct threading.

Our hope is this work will enable more language implementations to deploy better interpreters and JIT compilers and hence deliver better computer language performance to more users.

$RCSfile: matzDissertation.lyx,v $% WIDTH=294 HEIGHT=35 $Revision: 1.18 $% WIDTH=128 HEIGHT=16

Next: Bibliography Up: Zaleski Dissertation Previous: 7 Evaluation of Yeti Contents

Mathew Zaleski 2008-01-22