Interpreters play an important role in the implementation of computer languages. Initially, language implementors need a language VM to be simple and flexible in order to support the evolution of their language. Later, as their language increases in popularity, performance may become more of a concern.
Today, commonly implemented interpreter designs do not anticipate the need for more performance, and just in time (JIT) compiler designs, though capable of very high performance, require a great deal of up-front development. These factors conspire to prevent, or at least delay, important language implementations from improving performance by deploying a JIT. In this dissertation we have responded to this challenge by describing a design for a language VM that explicitly maps out a trajectory of staged deployments, providing gradually increasing performance as development effort is invested.
Our approach is different from most interpreter designs because we intentionally start out running a simple dispatch mechanism, direct call threading (DCT). DCT is an appropriate choice not because it is particularly fast - it runs about the same speed as a regular switch threaded interpreter - but because it is the simplest way to dispatch callable virtual instruction bodies and because it is easy to augment with profiling. This makes the early versions of a language VM simple to deploy.
To gain performance in later releases the DCT interpreter can be extended by inserting profiling into the dispatch loop and identifying interpreted traces. When more performance is required interpreted traces can be be enhanced by JIT compiling a subset of virtual instructions.
Our approach is motivated by a few observations:
In the process we learned a number of interesting things:
Substantial additional performance gains are no doubt possible by extending our trace-based JIT to handle more types of instructions (such as the floating point bytecodes) and by applying classical optimizations such as common subexpression elimination. Improving the performance of compiled code by applying classical optimizations is relatively well understood. Hence, on its own, such an effort seems to have relatively little to contribute to research. Moreover, it would require significant engineering work and likely could only be undertaken by a well-funded project.
We will discuss four avenues for further research. First, a way to
package virtual instruction bodies as nested functions. Second, how
the approach we describe in Section to optimize
virtual method invocation could be adapted for runtime typed languages.
Third, we comment on how new shapes of region bodies could be derived
from linked traces. Fourth, we describe our vision of how our design
could be used by the implementors of a new language.
An better option for implementing callable virtual instruction bodies might be to define them as nested functions. Nested functions are a common extension to C, implemented by gcc and other C compilers, that allows one function to be declared within another. The idea is that each virtual instruction body is declared as a separate nested function, with all bodies nested within the main interpreter function. Important interpreter variables, like the vPC, are defined, as currently, as local variables in the main interpreter function but can be used from the nested function implementing each virtual instruction body as well.
The approach is elegant, since functions are a natural way to express virtual instruction bodies, and also well supported by the tool chain, including the debugger. However, our first attempts in this direction did not perform well. In short, when a nested function is called via a function pointer, like from our DCT dispatch loop, gcc adds an extra level of indirection and calls the nested function via a runtime generated trampoline. As a result the DCT dispatch loop runs very slowly.
We investigated the possible performance of nested functions by hand-modifying
the assembler generated by gcc to short-circuit the trampoline. In
this way, we created a one-off version of OCaml that declares each
virtual instruction body in its own nested function and runs a simple
DCT dispatch loop like the one illustrated by Figure .
On the PowerPC this DCT interpreter runs the same OCaml benchmarks
used in Chapter
about 22% more
slowly than switch threading.
Further improvements to nested function performance should be investigated, possibly including modifications to gcc to create a variant of nested functions more suitable for implementing virtual instruction bodies.
An exciting possibility is to create new speculative dynamic optimizations
based on the runtime profile data collected while training a trace
(See Section .) The basic realization is that
a mechanism very similar to a trace exit can be used to guard almost
any speculative optimization. As a specific example we consider the
optimization of arithmetic operations in a runtime typed language.
A runtime typed language is a language that does not force the user to declare the types of variables but instead discovers types at run time. A typical implementation compiles expressions to sequences of virtual instructions that are not type specific. For instance, in Tcl or Python the virtual body for addition will work for integers, floating point numbers or even strings. Performance tends to be poor as each virtual instruction body must check the type of each input before actually calculating its result.
We believe the same profiling infrastructure that we use to optimize
callsites in Java (Section ) could be used to
improve arithmetic bytecodes in a runtime typed language. Whereas
the destination of a Java method invocation depends only upon the
type of the invoked-upon object, the operation carried out by a polymorphic
virtual instruction may depend on the type of each input. For instance,
suppose that a specific instance of the addition instruction in Tcl,
Python or JavaScript has integer type. (We would know this if its
inputs were observed to be integers during trace training.) We could
generate one or more trace exits, or guards, to ensure that the inputs
are actually integers. Following the guards we could generate specialized
integer code, or dispatch a version of the addition virtual instruction
body specialized for integers.
Just as basic blocks are collected into traces, so traces could be collected into yet larger regions for optimization. An obvious possibility would be to identify loop nests amongst the linked traces, and use these as a higher level unit of compilation.
The data recorded by our trace region payload structures already includes the information necessary to build a flow graph of the program in the code cache. It remains to adapt classical flow graph algorithms to detect nested loops and create a strategy for compiling the resulting code.
There seems to be little point, however, in detecting loop nests without any capability of optimizing them. Thus, this extension of our work would only make sense for a system that plans to build an optimizer.
Our vision for a new language implementation would be to start by
building a direct call threaded interpreter. Until the issues with
nested functions have been dealt with, the virtual bodies would have
to be packaged as we described in Chapter .
The level of performance would be roughly the same as a switch-threaded
interpreter.
Then, as more performance is called for, we would add linear blocks, interpreted traces, and trace linking. It would be natural to make these extensions in separate releases of our implementation. We believe that much of the runtime profiling infrastructure we built for Yeti could be reused as is. Finally, when performance requirements demand a JIT compiler could be built. Like Yeti, the first implementation would compile only a subset of the virtual instructions, perhaps only the ones needed to address specific performance issues with a given application.
We have described a design trajectory which describes how a high level language virtual machine can be deployed in a sequence of stages, starting with a simple entry-level direct call threaded interpreter, followed by interpreted traces and finally a trace-based just in time compiler.
We have shown that it is beneficial to implement virtual instruction bodies as callable routines both from the perspective of efficient interpretation and because it allows bodies to be reused by the JIT. We recognized that on modern computers subroutine threading is a very efficient way to dispatch straight-line sequences of virtual instructions. For branches we introduce a new technique, interpreted traces. Our technique exploits the power of traces to predict branch destinations and hence reduce mispredictions caused by the dispatch of virtual branches. Interpreted traces are a state-of-the-art technique, running about 25% faster than direct threading. This is about the same speed up as achieved by inlined-threading, SableVM's implementation of selective inlining.
We show how interpreted traces can be gradually enhanced with a trace-based JIT compiler. An attractive property of our approach is that compiler support can be added one virtual instruction at a time. Our trace-based JIT currently compiles about 50 integer virtual instructions, running about 30% faster than interpreted traces, or about double the performance of direct threading.
Our hope is this work will enable more language implementations to deploy better interpreters and JIT compilers and hence deliver better computer language performance to more users.