This chapter will describe how to efficiently implement an interpreter that calls its virtual instruction bodies. This investigation was motivated by the suggestion we made in cha:introduction, namely that such an interpreter will be easier to extend with a JIT than an interpreter that is direct-threaded or uses switch dispatch. Before tackling the design of our mixed-mode system we need to ensure that the interpreter is efficient.
An obvious, but slow, way to use callable virtual instruction bodies is to build a direct call threaded (DCT) interpreter (see sec:Direct-Call-Threading for a detailed description of the technique.) In a DCT interpreter all bodies are dispatched by the same indirect call instruction. The destination of the indirect call is data driven (i.e. by the sequence of virtual instructions that make up the virtual program) and thus impossible for the hardware to predict. As a result, a DCT interpreter suffers a branch misprediction for almost every dispatch.
The main realization driving our approach is that to call each body without misprediction dispatch branches must be direct call instructions. Since these can only be generated when virtual instructions are loaded, we generate them ourselves. At load time, each straight-line section of virtual instructions is translated to a sequence of direct call native instructions, each dispatching the corresponding virtual instruction body. The loaded program is run by jumping to the beginning of the generated sequence of native code, which then emulates the virtual program by calling each virtual instruction body in turn. This approach is very similar to a Forth compile-time technique called subroutine threading, described in sec:Subroutine-Threading.
Subroutine threading dispatches straight-line sequences of virtual instructions very efficiently because no branch mispredictions occur. The generated direct calls pose no prediction challenge because each has only one explicit destination. The destination of the return ending each body is perfectly predicted by the return branch predictor stack implemented by modern processors. In the next chapter we present data showing that subroutine threading runs the SPECjvm98 suite about 20% faster than direct threading.
Subroutine threading handles straight-line virtual code efficiently, but does nothing to improve the dispatch of virtual branch instructions. We introduce context threading, which, by generating more sophisticated code for virtual branch instructions, eliminates the branch mispredictions caused by the dispatch of virtual branch instructions as well. Context threading improves the performance of the SPECjvm98 suite by about another 5% over subroutine threading.
Generating and dispatching native code obviously makes our implementation of subroutine threading less portable than many dispatch techniques. However, since subroutine threading requires the generation of only one type of machine instruction, a direct call, its hardware dependency is isolated to a few lines of code. Context threading requires much more machine dependent code generation.
In cha:Design-and-Implementation-YETI we will describe another way of handling virtual branches that requires less complex, less machine dependent code generation, but requires additional runtime infrastructure to identify hot runtime interprocedural paths, or traces.
Although direct-threaded interpreters are known to have poor branch prediction properties, they are also known to have a small instruction cache footprint [#!romer!#]. Since both branch mispredictions and instruction cache misses are major pipeline hazards, we would like to retain the good cache behavior of direct-threaded interpreters while improving the branch behavior. Subroutine threading minimally affects code size. This is in contrast to techniques like selective inlining, described in Section , which improve branch prediction by replicating entire bodies, in effect trading instruction cache size for better branch prediction. In cha:Evaluation-of-Yeti we will report data showing that subroutine threading causes very few additional stall cycles caused by instruction cache misses as compared to direct threading.
In Section we discuss the challenge of virtual branch instructions in general terms. In Section we show how to replace straight-line dispatch with subroutine threading. In Section we show how to inline conditional and indirect jumps, and in Section we discuss handling virtual calls and returns with native calls and returns.
Before describing our design, we start with two observations. First, a virtual program will typically contain several types of control flow: conditional and unconditional branches, indirect branches, and calls and returns. We must also consider the dispatch of straight-line virtual instructions. For direct-threaded interpreters, straight-line execution is just as expensive as handling virtual branches, since all virtual instructions are dispatched with an indirect branch. Second, the dynamic execution path of the virtual program will contain patterns (loops, for example) that are similar in nature to the patterns found when executing native code. These control flow patterns originate in the algorithm that the virtual program implements.
As described in Section , modern microprocessors have considerable resources devoted to identifying these patterns in native code, and exploiting them to predict branches. Direct threading uses only indirect branches for dispatch and, due to the context problem, the patterns that exist in the virtual program are largely hidden from the microprocessor.
The spirit of our approach is to expose these virtual control flow patterns to the hardware, such that the physical execution path matches the virtual execution path. To achieve this goal, we generate dispatch code at load time that enables the different types of hardware prediction resources to predict the different types of virtual control flow transfers. We strive to maintain the property that the virtual program counter is precisely correlated with the physical program counter and in fact, when all our techniques are combined, there is a one-to-one mapping between them at most control flow points.
Figure: Subroutine Threaded Interpreter showing how the CTT contains one generated direct call instruction for each virtual instruction and how the first entry in the DTT corresponding to each virtual instruction points to generated code to dispatch it. Callable bodies are shown here as nested functions for illustration only. All maintenance of the vPC must be done in the bodies. Hence even virtual instructions that take no arguments, like iadd, must bump vPC past the virtual opcode. Virtual instructions, like iload, that take an argument must bump vPC past the argument as well.
Figure: Direct threaded bodies retrofitted as callable routines by inserting inline assembler return instructions. This example is for Pentium 4 and hence ends each body with a ret instruction. The asm statement is an extension to the C language, inline assembler, provided by gcc and many other compilers.
The dispatch of straight-line virtual instructions is the largest single source of branches when executing an interpreter. Any technique that hopes to improve branch prediction accuracy must address straight-line dispatch.
Rather than eliminate dispatch, we describe an alternative organization for the interpreter in which native call and return instructions are used. This approach is conceptually elegant because the subroutine is a natural unit of abstraction to express the implementation of virtual instruction bodies.
Figure illustrates our implementation of subroutine threading, using the same example program as Figure . In this case, we show the state of the virtual machine after the first virtual instruction has been executed. We add a new structure to the interpreter architecture, called the Context Threading Table (CTT), which contains a sequence of native call instructions. Each native call dispatches the body for its virtual instruction. Although Figure shows each body as a nested function, in fact we implement this by ending each non-branching opcode body with a native return instruction as shown in Figure .
The handling of immediate arguments to virtual instructions is perhaps the biggest difference between our implementation of subroutine threading and the approach used by Forth. Forth words pop all their arguments from the expression stack -- there is no concept of an immediate operand. Thus, there is no need for a structure like the DTT. The virtual instruction set defined by a Java virtual machine includes many instructions which take immediate operands. Hence, in Java, we need both the direct threading table (DTT) and the CTT. (In Section we described how the DTT is used to store immediate operands, and to correctly resolve virtual control transfer instructions.) In direct threading, entries in the DTT point to virtual instruction bodies, whereas in subroutine threading they refer to call sites in the CTT.
It may seem counterintuitive to improve dispatch performance by calling each body because the latency of a call and return may be greater than an indirect jump. This is not the real issue. On modern microprocessors the extra cost of the call (if any) is far outweighed by the benefit of eliminating a large source of unpredictable branches, as the data presented in the next chapter will show.
Subroutine threading handles the branches that implement the dispatch of straight-line virtual instructions; however, the control flow of the virtual program is still hidden from the hardware. That is, bodies that perform virtual branches still have no context. There are two problems, the first relating to shared indirect branch prediction resources, and the second relating to a lack of history context for conditional branch prediction resources.
Figure: Subroutine Threading does not not address branch instructions. Unlike straight line virtual instructions, virtual branch bodies end with an indirect branch, just like direct threading. (Note: When a body is called the vPC always points to the slot in the DTT corresponding to its first argument, or, if there are no operands, to the following instruction.)
Figure introduces a new Java example, this time including a virtual branch. Consider the implementation of ifeq, shaded in the figure. Prediction of the indirect branch at ``(a)'' may be problematic, because all instances of ifeq instructions in the virtual program share the same indirect branch instruction (and hence have a single prediction context).
Figure illustrates branch replication, a simple solution to the first of these problems. The idea is to generate an indirect branch instruction in the CTT immediately following the dispatch of the virtual branch. Virtual branch bodies have been modified to end with a native return instruction and the only result of dispatching a branch body is the side effect of setting the vPC to the destination. The result is that each virtual branch instruction has its own indirect branch predictor entry. Branch replication is an appropriate term because the indirect branch ending the branch body has been copied to potentially many places in the CTT.)
Figure: Context threading with branch replication illustrating the ``replicated'' indirect branch (a) in the CTT. The fact that the indirect branch corresponds to only one virtual instruction gives it better prediction context. The heavy arrow from (a) to (b) is followed when the virtual branch is taken. Prediction problems remain in the code compiled from the if statement labelled (c)
Branch replication is attractive because it is simple and produces the desired context with a minimum of new generated instructions. However, it has a number of drawbacks. First, for branching opcodes, we execute three hardware control transfers (a call to the body, a return, and the replicated indirect branch), which is an unnecessary overhead. Second, we still use the overly general indirect branch instruction, even in cases like goto where we would prefer a simpler direct native branch. Third, by only replicating the dispatch part of the virtual instruction, we do not take full advantage of the conditional branch predictor resources provided by the hardware. This is because the if statement in the body, marked (c) in the figure, is shared by all instances of ifeq. Due to these limitations, we only use branch replication for indirect virtual branches and exceptions.
Branch inlining, illustrated by Figure , is a technique that generates code for the bodies of virtual branch instructions into the CTT. In the figure we show how our system inlines the ifeq instruction. The generated native code, shaded in the figure, implements the same if-then-else logic as the original direct-threaded virtual instruction body. The inlined conditional branch instruction (jne, ``(a)'' in the figure) is thus fully exposed to the Pentium's conditional branch prediction hardware.
On the Pentium, branch inlining reduces pressure on the branch target buffer, or BTB, since conditional branches use the conditional branch predictors instead. The virtual conditional branches now appear as real conditional branches to the hardware. The dispatch of the body has been entirely eliminated.
The primary cost of branch inlining is increased code size, but this is modest because, at least for languages like Java and OCaml, virtual branch instructions are simple and have small bodies. For instance, on the Pentium 4, most branch instructions can be inlined with no more than 10 words, at worst a few additional i-cache lines.
The obvious challenge of branch inlining, apart from the hard labor required to implement it, is that the generated code is not portable and assumes detailed knowledge of the virtual bodies it must interoperate with.
Figure: Context-threaded VM Interpreter: Branch Inlining. The dashed arrow (a) illustrates the inlined conditional branch instruction, now fully exposed to the branch prediction hardware, and the heavy arrow (b) illustrates a direct branch implementing the not taken path. The generated code (shaded) assumes the vPC is in register esi and the Java expression stack pointer is in register edi. (In reality, we dedicate registers in the way shown for SableVM on the PowerPC only. On the Pentium4, due to lack of registers, the vPC is actually stored on the stack. )
The only significant source of control transfers that remain in the virtual program is virtual method invocation and return. For successful branch prediction, the real problem is not the virtual call, which has only a few possible destinations, but rather the virtual return, which potentially has many destinations, one for each callsite of the method. As noted previously, the hardware already has an elegant solution to this problem in the form of the return address stack. We need only to deploy this resource to predict virtual returns.
Figure: Context Threading Apply-Return Inlining on Pentium. The generated code calls the invokestatic virtual instruction body but jumps (instruction at (c) is a jmp) to the return body.
We describe our solution with reference to Figure . The virtual method invocation body, Java's invokestatic in the figure, must transfer control to the first virtual instruction of the callee. Our goal is to generate dispatch code so that the corresponding virtual return instruction makes use of the hardware's return branch predictors.
We begin at the virtual call instruction (just before label ``(a)'' in the figure). The body of the invokestatic creates a new frame for the callee and then sets the vPC to the entry point of the callee (``(b)'' in the figure) before returning back to the CTT. Similar to branch replication, we insert a new native call indirect instruction following ``(a)'' in the CTT to transfer control to the start of the callee, shown as a solid arrow from ``(a)'' to ``(b)'' in the figure. The call indirect has the desired side effect of pushing CTT location (a) onto the hardware's return address stack. The first instruction of the callee is then dispatched. At the end of the callee, we modify the virtual return instruction as follows. In the CTT, at ``(c)'', we emit a native direct jump, an x86 jmp in the figure, to dispatch the body of the virtual return. This direct branch avoids perturbing the return address stack. The body of the virtual return now returns all the way back to the instruction following the original virtual call. This is shown as the dotted arrow from ``(d)'' to following ``(a)''. We refer to this technique as apply/return inlining.
With this final step, we have a complete technique that aligns all virtual program control flow with the corresponding native flow. There are, however, some practical challenges to implementing our design for apply/return inlining. First, one must take care to match the hardware stack against the virtual program stack. For instance, in OCaml, exceptions unwind the virtual machine stack; the hardware stack must be unwound in a corresponding manner. Second, some runtime environments are extremely sensitive to hardware stack manipulations, since they use or modify the machine stack pointer for their own purposes. In such cases, it is possible to create a separate stack structure and swap between the two at virtual invocation and return points. This approach would introduce significant overhead, and is only justified if apply/return inlining provides a substantial performance benefit.
The code generation described in this chapter is carried out when each virtual method is loaded. The idea is to generate relatively simple code that exposes the dispatch branch instructions to the hardware branch predictors of the processor.
In the next chapter we present data showing that our approach is effective in the sense that branch mispredictions are reduced and performance is improved. Subroutine threading is by far the more effective than branch replication, branch inlining, apply-return inlining, or tiny inlining, especially when its relatively simplicity and small amount of machine dependent code are taken into account. Branch inlining is the most complicated and least portable.
Our implementation of context threading has at least two potential problems. First, effort is expended at load time for regions of code that may never execute. This could penalize performance when large amounts of cold code are present. Second, is it awkward to interpose profiling instrumentation around the virtual instruction bodies dispatched from the CTT. The difficulty stems from the fact that subroutine threading, like direct threading, does not need a dispatch loop. This means that calls to profiling code must be generated in amongst the generated dispatch code in the CTT. Removing instrumentation after it is needed requires generated code to be rewritten or regenerated.
In cha:Design-and-Implementation-YETI we describe a different approach to efficient interpretation that addresses these two problems. There, we describe a different approach that generates simple code for hot interprocedural paths, or traces. This allows us to exploit the efficacy and simplicity of subroutine threading for straight-line code at the same time as eliminate the mispredictions caused by virtual branch instructions.