This chapter will describe how to efficiently implement an interpreter
that calls its virtual instruction bodies. This investigation was motivated by the suggestion we made in cha:introduction,
namely that such an interpreter will be easier to extend with a JIT
than an interpreter that is direct-threaded or uses switch dispatch.
Before tackling the design of our mixed-mode system we need to ensure
that the interpreter is efficient.
An obvious, but slow, way to use callable virtual instruction bodies is to build a direct call threaded (DCT) interpreter (see sec:Direct-Call-Threading for a detailed description of the technique.) In a DCT interpreter all bodies are dispatched by the same indirect call instruction. The destination of the indirect call is data driven (i.e. by the sequence of virtual instructions that make up the virtual program) and thus impossible for the hardware to predict. As a result, a DCT interpreter suffers a branch misprediction for almost every dispatch.
The main realization driving our approach is that to call each body without misprediction dispatch branches must be direct call instructions. Since these can only be generated when virtual instructions are loaded, we generate them ourselves. At load time, each straight-line section of virtual instructions is translated to a sequence of direct call native instructions, each dispatching the corresponding virtual instruction body. The loaded program is run by jumping to the beginning of the generated sequence of native code, which then emulates the virtual program by calling each virtual instruction body in turn. This approach is very similar to a Forth compile-time technique called subroutine threading, described in sec:Subroutine-Threading.
Subroutine threading dispatches straight-line sequences of virtual instructions very efficiently because no branch mispredictions occur. The generated direct calls pose no prediction challenge because each has only one explicit destination. The destination of the return ending each body is perfectly predicted by the return branch predictor stack implemented by modern processors. In the next chapter we present data showing that subroutine threading runs the SPECjvm98 suite about 20% faster than direct threading.
Subroutine threading handles straight-line virtual code efficiently, but does nothing to improve the dispatch of virtual branch instructions. We introduce context threading, which, by generating more sophisticated code for virtual branch instructions, eliminates the branch mispredictions caused by the dispatch of virtual branch instructions as well. Context threading improves the performance of the SPECjvm98 suite by about another 5% over subroutine threading.
Generating and dispatching native code obviously makes our implementation of subroutine threading less portable than many dispatch techniques. However, since subroutine threading requires the generation of only one type of machine instruction, a direct call, its hardware dependency is isolated to a few lines of code. Context threading requires much more machine dependent code generation.
In cha:Design-and-Implementation-YETI we will describe another way of handling virtual branches that requires less complex, less machine dependent code generation, but requires additional runtime infrastructure to identify hot runtime interprocedural paths, or traces.
Although direct-threaded interpreters are known to have poor branch
prediction properties, they are also known to have a small instruction
cache footprint [#!romer!#]. Since both branch mispredictions and
instruction cache misses are major pipeline hazards, we would like
to retain the good cache behavior of direct-threaded interpreters
while improving the branch behavior. Subroutine threading minimally
affects code size. This is in contrast to techniques like selective
inlining, described in Section , which
improve branch prediction by replicating entire bodies, in effect
trading instruction cache size for better branch prediction. In cha:Evaluation-of-Yeti
we will report data showing that subroutine threading causes very
few additional stall cycles caused by instruction cache misses as
compared to direct threading.
In Section we discuss the challenge
of virtual branch instructions in general terms. In Section
we show how to replace straight-line dispatch with subroutine threading.
In Section
we show how to inline conditional
and indirect jumps, and in Section
we discuss
handling virtual calls and returns with native calls and returns.
Before describing our design, we start with two observations. First, a virtual program will typically contain several types of control flow: conditional and unconditional branches, indirect branches, and calls and returns. We must also consider the dispatch of straight-line virtual instructions. For direct-threaded interpreters, straight-line execution is just as expensive as handling virtual branches, since all virtual instructions are dispatched with an indirect branch. Second, the dynamic execution path of the virtual program will contain patterns (loops, for example) that are similar in nature to the patterns found when executing native code. These control flow patterns originate in the algorithm that the virtual program implements.
As described in Section , modern microprocessors
have considerable resources devoted to identifying these patterns
in native code, and exploiting them to predict branches. Direct threading
uses only indirect branches for dispatch and, due to the context problem,
the patterns that exist in the virtual program are largely hidden
from the microprocessor.
The spirit of our approach is to expose these virtual control flow patterns to the hardware, such that the physical execution path matches the virtual execution path. To achieve this goal, we generate dispatch code at load time that enables the different types of hardware prediction resources to predict the different types of virtual control flow transfers. We strive to maintain the property that the virtual program counter is precisely correlated with the physical program counter and in fact, when all our techniques are combined, there is a one-to-one mapping between them at most control flow points.
Figure:
Subroutine Threaded Interpreter showing how the CTT contains one
generated direct call instruction for each virtual instruction and
how the first entry in the DTT corresponding to each virtual instruction
points to generated code to dispatch it. Callable bodies are shown
here as nested functions for illustration only. All maintenance of
the vPC must be done in the bodies. Hence even virtual instructions
that take no arguments, like iadd, must bump vPC
past the virtual opcode. Virtual instructions, like iload,
that take an argument must bump vPC past the argument as
well.
Figure:
Direct threaded bodies retrofitted as callable routines by inserting
inline assembler return instructions. This example is for Pentium
4 and hence ends each body with a ret instruction. The asm
statement is an extension to the C language, inline assembler, provided
by gcc and many other compilers.
The dispatch of straight-line virtual instructions is the largest single source of branches when executing an interpreter. Any technique that hopes to improve branch prediction accuracy must address straight-line dispatch.
Rather than eliminate dispatch, we describe an alternative organization for the interpreter in which native call and return instructions are used. This approach is conceptually elegant because the subroutine is a natural unit of abstraction to express the implementation of virtual instruction bodies.
Figure illustrates our implementation of
subroutine threading, using the same example program as Figure
.
In this case, we show the state of the virtual machine after
the first virtual instruction has been executed. We add a new structure
to the interpreter architecture, called the Context Threading
Table (CTT), which contains a sequence of native call instructions.
Each native call dispatches the body for its virtual instruction.
Although Figure
shows each body as a nested
function, in fact we implement this by ending each non-branching opcode
body with a native return instruction
as shown in Figure
.
The handling of immediate arguments to virtual instructions is perhaps
the biggest difference between our implementation of subroutine threading
and the approach used by Forth. Forth words pop all their arguments
from the expression stack -- there is no concept of an immediate
operand. Thus, there is no need for a structure like the DTT. The
virtual instruction set defined by a Java virtual machine includes
many instructions which take immediate operands. Hence, in Java, we
need both the direct threading table (DTT) and the CTT. (In Section
we described how the DTT is used to store immediate operands, and
to correctly resolve virtual control transfer instructions.) In direct
threading, entries in the DTT point to virtual instruction bodies,
whereas in subroutine threading they refer to call sites in the CTT.
It may seem counterintuitive to improve dispatch performance by calling each body because the latency of a call and return may be greater than an indirect jump. This is not the real issue. On modern microprocessors the extra cost of the call (if any) is far outweighed by the benefit of eliminating a large source of unpredictable branches, as the data presented in the next chapter will show.
Subroutine threading handles the branches that implement the dispatch of straight-line virtual instructions; however, the control flow of the virtual program is still hidden from the hardware. That is, bodies that perform virtual branches still have no context. There are two problems, the first relating to shared indirect branch prediction resources, and the second relating to a lack of history context for conditional branch prediction resources.
Figure:
Subroutine Threading does not not address branch instructions. Unlike
straight line virtual instructions, virtual branch bodies end with
an indirect branch, just like direct threading. (Note: When a body
is called the vPC always points to the slot in the DTT corresponding
to its first argument, or, if there are no operands, to the following
instruction.)
Figure introduces a new Java example, this
time including a virtual branch. Consider the implementation of ifeq,
shaded in the figure. Prediction of the indirect branch at ``(a)''
may be problematic, because all instances of ifeq
instructions in the virtual program share the same indirect branch
instruction (and hence have a single prediction context).
Figure illustrates branch
replication, a simple solution to the first of these problems. The
idea is to generate an indirect branch instruction in the CTT immediately
following the dispatch of the virtual branch. Virtual branch bodies
have been modified to end with a native return instruction and the
only result of dispatching a branch body is the side effect of setting
the vPC to the destination. The result is that each
virtual branch instruction has its own indirect branch predictor entry.
Branch replication is an appropriate term because the indirect
branch ending the branch body has been copied to potentially many
places in the CTT.)
Figure:
Context threading with branch replication illustrating the ``replicated''
indirect branch (a) in the CTT. The fact that the indirect branch
corresponds to only one virtual instruction gives it better prediction
context. The heavy arrow from (a) to (b) is followed when the virtual
branch is taken. Prediction problems remain in the code compiled from
the if statement labelled (c)
Branch replication is attractive because it is simple and produces
the desired context with a minimum of new generated instructions.
However, it has a number of drawbacks. First, for branching opcodes,
we execute three hardware control transfers (a call to the body, a
return, and the replicated indirect branch), which is an unnecessary
overhead. Second, we still use the overly general indirect branch
instruction, even in cases like goto where we would prefer
a simpler direct native branch. Third, by only replicating the dispatch
part of the virtual instruction, we do not take full advantage of
the conditional branch predictor resources provided by the hardware.
This is because the if statement in the body, marked (c)
in the figure, is shared by all instances of ifeq. Due to
these limitations, we only use branch replication for indirect virtual
branches and exceptions.
Branch inlining, illustrated by Figure ,
is a technique that generates code for the bodies of virtual branch
instructions into the CTT. In the figure we show how our system inlines
the ifeq instruction. The generated native code, shaded in
the figure, implements the same if-then-else logic as the original
direct-threaded virtual instruction body. The inlined conditional
branch instruction (jne, ``(a)'' in the figure) is thus
fully exposed to the Pentium's conditional branch prediction hardware.
On the Pentium, branch inlining reduces pressure on the branch target buffer, or BTB, since conditional branches use the conditional branch predictors instead. The virtual conditional branches now appear as real conditional branches to the hardware. The dispatch of the body has been entirely eliminated.
The primary cost of branch inlining is increased code size, but this is modest because, at least for languages like Java and OCaml, virtual branch instructions are simple and have small bodies. For instance, on the Pentium 4, most branch instructions can be inlined with no more than 10 words, at worst a few additional i-cache lines.
The obvious challenge of branch inlining, apart from the hard labor required to implement it, is that the generated code is not portable and assumes detailed knowledge of the virtual bodies it must interoperate with.
Figure:
Context-threaded VM Interpreter: Branch Inlining. The dashed arrow
(a) illustrates the inlined conditional branch instruction, now fully
exposed to the branch prediction hardware, and the heavy arrow (b)
illustrates a direct branch implementing the not taken path. The generated
code (shaded) assumes the vPC is in register esi
and the Java expression stack pointer is in register edi.
(In reality, we dedicate registers in the way shown for SableVM on
the PowerPC only. On the Pentium4, due to lack of registers, the vPC
is actually stored on the stack. )
The only significant source of control transfers that remain in the virtual program is virtual method invocation and return. For successful branch prediction, the real problem is not the virtual call, which has only a few possible destinations, but rather the virtual return, which potentially has many destinations, one for each callsite of the method. As noted previously, the hardware already has an elegant solution to this problem in the form of the return address stack. We need only to deploy this resource to predict virtual returns.
Figure:
Context Threading Apply-Return Inlining on Pentium. The generated
code calls the invokestatic virtual instruction body
but jumps (instruction at (c) is a jmp) to the return
body.
We describe our solution with reference to Figure .
The virtual method invocation body, Java's invokestatic in
the figure, must transfer control to the first virtual instruction
of the callee. Our goal is to generate dispatch code so that the corresponding
virtual return instruction makes use of the hardware's return branch
predictors.
We begin at the virtual call instruction (just before label ``(a)''
in the figure). The body of the invokestatic creates a new
frame for the callee and then sets the vPC to the entry point
of the callee (``(b)'' in the figure) before returning back to
the CTT. Similar to branch replication, we insert a new native call
indirect instruction following ``(a)'' in the CTT to transfer
control to the start of the callee, shown as a solid arrow from ``(a)''
to ``(b)'' in the figure. The call indirect has the desired side
effect of pushing CTT location (a) onto the hardware's return address
stack. The first instruction of the callee is then dispatched. At
the end of the callee, we modify the virtual return instruction as
follows. In the CTT, at ``(c)'', we emit a native direct jump,
an x86 jmp in the figure, to dispatch the body of the virtual
return. This direct branch avoids perturbing the return address stack.
The body of the virtual return now returns all the way back to the
instruction following the original virtual call. This is shown as
the dotted arrow from ``(d)'' to following ``(a)''. We refer
to this technique as apply/return inlining.
With this final step, we have a complete technique that aligns all virtual program control flow with the corresponding native flow. There are, however, some practical challenges to implementing our design for apply/return inlining. First, one must take care to match the hardware stack against the virtual program stack. For instance, in OCaml, exceptions unwind the virtual machine stack; the hardware stack must be unwound in a corresponding manner. Second, some runtime environments are extremely sensitive to hardware stack manipulations, since they use or modify the machine stack pointer for their own purposes. In such cases, it is possible to create a separate stack structure and swap between the two at virtual invocation and return points. This approach would introduce significant overhead, and is only justified if apply/return inlining provides a substantial performance benefit.
The code generation described in this chapter is carried out when each virtual method is loaded. The idea is to generate relatively simple code that exposes the dispatch branch instructions to the hardware branch predictors of the processor.
In the next chapter we present data showing that our approach is effective in the sense that branch mispredictions are reduced and performance is improved. Subroutine threading is by far the more effective than branch replication, branch inlining, apply-return inlining, or tiny inlining, especially when its relatively simplicity and small amount of machine dependent code are taken into account. Branch inlining is the most complicated and least portable.
Our implementation of context threading has at least two potential problems. First, effort is expended at load time for regions of code that may never execute. This could penalize performance when large amounts of cold code are present. Second, is it awkward to interpose profiling instrumentation around the virtual instruction bodies dispatched from the CTT. The difficulty stems from the fact that subroutine threading, like direct threading, does not need a dispatch loop. This means that calls to profiling code must be generated in amongst the generated dispatch code in the CTT. Removing instrumentation after it is needed requires generated code to be rewritten or regenerated.
In cha:Design-and-Implementation-YETI we describe a different approach to efficient interpretation that addresses these two problems. There, we describe a different approach that generates simple code for hot interprocedural paths, or traces. This allows us to exploit the efficacy and simplicity of subroutine threading for straight-line code at the same time as eliminate the mispredictions caused by virtual branch instructions.