Next: 1 Introduction
Up: Zaleski Dissertation
Previous: List of Tables
Contents
- . Example Java Virtual Program showing
source (on the left) and Java virtual instructions, or bytecodes,
on the right.
- . Example of Java method containing a polymorphic
callsite
- . A switch interpreter loads each virtual instruction as a virtual
opcode, or token, corresponding to the case of the switch statement
that implements it. Virtual instructions that take immediate operands,
like iconst, must fetch them from the vPC and adjust
the vPC past the operand. Virtual instructions which do not
need operands, like iadd, do not need to adjust the vPC.
- . A direct call-threaded interpreter packages each virtual instruction
body as a function. The shaded box highlights the dispatch loop showing
how virtual instructions are dispatched through a function pointer.
Direct call threading requires the loaded representation of the program
to point to the address of the function implementing each virtual
instruction.
- . Direct-threaded Interpreter showing how Java Source code compiled
to Java bytecode is loaded into the Direct Threading Table (DTT).
The virtual instruction bodies are written in a single C function,
each identified by a separate label. The double-ampersand (&&)
shown in the DTT is gcc syntax for the address of a label.
- . Machine instructions used for direct dispatch. On both platforms
assume that some general purpose register, rx, has been dedicated
for the vPC. Note that on the PowerPC indirect branches are
two part instructions that first load the ctr register and
then branch to its contents.
- . Subroutine Threaded Interpreter showing how the CTT contains one
generated direct call instruction for each virtual instruction and
how the first entry in the DTT corresponding to each virtual instruction
points to generated code to dispatch it. Callable bodies are shown
here as nested functions for illustration only. All maintenance of
the vPC must be done in the bodies. Hence even virtual instructions
that take no arguments, like iadd, must bump vPC
past the virtual opcode. Virtual instructions, like iload,
that take an argument must bump vPC past the argument as
well.
- . Direct threaded bodies retrofitted as callable routines by inserting
inline assembler return instructions. This example is for Pentium
4 and hence ends each body with a ret instruction. The asm
statement is an extension to the C language, inline assembler, provided
by gcc and many other compilers.
- . Subroutine Threading does not not address branch instructions. Unlike
straight line virtual instructions, virtual branch bodies end with
an indirect branch, just like direct threading. (Note: When a body
is called the vPC always points to the slot in the DTT corresponding
to its first argument, or, if there are no operands, to the following
instruction.)
- . Context threading with branch replication illustrating the ``replicated''
indirect branch (a) in the CTT. The fact that the indirect branch
corresponds to only one virtual instruction gives it better prediction
context. The heavy arrow from (a) to (b) is followed when the virtual
branch is taken. Prediction problems remain in the code compiled from
the if statement labelled (c)
- . Context-threaded VM Interpreter: Branch Inlining. The dashed arrow
(a) illustrates the inlined conditional branch instruction, now fully
exposed to the branch prediction hardware, and the heavy arrow (b)
illustrates a direct branch implementing the not taken path. The generated
code (shaded) assumes the vPC is in register esi
and the Java expression stack pointer is in register edi.
(In reality, we dedicate registers in the way shown for SableVM on
the PowerPC only. On the Pentium4, due to lack of registers, the vPC
is actually stored on the stack. )
- . Context Threading Apply-Return Inlining on Pentium. The generated
code calls the invokestatic virtual instruction body
but jumps (instruction at (c) is a jmp) to the return
body.
- . OCaml Pipeline Hazards Relative to Direct Threading
- . Java Pipeline Hazards Relative to Direct Threading
- . OCaml Elapsed Time Relative to Direct Threading
- . SableVM Elapsed Time Relative to Direct Threading
- . PPC970 Elapsed Time Relative to Direct Threading
- . Reproduction of [#!ct-tcl2005!#, Figure 1] showing cycles run per
virtual instructions dispatched for various Tcl and OCaml benchmarks
.
- . Elapsed time of subroutine threading relative to direct threading
for OCaml on UltraSPARC III.
- . Virtual program loaded into Yeti showing how dispatcher structures
are initially shared between all instances of a virtual instruction.
The dispatch loop, shaded, is similar the dispatch loop of direct
call threading except that another level of indirection, through the
the dispatcher structure, has been added. Profiling instrumentation
is called before and after the dispatch of the body.
- . Shows a region of the DTT during block recording mode. The body of
each block discovery dispatcher points to the corresponding virtual
instruction body (Only the body for the first iload is shown). The
dispatcher's payload field points to instances of instruction payload.
The thread context struct is shown as TCS.
- . Shows a region of the DTT just after block recording mode has finished.
- . Schematic of a trace illustrating how trace exit table (shaded) in
trace payload has recorded the on-trace destination of each virtual
branch
- . PowerPC code for a portion of a trace region
body, showing details of a trace exit and trace exit handler. This
code assumes that r26 has been dedicated for the vPC. In
addition the generated code in the trace exit handler uses r30,
the stack pointer as defined by the ABI, to store the trace exit id
into the TCS.
- . Number of dispatches executed vs region shape. The y-axis has a logarithmic
scale. Numbers above bars, in scientific notation, give the number
of regions dispatched. The X axis lists the SPECjvm98 benchmarks in
alphabetical order.
- . Number of virtual instructions executed per dispatch for each region
shape. The y-axis has a logarithmic scale. Numbers above bars are
the number of virtual instructions executed per dispatch (rounded
to two significant figures). SPECjvm98
benchmarks appear along X axis sorted by the average number of instructions
executed by a LB.
- . Percentage trace completion rate as a proportion of the virtual instructions
in a trace and code cache size for as a percentage of the virtual
instructions in all loaded methods. For the SPECjvm98 benchmarks and
scitest.
- . Performance of each stage of Yeti enhancement from DCT interpreter
to trace-based JIT relative to unmodified JamVM-1.3.3 (direct-threaded)
running the SPECjvm98 benchmarks (sorted by LB length).
- . Performance of Linear Blocks (LB) compared to subroutine-threaded
JamVM-1.3.3 (SUB) relative to unmodified JamVM-1.3.3 (direct-threaded)
for the SPECjvm98 benchmarks.
- . Performance of JamVM interpreted traces (i-TR) and selective inlined
SableVM 1.1.8 relative to unmodified JamVM-1.3.3 (direct-threaded)
for the SPECjvm98 benchmarks.
- . Performance of JamVM interpreted traces (i-TR) relative to unmodified
JamVM-1.3.3 (direct-threaded) and selective inlined SableVM 1.1.8
relative to direct threaded SableVM version 1.1.8 for the SPECjvm98
benchmarks.
- . Elapsed time performance of Yeti with JIT compared to Sun Java 1.05.0_6_64
relative to JamVM-1.3.3 (direct threading) running SPECjvm98 benchmarks.
- . Performance of Gennady Pekhimenko's Pentium port relative to unmodified
JamVM-1.3.3 (direct-threaded) running the SPECjvm98 benchmarks.
- . Cycles relative to JamVM-1.3.3 (direct threading) running SPECjvm98
benchmarks.
- . Stall breakdown for SPECjvm98 benchmarks relative to JamVM-1.3.3
(direct threading).
January 22, 2008
Mathew Zaleski
2008-01-22