List of Figures

Next: 1 Introduction Up: Zaleski Dissertation Previous: List of Tables Contents

List of Figures

. Example Java Virtual Program showing source (on the left) and Java virtual instructions, or bytecodes, on the right.
. Example of Java method containing a polymorphic callsite
. A switch interpreter loads each virtual instruction as a virtual opcode, or token, corresponding to the case of the switch statement that implements it. Virtual instructions that take immediate operands, like iconst, must fetch them from the vPC and adjust the vPC past the operand. Virtual instructions which do not need operands, like iadd, do not need to adjust the vPC.
. A direct call-threaded interpreter packages each virtual instruction body as a function. The shaded box highlights the dispatch loop showing how virtual instructions are dispatched through a function pointer. Direct call threading requires the loaded representation of the program to point to the address of the function implementing each virtual instruction.
. Direct-threaded Interpreter showing how Java Source code compiled to Java bytecode is loaded into the Direct Threading Table (DTT). The virtual instruction bodies are written in a single C function, each identified by a separate label. The double-ampersand (&&) shown in the DTT is gcc syntax for the address of a label.
. Machine instructions used for direct dispatch. On both platforms assume that some general purpose register, rx, has been dedicated for the vPC. Note that on the PowerPC indirect branches are two part instructions that first load the ctr register and then branch to its contents.
. Subroutine Threaded Interpreter showing how the CTT contains one generated direct call instruction for each virtual instruction and how the first entry in the DTT corresponding to each virtual instruction points to generated code to dispatch it. Callable bodies are shown here as nested functions for illustration only. All maintenance of the vPC must be done in the bodies. Hence even virtual instructions that take no arguments, like iadd, must bump vPC past the virtual opcode. Virtual instructions, like iload, that take an argument must bump vPC past the argument as well.
. Direct threaded bodies retrofitted as callable routines by inserting inline assembler return instructions. This example is for Pentium 4 and hence ends each body with a ret instruction. The asm statement is an extension to the C language, inline assembler, provided by gcc and many other compilers.
. Subroutine Threading does not not address branch instructions. Unlike straight line virtual instructions, virtual branch bodies end with an indirect branch, just like direct threading. (Note: When a body is called the vPC always points to the slot in the DTT corresponding to its first argument, or, if there are no operands, to the following instruction.)
. Context threading with branch replication illustrating the ``replicated'' indirect branch (a) in the CTT. The fact that the indirect branch corresponds to only one virtual instruction gives it better prediction context. The heavy arrow from (a) to (b) is followed when the virtual branch is taken. Prediction problems remain in the code compiled from the if statement labelled (c)
. Context-threaded VM Interpreter: Branch Inlining. The dashed arrow (a) illustrates the inlined conditional branch instruction, now fully exposed to the branch prediction hardware, and the heavy arrow (b) illustrates a direct branch implementing the not taken path. The generated code (shaded) assumes the vPC is in register esi and the Java expression stack pointer is in register edi. (In reality, we dedicate registers in the way shown for SableVM on the PowerPC only. On the Pentium4, due to lack of registers, the vPC is actually stored on the stack. )
. Context Threading Apply-Return Inlining on Pentium. The generated code calls the invokestatic virtual instruction body but jumps (instruction at (c) is a jmp) to the return body.
. OCaml Pipeline Hazards Relative to Direct Threading
. Java Pipeline Hazards Relative to Direct Threading
. OCaml Elapsed Time Relative to Direct Threading
. SableVM Elapsed Time Relative to Direct Threading
. PPC970 Elapsed Time Relative to Direct Threading
. Reproduction of [#!ct-tcl2005!#, Figure 1] showing cycles run per virtual instructions dispatched for various Tcl and OCaml benchmarks .
. Elapsed time of subroutine threading relative to direct threading for OCaml on UltraSPARC III.
. Virtual program loaded into Yeti showing how dispatcher structures are initially shared between all instances of a virtual instruction. The dispatch loop, shaded, is similar the dispatch loop of direct call threading except that another level of indirection, through the the dispatcher structure, has been added. Profiling instrumentation is called before and after the dispatch of the body.
. Shows a region of the DTT during block recording mode. The body of each block discovery dispatcher points to the corresponding virtual instruction body (Only the body for the first iload is shown). The dispatcher's payload field points to instances of instruction payload. The thread context struct is shown as TCS.
. Shows a region of the DTT just after block recording mode has finished.
. Schematic of a trace illustrating how trace exit table (shaded) in trace payload has recorded the on-trace destination of each virtual branch
. PowerPC code for a portion of a trace region body, showing details of a trace exit and trace exit handler. This code assumes that r26 has been dedicated for the vPC. In addition the generated code in the trace exit handler uses r30, the stack pointer as defined by the ABI, to store the trace exit id into the TCS.
. Number of dispatches executed vs region shape. The y-axis has a logarithmic scale. Numbers above bars, in scientific notation, give the number of regions dispatched. The X axis lists the SPECjvm98 benchmarks in alphabetical order.
. Number of virtual instructions executed per dispatch for each region shape. The y-axis has a logarithmic scale. Numbers above bars are the number of virtual instructions executed per dispatch (rounded to two significant figures). SPECjvm98 benchmarks appear along X axis sorted by the average number of instructions executed by a LB.
. Percentage trace completion rate as a proportion of the virtual instructions in a trace and code cache size for as a percentage of the virtual instructions in all loaded methods. For the SPECjvm98 benchmarks and scitest.
. Performance of each stage of Yeti enhancement from DCT interpreter to trace-based JIT relative to unmodified JamVM-1.3.3 (direct-threaded) running the SPECjvm98 benchmarks (sorted by LB length).
. Performance of Linear Blocks (LB) compared to subroutine-threaded JamVM-1.3.3 (SUB) relative to unmodified JamVM-1.3.3 (direct-threaded) for the SPECjvm98 benchmarks.
. Performance of JamVM interpreted traces (i-TR) and selective inlined SableVM 1.1.8 relative to unmodified JamVM-1.3.3 (direct-threaded) for the SPECjvm98 benchmarks.
. Performance of JamVM interpreted traces (i-TR) relative to unmodified JamVM-1.3.3 (direct-threaded) and selective inlined SableVM 1.1.8 relative to direct threaded SableVM version 1.1.8 for the SPECjvm98 benchmarks.
. Elapsed time performance of Yeti with JIT compared to Sun Java 1.05.0_6_64 relative to JamVM-1.3.3 (direct threading) running SPECjvm98 benchmarks.
. Performance of Gennady Pekhimenko's Pentium port relative to unmodified JamVM-1.3.3 (direct-threaded) running the SPECjvm98 benchmarks.
. Cycles relative to JamVM-1.3.3 (direct threading) running SPECjvm98 benchmarks.
. Stall breakdown for SPECjvm98 benchmarks relative to JamVM-1.3.3 (direct threading).

$RCSfile: matzDissertation.lyx,v $% WIDTH=294 HEIGHT=35 $Revision: 1.18 $% WIDTH=128 HEIGHT=16 January 22, 2008

Mathew Zaleski 2008-01-22