Next: 7 Evaluation of Yeti Up: Zaleski Dissertation Previous: 5 Evaluation of Context Contents

Subsections

6 Design and Implementation of YETI

$RCSfile: implementation-yeti.lyx,v $% WIDTH=332 HEIGHT=35 $Revision: 1.46 $% WIDTH=128 HEIGHT=16

This chapter describes our gradually extensible trace interpreter, or Yeti for short. The main goal of this part of our research is to design and implement a language VM that allows for a simple, efficient interpreter and yet can be conveniently, and gradually, extended with a JIT compiler.

As we argued in cha:introduction, we believe the key ingredients for this are threefold. First, the system should implement callable virtual instruction bodies that can be dispatched both by the interpreter and from JIT compiler generated code. Second, the system should compile, then run, dynamically identified regions of code that contain only hot code. We pointed out that hot interprocedural paths, or traces, seem like a good choice. Third, the JIT compiler should be able to fall back on generating dispatch code to virtual instruction bodies when it encounters virtual instructions that it does not fully support. The combination of these features enables a gradual style of JIT development where compiler support for virtual instructions can be added one instruction at a time.

A similar argument can be made that the code generated for each hot region of the virtual program should also be callable and should update interpreter state before returning so that interpretation may resume immediately. We call this a region body because it essentially is a generated virtual instruction body for a newly created, runtime identified, virtual instruction.

Region bodies are to be called with interpreter state as the first virtual instruction in the region would have seen it and return with the interpreter state as the last the virtual instruction would have left it. Within the region, body interpreter state need not be kept up-to-date. A region body can have multiple return points due to exceptions (in straight-line code) or trace exits.

Packaging generated code as callable also aims to support an incremental style of development, in this case allowing new and presumably larger or more highly optimized regions of the virtual program to be identified, compiled and dispatched. Currently, Yeti dispatches single virtual instruction bodies, subroutine-threaded region bodies for straight-line sections of code, and interpreted and compiled traces.

Section gives an overview of our implementation. Section describes how regions are identified. The runtime environment of a trace is described in Section . Section describes how region bodies are generated for interpreted and JIT compiled traces. Finally, Section describes ways in which our implementation is challenged by the software environment in which it is implemented.

1 Structure and Overview of Yeti

Our system starts operating as a simple direct call threaded (DCT) interpreter as discussed in Section . After each instruction has run once, instrumentation called from the dispatch loop identifies straight line sections of the virtual program. Simple subroutine-threaded region bodies are generated. These are installed by overwriting the DTT slot corresponding to the first virtual instruction in the region with the entry point of the new region body. Subsequently, the subroutine-threaded code executes. The system, up to this point, is operating as a lazy loaded subroutine-threaded interpreter. This alone can speed up programs with long linear blocks (like compress and mpeg) relative to direct-threaded performance.

As the program executes, profiling associates and updates event counters in a payload structure corresponding to each region. Eventually, hot traces are identified and translated to region bodies. We will describe two ways traces are compiled. Interpreted traces, described in sub:Yeti-Interp-Trace, implement traces in the simplest way we could conceive of, whereas JIT compiled traces, described in sub:Yeti-JIT-Compiled-Traces, compile the virtual instructions in each trace to register allocated native code. A novel aspect of our JIT is that it compiles only a subset of virtual instructions while falling back on dispatch for the remainder. Currently, our system generates code for about 50 integer and object virtual instructions, including all of Java's conditional branch instructions. We have invested no effort in classical optimizations apart from a relatively simple variation on inlining when the invocation and return of a method occur in the same trace.

Ordinarily, DCT is slow, because it suffers a branch misprediction penalty for almost every iteration of the dispatch loop, but this turns out not to be a performance problem for Yeti. As hot region bodies are identified, installed, dispatched, and linked together, execution shifts almost entirely to within the region bodies and consequently the overhead of the dispatch loop becomes negligible.

1 Initial Load

Figure shows how our running example (Figure ) is loaded by Yeti. In the figure, the bodies are the same C coded virtual instruction bodies we show in Figure . Initially all instances of an instruction, like the two instances of iload in the figure, point to the same shared region bodies. This makes the initial load lightweight, since no code needs to be generated and a small (static) set of region bodies and associated profiling payloads are shared by all instances of virtual instructions.

Like direct threading and regular DCT, Yeti loads each virtual instruction into one or more slots in the DTT when the virtual program is loaded. Arguments to virtual instructions are handled exactly the same as DCT or direct threading. However, we have enhanced the representation of the virtual opcode significantly. In Yeti, we add a level of indirection - the first DTT slot of each instruction points to an instance of a dispatcher structure instead of the address of a virtual instruction body.

2 Dispatcher

It is the need to efficiently associate the vPC with both the body (for dispatch) and the payload (for profiling) that motivates the extra indirection in our design. The alternative would be to maintain a side table associating the payload and vPC. We chose the current arrangement over a hash table because it is simpler.

The dispatcher structure contains four key fields. The region body to be dispatched is stored in the body field. The preworker and postworker fields store the addresses of instrumentation functions to be called before and after the dispatch of the region body respectively. Finally, the dispatcher has a payload field, which is a region of profiling or other data that the instrumentation needs to associate with the region body. The most obvious use of the payload is to count events associated with each region body. We define specialized payload structures to describe virtual instructions, linear blocks, and traces.

When a dispatcher is created, specific preworker and postworker functions are chosen depending on the type of region body the dispatcher describes. The design is object-based in the sense that the choice of a given preworker and postworker determines the behavior of the instrumentation for the given region body. In our design, the workers assume that they are always associated with a specific type of payload.

3 Dispatch Loop

The dispatch loop, shaded in Figure , requires an extra level of indirection to call each body. The overhead of the extra indirection is of little concern as any given instruction will be executed only a few times using this generic mechanism.

Figure also illustrates how instrumentation code for the region is called before (the preworker) and after (the postworker) the instruction body is executed. Initially instrumentation is interposed around the dispatch of each virtual instruction. This is convenient as it puts the runtime in control when the destination of each virtual branch has been determined but before it is dispatched. Later, as larger region bodies are installed, instrumentation is dispatched before and after the execution of the region body (no longer after each instruction).

An interesting feature omitted from the figure is that Yeti actually has several specialized dispatch loops. For instance, when a trace is dispatched the only remaining event to monitor is the emergence of a hot trace exit. Overhead can be significantly reduced by providing a specialized dispatch loop exclusively for traces that inline only the required instrumentation. In general, profiling can be optimized, or turned off altogether, by changing dispatch loops.

Figure: Virtual program loaded into Yeti showing how dispatcher structures are initially shared between all instances of a virtual instruction. The dispatch loop, shaded, is similar the dispatch loop of direct call threading except that another level of indirection, through the the dispatcher structure, has been added. Profiling instrumentation is called before and after the dispatch of the body.

$\includegraphics[width=1\columnwidth,keepaspectratio]{figs/yeti-dispatcher}% WIDTH=552 HEIGHT=352$

4 Thread Context Structure

Modern virtual machines support multiple threads of execution. Our design, like many modern interpreters, requires that each new interpreter thread runs in a separate pthread starting with a new invocation of the interp function. This means that any local variables declared in interp are thread-private data. The DTT, dispatchers and region bodies, on the other hand, are shared by all threads.

Yeti needs a small additional amount of thread-private data for its own purposes. To keep all thread-private data together, we have added a new structure to the interp function called the thread context structure, or TCS. The TCS contains only a few fields, mostly in support of the region identification and trace exit profiling. For instance, in support of region identification, the TCS provides the recordMode bit, which indicates whether the current thread is actively recording a region; and the history list, that records region bodies as they are executed. Section describes the role played by the TCS in profiling trace exits.

A pointer to the TCS is passed to preworker and postworkers each time they are called. For simplicity, the TCS was omitted from Figure but appears in Figure where it is the root of the history list.

2 Region Selection

Our strategy for identifying hot regions of the program is carried out by preworkers and postworkers in conjunction with state information passed in the TCS. When the profiling instrumentation discovers the beginning of a new region to be compiled into a region body it sets the recordMode bit in the TCS. As described below, this may be done by the preworker (as for linear blocks) or the postworker (as for traces). Once the recordMode bit is set, the thread is actively collecting a region of the program. In this mode the preworker appends the payload of each region body about to be executed to the thread-private history list in the TCS.

Eventually a preworker or postworker will recognize that execution has reached the end of the region to be collected and clears recordMode. At this point a new region body is generated from the history list.

1 Initiating Region Discovery

We ignore the first execution of each instance of a virtual instruction before considering it for inclusion in a region body. First, as discussed in Section , late binding languages like Java may rewrite some virtual instructions the first time they execute. We should delay region selection until after these instructions have been rewritten. Second, some virtual instructions, for instance static class initialization blocks in Java, only execute once. This suggests that we should always wait until the second execution before considering a virtual instruction.

The obvious way of implementing this is to increment a counter the first time an instruction executes. However, this cannot be implemented with our loading strategy because a shared dispatcher has no simple way of counting how many times a specific instance has been dispatched. For example, in Figure both instances of iload share the same dispatcher and payload, so there is no place to maintain a counter for each instance.

Hence, after the first execution, the preworker replaces the shared dispatcher with a new, non-shared, instance of a block discovery dispatcher. The second time the instruction is dispatched, the block discovery dispatcher sets about identifying linear blocks, as described next.

2 Linear Block Detection

Figure: Shows a region of the DTT during block recording mode. The body of each block discovery dispatcher points to the corresponding virtual instruction body (Only the body for the first iload is shown). The dispatcher's payload field points to instances of instruction payload. The thread context struct is shown as TCS.

$\includegraphics[width=1\columnwidth]{figs/gradualBbRecordMode}% WIDTH=555 HEIGHT=316$

A linear block is a runtime approximation of a basic block, namely a straight-line section of the virtual program ending with a branch. The process of identifying linear regions of the program is carried out by the block discovery preworker based on state information it is passed in the TCS.

We start our explanation of how the block discovery works with a detailed walk-through of how the block discovery preworker identifies a new linear block. Suppose a block discovery preworker is called for an instance of virtual instruction i at vPC. A block discovery dispatcher was installed for i after it executed for the first time. Hence, whenever the block discovery preworker is called there are two possibilities. If recordMode is set then i should simply be appended to the history list (in the TCS) and thus added to the linear region currently being recorded. Otherwise, if recordMode is clear, then i must begin a new linear block. (If there already was a linear region starting at vPC, then a dispatcher for that region body would have executed instead.)

The preworker recognizes the end of the linear region when it encounters a virtual branch instruction. At this point recordMode is cleared, and a new subroutine-threaded region body is generated from the instructions on the history list. Figure illustrates an intermediate stage during the identification of the linear block of our running example. The preworker has appended the payload of each instruction onto the thread's history list, rooted in the TCS. In the figure, a branch instruction, a goto, will end the current linear block.

Figure illustrates the situation just after the collection of the linear block. The dispatcher corresponding to the entry point of the linear block has been replaced by a new linear block dispatcher whose job it will be to search for traces. The linear block dispatcher points to a new payload created from the history list; its body field points to a subroutine-threading-style region body that has been generated for the linear block. Note that linear blocks are not basic blocks because they do not end at labels. If the virtual program later branches to a virtual address that happens to be in the middle of a linear block our system will create a new linear block that replicates the tail of the original.

Figure: Shows a region of the DTT just after block recording mode has finished.

$\includegraphics[width=0.8\columnwidth,keepaspectratio]{figs/gradualBb}% WIDTH=443 HEIGHT=226$

3 Trace Selection

The postworker of a linear block dispatcher is called after the last virtual instruction of the linear block has executed. Since, by definition, linear blocks end with branches, after executing the last instruction the vPC has been set to the destination of the branch and hence points to one of the successors of the linear block. The postworker runs at exactly the right moment to profile edges of the control flow graph, namely after each branch destination is known, and yet before the destination is executed.

If the vPC of the destination is less than the vPC of the virtual branch instruction itself, this is a reverse branch - a likely candidate for the latch of a loop. According to the heuristics developed by Dynamo (see Section ), hot reverse branches are good places to start the search for hot code. Accordingly, when our system detects a reverse branch that has executed 100 times it enters trace recording mode. In trace recording mode, similar to linear block recording mode, the postworker adds each linear block payload to the thread's history list. The situation is very similar to that illustrated in Figure , except the history list describes linear blocks instead of virtual instructions. Our system, like Dynamo, ends a trace (i) when it reaches a reverse branch or finds a cycle, or (ii) when it contains too many (currently 100) linear blocks.

When trace generation ends, a new trace dispatcher is created and installed. This is quite similar to Figure apart from the need to support trace exits. The payload of a trace dispatcher includes a table of trace exit descriptors, one for each linear block in the trace. See Figure .

Although code could be generated for the trace at this point, we postpone code generation until the trace has run a few times, currently five, in trace training mode. Trace training mode uses a specialized dispatch loop that calls additional instrumentation before and after dispatching each virtual instruction in the trace. The instrumentation is passed pointers to various interpreter variables (top of the expression stack, a description of the currently executing method, etc). In principle, almost any detail of the virtual machine's state can be recorded. Currently, we record the class of every Java object upon which a virtual method is invoked.

Once the trace has been trained, we generate and install a region body. We have implemented two different mechanisms for generating code for a trace. Early in the project we implemented a simple approach, interpreted traces, that generates very simple subroutine-threaded style code for each trace. Then, with a great deal more effort, we implemented our trace-based JIT compiler. Both approaches are described in Section .

Before we discuss code generation, we need to describe the runtime of the trace system and especially the operation of trace exits.

3 Trace Exit Runtime

One of the properties that make traces a desirable shape of region body is that they predict hot paths through the virtual program. If the predictions are good, and the Dynamo results suggest that they are, we assume that most trace exits are not taken. The trace exits that are taken, however, quickly become hot and hence new traces must be generated and linked. This means that it will likely pay to burden the implementation of a trace exit with some extra overhead if this makes the path through the trace more efficient.

We use a combination of code generation (in the region body for the trace) and runtime profiling instrumentation (in the postworker called after each trace returns to the dispatch loop) to detect which trace exits are occurring and what to do about it.

Trace exits occur when execution diverges from the path collected during trace generation, or in other words, when the destination of a virtual branch instruction in the trace is different from what was recorded during trace generation. Generated trace exit code in the trace detects the divergence and branches to a trace exit handler. Generated code in the trace exit handler records which trace exit has occurred by storing, into the TCS, the address of the trace payload (to identify the trace) and the index of the trace exit (to identify the specific branch). The trace exit handler then returns to the dispatch loop, which, as usual, calls the postworker. The postworker uses the information in the TCS to update the trace exit profiling information in the trace payload.

This scheme minimizes overhead for traces that complete or link at the expense of cold trace exits. Conceptually, the postworker has only a few alternatives to chose from:

If the trace exit is still cold, increment the counter corresponding to the trace exit in the trace payload.
Notice that the counter has crossed the hot threshold and arrange to generate a new trace.
Notice that a trace already exists at the destination and link the trace exit handler to the destination trace.

Alternative

is trivial, the postworker increments a counter and returns. Alternative

is also simple, the postworker simply sets the recordMode bit in TCS and the destination trace will start being collected immediately. Alternative

is more challenging and will be described in the next section.

1 Trace Linking

The goal of trace linking is to rewrite the trace exit handler of a hot trace exit to branch directly to the destination trace rather than return to the dispatch loop. The actual mechanism we use depends on the underlying virtual branch instruction. There are two main cases, branches with only one off-trace destination and branches with multiple off-trace destinations.

Regular conditional branches, like Java's if_icmp, are quite simple. The branch has only two destinations, one on the trace and the other off. When the trace exit becomes hot a new trace is generated starting with the off-trace destination. Then, the next time the trace exit occurs, the postworker links the trace exit handler to the new trace by rewriting the branch instruction in the trace exit handler to jump directly to the destination trace instead of returning to the dispatch loop. Subsequently, execution stays in the code cache for both paths of the program.

Multiple destination branches, like method invocation and return, are more complex. When a trace exit originating from a multi-way branch occurs, we are faced with two additional challenges. First, profiling multiple destinations is more expensive than just maintaining one counter. Second, when one or more of the possible destinations are also traces, the trace exit handler needs some mechanism to jump to the right one.

The first challenge we essentially ignore. We use a simple counter and trace generate all destinations of a hot trace exit that arise. The danger of this strategy is that we could trace generate superfluous cold destinations and waste trace generation time and code cache memory.

The second challenge concerns the efficient selection of a destination trace to which to link, and the mechanism used to branch there. To choose a destination, we follow the heuristic developed by Dynamo for regular branches - that is, we link to destinations in the order they are encountered. The rationale is that the highest probability trace exits will occur first. At link time, we rewrite the code in the trace exit handler with code that checks the value of the vPC. If it equals the vPC of a linked trace, we branch directly to that trace; otherwise we return to the dispatch loop. Because the specific values of the vPC for each destination trace are visible to the postworker, we can hard-wire the comparand in the generated code. In fact, we can generate a sequence of compares checking for each of the multiple destinations in turn. Eventually, a sufficiently long cascade would perform no better than a trip around the dispatch loop. Currently we limit ourselves to two linked destinations per trace exit. This mechanism is similar to the technique used for interpreted traces, described next.

Figure: Schematic of a trace illustrating how trace exit table (shaded) in trace payload has recorded the on-trace destination of each virtual branch

$\begin{centering}\includegraphics[width=1\columnwidth,keepaspectratio]{figs/traceRegion} \par \end{centering}% WIDTH=554 HEIGHT=275$

4 Generating code for traces

Generating code for a trace is made up of two main tasks, generating the main body of the trace and generating a trace exit handler for each trace exit. After trace selection the TCS history list contains a list of linear block payloads that were selected. By traversing the list we can visit each virtual instruction in the trace.

We describe two different strategies for compiling a trace. Both schemes use the same runtime and carry out trace linking identically. Interpreted traces, described next, represent our simplest approach to generating code for a trace. JIT compiled traces, described in Section , contain a mixture of compiled code and dispatch.

Figure gives a schematic for a hypothetical trace. As shown in the figure, the dispatcher is the root of the data structure and points to the payload and the entry point of the region body. The payload contains a counter (not shown in the figure) and a trace exit table. The trace exit table is an array of trace exit descriptors, one for each trace exit in the trace. Each trace exit descriptor contains a counter (not shown) and a pointer to the trace exit handler for each trace exit. The counter is used to determine when a trace exit becomes hot. The pointer to the trace exit handler is used to mark the location that will be rewritten for trace linking.

1 Interpreted Traces

Interpreted traces require only slightly more complex code generation than subroutine threading, but are about as effective as branch inlining (See Section ) at reducing the overhead of dispatching virtual branch instructions. We call them interpreted because no virtual instruction bodies are compiled in-line, rather, an interpreted trace dispatches all virtual instruction bodies including virtual branches.

The trace payload identifies each linear block in the trace and each linear block payload lists every virtual instruction. Hence, by iterating over the linear block payloads the straight line portions of a trace can be easily implemented as regions of subroutine-threaded code.

Trace exits require only slightly more complicated code generation. A trace is a hot path through the virtual program, or put another way, a trace predicts the value of the vPC after each of its constituent virtual branch instructions has executed. Taking this view, the purpose of each trace exit is to ensure that the branch it guards has set the vPC to the on-trace destination. The on-trace destination of each virtual branch is recorded in the trace payload as the trace is generated. Hence, the simplest possible implementation of a trace exit must do three things. First, it dispatches the virtual branch body. Second, it compares the value of the vPC, the destination of the branch, to the on-trace vPC predicted by the trace. A compare immediate can be used, since the on-trace value of the vPC is known and is constant. Third, it conditionally branches to the trace exit handler if the comparison fails.

This code is somewhat reminiscent of the branch replication technique we described in sec:branch_inlining except that instead of following the dispatch of the virtual branch body with an expensive indirect branch we generate a compare immediate followed by a direct conditional branch to the trace exit handler. We expect this technique to be quite easy for the branch predictors of the underlying processor to predict because the direct conditional branch is fully exposed to the branch history predictors. As we shall show in the next chapter, interpreted traces achieve a level of performance similar to subroutine threading plus branch inlining.

2 JIT Compiled Traces

Our JIT does not perform any classical optimizations and does not build any internal representation before compiling a trace. As traces contain no merge points, we perform a single pass through each trace allocating expression stack slots to registers and generating code.

An important aspect of our JIT design is that it can generate code for a trace before it supports all virtual instructions. Our JIT generates register allocated machine code for contiguous sequences of virtual instructions it recognizes. When an unfamiliar virtual instruction is encountered, code is generated to flush any temporary values held in registers back to the Java expression stack. Then, the bodies of any uncompilable or unfamiliar virtual instructions are dispatched using subroutine threading. This significantly eases development as the compiler can be extended one virtual instruction at a time. The same tactics can be used for virtual instructions that the JIT partially supports. When the compiler encounters an awkward corner case it can simply give up and fall back to subroutine dispatch instead.

Expression stack slots are assigned to registers, freeing the generated code from maintaining the expression stack. Immediate arguments to virtual instructions, normally loaded from the DTT, are loaded into registers using load immediate instructions whenever possible. This frees the generated code from maintaining the vPC.

Machine code generation is performed using the ccg [#!piumarta:ccg!#] runtime assembler.

1 Dedicated Registers

The code generated by Yeti must be able to load and store values to the same Java expression stack and local variable array referred to by the C code implementing the virtual instruction bodies. Our current PowerPC implementation side-steps this difficulty by dedicating hardware registers for the values that must be shared between our generated code and C generated bodies. At present we dedicate registers for the vPC, the top of the Java expression stack and the pointer to the base of the local variables. Code is generated to adjust the value of the dedicated registers as part of the flush sequence, described below.

On targets with fewer registers, notably Intel's Pentium, there may not be enough general purpose registers to dedicate three of them for our own purposes. There, we plan to generate code that accesses the variables in memory.

2 Register Allocation

Java virtual instructions, and those of many other virtual machines, pop arguments off and push results onto an expression stack (See Section ). Naive compilation of the pushes and pops would result in many redundant loads, stores and adjustments of the pointer to the top of the expression stack. Our JIT assigns the temporary values to registers instead.

Our register allocator and code generator are combined and perform only one pass. As we examine each virtual instruction we maintain a compile time structure we call the shadow stack. The shadow stack associates each value in an expression stack slot with the register to which it has been assigned. Whenever a virtual instruction would pop one of its inputs we first check if there already is a register for that value in the corresponding shadow stack slot. If so, we use the register instead of generating any code to pop the expression stack. Similarly, whenever a virtual instruction would push a new value onto the expression stack we assign a new register to the value and push this on the shadow. We forgo generating any code to push the value onto the expression stack.

A convenient property of this approach is that every value assigned to a register always has a home location on the expression stack. If we run out of registers we simply spill the register whose home location is deepest on the shadow stack (as all the shallower values will be needed sooner [#!piumarta:vm04!#]).

3 Flushing Registers to Expression Stack

The simple strategy for assigning expression stack slots to registers we have described assumes that execution remains on the trace and that all instructions have been compiled. However, when a trace exit is taken, or when the JIT needs to fall back to calling a virtual instruction body, all values in registers must be saved back to the expression stack.

Flush code is generated by scanning the shadow stack to find every expression stack slot currently assigned to a register. A store is generated to store each such live register to its home location on the expression stack. Then, the shadow stack is reinitialized to empty and all registers are marked as free.

Generated code typically does not need to maintain the dedicated registers, for instance the top of the expression stack, or the vPC, until it is about to return to the interpreter. Generated flush code updates the values held by the dedicated registers as well.

4 Trace Exits and Trace Exit Handlers

The virtual branch instruction ending each block is compiled into a trace exit. We follow two different strategies for trace exits. The first case, regular conditional branch virtual instructions, are compiled by our JIT into machine code that conditionally branches to a trace exit handler when execution would leave the trace. The generated code implements the semantics of the virtual instruction body, and compares and conditionally branches on the values in registers. It does not access the vPC. PowerPC code for this case appears in Figure . The sense of the conditional branch is adjusted so that the branch is always not-taken for the on-trace path. The second case, for more complex virtual branch instructions, such as for method invocation and return, which may have multiple destinations, are handled as for interpreted traces. (Polymorphic method dispatch is also handled this way if it cannot be optimized as described in Section .)

Trace exit handlers have two further roles. First, since compiled traces contain compiled code, it may be necessary to flush values held in registers and update the values of dedicated registers. For instance, in Figure , the trace exit handler adjusts the vPC. Flush code is the only difference between trace exit handlers for interpreted and compiled traces. Second, trace linking is achieved by overwriting code in the trace exit handler. (This is the only situation in which we rewrite code.) To link traces, the tail of the trace exit handler is rewritten to branch to the destination trace rather than return to the dispatch loop.

The trace link branch occurs after the flush code, which means that registers are flushed only to be reloaded by the destination trace. We have not yet implemented any optimization to address this redundancy. However, if the shadow stack at the trace exit were to be saved aside, it could be used to prime the compilation of the destination. Then, the trace link could be inserted before the flush code.

Most trace exit handlers are reached only when a conditional trace exit is taken. The only exception occurs when a trace executes to completion. Then, control must return to the dispatch loop. To implement this, each trace ends with an in-line trace exit handler. Like any other trace exit handler, it may later be linked to its destination trace if one becomes hot.

Figure: PowerPC code for a portion of a trace region body, showing details of a trace exit and trace exit handler. This code assumes that r26 has been dedicated for the vPC. In addition the generated code in the trace exit handler uses r30, the stack pointer as defined by the ABI, to store the trace exit id into the TCS.

$\includegraphics[width=1\columnwidth]{figs/traceExit}% WIDTH=555 HEIGHT=212$

3 Trace Optimization

We describe two optimizations here: how loops are handled and how the training data can be used to optimize method invocation.

1 Inner Loops

An intrinsic property of Dynamo's trace selection heuristic is that the innermost loops of a program are often selected into a single trace ending with the loop closing reverse branch. This occurs because trace generation starts at the target of reverse branches and ends whenever it reaches a reverse branch. Note that there may be many branches, including calls and returns, along the way. When the trace is compiled, the loop is trivial to find because the last virtual instruction in the trace is a virtual conditional branch back to its entry.

Inner loops expose a problem with the way we end a trace. Normally, a trace exit is compiled as a branch taken to the trace exit handler for the off-trace path and a fall-through for the on-trace path. If this approach were followed, each iteration of a hot inner loop would execute to the inline trace exit handler at the end of the trace and return to the dispatch loop. Soon this trace exit would become hot and trace linking would rewrite the inline trace exit to branch back to the head of the trace. To avoid the extra branch and pointless trace linking, the trace JIT compiles a reverse branch differently - reversing the sense of the trace exit and generating a reverse conditional branch back to entry point of the trace.

Thus far, we have not exploited this information to optimize the body of the trace. For example, it would be relatively easy to detect loop invariant instructions and move them to a newly constructed loop preheader. However, the flow graph of the resulting unit of compilation would then include a merge point because the head of the loop would have two inbound edges (the back edge and the edge from the preheader). The register allocation scheme we have described does not support merge points.

2 Virtual Method Invocation

So far, all the trace exits we have described have been translations of virtual branch instructions. However, a trace exit can be used to guard other speculative optimizations as well. Our strategy for optimizing virtual method invocation is to generate a guard trace exit that is much cheaper than a full method dispatch. If the guard code falls through, we know execution should continue along the trace.

Specifically, if the class of the invoked-upon object is different than recorded when the trace was generated, a trace exit must occur. At trace generation time we know the on-trace destination of each call. From the training profile, we know the class of each invoked-upon object. Thus, we can easily generate a virtual invoke guard that branches to the trace exit handler if the class of the object on top of the expression stack is not the same as recorded during training. Then, we can generate code to perform a faster, stripped down version of method invocation. The savings are primarily the work associated with looking up the destination given the class of the receiver. This technique was independently invented by Gal et al. [#!gal:hotpath!#].

3 Inlining

Traces are agnostic towards method invocation and return, treating them like any other multiple-destination virtual branch instructions. However, when a return corresponds to an invoke in the same trace, the trace compiler can sometimes remove almost all method invocation overhead. Consider when the code between a method invocation and the matching return is relatively simple; for instance, it does not touch the callee's stack frame (other than the expression stack), it cannot throw an exception and it makes no further method invocations. Then, we can eliminate the invoke altogether, and the only method invocation overhead that remains is the virtual invoke guard. If the inlined method body contains any trace exits, the situation is slightly more complex. In this case, in order to prepare for a return somewhere off-trace, the trace exit handlers for the trace exits in the inlined code must modify the expression stack exactly as the (optimized away) method invocation would have done.

5 Other implementation details

Our system, as described in this chapter, generates code that coexists with virtual instruction bodies written in C. Consequently, the generated code must be able to access a few interpreter variables like the vPC, the top of the expression stack, and the base of the local variable array. For these heavily used interpreter variables, on machines with sufficient general purpose registers, we take the obvious approach of assigning the variables to dedicated registers. Dedicating the register might even improve the quality of code generated by the compiler for the interpreter. We note that on the PowerPC OCaml dedicates registers for the vPC and a few other commonly used values, presumably because it performs better this way.

A related challenge arises in our implementation of trace exit handlers. We want on-trace execution to be free of trace exit related overhead. At the same time, we need a way of recording which trace exit has occurred so that we can determine which trace exits are hot. This means that each trace exit handler, which is a region of code specific to a trace exit generated by Yeti, must have a way of writing into the TCS. On the PowerPC we could dedicate yet another register to point to the TCS. However, this could only hurt the performance of the virtual instruction bodies, since they never refer to the TCS. Instead, we indulge in some unwarranted chumminess with gcc. Using a trick invented by Vitale, we use gcc inline asm statements to obtain a string containing assembler gcc would generate to access the desired field in the TCS [#!ct-tcl2005!#]. Then, we parse the string and extract all the information we need to generate code to access the field.

Our use of a dispatch loop, similar to Figure , in conjunction with making virtual bodies callable by inserting inlined assembler return instructions, results in a control flow graph that is not apparent to the optimizer. First, the optimizer cannot know that the label at the head of each virtual instruction body can be reached by the function pointer call in the dispatch loop. (The compiler assumes, quite reasonably, that the function pointer call only reaches the entry point of functions.) Second, the optimizer does not know that control flows from the inlined return instruction back to the dispatch loop. We work around these difficulties by inserting computed gotos (which never actually execute) to simulate the missing edges.

6 Chapter Summary

In this chapter we have described the design trajectory for a high-level language virtual machine that extends from a very simple interpreter through a high-performance trace-based interpreter, to a extensible trace-based JIT compiled system. Our design goals are much more ambitious than in the preceding two chapters. There, we concentrated on how an interpreter can be made more efficient. In this chapter we presented a design that supports the evolution of a high-level language VM from a simple interpreter to a JIT. Thus, we favour infrastructure that supports the development of a JIT, for instance our dispatcher-based instrumentation, over infrastructure that is more narrowly focused on a specific interpretation technique.

An aspect of context threading that is somewhat unpalatable is that the effort invested implementing branch inlining, apply/return inlining and tiny inlining does nothing to facilitate the later addition of a JIT compiler. For instance, implementing branch inlining in the interpreter runs the risk of being a throw-away effort - if evolving performance requirements eventually lead to the implementation of a JIT, then a good deal of the the effort spent building branch inlining will have to be duplicated.

In contrast to this, Yeti builds its advanced interpretation techniques on top of infrastructure that is intended to facilitate the addition of a JIT. For instance, interpreted traces require trace-based profiling that is also required to support the trace-based JIT. As we will show in the next chapter, interpreted traces perform just as well as branch inlining.

With the resources at our disposal, it is not feasible to show that the performance potential of our trace-based JIT compiler is equal to an optimizing method-based JIT like those deployed by Sun or IBM. Our design is intended to support any shape of region body, so in a sense, the peak performance of traces is not a limiting factor, since with sufficient engineering effort, peak performance could always be achieved by compiling inlined method nests.

Instead, we concentrated our JIT compiler design efforts on how to support only a subset of virtual instructions, added one at a time. We found this was a convenient way to work, much easier than bringing up a regular compiler, since interactions between code generation bugs were much reduced. Currently our JIT consists of only about 2000 statements of C source code, about half machine dependent, and compiles about 50 integer virtual instructions. Nevertheless, as we will show in the next chapter, our JIT improves the performance of the SPECjvm98 benchmarks by about 24% over interpreted traces.

The main problem with the implementation of our prototype is that our generated code depends too heavily on gcc. There are two main issues. First, our generated code occasionally needs to access interpreter values. On the PowerPC we were able to side-step the potential difficulties by dedicating registers for key interpreter variables, but clearly another approach will be necessary for 32 bit Intel processors, which have too few general purpose registers to dedicate to any interpreter variables. Second, the way we have packaged virtual instruction bodies, and called them via a function pointer, (Figure ) hides the true control flow of the interpreter from the C optimizer. We will discuss how this might be avoided by packaging bodies as nested functions in Chapter .

Next, in Chapter , we will evaluate the performance of our prototype.

Next: 7 Evaluation of Yeti Up: Zaleski Dissertation Previous: 5 Evaluation of Context Contents

Mathew Zaleski 2008-01-22