In this chapter we evaluate context threading by comparing its performance to direct threading and direct-threaded selective inlining. We evaluate the impact of each of our techniques on Pentium 4 and PowerPC processors by measuring the performance of a modified version of SableVM, a Java virtual machine and ocamlrun, an OCaml interpreter. We explore the differences between context threading and SableVM's selective inlining further by measuring a simple extension of context threading we call tiny inlining. Finally, we illustrate the range of improvement possible with subroutine threading by comparing the performance of subroutine-threaded Tcl and subroutine-threaded OCaml to direct threading on Sparc.
The overall results show that dispatching virtual instructions by calling virtual instruction bodies is very effective for Java and OCaml on Pentium 4 and PowerPC platforms. In fact, subroutine threading outperforms direct threading by a healthy margin of about 20%. Context threading is almost as fast as selective inlining as implemented by SableVM. Since these are dispatch optimizations, they offer performance benefits depending on the proportion of dispatch to real work. Thus, when a Tcl interpreter is modified to be subroutine-threaded, performance relative to direct threading increases only by about 5%. Subroutine threaded Ocaml is 13% faster than direct threading on the same Sparc processor.
We begin by describing our experimental setup in sec:exp_setup. We investigate how effectively our techniques address pipeline branch hazards in sec:exp_hazards, and the overall effect on execution time in sec:exp_performance. sec:inlining demonstrates that context threading is complementary to inlining and results in performance comparable to SableVM's implementation of selective inlining. Finally, sec:Limitations-of-Context discusses a few of the limitations of context threading by studying the performance of Vitale's subroutine-threaded Tcl [#!ct-tcl2005!#, Figure 1] and OCaml, on Sparc.
We evaluate our techniques by modifying interpreters for Java and OCaml to run on Pentium 4, PowerPC 7410 and PPC970. The Pentium and PowerPC are processors used by PC and Macintosh workstations and many types of servers. The Pentium and PowerPC provide different architectures for indirect branches (Figure illustrates the differences) so we ensure our techniques work for both approaches.
Our experimental approach is to evaluate performance by measuring elapsed time. This is simple to measure and always relevant. We guard against intermittent events polluting any single run by always averaging across three executions of each benchmark.
We report pipeline hazards using the performance measurement counters of each processor. These vary widely not only between the Pentium and the PowerPC but also within each family. This is a challenge on the PowerPC, where IBM's modern PowerPC 970 is a desirable processor to measure, but has no performance counters for stalls caused by indirect branches. Thus, we use an older processor model, the PowerPC 7410, because it implements performance counters that the PowerPC 970 does not.
We choose two virtual machines for our experiments. OCaml is a simple, very cleanly implemented interpreter. However, there is only one implementation to measure and only a few relatively small benchmark programs are available. For this reason we also modified SableVM, a Java Virtual Machine.
We chose OCaml as representative of a class of efficient, stack-based interpreters that use direct-threaded dispatch. The bytecode bodies of the interpreter, in C, have been hand-tuned extensively, to the point of using gcc inline assembler extensions to hand-allocate important variables to dedicated registers. The implementation of the OCaml interpreter is clean and easy to modify [#!ocaml:book!#,#!ocamlsite!#].
Table: Description of OCaml benchmarks. Raw elapsed time and branch hazard data for direct-threaded runs.
The benchmarks in Table make up the standard OCaml benchmark suite. Boyer, kb, quicksort and sieve do mostly integer processing, while nucleic and fft are mostly floating point benchmarks. Soli is an exhaustive search algorithm that solves a solitaire peg game. Fib, taku, and takc are tiny, highly-recursive programs which calculate integer values.
Fib, taku, and takc are unusual because they contain very few distinct virtual instructions, and in some cases use only one instance of each. This has two important consequences. First, the indirect branch in direct-threaded dispatch is relatively predictable. Second, even minor changes can have dramatic effects (both positive and negative) because so few instructions contribute to the behavior.
SableVM is a Java Virtual Machine built for quick interpretation. SableVM implements multiple dispatch mechanisms, including switch, direct threading, and selective inlining (which SableVM calls inline threading [#!gagnon:inline-thread-prep-seq!#]). The support for multiple dispatch mechanisms facilitated our work to add context threading and allows for comparisons against other techniques, like inlining, that also address branch mispredictions. Finally, as part of its own inlining infrastructure, SableVM builds tables describing which virtual instruction bodies can be safely inlined using memcpy. This made our tiny inlining implementation very simple.
Table: Description of SPECjvm98 Java benchmarks. Raw elapsed time and branch hazard data for direct-threaded runs.
SableVM experiments were run on the complete SPECjvm98 [#!SPECjvm98!#] suite (compress, db, mpegaudio, raytrace, mtrt, jack, jess and javac), one large object-oriented application (soot [#!vall99soot!#]) and one scientific application (scimark [#!Scimark!#]). Table summarizes the key characteristics of these benchmarks.
On both platforms we measure elapsed time averaged over three runs to mitigate noise caused by intermittent system events. We necessarily use platform and operating systems dependent methods to estimate pipeline hazards.
The Pentium 4 (P4) processor speculatively dispatches instructions based on branch predictions. As discussed in Section , the indirect branches used for direct-threaded dispatch are often mispredicted due to the lack of context. Ideally, we could measure the cycles the processor stalls due to mispredictions of these branches, but the P4 does not provide a performance counter for this purpose. Instead, we count the number of mispredicted taken branches (MPT) to measure how our techniques effect branch prediction. We measure time on the P4 with the cycle-accurate time stamp counter (TSC) register. We count both MPT and TSC events using our own Linux kernel module, which collects complete data for the multithreaded Java benchmarks.
We need to characterize the cost of branches differently on the PowerPC than on the P4. On the PPC architecture split branches are used (as shown in Figure (b)) and the PPC stalls until the branch destination is known. Hence, we would like to count the number of cycles stalled due to link and count register dependencies. Unfortunately, PPC970 chips do not provide a performance counter for this purpose; however, the older PPC7410 CPU has a counter (counter 15, ``stall on LR/CTR dependency'') that provides exactly the information we need [#!motorola:mpc7410!#]. On the PPC7410, we also use the hardware counters to obtain overall execution times in terms of clock cycles. We expect that the branch stall penalty should be larger on more deeply-pipelined CPUs like the PPC970, however, we cannot directly verify this. Instead, we report only elapsed execution time for the PPC970.
In presenting our results, we normalize all experiments to the direct threading case, since it is considered a state-of-the art dispatch technique. (For instance, the source distribution of OCaml configures for direct threading.) We give the absolute execution times and branch hazard statistics for each benchmark and platform using direct threading in Tables and . Bar graphs in the following sections show the contributions of each component of our technique: subroutine threading only (labeled SUB); subroutine threading plus branch inlining and branch replication for exceptions and indirect branches (labeled SUB+BI); and our complete context threading implementation which includes apply/return inlining (labeled SUB+BI+AR. We include bars for selective inlining in SableVM (labeled SABLEVM) and our own simple inlining technique (labeled TINY) to facilitate comparisons, although inlining results are not discussed until Section . We do not show a bar for direct threading because it would, by definition, have height 1.0. Table provides a key to the acronyms used as labels in the following graphs.
Context threading was designed to align virtual program state with physical machine state to improve branch prediction and reduce pipeline branch hazards. We begin our evaluation by examining how well we have met this goal.
Figure reports the extent to which context threading reduces pipeline branch hazards for the OCaml benchmarks, while Figure reports these results for the Java benchmarks on SableVM. At the top of both figures, the graph labeled (a) presents the results on the P4, where we count mispredicted taken branches (MPT). At bottom of the figures, the graphs labeled (b) present the effect on LR/CTR stall cycles on the PPC7410. The last cluster of each bar graph reports the geometric mean across all benchmarks.
Context threading eliminates most of the mispredicted taken branches (MPT) on the Pentium 4 and LR/CTR stall cycles on the PPC7410, with similar overall effects for both interpreters. Examining Figures and reveals that subroutine threading has the single greatest impact, reducing MPT by an average of 75% for OCaml and 85% for SableVM on the P4, and reducing LR/CTR stalls by 60% and 75% on average for the PPC7410. This result matches our expectations because subroutine threading addresses the largest single source of unpredictable branches--the dispatch used for straight-line sequences of virtual instructions. Branch inlining has the next largest effect, since conditional branches are the most significant remaining pipeline hazard after applying subroutine threading. On the P4, branch inlining cuts the remaining MPTs by about 60%. On the PPC7410 branch inlining has a smaller, yet still significant effect, eliminating about 25% of the remaining LR/CTR stall cycles. A notable exception to the MPT trend occurs for the OCaml micro-benchmarks Fib, takc and taku. These tiny recursive micro benchmarks contain few duplicate virtual instructions and so the Pentium's branch target buffer (BTB) mostly predicts correctly and inlining the conditional branches cannot help.
Interestingly, the same three OCaml micro benchmarks Fib, takc and taku that challenge branch inlining on the P4 also reap the greatest benefit from apply/return inlining, as shown in Figure (a). (This appears as the significant improvement of SUB+BI+AR relative to SUB+BI.) Due to the recursive nature of these benchmarks, their performance is dominated by the behavior of virtual calls and returns. Thus, we expect predicting the returns to have significant impact.
For SableVM on the P4, however, our implementation of apply/return inlining is restricted by the fact that gcc-generated code touches the processor's esp register. Rather than implement a complicated stack switching technique, as discussed in Section , we allow the virtual and machine stacks to become misaligned and then manipulate the esp directly. This reduces the performance of our apply/return inlining implementation, presumably by somehow impeding the operation of the return address stack predictor. This can be seen in Figure (a), where adding apply/return inlining increases mispredicted branches. On the PPC7410, the effect of apply/return inlining on LR/CTR stalls is very small for SableVM.
Having shown that our techniques can significantly reduce pipeline branch hazards, we now examine the impact of these reductions on overall execution time.
Figure: OCaml Elapsed Time Relative to Direct Threading
Context threading improves branch prediction, resulting in better use of the pipelines on both the P4 and the PPC. However, using a native call/return pair for each dispatch increases instruction overhead. In this section, we examine the net result of these two effects on overall execution time. As before, all data is reported relative to direct threading.
Figures and show results for the OCaml and SableVM benchmarks, respectively. They are organized in the same way as the previous figures, with P4 results at the top, labeled (a), and PPC7410 results at the bottom, labeled (b). Figure shows the performance of OCaml and SableVM on the PPC970 CPU. The geometric means (rightmost cluster) in Figures , and show that context threading significantly outperforms direct threading on both virtual machines and on all three architectures. The geometric mean execution time of the OCaml VM is about 19% lower for context threading than direct threading on P4, 9% lower on PPC7410, and 39% lower on the PPC970. For SableVM, SUB+BI+AR, compared with direct threading, runs about 17% faster on the PPC7410 and 26% faster on both the P4 and PPC970. Although we cannot measure the cost of LR/CTR stalls on the PPC970, the greater reductions in execution time are consistent with its more deeply-pipelined design (23 stages vs. 7 for the PPC7410).
Across interpreters and architectures, the effect of our techniques is clear. Subroutine threading has the single largest impact on elapsed time. Branch inlining has the next largest impact eliminating an additional 3-7% of the elapsed time. In general, the reductions in execution time track the reductions in branch hazards seen in Figures and . The longer path length of our dispatch technique are most evident in the OCaml benchmarks fib and takc on the P4 where the improvements in branch prediction (relative to direct threading) are minor. These tiny benchmarks compile into unique instances of a few virtual instructions. This means that there is little or no sharing of BTB slots between instances and hence fewer mispredictions.
The effect of apply/return inlining on execution time is minimal overall, changing the geometric mean by only 1% with no discernible pattern. Given the limited performance benefit and added complexity, a general deployment of apply/return inlining does not seem worthwhile. Ideally, one would like to detect heavy recursion automatically, and only perform apply/return inlining when needed. We conclude that, for general usage, subroutine threading plus branch inlining provides the best trade-off.
We now demonstrate that context-threaded dispatch is complementary to inlining techniques.
Inlining techniques address the context problem by replicating bytecode bodies and removing dispatch code. This reduces both instructions executed and pipeline hazards. In this section we show that, although both selective inlining and our context threading technique reduce pipeline hazards, context threading is slower due to the overhead of its extra dispatch instructions. We investigate this issue by comparing our own tiny inlining technique with selective inlining.
In Figures , and (b), the bar labeled SABLEVM shows our measurements of Gagnon's selective inlining implementation for SableVM [#!gagnon:inline-thread-prep-seq!#]. From these figures, we see that selective inlining reduces both MPT and LR/CTR stalls significantly as compared to direct threading, but it is not as effective in this regard as subroutine threading alone. The larger reductions in pipeline hazards for context threading, however, do not necessarily translate into better performance over selective inlining. Figure (a) illustrates that SableVM's selective inlining beats context threading on the P4 by roughly 5%, whereas on the PPC7410 and the PPC970, both techniques have roughly the same execution time, as shown in Figure (b) and Figure (a), respectively. These results show that reducing pipeline hazards caused by dispatch is not sufficient to match the performance of selective inlining. By eliminating some dispatch code, selective inlining can do the same real work with fewer instructions than context threading.
Context threading is a dispatch technique, and can be easily combined with an inlining strategy. To investigate the impact of dispatch instruction overhead and to demonstrate that context threading is complementary to inlining, we implemented Tiny Inlining, a simple heuristic that inlines all bodies with a length less than four times the length of our dispatch code. This eliminates the dispatch overhead for the smallest bodies and, as calls in the CTT are replaced with comparably-sized bodies, tiny inlining ensures that the total code growth is low. In fact, the smallest inlined OCaml bodies on P4 were smaller than the length of a relative call instruction (five bytes). Table summarizes the effect of tiny inlining. On the P4, we come within 1% of SableVM's selective inlining implementation. On PowerPC, we outperform SableVM by 7.8% for the PPC7410 and 4.8% for the PPC970.
Table: Detailed comparison of selective inlining (SABLEVM) vs SUB+BI+AR and TINY. Numbers are elapsed time relative to direct threading. is the the difference between selective inlining and SUB+BI+AR. is the difference between selective inlining and TINY (the combination of context threading and tiny inlining).
We discuss two limitations of our technique. The first describes how our technique, like most dispatch optimizations, can have only limited impact on virtual machines that implement large virtual instructions. The second issue describes the difficulty we experienced adding profiling to our implementation of context threading.
The techniques described in this chapter address dispatch and hence have greater impact as the frequency of dispatch increases relative to the real work carried out. A key design decision for any virtual machine is the specific mix of virtual instructions. A computation may be carried out by many lightweight virtual instructions or fewer heavyweight ones. Figure shows that a Tcl interpreter typically executes an order of magnitude more cycles per dispatched virtual instruction than OCaml. Another perspective is that OCaml executes proportionately more dispatch because its work is carved up into smaller virtual instructions. In the figure, we see that many OCaml benchmarks average only tens of cycles per dispatched instruction. Thus, the time OCaml spends executing a typical body is of the same order of magnitude as the branch misprediction penalty of a modern CPU. On the other hand most Tcl benchmarks execute hundreds of cycles per dispatch, many times the misprediction penalty. Thus, we expect subroutine threading to speed up Tcl much less than OCaml. Figure reports the performance of subroutine threaded OCaml on an UltraSPARC III. As shown in the figure, subroutine threading speeds up OCaml on the UltraSPARC by about 13%. In contrast, the geometric mean of 500 Tcl benchmarks speeds up only by only 5.4% [#!ct-tcl2005!#].
Figure: Reproduction of [#!ct-tcl2005!#, Figure 1] showing cycles run per virtual instructions dispatched for various Tcl and OCaml benchmarks .
Figure: Elapsed time of subroutine threading relative to direct threading for OCaml on UltraSPARC III.
Another issue raised by the Tcl implementation was that about 12% of the 500 program benchmark suite slowed down. Very few of these dispatched more than 10,000 virtual instructions. Most were tiny programs that executed as little as a few dozen dispatches. This suggests that for programs that execute only a small number of virtual instructions, the load time overhead of generating code in the CTT may be too high.
Our original scheme for extending our context threaded interpreter with a JIT was to detect hot paths of the virtual program by generating calls to profiling instrumentation amongst the dispatch code in the CTT. We persevered for some time with this approach, and successfully implemented a system that identified traces [#!us_cascon2005!#]. The resulting implementation, though efficient, was fragile and required the generation of more machine specific code for profiling than we considered desirable. In the next chapter we describe a much more convenient approach based on dispatch loops.
SableVM is a very well engineered interpreter. For instance, SableVM's infrastructure for identifying un-relocatable virtual instruction bodies made implementing our TINY inlining experiment simple. However, its heavy use of m4 and cpp macros, used to implement multiple dispatch mechanisms and achieve a high degree of portability, makes debugging awkward. In addition, our efforts to add profiling instrumentation to context threading made many changes that we subsequently realized were ill-advised. Hence, we decided to start from clean sources. For the next stage of our experiment, our trace-based JIT, we decided to abandon SableVM in favour of JamVM.
Our experimentation with subroutine threading has established that calling virtual instruction bodies is an efficient way of dispatching virtual instructions. Subroutine threading is particularly effective at eliminating branch mispredictions caused by the dispatch of straight-line regions of virtual instructions. Branch inlining, though labor intensive to implement, eliminates the branch mispredictions caused by most virtual branches. Once the pipelines are full, the latency of dispatch instructions becomes significant. A suitable technique for addressing this overhead is inlining, and we have shown that context threading is compatible with our ``tiny'' inlining heuristic. With this simple approach, context threading achieves performance roughly equivalent to, and occasionally better than, selective inlining.
Our experiments also resulted in some warnings. First, our attempts to finesse the implementation of virtual branch instructions using branch replication (Section ) and apply/return inlining (Section ) were not successful. It was only when we resorted to the much less portable branch inlining that we improved the performance of virtual branches significantly. Second, the slowdown observed amongst a few Tcl benchmarks (which dispatched very few virtual instructions) raises the concern that even the load time overhead of subroutine threading may be too high. This suggests that we should investigate lazy approaches so we can delay generating code until it is needed.
These results inform our design of a gradually extensible interpreter, to be presented next. We suggested, in cha:introduction, that a JIT compiler would be simpler to build if its code generator has the option of falling back on calling virtual instruction bodies. The resulting fall back code is very similar to code generated at load time by a subroutine-threaded interpreter. In this chapter we have seen that linear sequences of virtual instructions program can be efficiently dispatched using subroutine threading. This suggests that there would be little or no performance penalty, relative to interpretation, when a JIT falls back on calling sequences of virtual instructions that it chooses not to compile.
We have shown that dispatching virtual branch instructions efficiently can gain 5% or more performance. We have shown that branch inlining, though not portable, is an effective way of reducing branch mispredictions. However, our experience has been that branch inlining is time consuming to implement. In the next chapter we will show that identifying hot interprocedural paths, or traces, at runtime enables a much simpler way of dealing with virtual branches that performs as well as branch inlining.