C
A
B
D
E
G
F
Figure 6. An example for a loop with more than one frequently ex-
ecuted path. If we supported just a single hotpath for each loop and
then encountered A, F, G first and recorded a trace for it, all other
combinations (A, B, C, E, G and A, B, D, E, G) would result in
expensive bail-out operations in which the VM has to recover from
failed guard instructions. Our secondary trace approach avoids this
problem.
tion. If a loop calls into the memory allocator on every iteration,
the overhead of running the memory allocator often by far out-
weighs any speedup that compilation of the rest of the loop code
could achieve. Thus not compiling such loops in the first place is an
acceptable initial heuristics.
3
Since the runtime of the loop is domi-
nated by the runtime of the memory allocator, the performance loss
will be small.
There is, however, a class of loops that would benefit from com-
pilation despite the presence of memory allocation instructions.
Since Java does not allow explicit stack allocation of temporary
objects, regular heap allocation instructions are used. As these tem-
porary objects are only live for a single iteration, we could actu-
ally bypass the allocator altogether and allocate temporary stack
space instead. This would enable us to speed up the loop body with-
out building up any unused heap blocks that the garbage collector
would have to reclaim (which it can’t since we block garbage col-
lection while executing compiled code.)
We are currently working on a specialized linear-time escape
analysis that will allow us to identify such temporary objects at
runtime.
4.
Trace Merging
Trace-based JIT compilation as presented in the previous two sec-
tions works well for very regular code. Unfortunately, code is rarely
purely regular. Loops often do not contain a single dominant path,
but several frequently executed paths (Figure 6). To achieve good
performance in this more general case, we have to support multiple
hot paths through a loop.
For this, similarly to Dynamo we record secondary (child) traces
every time we exit a trace along a non-exception edge (Figure 7).
The guard instruction is updated to jump to the child trace, instead
of returning to the VM. The child trace shares the upstream code
with its parent trace, and is compiled bottom-up just as its parent
trace, with the register allocator initialized to the final state of the
3
Our compiler currently also rejects such loops since memory allocation
requires a callback into the virtual machine and is thus effectively a native
method invocation, which in our prototype abort trace compilation.
A
B
E
G
D
G
F
C
E
G
Figure 7. Recording secondary traces. Every time we exit a trace
along a non-exception edge, we immediately start recording a new
trace. The guard instruction that triggered the recording of the new
trace is then patched to point to the newly recorded and compiled
trace.
parent trace. This is necessary in order for the child trace to know
in which register the parent trace stored values that were computed
upstream.
If a child trace is left through a side exit, we resume recording
but that trace will also be directly associated with the parent trace.
In essence we lazily record all actually executed traces through the
loop as we encounter them, and connect them to the single loop
header of the parent trace.
In certain cases we also have to insert compensation code along
the new edge to the child trace, for example in order to compute a
value that was dead in the parent trace, but is now active in the child
trace. These cases are easily recognizable, because a use refers to a
definition in the parent trace that has no register allocation assigned
to it.
It should be noted that this approach leads to a certain amount
of code duplication, however, it also permits specialized code opti-
mization along each path (Figure 8).
Invariant code motion in the presence of secondary traces also
warrants special consideration. By aggressively hoisting computa-
tions out of the initially recorded primary trace, a large number of
registers would be blocked for further use by secondary traces. To
circumvent this problem, we selectively spill certain invariant val-
ues into memory when switching from parent to child traces.
An interesting effect can be observed in the presence of nested
loops. As shown in Figure 9, in a nested loop construct the inner
part of the loop tends to be ”hotter” than the outer parts of the
nested loop. Thus, the first loop header to be detected as such is
likely B, because it is part of the inner loop. Once the trace B − C
has been recorded and is executed, a new trace is recorded as soon
C exits to D instead of following the loop edge. We record a new
child trace D − A and reconnect it to B, which we consider the
loop header. The fact that we only consider B as a loop header
as far as optimization is concerned means that the primary trace
B −C will get more resources allocated than C −D −A. However,
considering that the inner loop is likely the hotter path, this is
actually exactly what we want. Effectively, we have turned the
loop inside out, giving maximum consideration to the hottest loop
region, and slightly disadvantaging the outer loop parts.