trace has been identified: it is then dynamically compiled into na-
tive code via a nontraditional application of SSA form, which we
call Trace SSA (TSSA).
In the classical use of SSA form, a control-flow graph is trans-
lated into SSA form in its entirety, and φ nodes are placed in
control-flow join nodes. Our approach differentiates between the
values used in a trace being compiled, which are in SSA form, and
values in the rest of the VM, which are not. The VM explicitly
moves data from the stack and local variables into dedicated SSA
variables before any generated native code is called, and explic-
itly moves non-dead SSA results back onto the stack and into local
variables on every exit from such an optimized trace (including side
exits). This approach enables the just-in-time compiler to perform
aggressive optimizations on the trace, including moving operations
on SSA values across side exit points. Because instruction traces
are essentially linear (they may contain only internal back edges)
liveness analysis and placement of φ nodes are straightforward. Our
system supports fairly sophisticated merging of multiple traces that
have a common ancestor.
The remainder of this paper is organized as follows. In Sec-
tion 2, we describe our approach to identify and extract linear code
sequences from non-sequential programs. Section 3 describes our
trace compiler, which is a specialized JIT compiler that translates
bytecode traces to PowerPC machine code. In Section 4 we de-
tail our approach to extend primary traces with secondary traces to
cover irregular control-flow scenarios. Then (Section 5), we give
a brief overview of the current state of our prototype implementa-
tion. Related work is discussed in Section 6. Our paper concludes
in Section 7.
2.
Trace Selection and Recording
When loading bytecode programs into memory, traces are not read-
ily visible in the code. In fact, most general purpose code is not
even purely sequential. The Java bytecode format groups code into
methods that are associated with classes, with all code for a class
stored in individual class files.
To transform such non-sequential code into linear sequences
of instructions, we use an approach called software trace schedul-
ing [9]. Software trace scheduling is based on the observation that
a physical CPU does not actually execute a code graph, but merely
follows linear instruction sequences (traces) along the edges of that
graph. To obtain a trace from a code graph, we record instructions
at runtime as they are executed. As not all code is equally worthy of
compilation and optimization, we limit our efforts to loops and only
start recording bytecode traces once we have identified a frequently
executed loop header.
This section gives an overview of the main steps—identifying
loop headers, recording traces, and conditions for ending the
recording.
2.1
Identifying Loop Headers
To identify loop headers, we use a simple heuristics that first ap-
peared in Dynamo [3], a framework for dynamic runtime optimiza-
tion of binary code.
Initially, our virtual machine interprets the bytecode program in-
struction by instruction. Each time a backwards branch instruction
is executed, the destination of that jump is recorded as a potential
loop header. The rationale of this approach is that the general flow
of control in bytecode is forward, and thus each loop has to contain
at least one backward branch.
To filter actual loop headers from the superset of potential
loop headers, we track the invocation frequency of (potential) loop
headers. Only after the execution frequency of a potential loop
header exceeds a certain threshold, our VM starts recording a
bytecode trace (Figure 1). To reduce the overhead for code that is
hotspot
getstatic System.out
for (i = 0; i < 1000; ++i)
++k;
System.out.println(k);
}
iconst_0
istore_2
iconst_0
istore_1
iload_1
sipush 1000
if_icmpge B
iinc 2,1
iinc 1,1
goto A
iload_2
invokevirtual println(int)
return
A:
B:
public static void main(String[] args) {
int i, k = 0;
hotpath
trace
Figure 1. Recording bytecode traces. The source program (upper
part of the figure) contains a frequently executed loop that we
want to compile (a hotspot). The other parts of the program are
executed infrequently and compiling them would likely not much
benefit overall performance. The bottom part of the figure shows
the resulting bytecode program. A frequently executed backward
branch to A triggers the recording of a trace. The trace is complete
when execution returns to the loop header. We also refer to recorded
traces as ”hotpaths”, as they are the results of tracing hotspots.
not suitable for trace-based JIT compilation, the execution monitor
disconnects itself from the virtual machine if no usable traces can
be found and a second threshold is exceeded.
As branch instructions appear frequently in Java bytecode, it
is essential to minimize the housekeeping overhead for keeping
track of the invocation frequency of backward branches. In our
first KVM-based implementation we used a relatively small (977
entries) closed hash table, consisting of 16-bit integer counters, for
this purpose. Every time a backward-branch instruction is executed,
the corresponding counter in the hash table is incremented. Colli-
sions in the hash table are intentionally tolerated as their impact on
code generation is relatively limited. Collisions can merely lead to
overestimation of the ”hotness” of a code region, triggering code
generation for ”cold” code. In the worst case this overestimation
will cause a slight performance degradation as the VM might be
unable to recover the compilation cost in terms of cycles spent on
code optimization vs. cycles recovered from more efficient program
execution.
Using a hash table is well suited for interpreters such as KVM,
which interpret the native bytecode format as it exists in Java class
files. JamVM, in contrast, translates the bytecode format into its
own intermediate representation prior to execution. Amongst oth-
ers, relative branch addresses are resolved to absolute target ad-
dresses and constant values are retrieved from the constant pool
and stored directly in the intermediate representation. To further
reduce the overhead of the branch monitoring approach described
above, we directly embed the profiling information in the interme-
diate representation. This allows us to avoid the additional hash
table lookup per branch instruction.