Java stack and local variables are in fact not updated during trace
execution. φ pseudo instructions, in contrast, are by definition loop
variant, however, for our purposes we can still hoist them out of
the loop because only the invariant part of their semantics results in
actual code to be generated.
φ pseudo instructions have a different semantics, depending on
whether they are executed during initial loop-entry or following a
loop iteration. For the initial loop-entry, similarly to read pseudo
instructions, φ pseudo instructions fetch context information from
the Java stack or local variables. In case of a loop iteration, how-
ever, they carry over the value from the last iteration to the cur-
rent one. The latter semantics, which is loop variant, we guaran-
tee through our register allocation algorithm by assigning the same
output register to φ pseudo instructions as the loop-carrying input
register they are supposed to copy in case of a loop iteration. As
no code generation is necessary to guarantee this property, we can
safely move φ instructions out of the loop body.
Starting with the read and φ pseudo instructions, all instruc-
tions that only depend on loop-invariant inputs are hoisted out of
the actual loop body, and are executed before the loop code.
To create more opportunity to invariant code motion, the
trace recorder splits complex Java bytecode instructions into sub-
instructions, some of which might be suitable to be hoisted out
of the loop body or at least they can be shared between similar
subsequent instructions. In fact, we did not have to introduce a
large number of new instructions as most of the sub-instructions
are simply JVML instructions with relaxed typing (we use IADD
for address arithmetic, for example).
A common example for this principle are array access instruc-
tions (i.e. IALOAD). Each IALOAD instruction is emitted as a com-
bination of a shift operation (ISHL) to transform the index into
an appropriate offset depending on the array element size and an
IADD
instruction to generate an address from the offset and the ob-
ject reference (array base address). Subsequent array accesses can
then share parts or the entire address calculation. This approach
also creates additional opportunity for efficient code generation by
the backend, which can fold constant address calculations (i.e. con-
stant index expression) and emit more compact code using fewer
registers.
Integer modulo instructions (IREM) undergo a similar treatment
and are emitted as a combination of an integer division, multi-
plication and subtraction since many architectures do not directly
support modulo calculation. For architectures that do (i.e. x86) the
backend can easily detect and compact the resulting code pattern,
while architectures such as PowerPC that have no modulo ma-
chine instruction can optimize the resulting instruction stream, i.e.
through common subexpression elimination.
In order to guarantee proper semantics for memory accesses, we
only consider memory accesses to be loop invariant if no matching
read/write exists with the same base type and field offset (or abso-
lute address in case of static fields). While this approach is rather
pessimistic and does not take any high-level type information into
account, it still allows us to hoist most memory reads/writes out of
the loop without having to invest a lot of resources in proper alias
analysis.
We also perform common subexpression elimination (CSE) and
flag redundant instructions, which the backend phase in turn will
suppress during code generation. The rationale for not immediately
eliminating redundant instructions is that at this point the register
pressure is still unknown, and it might be beneficial to leave the
redundant instruction in the code, instead of having to spill due to
excessive register pressure.
2
2
This condition rarely occurs in case of the PowerPC architecture, but is
more frequent on architectures with few registers such as Intel 386.
One could argue that by performing CSE and loop invariant
code motion (hoisting) on a trace, we actually perform profile-
guided partial redundancy elimination (PRE), because we perform
CSE only along the hot path (trace), utilizing any partial redun-
dancy, which regular CSE may not be able to realize on the entire
graph. This observation underlines our claim that compilation of
code traces is much simpler to facilitate than code generation in
presence of code graphs.
The final register allocation and code generation phase is some-
what special in that it works bottom-up, starting at the last instruc-
tion in the trace working its way upstream. Register allocation and
code generation are performed simultaneously. The register alloca-
tor is pre-initialized by assigning fixed register locations to all loop
variables (φ instructions), which is necessary to guarantee that val-
ues generated during a loop iteration are properly carried over to
the next iteration.
We try to coalesce both operands of φ pseudo instructions into
the same hardware register to avoid unnecessary register moves at
the end of the trace (loop). This is of course only legal if the live
ranges of the of the two operands are disjunct. An overlap can for
example occur when the loop code references the previous value of
a loop variable after it has already been updated. A frequent cause
for this are the unary operators i++ and i-- that increment/decre-
ment a loop counter (in this case i), but return the previous value
as result of the expression. To ensure proper semantics the register
where the loop expects the incoming value to arrive in the next it-
eration can only be updated with the new value at the very end of
the loop. Thus we have to ensure that the two instances are not co-
alesced into the same register but the value is explicitly transfered
through a move instruction at the end of the loop.
As all other data-flow analyses, live range analysis is trivial on
a linear sequence of instructions. We merely record the last use
of each instruction during the initial iteration over the trace code.
The last-use information is maintained as a simple pointer from
each instruction to the last instruction in the trace that uses the
value generated by it. This information is collected by continuously
updating the last use pointer every time a value defined by an
instruction appears as operand of another instruction. Intermediate
uses are overwritten by subsequent uses, leaving the final use in the
pointer field.
Whether the live ranges of the two operands of a φ pseudo
instruction (the incoming value, and the value generated during
each iteration) overlap, can easily be determined by comparing
the last use information of the incoming value (context φ pseudo
instruction) with the definition of the loop-generated value.
If the loop-generated value of the φ pseudo instruction is defined
after the last use, the two values can be coalesced into the same
register. Otherwise a move instruction is generated at the end of the
loop (trace).
3.3
Code Generation
The code generator then starts to emit code, starting at the last in-
struction, and moving backwards. Correspondingly, the machine
code is also generated backwards, which has the subtle advantage
that the target address of conditional branches is always encoun-
tered before the actual branch instruction, eliminating the need to
fix up target addresses in a second pass.
During code generation, we first try to perform constant folding
for each operand reference of the instruction that is currently being
compiled. Traditionally, constant folding is performed at compile
time by executing the dynamic semantics of constant instructions.
In a mixed-mode interpreter, this leads to significant code duplica-
tion between the compiler and the virtual machine, because both es-
sentially perform the same task (executing bytecode instructions).
In our system, instead of folding constants in the compiler, we rely