IADD:
if constant(son[0]) && constant(son[1]) {
value = OP(son[0],son[1])
return
}
if constant(son[0])
swap(son[0], son[1])
if int_const(son[1]) && imm16(son[1])
emit(addi, reg(son[0]), value(son[1]))
else
emit(add, reg(son[0]), reg(son[1]))
Figure 4. The backend iterates over the code bottom-up, which
allows us to use simple pattern-matching for emitting specialized
instruction forms. The generator function for the IADD bytecode
instruction, for example, first checks whether its left operand is
constant, and swaps its operands in that case. This is to ensure that
if one operand is constant, this is always the right one, as expected
by the addi machine instruction. Only operands that cannot be
folded into a constant are assigned a register using the reg function.
The corresponding code for the definition will be emitted later, once
the code generator arrives at the defining instruction.
on the bytecode interpreter to record the value calculated by each
instruction in the trace. The compiler merely tracks a bit whether
operands are constant or not, and if all operands of an instruction
are constant, the result previously recorded by the virtual machine
is assumed to be the correct result value of the instruction.
The only JVML instruction that warrants special treatment is
the IINC instruction. In contrast to all other JVML instructions it
does not store the value it generates on top of the stack and thus
our stack recording mechanism cannot capture the value. Instead,
it directly pushes the new value into a local variable. Our recording
code is aware of this irregularity and records the value of the local
variable instead of the top of the stack for IINC instructions.
Only if constant folding fails we assign a register to the in-
struction defining the operand, which is the main reason why we
perform code generation in reverse order. It guarantees that we al-
ways encounter all uses of a value before its actual definition. The
code for the definition is only generated if not all uses could be
resolved through constant folding and specialized instruction pat-
terns. A common example is the addi instruction that allows to
directly embed one operand as an immediate as long the value is
within a certain range. If all uses of the value can be directly em-
bedded in such a fashion, it is not necessary to actually generate
that value at the definition site and no register has to be assigned
to it. In case of forward code generation we would have to decide
whether to emit the code for a definition before we know whether
the value will actually be used or not, which in turn would require
to analyze each use of the value.
As code generation continues, we will eventually reach the
instruction we just assigned a register to, and generate the necessary
code to produce a value into that register. Code is only generated for
instructions that were previously assigned a register. If we are able
to satisfy all uses statically through constant folding, the instruction
will have no register allocation when we arrive at it, and it is
considered dead.
Besides simplifying register allocation, bottom-up code genera-
tion also enables us to perform efficient pattern matching to emit
specialized machine instruction forms (Figure 4). As described
above, we try to fold all operands into constants, and if we real-
ize that one operand of an add instruction is a constant, for exam-
ple, we emit the specialized addi machine instruction (in case of a
PowerPC backend).
02801008: cmpwi r2,1000
0280100c: bgel- 0x2801034
02801010: addi r3,r3,1
02801014: addi r2,r2,1
02801018: b 0x2801008
Figure 5. The final code emitted by our backend for the example
introduced above. By assigning fixed registers to the φ instructions,
we avoid any trailing move instruction at the end of the loop to
correct the state of the register allocator before following the loop
edge back to the loop header. The branch instruction in the original
bytecode was compiled as a guard instruction, which exits the
compiled code when execution diverges from the predicted path.
We use the link register of the PowerPC to record the location
where this bail-out happened, which allows the VM to unwind the
execution state accordingly.
The resulting machine code for the example introduced in the
previous section is shown in Figure 5.
3.4
Side Exits
The initial version of our compiler did not generate specific ma-
chine code for individual side exits. Guard instructions were trans-
lated to conditional branch and link instructions that all pointed to
the same compensation code. Every time a guard instruction is ex-
ecuted (whether it passes or fails), the processor updates the link
register with the current value of the program counter, which is es-
sentially the address of the conditional branch instruction that was
generated for the guard instruction. We can make such liberal use of
the link register, because we don’t invoke any methods from within
compiled traces. If a guard instruction fails, execution is transfered
to the side exit compensation code shared by all guard instructions.
The compensation code first has to locate the appropriate register
to local variable mapping information by translating back from the
machine address recorded by the branch instruction to the address
of the branch instruction in the intermediate representation that ul-
timately contains the pointer to the register mapping information.
Benchmarks have shown that this approach creates a significant
overhead for traces that are frequently exited through side exists
after only few iterations (or no complete iteration at all). We have
thus changed our code generator to generate dedicated compensa-
tion code for each side exit.
Each side exit stub consists of a series of memory stores trans-
ferring values back from hardware registers into the appropriate
local variable storage area.
If the side exit occurred while executing an inlined method, the
compensation code has to generate additional stack frames to write
back the current state.
In our implementation we actually write back the values first,
and then create the new stack frames ”around” the already written
back stack and local variable information, because most temporary
values are held in hardware registers, which in turn would have
to be flushed to memory before we can execute the high-level
functions of JamVM that are responsible for creating new stack
frames. As a consequence we have to ensure that the garbage
collector is not triggered while we are writing back the stack frames
(or while we execute compiled code). We do so by holding the
garbage collector lock at all times when compiled code is executed.
While in principle this blocks the garbage collector from re-
claiming unused memory while compiled code is executed, and
could cause a memory overflow, in practice this does not seem to
be a significant limitation. Performance-critical code such as a loop
rarely performs any memory allocations.
The main reason for seeing so few memory allocations inside
loop bodies is the inherent cost of the memory allocation opera-