HotpathVM: An Effective jit compiler for Resource-constrained Devices

Yüklə 217,56 Kb.

Pdf görüntüsü

səhifə	5/8
tarix	07.11.2018
ölçüsü	217,56 Kb.
	#78936

1 2 3 4 5 6 7 8

Java stack and local variables are in fact not updated during trace

execution. φ pseudo instructions, in contrast, are by deﬁnition loop

variant, however, for our purposes we can still hoist them out of

the loop because only the invariant part of their semantics results in

actual code to be generated.

φ pseudo instructions have a different semantics, depending on

whether they are executed during initial loop-entry or following a

loop iteration. For the initial loop-entry, similarly to read pseudo

instructions, φ pseudo instructions fetch context information from

the Java stack or local variables. In case of a loop iteration, how-

ever, they carry over the value from the last iteration to the cur-

rent one. The latter semantics, which is loop variant, we guaran-

tee through our register allocation algorithm by assigning the same

output register to φ pseudo instructions as the loop-carrying input

no code generation is necessary to guarantee this property, we can

safely move φ instructions out of the loop body.

Starting with the read and φ pseudo instructions, all instruc-

tions that only depend on loop-invariant inputs are hoisted out of

the actual loop body, and are executed before the loop code.

To create more opportunity to invariant code motion, the

trace recorder splits complex Java bytecode instructions into sub-

instructions, some of which might be suitable to be hoisted out

of the loop body or at least they can be shared between similar

subsequent instructions. In fact, we did not have to introduce a

large number of new instructions as most of the sub-instructions

are simply JVML instructions with relaxed typing (we use IADD

for address arithmetic, for example).

A common example for this principle are array access instruc-

tions (i.e. IALOAD). Each IALOAD instruction is emitted as a com-

bination of a shift operation (ISHL) to transform the index into

an appropriate offset depending on the array element size and an

IADD

instruction to generate an address from the offset and the ob-

ject reference (array base address). Subsequent array accesses can

then share parts or the entire address calculation. This approach

also creates additional opportunity for efﬁcient code generation by

the backend, which can fold constant address calculations (i.e. con-

stant index expression) and emit more compact code using fewer

registers.

Integer modulo instructions (IREM) undergo a similar treatment

and are emitted as a combination of an integer division, multi-

plication and subtraction since many architectures do not directly

support modulo calculation. For architectures that do (i.e. x86) the

backend can easily detect and compact the resulting code pattern,

while architectures such as PowerPC that have no modulo ma-

chine instruction can optimize the resulting instruction stream, i.e.

through common subexpression elimination.

In order to guarantee proper semantics for memory accesses, we

only consider memory accesses to be loop invariant if no matching

read/write exists with the same base type and ﬁeld offset (or abso-

lute address in case of static ﬁelds). While this approach is rather

pessimistic and does not take any high-level type information into

account, it still allows us to hoist most memory reads/writes out of

the loop without having to invest a lot of resources in proper alias

analysis.

We also perform common subexpression elimination (CSE) and

ﬂag redundant instructions, which the backend phase in turn will

suppress during code generation. The rationale for not immediately

eliminating redundant instructions is that at this point the register

pressure is still unknown, and it might be beneﬁcial to leave the

redundant instruction in the code, instead of having to spill due to

excessive register pressure.

This condition rarely occurs in case of the PowerPC architecture, but is

more frequent on architectures with few registers such as Intel 386.

One could argue that by performing CSE and loop invariant

code motion (hoisting) on a trace, we actually perform proﬁle-

guided partial redundancy elimination (PRE), because we perform

CSE only along the hot path (trace), utilizing any partial redun-

dancy, which regular CSE may not be able to realize on the entire

graph. This observation underlines our claim that compilation of

code traces is much simpler to facilitate than code generation in

presence of code graphs.

The ﬁnal register allocation and code generation phase is some-

what special in that it works bottom-up, starting at the last instruc-

tion in the trace working its way upstream. Register allocation and

code generation are performed simultaneously. The register alloca-

tor is pre-initialized by assigning ﬁxed register locations to all loop

variables (φ instructions), which is necessary to guarantee that val-

ues generated during a loop iteration are properly carried over to

the next iteration.

We try to coalesce both operands of φ pseudo instructions into

the same hardware register to avoid unnecessary register moves at

the end of the trace (loop). This is of course only legal if the live

ranges of the of the two operands are disjunct. An overlap can for

example occur when the loop code references the previous value of

a loop variable after it has already been updated. A frequent cause

for this are the unary operators i++ and i-- that increment/decre-

ment a loop counter (in this case i), but return the previous value

as result of the expression. To ensure proper semantics the register

where the loop expects the incoming value to arrive in the next it-

eration can only be updated with the new value at the very end of

the loop. Thus we have to ensure that the two instances are not co-

alesced into the same register but the value is explicitly transfered

through a move instruction at the end of the loop.

As all other data-ﬂow analyses, live range analysis is trivial on

a linear sequence of instructions. We merely record the last use

of each instruction during the initial iteration over the trace code.

The last-use information is maintained as a simple pointer from

each instruction to the last instruction in the trace that uses the

value generated by it. This information is collected by continuously

updating the last use pointer every time a value deﬁned by an

instruction appears as operand of another instruction. Intermediate

uses are overwritten by subsequent uses, leaving the ﬁnal use in the

pointer ﬁeld.

Whether the live ranges of the two operands of a φ pseudo

instruction (the incoming value, and the value generated during

each iteration) overlap, can easily be determined by comparing

the last use information of the incoming value (context φ pseudo

instruction) with the deﬁnition of the loop-generated value.

If the loop-generated value of the φ pseudo instruction is deﬁned

after the last use, the two values can be coalesced into the same

loop (trace).

3.3

Code Generation

The code generator then starts to emit code, starting at the last in-

struction, and moving backwards. Correspondingly, the machine

code is also generated backwards, which has the subtle advantage

that the target address of conditional branches is always encoun-

tered before the actual branch instruction, eliminating the need to

ﬁx up target addresses in a second pass.

During code generation, we ﬁrst try to perform constant folding

for each operand reference of the instruction that is currently being

compiled. Traditionally, constant folding is performed at compile

time by executing the dynamic semantics of constant instructions.

In a mixed-mode interpreter, this leads to signiﬁcant code duplica-

tion between the compiler and the virtual machine, because both es-

sentially perform the same task (executing bytecode instructions).

In our system, instead of folding constants in the compiler, we rely

Yüklə 217,56 Kb.

Dostları ilə paylaş:

1 2 3 4 5 6 7 8