HotpathVM: An Effective jit compiler for Resource-constrained Devices

Yüklə 217,56 Kb.

Pdf görüntüsü

səhifə	8/8
tarix	07.11.2018
ölçüsü	217,56 Kb.
	#78936

1 2 3 4 5 6 7 8

Figure 8. Code duplication resulting from trace-based compila-

tion. A compiler designed to compiler whole control-ﬂow graphs

is able to generate shared code for F , while our trace compiler has

to generate versions of F separately for each trace. While this is

a disadvantage from a code density perspective, it does allow our

compiler to individually optimize each version of F , according to

its speciﬁc pre-context (B − D and C − E).

C

D

Figure 9. Handling of nested loops. As inner loops tend to be

executed more frequently than the outer parts of the loop, the trace

through the inner loop is recognized as the primary trace. The loop

header of the inner loop B becomes the sole loop header. When C

branches to D instead of looping back to B, we start recording a

child trace and connect it back to A once the inner loop is entered

again.

Once we support trace merging, the question arises how this

approach is still different from compilation of whole control-ﬂow

graphs. The two most signiﬁcant advantages over traditional com-

pilation are that paths are split and optimized individually (down-

stream), and that relevant paths are compiled and optimized only,

whereas compilers designed to compile whole control-ﬂow graphs

often compile entire ”hot” methods, even though only parts of the

method are in fact hot, while certain other parts are rarely or never

executed.

Benchmarks

We have implemented a prototype of the trace-based HotpathVM

JIT compiler. While initially our prototype was built based on Sun’s

KVM [24, 23] virtual machine, the current implementation we

are reporting on in this section has been ported to JamVM [17].

Similarly to KVM, JamVM is a small virtual machine suitable

for embedded devices. In contrast to traditional JIT compilers that

heavily depend on the internals of the virtual machine they are

hosted by, and are deeply interwoven with virtual machine code,

for both, KVM and JamVM, we only had to touch approximately

20 lines of the VM source code to add our compiler to the VM. The

most relevant change was to invoke our trace recorder every time a

branch instruction is encountered, to decide whether to start a new

recording, or to execute an already existing trace. The remainder

of the changes signal trace abort conditions to the JIT compiler, for

example when exceptions are raised or a native method is executed.

The JIT compiler itself consists of roughly 1800 lines of C

code, from which 900 lines are used by the frontend (phase 1

and 2), and 900 by the backend (phase 3, PowerPC). Our initial

implementation that performed manual constant folding instead of

recording constant values through the VM stack was 700 lines

large, which is almost a 50% savings over the current size of the

JIT compiler.

Compiled to native code, our JIT compiler has a footprint of

approximately 50 kBytes, which is noteworthy for a compiler that

implements a series of aggressive optimizations that were previ-

ously inaccessible to embedded mobile code frameworks. For com-

parison, JamVM alone consists of over 20,000 lines of C code

(180 kBytes of PowerPC native code) [13].

Additionally to the static size of the code, during compilation

our JIT compiler requires approximately 48 bytes of heap per in-

struction in the trace. The space is freed up in between compiler

runs. To cache compiled traces, we use a 64 kBytes code buffer,

bringing our overall minimum memory consumption to approxi-

mately 128 kBytes.

We have measured the performance of our Hotpath Virtual Ma-

chine for a set of benchmark programs from the SciMark2 [19]

and Java Grande [18] benchmark suites. The set contains programs

that are ideal for trace compilation (SOR, LU) but also programs

that can be considered the worst-case scenario for trace compila-

tion (NumericSort and MonteCarlo), because they contain branches

without clear preferred branch direction (taken and not taken are

both equally likely).

The performance was compared to the standard interpreting

Java VM, JamVM 1.3.3, which is one of the fastest purely interpret-

ing Java virtual machines, and ﬁnally Sun’s HotSpot just-in-time

compiler (Figure 10). Ideally, we would have preferred to com-

pare our trace-based compiler to a commercial just-in-time com-

piler such as Sun’s CLDC HotSpot VM [15, 20] as well. Unfor-

tunately, however, Sun was unable to license CLDC Hotspot to a

non-commercial entity such as the University of California in ei-

ther binary or source code form.

As expected, our JIT compiler performs well on highly regu-

lar and sequential programs such as LU, SOR, and Linpack. FFT

and SparseCompRow still yield a speedup of factor two over in-

terpretation but suffer from the lack of long traces that could be

exploited. We are unable to record meaningful traces for Monte-

Carlo and NumericSort because our current prototype does not sup-

port trace merging across (inlined) method invocations, which both

benchmarks would require.

Our compiler is efﬁcient for pure Java, but fails to optimize a

single trace in the presence of native method invocations (because

we abort trace recording when encountering a native method invo-

cation). This points to a signiﬁcant problem regarding the current

Java libraries. Because Java interpreters are generally relatively

slow, some performance critical methods (such as arraycopy)

are implemented in native C. This ”optimization” improves perfor-

mance in case of interpretation, but inhibits our compiler from

performing actual JIT compilation, resulting in very poor per-

formance. To improve performance, we will likely have to re-

implement all relevant parts of the Java library that do not rely

on unsafe language constructs in pure Java. This would be partic-

ularly beneﬁcial in an embedded context since Java code is often

100

150

200

250

SOR

SparseCompR

Linpack

NumSort

MonteCarlo

execution time [s]

Java -Xint

JamVM

JamVM/TJIT

Java/Hotspot

Figure 10. Execution time for a set of benchmark programs from

the SciMark2 and Java Grande benchmark suites, executed by for

pure interpretation with the standard Java VM (java -Xint),

JamVM 1.3.3 (jamvm), our trace-based JIT compiler (TJIT), and

ﬁnally Sun’s Hotspot VM (Hotspot). As expected, our JIT com-

piler performs well on highly regular and sequential programs such

as LU, SOR, and Linpack. FFT and SparseCompRow still yield a

speedup of factor two over interpretation but suffer from the lack of

long traces that could be exploited. We are unable to record mean-

ingful traces for MonteCarlo and NumericSort because our current

prototype does not support trace merging across (inlined) method

invocations, which both benchmarks would require.

signiﬁcantly more compact than equivalent pre-compiled native

code.

Currently, we do not plan on adding support for inlining native

methods to our compiler. Previous work in this area [22] has shown

that efﬁcient inlining of native code blocks requires translating

the machine code instructions inside the native method into the

higher-level intermediate representation used internally by the JIT

compiler. In essence this amounts to decompiling the native code,

so that it can be (re-)optimized, specialized, and interwoven with

the surrounding Java code. It is unlikely that this analysis effort

would pay off in an embedded context.

Related Work

Our research is related to a number of existing works and projects.

The main inspiration for our trace selection and recording mecha-

nism was Dynamo [3]. Our system differs from Dynamo, because

we record bytecode traces, and then compile them to native ma-

chine code, whereas Dynamo performs dynamic optimization on

native machine code, emitting optimized native machine code. Our

actual trace recording algorithms also differ from Dynamo, be-

cause we have more high-level information available than Dynamo,

which only sees native machine code instructions.

Our approach of bottom-up code generation is closely related to

Burg [11, 10] that extends work by [1, 2]. Just like BEG [8] and

Twig [1], Burg works on tree grammars and generates a tree parser,

which in turn makes two passes over the tree for code generation.

The main difference for all these approaches with respect to our

work is that they all need a structured IR to work, while our input

is a mere linear sequence of instructions (that is read backwards).

Our approach also combines code generation, constant folding,

and register allocation, and all three steps are performed in inter-

2

4

SOR

SparseCompR

Linpack

NumSort

MonteCarlo

speedup

Java -Xint

JamVM

JamVM/TJIT

Java/Hotspot

Figure 11. Speedup of trace-based JIT compilation over pure in-

terpretation. For highly regular programs our compiler achieves a

speedup of 7-11 over a fast interpreter. For code with signiﬁcant

side exists the speedup drops to factor 3. For highly irregular code

our JIT compiler is unable to extract useful traces and does not im-

prove execution time.

lock in a single pass over the code. We can perform all three

steps in a single pass because we iterate backwards over the code,

guaranteeing that we will always see all uses before the actual

deﬁnition, which we utilize for on-the-ﬂy dead code analysis and

The work by Chang et al. [5] on superblock formation and

optimization is closely related to our loop compilation mechanism.

Similar to Chang, we transform loop code into suitable single-entry

traces through re-tracing and duplicating the loop body along each

possible execution path. Our work differs from Chang in that we do

so at runtime and lazily: we only add additional child traces to the

loop construct as we encounter them dynamically at runtime.

A number of existing virtual machines target the same mobile

domain our work aims at. Sun has produced its own JIT com-

piler for the Kilobyte Virtual Machine, called KJIT [21]. KJIT is a

lightweight dynamic compiler. Similarly to our work, it only com-

piles a subset of bytecodes to reduce the size of the compiler. In

contrast to our JIT compiler, however, KJIT does not perform any

signiﬁcant code optimization but merely maps bytecode instruc-

tions to machine code sequences. Also, KJIT seems to be an inter-

nal research project only and we have not been able to use it for

comparative benchmarks.

Sun’s current implementation of an embedded JIT compiler is

called CLDC Hotspot VM [15, 20]. Unfortunately, very little is

known about the internal details of this compiler. According to

Sun’s white papers, CLDC Hotspot performs some basic optimiza-

tions including constant folding, constant propagation, and loop

peeling, while our compiler also applies common subexpression

elimination and invariant code motion.

Other VM’s for the embedded domain include E-Bunny [7] and

Jeode EVM [16]. E-bunny is a simple, non-optimizing compiler for

x86, that uses stack-based code generation, which is very fast as far

as compile time is concerned, but yields poor code quality in com-

parison to optimizing compilers. Jeode is an optimizing compiler

that uses a simpliﬁed form of dynamic compilation. Unfortunately,

little is known about its internals.

Summary and Conclusion

We have presented an approach to just-in-time compilation that

records frequently executed code traces and compiles them to

executable native code. Instead of supporting generalized code

graphs, our compiler is highly specialized to support only instruc-

tion traces, which greatly simpliﬁes the algorithms involved in

compilation and code optimization. Whereas traditional systems

often compile entire methods, regardless of whether all parts of the

method are hot or not, our approach focuses solely on hot code

paths and leaves cold regions to the interpreter.

Furthermore, our approach opens up additional optimization po-

tential by not allowing irrelevant code parts (such as rarely executed

exception edges) to interfere with the code quality of hot areas. Ad-

ditionally, compiler analyses and optimizations that are complex

and costly when applied to graphs of basic blocks are trivial to im-

plement when the input is restricted to traces.

To support irregular control ﬂow, our system implements a trace

merging algorithm that compiles secondary traces when an initial

trace is exited along a non-exception edge. Secondary traces are

then directly attached to the parent trace and future executions

along this new path will be nearly as efﬁcient as continuing along

the edges of the original primary trace.

We plan to port our JIT compiler to additional target architec-

tures. Due to the simplicity and small size of the compiler (which

is a direct result of limiting the code input to traces), we expect to

be able to add additional backends without much effort. A Strong-

ARM backend would be particularly interesting as it would allow

a direct comparison with existing embedded virtual machines such

as Jeode EVM.

Preliminary performance results from our prototype implemen-

tation are highly encouraging, and position our approach favorably

among the price/performance trade-offs made by embedded just-

in-time compilers. As a simple and effective technique for reduc-

ing the size of optimizing just-in-time compilers, and thereby the

trusted code base of critical systems, our approach may even be

applicable outside of the embedded systems space.

Acknowledgment

This research effort was partially funded by the State of California

and Microsoft Research under the Microeletronics Innovation and

Computer Research Opportunities (MICRO) program, the Bavaria

California Technology Center (BacaTec), and the National Science

Foundation (NSF) under grants TC-0209163 and ITR-0205712.

The U.S. Government is authorized to reproduce and distribute

reprints for Governmental purposes notwithstanding any copyright

annotation thereon.

The views and conclusions contained herein are those of the

authors and should not be interpreted as necessarily representing

the ofﬁcial policies or endorsements, either expressed or implied,

of the National Science Foundation or any other agency of the U.S.

Government.

References

[1] A. V. Aho, M. Ganapathi, and S. W. K. Tjiang. Code Generation

Using Tree Matching and Dynamic Programming. ACM Transactions

on Programming Languages and Systems

, 11(4):491–516, Oct. 1989.

[2] A. Appel. Concise Speciﬁcation of Locally Optimal Code Generators.

Technical Report CS-TR-080-87, Princeton University, 1987.

[3] V. Bala, E. Duesterwald, and S. Banerjia. Transparent Dynamic

Optimization: The Design and Implementation of Dynamo

. Technical

Report HPL-1999-78, Hewlett Packard Laboratories, June 1999.

[4] J. R. Bell. Threaded code. Communications of the ACM, 16(6):370–

372, 1973.

[5] P. P. Chang, S. A. Mahlke, and W.-M. W. Hwu. Using Proﬁle

Information to Assist Classic Code Optimizations. Software—

Practice and Experience

, 21(12):1301–1321, December 1991.

[6] R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K.

Zadeck. Efﬁciently Computing Static Single Assignment Form and

the Control Dependence Graph. ACM Transactions on Programming

Languages and Systems

, 13(4):451–490, October 1991.

[7] M. Debbabi, A. Mourad, and N. Tawbi. Armed E-Bunny: a selective

dynamic compiler for embedded java virtual machine targeting ARM

processors. In SAC ‘05: Proceedings of the 2005 ACM symposium

on Applied computing

, pages 874–878, New York, NY, USA, 2005.

ACM Press.

[8] H. Emmelmann, F.-W. Schr¨oer, and L. Landwehr. BEG: A Generator

for Efﬁcient Back Ends.

In Proceedings of the SIGPLAN ’89

Conference on Programming Language Design and Implementation

pages 227–237, 1989.

[9] J. A. Fisher. Trace Scheduling: A Technique for Global Microcode

Compaction. IEEE Transactions on Computers, C-30(7):478–490,

1981.

[10] C. W. Fraser, D. R. Hanson, and T. A. Proebsting. Engineering

a Simple, Efﬁcient Code-Generator Generator. ACM Letters on

Programming Languages and Systems

, 1(3):213–226, September

1992.

[11] C. W. Fraser, R. R. Henry, and T. A. Proebsting. BURG: Fast Optimal

Instruction Selection And Tree Parsing. ACM SIGPLAN Notices,

27(4):68–76, Apr. 1992.

[12] Free Software Foundation. GNU c compiler, Dec. 2005. http:

//gcc.gnu.org

[13] A. Gal. KVM [24] compiled for PowerPC/Linux with GNU C

Compiler 3.2, April 2005.

[14] A. Gal. Measured on a Sharp Zaurus PDA, 206 MHz SA-1110, 64MB

RAM, using a modiﬁed subset of SpecJVM98, July 2005.

[15] CLDC HotSpot Implementation Virtual Machine, available at

http://java.sun.com/j2me/docs/pdf/CLDC-HI

whitepaper-February 2005.pdf

, Feb. 2005.

[16] Insignia Solutions. Jeode Platform: Java for Resource-constrained

Devices, White Paper, 2002.

[17] R. Lougher. JamVM Virtual Machine. http://jamvm.sf.net/,

Nov. 2005.

[18] J. A. Mathew, P. D. Coddington, and K. A. Hawick. Analysis and

Development of Java Grande Benchmarks. In Proceedings of the

ACM 1999 Java Grande Conference, San Francisco

, 1999.

[19] R. Pozo and B. Miller. SciMark2 http://math.nist.gov/

scimark2

, Mar. 2004.

[20] K. Schmid. Jbed Micro Edition CLDC and Jbed Proﬁle for MID.

Technical report, Esmertec AG, Dubendorf, Switzerland, 2002.

[21] N. Shaylor. A Just-in-Time Compiler for Memory-Constrained Low-

Power Devices. In Proceedings of the 2nd Java and Virtual Machine

Research and Technology Symposium

, pages 119–126, Berkeley, CA,

USA, 2002. USENIX Association.

[22] L. Stepanian, A. D. Brown, A. Kielstra, G. Koblents, and K. Stoodley.

Inlining java native calls at runtime. In VEE ‘05: Proceedings of the

1st ACM/USENIX international conference on Virtual execution

environments

, pages 121–131, New York, NY, USA, 2005. ACM

Press.

[23] SUN J2ME’s Homepage. http://java.sun.com/j2me.

[24] Sun Microsystems. J2ME Building Blocks for Mobile Devices, White

Paper on KVM and the Connected, Limited Device Conﬁguration

http://java.sun.com/products/cldc/wp/KVMwp.

pdf

, May 2000.

Yüklə 217,56 Kb.

Dostları ilə paylaş:

1 2 3 4 5 6 7 8