1
10
20
30
40
50
60
70
80
0
200
400
600
800
Total CPUs
G
FL
OPS
Regent
Legion
(a) Circuit performance vs Legion.
1
4
8
12
16
20
24
1
2
3
4
Total CPUs
Zo
n
es
p
er
se
co
n
d
(i
n
mi
ll
io
n
s)
Regent
OpenMP
(b) PENNANT performance vs OpenMP.
1
2
4
8
0.0
0.2
0.4
0.6
Total CPUs
Cel
ls
p
er
sec
o
n
d
(i
n
mi
ll
io
n
s)
Regent
MPI+Kokkos
(c) MiniAero single-node performance vs MPI+Kokkos.
1
2
4
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Total Nodes (8 CPUs per Node)
Ce
ll
s
p
er
sec
o
n
d
(i
n
mi
ll
io
n
s)
Regent
MPI+Kokkos
(d) MiniAero multi-node performance vs MPI+Kokkos.
Figure 4: Performance evaluations.
which are generally compute or memory bound, certain per-
formance critical kernels in PENNANT contain long chains
of dependent math instructions, which in turn depend on
conditional memory accesses (when dynamic branch elision
is not enabled). At 10 cores, throughput improves by 15%
if dynamic branches can be eliminated.
6.3
MiniAero
MiniAero [3] is a computational fluid dynamics mini-app
that uses a Runge-Kutta fourth-order time marching scheme
to solve the compressible Navier-Stokes equations on a 3D
unstructured mesh. The baseline version of MiniAero is im-
plemented as a hybrid code, using MPI for inter-node com-
munication and Trilinos Kokkos [21] for intra-node paral-
lelism. Figures 4c and 4d compare performance between a
Regent implementation and the baseline MPI+Kokkos ver-
sion on a problem size with 4M cells and 13M faces running
on up to 4 nodes on Certainty. The Regent implementation
for MiniAero is approximately 30% shorter by lines of code
than the reference.
Regent outperforms MPI+Kokkos on 8 cores by a factor
of 2.8X through the use of a hybrid SOA-AOS data layout,
an approach similar to that taken in [10]. The improved
data layout substantially boosts cache reuse and improves
utilization of memory bandwidth.
Figure 5b demonstrates that MiniAero is sensitive to the
combination of index launches and mapping elision. When
both optimizations are disabled, the code runs serially. Mini-
Aero is otherwise relatively resilient to the choice of opti-
mizations performed. MiniAero is not significantly impacted
by dynamic branch elision (not shown).
6.4
Limitations
While the Regent compiler significantly simplifies aspects
n
o
n
e
a
ll
-id
x
-m
a
p
a
ll
-id
x-
lea
f
a
ll
-id
x
a
ll
-fu
t
a
ll
-ma
p
a
ll
-lea
f
a
ll
-v
ec
a
ll
0
50
100
G
FL
OPS
(a) Circuit (10 CPUs)
n
o
n
e
a
ll
-id
x
-m
a
p
a
ll
-id
x-
lea
f
a
ll
-id
x
a
ll
-fu
t
a
ll
-ma
p
a
ll
-l
ea
f
a
ll
-v
ec
a
ll
0
0.2
0.4
0.6
Ce
ll
s
p
er
sec
o
n
d
(i
n
mi
ll
io
n
s)
(b) MiniAero (8 CPUs)
n
o
n
e
a
ll
-id
x-
m
a
p
a
ll
-id
x
-l
ea
f
a
ll
-id
x
a
ll
-fu
t
a
ll
-ma
p
a
ll
-lea
f
a
ll
-v
ec
a
ll
-d
b
r
a
ll
0
1
2
3
4
Zo
n
es
p
er
se
co
n
d
(i
n
mi
ll
io
n
s)
(c) PENNANT (10 CPUs)
Figure 5: Knockout experiments. The red line in each graph shows the best sequential Regent performance.
of writing to the Legion programming model, our current
implementation does have limitations. The most important
one is that Legion’s support for simultaneous access to re-
gions by tasks is not currently implemented in Regent [9].
Briefly, simultaneous access allows Legion programs to im-
plement SPMD style patterns, such as are typically used in
UPC [14, 2], Titanium [35] and MPI [31] programs, where
multiple concurrently executing tasks share access to the
same region, mediated by explicit synchronization opera-
tions.
Simultaneous access exists in Legion to address a
problem that any dynamic task-based model with subtasks
(i.e., not just Legion) faces.
The performance of a task
that launches substasks depends on the number of subtasks
launched and the duration of those subtasks; the runtime
overhead incurred is proportional to the number of subtasks
launched, and so the more subtasks that are launched by a
task, the longer those subtasks must run to keep the ratio
of useful work to runtime overhead low. When launching
very large numbers of tasks, say across a petascale super-
computer, the runtime overheads will dominate the useful
computation of the tasks unless the tasks run for a long
time. For applications in which the bulk of the available par-
allelism comes from repeated and regular data-parallel task
launches, the SPMD structure allows a very large number of
long-running tasks to be launched at application start, and
then be further subdivided using the functional-style task
calls we have shown in our examples in this paper (where
each task has exclusive access to its region arguments [9]).
In our experience, such applications should be written in
the SPMD syle at the outermost level of control if they are
to scale past somewhere between 10 and 100 nodes, depend-
ing on the application domain. Because of that, and because
the optimizations described in this paper are effective even
on a single node, we have focused on experiments on one
or a small number of nodes. There is no technical barrier
to adding support for simultaneous access to Regent as it
already exists in Legion. This approach is known to provide
scalability to more than 10K nodes in Legion [7], and it may
be the solution we eventually adopt, but simultaneous access
does add another dimension of complexity to the program-
ming model that we would like to avoid in Regent if possible.
As future work we are interested in whether it is possible to
provide scalability to large numbers of nodes in Regent for
applications with regular (and repetitive) data parallel tasks
by compiler transformations that target Legion’s simultane-
ous coherence without exposing that feature to the Regent
programmer.
7.
RELATED WORK
Legion [9], though implemented as a runtime system, is
designed to be a programming model for which many prop-
erties could in principle be statically checked. In particular,
there is a static type system for key Legion abstractions [33].
Regent extends that work into a complete language and com-
piler, with a focus on productivity and performance. Among
other things, partitions in Regent are first-class and can
be passed to subtasks, facilitating certain interesting design
patterns impossible in the initial type system.
Legion is itself the spiritual successor of Sequoia [22]. Most
notably, Sequoia, in contrast to Legion, was entirely static:
the application, machine specification, and mapping from
application to machine were all given as inputs to the com-
piler, which produced an optimized (but entirely static) ex-
ecutable.
Legion grew out of the difficulties encountered
in adapting Sequoia to irregular, and thus more dynamic,
applications [8]. Regent itself differs from the Sequoia com-
piler [29] in that the compiler plays a different role with
respect to the runtime system, facilitating the efficient op-
eration of the runtime, rather than the other way around.
As a result, the optimization needs of each system differ.
Deterministic Parallel Java (DPJ) [11, 12] is a parallel
extension to the Java programming language adding sup-
port for regions. Regions in DPJ also express locality, as
they do in Regent. However, DPJ is a fully static system,
while Regent is a hybrid system with a static type system
and compiler optimizations but also an aggressively optimiz-
ing dynamic runtime. As a result, while Regent’s semantic
safety properties must be enforced at compile time, Regent
is free to discover parallelism at runtime, giving it the ability
to exploit dynamic information about the application when
insufficient static information is available.
An older but related language is Jade [30]. Jade provides
an apparently-serial programming model that implicitly par-
allelizes through the use of a dynamic dataflow graph. Jade
differs substantially in design and implementation from Le-
gion because of the characteristics of the hardware at the
time, when processors were much slower relative to net-
works and it was plausible to track dynamic dataflow on
a per-object basis. Regent addresses these challenges by ag-
gregating data with logical regions and providing language
constructs for hierarchical decomposition of those regions.
Regent’s design also allows it to employ a hierarchical, dis-
tributed scheduling algorithm [33].
Swift/T [34, 5] (not to be confused with Apple Swift) is
a more recent effort focusing on high-productivity scripted
parallel workflows. Both Swift/T and Regent employ an im-
plicitly parallel programming model, and both achieve par-
allelism through dataflow. Swift/T differs from Regent in
that it is purely functional, while Regent allows tasks to have
side-effects on regions, but serializes execution when tasks
have the potential to interfere. Unlike regions, Swift/T’s
aggregate data types do not have a first-class concept of
partitioning, as in Regent. Beyond this, while both Swift/T
and Regent optimize for parallelism, the Swift/T makes no
attempt to generate efficient sequential code while Regent is
able to match the performance of hand-tuned and manually
vectorized kernels.
Other runtime systems with support for task-based pro-
gramming include Charm++ [27], Uintah [19], StarPU [6],
and OCR [4]. While these systems differ substantially in the
details, they all provide a common abstraction of a task as a
fundamental unit of execution, with a graph of dependencies
between tasks providing communication and synchroniza-
tion. These systems differ from Regent in that, as dynamic
runtime systems, they impose a small but potentially signif-
icant runtime overhead which must be overcome to achieve
performance. Beyond this, these systems do not provide the
ability to partition data multiple ways and to migrate data
dynamically between these views as the application moves
between different phases of computation. This makes sev-
eral design patterns more straightforward to implement in
Regent compared to the above systems.
Another related class of work is represented by PGAS
and related models, such as UPC [14, 2], Titanium [35],
Chapel [16, 15], X10 [18, 17, 25] and HPX [26]. Regent is
similar to PGAS programming models in so far as point-
ers created in Regent are valid everywhere in the machine.
However, Regent pointers can only be dereferenced when the
task in question has privileges to the region that it points
into. This property allows Regent to preserve a number of
other important properties important for performance and
correctness, such as guaranteeing non-interference of tasks
executing in parallel.
8.
CONCLUSION
We have presented Regent, a high-productivity program-
ming language for high performance computing with logical
regions. Regent provides a sequential programming model
that parallelizes implicitly by performing dynamic dataflow
on regions. We have presented an optimizing compiler for
Regent which produces efficient Legion implementations, and
evaluated the implementation on three benchmark applica-
tions.
Acknowledgments
The authors would like to thank Charles R. Ferenbaugh
for his advice and assistance with the PENNANT code.
The authors wish to express their appreciation to Janine
C. Bennett, Greg Sjaardema, Kenneth Franko, Hemanth
Kolla, Jeremiah Wilke, David Hollman, and the Mantevo
project [3] for support with the MiniAero code.
This work was supported by the Department of Energy
National Nuclear Security Administration under Award Num-
ber DE-NA0002373-1, and by Los Alamos National Labora-
tories Subcontract No. 173315-1 through the U.S. Depart-
ment of Energy under Contract No. DE-AC52-06NA25396.
The work used computing resources at the High Perfor-
mance Computing Center at Stanford University [1], sup-
ported by the National Science Foundation under Award
Number 0960306.
9.
REFERENCES
[1] High Performance Computing Center at Stanford
University. http://hpcc.stanford.edu/.
[2] UPC language specification v1.2.
upc.lbl.gov/docs/user/upc\ spec\ 1.2.pdf, 2011.
[3] Mantevo project. https://mantevo.org/, Nov. 2014.
[4] The Open Community Runtime interface.
https://xstackwiki.modelado.org/images/1/13/
Ocr-v0.9-spec.pdf, 2014.
[5] T. G. Armstrong, J. M. Wozniak, M. Wilde, and I. T.
Foster. Compiler techniques for massively scalable
implicit task parallelism. In Supercomputing (SC),
2014.
[6] C. Augonnet et al. StarPU: A unified platform for
task scheduling on heterogeneous multicore
architectures. Concurrency and Computation: Practice
and Experience, 23:187–198, Feb. 2011.
[7] M. Bauer. Legion: Programming Distributed
Heterogeneous Architectures with Logical Regions. PhD
thesis, Stanford University, 2014.
[8] M. Bauer, J. Clark, E. Schkufza, and A. Aiken.
Programming the memory hierarchy revisited:
Supporting irregular parallelism in Sequoia. In
Proceedings of the Symposium on Principles and
Practice of Parallel Programming, 2011.
[9] M. Bauer, S. Treichler, E. Slaughter, and A. Aiken.
Legion: Expressing locality and independence with
logical regions. In Supercomputing (SC), 2012.
[10] M. Bauer, S. Treichler, E. Slaughter, and A. Aiken.
Structure slicing: Extending logical regions with fields.
In Supercomputing (SC), 2014.
[11] R. Bocchino et al. A type and effect system for
deterministic parallel Java. In OOPSLA, 2009.
[12] R. L. Bocchino, Jr., S. Heumann, N. Honarmand,
S. V. Adve, V. S. Adve, A. Welc, and T. Shpeisman.
Safe nondeterminism in a deterministic-by-default
parallel language. POPL, 2011.
[13] D. E. Burton. Consistent finite-volume discretization
of hydrodynamics conservation laws for unstructured
grids. Technical Report UCRL-JC-118788, Lawrence
Livermore National Laboratory, Livermore, CA, 1994.
[14] W. Carlson, J. Draper, D. Culler, K. Yelick,
E. Brooks, and K. Warren. Introduction to UPC and
language specification. UC Berkeley Technical Report:
CCS-TR-99-157, 1999.
[15] B. Chamberlain, S. Choi, S. Deitz, D. Iten, and
V. Litvinov. Authoring user-defined domain maps in
Chapel. 2011.
[16] B. Chamberlain et al. Parallel programmability and
the Chapel language. Int’l Journal of HPC Apps.,
2007.
[17] S. Chandra et al. Type inference for locality analysis
of distributed data structures. In PPoPP, pages
11–22, 2008.
[18] P. Charles et al. X10: An object-oriented approach to
non-uniform cluster computing. In OOPSLA, 2005.
[19] J. Davison de St.Germain, J. McCorquodale,
S. Parker, and C. Johnson. Uintah: a massively
parallel problem solving environment. In
High-Performance Distributed Computing, 2000.
Proceedings. The Ninth International Symposium on,
pages 33–41, 2000.
[20] Z. DeVito, J. Hegarty, A. Aiken, P. Hanrahan, and
J. Vitek. Terra: a multi-stage language for
high-performance computing. PLDI, 2013.
[21] H. Edwards and C. Trott. Kokkos: Enabling
performance portability across manycore
architectures. In Extreme Scaling Workshop (XSW),
2013, pages 18–24, Aug 2013.
[22] K. Fatahalian et al. Sequoia: Programming the
memory hierarchy. In Supercomputing, November 2006.
[23] C. R. Ferenbaugh. PENNANT: an unstructured mesh
mini-app for advanced architecture research.
Concurrency and Computation: Practice and
Experience, 2014.
[24] R. Ierusalimschy, L. H. De Figueiredo, and
W. Celes Filho. Lua - an extensible extension
language. Softw., Pract. Exper., 1996.
[25] M. Joyner, Z. Budimlic, and V. Sarkar. Subregion
analysis and bounds check elimination for high level
arrays. In Compiler Construction, pages 246–265,
2011.
[26] H. Kaiser, T. Heller, B. Adelstein-Lelbach, A. Serio,
and D. Fey. HPX: A task based programming model
in a global address space. In Partitioned Global
Address Space Programming Models, 2014.
[27] L. Kal´
e and S. Krishnan. CHARM++: A portable
concurrent object oriented system based on C++. In
Proceedings of OOPSLA’93, pages 91–108, 1993.
[28] C. Lattner and V. Adve. LLVM: A compilation
framework for lifelong program analysis &
transformation. In Proceedings of the 2004
International Symposium on Code Generation and
Optimization (CGO’04), Palo Alto, California, Mar
2004.
[29] M. Ren, J. Y. Park, M. Houston, A. Aiken, and
W. Dally. A tuning framework for software-managed
memory hierarchies. In Int’l Conference on Parallel
Architectures and Compilation Techniques, pages
280–291, 2008.
[30] M. C. Rinard and M. S. Lam. The design,
implementation, and evaluation of Jade. ACM Trans.
Program. Lang. Syst., 1998.
[31] M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and
J. Dongarra. MPI-The Complete Reference. MIT
Press, 1998.
[32] W. Taha and T. Sheard. MetaML and multi-stage
programming with explicit annotations. Theoretical
Computer Science, 2000.
[33] S. Treichler, M. Bauer, and A. Aiken. Language
support for dynamic, hierarchical data partitioning. In
Object Oriented Programming, Systems, Languages,
and Applications (OOPSLA), 2013.
[34] J. M. Wozniak, T. G. Armstrong, M. Wilde, D. S.
Katz, E. Lusk, and I. T. Foster. Swift/T: Large-scale
application composition via distributed-memory
dataflow processing. In Cluster, Cloud and Grid
Computing (CCGrid), 2013.
[35] K. Yelick et al. Titanium: A high-performance Java
dialect. In Workshop on Java for High-Performance
Network Computing, 1998.
Dostları ilə paylaş: |