Key
Optimization
map
Mapping Elision
leaf
Leaf Task Optimization
idx
Index Launch Optimization
fut
Future Optimization
dbr
Dynamic Branch Elision
vec
Vectorization
all
All of the Optimizations Above
Figure 2: Legend key for knockout experiments.
Application
Regent
Reference
Circuit
825
1701
PENNANT
1789
2416
MiniAero
2836
3993
Figure 3: Lines of code (non-comment, non-blank) for Re-
gent and reference implementations.
erence implementations of each application. First, we com-
pare Regent absolute performance against the reference on
the target machine.
Next, to demonstrate the impact of the compiler opti-
mizations performed by Regent, we perform knockout exper-
iments for each application, disabling each optimization pre-
sented in Section 4 in turn. In addition, we perform double
knockout experiments, measuring performance with all pos-
sible pairs of two optimizations disabled, and call out a few
interesting combinations. As several of the optimizations
impact the achieved parallelism, we evaluate each configu-
ration in a parallel configuration and compare against the
best sequential performance achieved by Regent. The la-
bels for the various optimizations are described in Figure 2.
Pointer check elision has been previously demonstrated to
have a significant impact on performance [33] and has been
left out of the knockout to reduce clutter.
Finally, we evaluate the productivity of Regent by com-
paring the number of lines of codes in each Regent imple-
mentation against each reference. Figure 3 summarizes the
results. Application-specific details are described along with
each application below.
6.1
Circuit
Circuit, introduced in [9], is a distributed circuit simula-
tion, operating over an arbitrary graph of nodes and wires.
While in principle the topology of the graph can be arbi-
trary, Circuit is concerned primarily with topologies with
interconnected dense subgraphs. Such graphs allow Circuit
to achieve some level of scalability, though that scalability
is ultimately limited by the global all-to-all communication
pattern between the subgraphs.
We compare the performance of Regent against a hand-
tuned and manually vectorized CPU implementation written
to the C++ Legion API. We evaluate both implementations
on a graph with 800K wires connecting 200K nodes. Fig-
ure 4a shows the performance of Regent against the baseline
C++ Legion implementation running on up to 8 nodes on
Certainty. Notably, the fully-optimized Regent implemen-
tation—which is written in a straightforward way with no
use of explicit vectors or vector intrinsics, and is less than
half the total number of lines of code—achieves performance
comparable to the manually vectorized C++ code, exceed-
ing the performance that can be achieved by using the LLVM
3.5 vectorizer alone.
Figure 5a demonstrates the impact of disabling individ-
ual and pairs of optimizations on the performance of the
Regent implementation of Circuit.
Certain optimizations
impact the parallelism available in the application; index
launch optimization and mapping elision are two such op-
timizations.
When both are disabled simultaneously, the
code runs sequentially. As described in Section 4.1, the Le-
gion runtime, in the absence of the map and unmap calls
placed by the compiler, must copy back the results of each
task execution before returning control to caller. This cre-
ates an effective barrier between consecutive tasks, but the
effect is not noticeable as long as index launch optimization
is able to parallelize the task launches. Disabling both op-
timizations serializes the code. But if either optimization is
disabled by itself, the application continues to run in parallel
at somewhat reduced throughput.
Dynamic branch elision does not have a significant im-
pact on the performance of Circuit and has been omitted to
reduce clutter.
6.2
PENNANT
PENNANT [23] is a mini-app for Lagrangian hydrody-
namics representing a subset of the functionality of FLAG [13],
a LANL production code. Figure 4b evaluates Regent against
an OpenMP implementation of PENNANT on a problem
containing approximately 2.6M zones. Implementation de-
tails of the Regent version are discussed in Section 2.
Regent performs better than OpenMP for all core counts
up to 10, surpassing OpenMP by 8% at 10 cores. Starting
at 12 cores, Regent performance degrades because the ad-
ditional compute threads interfere with threads Legion uses
for dynamic dependence analysis and data movement. The
Legion runtime is also unable to exclusively allocate physical
cores for each thread and abandons pinning altogether, lead-
ing to increased interference between application threads.
PENNANT is largely memory-bound, and is thus sig-
nificantly impacted by the NUMA architecture of the ma-
chine.
OpenMP performance was substantially impacted
by CPU affinity, and a manual assignment of threads to
cores was needed for optimal performance. Regent auto-
matically binds threads to cores when possible and round
robins threads between NUMA domains, thus performing
well with minimal manual tuning.
Regent achieves good performance despite an overall de-
crease in lines of code. The numbers listed in Figure 3 ex-
clude a number of routines, shared between both implemen-
tations, for generating the input mesh and exporting the
output of the simulation to files.
Figure 5c shows that the Regent implementation of PEN-
NANT exhibits varied behavior with certain combinations
of optimizations disabled. As with Circuit, the combination
of index launches and mapping elision causes the application
to execute sequentially. However, PENNANT offers a differ-
ent response to the combination of index and leaf optimiza-
tions. PENNANT’s pattern of task launches is such that
when leaf optimization is disabled, the Legion runtime must
stall for mapping to complete in order to ensure that all the
dependencies are correctly captured. Circuit is structured
differently from PENNANT and is therefore not impacted
significantly by the leaf optimization (in combination with
index launches or otherwise).
PENNANT also shows the most benefit from eliminat-
ing dynamic branches. In contrast to Circuit and MiniAero