2.1. HARDWARE TRENDS
How much parallelism must be handled by the program?
From Peter Kogge (on behalf of Exascale Working Group), “Architectural Challenges at the Exascale Frontier”, June 20, 2008
Figure 2.1: Explicit parallelism of HPC systems will be increasing at an exponential rate for the foreseeable
future.
1"
10"
100"
1000"
10000"
DP
"FLOP
"
Re
gis
ter
"
1mm"o
n3ch
ip"
5mm"o
n3ch
ip"
15mm"o
n3ch
ip"
Off
3ch
ip/
DRA
M"
loc
al"
inte
rco
nn
ec
t"
Cr
os
s"s
yste
m"
2008"(45nm)"
2018"(11nm)"
Pi
co
jo
ul
es
*P
er
*6
4b
it*
op
er
a2
on
*
Figure 2.2: The cost of moving the data operands exceeds the cost of performing an operation upon them.
decreases. Since the energy efficiency of transistors is improving as their sizes shrink, and the energy efficiency
of wires is not improving, the point is rapidly approaching where the energy needed to move the data exceeds
the energy used in performing the operation on those data.
Data locality has long been a concern for application development on supercomputers. Since the advent
of caches, vertical data locality has been extraordinarily important for performance, but recent architecture
trends have exacerbated these challenges to the point that they can no longer be accommodated using existing
methods such as loop blocking or compiler techniques. This report will identify numerous opportunities to
greatly improve automation in these areas.
2.1.3
Increasingly Hierarchical Machine Model
Future performance and energy efficiency improvements will require fundamental changes to hardware ar-
chitectures. The most significant consequence of this assertion is the impact on the scientific applications
that run on current high performance computing (HPC) systems, many of which codify years of scientific
domain knowledge and refinements for contemporary computer systems. In order to adapt to exascale archi-
tectures, developers must be able to reason about new hardware and determine what programming models
Programming Abstractions for Data Locality
7
2.2. LIMITATIONS OF CURRENT APPROACHES
3D Stacked
Memory
(Low Capacity, High Bandwidth)
Fat
Core
Fat
Core
Thin Cores / Accelerators
D
R
AM
N
VR
AM
(High Capacity,
Low Bandwidth)
Coherence Domain
Core
Integrated NIC
for Off-Chip
Communication
Figure 2.3:
The increasingly hierarchical nature of emerging exascale machines will make conventional
manual methods for expressing hierarchy and locality very challenging to carry forward into the next decade.
and algorithms will provide the best blend of performance and energy efficiency into the future. While many
details of the exascale architectures are undefined, an abstract machine model (AMM) enables application
developers to focus on the aspects of the machine that are important or relevant to performance and code
structure. These models are intended as communication aids between application developers and hardware
architects during the co-design process. Figure 2.3 depicts an AMM that captures the anticipated evolution
of existing node architectures based on current industry roadmaps [3].
As represented in the figure, future systems will express more levels of hierarchy than we are normally
accustomed to in our existing 2-level MPI+X programming models. Not only are there more levels of hier-
archy, it is likely that the topology of communication will become important to optimize for. Programmers
are already facing NUMA (non-uniform-memory access) performance challenges within the node, but future
systems will see increasing NUMA effects between cores within an individual chip die in the future. Overall,
our current programming models and methodologies are ill-equipped to accommodate the changes to the
underlying abstract machine model, and this breaks our current programming systems.
2.2
Limitations of Current Approaches
Architectural trends break our existing programming paradigm because the current software tools are focused
on equally partitioning computational work. In doing so, they implicitly assume all the processing elements
are equidistant to each other and equidistant to their local memories within a node. For example, commonly
used modern threading models allow a programmer to describe how to parallelize loop iterations by dividing
the iteration space (the computation) evenly among processors and allowing the memory hierarchy and cache
coherence to move data to the location of the compute. Such a compute-centric approach no longer reflects
the realities of the underlying machine architecture where data locality and the underlying topology of the
data-path between computing elements are crucial for performance and energy efficiency. There is a critical
need for a new data-centric programming paradigm that takes data layout and topology as a primary criteria
for optimization and migrates compute appropriately to operate on data where it is located.
The applications community will need to refactor their applications to align with this emerging data-
centric paradigm, but the abstractions currently offered by modern programming environments offer few
abstractions for managing data locality. Absent these facilities, the application programmers and algorithm
developers must manually manage data locality using manual techniques such as strip-mining and loop-
blocking. To optimize data movement, applications must be optimized both for vertical data movement
in the memory hierarchy and horizontal data movement between processing units as shown in figure 2.4.
Such transformations are labor-intensive and easily break with minor evolutions of successive generations of
compute platforms. Hardware features such as cache-coherence further inhibit our ability to manage data
Programming Abstractions for Data Locality
8