4.2. KEY POINTS
processor when the data reuse occurs. This can be done by tiling of the iteration space, or by data
space tiling, and domain decomposition).
Iteration Space Tiling is a code restructuring technique used to reduce working set size of a program
by dividing the iteration space defined by the loop structures into tiles.
Data Space Tiling involves subdividing a rectilinear array into a regular array of rectilinear chunks
within the same memory space in order to maximize the consecutive memory locations that are
visited during traversal.
Data Decomposition is the partitioning of a data structure’s arrays into smaller chunks with the
intent of assigning each chunk to a different memory space for introducing parallelism across
chunks or improving data locality.
Data Placement/Pinning is the mapping of the tiles or chunks of data from a domain decomposed data
structure to memory spaces.
Task Placement/Pinning This is the assignment of threads of execution to a particular physical processor
resource and its related hierarchy of memory spaces. Many contemporary programming environments
do not offer an automatic method to directly relate the task placement to the data placement, aside
from very loose policies such as “first touch memory affinity.”
Data Layout is the mapping of a data structure’s chunk to the physical addresses corresponding to the
addressing scheme used by a given abstraction layer, which in turn provides affine mapping from the
logical address space seen by the programmer to the physical layout of the data arrays in a memory.
There may be multiple abstraction layers, which together serve a diversity of objectives, as outlined above.
For example, users may prefer to refer to struct members of an array (AoS), whereas an array of structures
of arrays (AoSoA) layout may be more profitable at the physical storage level. Physical layout may differ
between a software-managed memory that, for example, uses Hilbert curves, and a hardware-managed cache.
There must be a mapping between the data layouts in the various abstraction layers, and potentially in each
of the memory spaces.
A computation’s memory access is dictated by its data access pattern that is a composition of
1. how the computation’s parallel tasks traverse the data structure iteration space and traversal order,
2. how the computation’s tasks are scheduled (i.e., the task placement),
3. data layout,
4. data placement, and
5. data decomposition.
Performance is constrained by the cost (time and energy) of moving data in service of the computation,
which is directly impacted by the computation’s data access pattern as represented in Figure 2.2 in the
Background chapter of this document. Data locality is a proxy measure for the volume and distance of
data movement; smaller “distance” and lower volumes imply lower cost. The idea is to shorten the distance
between successive references to the same memory location, so that it is more probable that the memory
word resides in the memory levels near to the processor when the data reuse occurs. Data locality has both
spatial and temporal dimensions as depicted in Figure 2.4 in the Background chapter of this document.
4.2
Key Points
The research community has been developing data structure abstractions for several years to improve the
expressiveness of parallel programming environments in order to improve application performance. Recently,
there has been a shift towards a more data-centric programming to provide locality abstractions for data
Programming Abstractions for Data Locality
16
4.3. STATE OF THE ART
structures because locality has become one of the most important design objectives. During the PADAL
workshop, we identified the following design principles as important and desired by the application program-
mers.
• We seek to improve performance by controlling the separate components (traversal, execution policy,
data layout, data placement, and data decomposition) whose composition determines a computation’s
data access pattern.
• Separation of concerns is important to maximize expressivity and optimize performance. That prin-
ciple implies a clear distinction between a logical, semantic specification, and lower-level controls over
the physical implementation. At the semantic level, sequential work is mapped onto parallel data
collections, using logical abstractions of data layout which best promote productivity. Domain experts
can also expose semantic characteristics like reference patterns that can inform trade-offs made by
tools.
• A computation’s algorithm is expressed in terms of operations on certain abstract data types, such as
matrices, trees, etc. The code usually expresses data traversal patterns at high level, such as accessing
the parent node in a tree, or the elements in a row of a matrix. This expresses the algorithmic locality
of the data, which may be not related to the actual locality of the data access. The actual locality
is obtained when the algorithm’s traversals, tasks, and data structures are mapped to the physical
architecture.
• The efficiency of a computation’s on-node data access pattern may depend upon the architecture of the
computer on which it executes. Performance may be improved by choosing a data layout appropriate
for the architecture. This choice can be made without modifying the computation’s source code if the
data structure has a polymorphic data layout.
• Performance tuners, who may be distinct from the domain experts, may exercise control over perfor-
mance with a set of interfaces that provide more explicit control over the data movement, and can be
distinct from those that specify semantics. Through these interfaces, they may decompose, distribute
and map parallel data collections onto efficient physical layouts with specialized characteristics. This
may be done in a modular fashion that leaves the semantic expression of the code intuitive, readable
and maintainable. This approach presents an explicit “separation of concerns” by offering different
interfaces to different programmers (the domain scientists vs. the computer science and hardware
experts.
• A key goal of separation of concerns is to separate the interface of the architecture-agnostic abstract
machine (AM) from that of the physical machine (PM) using hierarchical abstractions. An AM is nec-
essary to express an algorithm, whether in a functional language, as a PGAS abstraction, or something
else. An algorithm written for an AM must be translated into program for a physical machine (PM).
The PM may or may not be closely related to the AM. The translation effort depends on the “distance”
between the AM and the PM. Since there are many different PMs, and the lifetime of applications
should be much longer that the lifetime of an architecture, the usual requirement is that an algorithm
for an AM should be executed efficiently in the widest possible class of foreseeable PMs. This is the
well-known problem of performance portability. From this perspective, an AM should be distinct from
PMs for two reasons: first it gives the user a uniform view of the programming space, often at a much
higher level of abstraction than any specific programming model for a PM; second, the hope is that
by being equidistant from any PM, the translation can lead to equally efficient implementations with
similar effort.
4.3
State of the Art
There is a concern that data structure and layout features that are needed to “get us to exascale” are lacking
in existing, standard languages. There are few options for mitigating this situation.
Programming Abstractions for Data Locality
17