4.4. DISCUSSION
Logical
Physical
Separation of Concerns
Semantic
Specification
Performant
Implementation
F
un
ct
io
na
l
O
rie
nt
at
io
n
1. Define and
discover mapping
2. Circumvent limitations
posed by each application
domain
3. Choose between
functional and object
orientation
O
bj
ect
O
rie
nt
at
io
n
Key challenges:
4. Expose opportunity and
promote productivity
- Semantic Control
- Descriptive Annotation
by Domain Experts
- Performance Control
- Execution Policy
by Performance Programmers
Figure 4.1:
Separation of concerns in data structure abstractions
Separation of concerns. High performance computing presents scientists and performance tuners with
two key challenges: exposing parallelism and effectively harvesting that parallelism. A natural separation of
concerns arises from those two efforts, whereby the scope of effort for each of domain experts and performance
tuning experts can be limited, and the two may be decoupled without overly restricting each other. Charting
a solid path forward toward a clean separation of concerns, defining appropriate levels of abstraction, and
highlighting properties of language interfaces that will be effective at managing data abstractions are the
subject of this effort.
Domain experts specify the work to be accomplished. They want the freedom to use a representation of
data that is natural to them. Most of them prefer not to be forced to specify performance-related details.
Performance tuning experts want full control over performance, without having to become domain experts.
So they want domain experts to fully expose opportunities for parallelism, without over-specifying how that
parallelism is to be harvested. This leads to a natural separation of concerns between a logical abstraction
layer, at which semantics are specified - the upper box in Figure 3.1, and a lower, physical abstraction layer,
at which performance is tuned and the harvesting of parallelism on a particular target machine is controlled
- the lower box in the figure. This separation of concerns allows code modifications to be localized, at each
of the semantic and performance control layers. The use of abstraction allows high-level expressions to be
polymorphic across a variety of low-level trade-offs at the physical implementation layer. Several interfaces
are now emerging that maintain the discipline of this separation of concerns, and that offer alternative ways
of mapping between the logical and physical abstraction layers.
Performance-related controls pertain to 1) data and to 2) execution policy.
1. Data controls may be used to manage:
• Decomposition, which tends to be either trivial (parameterizable and automatic; perhaps along
multiple dimensions) or not (explicit, and perhaps hierarchical)
• Mechanisms for, and timing of, distribution to data space/locality bindings
• Data layout, the arrangement of data according to the addressing scheme, mapping logical ab-
stractions to arrangement in physical storage
• Binding, storage to memories that support a particular access pattern (e.g. read only, stream-
ing), to a phase-dependent depth in the memory hierarchy (prefetching, marking non-temporal),
to memory structures which support different kinds of sharing (software-managed or hardware-
managed cache), or to certain kinds of near storage (e.g. scalar or SIMD registers)
Programming Abstractions for Data Locality
20
4.4. DISCUSSION
2. Execution policy controls may be used to manage
• Decomposition of work, e.g. iterations, nested iterations, hierarchical tasks
• Ordering of work, e.g. recursive subdivision, work stealing
• Association of work with data, e.g. moving work to data, binding to hierarchical domains like a
node, an OpenMP place, a thread or a SIMD lane
Mechanisms.
These controls may be applied at different scopes and granularities through a variety
of mechanisms:
• Data types - these are fine-grained, applying to one variable or parameter at a time; they apply across
the lexical scope of the variable
• Function or construct modifiers - instead of applying to individual variables, these apply a policy to
everything in a function or control construct
• Environmental variable controls - these global policies apply across a whole program
Note that through modularity, the scope to which data types and function or construct modifiers may be
refined. For example, with many functions, a given variable’s data type may vary by call site.
These controls may be applied at different scopes and granularities through a variety of mechanisms. The
interactions and interdependencies of data controls, execution controls, and the management of granularity
or scope can get quite complex. We use SIMD as a motivating example to illustrate some of the issues.
• The order among dimensions in a multi-dimensional array obviously impacts which dimension’s ele-
ments are contiguous. This, in turn, can affect which dimension is best to vectorize with a unit stride, to
maximize memory access efficiency. Relevant compiler transformations may include loop interchange,
data re-layout, and the selection of which loop nesting level to vectorize. The dimension order for an
array can be specified with a data type, or an new ABI at a function boundary can be used to create
a smaller scope within which indicators of the physcial layout cannot escape, such that the compiler
can be left free to re-layout data within that scope.
• A complicating factor for this data layout is that the best layout may vary within a scope. For example,
one loop nest may perform better with one dimensional ordering, or one AoSoA (array of structures of
arrays) arrangement, while the best performance for the next loop nest may favor a different layout.
This can lead to a complex set of trade-offs that are not optimally solved with greedy schemes. It
is possible to isolate the different nests in different functions, and to use an ABI to hide where data
relayout occurs.
• The number of elements that can be packed into a fixed-width SIMD register may depend on the
precision of each element, for some target architectures. Consider a loop iteration that contains a mix
of double- and single-precision operands, where vlen single-precision operations may be packed into
a single SIMD instruction, whereas since only vlen/2 double-precision operations may be packed into
a single instruction, two instructions are required. Matching the number of available operations may
require unrolling the loop by an additional factor of 2. For situations such as this, a hard encoding
of the elements in each SIMD register using types may inhibit the compiler’s freedom to make tuning
trade-offs.
We have many areas of agreement. Leaving the programmer free from being forced to specify performance
controls makes them more productive, and leaving the tuner free to add controls without introducing bugs
through inadvertent changes to semantics makes them more productive. If the tuner is more productive,
and if the tuner has a clear set of priorities of which controls to apply and the ability to apply them
effectively, they can achieve better performance more quickly. Finally, isolating the performance tuning
controls and presenting them in a fashion which allows target-specific implementations to be applied easily
make performance more performance portable.
Programming Abstractions for Data Locality
21