4.3. STATE OF THE ART
• Employ standard language features, e.g. in library-based solutions. Languages vary in their support
for library-based data layout abstractions.
– C++ seems to provide an opportunity for enabling data layout abstractions due to its ability
to extend the base language syntax using template metaprogramming. C++ meta-programming
can cover many desired capabilities, e.g. polymorphic data layout and execution policy, where
specialization can be hidden in template implementation and controlled by template parameters.
Although C++ template metaprogramming offers more syntactic elegance for expressing solutions,
the solution is ultimately a library based approach because the code generated by the template
metaprogram is not understood by the baseline compiler and therefore the compiler cannot provide
optimizations that take advantage of the higher-level abstractions implemented by the templates.
The primary opportunity in the C++ approach is the hope that the C++ standards committee
would adopt it as part of the standard. The standards committee is aggressive at adoption, and
it already supports advanced features like lambdas, and the C++ Standard Library. However if
these syntactic extensions are adopted, compiler writers would still need to explicitly target those
templates and make use of the higher-level semantics they represent. At present it is not clear
how well this strategy will work out.
– In contrast to C++, Fortran is relatively limited and inflexible in terms of its ability to extend the
syntax, but having multidimensional arrays as first class objects gives it an advatage in expressing
data locality. A huge number of applications are implemented in Fortran, and computations
with regular data addressing are common. Library-based approaches to extending locality-aware
constructs into Fortran are able to exploit the explicit support for multidimensional arrays in the
base language. However, these library-based approaches may seem less elegant in Fortran because
of the inability to perform syntactic extensions in the base language
– Dynamically-typed scripting languages like Python, Perl, and Matlab provide lots of flexibility to
users, and enable some forms of metaprogramming, but some of that flexibility can make optimiza-
tion difficult. Approaches to overcome the performance challenges of scripting languages involve
using the integrated language introspection capabilities of these languages (particularly Python)
that enables the scripting system to intercept known motifs in the code and use Just in Time (JIT)
rewriting or specialization. Examples include Copperhead [19] and PyCuda [57], which recognize
data-parallel constructs and rewrites and recompiles them as CUDA code for GPUs. SEJITS and
the ASP frameworks [18] are other examples that use specializers to recognize particular algorith-
mic motifs and invoke specialized code rewriting rules to optimize those constructs. This same
machinery can be used to recognize and rewrite code that uses metaprogramming constructs that
express data locality information.
• Augment base languages with directives or embedded domain-specific languages (DSLs). Examples
include OpenMP, OpenACC, Threading Building Blocks
2
and Thrust
3
.
Most contributors to this report worked within the confines of existing language standards, thereby max-
imizing the impact and leveraging market breadth of the supporting tool chain (e.g., compilers, debuggers,
profilers). Wherever profitable, the research plan is to “redeem” existing languages by amending or extending
them, e.g. via changes to the specifications or by introducing new ABIs.
The interfaces the authors of this report are developing are Kokkos [31], TiDA [87], C++ type support,
OpenMP extensions to support SIMD, GridTools [34], hStreams
4
), and DASH [35]. The Kokkos library
supports expressing multidimensional arrays in C++, in which the polymorphic layout can be decided at
compile time. An algorithm written with Kokkos uses the AM of C++ with the data specification and access
provided by the interface of Kokkos arrays. Locality is managed explicitly by matching the data layout with
the algorithm logical locality.
TiDA allows the programmer to express data locality and layout at the array construction. Under TiDA,
each array is extended with metadata that describes its layout and tiling policy and topological affinity for
2
https://www.threadingbuildingblocks.org
3
http://docs.nvidia.com/cuda/thrust
4
https://software.intel.com/en-us/articles/prominent-features-of-the-intel-manycore-platform-software-stack-intel-mpss-
version-34
Programming Abstractions for Data Locality
18
4.4. DISCUSSION
an intelligent mapping on cores. This metadata follows the array through the program so that a different
configuration in layout or tiling strategy do not require any of the loop nests to be modified. Various layout
scenarios are supported to enable multidimensional decomposition of data on NUMA and cache coherence
domains. Like Kokkos, the metadata describing the layout of each array is carried throughout the program
and into libraries, thereby offering a pathway towards better library composability.
TiDA is currently
packaged as a Fortran library and is minimally invasive to Fortran codes. It provides a tiling traversal
interface, which can hide complicated loop traversals, parallelization or execution strategies. Extensions are
being considered for the C++ type system to express semantics related to the consistency (varying or
uniform) of values in SIMD lanes. This is potentially complementary to ongoing investigations in how to
introduce new OpenMP-compatible ABIs that define a scope within which more relaxed language rules
may allow greater layout optimization, e.g. for physical storage layout that’s more amenable to SIMD.
GridTools provides a set of libraries for expressing distributed memory implementations of regular grid
applications, like stencils. It is not meant to be universal, in the sense that non-regular grid applications
should not be expressed using Gridtools libraries, even though possible in principle, for performance rea-
sons. Since the constructs provided by GridTools are high level and semi-functional, locality issues are
taken into account at the level of performance tuners and not by application programmers. At the semantic
level the locality is taken into consideration only implicitly. The hStreams library provides mechanisms
for expressing and implementing data decomposition, distribution, data binding, data layout, data reference
characteristics, and execution policy on heterogeneous platforms. DASH is built on a one-sided communi-
cation substrate and provides a PGAS abstraction in C++ using operator overloading. The DASH AM is
basically a distributed parallel machine with the concept of hierarchical locality.
As can be seen, there is no single way of treating locality concerns, and there is no consensus on which
one is the best. Each of these approaches is appealing in different scenarios that depend on the scope of the
particular application domain. There is the opportunity of naturally building higher level interfaces using
lower level ones. For instance, TiDA or DASH multidimensional arrays could be implemented using Kokkos
arrays, and GridTools parallel algorithms could use the DASH library, and Kokkos arrays for storage, etc.
This is a potential benefit from interoperability that arises from using a common language provided with
generic programming capabilities.
Ultimately, the use of lambdas to abstract the iteration space and metadata to carry information about
the abstracted data layouts are common themes across all of these implementations.
This points to a
potential for a lower-level standardization of data structures and APIs that can be used under-the-covers by
all of these APIs (a common abstraction layer that could be used by each library solution). One outcome of
the workshop is to initiate efforts to explicitly define the requirements for a common runtime infrastructure
that could be used interoperable across these library solutions.
4.4
Discussion
This chapter presents and begins to resolve several key challenges:
• Defining the abstraction layers. The logical layer is where the domain expert provides a semantic
specification and offers hints about the program’s execution patterns. The physical layer is where
the performance tuner specifies data and execution policy controls that are designed to provide best
performance on target machines. These layers are illustrated in Figure 4.1.
• Enumerating the data and execution policy controls of interest. These are listed below in this section
and are highlighted in Figure 4.1.
• Suggest some mechanisms which enable a flexible and effective mapping from the logical down to the
physical layer, while maintaining a clean separation of controls and without limiting the freedom of
expression and efficiency at each layer. One class of mechanisms is oriented around individual data
objects, e.g. with data types, and another is oriented around control structures, e.g. with ABIs that
enable a relaxation of language rules. The choice between these two orientations is illustrated in Figure
3.1.
Programming Abstractions for Data Locality
19