3.2. DISCUSSION
or reliable solutions that they can adopt. For example, viewed on a multi-year time-scale, GROMACS has
re-implemented all of its high-performance code several times, always to make better use of data locality
[42, 77, 74] and almost never able to make any use of existing portable high-performance external libraries
or DSLs. Many of those internal efforts have since been duplicated by third parties with general-purpose,
re-usable code, that GROMACS could have used if they been available at the time.
The research communities that have been engaged in the discussions about peta- and exa-scale computing
are well informed about the challenges they face. Many have started to experiment with approaches recom-
mended by the researchers from the compilers, programming abstractions and runtime systems communities
in order to be better prepared for the platform heterogeneity. At the workshop, examples of many such efforts
were presented. These efforts can be mainly classified into: Approaches based on Domain-Specific program-
ming Languages (DSL), library-based methods, or combinations of the two. For example, OP2/PyOP2 [79],
STELLA (STEncil Loop LAnguage) [72] and HIPA
cc
(Heterogeneous Image Processing Acceleration) [64] are
embedded domain-specific languages (see Section 5) that are tailored for a certain field of application and
abstract from details of a parallel implementation on a certain target architecture. PyOP2 uses Python
as the host language, while OP2 and the latter approaches use C++. OP2 and PyOP2 target mesh-based
simulation codes over unstructured meshes, generating code for MPI clusters, multicore CPUs and GPUs.
STELLA considers stencil codes on structured grids. An OpenMP and a CUDA back end are currently under
development. HIPA
cc
targets the domain of geometric multigrid applications on two-dimensional structured
grids [65] and provides code generation for accelerators, such as, GPUs (CUDA, OpenCL, RenderScript)
and FPGAs (C code that is suited for high-level synthesis). The latest efforts in GROMACS also have
moved the code from hand-tuned, inflexible assembly CPU kernels to a form implemented in a compiler- and
hardware-portable SIMD intrinsics layer developed internally, for which the code is generated automatically
for a wide range of model physics and hardware, including accelerators. In effect, this is an embedded DSL
for high-performance short-ranged particle-particle interactions. All the aforementioned approaches can in-
crease productivity—e.g., reducing the lines of application code and debugging time—in their target science
domain as well as performance portability across different compute architectures.
Other efforts, such as the use of tiling within patches in AMR for exposing greater parallelism rely on
a library-based approach. Many of these efforts have met with some degree of success. Some are in effect
a usable component of an overall solution to be found in future, while others are experiments that are far
more informative about how to rethink the data and algorithmic layout of the core solvers. Though all these
efforts are helpful in part, and hold promise for the future, none of these approaches currently provide a
complete stable solution that applications can use for their transition to long term viability.
3.2
Discussion
There are many factors that affect the decision by the developers of a large scientific library or an application
code base to use an available programming paradigm, but the biggest non-negotiable requirement is the
stability of the paradigm. The fear of adding a non-robustly-supported critical dependency prevents code
developers who use high-end platforms from adopting technologies that can otherwise benefit them. This
fear may be due to the lack of guarantee about continued future support or the experimental nature of the
technology that might compromise the portability in current and/or future platforms. Often the developers
opt for a suboptimal or custom-built solution that does not get in the way of being able to run their
simulations. For example, even today research groups exist that use their own I/O solutions with all the
attendant performance and maintenance overhead because they were built before the parallel I/O libraries
became mature, robust and well supported. A group that has adopted a custom built solution that suits their
purpose is much more difficult to persuade to use external solutions even if those solutions are better. It is
in general hard to promote adoption of higher-level abstractions unless there is a bound on the dependency
through the use of a library that was implemented in a highly portable way, and easy to install and link.
For this reason embedded DSLs, code transformation technologies, or any other approaches that provide
an incremental path of transition are more attractive to the applications. Any solution that demands a
complete rewrite of the code from scratch presents two formidable challenges to the users and developers of
the application codes: (1) verifying correctness of algorithms and implementations, since direct comparison
with subsections of existing code may not always be possible, and (2) a long gap when production must
Programming Abstractions for Data Locality
12
3.3. APPLICATION REQUIREMENTS
continue with existing, non-evolving (and therefore sometimes outdated) code until the new version of the
code comes online. The trade-off is the possibility of missing out on some compiler-level optimizations.
There are other less considered but possibly equally critical concerns that have to do with expressibility.
The application developers can have a very clear notion of their data model without finding ways of expressing
the models effectively in the available data structures and language constructs. The situation is even worse
for expressing the latent locality in their data model to the compilers or other translational tools. None
of the prevalent mature high-level languages being used in scientific computing have constructs to provide
means of expressing data locality. There is no theoretical basis for the analysis of data movement within
the local memory or remote memory. Because there is no formalism to inform the application developers
about the implications of their choices, the data structures get locked into the implementation before the
algorithm design is fully fleshed out. The typical development cycle of a numerical algorithm is to focus on
correctness and stability first, and then performance. By the time performance analysis tools are applied,
it is usually too late for anything but incremental corrective measures, which usually reduce the readability
and maintainability of the code. A better approach would be to model the expected performance of a
given data model before completing the implementation, and let the design be informed by the expected
performance model throughout the process. Such a modeling tool would need to be highly configurable, so
that its conclusions might be portable across a range of compilers and hardware, and valid into the future,
in much the same way that numerical simulations often use ensembles of input-parameter space in order to
obtain conclusions with reduced bias.
3.3
Application requirements
Most languages provide standard containers and data structures that are easy to use in high-level code, yet
very few languages or libraries provide interfaces for the application developer to inform the compiler about
expectations of data locality, data layout, or memory alignment. For example, a common concern for the
PDE solvers is the data structure containing multiple field components that have identical spatial layout.
Should it be an array with an added dimension for the field components or a structure; and within the array
or structure, what should be the order for storing in memory for performance. The application’s solution
is agnostic to the layout, but choice of data structure bakes into the platform on which the code will have
a better performance. There are many similar situations that force application implementations to become
more platform specific than they need to be. Furthermore, the lack of expressibility can also present false
dependencies to the compiler and prevent optimization.
3.4
The Wish List
In response to the application requirements identified above, we constructed a Wish List for programming
environment features that will efficiently address application needs. The list outlines some abstractions
and/or language constructs that would allow the applications to avoid false constraints and be more expressive
to the software stack for optimization possibilities.
• Ability to write dimension independent code easily.
• Ability to express functionally equivalent data structures, where one of them can be arbitrarily picked
for implementing the algorithm with the understanding that the compiler can do the transformation
to one of the equivalent ones suited for the target architecture.
• The ability to map abstract processes to given architectures, and coupled to this, the ability to express
these mappings in either memory-unaware or at least flexible formulations.
• More, and more complex, information has been demanded from modelers that is not relevant to the
model itself, but rather to the machines. Requiring less information, and allowing formulations close to
the models in natural language, is a critical point for applications. This does not restrict the possible
approaches, quite the opposite: for example, well engineered libraries that provide the memory and
operator mechanics can be very successful, if they provide intelligent interfaces.
Programming Abstractions for Data Locality
13