Programming abstractions for

Yüklə 0,54 Mb.

Pdf görüntüsü

səhifə	7/23
tarix	24.12.2017
ölçüsü	0,54 Mb.
	#17201

1 2 3 4 5 6 7 8 9 10 ... 23

3.2. DISCUSSION

or reliable solutions that they can adopt. For example, viewed on a multi-year time-scale, GROMACS has

re-implemented all of its high-performance code several times, always to make better use of data locality

[42, 77, 74] and almost never able to make any use of existing portable high-performance external libraries

or DSLs. Many of those internal eﬀorts have since been duplicated by third parties with general-purpose,

re-usable code, that GROMACS could have used if they been available at the time.

The research communities that have been engaged in the discussions about peta- and exa-scale computing

are well informed about the challenges they face. Many have started to experiment with approaches recom-

mended by the researchers from the compilers, programming abstractions and runtime systems communities

in order to be better prepared for the platform heterogeneity. At the workshop, examples of many such eﬀorts

were presented. These eﬀorts can be mainly classiﬁed into: Approaches based on Domain-Speciﬁc program-

ming Languages (DSL), library-based methods, or combinations of the two. For example, OP2/PyOP2 [79],

STELLA (STEncil Loop LAnguage) [72] and HIPA

(Heterogeneous Image Processing Acceleration) [64] are

embedded domain-speciﬁc languages (see Section 5) that are tailored for a certain ﬁeld of application and

abstract from details of a parallel implementation on a certain target architecture. PyOP2 uses Python

as the host language, while OP2 and the latter approaches use C++. OP2 and PyOP2 target mesh-based

simulation codes over unstructured meshes, generating code for MPI clusters, multicore CPUs and GPUs.

STELLA considers stencil codes on structured grids. An OpenMP and a CUDA back end are currently under

development. HIPA

targets the domain of geometric multigrid applications on two-dimensional structured

grids [65] and provides code generation for accelerators, such as, GPUs (CUDA, OpenCL, RenderScript)

and FPGAs (C code that is suited for high-level synthesis). The latest eﬀorts in GROMACS also have

moved the code from hand-tuned, inﬂexible assembly CPU kernels to a form implemented in a compiler- and

hardware-portable SIMD intrinsics layer developed internally, for which the code is generated automatically

for a wide range of model physics and hardware, including accelerators. In eﬀect, this is an embedded DSL

for high-performance short-ranged particle-particle interactions. All the aforementioned approaches can in-

crease productivity—e.g., reducing the lines of application code and debugging time—in their target science

domain as well as performance portability across diﬀerent compute architectures.

Other eﬀorts, such as the use of tiling within patches in AMR for exposing greater parallelism rely on

a library-based approach. Many of these eﬀorts have met with some degree of success. Some are in eﬀect

a usable component of an overall solution to be found in future, while others are experiments that are far

more informative about how to rethink the data and algorithmic layout of the core solvers. Though all these

eﬀorts are helpful in part, and hold promise for the future, none of these approaches currently provide a

complete stable solution that applications can use for their transition to long term viability.

3.2

Discussion

There are many factors that aﬀect the decision by the developers of a large scientiﬁc library or an application

code base to use an available programming paradigm, but the biggest non-negotiable requirement is the

stability of the paradigm. The fear of adding a non-robustly-supported critical dependency prevents code

developers who use high-end platforms from adopting technologies that can otherwise beneﬁt them. This

fear may be due to the lack of guarantee about continued future support or the experimental nature of the

technology that might compromise the portability in current and/or future platforms. Often the developers

opt for a suboptimal or custom-built solution that does not get in the way of being able to run their

simulations. For example, even today research groups exist that use their own I/O solutions with all the

attendant performance and maintenance overhead because they were built before the parallel I/O libraries

became mature, robust and well supported. A group that has adopted a custom built solution that suits their

purpose is much more diﬃcult to persuade to use external solutions even if those solutions are better. It is

in general hard to promote adoption of higher-level abstractions unless there is a bound on the dependency

through the use of a library that was implemented in a highly portable way, and easy to install and link.

For this reason embedded DSLs, code transformation technologies, or any other approaches that provide

an incremental path of transition are more attractive to the applications. Any solution that demands a

complete rewrite of the code from scratch presents two formidable challenges to the users and developers of

the application codes: (1) verifying correctness of algorithms and implementations, since direct comparison

with subsections of existing code may not always be possible, and (2) a long gap when production must

Programming Abstractions for Data Locality

3.3. APPLICATION REQUIREMENTS

continue with existing, non-evolving (and therefore sometimes outdated) code until the new version of the

code comes online. The trade-oﬀ is the possibility of missing out on some compiler-level optimizations.

There are other less considered but possibly equally critical concerns that have to do with expressibility.

The application developers can have a very clear notion of their data model without ﬁnding ways of expressing

the models eﬀectively in the available data structures and language constructs. The situation is even worse

for expressing the latent locality in their data model to the compilers or other translational tools. None

of the prevalent mature high-level languages being used in scientiﬁc computing have constructs to provide

means of expressing data locality. There is no theoretical basis for the analysis of data movement within

the local memory or remote memory. Because there is no formalism to inform the application developers

about the implications of their choices, the data structures get locked into the implementation before the

algorithm design is fully ﬂeshed out. The typical development cycle of a numerical algorithm is to focus on

correctness and stability ﬁrst, and then performance. By the time performance analysis tools are applied,

it is usually too late for anything but incremental corrective measures, which usually reduce the readability

and maintainability of the code. A better approach would be to model the expected performance of a

given data model before completing the implementation, and let the design be informed by the expected

performance model throughout the process. Such a modeling tool would need to be highly conﬁgurable, so

that its conclusions might be portable across a range of compilers and hardware, and valid into the future,

in much the same way that numerical simulations often use ensembles of input-parameter space in order to

obtain conclusions with reduced bias.

3.3

Application requirements

Most languages provide standard containers and data structures that are easy to use in high-level code, yet

very few languages or libraries provide interfaces for the application developer to inform the compiler about

expectations of data locality, data layout, or memory alignment. For example, a common concern for the

PDE solvers is the data structure containing multiple ﬁeld components that have identical spatial layout.

Should it be an array with an added dimension for the ﬁeld components or a structure; and within the array

or structure, what should be the order for storing in memory for performance. The application’s solution

is agnostic to the layout, but choice of data structure bakes into the platform on which the code will have

a better performance. There are many similar situations that force application implementations to become

more platform speciﬁc than they need to be. Furthermore, the lack of expressibility can also present false

dependencies to the compiler and prevent optimization.

3.4

The Wish List

In response to the application requirements identiﬁed above, we constructed a Wish List for programming

environment features that will eﬃciently address application needs. The list outlines some abstractions

and/or language constructs that would allow the applications to avoid false constraints and be more expressive

to the software stack for optimization possibilities.

• Ability to write dimension independent code easily.

• Ability to express functionally equivalent data structures, where one of them can be arbitrarily picked

for implementing the algorithm with the understanding that the compiler can do the transformation

to one of the equivalent ones suited for the target architecture.

• The ability to map abstract processes to given architectures, and coupled to this, the ability to express

these mappings in either memory-unaware or at least ﬂexible formulations.

• More, and more complex, information has been demanded from modelers that is not relevant to the

model itself, but rather to the machines. Requiring less information, and allowing formulations close to

the models in natural language, is a critical point for applications. This does not restrict the possible

approaches, quite the opposite: for example, well engineered libraries that provide the memory and

operator mechanics can be very successful, if they provide intelligent interfaces.

Programming Abstractions for Data Locality

Yüklə 0,54 Mb.

Dostları ilə paylaş:

1 2 3 4 5 6 7 8 9 10 ... 23