Programming abstractions for

Yüklə 0,54 Mb.

Pdf görüntüsü

səhifə	9/23
tarix	24.12.2017
ölçüsü	0,54 Mb.
	#17201

1 ... 5 6 7 8 9 10 11 12 ... 23

4.2. KEY POINTS

processor when the data reuse occurs. This can be done by tiling of the iteration space, or by data

space tiling, and domain decomposition).

Iteration Space Tiling is a code restructuring technique used to reduce working set size of a program

by dividing the iteration space deﬁned by the loop structures into tiles.

Data Space Tiling involves subdividing a rectilinear array into a regular array of rectilinear chunks

within the same memory space in order to maximize the consecutive memory locations that are

visited during traversal.

Data Decomposition is the partitioning of a data structure’s arrays into smaller chunks with the

intent of assigning each chunk to a diﬀerent memory space for introducing parallelism across

chunks or improving data locality.

Data Placement/Pinning is the mapping of the tiles or chunks of data from a domain decomposed data

structure to memory spaces.

Task Placement/Pinning This is the assignment of threads of execution to a particular physical processor

resource and its related hierarchy of memory spaces. Many contemporary programming environments

do not oﬀer an automatic method to directly relate the task placement to the data placement, aside

from very loose policies such as “ﬁrst touch memory aﬃnity.”

Data Layout is the mapping of a data structure’s chunk to the physical addresses corresponding to the

addressing scheme used by a given abstraction layer, which in turn provides aﬃne mapping from the

logical address space seen by the programmer to the physical layout of the data arrays in a memory.

There may be multiple abstraction layers, which together serve a diversity of objectives, as outlined above.

For example, users may prefer to refer to struct members of an array (AoS), whereas an array of structures

of arrays (AoSoA) layout may be more proﬁtable at the physical storage level. Physical layout may diﬀer

between a software-managed memory that, for example, uses Hilbert curves, and a hardware-managed cache.

There must be a mapping between the data layouts in the various abstraction layers, and potentially in each

of the memory spaces.

A computation’s memory access is dictated by its data access pattern that is a composition of

1. how the computation’s parallel tasks traverse the data structure iteration space and traversal order,

2. how the computation’s tasks are scheduled (i.e., the task placement),

3. data layout,

4. data placement, and

5. data decomposition.

Performance is constrained by the cost (time and energy) of moving data in service of the computation,

which is directly impacted by the computation’s data access pattern as represented in Figure 2.2 in the

Background chapter of this document. Data locality is a proxy measure for the volume and distance of

data movement; smaller “distance” and lower volumes imply lower cost. The idea is to shorten the distance

between successive references to the same memory location, so that it is more probable that the memory

word resides in the memory levels near to the processor when the data reuse occurs. Data locality has both

spatial and temporal dimensions as depicted in Figure 2.4 in the Background chapter of this document.

4.2

Key Points

The research community has been developing data structure abstractions for several years to improve the

expressiveness of parallel programming environments in order to improve application performance. Recently,

there has been a shift towards a more data-centric programming to provide locality abstractions for data

Programming Abstractions for Data Locality

4.3. STATE OF THE ART

structures because locality has become one of the most important design objectives. During the PADAL

workshop, we identiﬁed the following design principles as important and desired by the application program-

mers.

• We seek to improve performance by controlling the separate components (traversal, execution policy,

data layout, data placement, and data decomposition) whose composition determines a computation’s

data access pattern.

• Separation of concerns is important to maximize expressivity and optimize performance. That prin-

ciple implies a clear distinction between a logical, semantic speciﬁcation, and lower-level controls over

the physical implementation. At the semantic level, sequential work is mapped onto parallel data

collections, using logical abstractions of data layout which best promote productivity. Domain experts

can also expose semantic characteristics like reference patterns that can inform trade-oﬀs made by

tools.

• A computation’s algorithm is expressed in terms of operations on certain abstract data types, such as

matrices, trees, etc. The code usually expresses data traversal patterns at high level, such as accessing

the parent node in a tree, or the elements in a row of a matrix. This expresses the algorithmic locality

of the data, which may be not related to the actual locality of the data access. The actual locality

is obtained when the algorithm’s traversals, tasks, and data structures are mapped to the physical

architecture.

• The eﬃciency of a computation’s on-node data access pattern may depend upon the architecture of the

computer on which it executes. Performance may be improved by choosing a data layout appropriate

for the architecture. This choice can be made without modifying the computation’s source code if the

data structure has a polymorphic data layout.

• Performance tuners, who may be distinct from the domain experts, may exercise control over perfor-

mance with a set of interfaces that provide more explicit control over the data movement, and can be

distinct from those that specify semantics. Through these interfaces, they may decompose, distribute

and map parallel data collections onto eﬃcient physical layouts with specialized characteristics. This

may be done in a modular fashion that leaves the semantic expression of the code intuitive, readable

and maintainable. This approach presents an explicit “separation of concerns” by oﬀering diﬀerent

interfaces to diﬀerent programmers (the domain scientists vs. the computer science and hardware

experts.

• A key goal of separation of concerns is to separate the interface of the architecture-agnostic abstract

machine (AM) from that of the physical machine (PM) using hierarchical abstractions. An AM is nec-

essary to express an algorithm, whether in a functional language, as a PGAS abstraction, or something

else. An algorithm written for an AM must be translated into program for a physical machine (PM).

The PM may or may not be closely related to the AM. The translation eﬀort depends on the “distance”

between the AM and the PM. Since there are many diﬀerent PMs, and the lifetime of applications

should be much longer that the lifetime of an architecture, the usual requirement is that an algorithm

for an AM should be executed eﬃciently in the widest possible class of foreseeable PMs. This is the

well-known problem of performance portability. From this perspective, an AM should be distinct from

PMs for two reasons: ﬁrst it gives the user a uniform view of the programming space, often at a much

higher level of abstraction than any speciﬁc programming model for a PM; second, the hope is that

by being equidistant from any PM, the translation can lead to equally eﬃcient implementations with

similar eﬀort.

4.3

State of the Art

There is a concern that data structure and layout features that are needed to “get us to exascale” are lacking

in existing, standard languages. There are few options for mitigating this situation.

Programming Abstractions for Data Locality

Yüklə 0,54 Mb.

Dostları ilə paylaş:

1 ... 5 6 7 8 9 10 11 12 ... 23