Programming abstractions for

Yüklə 0,54 Mb.

Pdf görüntüsü

səhifə	19/23
tarix	24.12.2017
ölçüsü	0,54 Mb.
	#17201

1 ... 15 16 17 18 19 20 21 22 23

6.4. RESEARCH PLAN

• Information on the execution ﬂow of the application in terms of the tasks and data-elements deﬁned

by the user;

• Information about the mapping of those tasks and data-elements to the computing and memory re-

sources.

The former area will help the user in determining whether the application is executing correctly while the

latter will help in identifying performance issues due to improper choices in the runtime (bad scheduling,

bad locality, etc.). Both areas will be important for diﬀerent audiences: the former will be more important

for application developers while the latter will prove crucial for runtime system developers.

6.4.3

Hint framework

Task-based programming models do a good job requiring very little information from the programmer: what

are the tasks, the data items and the dependencies between them. While this is suﬃcient to produce a correct

execution of the program, it may not be suﬃcient to produce an eﬃcient one. As previously mentioned,

task-based runtimes that are oblivious to locality constraints will perform badly.

It is therefore important to understand the type of information that a programmer can easily supply and

how best to communicate it through the programming model to the runtime system. Task-based runtimes

already know the dependence structure between tasks; the research question is what other information

they could make use of that the programmer can easily provide to improve their scheduling and placement

heuristics.

Programming Abstractions for Data Locality

Chapter 7

System-Scale Data Locality Management

As the complexity of parallel computers rises, applications ﬁnd it increasingly more diﬃcult to access data

eﬃciently. Today’s machines feature hundreds of thousands of cores, a deep memory hierarchy with several

cache layers, non-uniform memory access with several levels of memory (e.g., ﬂash, non-volatile, RAM),

elaborate topologies (both at the shared-memory level and at the distributed-memory level using state-of-

the-art interconnection networks), advanced parallel storage systems, and resource management tools. With

so many degrees of freedom arranging bytes becomes signiﬁcantly more complex than computing ﬂops, and

the expectation of decrease in memory per core in the coming years will only worsen the situation

To address the data locality issue system-wide, one possible approach is to examine and refactor the

application ecosystem, i.e., the execution characteristics, the way in which the resources are used, the

application’s relation with the topology, or its interaction with other executing applications. This approach

implies a strong need for new models and algorithms, or at least signiﬁcantly refactored ones. It also points to

the need for new mechanisms and tools for improving (1) topology-aware data accesses, (2) data movements

across the various software layers, and (3) data locality and transfers for applications.

7.1

Key points

There is rich literature with evidence of tremendous eﬀorts expended on optimizing various applications

statically. The approaches mentioned in the literature include data layout optimizations, compilation opti-

mization, and parallelism structuring. These approaches in general have been extremely eﬀective however,

there are several additional factors that cannot be optimized for prior to execution, but, can have dramatic

impact on the application performance beyond the static optimizations. Among these factors are:

• the conﬁguration of allocated resources for the speciﬁc run,

• the network traﬃc induced, or any other interference caused by other running applications,

• the topology of the target machine,

• the relative location of the data accessed by the application on the storage system,

• dependencies of the input on the execution,

and many more.

Often these runtime factors are orthogonal to the static optimizations that can be performed.

For

instance, recent results [27, 60] show that a non-contiguous allocation can reduce the performance by more

than 30%. However, a batch scheduler cannot always provide a contiguous allocation and even if it could, the

way processes are mapped to the allocated resources still has a big impact on the performance [23, 28, 45, 50].

The reason is often the complex network and memory topology of modern HPC systems and that some pairs

of processes exchange more data than some other pairs.

Furthermore, energy constraints imposed by exascale goals are altering the balance of interconnect ca-

pabilities, reducing the bandwidth to compute ratio while increasing injection rates. This shift is causing

7.2. STATE OF THE ART

fundamental reconsideration of the BSP programming model and interconnect design. One of the leading

contenders for a new interconnect is a multi-level direct network such as Dragonﬂy [4, 56]. Such networks are

formed from highly-connected parts, placing every node within a few hops of all other nodes in the system.

This may beneﬁt unstructured communications that often occur in graph algorithms, but limited connec-

tions between parts can be bottlenecks for structured communication patterns [76]. At the node level, a

promising approach for fully utilizing higher core counts on next-generation architectures is over-decomposed

task parallelism [52], which will stress the interconnect in ways diﬀerent from the traditional BSP model.

In order to optimize system-scale application execution we need models of the machine at diﬀerent scales.

We also need models of the application and its algorithms, and tools to optimize the execution within the

whole ecosystem. Literature provides many models and abstractions for writing parallel codes that have been

successful in the past. However, these models alone may not be suﬃcient for scaling in future applications

due to the data traﬃc and coherence management constraints [78]. Current models and abstraction are more

concerned with computations than with the cost incurred by data movement, topology and synchronization.

It is important to provide new hardware models to account for these phenomena as well as abstractions to

enable the design of eﬃcient topology-aware algorithms and tools.

A hardware model is needed to understand how to control locality. Modeling the future large-scale parallel

machines will require work in the following directions: (1) ability to better describe the memory hierarchy,

(2) a way to provide an integrated view of nodes and the network, (3) inclusion of qualitative knowledge such

as latencies, bandwidths, or buﬀering strategies, and (4) providing ways to express the multi-scale properties

of the machine.

Applications need abstractions allowing them to express their behavior and requirement in terms of data

access, locality and communication. For this, we need to deﬁne metrics to capture the notions of data

access, aﬃnity, and network traﬃc. The MPI standard oﬀers the process topology interface that allows an

application to specify the dataﬂow between processes [43]. While this interface is a viable ﬁrst step, it is

limited to BSP-style MPI applications. More general solutions are needed for wider coverage. To optimize

execution at system scale, we need to extract the application requirements from the application models and

abstractions and apply them to the mechanisms, tools and algorithms provided by the network model. With

enough such information available it becomes feasible to perform several optimizations such as: improving

storage access, mapping processes onto resources based on their aﬃnity [28, 44, 45, 80], selecting resources

according to the application communication pattern and the pattern of the currently running applications. It

is also possible to couple allocation and mapping. Appropriate abstraction In the context of storage bring in

the possibility of exploiting an often overlooked aspect of locality. Not only is it possible for the applications

to request that the data and execution are local to each other, it is also possible to spread out data and

corresponding execution to take advantage of parallelism inherent in the system.

7.2

State of the Art

Various approaches for topology mapping have been developed: TreeMatch [51], provides mapping of pro-

cesses onto computing resources in the case of a tree topology (such as current NUMA nodes and fat tree

topologies). LibTopoMap [45] addresses the same problem as TreeMatch but for arbitrary topology such as

torus, grid, or completely unstructured networks. The general problem is NP-hard and no good approxima-

tion schemes are known, thus forcing developers to rely on various heuristics. However, several specialized

versions of the problem can be solved near-optimal in polynomial time, for example, mapping Cartesian

topologies to Dragonﬂy networks [76].

Topology mapping can also be seen as a graph embedding problem where an application graph is embed-

ded into a machine graph. Therefore, graph partitioners such as Scotch [33] or ParMetis [54] could provide

a solution, though they may require more precise information than more specialized tools and the solutions

are not always good [51]. Zoltan2 [9, 28] is a toolkit that, after processes are allocated to an application,

can map these processes to resources based on geometric partitioning where processes and computing units

are identiﬁed by coordinates in a multidimensional space

Hardware Locality (hwloc) [38, 47] is a library and a set of tools aimed at discovering and exposing the

hardware topology of machines, including processors, cores, threads, shared caches, NUMA memory nodes

and I/O devices. Netloc [39, 68] is a network model extension of hwloc to account for locality requirements

Programming Abstractions for Data Locality

Yüklə 0,54 Mb.

Dostları ilə paylaş:

1 ... 15 16 17 18 19 20 21 22 23