Programming abstractions for

Yüklə 0,54 Mb.

Pdf görüntüsü

səhifə	17/23
tarix	24.12.2017
ölçüsü	0,54 Mb.
	#17201

1 ... 13 14 15 16 17 18 19 20 ... 23

6.1. KEY POINTS

A big challenge is how to communicate the hierarchical data properties of an application to the runtime

so that they can be exploited to generate eﬃcient schedules. Classical random work stealers (e.g. Cilk-

like [8]) do not exploit this. Socket-aware policies exist (e.g. Qthread [70]) that perform hierarchical work

stealing: a) ﬁrst among cores in a socket and b) then among sockets. Some programming models expose

an API that allows programmers to specify on which NUMA node/socket a collection of tasks should be

executed (e.g. OmpSs [7]). Conﬁgurable work stealers which can be customized with scheduling hints have

also been developed [91]. Finally, a more extreme option is to allow the application programmer to attach

a custom work stealing function to the application [67]. How to eﬀectively specify this information in a

programmer-friendly way is an open topic of research.

b) Task Granularity

Large tasks (i.e., coarse-grained tasks) can result in load imbalance at synchronization points. Reducing

task sizes is a common method to improve load balance. On the other hand, the granularity of tasks directly

inﬂuences scheduling overheads. Too ﬁne-grained tasks increase the amount of time spent inside the runtime

performing jobs such as scheduling and work stealing. Task granularity impacts locality since larger tasks

also have bigger memory footprints. Task sizes are not only a trade-oﬀ between parallelism and runtime

overheads, but should also be set in order to eﬃciently exploit the memory hierarchy. Last level shared

caches provide larger storage that can be exploited particularly well when groups of tasks share some of their

inputs. This is called constructive cache sharing [24].

Finding the best granularity is a complex problem since all three metrics to be optimized (overheads,

load balance, data locality) are connected. Since the optimum conﬁguration may depend on the input,

auto-tuning techniques are a promising approach. Enabling auto-tuning involves programmer eﬀort to ﬁnd

ways to partition data sets in a parametric way, allowing the runtime to tune the task size.

6.1.2

Expressing Data Locality with Task-Based systems

Nested parallelism:

Nested or recursive algorithmic formulation is a well-known technique to increase

data reuse at the high levels of the memory hierarchy and therefore, to reduce memory latency overheads.

This often requires slight changes in the original algorithm. At the same time, to witness the performance

beneﬁts of such formulation, it is critical to make sure the resulting nested algorithm is correctly mapped into

the underlying architecture. Task-parallel runtimes should exploit recursive formulations of an algorithm

by scheduling nested parallel tasks to processing units, which share some level of caches. This should not

prevent task-parallel runtimes to still perform work stealing in case load imbalance is detected as long as it

does not hinder the overall performance.

Distance-aware scheduling:

Reducing data movement needs to happen at all levels of the system ar-

chitecture to be eﬀective: from the single CPU socket within a multi-socket shared-memory node up to

multiple distributed-memory nodes linked through high performance network interconnect. This bottom-

up approach highlights the importance of improving data locality within the socket itself and foremost.

Therefore, task-parallel runtimes need to provide an abstraction to algorithm developers to facilitate the

scheduling of several tasks, which own common data dependencies, on the same socket. Similarly to the

nested parallelism technique, in case of load balancing issues, work should be ﬁrst stolen from the closest

CPU sockets. This is paramount especially for NUMA architectures.

Pipelining across Multiple Computational Stages:

Many numerical algorithms are often built on

top of optimized basic blocks. For instance, dense Eigensolvers require three computational stages: matrix

reduction to condensed form, iterative solver to extract the eigenvalues and back transformation to get the

associated Eigenvectors. Each stage corresponds to an aggregation of several computational kernels, which

may already be optimized independently for data locality. However, pipelining across multiple subsequent

basic blocks (e.g., thanks to look ahead optimizations) may mitigate the beneﬁts of data locality since task-

parallel runtimes may not have the global directed acyclic graph before execution. There is, therefore, an

urgent need to express and maintain data locality of the overall directed acyclic graph, without delaying

the execution of tasks belonging to the critical path. While this is still an open research problem in the

Programming Abstractions for Data Locality

6.2. STATE OF THE ART

context of distributed-memory systems, this issue is handled on shared-memory machines through speciﬁc

data locality ﬂags, provided by the user.

6.2

State of the art

There are many software projects on task-parallel runtimes. We list a number of them here but it is by no

means a complete list.

Nanos++

The Nanos++ dynamic runtime system interfaces the task-parallel application with the under-

lying hardware architecture. A core component of the runtime is in charge of handling data movement and

ensuring coherency, so that the programmer does not need to deal with memory allocation and movements.

Another core component are the scheduling policies, which contain a newly introduced policy for mitigat-

ing the work stealing overhead, called the distance-aware work stealing policy. The OmpSs programming

model, a high-level task-based parallel programming model, is powered by the Nanos++ runtime. It relies on

compiler technology and supports task parallelism using synchronization based on data-dependencies. Data

parallelism is also supported by means of services mapped on top of its task support. OmpSs also supports

heterogeneous programming, as reﬂected in the modular support for diﬀerent architectures in Nanos++

(SMP, CUDA, OpenCL, simulators such as TaskSim, etc.). Nanos++ also provides instrumentation tools

to help identifying performance issues.

Quark

QUARK (QUeuing And Runtime for Kernels) provides a library that enables the dynamic execution

of tasks with data dependencies in a multi-core, multi-socket, shared-memory environment. QUARK infers

data dependencies and precedence constraints between tasks from the way that the data is used, and then

executes the tasks in an asynchronous and dynamic fashion in order to achieve a high utilization of the

available resources. The dependencies between the tasks form an implicit DAG, however this DAG is never

explicitly realized in the scheduler. The structure is maintained in the way that tasks are queued on data

items, waiting for the appropriate access to the data. The tasks are inserted into the scheduler, which stores

them and executes them when all the dependencies are satisﬁed. In other words, a task is ready to be

executed when all parent tasks have completed. The execution of ready tasks is handled by worker threads

that simply wait for tasks to become ready and execute them using a combination of default task assignments

and work stealing.

SuperMatrix

Similarly to QUARK, SuperMatrix is a runtime system that mainly parallelizes matrix op-

erations for SMP and/or multi-core architectures. This runtime system demonstrates how code described at

a high level of abstraction can achieve high performance on such architectures, while completely hiding the

parallelism from the library programmer. The key insight entails viewing matrices hierarchically, consisting

of blocks that serve as units of data where operations over those blocks are treated as units of computation.

The implementation transparently enqueues the required operations, internally tracking dependencies, and

then executes the operations utilizing out-of-order execution techniques inspired by superscalar microarchi-

tectures. This separation of concerns allows library developers to implement algorithms without concerning

themselves with the parallelization aspect of the problem. Diﬀerent heuristics for scheduling operations can

be implemented in the runtime system independently of the code that enqueues the operations.

PaRSEC

PaRSEC is a generic framework for architecture aware scheduling and management of micro-

tasks on distributed many-core heterogeneous architectures. Applications can be expressed as a Direct

Acyclic Graph of tasks with labeled edges designating data dependencies. DAGs are represented in a com-

pact problem-size independent format that can be queried on-demand to discover data dependencies in a

totally distributed fashion. PaRSEC assigns computation threads to the cores, overlaps communications and

computations and uses a dynamic, fully-distributed scheduler based on architectural features such as NUMA

nodes and algorithmic features such as data reuse. The framework includes libraries, a runtime system, and

development tools to help application developers tackle the diﬃcult task of porting their applications to

distributed memory, highly heterogeneous and diverse environments.

Programming Abstractions for Data Locality

Yüklə 0,54 Mb.

Dostları ilə paylaş:

1 ... 13 14 15 16 17 18 19 20 ... 23