Programming abstractions for

Yüklə 0,54 Mb.

Pdf görüntüsü

səhifə	16/23
tarix	24.12.2017
ölçüsü	0,54 Mb.
	#17201

1 ... 12 13 14 15 16 17 18 19 ... 23

Chapter 6

Data Locality in Runtimes for Task Mod-

els

Runtimes for task models enable a problem-centric description of an application’s parallelism while hiding

the details of task scheduling to complex architectures from the programmer. This separation of concerns is

probably the most important reason for the success of using runtime environment systems for task models.

It enables developers to “taskify” their applications while focusing on the scientiﬁc algorithms they are most

familiar with. Programmers delegate all responsibilities related to eﬃcient execution to the task scheduling

runtime thereby achieving higher productivity and portability across architectures. The initial niche of these

task model oriented runtimes was the dense linear algebra community [12, 88] upon which many scientiﬁc

ﬁelds depend on for high performance computing. More recently, it has been extended to other scientiﬁc

domains such as computational astronomy [21], ﬂuid-structure interactions [66] and N-body problems [63, 2].

Runtimes for task models can be roughly divided into two groups:

Task-parallel runtime In a task-parallel runtime (e.g., Cilk [8], Intel TBB task groups [49], OpenMP 3.0

tasks [71]), created tasks are ready to be executed in parallel as soon as they are created. The data

dependencies in these runtimes are often expressed through parent-child states (for instance, due to the

fork-join paradigm). Child states will inherit resolved dependencies from parent state(s) and, therefore,

no dependence discovery is necessary for these runtimes.

Task-dataﬂow runtime In a task-dataﬂow runtime (e.g., OmpSs [7], StarPU [5], QUARK [55], Super-

Matrix [20], OpenMP 4.0 tasks [71], Intel TBB graph parallelism [49], PaRSEC [13], or OCR [69]),

instantiated tasks may not be immediately ready to execute and may depend on data generated from

previous tasks. In these runtimes, data dependencies must be resolved prior to executing tasks.

When it comes to parallel eﬃciency, runtimes for task models usually provide decent performance. How-

ever, due to their dynamic or non deterministic behaviors, wrong scheduling decisions at runtime may con-

siderably increase time to solution. For instance, many task-based runtimes make use of distributed queues

for performance purposes, which enable local task scheduling but may come at the price of introducing load

imbalance. In such runtimes, work stealing is the mechanism by which idle cores/threads/workers obtain

tasks to execute and, by the same token, improves load balancing. When tasks are “stolen”, the working set

of those tasks is also frequently stolen, i.e., a work stealing operation usually results in data migration.

In fact, an oblivious work stealing strategy can further exacerbate the overhead of data motion, which

may hide the existing beneﬁt of data locality at the ﬁrst place. To deal with such issues, it is necessary to

make the runtime aware of high level data access pattern information so that it can perform locality-aware

scheduling decisions. In this chapter, we explore the proposition that task-based programming models and

the runtime system should oﬀer a systematic way to express locality information, while still maintaining

high productivity for end users.

6.1. KEY POINTS

6.1

Key Points

We now brieﬂy discuss the key issues of data locality in task-based systems.

6.1.1

Concerns for Task-Based Programming Model

For both task-parallel and task-dataﬂow runtimes, the programmer and/or the runtime must consider two

issues: a) the runtime’s scheduling mode, and b) ﬁnding the appropriate task granularity.

a) Runtime Scheduling Mode

Previous eﬀorts to optimize and improve the performance of a given application have always involved tuning

an application to a particular machine, for example by optimizing for cache size, memory hierarchy, or

the number of cores. Moving forward, this approach is untenable as machine variability will only increase

therefore making a static decision impractical. At the same time, static or deterministic scheduling enables

oﬄine data locality optimizations but cannot deal with runtime hazards such as load imbalance issues. Two

levels of scheduling (static and dynamic) have shown promising results by proposing an interesting trade oﬀ

between concurrency and data locality [29].

Finding the assignment of a set of ﬁxed-time tasks to a set of processors that minimizes the makespan

is a hard computational problem for which no fast optimal algorithm exists for more than two processors.

In a shared memory multiprocessor, the assignment itself perturbs the execution time for each task due to

resource sharing. This makes this already complex problem even harder. In practice, runtimes rely on simple

heuristics to generate decent scheduling policies. Depending on the tasking semantics this involves diﬀerent

amounts of work. In task-parallel runtimes, scheduling involves fetching a task from the ready queue(s)

or running the work-stealing loop if there are no ready tasks in the queues. As long as work stealing is

minimized, this scheme features low overhead and generates locality eﬃcient execution for codes optimized

for locality on the sequential path. In task-dataﬂow runtimes, there are two scheduling levels a) resolving

dependencies to ﬁnd ready tasks and b) scheduling ready tasks to workers. The latter is in general performed

as in task-parallel schemes. Task-dataﬂow schemes can improve performance by weakening synchronization

points. However, since tasks may be inserted in the task queues out of order, the inter-task locality is often

not exploited as eﬃciently [75].

Extending these runtimes to perform locality-aware schedules in the general case is a complex task. First

of all, how to provide locality information to the runtime is still an open problem. In the optimal case,

the runtime should have a global notion of the application’s parallel structure and data access pattern.

However, such information can not generally be extracted automatically, and it is also a diﬃcult task for

the programmer. Explicitly stating information on task inputs and outputs allows the runtime to analyze

the data reuse of a window of tasks. On the other hand, such optimizations often trade oﬀ concurrency for

locality. Spending too much time in the runtime analyzing tasks instead of scheduling them can result in

performance loss. Any information that a runtime takes into account for locality should still result in a quick

scheduling decision.

Impact of work stealing

In task-based systems, work stealing is a common scheduling technique used

primarily for load-balancing. Work stealing can be optimized to exploit locality across tasks. If sets of

tasks can constructively share the cache then limiting the work stealing to the tasks within the set will limit

the working set migration to the private caches (shared caches will not be aﬀected). Parallel-depth-ﬁrst

schedulers attempt to constructively share the cache by scheduling the oldest task in sequential order in

the set of cores sharing the cache and only resorting to global work stealing when the task queues become

empty [70].

In general we want to minimize the number of steals. This works well if the application can quickly be

partitioned into almost equivalent sets of work that can then proceed independently. This is generally the

case for divide-and-conquer parallelism, but for more general approaches such as loop-style parallelism (i.e.,

tasks generated inside a for loop) using work stealing to keep all cores busy is usually not eﬃcient. This

happens because distributing N tasks generated by a N -iteration loop will require exactly N work steals. If

N is larger than the number of processors, then a work-sharing partitioning of the loop can be more eﬃcient.

Programming Abstractions for Data Locality

Yüklə 0,54 Mb.

Dostları ilə paylaş:

1 ... 12 13 14 15 16 17 18 19 ... 23