Programming abstractions for

Yüklə 0,54 Mb.

Pdf görüntüsü

səhifə	18/23
tarix	24.12.2017
ölçüsü	0,54 Mb.
	#17201

1 ... 15 16 17 18 19 20 21 22 23

6.3. DISCUSSION

StarPU

StarPU is a runtime system that schedules tasks onto accelerator-based platforms. It is meant to

be used as a back-end for parallel language compilation environments and high-performance libraries. The

two basic principles of StarPU is ﬁrstly that tasks can have several implementations, for some or each of the

various heterogeneous processing units available in the machine, and secondly that transfers of data pieces

to these processing units are handled transparently by StarPU. Thanks to auto-tuning facilities, StarPU

transparently predicts execution time and data transfer overhead. This permits StarPU’s dynamic scheduler

to avoid load imbalance while enforcing data locality to reduce the pressure on the memory subsystem, which

is a critical resource for accelerator-based platforms.

The Open Community Runtime (OCR)

OCR is a task-dataﬂow programming model and research

runtime aimed at determining what works and what does not for large scale task-dataﬂow programming mod-

els and runtimes. It provides a simple low-level API with which programmers or higher-level programming

languages or abstractions can inform a runtime system of tasks, data used by the tasks (called data-blocks)

as well as dependencies between them.

From a data locality perspective, all aforementioned dynamic runtime systems provide mechanisms to

limit data motion.

This is achieved in one of two ways.

One approach is to do it internally through

scheduling policies, in a transparent-to-the-user fashion. This method is typical for runtimes using source-

to-source compiler technology since dependencies are discovered ahead of the actual execution and therefore

decent scheduling decisions can be made to preserve data locality. An alternative approach is via scheduling

optimization ﬂags, a method which may require user intervention to prevent moving excessively data between

worker threads. This method is representative of dynamic scheduling libraries, which discover dependencies

only at runtime.

6.3

Discussion

Areas of Agreement

• Static partitioning will become increasingly hard: The performance and energy characteristics

of today’s hardware are diﬃcult to predict; the unpredictability of a hit or miss in a cache, the variable

latencies introduced by branch mis-predictions, etc. are only some of the factors that contribute to a

very dynamic hardware behavior. This trend will only accelerate with future hardware as near thresh-

old voltage (NTV) and aggressive energy management increase the level of performance variability.

Architectures, unfortunately, do not provide any guarantees on execution time or energy consumption,

impeding the task of time and energy estimation.

In light of this, toolchains will become increasingly unable to statically partition and schedule code

to eﬃciently utilize future parallel resources. Hardware characteristics will vary dynamically and we

therefore cannot do without a dose of dynamism in the runtime: dynamic task based programming

models need to be a part of the solution.

• Optimal Granularity: Task-based programming models rely on the computation being split into

chunks called “tasks”. The partitioning of code into tasks implicitly partitions data into the corre-

sponding per-task working sets. The optimal size of tasks (the granularity) is diﬃcult to determine as

it needs to balance the overheads of the task-based runtime system with the need to expose suﬃcient

parallelism, while making eﬃcient use of processor caches, which depends on the size and access pattern

of the working set. A static granularity will be sub-optimal for future machines.

• Separation of Concerns vs Co-design: One of the main reasons behind the success of these task-

based dynamic runtime systems is their capacity to abstract the underlying hardware complexity.

This separation of concerns between scheduling and algorithmic development allows end users to focus

primarily on designing their numerical algorithms sequentially while adding parallelism features at a

later stage. This separation of functionality and performance enhances productivity. However, moving

forward with exascale systems, this free lunch may come to an end if one wants to exploit systems with

billion of cores eﬃciently. Future programming environments and runtimes will need to target stronger

Programming Abstractions for Data Locality

6.4. RESEARCH PLAN

software-hardware co-design. Algorithms based on recursive formulations are a good example of such

co-design, as divide-and-conquer approaches can often express parallelism while implicitly preserving

data locality. The memory hierarchy at all levels of the system can then be exploited and further

performance improvement can be expected in case of data reuse.

Areas of Disagreement

• Level of Standardization: There is an agreement on the fact that a standardization needs to

happen for task-based programming model but there is a disagreement as to the level at which this

standardization should happen. One option is to standardize the APIs at the runtime level in a way

similar to the Open Community Runtime (OCR). Another option is to standardize the interface of the

programming model, like OpenMP or OmpSs do. Currently there is no clear reason to decide for a

particular scheme, so both approaches are being actively researched.

• Expression of Granularity: Another area of disagreement is how to deal best with the expression

of granularity; speciﬁcally, is it better for the programmer to break-down tasks and have a runtime

system re-assemble them if needed or is it preferable to have the programmer express coarse grained

tasks and data and allow the runtime system to break them down further. The latter approach has

been used successfully in runtimes targeted to problems that are recursively divisible. The former

approach would require some sort of “recipe” for the runtime to be able to stitch smaller tasks or

chunks of data into larger ones. There is a debate as to which approach is simpler and more likely to

yield positive results.

6.4

Research Plan

There are several areas that would beneﬁt from further research in task-based models and runtime systems:

a) a better understanding of the broader performance implications of task-based systems, b) better under-

standing of tools for debugging task-based models, and c) abstractions for the programmer to be able to

encapsulate and express hints regarding task granularity, data locality and other information not already

captured by the expression of dependencies.

Apart from these three main areas of research, detailed below, the community would also greatly beneﬁt

from experiments showing the beneﬁts and downsides of task-based programming models and runtimes.

6.4.1

Performance of task-based runtimes

Wall-clock execution time has long been the golden standard of performance metrics. This metric, however,

is no longer suﬃcient in understanding the performance of task-based runtime systems especially in how

they relate to data-locality.

Task-based runtimes require that the programmer relinquish control of aspects that he has traditionally

controlled: scheduling, data placement, etc. For these runtimes to be accepted, programmers need to be

convinced that the runtime systems are not making “stupid” decisions; in other words, that by delegating

some of the traditional aspects of parallel programming, the programmer is not sacriﬁcing inordinate amounts

of performance. To that end, metrics that capture changes to data locality, load balancing, etc. need to be

developed. Separately, metrics showing the speciﬁc beneﬁts of task based systems also need to be developed:

programmer productivity, better resiliency, etc.

6.4.2

Debugging tools

Task-based programming models are notoriously diﬃcult to reason about and debug given that the parame-

ters speciﬁed by the programmer to constrain execution (dependencies) purposefully allow for a wide range

of execution options. Certain task-based runtime systems, which allow the dynamic construction of the

task-graph (such as OCR), only exacerbate these problems. Tools allowing the programmer to understand

the execution of a task-based program need to be developed.

These tools will need to cover two broad areas:

Programming Abstractions for Data Locality

Yüklə 0,54 Mb.

Dostları ilə paylaş:

1 ... 15 16 17 18 19 20 21 22 23