6.3. DISCUSSION
StarPU
StarPU is a runtime system that schedules tasks onto accelerator-based platforms. It is meant to
be used as a back-end for parallel language compilation environments and high-performance libraries. The
two basic principles of StarPU is firstly that tasks can have several implementations, for some or each of the
various heterogeneous processing units available in the machine, and secondly that transfers of data pieces
to these processing units are handled transparently by StarPU. Thanks to auto-tuning facilities, StarPU
transparently predicts execution time and data transfer overhead. This permits StarPU’s dynamic scheduler
to avoid load imbalance while enforcing data locality to reduce the pressure on the memory subsystem, which
is a critical resource for accelerator-based platforms.
The Open Community Runtime (OCR)
OCR is a task-dataflow programming model and research
runtime aimed at determining what works and what does not for large scale task-dataflow programming mod-
els and runtimes. It provides a simple low-level API with which programmers or higher-level programming
languages or abstractions can inform a runtime system of tasks, data used by the tasks (called data-blocks)
as well as dependencies between them.
From a data locality perspective, all aforementioned dynamic runtime systems provide mechanisms to
limit data motion.
This is achieved in one of two ways.
One approach is to do it internally through
scheduling policies, in a transparent-to-the-user fashion. This method is typical for runtimes using source-
to-source compiler technology since dependencies are discovered ahead of the actual execution and therefore
decent scheduling decisions can be made to preserve data locality. An alternative approach is via scheduling
optimization flags, a method which may require user intervention to prevent moving excessively data between
worker threads. This method is representative of dynamic scheduling libraries, which discover dependencies
only at runtime.
6.3
Discussion
Areas of Agreement
• Static partitioning will become increasingly hard: The performance and energy characteristics
of today’s hardware are difficult to predict; the unpredictability of a hit or miss in a cache, the variable
latencies introduced by branch mis-predictions, etc. are only some of the factors that contribute to a
very dynamic hardware behavior. This trend will only accelerate with future hardware as near thresh-
old voltage (NTV) and aggressive energy management increase the level of performance variability.
Architectures, unfortunately, do not provide any guarantees on execution time or energy consumption,
impeding the task of time and energy estimation.
In light of this, toolchains will become increasingly unable to statically partition and schedule code
to efficiently utilize future parallel resources. Hardware characteristics will vary dynamically and we
therefore cannot do without a dose of dynamism in the runtime: dynamic task based programming
models need to be a part of the solution.
• Optimal Granularity: Task-based programming models rely on the computation being split into
chunks called “tasks”. The partitioning of code into tasks implicitly partitions data into the corre-
sponding per-task working sets. The optimal size of tasks (the granularity) is difficult to determine as
it needs to balance the overheads of the task-based runtime system with the need to expose sufficient
parallelism, while making efficient use of processor caches, which depends on the size and access pattern
of the working set. A static granularity will be sub-optimal for future machines.
• Separation of Concerns vs Co-design: One of the main reasons behind the success of these task-
based dynamic runtime systems is their capacity to abstract the underlying hardware complexity.
This separation of concerns between scheduling and algorithmic development allows end users to focus
primarily on designing their numerical algorithms sequentially while adding parallelism features at a
later stage. This separation of functionality and performance enhances productivity. However, moving
forward with exascale systems, this free lunch may come to an end if one wants to exploit systems with
billion of cores efficiently. Future programming environments and runtimes will need to target stronger
Programming Abstractions for Data Locality
35
6.4. RESEARCH PLAN
software-hardware co-design. Algorithms based on recursive formulations are a good example of such
co-design, as divide-and-conquer approaches can often express parallelism while implicitly preserving
data locality. The memory hierarchy at all levels of the system can then be exploited and further
performance improvement can be expected in case of data reuse.
Areas of Disagreement
• Level of Standardization: There is an agreement on the fact that a standardization needs to
happen for task-based programming model but there is a disagreement as to the level at which this
standardization should happen. One option is to standardize the APIs at the runtime level in a way
similar to the Open Community Runtime (OCR). Another option is to standardize the interface of the
programming model, like OpenMP or OmpSs do. Currently there is no clear reason to decide for a
particular scheme, so both approaches are being actively researched.
• Expression of Granularity: Another area of disagreement is how to deal best with the expression
of granularity; specifically, is it better for the programmer to break-down tasks and have a runtime
system re-assemble them if needed or is it preferable to have the programmer express coarse grained
tasks and data and allow the runtime system to break them down further. The latter approach has
been used successfully in runtimes targeted to problems that are recursively divisible. The former
approach would require some sort of “recipe” for the runtime to be able to stitch smaller tasks or
chunks of data into larger ones. There is a debate as to which approach is simpler and more likely to
yield positive results.
6.4
Research Plan
There are several areas that would benefit from further research in task-based models and runtime systems:
a) a better understanding of the broader performance implications of task-based systems, b) better under-
standing of tools for debugging task-based models, and c) abstractions for the programmer to be able to
encapsulate and express hints regarding task granularity, data locality and other information not already
captured by the expression of dependencies.
Apart from these three main areas of research, detailed below, the community would also greatly benefit
from experiments showing the benefits and downsides of task-based programming models and runtimes.
6.4.1
Performance of task-based runtimes
Wall-clock execution time has long been the golden standard of performance metrics. This metric, however,
is no longer sufficient in understanding the performance of task-based runtime systems especially in how
they relate to data-locality.
Task-based runtimes require that the programmer relinquish control of aspects that he has traditionally
controlled: scheduling, data placement, etc. For these runtimes to be accepted, programmers need to be
convinced that the runtime systems are not making “stupid” decisions; in other words, that by delegating
some of the traditional aspects of parallel programming, the programmer is not sacrificing inordinate amounts
of performance. To that end, metrics that capture changes to data locality, load balancing, etc. need to be
developed. Separately, metrics showing the specific benefits of task based systems also need to be developed:
programmer productivity, better resiliency, etc.
6.4.2
Debugging tools
Task-based programming models are notoriously difficult to reason about and debug given that the parame-
ters specified by the programmer to constrain execution (dependencies) purposefully allow for a wide range
of execution options. Certain task-based runtime systems, which allow the dynamic construction of the
task-graph (such as OCR), only exacerbate these problems. Tools allowing the programmer to understand
the execution of a task-based program need to be developed.
These tools will need to cover two broad areas:
Programming Abstractions for Data Locality
36