6.1. KEY POINTS
A big challenge is how to communicate the hierarchical data properties of an application to the runtime
so that they can be exploited to generate efficient schedules. Classical random work stealers (e.g. Cilk-
like [8]) do not exploit this. Socket-aware policies exist (e.g. Qthread [70]) that perform hierarchical work
stealing: a) first among cores in a socket and b) then among sockets. Some programming models expose
an API that allows programmers to specify on which NUMA node/socket a collection of tasks should be
executed (e.g. OmpSs [7]). Configurable work stealers which can be customized with scheduling hints have
also been developed [91]. Finally, a more extreme option is to allow the application programmer to attach
a custom work stealing function to the application [67]. How to effectively specify this information in a
programmer-friendly way is an open topic of research.
b) Task Granularity
Large tasks (i.e., coarse-grained tasks) can result in load imbalance at synchronization points. Reducing
task sizes is a common method to improve load balance. On the other hand, the granularity of tasks directly
influences scheduling overheads. Too fine-grained tasks increase the amount of time spent inside the runtime
performing jobs such as scheduling and work stealing. Task granularity impacts locality since larger tasks
also have bigger memory footprints. Task sizes are not only a trade-off between parallelism and runtime
overheads, but should also be set in order to efficiently exploit the memory hierarchy. Last level shared
caches provide larger storage that can be exploited particularly well when groups of tasks share some of their
inputs. This is called constructive cache sharing [24].
Finding the best granularity is a complex problem since all three metrics to be optimized (overheads,
load balance, data locality) are connected. Since the optimum configuration may depend on the input,
auto-tuning techniques are a promising approach. Enabling auto-tuning involves programmer effort to find
ways to partition data sets in a parametric way, allowing the runtime to tune the task size.
6.1.2
Expressing Data Locality with Task-Based systems
Nested parallelism:
Nested or recursive algorithmic formulation is a well-known technique to increase
data reuse at the high levels of the memory hierarchy and therefore, to reduce memory latency overheads.
This often requires slight changes in the original algorithm. At the same time, to witness the performance
benefits of such formulation, it is critical to make sure the resulting nested algorithm is correctly mapped into
the underlying architecture. Task-parallel runtimes should exploit recursive formulations of an algorithm
by scheduling nested parallel tasks to processing units, which share some level of caches. This should not
prevent task-parallel runtimes to still perform work stealing in case load imbalance is detected as long as it
does not hinder the overall performance.
Distance-aware scheduling:
Reducing data movement needs to happen at all levels of the system ar-
chitecture to be effective: from the single CPU socket within a multi-socket shared-memory node up to
multiple distributed-memory nodes linked through high performance network interconnect. This bottom-
up approach highlights the importance of improving data locality within the socket itself and foremost.
Therefore, task-parallel runtimes need to provide an abstraction to algorithm developers to facilitate the
scheduling of several tasks, which own common data dependencies, on the same socket. Similarly to the
nested parallelism technique, in case of load balancing issues, work should be first stolen from the closest
CPU sockets. This is paramount especially for NUMA architectures.
Pipelining across Multiple Computational Stages:
Many numerical algorithms are often built on
top of optimized basic blocks. For instance, dense Eigensolvers require three computational stages: matrix
reduction to condensed form, iterative solver to extract the eigenvalues and back transformation to get the
associated Eigenvectors. Each stage corresponds to an aggregation of several computational kernels, which
may already be optimized independently for data locality. However, pipelining across multiple subsequent
basic blocks (e.g., thanks to look ahead optimizations) may mitigate the benefits of data locality since task-
parallel runtimes may not have the global directed acyclic graph before execution. There is, therefore, an
urgent need to express and maintain data locality of the overall directed acyclic graph, without delaying
the execution of tasks belonging to the critical path. While this is still an open research problem in the
Programming Abstractions for Data Locality
33
6.2. STATE OF THE ART
context of distributed-memory systems, this issue is handled on shared-memory machines through specific
data locality flags, provided by the user.
6.2
State of the art
There are many software projects on task-parallel runtimes. We list a number of them here but it is by no
means a complete list.
Nanos++
The Nanos++ dynamic runtime system interfaces the task-parallel application with the under-
lying hardware architecture. A core component of the runtime is in charge of handling data movement and
ensuring coherency, so that the programmer does not need to deal with memory allocation and movements.
Another core component are the scheduling policies, which contain a newly introduced policy for mitigat-
ing the work stealing overhead, called the distance-aware work stealing policy. The OmpSs programming
model, a high-level task-based parallel programming model, is powered by the Nanos++ runtime. It relies on
compiler technology and supports task parallelism using synchronization based on data-dependencies. Data
parallelism is also supported by means of services mapped on top of its task support. OmpSs also supports
heterogeneous programming, as reflected in the modular support for different architectures in Nanos++
(SMP, CUDA, OpenCL, simulators such as TaskSim, etc.). Nanos++ also provides instrumentation tools
to help identifying performance issues.
Quark
QUARK (QUeuing And Runtime for Kernels) provides a library that enables the dynamic execution
of tasks with data dependencies in a multi-core, multi-socket, shared-memory environment. QUARK infers
data dependencies and precedence constraints between tasks from the way that the data is used, and then
executes the tasks in an asynchronous and dynamic fashion in order to achieve a high utilization of the
available resources. The dependencies between the tasks form an implicit DAG, however this DAG is never
explicitly realized in the scheduler. The structure is maintained in the way that tasks are queued on data
items, waiting for the appropriate access to the data. The tasks are inserted into the scheduler, which stores
them and executes them when all the dependencies are satisfied. In other words, a task is ready to be
executed when all parent tasks have completed. The execution of ready tasks is handled by worker threads
that simply wait for tasks to become ready and execute them using a combination of default task assignments
and work stealing.
SuperMatrix
Similarly to QUARK, SuperMatrix is a runtime system that mainly parallelizes matrix op-
erations for SMP and/or multi-core architectures. This runtime system demonstrates how code described at
a high level of abstraction can achieve high performance on such architectures, while completely hiding the
parallelism from the library programmer. The key insight entails viewing matrices hierarchically, consisting
of blocks that serve as units of data where operations over those blocks are treated as units of computation.
The implementation transparently enqueues the required operations, internally tracking dependencies, and
then executes the operations utilizing out-of-order execution techniques inspired by superscalar microarchi-
tectures. This separation of concerns allows library developers to implement algorithms without concerning
themselves with the parallelization aspect of the problem. Different heuristics for scheduling operations can
be implemented in the runtime system independently of the code that enqueues the operations.
PaRSEC
PaRSEC is a generic framework for architecture aware scheduling and management of micro-
tasks on distributed many-core heterogeneous architectures. Applications can be expressed as a Direct
Acyclic Graph of tasks with labeled edges designating data dependencies. DAGs are represented in a com-
pact problem-size independent format that can be queried on-demand to discover data dependencies in a
totally distributed fashion. PaRSEC assigns computation threads to the cores, overlaps communications and
computations and uses a dynamic, fully-distributed scheduler based on architectural features such as NUMA
nodes and algorithmic features such as data reuse. The framework includes libraries, a runtime system, and
development tools to help application developers tackle the difficult task of porting their applications to
distributed memory, highly heterogeneous and diverse environments.
Programming Abstractions for Data Locality
34