1.3. SUMMARY OF FINDINGS AND RECOMMENDATIONS
to express tasks to facilitate more flexible task migration are also well structured to facilitate locality-aware
or data-centric models, but work remains on effective locality-aware heuristics for the runtime/scheduling-
algorithm that properly trade-off load-balancing efficiency with the cost of data movement.
Recommendations: Information on data locality and lightweight cost models for the cost of migrating
data will play a central role in creating locality-aware runtime systems that can schedule tasks in a manner
that minimizes data-movement. Such locality-aware advanced runtime systems are still an active area of
research. The semantics for expressing inter-task locality are emerging in runtime systems such as Open
Community Runtime, CHARM++, Swarm, and HPX, but the optimal set of heuristics or mechanisms to
effectively exploit that information requires further research.
1.3.5
System-Scale Data Locality Management
Findings: The cost of allocating, moving or accessing data is increasing compared to that of processing it.
With the deepening of the memory hierarchy and greater complexity of interconnection networks, the coher-
ence management, the traffic and the contention in the interconnection network will have a huge impact on
the application’s runtime and energy consumption. Moreover, in a large-scale system multiple applications
run simultaneously, and therefore compete for resources. The locality management must therefore take into
account local constraints (the way the application behaves) and system-scale constraints (the way it accesses
the resources). A global integration of these two types of constraints is the key for enabling scalability of
applications execution in future.
Recommendations: Our recommendation is to address several complementary directions: models, ab-
straction and algorithms for managing data locality at system scale. New models are required to describe
the topology on which the application is running both at the node level and at the network level. New
abstractions will provide means of expressing the way to access system level services such as storage or
the batch scheduler by the applications. These abstractions must expose the topology in a generic manner
without deeply impacting the programming model while also providing scalable mapping algorithms that
account for the deep hierarchy and complex topology. It is critical that this research be done co-operatively
with other aspects of data managment in order to avoid optimization conflicts and also to offer a unified
view of the system and its locality management.
Programming Abstractions for Data Locality
5
Chapter 2
Background
The cost of data movement has become the dominant factor of a high performance computing system both
in terms of energy consumption and performance. To minimize data movement, applications have to be
optimized both for vertical data movement in the memory hierarchy and horizontal data movement between
processing units. These hardware challenges have been modest enough that the community has largely
relied upon compiler technology and software engineering practices to mitigate the coarse-grained effects
such as manual loop blocking or 2-level MPI+X parallelism. The effects were modest enough that these
manual techniques were sufficient to enable codes to perform on different architectures. However, with the
exponential rise in explicit parallelism and increasing energy cost of data movement relative to computation,
application developers need a set of programming abstractions to describe data locality on the new computing
ecosystems.
2.1
Hardware Trends
This section will briefly cover the primary hardware architecture trends that have motivated the move from
a compute-centric programming model towards a data-centric model.
2.1.1
The End of Classical Performance Scaling
The year 2004 marked the approximate end of Dennard Scaling because chip manufacturers could no longer
reduce voltages at the historical rates. Other gains in energy efficiency were still possible; for example,
smaller transistors with lower capacitance consume less energy, but those gains would be dwarfed by leakage
currents. The inability to reduce the voltages further did mean, however, that clock rates could no longer be
increased within the same power budget. With the end of voltage scaling, single processing core performance
no longer improved with each generation, but performance could be improved, theoretically, by packing more
cores into each processor. This multicore approach continues to drive up the theoretical peak performance
of the processing chips, and we are on track to have chips with thousands of cores by 2020. This increase
in parallelism via raw core count is clearly visible in the black trend line in Peter Kogge’s classic diagram
(Figure 2.1) from the 2008 DARPA report [58]. This is an important development in that programmers
outside the small cadre of those with experience in parallel computing must now contend with the challenge
of making their codes run effectively in parallel. Parallelism has become everyone’s problem and this will
require deep rethinking of the commercial software and algorithm infrastructure.
2.1.2
Data Movement Dominates Costs
Since the loss of Dennard Scaling, a new technology scaling regime has emerged. Due to the laws of electrical
resistance and capacitance, a wire’s intrinsic energy efficiency for a fixed-length wire does not improve
appreciably as it shrinks down with Moore’s law improvements in lithography as shown in Figure 2.2. In
contrast, the power consumption of transistors continues to decrease as their gate size (and hence capacitance)
6