Programming abstractions for

Yüklə 0,54 Mb.

Pdf görüntüsü

səhifə	4/23
tarix	24.12.2017
ölçüsü	0,54 Mb.
	#17201

1 2 3 4 5 6 7 8 9 ... 23

1.3. SUMMARY OF FINDINGS AND RECOMMENDATIONS

to express tasks to facilitate more ﬂexible task migration are also well structured to facilitate locality-aware

or data-centric models, but work remains on eﬀective locality-aware heuristics for the runtime/scheduling-

algorithm that properly trade-oﬀ load-balancing eﬃciency with the cost of data movement.

Recommendations: Information on data locality and lightweight cost models for the cost of migrating

data will play a central role in creating locality-aware runtime systems that can schedule tasks in a manner

that minimizes data-movement. Such locality-aware advanced runtime systems are still an active area of

research. The semantics for expressing inter-task locality are emerging in runtime systems such as Open

Community Runtime, CHARM++, Swarm, and HPX, but the optimal set of heuristics or mechanisms to

eﬀectively exploit that information requires further research.

1.3.5

System-Scale Data Locality Management

Findings: The cost of allocating, moving or accessing data is increasing compared to that of processing it.

With the deepening of the memory hierarchy and greater complexity of interconnection networks, the coher-

ence management, the traﬃc and the contention in the interconnection network will have a huge impact on

the application’s runtime and energy consumption. Moreover, in a large-scale system multiple applications

run simultaneously, and therefore compete for resources. The locality management must therefore take into

account local constraints (the way the application behaves) and system-scale constraints (the way it accesses

the resources). A global integration of these two types of constraints is the key for enabling scalability of

applications execution in future.

Recommendations: Our recommendation is to address several complementary directions: models, ab-

straction and algorithms for managing data locality at system scale. New models are required to describe

the topology on which the application is running both at the node level and at the network level. New

abstractions will provide means of expressing the way to access system level services such as storage or

the batch scheduler by the applications. These abstractions must expose the topology in a generic manner

without deeply impacting the programming model while also providing scalable mapping algorithms that

account for the deep hierarchy and complex topology. It is critical that this research be done co-operatively

with other aspects of data managment in order to avoid optimization conﬂicts and also to oﬀer a uniﬁed

view of the system and its locality management.

Programming Abstractions for Data Locality

Chapter 2

Background

The cost of data movement has become the dominant factor of a high performance computing system both

in terms of energy consumption and performance. To minimize data movement, applications have to be

optimized both for vertical data movement in the memory hierarchy and horizontal data movement between

processing units. These hardware challenges have been modest enough that the community has largely

relied upon compiler technology and software engineering practices to mitigate the coarse-grained eﬀects

such as manual loop blocking or 2-level MPI+X parallelism. The eﬀects were modest enough that these

manual techniques were suﬃcient to enable codes to perform on diﬀerent architectures. However, with the

exponential rise in explicit parallelism and increasing energy cost of data movement relative to computation,

application developers need a set of programming abstractions to describe data locality on the new computing

ecosystems.

2.1

Hardware Trends

This section will brieﬂy cover the primary hardware architecture trends that have motivated the move from

a compute-centric programming model towards a data-centric model.

2.1.1

The End of Classical Performance Scaling

The year 2004 marked the approximate end of Dennard Scaling because chip manufacturers could no longer

reduce voltages at the historical rates. Other gains in energy eﬃciency were still possible; for example,

smaller transistors with lower capacitance consume less energy, but those gains would be dwarfed by leakage

currents. The inability to reduce the voltages further did mean, however, that clock rates could no longer be

increased within the same power budget. With the end of voltage scaling, single processing core performance

no longer improved with each generation, but performance could be improved, theoretically, by packing more

cores into each processor. This multicore approach continues to drive up the theoretical peak performance

of the processing chips, and we are on track to have chips with thousands of cores by 2020. This increase

in parallelism via raw core count is clearly visible in the black trend line in Peter Kogge’s classic diagram

(Figure 2.1) from the 2008 DARPA report [58]. This is an important development in that programmers

outside the small cadre of those with experience in parallel computing must now contend with the challenge

of making their codes run eﬀectively in parallel. Parallelism has become everyone’s problem and this will

require deep rethinking of the commercial software and algorithm infrastructure.

2.1.2

Data Movement Dominates Costs

Since the loss of Dennard Scaling, a new technology scaling regime has emerged. Due to the laws of electrical

resistance and capacitance, a wire’s intrinsic energy eﬃciency for a ﬁxed-length wire does not improve

appreciably as it shrinks down with Moore’s law improvements in lithography as shown in Figure 2.2. In

contrast, the power consumption of transistors continues to decrease as their gate size (and hence capacitance)

Yüklə 0,54 Mb.

Dostları ilə paylaş:

1 2 3 4 5 6 7 8 9 ... 23