Programming abstractions for

Yüklə 0,54 Mb.

Pdf görüntüsü

səhifə	5/23
tarix	24.12.2017
ölçüsü	0,54 Mb.
	#17201

1 2 3 4 5 6 7 8 9 ... 23

How much parallelism must be handled by the program

2.1. HARDWARE TRENDS

How much parallelism must be handled by the program?

From Peter Kogge (on behalf of Exascale Working Group), “Architectural Challenges at the Exascale Frontier”, June 20, 2008

Figure 2.1: Explicit parallelism of HPC systems will be increasing at an exponential rate for the foreseeable

future.

10"

100"

1000"

10000"

"FLOP

gis

ter

1mm"o

n3ch

ip"

5mm"o

n3ch

ip"

15mm"o

n3ch

ip"

Oﬀ

3ch

ip/

DRA

loc

al"

inte

rco

s"s

yste

2008"(45nm)"

2018"(11nm)"

it*

Figure 2.2: The cost of moving the data operands exceeds the cost of performing an operation upon them.

decreases. Since the energy eﬃciency of transistors is improving as their sizes shrink, and the energy eﬃciency

of wires is not improving, the point is rapidly approaching where the energy needed to move the data exceeds

the energy used in performing the operation on those data.

Data locality has long been a concern for application development on supercomputers. Since the advent

of caches, vertical data locality has been extraordinarily important for performance, but recent architecture

trends have exacerbated these challenges to the point that they can no longer be accommodated using existing

methods such as loop blocking or compiler techniques. This report will identify numerous opportunities to

greatly improve automation in these areas.

2.1.3

Increasingly Hierarchical Machine Model

Future performance and energy eﬃciency improvements will require fundamental changes to hardware ar-

chitectures. The most signiﬁcant consequence of this assertion is the impact on the scientiﬁc applications

that run on current high performance computing (HPC) systems, many of which codify years of scientiﬁc

domain knowledge and reﬁnements for contemporary computer systems. In order to adapt to exascale archi-

tectures, developers must be able to reason about new hardware and determine what programming models

Programming Abstractions for Data Locality

2.2. LIMITATIONS OF CURRENT APPROACHES

3D Stacked

Memory

(Low Capacity, High Bandwidth)

Fat

Core

Fat

Core

Thin Cores / Accelerators

(High Capacity,

Low Bandwidth)

Coherence Domain

Core

Integrated NIC

for Off-Chip

Communication

Figure 2.3:

The increasingly hierarchical nature of emerging exascale machines will make conventional

manual methods for expressing hierarchy and locality very challenging to carry forward into the next decade.

and algorithms will provide the best blend of performance and energy eﬃciency into the future. While many

details of the exascale architectures are undeﬁned, an abstract machine model (AMM) enables application

developers to focus on the aspects of the machine that are important or relevant to performance and code

structure. These models are intended as communication aids between application developers and hardware

architects during the co-design process. Figure 2.3 depicts an AMM that captures the anticipated evolution

of existing node architectures based on current industry roadmaps [3].

As represented in the ﬁgure, future systems will express more levels of hierarchy than we are normally

accustomed to in our existing 2-level MPI+X programming models. Not only are there more levels of hier-

archy, it is likely that the topology of communication will become important to optimize for. Programmers

are already facing NUMA (non-uniform-memory access) performance challenges within the node, but future

systems will see increasing NUMA eﬀects between cores within an individual chip die in the future. Overall,

our current programming models and methodologies are ill-equipped to accommodate the changes to the

underlying abstract machine model, and this breaks our current programming systems.

2.2

Limitations of Current Approaches

Architectural trends break our existing programming paradigm because the current software tools are focused

on equally partitioning computational work. In doing so, they implicitly assume all the processing elements

are equidistant to each other and equidistant to their local memories within a node. For example, commonly

used modern threading models allow a programmer to describe how to parallelize loop iterations by dividing

the iteration space (the computation) evenly among processors and allowing the memory hierarchy and cache

coherence to move data to the location of the compute. Such a compute-centric approach no longer reﬂects

the realities of the underlying machine architecture where data locality and the underlying topology of the

data-path between computing elements are crucial for performance and energy eﬃciency. There is a critical

need for a new data-centric programming paradigm that takes data layout and topology as a primary criteria

for optimization and migrates compute appropriately to operate on data where it is located.

The applications community will need to refactor their applications to align with this emerging data-

centric paradigm, but the abstractions currently oﬀered by modern programming environments oﬀer few

abstractions for managing data locality. Absent these facilities, the application programmers and algorithm

developers must manually manage data locality using manual techniques such as strip-mining and loop-

blocking. To optimize data movement, applications must be optimized both for vertical data movement

in the memory hierarchy and horizontal data movement between processing units as shown in ﬁgure 2.4.

Such transformations are labor-intensive and easily break with minor evolutions of successive generations of

compute platforms. Hardware features such as cache-coherence further inhibit our ability to manage data

Programming Abstractions for Data Locality

Yüklə 0,54 Mb.

Dostları ilə paylaş:

1 2 3 4 5 6 7 8 9 ... 23