A systematic Characterization of Application Sensitivity to Network Performance

Yüklə 0,74 Mb.

Pdf görüntüsü

səhifə	21/51
tarix	15.10.2018
ölçüsü	0,74 Mb.
	#74178

1 ... 17 18 19 20 21 22 23 24 ... 51

Sample Sort
Parallel Mur Ä
Connected Components

50
cessors the application spends 98% of its time in the communication phases.
The communication density plot of Figure 3.1a is useful in understanding the communication
behavior of this application. The darkness of cell
¹»º½¼
indicates the fraction of messages sent
from processor
¹
to processor
¼
. The dark line off the diagonal reﬂects the global histogram
phase, where the ranks are accumulated across processors in a kind of pipelined cyclic shift.
The grey background is the global distribution phase. Overall, the communication is frequent,
write-based and balanced.
¾
EM3D: EM3D [28] is the kernel of an application that models propagation of electromagnetic
waves through objects in three dimensions. It ﬁrst spreads an irregular bipartite graph over
all processors. During each time-step, changes in the electric ﬁeld are calculated as a linear
function of the neighboring magnetic ﬁeld values and vice versa. We use two complementary
versions of EM3D, one write-based and the other read-based. Both versions contain relatively
short computation steps. The write-based EM3D uses pipelined writes to propagate updates by
augmenting the graph with special boundary nodes. EM3D(write) represents a large class of
bulk synchronous applications, alternating between local computation and global communi-
cation phases. The read version uses simple blocking reads to pull update information locally
and does not need to create special boundary nodes. The locality of connectivity in the graph
for both versions is indicated by the dark swath in Figures 3.1b and 3.1c.
¾
Sample Sort: is a probabilistic algorithm which sorts a large collection of 32-bit keys by ﬁrst
choosing
¿ÁÀÂ
“good” splitter values and broadcasting them to all processors. Every processor
distributes its keys to the proper destination processor, based on the splitter values, and ﬁnally,
a local radix sort is performed on the received keys. An interesting aspect of this application
is the potential for unbalanced all-to-all communication as each processor potentially receives
a different number of keys. This is reﬂected in the vertical bars in Figure 3.1d. For our input
size, the local sort time is dominated by the distribution of keys to their proper destinations.
For our input of 16 million keys Sample sort spends 85% of the time in the two communication
phases.
¾
Barnes: Our implementation of this hierarchical N-Body force calculation is similar to the
version in the SPLASH benchmark suite [110]. However, the main data structure, a spatial
oct-tree, is replicated in software rather than hardware. Each timestep consists of two phases,
a tree construction phase and an interaction phase among the simulated bodies. Updates of the

51
oct-tree are synchronized through blocking locks. During the interaction phase, the processors
cache oct-tree nodes owned by remote processors in a software managed cache. Communica-
tion is generally balanced, as the solid grey square shows in Figure 3.1e.
Ã
P-Ray: This scene passing ray tracing program distributes a read-only spatial oct-tree over all
processors. The processors evenly divide ownership of objects in the scene. When a processor
needs access to an object stored on a remote processor, the object is cached locally in a ﬁxed
sized software-managed cache. Communication thus consists entirely of blocking read opera-
tions; the frequency of such operations is a function on the scene complexity and the software
caching algorithm. The dark spots in Figure 3.1f indicate the presence of “hot” objects which
are visible from multiple points in the scene.
Ã
Parallel Mur
Ä
: In this parallel version of a popular protocol veriﬁcation tool [34, 97], the ex-
ponential space of all reachable protocol states is explored to catch protocol bugs. Each pro-
cessor maintains a work queue of unexplored states. A hash function maps states to “owning”
processors. When a new state is discovered, it is sent to the proper processor. On reception of
a state description, a processor ﬁrst checks if the state has been reached before. If the state is
new, the processor adds it to the work queue to be validated against an assertion list.
Ã
Connected Components: First, a graph is spread across all processors [69]. Each proces-
sor then performs a connected components on its local subgraph to collapse portions of its
components into representative nodes. Next, the graph is globally adjusted to point remote
edges (crossing processor boundaries) at the respective representative nodes. Finally, a global
phase successively merges components between neighboring processors. The communication
to computation ratio is determined by the size of the graph.
Ã
NOW-sort: The version of NOW-sort used in this study sorts records from disk-to-disk in two
passes [8]. The sort is highly tuned, setting a the MinuteSort world record in 1997. The sort-
ing algorithm contains two phases. In the ﬁrst phase, each processor reads the records from
disk and sends them to the ﬁnal destination processor. The perfectly balanced nature of the
communication of phase 1 is shown by the solid black square in Figure 3.1i. The sort uses
one-way Active Messages directly, sending bulk messages at the rate the records can be read
from disk. Phase 2 of the algorithm consists of entirely local disk operations. Unlike the other
applications, NOW-sort performs a large amount of I/O, so can overlap communication over-
head with disk accesses.

Yüklə 0,74 Mb.

Dostları ilə paylaş:

1 ... 17 18 19 20 21 22 23 24 ... 51