A systematic Characterization of Application Sensitivity to Network Performance

Yüklə 0,74 Mb.

Pdf görüntüsü

səhifə	20/51
tarix	15.10.2018
ölçüsü	0,74 Mb.
	#74178

1 ... 16 17 18 19 20 21 22 23 ... 51

47
Program
Description
Input Set
16 node
32 node
Time (sec)
Time (sec)
Radix
Integer radix sort
16 Million
13.66
7.76
32-bit keys
EM3D(write)
Electro-magnetic wave
80000 Nodes, 40% remote,
88.59
37.98
propagation
degree 20, 100 steps
EM3D(read)
Electro-magnetic wave
80000 Nodes, 40% remote,
230.0
114.0
propagation
degree 20, 100 steps
Sample
Integer sample sort
32 Million
24.65
13.23
32-bit keys
Barnes
Hierarchical N-Body
1 Million Bodies
77.89
43.24
simulation
P-Ray
Ray Tracer
1 Million pixel image
23.47
17.91
16390 objects
Mur
¯
Protocol
SCI protocol, 2 procs,
67.68
35.33
Veriﬁcation
1 line, 1 memory each
Connect
Connected
4 Million nodes
2.29
1.17
Components
2-D mesh, 30% connected
NOW-sort
Disk-to-Disk Sort
32 Million
127.2
56.87
100-byte records
Radb
Bulk version of
16 Million
6.96
3.73
Radix sort
32-bit keys
Table 3.1: Split-C Applications and Data Sets
This table describes our applications, the input set, the application’s communication pattern, and the
base run time on 16 and 32 nodes. The 16 and 32 node run times show that most of the applications
are quite scalable between these two machine sizes.
available site
1
.
To ensure that the data is not overly inﬂuenced by startup characteristics, the applications
must use reasonably large data sets. Given the experimental space we wish to explore, it is not prac-
tical to choose data sets taking hours to complete; however, an effort was made to choose realistic
data sets for each of the applications. We used the following criteria to characterize applications in
our benchmark suite and to ensure that the applications demonstrate a wide range of architectural
requirements:
°
Message Frequency: The more communication intensive the application, the more we would
expect its performance to be affected by the machine’s communication performance. For ap-
plications that use short messages, the most important factor is the message frequency, or equiv-
alently the average interval between messages. However, the behavior may be inﬂuenced by
±
ftp.cs.berkeley.edu/pub/CASTLE/Split-C/release/sc961015

48
(a) Radix
(b) EM3D(write)
(c) EM3D(read)
(d) Sample
(e) Barnes
(f) P-Ray
(g) Mur
²
(h) Connect
(i) NOW-sort
(j) Radb
Figure 3.1: Split-C Communication Balance
This ﬁgure demonstrates the communication balance between each of the 32 processors for our 10
Split-C applications. The greyscale for each pixel represents a message count. Each application
is individually scaled from white, representing zero messages, to black, representing the maximum
message count per processor as shown in Table 3.2. The
³
-coordinate tracks the message sender
and the
´
-coordinate tracks the receiver.
the burstiness of communication and the balance in trafﬁc between processors.
µ
Write or Read Based: Applications that read remote data and wait for the result are more
likely to be sensitive to latency than applications that mostly write remote data. The latter are
likely to be more sensitive to bandwidth. However, dependences that cause waiting can appear
in applications in many forms.
µ
Short or Long Messages: The Active Message layer used for this study provides two types
of messages, short packets and bulk transfers. Applications that use bulk messages may have
high data bandwidth requirements, even though message initiations are infrequent.
µ
Synchronization: Applications can be bulk synchronous or task queue based. Tightly syn-
chronized applications are likely to be dependent on network round trip times, and so may be
very sensitive to latency. Task queue applications may tolerate latency, but may be sensitive
to overhead. A task queue based application attempts to overlap message operations with lo-
cal computation from a task queue. An increase in overhead decreases the available overlap
between the communication and local computation.
µ
Communication Balance: Balance is simply the ratio of the maximum number of messages

49
Program
Avg. Msg./
Max Msg./
Msg./
Msg.
Barrier
Percent
Percent
Bulk
Small
Proc
Proc
Proc/
Interval
Interval
Bulk
Reads
Msg.
Msg.
ms
(
¶
s )
(ms)
(KB/s)
(KB/s)
Radix
1,278,399
1,279,018
164.76
6.1
408
0.01%
0.00%
26.7
4,612.9
EM3D(write)
4,737,955
4,765,319
124.76
8.0
122
0.00%
0.00%
0.6
3,493.2
EM3D(read)
8,253,885
8,316,063
72.39
13.8
369
0.00%
97.07%
0.0
2,026.9
Sample
1,015,894
1,294,967
76.76
13.0
1,203
0.00%
0.00%
0.0
2,149.2
Barnes
819,067
852,564
18.94
52.8
279
23.25%
20.57%
110.4
407.1
P-Ray
114,682
278,556
6.40
156.2
1,120
47.85%
96.49%
358.5
93.5
Connect
6,399
6,724
5.45
183.5
47
0.06%
67.42%
0.0
152.5
Mur
·
166,161
168,657
4.70
212.6
11,778
49.99%
0.00%
3,876.6
65.8
NOW-sort
69,574
69,813
1.22
817.4
1,834
49.82%
0.00%
3,125.1
17.2
Radb
4,372
5,010
1.17
852.7
25
34.73%
0.04%
33.6
21.4
Table 3.2: Split-C Communication Summary
For a 32 processor conﬁguration, the table shows run times, average number of messages sent per
processor, and the maximum number of messages sent by any processor. Also shown is the message
frequency expressed in the average number of messages per processor per millisecond, the average
message interval in microseconds,the average barrier interval in milliseconds, the percentage of the
messages using the Active Message bulk transfer mechanism, the percentage of total messages which
are read requests or replies, the average bandwidth per processor for bulk messages in kilobytes per
second, and the average bandwidth per processor for small messages in kilobytes per second.
sent per processor to the average number of messages sent per processor. It is difﬁcult to pre-
dict the inﬂuence of network performance on applications with a relatively large communica-
tion imbalance since varying LogP parameters may exacerbate or may actually alleviate the
imbalance.
3.1.1
Split-C Benchmark Suite
Table 3.1 summarizes the programs we chose for our benchmark suite as run on both a 16
and a 32 node cluster. Most applications are well parallelized when scaled from 16 to 32 proces-
sors. It is important to note the history of these applications when examining our results. All of the
applications were designed for low overhead MPPs or NOWs. The program designers were often
able to exploit the low-overhead aspect of these machine architectures in the program design. Each
application is discussed brieﬂy below.
¸
Radix Sort: sorts a large collection of 32-bit keys spread over the processors, and is thor-
oughly analyzed in [39]. It progresses as two iterations of three phases. First, each processor
determines the local rank for one digit of its keys. Second, the global rank of each key is cal-
culated from local histograms. Finally, each processor uses the global histogram to distribute
the keys to the proper location. For our input set of one million keys per processor on 32 pro-

Yüklə 0,74 Mb.

Dostları ilə paylaş:

1 ... 16 17 18 19 20 21 22 23 ... 51