A systematic Characterization of Application Sensitivity to Network Performance

Yüklə 0,74 Mb.

Pdf görüntüsü

səhifə	26/51
tarix	15.10.2018
ölçüsü	0,74 Mb.
	#74178

1 ... 22 23 24 25 26 27 28 29 ... 51

3.4.3 Network Architecture
3.4.4 Modeling
Chapter 4 NAS Parallel Benchmark Sensitivity
NPB Communication Summary

64
our measured applications, the sensitivity to overhead and gap is much stronger than sensitivity to
latency and per-byte bandwidth.
3.4.3
Network Architecture
The most interesting result, which relates to the architecture area, is the fact that all the
applications display a linear dependence to both overhead and gap. This relationship suggest that
continued architectural improvements in these areas should result in a corresponding improvement
in application performance (limited by Amdahl’s Law). In contrast, if the network performance were
“good enough” for the applications, (i.e., some other part of the system was the bottleneck), then we
should observe a region were the applications did not slow down as network performance decreased.
In contrast, efforts in improving network latency will not yield as much performance improvements
across as wide a class of applications.
A second architectural result is that there is an interesting tradeoff between processor per-
formance and communication performance. For many parallel applications, relatively small im-
provements in network overhead and gap can result in a factor of two performance improvement.
This result suggests that in some cases, rather than making a signiﬁcant investment to double a ma-
chine’s processing capacity, the investment may be better directed toward improving the perfor-
mance of the communication system.
3.4.4
Modeling
In the modeling area, we found that for both overhead and gap, simple models are able to
predict sensitivity to these parameters for most of our applications. The effects of latency, on the
other hand, are harder to predict because they are more dependent on application structure. The ap-
plications used a wide variety of latency tolerating techniques, including pipelining (radix, sample),
batching (EM3D, Radb), caching (Barnes, P-ray) and overlapping (Mur
p
and NowSort). Each of
these techniques requires more sophisticated models to capture the effect of added latency than our
frequency-cost model allows.

65
Chapter 4
NAS Parallel Benchmark Sensitivity
... it is a consistent theme that each generation of computers obsoletes the perfor-
mance evaluation techniques of the prior generation. — Hennessey & Patterson, Com-
puter Architecture: A Quantitative Approach
The NAS Parallel Benchmarks (NPB) are widely used to evaluate parallel machines. To
date, every vendor of large parallel machines has presented NPB version 1.0 results [10]. The recent
convergence of parallel machines and the introduction of a standard programming model (MPI), the
NAS group created version 2.2 of the benchmark suite [11]. In contrast to the vendor-speciﬁc imple-
mentations of version 1.0, NPB 2.2 presents a consistent, portable, and readily available workload to
parallel machine designers, analogous to the SPECcpu benchmarks for single-processor machines.
Much has been written about the theoretical techniques in these codes [13, 111], but an understanding
of their practical communication behavior is at best incomplete.
In this chapter, we examine the sensitivities of three of the six NAS parallel benchmarks:
FT, IS and MG. The three are computational kernels from numerical aerodynamic simulation codes.
The other three benchmarks, SP, BT and LU, are longer codes that are considered pseudo-applica-
tions. Unfortunately, apparatus limitations did not allow us to run the much larger pseudo-applications
for this study. The input set comes in 3 sizes: class A, B, and C. All our experiments are run on the
class B size, which is appropriate to run on 32 nodes but does not scale down to single node sizes.
However, because we are not measuring the scalability of the codes class B is reasonable input set
size to run on our apparatus.

66
Program
Run Time
Collective
MPI Device
Active Message
Max-Min
(sec)
All-to-All(v)
Level
Level
Ratio
Msgs.
Bytes
Msgs.
Bytes
Small
4K Frag
FT
173.2
20
325058560
660
325058560
1980
79360
0.0%
IS
18.7
20
40833600
670
40833600
2010
9969
17.0%
MG
17.8
2854
27570880
2854
27570880
8562
6731
0.1%
Table 4.1: NPB Communication Summary
For a 32 processor conﬁguration, the table shows run times, the number and size of collective op-
erations at the MPI level, the maximum number and size of operations at the MPI Device level (per
processor), the resulting number of messages at the Active Message level (per processor), and the
percentage skew, measured in bytes, between the processors that sent the maximum and minimum
number of bytes.
4.1
Characterization
In this section we characterize the NPB much in the same way as we did the Split-C/AM
benchmarks in the previous chapter. We begin with a brief description of each benchmark, followed
by a balance graph and message count analysis. We compare the communication characteristics of
the benchmarks to the Split-C/AM programs. The benchmarks are:
q
FT: This kernel performs a 3-D Fast Fourier Transform (FFT) on a
rtstutvxwysr
grid. The pro-
gram implements the FFT as a series of 1-D FFTs. Each iteration, the program does a global
transpose of all the data, thus performing a perfectly balanced all-to-all communication pat-
tern. The FFT requires a large amount of computation in addition to a large volume of com-
munication.
q
IS: The Integer Sort (IS) benchmark performs a bucket sort on 1 million 32 bit keys per proces-
sor. Notice that the run time for IS is much higher than for the equivalent Split-C sorts [39].
Like sample sort, this sort performs unbalanced all-to-all communication, relying on a ran-
dom key distribution for load balancing. Thus, the communication pattern is dependent on
the data-set.
q
MG: This program solves a poisson equation on a
r§stu%
grid using a multigrid “W” algorithm.
Unlike FT and IS, communication is quite localized, occurring between neighboring proces-
sors on the different grid levels. The communication pattern does not depend on the data.
Figure 4.1 shows that the three NPB codes are well structured. FT and MG are perfectly
balanced; each processor sends and receives the same amount of data. IS is slightly unbalanced,

Yüklə 0,74 Mb.

Dostları ilə paylaş:

1 ... 22 23 24 25 26 27 28 29 ... 51