26
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
10
20
30
40
50
60
70
80
90
Bulk Gap (usec/byte)
Delay (usec)
Bulk
Gap Calibration
Figure 2.4:
Calibration of Bulk Gap for the Parallel Program Apparatus
This figure shows the empirical calibration for bulk Gap. The dependent variable shows the added
delay in
s per 100 bytes of packet size. The independent variable is the Gap expressed in
s per
byte (1/bandwidth) at a 2KB packet size. After a small delay, the relationship is linear, showing that
the apparatus for adjusting bulk Gap is quite accurate.
2.3.2
MPI Apparatus
The last few years has seen a standard message passing interface, aptly named the Message
Passing Interface (MPI) [75], emerge from the parallel programming community. In this section,
we describe the construction and performance of MPI on top of our basic apparatus described in the
previous section. Recall that this apparatus is used in our study of the NAS Parallel Benchmark suite.
We conclude this section with a simple model which describes how the MPI will react to changes in
LogGP parameters.
The MPI specification in quite complex, including many collective operations, four seman-
tic classes of point-to-point messages, methods of grouping processes (communicators), and many
ways of tag-matching between sends and receives. In order to manage this complexity, the MPICH
implementation [6] layers the more complex MPI abstractions on top of simpler ones. For example,
collective operations, such as
MPI All to all
are implemented as standard point-to-point mes-
sages using
MPI send
. The point-to-point messages are in turn, mapped to the lowest layer, the
MPI abstract device (MPID). The MPID layer is quite small; it implements just three ways to send
point-to-point messages.
28
0
To=25.4
B=0.058
0
5000
10000
15000
20000
25000
30000
200
400
600
800
1000
1200
measured
Transfer Time(usec)
To=148.5
MPI
Linear Model
B=0.027
Bytes
Figure 2.5:
Baseline MPI Performance
This figure shows the baseline performance of the MPI-GAM system. The figure plots half the round
trip time for different size messages. Two distinct performance regimes are observable, one for mes-
sages
4KB and the other for messages
d
4KB. The modeled start-up cost,
ef
is obtained from the
y-intercept of a least squares fits to the two performance regimes. The per-byte costs,
gih
j
kml
, is
obtained from the slopes of the fitted lines.
thus requires all memory addresses to be known in advance of the call. A key point of the GAM
implementation is that
am store
internally maps to a sequence of
am request
calls. The GAM
LCP can pipeline these requests resulting in the maximum bandwidth of 38 MB/s for a long sequence
of 4KB
am request
messages.
Performance
In this section, we investigate the performance of the MPI message passing layer built on
top of GAM. The purpose of this section is to understand how the inflation of the LogGP parameters
at the GAM level affects the performance of MPI. We show how the different implementations of
the standard MPI send result in different performance regimes.
MPI benchmarks traditionally use a linear model of performance. In the traditional linear
model, a per message start-up cost of
ef
is paid on every message. A second parameter,
npo
, captures
bandwidth limitations of the machine. The cost to send an n-byte message,
eq
, is thus modeled as
eqrhsefut
q
k
l
. Fitting this model into the LogGP perspective requires care, although by definition,
vxw
j
k
l
. Modeling
ef
requires knowledge of the underlying implementation. For example,
ef