A systematic Characterization of Application Sensitivity to Network Performance

Yüklə 0,74 Mb.

Pdf görüntüsü

səhifə	13/51
tarix	15.10.2018
ölçüsü	0,74 Mb.
	#74178

1 ... 9 10 11 12 13 14 15 16 ... 51

Calibration of Bulk Gap for the Parallel Program Apparatus
2.3.2 MPI Apparatus
Construction
Baseline MPI Performance

26
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
10
20
30
40
50
60
70
80
90
Bulk Gap (usec/byte)

Delay (usec)
Bulk Gap Calibration
Figure 2.4: Calibration of Bulk Gap for the Parallel Program Apparatus
This ﬁgure shows the empirical calibration for bulk Gap. The dependent variable shows the added
delay in

s per 100 bytes of packet size. The independent variable is the Gap expressed in

s per
byte (1/bandwidth) at a 2KB packet size. After a small delay, the relationship is linear, showing that
the apparatus for adjusting bulk Gap is quite accurate.
2.3.2
MPI Apparatus
The last few years has seen a standard message passing interface, aptly named the Message
Passing Interface (MPI) [75], emerge from the parallel programming community. In this section,
we describe the construction and performance of MPI on top of our basic apparatus described in the
previous section. Recall that this apparatus is used in our study of the NAS Parallel Benchmark suite.
We conclude this section with a simple model which describes how the MPI will react to changes in
LogGP parameters.
The MPI speciﬁcation in quite complex, including many collective operations, four seman-
tic classes of point-to-point messages, methods of grouping processes (communicators), and many
ways of tag-matching between sends and receives. In order to manage this complexity, the MPICH
implementation [6] layers the more complex MPI abstractions on top of simpler ones. For example,
collective operations, such as
MPI All to all
are implemented as standard point-to-point mes-
sages using
MPI send
. The point-to-point messages are in turn, mapped to the lowest layer, the
MPI abstract device (MPID). The MPID layer is quite small; it implements just three ways to send
point-to-point messages.

27
Construction
In order to construct a tunable apparatus, it was sufﬁcient to map the MPID layer to GAM
[109]. The three ways to send messages at the MPID level correspond to two mappings to the GAM
level. The ﬁrst send type, the “standard” send, is the most common. More importantly, of the three
MPID sends, the standard send is the only one used by the NPB. An MPI layer standard send eventu-
ally maps to the MPID function
MPID AM Post send
. This function sends a contiguous memory
region to another processor, and returns when the data has been accepted by the communications
layer. The tag and communications group are already speciﬁed by higher MPI layers. All MPI re-
ceive functions map to a single receive function at the MPID layer,
MPID AM Post recv
. This
function tests for completion at the processor that receives the message. For the standard send, no
messages are sent inside the
MPID AM Post recv
call.
The implementation strategy for the standard send depends on the size of the message sent.
The GAM interface provides 2 distinct message sizes: 0 - 4KB, via the
am request
function, and
greater than 4KB in the
am store
function. Each of the methods results in substantially different
start-up costs and per-byte bandwidths, resulting in two methods of constructing sends at the MPID
level.
The
MPID AM Post send
call is mapped to the GAM layer using the Myrinet speciﬁc
am request
function for messages less than 4KB long. The
am request
function was added
after the initial GAM speciﬁcation in order to handle the “medium” message sizes needed by many
distributed systems. It delivers a continuous block of data up to 4KB long, and invokes a handler
on the remote end when the block arrives. The block on the remote side exists only for the life of
the handler. For these medium message sizes, the MPID-GAM implementation simply launches the
message into the network, or stalls if the network is full. Upon arrival, if the receive is posted, the
data is copied into the ﬁnal destination. If the receive has not been posted, the message is copied into
a temporary buffer. Control messages, e.g., for barriers, are implemented with the
am request 4
function. Recall that
am request 4
sends an active message with 4 32-bit words as arguments.
For messages larger than 4KB,
MPID AM Post send
ﬁrst performs a round trip using
the
am request 4
call. The receiver returns the destination address of the location of the receive
buffer. If the receive has not been posted, the handler on the receiver creates a temporary buffer. The
sender blocks until it receives the response containing the address of the receive buffer. Once the
address has been obtained, the sender uses the
am store
function to send the data into the correct
destination. Recall that
am store
copies a block of arbitrary data from one node to another, and

28
0
To=25.4
B=0.058
0
5000
10000
15000
20000
25000
30000
200
400
600
800
1000
1200
measured
Transfer Time(usec)
To=148.5
MPI Linear Model
B=0.027
Bytes
Figure 2.5: Baseline MPI Performance
This ﬁgure shows the baseline performance of the MPI-GAM system. The ﬁgure plots half the round
trip time for different size messages. Two distinct performance regimes are observable, one for mes-
sages

4KB and the other for messages
d
4KB. The modeled start-up cost,
ef
is obtained from the
y-intercept of a least squares ﬁts to the two performance regimes. The per-byte costs,
gih
j
kml
, is
obtained from the slopes of the ﬁtted lines.
thus requires all memory addresses to be known in advance of the call. A key point of the GAM
implementation is that
am store
internally maps to a sequence of
am request
calls. The GAM
LCP can pipeline these requests resulting in the maximum bandwidth of 38 MB/s for a long sequence
of 4KB
am request
messages.
Performance
In this section, we investigate the performance of the MPI message passing layer built on
top of GAM. The purpose of this section is to understand how the inﬂation of the LogGP parameters
at the GAM level affects the performance of MPI. We show how the different implementations of
the standard MPI send result in different performance regimes.
MPI benchmarks traditionally use a linear model of performance. In the traditional linear
model, a per message start-up cost of
ef
is paid on every message. A second parameter,
npo
, captures
bandwidth limitations of the machine. The cost to send an n-byte message,
eq
, is thus modeled as
eqrhsefut
q
k
l
. Fitting this model into the LogGP perspective requires care, although by deﬁnition,
vxw
j
k
l
. Modeling
ef
requires knowledge of the underlying implementation. For example,
ef

Yüklə 0,74 Mb.

Dostları ilə paylaş:

1 ... 9 10 11 12 13 14 15 16 ... 51