Chapter 5
Performance Evaluation
So far, we have presented the programming interface of the MapReduce skeleton as
well as its internal structure. Built on top of Skandium, our system completely hides
threading issues from the programmer. With the addition of the predefined muscles into
the Skandium library, the system also hides from the programmer complex function-
ality that is independent of the problem in hand. The user can focus on implementing
only the map and reduce muscles and optionally, the split and merge muscles. The
latter makes the skeleton more flexible: it can be used to process any input and it can
produce any output. At the same time the skeleton’s interface is similar to that of the
other Skandium skeletons so it can be used with minimum disruption to the Skandium
programmer. Moreover, the skeleton’s implementation naturally fits the existing Skad-
nium infrastructure and its modular design provides a clear solution.
The next question is: How well does our solution perform? Performance usually
conflicts with the ease of use and so this chapter is dedicated to present the evalua-
tion of the skeleton’s performance. This chapter begins with the evaluation of the four
different schemes that we described in the previous chapter. The schemes are evalu-
ated against 5 different applications on two different multi-core platforms. The best
performing scheme is then evaluated further and compared in terms of programmabil-
ity and performance with traditional parallel programming techniques based on Java
threads.
5.1
Experimental Method
This section describes the experimental methodology we used to evaluate the MapRe-
duce skeleton.
33
Chapter 5. Performance Evaluation
34
System 1
System 2
Hardware Settings
CPU Type
Intel Xeon E7320
Intel Xeon E5645
Clock Freq
2.13 GHz
2.40 GHz
Processor Count 4
2
Cores/Processor
4
6
L1 Cache
32KB, 8-way SA, 64-byte line size
32KB, 8-way SA, 64-byte line
size
L2 Cache
2 x 2MB 8-way SA, 64-byte line size 6 x 256KB 8-way SA, 64-byte
line size
L3 Cache
-
12MB
Software Settings
Operating System Scientific Linux 5.5
Red Hat Enterprise Linux Server
release 5.4
Java
OpenJDK 64-Bit Server VM
HotSpot 64-Bit Server VM 2
Table 5.1: The characteristics of the two systems that were used for the skeleton’s
evaluation
5.1.1
Shared Memory Systems
One of the most important factors that affect the model’s performance on shared mem-
ory platforms is the machine’s memory hierarchy [5]. As we want to investigate how
this parameter affects our implementation, we run our experiments on two machines
with considerably different memory sub-systems. The two systems that we used are
described in Table 5.1. System1 is a 4-processor machine with four cores per pro-
cessor and System2 is a 2-processor machine with six cores per processor. The first
system has two L2 caches shared by the four cores while the second system has a
separate L2 cache per core and a common L3 cache shared by all six cores.
5.1.2
Applications
In chapter 2, we presented some typical MapReduce applications that we used for
the performance evaluation of the skeleton. Table 5.2 describes the data sets that
were used for each application as well as the characteristics of each workload i.e. the
number and the size of the emitted pairs as well as the number of the distinct keys that
as we will see in the next section, considerably affect the skeleton’s performance.
Chapter 5. Performance Evaluation
35
Data Set
No of Pairs Average Pair Size (Bytes) No of distinct keys
Word Count
text 100MB
18 M
38
150 K
Matrix Multiply 2 matrices 1600*1600
2.5 M
24
2.5 M
Inverted Index text 100MB, 10 words to
be searched
4M
38
10
Histogram
bitmap 100MB
100M
48
768
KMeans
1M points, 3 dimensions,
100 clusters
1 M
28
100
Table 5.2: The characteristics of the five applications that were used for the skeleton’s
evaluation
For our evaluation, we implemented the sequential algorithm for all benchmarks to
serves as the baseline for speedups. In the first set of experiments which are presented
in the next sections, we have the initial heap size of the Java virtual machine set to
a very big number in order to eliminate all Garbage Collections. In this way, we can
focus on other parameters that affect the performance of our solution. The performance
penalty caused by the garbage collector is evaluated separately in the last section of this
chapter.
5.2
Evaluation of the four implementation schemes
Figure 5.1 shows the efficiency of the four implementation schemes compared to the
sequential algorithm for the two machines described in Table 5.1. The efficiency is
calculated as the ratio of the speedup over the sequential algorithm to the number of
cores in each machine. In this experiment, each implementation scheme uses at its
parallel stages as many workers as the number of cores in each machine i.e. 16 for
System1 and 12 for System2. There are three main observations that we can make on
these results.
Our first observation is that some applications do not perform well, regardless of
the machine or the implementation scheme. These are the Histogram and KMeans
applications. The reason for this is that the MapReduce model is not an efficient fit for
these two applications [5].
Histogram is a good example of the model’s inefficiency when the problem in hand
is not naturally associated with any keys. Unlike applications like word count, the
output of Histogram is predictable. It consists of three fixed-width tables (i.e. 3 x 256)