Chapter 5. Performance Evaluation
42
find a user-specified word in the text. Figure 5.6 shows the performance of the manual
threading implementation compared to MapReduce. Although the performance gap is
smaller than in the matrix multiplication case, the manual threading implementation
still outperforms the MapReduce implementation. We should also note that while the
MapReduce implementation is significantly hurt with the increase of workers past the
number of cores in the machine (due to the dispersion of the pairs that we described in
the previous section), the efficiency of the manual threading implementation remains
almost invariable (and in fact it slightly gets improved) as the number of threads in-
creases.
Figure 5.6: MapReduce vs Manual Threading for inverted index on System1 (left) and
System2 (right)
5.4.1.4
KMeans
In Kmeans a subset of the input data points is assigned to a classifier thread. The
centroids are maintained in a concurrentHashMap along with a list of the points that
are close to each centroid. Each classifier thread calculates the distance of each point
from each centroid and updates the list of values accordingly. When all the values
have been calculated, the CentroidEstimator threads start, which are responsible for
calculating the new centroid for each group. When all the estimators complete their
computations, the second iteration of the algorithm begins with the classifiers grouping
the points according to the new centroids. The manual threading algorithm involved a
more complex synchronization between the threads than other applications, since all
the classifier threads must finish their work before the estimators start and vice versa.
The synchronization was achieved using Java’s CyclicBarrier object.
Chapter 5. Performance Evaluation
43
Figure 5.7 illustrates the performance of this implementation compared to Map
Reduce. The manual threading is the clear winner, its performance perfectly scales
with increasing number of threads up to the point where the number of thread is equal
to the number of cores in the machine.
Figure 5.7: MapReduce vs Manual Threading for KMeans on System1 (left) and Sys-
tem2 (right)
5.4.2
Discussion
From the comparison between the manual threading and the MapReduce implementa-
tion we can draw the following conclusions about the performance and the programma-
bility of each approach.
As far as performance is concerned, the manual threading scheme is the clear win-
ner. The many stages of our MapReduce implementation that accommodate pair ma-
nipulation hurt the performance comparing to the low level threading scheme. How-
ever, as the word count application showed, the programmer must make some careful
implementation decisions when porting his application to a multi-threaded environ-
ment, since some less obvious bottlenecks can seriously degrade the performance of
the application. Our skeletal implementation cannot compete with traditional low level
threading. Nevertheless, it can still provide a safety net for the user as and protect the
programmer from some of the pitfalls of multi-threaded programming as Figure 5.4
shows.
Concerning the programmability of each scheme, although it is rather subjective,
we believe that the MapReduce implementation marks a clear win in this area. The
amount of code that is required by the user is far less and we observed that it was much
Chapter 5. Performance Evaluation
44
easier to build a working application since the skeletal code is present and we only had
to provide the sequential parts without re-implementing the whole application.
5.5
Evaluating the Garbage Collector’s impact
All our previous experiments took place using a fixed JVM heap size that was large
enough to minimize any GC activity. This helped us to evaluate in isolation and in a
more effective way other parameters that affect the performance of our implementation
i.e. the underlying hardware, the schemes that are used for storing/partitioning the
intermediate pairs, the number of workers. In this section, we quantify the impact of
the Garbage Collector on the performance of our implementation.
Figure 5.8 shows how the speedup of the four Map Reduce applications drops as we
decrease the size of the heap. Reducing the heap typically means that we have more
garbage collections, each of which takes significant CPU time from the application
threads. In this experiment the number of workers is fixed (16 workers). We should
note thought that the more workers we have, the greater the heap space pressure is,
because its thread has its own local data structures which increase the proportion of the
live data in the heap [9].
As we can see in the figure, Word Count is most seriously affected by the GC as
its memory requirements are quite large and trigger many garbage collections. This is
because WordCount produces many Key Value Pair objects and the word count pairs
are quite large in terms of memory. On the other hand, KMeans is almost unaffected
by the garbage collector application as its memory requirements are quite low, with
fewer emitted pairs of smaller size.
We should note that although we can observe a general decline of the speedup as the
heap shrinks, for some heap sizes we have the case that a smaller heap performs equally
or even slightly better than a larger one (this is particularly evident for inverted index
for heap sizes betwen 4G and 512M. This is because the garbage collector generally
performs better with a smaller heap. That is, if a 4G heap and a 2G heap both trigger
10 garbage collections, then the performance drop will be smaller for the 2G heap.
In this section, we identified the garbage collection bottleneck for our MapReduce
implementation. We will revisit this problem in the next chapter, where we present a
set of mechanisms and garbage collector tuning settings that aim at reducing the GC
penalty.