Chapter 5. Performance Evaluation
36
Figure 5.1: The efficiency of each of the four implementation schemes for System1 (left)
and System2 (right)
where the frequency of each RGB value is stored. The obvious sequential algorithm
for this problem maintains three separate tables for the red, green and blue values and
for each byte in the bitmap, it increases a counter in the corresponding entry of the
appropriate table. The MapReduce model introduces an unecessary key manipulation
overhead which is even more serious for this application because a very big number of
keys is produced, as we can see in Table 5.2.
KMeans is also not naturally associated with keys and on top of that, it is an itera-
tive algorithm, so the overheads of key manipulation can become more serious. How-
ever, unlike Histogram where the mappers are repeatedly allocating pairs, the map
phase in KMeans involves a non-trivial computation part where the distances of all the
points from the centroids are computed. This means that the application is benefited
from the parallelism in the map phase and this is reflected in the slight performance
advantage over Histogram.
Our second observation is that for some applications, there are good and bad imple-
mentation schemes regardless of the machine. The cache scheme for example performs
well for word count and inverted index. This is not surprising since these applications
produce a big number of duplicates with inverted index the most. However, for matrix
multiply where all the keys are distinct, cache offers no performance benefits against
the other schemes. The Phoenix scheme on the other hand, while it generally performs
well for all applications, it underperforms for Matrix Multiply because it involves an
expensive and unnecessary for this application join stage which operates on the full set
of pairs since all the keys are distinct and none of the pairs have been grouped together
Chapter 5. Performance Evaluation
37
by this stage.
Our third observation is that the machine significantly affects the performance of
some implementation schemes. For example, the parStore scheme which introduces
parallelism in the Store muscle presents a quite different behavior on the two machines.
This is apparent in applications that produce many pairs i.e. the word count and in-
verted index. For inverted index, we can see that the parallel store is actually worse
than the sequential store (seqStore) while this is not the case for System2 where it re-
mains at about the same levels. The reason for this must be attributed to the different
memory hierarchies of the two machines.
Inverted index is characterized by only 10 different keys and 2 million pairs. This
means that the store threads, what they are essentially doing, is to constantly updating
the values associated with these ten keys that are stored in the ConcurrentHashMap.
With more store threads, we have more cache invalidations, and this is probably rea-
son of the decreased performance comparing to the sequential Store in System1. The
cache architecture of System2 on the other hand seems to handle the many cache in-
validations more efficiently probably because more processors are grouped together in
the same chip sharing a common L3 cache so the sharing of updated values between
the processors is done in a more efficient way. We can also see that the cache imple-
mentation scheme generally performs better on System2. Again, we believe that the
reason is that the architecture of System2 is able to exploit locality better, so there is
plenty of synergy with the cache implementation scheme.
In the next sections, we narrow the problem space by further evaluating only the
Phoenix scheme as it generally performs better than the other schemes for most of the
applications, on both machines. The next section investigates how the performance
of this scheme varies as we change the number of workers. In addition, we will not
further evaluate the Histogram application as its performance is quite low according to
these preliminary results.
5.3
Evaluation of the Phoenix Scheme
Figure 5.2 shows the speedup of the implementation for a different number of work-
ers. In general, the performance scales almost linearly as we increase the number of
workers, up to the point where the number of workers is equal to the number of cores
in the machine, and the drops with any further increase (i.e. word count, inverted Index
matrix Multiply) or remains at about the same levels (i.e. KMeans). As the number of
Chapter 5. Performance Evaluation
38
workers is increased past the number of cores in the machine, we get no more benefits
from parallel execution, since not all workers can now be executed in parallel. Besides,
each worker has its own data structure to store the intermediate pairs and this means
that the more workers we have, the more dispersed the intermediate pairs are across
the different data structures. This slows down the join phase which in turn leads to the
performance drop that we can observe in the figure.
Figure 5.2: The speedup with increasing number of workers for System1
Figure 5.3 shows the results when running the same experiment on System2. The
behavior is similar to the previous case for all applications apart from word count,
where we can observe a slight increase in the speedup even when the number of work-
ers increases past twelve, which is the number of cores in the machine.
Figure 5.3: The speedup with increasing number of workers for System2