Chapter 5. Performance Evaluation
45
Figure 5.8: The garbage collector impact on the performance of the four MapReduce
applications
Chapter 6
Optimization of the MapReduce
Skeleton
From the results of the previous chapter, it is evident that as far as performance is con-
cerned, our solution has some clear weaknesses. From the comparison with the manual
java threading we can draw the conclusion that our implementation although it protects
the programmer from some pitfalls of parallelism, it is by far outperformed by the man-
ual threading scheme . Our next consideration is how we can narrow the performance
gap between the two approaches so we can use the benefits of our skeletal scheme
i.e. better programmability without losing too much on the performance comparing
to manual threading. In this chapter, we identify the problems of our implementation,
and we propose a set of optimizations that aim at improving the skeletons performance
in order to match, or be comparable, to the manual threading approach.
In addition, we will consider the potential of the skeleton’s auto-tuning and we
will present a simple auto-tuning mechanism that allows the implementation to bypass
stages of the map reduce pipeline that are unnecessary for certain types of applications.
6.1
Hash Table Improvements
6.1.1
Choosing an optimal Hash Table Size
In our original MapReduce implementation, the size of the hash arrays that are used to
store the intermediate pairs is fixed to 256 entries. The fixed size turns to be ineffective
for applications like word count, where the number of distinct keys is quite big (i.e.
150.000 distinct keys for word count). The big number of distinct keys and the few
46
Chapter 6. Optimization of the MapReduce Skeleton
47
hash array entries mean that the sorted ArrayLists of Combiners in each entry of the
hash array can grow large. This slows down the store process since larger lists should
be searched for the existence of a Combiner and their elements should be moved down
for the insertion of the Combiner if it does not already exist in the list.
Ideally, we want to have a big hash array, whose size is proportional to the number
of the distinct keys in the application. This leads to short lists in each entry and fast
storing of the pairs. At the same time, we do not want to have a hashArray much
bigger than the number of distinct keys first because we are wasting memory, and
second because the combiners will then be more dispersed in the array and this leads
to a more expensive join stage.
Figure 6.1 shows the speedup of the word count application as we change the size
of the hash array. We show the results on both System and System2
Figure 6.1: The Speedup with varying hash array size for word count on System1(left)
and System2(right)
On System1, we can see that the speedup is very low for a small hash array and
gradually increases until it reaches a maximum at 64 K entries (i.e. 3 Combiners per
entry) and then drops again. The hashArray that was used in our original version
had 256 entries and it is shown in red in the figure. Similar behavior is observed on
System2.
Matrix Multiplication benefits even more with the increase in the size of the hash
array. This is because the number of Combiners is significantly larger in this case (i.e.
2.5M pairs), meaning that for a small HashArray the lists in each entry can grow too
large. As illustrated in Figure 6.2, the performance benefits are really high compared
to our original size for both systems.
Figure 6.3 shows the same experiment for inverted index. The speedup remains
Chapter 6. Optimization of the MapReduce Skeleton
48
Figure 6.2: The Speedup with varying hash array size for matrix multiply on System1
(left) and System2(right)
almost invariant, while a significant decrease is observed on both systems only when
the hash array grow too large. Since the small hashArray in these application performs
equally well to a larger one, it is more beneficial to use a small hash array, to reduce
the memory consumed by the structure.
Figure 6.3: The Speedup with varying hash array size for inverted index on System1
(left) and System2 (right)
Figure 6.4 shows the same experiment for KMeans. Similarly to inverted index,
the speedup is uniform as we increase the hash array size, and drops only when the
array grows too large.
A careful look at these figures indicates that on System1, a significant speedup drop
occurs for all applications when the hashArray is increased past 64k entries, while on
System2 this significant drop occurs at 512k entries. The fact that the drop occurs at the
same point for all applications and the fact that that this decrease occurs for different
Chapter 6. Optimization of the MapReduce Skeleton
49
Figure 6.4: The Speedup with varying hash array size for Kmeans on System1 (left)
and System2 (right)
array sizes on the two machines lead us to the assumption that the machine itself must
impose a hard limit on the size of the data structure. This hard limit is most probably
related to the cache hierarchy of each machine. For example, if the hashArray grows
too large, it may not fit in the L1 cache so we have two many L1 cache misses, either
at the store or at the partition stage which degrade the performance of the application.
Our two systems have L1 caches of equal size, nevertheless, the hierarchy of System2
with the private L2 cache per processor and the common L3 cache before the main
memory can apparently handle better the L1 cache misses compared to the architecture
of System1 which has a shared by two processors L2 cache and no L3 cache.
To summarize, no single hash array table size works best for all the workloads. For
applications that produce of a small number of keys the size of the hashArray should be
kept to a minimum. On the other hand for applications that produce a large number of
pairs the size of hashArray should be increased up to the point that it does not interact
negatively with the system’s cache subsystem or up to the point where the values are
not too dispersed in the hashArray, something that would slow down the join stage.
6.1.2
Using a binary search tree for O log(n) store
In this section, we present a second improvement on the data structure that holds the
intermediate keys. This optimization has also been used by [14]. The cost for the
store muscle to insert a combiner into the hash array depends on whether the key is
already present in the relevant hash array entry’s list. If the key is present, and the
number of the stored combiners in the list is k, the cost is O(1) to hash to the correct
hashArray entry and O(logk) to find the existing key in the entry’s list. If the key is
Dostları ilə paylaş: |