A mapReduce Skeleton for Skandium

Yüklə 427,52 Kb.

Pdf görüntüsü

səhifə	14/19
tarix	05.03.2018
ölçüsü	427,52 Kb.
	#30153

1 ... 11 12 13 14 15 16 17 18 19

Chapter 5. Performance Evaluation

Figure 5.1: The efﬁciency of each of the four implementation schemes for System1 (left)

and System2 (right)

where the frequency of each RGB value is stored. The obvious sequential algorithm

for this problem maintains three separate tables for the red, green and blue values and

for each byte in the bitmap, it increases a counter in the corresponding entry of the

appropriate table. The MapReduce model introduces an unecessary key manipulation

overhead which is even more serious for this application because a very big number of

keys is produced, as we can see in Table 5.2.

KMeans is also not naturally associated with keys and on top of that, it is an itera-

tive algorithm, so the overheads of key manipulation can become more serious. How-

ever, unlike Histogram where the mappers are repeatedly allocating pairs, the map

phase in KMeans involves a non-trivial computation part where the distances of all the

points from the centroids are computed. This means that the application is beneﬁted

from the parallelism in the map phase and this is reﬂected in the slight performance

advantage over Histogram.

Our second observation is that for some applications, there are good and bad imple-

mentation schemes regardless of the machine. The cache scheme for example performs

well for word count and inverted index. This is not surprising since these applications

produce a big number of duplicates with inverted index the most. However, for matrix

multiply where all the keys are distinct, cache offers no performance beneﬁts against

the other schemes. The Phoenix scheme on the other hand, while it generally performs

well for all applications, it underperforms for Matrix Multiply because it involves an

expensive and unnecessary for this application join stage which operates on the full set

of pairs since all the keys are distinct and none of the pairs have been grouped together

Chapter 5. Performance Evaluation

by this stage.

Our third observation is that the machine signiﬁcantly affects the performance of

some implementation schemes. For example, the parStore scheme which introduces

parallelism in the Store muscle presents a quite different behavior on the two machines.

This is apparent in applications that produce many pairs i.e. the word count and in-

verted index. For inverted index, we can see that the parallel store is actually worse

than the sequential store (seqStore) while this is not the case for System2 where it re-

mains at about the same levels. The reason for this must be attributed to the different

memory hierarchies of the two machines.

Inverted index is characterized by only 10 different keys and 2 million pairs. This

means that the store threads, what they are essentially doing, is to constantly updating

the values associated with these ten keys that are stored in the ConcurrentHashMap.

With more store threads, we have more cache invalidations, and this is probably rea-

son of the decreased performance comparing to the sequential Store in System1. The

cache architecture of System2 on the other hand seems to handle the many cache in-

validations more efﬁciently probably because more processors are grouped together in

the same chip sharing a common L3 cache so the sharing of updated values between

the processors is done in a more efﬁcient way. We can also see that the cache imple-

mentation scheme generally performs better on System2. Again, we believe that the

reason is that the architecture of System2 is able to exploit locality better, so there is

plenty of synergy with the cache implementation scheme.

In the next sections, we narrow the problem space by further evaluating only the

Phoenix scheme as it generally performs better than the other schemes for most of the

applications, on both machines. The next section investigates how the performance

of this scheme varies as we change the number of workers. In addition, we will not

further evaluate the Histogram application as its performance is quite low according to

these preliminary results.

5.3

Evaluation of the Phoenix Scheme

Figure 5.2 shows the speedup of the implementation for a different number of work-

ers. In general, the performance scales almost linearly as we increase the number of

workers, up to the point where the number of workers is equal to the number of cores

in the machine, and the drops with any further increase (i.e. word count, inverted Index

matrix Multiply) or remains at about the same levels (i.e. KMeans). As the number of

Chapter 5. Performance Evaluation

workers is increased past the number of cores in the machine, we get no more beneﬁts

from parallel execution, since not all workers can now be executed in parallel. Besides,

each worker has its own data structure to store the intermediate pairs and this means

that the more workers we have, the more dispersed the intermediate pairs are across

the different data structures. This slows down the join phase which in turn leads to the

performance drop that we can observe in the ﬁgure.

Figure 5.2: The speedup with increasing number of workers for System1

Figure 5.3 shows the results when running the same experiment on System2. The

behavior is similar to the previous case for all applications apart from word count,

where we can observe a slight increase in the speedup even when the number of work-

ers increases past twelve, which is the number of cores in the machine.

Figure 5.3: The speedup with increasing number of workers for System2

Yüklə 427,52 Kb.

Dostları ilə paylaş:

1 ... 11 12 13 14 15 16 17 18 19