Chapter 6. Optimization of the MapReduce Skeleton
53
the look up overhead in the pool. However, it begins to pay back for smaller sizes.
Matrix Multiplication is shown in Figure 6.7. Unlike the previous two applications
where we have many duplicate keys, in Matrix Multiplicaiton all the keys are distinct
and this means that we can have no benefits from the key pool. In addition, the lookup
overhead leads to worse performance than the unoptimized version as illustrated in the
figure.
Figure 6.7: The Speedup as the Heap Size decreases with and without the key pool for
matrix multiplication
6.2.2
Tuning the Garbage Collector
In this section, we present how the impact of the garbage collector can be further
reduced if we tune some of the parameters that are provided by the JVM.
Our first tuning is to increase the size of the new generation. Since we have a big
number of short-lived objects when we use the key pool, it would also be beneficial if
we also have a big new generation. Figure 6.8 shows the speedup for word count and
inverted index as we decrease the heap size. We are using the key pool mechanism
as in the previous experiment and in addition, we have set the old to new generation
ratio to 1 instead of 2 which is the default value. The results show that we indeed have
performance benefits by increasing the size of the new generation.
Our second setting is to change the garbage collector. The JVM in our system
uses by default a parallel garbage scavenge collector in the new generation which is
Chapter 6. Optimization of the MapReduce Skeleton
54
Figure 6.8: The Speedup as the Heap Size decreases using a larger new generation
for word count (left) and inverted index (right)
optimized for large heaps. We experimented with the parallel gc which performs better
with smaller heaps. The results are shown in Figure 6.9 for the word count application.
As expected, the parallel collector performs worse than the parallel scavenge for large
heaps but it begins to perform better as the heap shrinks . Although we experimented
with the inverted index application as well, we did not observe the same pattern and
the performance of the two collectors was approximately the same. This is probably
because in inverted index we have fewer garbage collections on average.
Figure 6.9: The Speedup as the Heap Size decreases using the parallel GC for the
word count application
Chapter 6. Optimization of the MapReduce Skeleton
55
6.3
Auto Tuning
Our results from the previous sections have shown that the parameters of the internal
data structures (i.e. size or type) used by the skeleton can significantly affect its per-
formance and a correct setting can lead to a considerably better speedup. However,
according to the same results, there is no one-size-fits-all solution and the optimal pa-
rameters generally depend on the characteristics of the input. This means that in order
to achieve a good speedup for a variety of applications, our implementation should be
able to dynamically adapt to the workload characteristics. There are two ways that this
can be achieved.
The first way is to expose the system’s implementation details and allow the user
to tune the implementation for the problem he is trying to solve i.e. the user can set the
size of the intermediate data structure, or select one of the many different data struc-
tures that the system provides, or even implement his own and plug it in the system.
This approach is followed by Phoenix++.
The second way is to make the system able to auto-tune and select automatically
the best parameters for a particular workload. We follow this approach in our imple-
mentation. The exposure of the core system functionality to the user, leads us to the
trap of granting too much responsibility to the programmer. This may still not be a
significant problem for Phoenix, which provides only the MapReduce skeleton and the
user can easily learn how to tune a few parameters just to achieve better performance.
However, this solution is clearly not applicable in our case, where we see MapReduce
as just another algorithmic skeleton in the toolbox of the parallel system’s program-
mer, integrated into a skeletal library. The fact that each skeleton would require its own
tuning parameters, and the fact that a parallel program typically consists of not only
one but several skeletons, probably nested, that cooperate to solve a problem make the
user’s tuning non-trivial and require effort that would probably distract the user from
the conceptually straightforward concept of skeletal programming.
6.3.1
Auto-tuning for the MapReduce skeleton
Considering how we can implement auto-tuning for the MapReduce skeleton we can
make two key observations on our implementation: The first is that the store phase
begins after the map phase. Thus, by the time the store phase begins, the system
knows exactly the total number of pairs that have been emitted from the mappers.
The second observation is that that the key pool mechanism that we described in the
Chapter 6. Optimization of the MapReduce Skeleton
56
previous section can easily be used as a means to count the number of the distinct keys
in the application. The pool lookup happens when the pairs are created, well before
the store stage. This means that at the beginning of the store stage the system already
has the knowledge of two critical workload characteristics i.e. the total number of
pairs and the number of the distinct keys. The system could then use this knowledge
to adapt the intermediate data structure to the workload. For example, a simple auto-
tuning mechanism could select the size of the hash array to be equal or proportional to
the number of the distinct keys in the application.
However, this simple mechanism solves only one aspect of the problem. The per-
formance of the application does not solely depend on the input, but it depends on the
underlying hardware as well. This is reflected in our results in section 6.1.1, where the
memory subsystem imposes a hard limit on the size of data structure that is used to
hold the emitted pairs, and if the size is increased past that limit, the performance gets
seriously harmed. In other words, an auto-tuning solution would fail, unless it takes
into account the characteristics of the machine and how they affect the implementation.
The hardware interacts with the implementation in non-obvious ways. For exam-
ple in our results in section 6.1.1 one could expect that the two systems would exhibit
similar behavior since their L1 caches are of the same size. However, this is not the
case, and apparently the whole cache hierarchy, the interaction between the caches at
different levels and the cache sharing, all of them affect the skeleton’s performance.
This means that the implementation of an auto-tuning scheme gets even more chal-
lenging since all these interactions should be taken into account . At the same time,
this complexity gives another good reason why tuning should be definitely abstracted
away from the programmer.
In Java, on top of that we have the added complexity introduced by the JVM. We
showed in section 5.5 how the automatic memory management of the Java platform
seriously affects the performance and how this penalty can be reduced by tuning some
parameters of the garbage collector. Again, there is not one-size-fits-all solution and
the optimal parameters depend on the input (number and size of the pair objects) and
on the machine (available memory for the heap, number of cores).
Finally, the MapReduce skeleton is not a stand alone implementation but it is in-
tegrated into a skeletal library. This means that it can be nested and combined with
other skeletons to create more complex structures. In this environment, the configura-
tion of one skeleton may interact negatively with the configuration of other skeletons
in non-obvious ways. This means that the auto-tuning framework should also consider
Chapter 6. Optimization of the MapReduce Skeleton
57
the skeleton’s environment.
To summarize, an auto-tuning scheme should take into account the workload char-
acteristics, the underlying hardware,the JVM and the interactions between the skele-
tons.Because of this complexity, implementing a proper auto-tuning scheme is beyond
the scope of this work. However, we present a very simple auto-tuning mechanism that
we implemented in the context of this work in the next section. This mechanism allows
the system to bypass parts of the pipeline that are not necessary for specific types of
applications.
6.3.2
A simple Auto-tuning mechanism
Our auto-tuning mechanism is useful for applications like matrix multiplication where
all the keys emitted from the mappers are distinct and there is no reduce stage. For
these applications, the generic MapReduce pipeline introduces unnecessary overhead
due to the pair manipulation stages. Figure 6.10 illustrates our solution. The Store,
Join and Reduce Stages of the pipeline have been nested in a Skandium If skeleton.
The other branch of the skeleton is an identity muscle which just returns the value at
its input. The condition of the If skeleton is whether all the keys are distinct. We
can determine whether the keys are distinct, if the total number of pairs is equal to
the number of distinct keys. As we described above, the total number of pairs as well
as the number of the distinct keys are known by this stage. If the condition of the If
skeleton holds, then the key value pairs emitted from the mapper are simply returned
to the user without the unnecessary reduction and key manipulation stages. Otherwise
the pairs pass through the map reduce pipeline.
Figure 6.10: High Level view of the auto-tuning mechanism for MapReduce
Figure 6.11 illustrates how this simple mechanism can achieve good benefits for
matrix multiplication on both systems.
Chapter 6. Optimization of the MapReduce Skeleton
58
Figure 6.11: The auto-tuning benefits compared to the unoptimized version
Chapter 7
Conclusions and Future Work
In this project, we implemented the MapReduce skeleton for the parallel patterns li-
brary Skandium. We then evaluated the skeleton for a selection of typical MapReduce
applications and optimized its performance.
Based on our experience with the integration of the skeleton into Skandium and on
our experimental results from the performance evaluation and optimization phases of
the project, we can draw the following conclusions:
1. The high level of abstraction in Java gives us certain advantages regarding the
implementation of the skeleton. The language’s high level constructs provide the flex-
ibility that allowed us to express effortlessly a high-level programming model like
MapReduce. One of the abstractions that we extensively use in our implementation is
the Java’s Collection Framework. The Collection interface is used in the interaction
between the user-code and the library code, and allows the user and the library code
to inter-operate seamlessly although they were written independently. It also provides
great flexibility both on the user side as well as on the system’s side: The user can
select the most suitable Collection implementation for his problem and use it to inter-
act with the system. Similarly, the system can be easily tuned by switching collection
implementation and use it to interact with the user with no changes in the interface.
Another abstraction that we use is the map interface and its concrete implementations,
that associate a key with a value. This interface provides a very natural and efficient
data structure to store the intermediate key-value pairs. However, Java’s high level of
abstraction and the clear and flexible solutions that it provides come at a significant
cost to performance.
2. Java’s automatic memory management can have a serious impact on the skele-
ton’s performance. For smaller heap sizes, the garbage collector can take a significant
59
Chapter 7. Conclusions and Future Work
60
CPU time from the application threads, seriously degrading the overall performance
of a MapReduce application. To minimize the performance penalty, the MapReduce
implementation should be implemented in a way that keeps its memory requirements
low, and careful tuning of the garbage collector may also be required.
3. Our solution provides an easy-to-use programming model where the threading
issues and the generic functionality of the Mapreduce skeleton other than applica-
tion dependent map and reduce functions are completely hidden from the programmer.
However, it lacks performance in comparison with a careful manual threading imple-
mentation. The causes for this must be attributed to the extensive use of Java’s abstract
structures in our implementation as well as to the fact that the MapReduce skeleton is
not a stand-alone implementation but it is implemented on top of existing Skandium
skeletons.
4. The skeleton can achieve significantly better performance by adapting to the
characteristics of the workload. This should be achieved through auto-tuning rather
than exposing the skeleton’s implementation details to the programmer. Our results
have shown that an auto-tuning scheme should consider several factors apart from the
workload characteristics including the underlying hardware and the JVM configura-
tion.
7.1
Future Work
This project can be expanded into many different directions. An obvious extension is
the further evaluation and optimization of the current implementation. In our evalua-
tion, we used data sets of the same size for each application. A future evaluation could
use datasets of different sizes and in addition, it could expand the set of map reduce
applications that are evaluated. This could give us better insight on how the behavior
of the skeleton changes according to the workload characteristics.
Our evaluation showed that the garbage collector is a very significant bottleneck so
future optimization should aim at reducing the memory footprint of the implementa-
tion as much as possible. This should involve a more systematic way to measure the
memory consumption of the MapReduce application and resolving possible trade-offs
between the performance and the memory requirements of the skeleton.
Another extension is the integration of more skeletons into Skandium. In this
project we extended the Skandium library to hold a repository of predefined mus-
cles that capture the generic functionality of the MapRedue skeleton, other than the
Chapter 7. Conclusions and Future Work
61
user-defined and application-dependent map and reduce functions. It would be inter-
esting to investigate to what extent this scheme is applicable to other skeletons and
how it facilitates their integration into the library. Another extension is the integration
of MapReduce to another skeletal library that targets shared memory architectures and
how it would be compared to our implementation in terms of performance as well as
in terms of usability of the programming interface.
Finally, an interesting and challenging expansion to this project would be in the
area of auto-tuning. Our results have shown that the performance of the skeleton is
dependent on a number of factors i.e.the implementation parameters, the hardware, the
GC configuration. To handle this complexity, the auto-tuning framework could use
machine learning techniques to predict the optimal configuration that maximizes the
skeleton’s performance. For example, key hardware parameters (e.g. the size of the
caches, cache topology , cpu clock speed, number of cores) as well as key workload
characteristics (distribution of keys, total pairs, pair size) could be used as features
to the learning algorithm which would predict a suitable setting of the implementation
parameters (e.g. the size/type of the intermediate data structure, the level of parallelism
in each stage) as well as a good GC Policy (the GC Algorithm, the New/Old Generation
ratio).
Bibliography
[1] Matlab central. http://www.mathworks.com/matlabcentral/fileexchange/19344-
efficient-k-means-clustering-using-jit.
[2] Pixelero.
http://pixelero.wordpress.com/2008/06/19/flash-10-
bitmapdatahistogram/.
[3] Skandium. http://skandium.niclabs.cl/tutorial/.
[4] S. Borkar. Thousand core chips: a technology perspective. in Proceedings of the
44th annual Design Automation Conference (DAC),pp. 746749
, 2007.
[5] A. Penmetsa G. Bradski andnC. Kozyrakis C. Ranger, R. Raghuraman. Evaluat-
ing mapreduce for multi-core and multiprocessor systems. in Proceedings of the
13th IEEE International Symposium on High-Performance Computer Architec-
ture (HPCA)
, 2007.
[6] M. Cole. Algorithmic skeletons: structured management of parallel computation.
Cambridge, MA, USA: MIT Press
, 1991.
[7] M. Leyton. D. Caromel. Fine tuning algorithmic skeletons. in 13th International
Euro-Par Conference: Parallel Processing, ser. Lecture Notes in Computer Sci-
ence, vol. 4641. Springer-Verlag
, 2007.
[8] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clus-
ters. Communications of the ACM, vol. 51, no. 1, pp.107-113.
[9] G. Brown M.Lujan J. Singer, G. Kovoor. Garbage collection auto-tuning for java
mapreduce on multi-cores. in Proceedings of the International Synposium on
Memory Management (DAC)
, 2011.
62
Bibliography
63
[10] R.M. Yoo J. Talbot and C. Kozyrakis.
Phoenix++: Modular mapreduce for
shared-memory systems. MapReduce11, June 8, 2011, San Jose, California,
USA
, 2011.
[11] M. Leyton. Advanced features for algorithmic skeleton programming. PhD The-
sis, Universite de Nice - Sophia Antipolis
, 2008.
[12] J.M. Piquer M.Leyton. Skandium: Multi-core programming with algorithmic
skeletons. in Proceedings of the 18th Euromicro PDP. IEEE CS Press, Pisa,
2010.
[13] A. Romano R. M. Yoo and C. Kozyrakis. Phoenix rebirth: Scalable mapreduce
on a large-scale shared-memory system. in Proceedings of 2009 IEEE Interna-
tionalSymposium on Workload Characterization (IISWC)
, 2009.
[14] R. Morris Y. Mao and F. Kaashoek. Optimizing mapreduce for multicore archi-
tectures. Computer Science and Artificial Intelligence Laboratory, Massachusetts
Institute of Technology
, 2010.
[15] K. Zheng H. Wang H. Lin Y. Zhao, J. Shi and L. Shao. Allocation wall: a limiting
factor of java applications on emerging multi-core platforms. ACM SIGPLAN
Notices
, 2009.
Dostları ilə paylaş: |