Chapter 3
MapReduce for Skandium: The
programming model
This chapter presents how the MapReduce skeleton has been integrated into the Skandium
Library and the skeleton’s programming interface along with an example of its use.
Detailed discussion on the skeleton’s internal structure will follow on the next chapter.
3.1
The MapReduce integration into Skandium
Skandium’s runtime system completely hides threading issues from the programmer.
However, one could observe that Skandium provides only a thin layer of functionality
and while the library successfully handles parallel execution, the programmer is still
responsible for implementing much of the overall functionality: i.e the split and merge
muscles of the Map skeleton and the input/output parameters of each muscle.
Most importantly, the programmer defines muscles that are not related to the prob-
lem in hand, but exist only to support parallel execution. i.e. the split and merge mus-
cles required by the map skeleton. A more abstract programming model would only
require from the programmer to implement the execute muscle, hiding the split and
merge muscles in the runtime. However, the programming model in this case would
inevitably be more restricted e.g. the input of the skeleton would be of a certain type.
This more restricted model might turn the skeleton unsuitable for certain applications.
In the general case, there is clear trade-off between simplicity and flexibility. How-
ever, in complex skeletons like MapReduce we could safely sacrifice some flexibility
for abstracted simplicity. In MapReduce, unlike existing Skandium skeletons, there is
increased functionality that surrounds the application-specific Map and Reduce mus-
17
Chapter 3. MapReduce for Skandium: The programming model
18
cles i.e. the efficient storing and partitioning of the intermediate keys. The program-
mer’s code and effort to implement these additional muscles would obscure the code
and effort to implement the application-specific Map and Reduce muscles which solve
the simple problem in hand e.g. counting words. This means that we cannot strictly
follow the flexible Skandium programming model for MapReduce and we cannot im-
plement the skeleton similarly to other Skandium skeletons because in the MapReduce
case, the programmer needs far more support from the runtime system.
To enable this support, we extended the library with the addition of a repository
of predefined muscles that encapsulate all the MapReduce application-independent
functionality. When the skeleton’s instance is created, the runtime system nests and
combines existing Skandium skeletons and fills those skeletons with the user-defined
as well as with the library-defined muscles.
Figure 3.1 illustrates a high level view of the skeleton. MapReduce has been im-
plemented in Skandium as a pipe of Skandium Map skeletons. The figure shows the
skeleton expressed as a pipe of two Skandium Map skeletons: one for the map stage
and one for the reduce stage. We should note that this nesting represents one of our
initial implementations, and this simple scheme is presented here for purposes of clar-
ity and to provide an overview of the skeleton’s integration into the library. In our final
implementation, where we wanted to achieve good performance by parallelizing the
store and partition muscles, the skeleton is a combination of more than two Skandium
Map skeletons. We will discuss our different implementation schemes in chapter 4
where we present the implementation details of the skeleton.
After our addition of the repository of muscles in the library , the muscles that flesh
the skeleton are now of three types:
Application-specific muscles: These muscles are defined by the user. For the
MapReduce skeleton, these are the mapper and the reducer.
Completely generic muscles: These muscles are defined by the library. For the
MapReduce skeleton, these are the Store and Partition muscles. These muscles are
responsible for storing and partitioning the keys in the intermediate data structure in
an efficient way and they are generic across all applications.Typically these muscles are
located at the heart of the runtime system and they always perform the same function
regardless of the application.
Partially Generic muscles: These muscles are also defined by the library. For
MapReduce, these are the Split and Merge Muscles.These muscles cover a wide range
of applications however, there may exist some applications where these muscles do not
Chapter 3. MapReduce for Skandium: The programming model
19
fit. e.g. when we want to split a specific kind of input. These muscles are typically
located at the skeleton’s edge. Because we want to have flexibility at this point, these
muscles can be substituted by a user-defined muscle, unlike the completely generic
muscles.
Figure 3.1: A High level view of MapReduce for Skandium
Our Approach to integrating the MapReduce into Skandium has the following ad-
vantages. These can be generalized to the integration of other skeletons into the library
as well.
1. Since the layer of functionality of the Skandium runtime is thin, adding the
complex MapReduce functionality to the Skandium’ core would not be a natural fit.
Our approach encapsulates all the functionality of the new model into muscle objects
which lie at the system’s edge with no changes at the core of the library.
2. The Muscles are well defined components, meaning that the added functionality
is well-located instead of being scattered in the runtime system. This means that a
library muscle can be easily substituted by another, for example if it under-performs on
a different machine or for an application with distinct characteristics. Other advantages
of the modular design , like code clarity and ease of debugging are preserved as well.
3. The muscles are components familiar to the Skandium user. If a library muscle
is not applicable for a specific application, the user could substitute the muscle with
his own as with the case of the partially generic muscles.
However, the disadvantage of our approach is the performance overhead it adds.