Chapter 2. Related Background
8
runtime provides a default partitioning function based on key hashing.
The phoenix runtime was developed on top of Pthreads. At the core of the runtime
system, there is a scheduler which dynamically assigns map and reduce tasks to worker
threads. Additionaly, the scheduler provides support for fault tolerance by re-executing
tasks that take too long to finish and blacklisting workers that repeatedly report a task
failure. The first version of phoenix was shown to exhibit comparable performance to
P-threads for applications that are easily expressed with the MapReduce model.
The Phoenix system was further optimized in [13] and in [10]. The study presented
in [13] optimizes the runtime for large scale systems with NUMA characteristics by
replacing the original scheduler with a more sophisticated one which sends tasks to
worker threads according to the memory location of the pertaining data chunk. The
study presented in [10] outlines that the static pipeline that Phoenix adopts is inefficient
for many workloads and introduces a modular, application-aware scheme, where the
users can adapt performance critical elements of the pipeline to match characteristics
of the workload e.g the number and the distribution of keys. For example, the user can
choose the internal data structure that most efficiently stores the intermediate key-value
pairs.
2.4
The Skandium Library
Skandium is a recent skeletal library that targets shared memory architectures and is
a re-implementation of Calcium [7] that targets distributed parallel machines. The
library provides the most common parallel algorithmic skeletons like the task farm, the
pipeline which can be arbitrarily nested and combined into more complex structures.
The library completely hides threading from the user who can focus on the functional
part of the problem in hand. In the following sections, Skandium’s architecture and its
programming model are introduced.
2.4.1
The programming model
The user of the Skandium library is presented with a set of classes, each of which
corresponds to a skeleton. The skeleton classes have no embedded distributed behavior
and their sole objective is to allow type safe composition of the skeleton pattern by
holding references to sub-skeletons and muscle objects which provide the business
logic of the application [11]. The library defines four interfaces that correspond to
Chapter 2. Related Background
9
four different kinds of muscles: Execution (fc), Split (fs), Merge (fm) and Condition
(fc) each of which requires the programmer to implement a specific method. The
programmer creates an instance of the skeleton that best fits the problem he is trying to
solve, implements the muscle interfaces that are required by the skeleton and passes the
muscle objects in the skeleton’s constructor. In order to make the programming model
concise and as error as free as possible Skandium is based on two critical hypotheses
[12]:
Single input/output: A skeleton pattern is limited to receiving only one input and
producing only one result at a time. This strict hypothesis is done for the sake of
simplicity in the overall program structure, as well as to ease skeleton nesting, and to
reduce programming errors.
Passive Skeletons: Skandium does not allow skeletons to generate output without
receiving an input. This precondition is helpful in detecting termination conditions
When the programmer creates an instance of the skeleton and inputs the actual data,
the Skandium runtime takes over, orchestrating the execution of the muscles either
sequentially or in parallel according to the definition of the skeleton. The next section
describes the functionality of the runtime system in more detail.
2.4.2
The runtime system
The internal representation of each skeleton is called instruction and unlike the skele-
ton class, it is hidden from the programmer. The skeleton nested tree is automatically
transformed into an instruction stack by the runtime. Instructions from the top of the
stack are interpreted on the computation nodes and when data parallelism is encoun-
tered, new instruction stacks are generated. Each instruction stack is encapsulated
inside a task along with references to subtasks and information such as performance
metrics.
The internal architecture of Skandium is depicted in 2.2. Each newly generated
task that is stored on the ready queue, is sent to execution in the thread pool. New
subtasks are dynamically created and those that are ready, are reinserted to the queue
waiting for a worker thread to handle the execution. Tasks can also be in a waiting
state, where they have generated new subtasks and are waiting for them to finish, or in
finished state where the computation is complete and results are returned. Depending
on the skeleton that was initially selected, a different set of instructions are interpreted
to achieve parallelism for each case.
Chapter 2. Related Background
10
Figure 2.2: Skandium’s Internal Architecture. Figure taken from [12]
2.4.3
The Skandium Map Skeleton
As we will see in chapter 4, where the implementation details of the MapReduce skele-
ton are presented in detail, the MapReduce skeleton has been built on top of Skandium
Map skeletons. For this reason, we take some time at this point to present the Skandium
Map skeleton along with an example of its use. This section provides a better insight
on Skandium’s programming interface. Given this knowledge, the design decisions
about the MapReduce interface that will be presented in chapter 4 will be easier to
justify.
The Skandium Map skeleton is depicted in Figure 2.3. The map skeleton requires
from the user to define a split muscle that splits a single element to multiple elements
an execute muscle, several instances of which are spawned and executed on each split
and a merge muscle that unites the outputs of the execute muscles back to a single
result. As with other Skandium skeletons, the user is free to use any object to transfer
the computation between the muscles. These objects can be either new types defined
by the programmer or existing types.
As an example of the skeleton’s use, let us consider the case where the user wants
to compute the number Pi up to a specified number of decimals using the Bailey-
Borwein-Plouffe formula. This example is taken from [3]. The user first defines an
interval object that holds the start and the end positions of the decimals to compute. He
then inputs the Interval to the split function, which creates multiple non-overlapping