Chapter 2. Related Background
11
Figure 2.3: The Skandium Map skeleton
internals that represent the original interval and passes each interval to a different ex-
ecute muscle. Each execute muscle performs computations in parallel to compute the
decimals of Pi for a given interval end outputs a BigDecimal object. The BigDecimals
from all the execute muscles are then combined into a single BigDecimal that forms
the final number.
From the above example it is apparent that the threading issues are completely
hidden from the programmer and the programming model is quite flexible, as the pro-
grammer is free to define apart from the execute muscle which is the core computation,
the muscle that splits the input, the muscle that merges the result as well as the in-
put/output parameters. However, as we will see in chapter 3 this flexible programming
model is problematic for skeletons like MapReduce where we need better support from
the runtime. We will discuss this problem in detail in the next chapter.
2.5
Typical MapReduce Applications
In this section, we present some typical MapReduce applications that were used for
evaluating the skeleton’s performance. As we will discuss in detail in chapter 5, the
skeleton’s performance is sensitive to the characteristics of the input i.e. the total
number of the emitted key-value pairs and the distribution of the distinct keys. For this
reason, we also highlight in this section the distinct characteristics of each application
Chapter 2. Related Background
12
that affect the skeleton’s performance.
2.5.1
Word Count
The word count application counts the number of occurrences of each word in a given
input text file. The Map tasks processes different sections of the original text file, and
emit key-value pairs where the key is a word of the original text and the value is 1, to
indicate that the word was found. The reducers add up all the values for each word.
The word count application typically produces a big number of intermediate pairs,
depending on the size of the original text. The number of duplicate keys is usually
small but considerable. Both the total number of the emitted pairs and the number of
duplicate keys is unpredictable in the general case.
2.5.2
Inverted Index
The inverted index application generates position lists for words in a document. The
input is a text file along with a user-specified list of words. The Map tasks scan different
sections of the original text file and emit pairs, where the key is one of the words of
the input list and the value is the position in the text where the word was found. The
reducer emits a {word,list(position)} pair.
The Inverted Index application typically produces a big number of intermediate
pairs. The number depends on the size of the original text and the frequency of oc-
currence of the user-specified words in the text. The total number of emitted pairs is
generally unpredictable. However the number of distinct keys is known and it is equal
to the size of the user-specified list of words. Because the user-specified list is usually
small, the percentage of the duplicate keys is typically quite big.
2.5.3
Matrix Multiplication
Each map task computes the results for a set of rows of the result matrix and emits
key-value pairs where the key is the (x,y) location of each element and the value is the
result of the computation. The reduce task is just the identity function. The emitted
keys in this application are all distinct and their number is known.
Chapter 2. Related Background
13
2.5.4
Histogram
The Histogram application takes as input a bitmap image and calculates the distribution
of colors in it (Figure 2.4). Different sections of the image are are assigned to different
Map tasks which compute the occurrence of the values of the RGB components of the
pixels. For each pixel, the mapper emits three pairs, one pair for each component. For
the red component it emits the value of the first byte of the pixel as the key along with
the value 1 that indicates occurrence. For the green component it emits the value of the
second byte of the pixel plus 255 as the key along with the value of 1. Finally for the
blue component it emits the value of the third byte of the pixel plus 512 as the key and
the value 1. The reducer adds up all the values for each RGB value.
Figure 2.4: Example of a color histogram. Picture taken from [2]
This application typically produces a very big number of pairs for a high-resolution
image since three pairs are emitted for each pixel. The total number of pairs is known,
given the resolution of the image. In addition, all the emitted keys fall in the range
0-767 which means that the number of distinct keys is also known. The percentage of
duplicates in this application is quite big for a high resolution image.
2.5.5
KMeans
This application groups a set of input of points in the n-dimensional space into k clus-
ters (Figure 2.5). The algorithm first randomly chooses k of the input points to act
as centroids. It then calculates the Euclidean distance between the input points and
the centroids and assigns each input point to the group that has the nearest centroid.
When all objects have been assigned to a cluster, it recalculates the positions of the