A mapReduce Skeleton for Skandium

Yüklə 427,52 Kb.

Pdf görüntüsü

səhifə	6/19
tarix	05.03.2018
ölçüsü	427,52 Kb.
	#30153

1 2 3 4 5 6 7 8 9 ... 19

Chapter 2. Related Background

Figure 2.3: The Skandium Map skeleton

internals that represent the original interval and passes each interval to a different ex-

ecute muscle. Each execute muscle performs computations in parallel to compute the

decimals of Pi for a given interval end outputs a BigDecimal object. The BigDecimals

from all the execute muscles are then combined into a single BigDecimal that forms

the ﬁnal number.

From the above example it is apparent that the threading issues are completely

hidden from the programmer and the programming model is quite ﬂexible, as the pro-

grammer is free to deﬁne apart from the execute muscle which is the core computation,

the muscle that splits the input, the muscle that merges the result as well as the in-

put/output parameters. However, as we will see in chapter 3 this ﬂexible programming

model is problematic for skeletons like MapReduce where we need better support from

the runtime. We will discuss this problem in detail in the next chapter.

2.5

Typical MapReduce Applications

In this section, we present some typical MapReduce applications that were used for

evaluating the skeleton’s performance. As we will discuss in detail in chapter 5, the

skeleton’s performance is sensitive to the characteristics of the input i.e. the total

number of the emitted key-value pairs and the distribution of the distinct keys. For this

reason, we also highlight in this section the distinct characteristics of each application

Chapter 2. Related Background

that affect the skeleton’s performance.

2.5.1

Word Count

The word count application counts the number of occurrences of each word in a given

input text ﬁle. The Map tasks processes different sections of the original text ﬁle, and

emit key-value pairs where the key is a word of the original text and the value is 1, to

indicate that the word was found. The reducers add up all the values for each word.

The word count application typically produces a big number of intermediate pairs,

depending on the size of the original text. The number of duplicate keys is usually

small but considerable. Both the total number of the emitted pairs and the number of

duplicate keys is unpredictable in the general case.

2.5.2

Inverted Index

The inverted index application generates position lists for words in a document. The

input is a text ﬁle along with a user-speciﬁed list of words. The Map tasks scan different

sections of the original text ﬁle and emit pairs, where the key is one of the words of

the input list and the value is the position in the text where the word was found. The

reducer emits a {word,list(position)} pair.

The Inverted Index application typically produces a big number of intermediate

pairs. The number depends on the size of the original text and the frequency of oc-

currence of the user-speciﬁed words in the text. The total number of emitted pairs is

generally unpredictable. However the number of distinct keys is known and it is equal

to the size of the user-speciﬁed list of words. Because the user-speciﬁed list is usually

small, the percentage of the duplicate keys is typically quite big.

2.5.3

Matrix Multiplication

Each map task computes the results for a set of rows of the result matrix and emits

key-value pairs where the key is the (x,y) location of each element and the value is the

result of the computation. The reduce task is just the identity function. The emitted

keys in this application are all distinct and their number is known.

Chapter 2. Related Background

13

2.5.4

Histogram

The Histogram application takes as input a bitmap image and calculates the distribution

of colors in it (Figure 2.4). Different sections of the image are are assigned to different

Map tasks which compute the occurrence of the values of the RGB components of the

pixels. For each pixel, the mapper emits three pairs, one pair for each component. For

the red component it emits the value of the ﬁrst byte of the pixel as the key along with

the value 1 that indicates occurrence. For the green component it emits the value of the

second byte of the pixel plus 255 as the key along with the value of 1. Finally for the

blue component it emits the value of the third byte of the pixel plus 512 as the key and

the value 1. The reducer adds up all the values for each RGB value.

Figure 2.4: Example of a color histogram. Picture taken from [2]

This application typically produces a very big number of pairs for a high-resolution

image since three pairs are emitted for each pixel. The total number of pairs is known,

given the resolution of the image. In addition, all the emitted keys fall in the range

0-767 which means that the number of distinct keys is also known. The percentage of

duplicates in this application is quite big for a high resolution image.

2.5.5

KMeans

This application groups a set of input of points in the n-dimensional space into k clus-

ters (Figure 2.5). The algorithm ﬁrst randomly chooses k of the input points to act

as centroids. It then calculates the Euclidean distance between the input points and

the centroids and assigns each input point to the group that has the nearest centroid.

When all objects have been assigned to a cluster, it recalculates the positions of the

Yüklə 427,52 Kb.

Dostları ilə paylaş:

1 2 3 4 5 6 7 8 9 ... 19