Chapter 2. Related Background
14
k centroids for each cluster by calculating the mean vector of all the assigned points.
The final two steps are repeated until the centroids no longer move.
Figure 2.5: Example of the KMeans clustering in the 2-d space. Picture taken from [1]
In each iteration, each mapper takes as input a subset of the data points along with
the positions of the current centroids. It then calculates the distance between each point
and each centroid and for each point, it emits the id of the nearest cluster as the key
and the point itself as value. The points with the same key are then sent to the same
reducer which calculates their mean vector. The reducer emits the cluster id as the key
and the new centroid as the value which are passed to the mapper’s input.
In each iteration, the mapper produces a known number of pairs which is equal to
the number of input points. The number of distinct keys is also known and it is equal
to the number of clusters. Typically, the number of clusters is small comparing to the
number of points so the percentage of duplicate keys is quite significant.
2.6
Garbage Collection principles
One of the most serious performance issues when building a MapReduce framework
in Java is the way the MapReduce application interacts with the Garbage Collector [9].
One of our goals in this project, is to investigate how the garbage collector affects our
MapReduce implementation. We quantify the garbage collection bottleneck for our
implementation in the performance evaluation chapter (chapter 5), while in the opti-
mization chapter (chapter 6), we present some implementation mechanisms and tuning
settings that minimize the garbage collector’s impact on the MapReduce skeleton. This
Chapter 2. Related Background
15
introductory section is dedicated to present the principles behind Garbage Collection
and the basic tuning options that are provided by the JVM.
The JVM makes use of basically two garbage collector algorithms: the copying
collector and the mark-compact collector.
The copying collector divides the heap into two equally sized semi-spaces one
of which is active at any given time. During a garbage collection, all live objects are
copied from the active space to the inactive space. At the end of the garbage collection,
the roles of the spaces are flipped, with the old inactive space becoming the new active
space. The inactive space now contains all the dead objects which can be collected
very quickly at the next garbage collection. The advantage of this algorithm is that it
does not visit the dead objects at all, meaning that it performs very well when most of
the objects in the program are short-lived. On the other hand, it performs poorly when
the objects are mostly long-lived since it repeatedly copies them back and forth from
one semi-space to another.
The mark-compact garbage collector, instead of copying each live object to a new
heap on every garbage-collection pass, it compacts all the live objects down to one end
of the heap. In the mark phase, the collector marks the live objects and also keeps track
of the object’s new location in the heap. In the compact phase all the references are
updated and the live objects are gathered on the one end of the heap. In contrast to the
copying collector, the mark-compact collector works very well for long-lived objects
since these tend to accumulate at the bottom of the heap and then do not need to be
copied repeatedly.
The Java Virtual Machine uses a technique that combines the advantages of the
two algorithms described above. The heap is divided into two basic regions called
generations: the new generation and the old generation. The objects are created in the
new generation and if they survive a certain number of collections, they are promoted
to the old generation. This scheme takes advantage of the fact that most of the objects
created in a Java application are short-lived. When the new generation fills up and the
allocator is unable to fulfill an allocation request, a collection on the young generation
is triggered (minor collection). The minor collection uses a copying garbage collector
algorithm. Given the fact that most of the the objects are short-lived, they will be
already dead by the time the minor collection occurs and so they can be collected very
quickly with the copying algorithm. Objects that survive a few minor collections are
considered to be long-lived and they are promoted to the old generation. A collection in
the old generation(major collection) uses the mark-compact algorithm. Major garbage
Chapter 2. Related Background
16
collectors typically happen less frequently during the program’s execution, only if the
minor collection was unable to reclaim enough space.
Most of the Java Virtual Machines use a copying collector in the young and a mark-
compact collector in the old generation by default. However, the Java runtime provides
another three Garbage Collectors that the user can select by a command line parame-
ter: the parallel, the concurrent and parallel scavenge collector. The parallel collector
is a copying collector for the new generation that parallelizes the collection over mul-
tiple threads. This resolves the sequential bottleneck of the single-threaded garbage
collector on multi-CPU systems. The concurrent collector is used in the old genera-
tion and does most of the collection concurrently with the execution of the application,
unlike other collectors where all the application threads must pause for the duration
of the collection. Finally, the parallel scavenge collector is used in the new genera-
tion, it is similar to the parallel collector but tuned for gigabyte heaps (over 10GB).
This collector is used by default in java virual machines that run on multi-CPU server
machines.
The JVM includes numerous other options to tune the garbage collector. One of
the most important parameters is the initial and the maximum heap size as well as the
proportion of the heap assigned to the new generation. The latter is a very important
parameter to tune if the application includes a big number of short-lived objects. In
this case, having a big new generation means that the minor collection will be delayed
long enough, that when it occurs, most of the objects in the new generation will al-
ready be dead and so they will be garbage-collected quickly. On the other hand, if the
new generation is small, some of the short-lived objects will get promoted to the old
generation before they die and they will slow down the mark-compact algorithm which
does not perform well when it has to deal with many garbage objects.