Data Mining
for the Masses
76
Professional: A yes/no column indicating whether or not the respondent is currently a
member of a professional organization with local chapter meetings, such as a chapter of a
law or
medical society, a small business owner’s group, etc.
Religious: A yes/no column indicating whether or not the respondent is currently a
member of a church in the community.
Support_Group: A yes/no column indicating whether or not the respondent is currently
a member of a support-oriented community organization, such as Alcoholics Anonymous,
an
anger management group, etc.
In order to preserve a level of personal privacy, individual respondents’ names were not collected
through the survey, and no respondent was asked to give personally identifiable information when
responding.
DATA PREPARATION
A CSV data set for this chapter’s exercise is available for download at the book’s companion web
site (
https://sites.google.com/site/dataminingforthemasses/
). If you wish to follow along with
the exercise, go ahead and download the Chapter05DataSet.csv file now and save it into your
RapidMiner data folder. Then, complete the following steps to prepare the data set for association
rule mining:
1)
Import the Chapter 5 CSV data set into your RapidMiner data repository. Save it with the
name Chapter5. If you need a refresher on how to bring this data set into your
RapidMiner repository, refer to steps 7 through 14 of the Hands On Exercise in Chapter 3.
The steps will be the same, with the exception of which file you select to import. Import
all attributes, and accept the default data types. This is the same process as was done in
Chapter 4, so hopefully by now, you are getting comfortable with the steps to import data
into RapidMiner.
2)
Drag your Chapter5 data set into a new process window in RapidMiner, and run the model
in order to inspect the data. When running the model, if prompted, save the process as
Chapter5_Process, as shown in Figure 5-1.
Chapter 5:
Association Rules
77
Figure 5-1. Adding the data for the Chapter 5 example model.
3)
In results perspective, look first at Meta Data view (Figure 5-2). Note that we do not have
any missing values among any of the 12 attributes across 3,483 observations. In examining
the statistics, we do not see any inconsistent data. For numeric data types, RapidMiner has
given us the
average (avg), or
mean, for each
attribute, as well the
standard deviation for
each attribute. Standard deviations are measurements of how dispersed or varied the
values in an attribute are, and so can be used to watch for inconsistent data. A good rule
of thumb is that any value that is smaller than two standard deviations below the mean (or
arithmetic average), or two standard deviations above the mean, is a statistical outlier. For
example, in the Age attribute in Figure 5-2, the average age is 36.731, while the standard
deviation is 10.647. Two standard deviations above the mean would be 58.025
(36.731+(2*10.647)), and two standard deviations below the mean would be 15.437
(36.731-(2*10.647)). If we look at the Range column in Figure 5-2, we can see that the Age
attribute has a range of 17 to 57, so all of our observations fall within two standard
deviations of the mean. We find no inconsistent data in this attribute. This won’t always
be the case, so a data miner should always be watchful for such indications of inconsistent
data. It’s important to realize also that while two standard deviations is a guideline, it’s not
a hard-and-fast rule. Data miners should be thoughtful about why some observations may
be legitimate and yet far from the mean, or why some values that fall within two standard
deviations of the mean should still be scrutinized. One other item should be noted as we
Data Mining for the Masses
78
examine Figure 5-2: the yes/no attributes about whether or not a person was a member of
various types of community organizations was recorded as a 0 or 1 and those attributes
were imported as ‘integer’ data types. The association rule operators we’ll be using in
RapidMiner require attributes to be of ‘binominal’ data type, so we still have some data
preparation yet to do.
Figure 5-2. Meta data of our community group involvement survey.
4)
Switch back to design perspective. We have a fairly good understanding of our objectives
and our data, but we know that some additional preparation is needed. First off, we need
to reduce the number of attributes in our data set. The elapsed time each person took to
complete the survey isn’t necessarily interesting in the context of our current question,
which is whether or not there are existing connections between types of organizations in
our community, and if so, where those linkages exist. In order to reduce our data set to
only those attributes related to our question, add a Select Attributes operator to your
stream (as was demonstrated in Chapter 3), and select the following attributes
for inclusion,
as illustrated in Figure 5-3: Family, Hobbies, Social_Club, Political, Professional, Religious,
Support_Group. Once you have these attributes selected, click OK to return to your main
process.