Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	23/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 19 20 21 22 23 24 25 26 ... 65

Professional
Support_Group

Data Mining for the Masses
76


Professional:  A yes/no column indicating whether or not the respondent is currently  a
member of a professional organization with local chapter meetings, such as a chapter of a
law or medical society, a small business owner’s group, etc.


Religious:    A  yes/no  column  indicating  whether  or  not  the  respondent  is  currently  a
member of a church in the community.


Support_Group:  A yes/no column indicating whether or not the respondent is currently
a member of a support-oriented community organization, such as Alcoholics Anonymous,
an anger management group, etc.

In order to preserve a level of personal privacy, individual respondents’ names were not collected
through the survey, and no respondent was asked to give personally identifiable information when
responding.

DATA PREPARATION

A CSV data set for this chapter’s exercise is available for download at the book’s companion web
site  (
https://sites.google.com/site/dataminingforthemasses/
).    If  you  wish  to  follow  along  with
the  exercise,  go  ahead  and  download  the  Chapter05DataSet.csv  file  now  and  save  it  into  your
RapidMiner data folder.  Then, complete the following steps to prepare the data set for association
rule mining:

1)

Import the Chapter 5 CSV data set into your RapidMiner data repository.  Save it with the
name  Chapter5.    If  you  need  a  refresher  on  how  to  bring  this  data  set  into  your
RapidMiner repository, refer to steps 7 through 14 of the Hands On Exercise in Chapter 3.
The steps will be the same, with the exception of which file you select to import.  Import
all attributes, and accept the default data types.  This is the same process as was done in
Chapter 4, so hopefully by now, you are getting comfortable with the steps to import data
into RapidMiner.

2)

Drag your Chapter5 data set into a new process window in RapidMiner, and run the model
in order to inspect the data.  When running the model, if prompted, save the process as
Chapter5_Process, as shown in Figure 5-1.

Chapter 5: Association Rules
77

Figure 5-1. Adding the data for the Chapter 5 example model.

3)

In results perspective, look first at Meta Data view (Figure 5-2).  Note that we do not have
any missing values among any of the 12 attributes across 3,483 observations.  In examining
the statistics, we do not see any inconsistent data.  For numeric data types, RapidMiner has
given us the average (avg), or mean, for each attribute, as well the standard deviation for
each  attribute.    Standard  deviations  are  measurements  of  how  dispersed  or  varied  the
values in an attribute are, and so can be used to watch for inconsistent data.  A good rule
of thumb is that any value that is smaller than two standard deviations below the mean (or
arithmetic average), or two standard deviations above the mean, is a statistical outlier.  For
example, in the Age attribute in Figure 5-2, the average age is 36.731, while the standard
deviation  is  10.647.    Two  standard  deviations  above  the  mean  would  be  58.025
(36.731+(2*10.647)),  and  two  standard  deviations  below  the  mean  would  be  15.437
(36.731-(2*10.647)).  If we look at the Range column in Figure 5-2, we can see that the Age
attribute  has  a  range  of  17  to  57,  so  all  of  our  observations  fall  within  two  standard
deviations of the mean.  We find no inconsistent data in this attribute.  This won’t always
be the case, so a data miner should always be watchful for such indications of inconsistent
data.  It’s important to realize also that while two standard deviations is a guideline, it’s not
a hard-and-fast rule.  Data miners should be thoughtful about why some observations may
be legitimate and yet far from the mean, or why some values that fall within two standard
deviations of the mean should still be scrutinized.  One other item should be noted as we

Data Mining for the Masses
78
examine Figure 5-2: the yes/no attributes about whether or not a person was a member of
various  types  of  community  organizations  was  recorded  as  a  0  or  1  and  those  attributes
were  imported  as  ‘integer’  data  types.    The  association  rule  operators  we’ll  be  using  in
RapidMiner  require  attributes  to  be  of  ‘binominal’  data  type,  so  we  still  have  some  data
preparation yet to do.

Figure 5-2. Meta data of our community group involvement survey.

4)

Switch back to design perspective.  We have a fairly good understanding of our objectives
and our data, but we know that some additional preparation is needed.  First off, we need
to reduce the number of attributes in our data set.  The elapsed time each person took to
complete  the  survey  isn’t  necessarily  interesting  in  the  context  of  our  current  question,
which is whether or not there are existing connections between types of organizations in
our community, and if so, where those linkages exist.  In order to reduce our data set to
only  those  attributes  related  to  our  question,  add  a  Select  Attributes  operator  to  your
stream (as was demonstrated in Chapter 3), and select the following attributes for inclusion,
as illustrated in Figure 5-3: Family, Hobbies, Social_Club, Political, Professional, Religious,
Support_Group.  Once you have these attributes selected, click OK to return to your main
process.

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 19 20 21 22 23 24 25 26 ... 65