Data Mining
for the Masses
154
REVIEW QUESTIONS
1)
What is the appropriate data type for independent variables (predictor attributes) in logistic
regression? What about for the dependent variable (target or label attribute)?
2)
Compare the predictions for Row 15 and 669 in the chapter’s example model.
a.
What is the single difference between these two people, and how does it affect their
predicted 2nd_Heart_Attack risk?
b.
Locate other 67 year old men in the results and compare them to the men on rows
15 and 669. How do they compare?
c.
Can you spot areas when the men represented on rows 15 and 669 could improve
their chances of not suffering a second heart attack?
3)
What is the difference between confidence(Yes) and confidence(No) in this chapter’s
example?
4)
How can you set an attribute’s role to be ‘label’ in RapidMiner without using the Set Role
operator? What is one drawback to doing it that way?
EXERCISE
For this chapter’s exercise, you will use logistic regression to try to predict whether or not young
people you know will eventually graduate from college. Complete the following steps:
1)
Open a new blank spreadsheet in OpenOffice Calc. At the bottom of the spreadsheet
there will be three default tabs labeled Sheet1, Sheet2, Sheet3. Rename the first one
Training and the second one Scoring. You can rename the tabs by double clicking on their
labels. You can delete or ignore the third default sheet.
2)
On the training sheet, starting in cell A1 and going across, create attribute labels for five
attributes: Parent_Grad, Gender, Income_Level, Num_Siblings, and Graduated.
3)
Copy each of these attribute names except Graduated into the Scoring sheet.
Chapter 9:
Logistic Regression
155
4)
On the Training sheet, enter values for each of these attributes for several adults that you
know who are at the age that they could have graduated from college by now. These could
be family members, friends and neighbors, coworkers or fellow students, etc. Try to do at
least 20 observations; 30 or more would be better. Enter husband and wife couples as two
separate observations. Use the following to guide your data entry:
a.
For Parent_Grad, enter a 0 if neither of the person’s parents graduated from college,
a 1
if one parent did, and a 2 if both parents did. If the person’s parents went on to
earn graduate degress, you could experiment with making this attribute even more
interesting by using it to hold the total number of college degrees by the person’s
parents. For example, if the person represented in the observation had a mother
who earned a bachelor’s, master’s and doctorate, and a father who earned a
bachelor’s and a master’s, you could enter a 5 in this attribute for that person.
b.
For Gender, enter 0 for female and 1 for male.
c.
For Income_Level, enter a 0 if the person lives in a household with an income level
below what you would consider to be below average, a 1 for average, and a 2 for
above average. You can estimate or generalize. Be sensitive to others when
gathering your data—don’t snoop too much or risk offending your data subjects.
d.
For Num_Siblings, enter the number of siblings the person has.
e.
For Graduated, put ‘Yes’ if the person has graduated from college and ‘No’ if they
have not.
5)
Once you’ve compiled your Training data set, switch to the Scoring sheet in OpenOffice
Calc. Repeat the data entry process for at least 20 (more is better) young people between
the ages of 0 and 18 that you know. You will use the training set to try to predict whether
or not these young people will graduate from college, and if so, how confident you are in
your prediction. Remember this is your scoring data, so you won’t provide the Graduated
attribute, you’ll predict it shortly.
6)
Use the File > Save As menu option in OpenOffice Calc to save your
Training and Scoring
sheets as CSV files.
7)
Import your two CSV files into your RapidMiner respository. Be sure to give them
descriptive names.
Data Mining for the Masses
156
8)
Drag your two data sets into a new process window. If you have prepared your data well
in OpenOffice Calc, you shouldn’t have any missing or inconsistent data to contend with,
so data preparation should be minimal. Rename the two retrieve operators so you can tell
the difference between your training and scoring data sets.
9)
One necessary data preparation step is to add a Set Role operator and
define the Graduated
attribute as your label in your training data. Alternatively, you can set your Graduated
attribute as the label during data import.
10)
Add a Logistic Regression operator to your Training stream.
11)
Apply your Logistic Regression model to your scoring data and run your model. Evaluate
and report your results. Are your confidence percentages interesting? Surprising? Do the
predicted Graduation values seem reasonable and consistent with your training data? Does
any one independent variable (predictor attribute) seem to be a particularly good predictor
of the dependent variable (label or prediction attribute)? If so, why do you think so?
Challenge Step!
12)
Change your Logistic Regression operator to a different
type of Logistic operator
(for example, maybe try the Weka W-Logistic operator). Re-run your model. Consider
doing some research to learn about the difference between algorithms underlying different
logistic approaches. Compare your new results to the original
Logistic Regression results
and report any interesting findings or differences.