Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	43/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 39 40 41 42 43 44 45 46 ... 65

REVIEW QUESTIONS
Challenge Step!

Data Mining for the Masses
154
REVIEW QUESTIONS

1)

What is the appropriate data type for independent variables (predictor attributes) in logistic
regression? What about for the dependent variable (target or label attribute)?

2)

Compare the predictions for Row 15 and 669 in the chapter’s example model.
a.

What is the single difference between these two people, and how does it affect their
predicted 2nd_Heart_Attack risk?
b.

Locate other 67 year old men in the results and compare them to the men on rows
15 and 669. How do they compare?
c.

Can you spot areas when the men represented on rows 15 and 669 could improve
their chances of not suffering a second heart attack?

3)

What  is  the  difference  between  confidence(Yes)  and  confidence(No)  in  this  chapter’s
example?

4)

How can you set an attribute’s role to be ‘label’ in RapidMiner without using the Set Role
operator? What is one drawback to doing it that way?

EXERCISE

For this chapter’s exercise, you will use logistic regression to try to predict whether or not young
people you know will eventually graduate from college. Complete the following steps:

1)

Open  a  new  blank  spreadsheet  in  OpenOffice  Calc.    At  the  bottom  of  the  spreadsheet
there  will  be  three  default  tabs  labeled  Sheet1,  Sheet2,  Sheet3.    Rename  the  first  one
Training and the second one Scoring.  You can rename the tabs by double clicking on their
labels. You can delete or ignore the third default sheet.

2)

On the training sheet, starting in cell A1 and going across, create attribute labels for  five
attributes: Parent_Grad, Gender, Income_Level, Num_Siblings, and Graduated.

3)

Copy each of these attribute names except Graduated into the Scoring sheet.

Chapter 9: Logistic Regression
155

4)

On the Training sheet, enter values for each of these attributes for several adults that you
know who are at the age that they could have graduated from college by now.  These could
be family members, friends and neighbors, coworkers or fellow students, etc.  Try to do at
least 20 observations; 30 or more would be better.  Enter husband and wife couples as two
separate observations. Use the following to guide your data entry:
a.

For Parent_Grad, enter a 0 if neither of the person’s parents graduated from college,
a 1 if one parent did, and a 2 if both parents did. If the person’s parents went on to
earn graduate degress, you could experiment with making this attribute even more
interesting by using it to hold the total number of college degrees by the person’s
parents.  For example, if the person represented in the observation had a mother
who  earned  a  bachelor’s,  master’s  and  doctorate,  and  a  father  who  earned  a
bachelor’s and a master’s, you could enter a 5 in this attribute for that person.
b.

For Gender, enter 0 for female and 1 for male.
c.

For Income_Level, enter a 0 if the person lives in a household with an income level
below what you would consider to be below average, a 1 for average, and a 2 for
above  average.    You  can  estimate  or  generalize.    Be  sensitive  to  others  when
gathering your data—don’t snoop too much or risk offending your data subjects.
d.

For Num_Siblings, enter the number of siblings the person has.
e.

For Graduated, put ‘Yes’ if the person has graduated from college and ‘No’ if they
have not.

5)

Once you’ve compiled your Training data set, switch to the Scoring sheet in OpenOffice
Calc.  Repeat the data entry process for at least 20 (more is better) young people between
the ages of 0 and 18 that you know.  You will use the training set to try to predict whether
or not these young people will graduate from college, and if so, how confident you are in
your prediction. Remember this is your scoring data, so you won’t provide the Graduated
attribute, you’ll predict it shortly.

6)

Use the File > Save As menu option in OpenOffice Calc to save your Training and Scoring
sheets as CSV files.

7)

Import  your  two  CSV  files  into  your  RapidMiner  respository.    Be  sure  to  give  them
descriptive names.

Data Mining for the Masses
156

8)

Drag your two data sets into a new process window.  If you have prepared your data well
in OpenOffice Calc, you shouldn’t have any missing or inconsistent data to contend with,
so data preparation should be minimal.  Rename the two retrieve operators so you can tell
the difference between your training and scoring data sets.

9)

One necessary data preparation step is to add a Set Role operator and define the Graduated
attribute  as  your  label  in  your  training  data.    Alternatively,  you  can  set  your  Graduated
attribute as the label during data import.

10)

Add a Logistic Regression operator to your Training stream.

11)

Apply your Logistic Regression model to your scoring data and run your model.  Evaluate
and report your results.  Are your confidence percentages interesting?  Surprising?  Do the
predicted Graduation values seem reasonable and consistent with your training data?  Does
any one independent variable (predictor attribute) seem to be a particularly good predictor
of the dependent variable (label or prediction attribute)? If so, why do you think so?

Challenge Step!

12)

Change your Logistic Regression operator to a different type of Logistic operator
(for example, maybe try the Weka W-Logistic operator). Re-run your model. Consider
doing some research to learn about the difference between algorithms underlying different
logistic approaches. Compare your new results to the original Logistic Regression results
and report any interesting findings or differences.

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 39 40 41 42 43 44 45 46 ... 65