Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	39/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 35 36 37 38 39 40 41 42 ... 65

Extra thought question
CHAPTER NINE: LOGISTIC REGRESSION CONTEXT AND PERSPECTIVE
LEARNING OBJECTIVES
ORGANIZATIONAL UNDERSTANDING
Logistic regression
Gender

Data Mining for the Masses
138

3)

What are linear regression coefficients? What does ‘weight’ mean?

4)

What is the linear regression mathematical formula, and how is it arranged?

5)

How are linear regression results interpreted?

Extra thought question:
6)

If you have an attribute that you want to use in a linear regression model, but it contains
text data, such as the make or model of a car, what could you do in order to be able to use
that attribute in your model?

EXERCISE

In the Chapter 4 exercise, you compiled your own data set about professional athletes.  For this
exercise, we will enhance this data set and then build a linear regression model on it.  Complete the
following steps:

1)

Open the data set you compiled for the Chapter 4 exercise.  If you did not do that exercise,
please turn back to Chapter 4 and complete steps 1 – 4.

2)

Split your data set’s observations in two: a training portion and a scoring portion.  Be sure
that  you  have  at  least  20  observations  in  your  training  data  set,  and  at  least  10  in  your
scoring data set.  More would be better, so if you only have 30 observations total, perhaps
it  would  be  good  to  take  some  time  to  look  up  ten  or  so  more  athletes  to  add  to  your
scoring data set.  Also, we are going to try to predict each athlete’s salary, so if Salary is not
one of your attributes, look it up for each athlete in your training data set (don’t look it up
for the scoring data set athletes, we’re going to try to predict these).  Also, if there are other
attributes that you don’t have, but that you think would be great predictors of salary, look
these up, and add them to both your training and scoring data sets.  These might be things
like points per game, defensive statistics, etc.  Be sure your attributes are numeric.

Chapter 8: Linear Regression
139
3)

Import  both  of  your  data  sets  into  your  RapidMiner  repository.    Be  sure  to  give  them
descriptive names.  Drag and drop them into a new process, and rename them as Training
and Scoring so that you can tell them apart.

4)

Use a Set Role operator to designate the Salary attribute as the label for the training data.

5)

Add a linear regression operator and apply your model to your scoring data set.

6)

Run  your  model.    In  results  perspective,  examine  your  attribute  coefficients  and  the
predictions for the athletes’ salaries in your scoring data set.

7)

Report your results:
a.

Which attributes have the greatest weight?
b.

Were any attributes dropped from the data set as non-predictors?  If so, which ones
and why do you think they weren’t effective predictors?
c.

Look up a few of the salaries for some of your  scoring data athletes and compare
their actual salary to the predicted salary.  Is it very close?  Why or why not, do you
think?
d.

What  other  attributes  do  you  think  would  help  your  model  better  predict
professional athletes’ salaries?

Chapter 9: Logistic Regression
141

CHAPTER NINE:
LOGISTIC REGRESSION

CONTEXT AND PERSPECTIVE

Remember  Sonia,  the  health  insurance  program  director  from  Chapter  6?    Well,  she’s  back  for
more help too! Her k-means clustering project was so helpful in finding groups of folks who could
benefit from her programs, that she wants to do more.  This time around, she is concerned with
helping those who have suffered heart attacks.  She wants to help them improve lifestyle choices,
including management of weight and stress, in order to improve their chances of  not suffering a
second heart attack.  Sonia is wondering if, with the right training data, we can predict the chances
of her company’s policy holders suffering second heart attacks.  She feels like she could really help
some  of  her  policy  holders  who  have  suffered  heart  attacks  by  offering  weight,  cholesterol  and
stress management classes or support groups.  By lowering these key heart attack risk factors, her
employer’s clients will live healthier lives, and her employer’s risk at having to pay costs associated
with treatment of second heart attacks will also go down.  Sonia thinks she might even be able to
educate the insured individuals about ways to save money in other aspects of their lives, such as
their life insurance premiums, by being able to demonstrate that they are now a lower risk policy
holder.

LEARNING OBJECTIVES

After completing the reading and exercises in this chapter, you should be able to:


Explain what logistic regression is, how it is used and the benefits of using it.


Recognize the necessary format for data in order to perform predictive logistic regression.


Develop a logistic regression data mining model in RapidMiner using a training data set.


Interpret the model’s outputs and apply them to a scoring data set in order to deploy the
model.

Data Mining for the Masses
142
ORGANIZATIONAL UNDERSTANDING

Sonia’s  desire  is  to  expand  her  data  mining  activities  to  determine  what  kinds  of  programs  she
should  develop  to  help  victims  of  heart  attacks  avoid  suffering  a  recurrence.    She  knows  that
several  risk  factors  such  as  weight,  high  cholesterol  and  stress  contribute  to  heart  attacks,
particularly  in  those  who  have  already  suffered  one.    She  also  knows  that  the  cost  of  providing
programs developed to help mitigate these risks is a fraction of the cost of providing medical care
for a patient who has suffered multiple heart attacks.  Getting her employer on board with funding
the  programs  is  the  easy part.    Figuring  out  which  patients  will  benefit  from  which  programs  is
trickier.  She is looking to us to provide some guidance, based on data mining, to figure out which
patients are good candidates for which programs.  Sonia’s bottom line is that she wants to know
whether or not something (a second heart attack) is likely to happen, and if so, how likely it is that
it will or will not happen.  Logistic regression is an excellent tool for predicting the likelihood of
something happening or not.

DATA UNDERSTANDING

Sonia  has  access  to  the  company’s  medical  claims  database.    With  this  access,  she  is  able  to
generate two data sets for us.  This first is a list of people who have suffered heart attacks, with an
attribute indicating whether or not they have had more than one; and the second is a list of those
who  have  had  a  first  heart  attack,  but  not  a  second.    The  former  data  set,  comprised  of  138
observations, will serve as our training data; while the latter, comprised of 690 peoples’ data, will be
for scoring.  Sonia’s hope is to help this latter group of people avoid becoming second heart attack
victims. In compiling the two data sets we have defined the following attributes:


Age: The age in years of the person, rounded to the nearest whole year.


Marital_Status:    The  person’s  current  marital  status,  indicated  by  a  coded  number:  0–
Single, never married; 1–Married; 2–Divorced; 3–Widowed.


Gender: The person’s gender: 0 for female; 1 for male.


Weight_Category:  The person’s weight categorized into one of three levels: 0 for normal
weight range; 1 for overweight; and 2 for obese.


Cholesterol:  The person’s cholesterol level, as recorded at the time of their treatment for
their most recent heart attack (their only heart attack, in the case of those individuals in the
scoring data set.)

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 35 36 37 38 39 40 41 42 ... 65