Data Mining for the Masses
138
3)
What are linear regression coefficients? What does ‘weight’ mean?
4)
What is the linear regression mathematical formula, and how is it arranged?
5)
How are linear regression results interpreted?
6)
If you have an attribute that you want to use in a linear regression model, but it contains
text data, such as the make or model of a car, what could you do in order to be able to use
that attribute in your model?
EXERCISE
In the Chapter 4 exercise, you compiled your own data set about professional athletes. For this
exercise, we will enhance this data set and then build a linear regression model on it. Complete the
following steps:
1)
Open the data set you compiled for the Chapter 4 exercise. If you did not do that exercise,
please turn back to Chapter 4 and complete steps 1 – 4.
2)
Split your data set’s observations in two: a training portion and a scoring portion. Be sure
that you have at least 20 observations in your training data set, and at least 10 in your
scoring data set. More would be better, so if you only have 30 observations total, perhaps
it would be good to take some time to look up ten or so more athletes to add to your
scoring data set. Also, we are going to try to predict each athlete’s salary, so if Salary is not
one of your attributes, look it up for each athlete in your training data set (don’t look it up
for the scoring data set athletes, we’re going to try to predict these). Also, if there are other
attributes that you don’t have, but that you think would be great predictors of salary, look
these up, and add them to both your training and scoring data sets. These might be things
like points per game, defensive statistics, etc. Be sure your attributes are numeric.
Chapter 8: Linear Regression
139
3)
Import both of your data sets into your RapidMiner repository. Be sure to give them
descriptive names. Drag and drop them into a new process, and rename them as Training
and Scoring so that you can tell them apart.
4)
Use a Set Role operator to designate the Salary attribute as the label for the training data.
5)
Add a linear regression operator and apply your model to your scoring data set.
6)
Run your model. In results perspective, examine your attribute coefficients and the
predictions for the athletes’ salaries in your scoring data set.
7)
Report your results:
a.
Which attributes have the greatest weight?
b.
Were any attributes dropped from the data set as non-predictors? If so, which ones
and why do you think they weren’t effective predictors?
c.
Look up a few of the salaries for some of your scoring data athletes and compare
their actual salary to the predicted salary. Is it very close? Why or why not, do you
think?
d.
What other attributes do you think would help your model better predict
professional athletes’ salaries?
Chapter 9: Logistic Regression
141
CHAPTER NINE:
LOGISTIC REGRESSION
CONTEXT AND PERSPECTIVE
Remember Sonia, the health insurance program director from Chapter 6? Well, she’s back for
more help too! Her k-means clustering project was so helpful in finding groups of folks who could
benefit from her programs, that she wants to do more. This time around, she is concerned with
helping those who have suffered heart attacks. She wants to help them improve lifestyle choices,
including management of weight and stress, in order to improve their chances of not suffering a
second heart attack. Sonia is wondering if, with the right training data, we can predict the chances
of her company’s policy holders suffering second heart attacks. She feels like she could really help
some of her policy holders who have suffered heart attacks by offering weight, cholesterol and
stress management classes or support groups. By lowering these key heart attack risk factors, her
employer’s clients will live healthier lives, and her employer’s risk at having to pay costs associated
with treatment of second heart attacks will also go down. Sonia thinks she might even be able to
educate the insured individuals about ways to save money in other aspects of their lives, such as
their life insurance premiums, by being able to demonstrate that they are now a lower risk policy
holder.
LEARNING OBJECTIVES
After completing the reading and exercises in this chapter, you should be able to:
Explain what logistic regression is, how it is used and the benefits of using it.
Recognize the necessary format for data in order to perform predictive logistic regression.
Develop a logistic regression data mining model in RapidMiner using a training data set.
Interpret the model’s outputs and apply them to a scoring data set in order to deploy the
model.
Data Mining for the Masses
142
ORGANIZATIONAL UNDERSTANDING
Sonia’s desire is to expand her data mining activities to determine what kinds of programs she
should develop to help victims of heart attacks avoid suffering a recurrence. She knows that
several risk factors such as weight, high cholesterol and stress contribute to heart attacks,
particularly in those who have already suffered one. She also knows that the cost of providing
programs developed to help mitigate these risks is a fraction of the cost of providing medical care
for a patient who has suffered multiple heart attacks. Getting her employer on board with funding
the programs is the easy part. Figuring out which patients will benefit from which programs is
trickier. She is looking to us to provide some guidance, based on data mining, to figure out which
patients are good candidates for which programs. Sonia’s bottom line is that she wants to know
whether or not something (a second heart attack) is likely to happen, and if so, how likely it is that
it will or will not happen. Logistic regression is an excellent tool for predicting the likelihood of
something happening or not.
DATA UNDERSTANDING
Sonia has access to the company’s medical claims database. With this access, she is able to
generate two data sets for us. This first is a list of people who have suffered heart attacks, with an
attribute indicating whether or not they have had more than one; and the second is a list of those
who have had a first heart attack, but not a second. The former data set, comprised of 138
observations, will serve as our training data; while the latter, comprised of 690 peoples’ data, will be
for scoring. Sonia’s hope is to help this latter group of people avoid becoming second heart attack
victims. In compiling the two data sets we have defined the following attributes:
Age: The age in years of the person, rounded to the nearest whole year.
Marital_Status: The person’s current marital status, indicated by a coded number: 0–
Single, never married; 1–Married; 2–Divorced; 3–Widowed.
Gender: The person’s gender: 0 for female; 1 for male.
Weight_Category: The person’s weight categorized into one of three levels: 0 for normal
weight range; 1 for overweight; and 2 for obese.
Cholesterol: The person’s cholesterol level, as recorded at the time of their treatment for
their most recent heart attack (their only heart attack, in the case of those individuals in the
scoring data set.)
Dostları ilə paylaş: |