Chapter 9:
Logistic Regression
147
With
the label attribute set, we are now prepared to begin…
MODELING
7)
Using the search field in the Operators tab, locate the Logistic Regression operator. You
will see that if you just search for the word ‘logistic’ (as has been done in Figure 9-6), there
are several different logistic, and logistic regression operators available to you in
RapidMiner. We will use the first one in this example, however, you are certainly
encouraged to experiment with the others as you would like. Drag the Logistic Regression
operator into your training stream.
Figure 9-6. The Logistic Regression operator in our training stream.
8)
The Logistic Regression operator will generate coefficients for each of our predictor
attributes, in much the same way that the linear regression operator did. If you would like
to see these, you can run your model now. The algebraic formula for logistic regression is
different and a bit more complicated than the one for linear regression. We are no longer
calculating the slope of a straight line, but rather, we are trying to determine the likelihood
of an observation falling at a given point along a curvy and less well-defined imaginary line
through a data set. The coefficients for logistic regression are used in that formula.
Data Mining
for the Masses
148
9)
If you ran your model to see your coefficients, return now to design perspective. As you
have done in our most recent few chapter examples, add an Apply Model operator to your
stream, to bring the training and scoring data sets together. Remember that you may need
to disconnect and reconnect some ports, as we did in Chapter 7 (step 13), in order to
merge your two streams together. Be sure your
lab and
mod ports are both connected to
res
ports.
Figure 9-7. Applying the model to the scoring data set.
We are finished building the model.
Run it now, and we will proceed to…
EVALUATION
Figure 9-8. Coefficients for each predictor attribute.
The initial tab shown in results perspective is a list of our coefficients. These coefficients are used
in the logistic regression algorithm to predict whether or not each person in our scoring data set
Data Mining for the Masses
150
In Figure 9-10, we can see that each person has been given a predication of ‘No’ (they won’t suffer
a second heart attack), or ‘Yes’ (they will). It is critically important to remember at this point of
our evaluation that if this were real, and not a textbook example, these would be real people, with
names, families and lives. Yes, we are using data to evaluate their health, but we shouldn’t treat
these people like numbers. Hopefully our work and analysis will help our imaginary client Sonia in
her efforts to serve these people better. When data mining, we should always keep the human
element in mind, and we’ll talk more about this in Chapter 14.
So we have these predictions that some people in our scoring data set are on the path to a second
heart attack and others are not, but how confident are we in these predictions? The
confidence(Yes) and confidence(No) attributes can help us answer that question. To start, let’s
just consider the person represented on Row 1. This is a single (never been married) 61 year old
man. He has been classified as overweight, but has lower than average cholesterol (the mean
shown in our meta data in Figure 9-9 is just over 178). He scored right in the middle on our trait
anxiety test at 50, and has attended stress management class. With these personal attributes,
compared with those in our training data, our model offers us an 86.1% level of confidence that
the ‘No’ prediction is correct. This leaves us with 13.9% worth of doubt in our prediction. The
‘No’ and ‘Yes’ values will always total to 1, or in other words, 100%. For each person in the data
set, their attributes are fed into the logistic regression model, and a prediction with confidence
percentages is calculated.
Let’s consider one other person as an example in Figure 9-10. Look at Row 11. This is a 66 year
old man who’s been divorced. He’s above the average values in every attribute. While he’s not as
old as some in our data set, he is getting older, and he’s obese. His cholesterol is among the
highest in our data set, he scored higher than average on the trait anxiety test and hasn’t been to a
stress management class. We’re predicting, with 99.2% confidence, that this man will suffer a
second heart attack. The warning signs are all there, and Sonia can now see them fairly easily.
With an understanding
of how to read the output, Sonia can now proceed to…