Chapter 9:
Logistic Regression
151
DEPLOYMENT
In the context of the person represented on Row 11, it seems pretty obvious that Sonia should try
to reach out to this gentleman right away, offering help in every aspect. She may want to help him
find a weight loss support group, such as Overeaters Anonymous, provide information about
dealing with divorce and/or stress, and encourage the person to work with his doctor to better
regulate his cholesterol through diet and perhaps medication as well. There may be a number of
the 690 individuals who fairly clearly need specific help. Click twice on the attribute name
confidence(Yes). Clicking on a column heading (the attribute name) in RapidMiner results
perspective will sort the data set by that attribute. Click it once to sort in ascending order, twice to
re-sort in descending order, and a third time to return the data set to its original state. Figure 9-11
shows our results sorted in descending order on the confidence(Yes) attribute.
Figure 9-11. Results sorted by confidence(Yes) in descending order (two clicks on the attribute
name).
If you were to count down from the first record (Row 667) to the point at which our
confidence(Yes) value is 0.950, you would find that there are 140 individuals in the data set for
whom we have a 95% or better confidence that they are at risk for heart attack recurrence (and
that’s not rounding up those who have a 0.949 in the ‘Yes’ column). So there are some who are
Chapter 9: Logistic Regression
153
certainty that the first will
not suffer
another heart attack, while predicting with almost 80% that the
other
will. Even their weight categories are similar, though being overweight certainly plays into
the second woman’s risk. But what is really evident in comparing thes two women is that the
second woman has a cholesterol level that nearly touches the top of our range in this data set (the
upper bound shown in Figure 9-9 is 239), and she hasn’t been to stress management classes.
Perhaps Sonia can use such comparisons to help this woman understand
just how dramatically she
can improve her chances of avoiding another heart attack. In essence, Sonia could say: “There are
women who are a lot like you who have almost zero chance of suffering another heart attack. By
lowering your cholesterol, learning
to manage your stress, and perhaps getting your weight down
closer to a normal level, you can almost eliminate your risk for another heart attack.”
Sonia could
follow up by offering specific programs for this woman targeted specifically at cholesterol, weight
or stress management.
CHAPTER SUMMARY
Logistic regression is an excellent way to predict whether or not something will happen, and how
confident we are in such predictions. It takes a number of numeric attributes into account and
then uses those through a training data set to predict the probable outcomes in a comparable
scoring data set. Logistic regression uses a nominal target attribute (or label, in RapidMiner) to
categorize observations in a scoring data set into their probable outcomes.
As with linear regression, the scoring data must have ranges that fall within their corresponding
training data ranges. Without such bounds, it is unsafe and unwise to draw assumptions about
observations in the scoring data set, since there are no comparable observations in the training data
upon which to base your scoring assumptions. When used within these bounds however, logistic
regression can help us quickly and easily predict the outcome of some phenomenon in a data set,
and to determine how confident we can be in the accuracy of that prediction.