Chapter 9:
Logistic Regression
143
Stress_Management: A binary attribute indicating whether or not the person has
previously attended a stress management course: 0 for no; 1 for yes.
Trait_Anxiety: A score on a scale of 0 to 100 measuring the level of each person’s natural
stress levels and abilities to cope with stress. A short time after each person in each of the
two data sets had recovered from their first heart attack, they were administered a standard
test of natural anxiety. Their scores are tabulated and recorded in this attribute along five
point increments. A score of 0 would indicate that the person never feels anxiety, pressure
or stress in any situation, while a score of 100 would indicate that the person lives in a
constant state of being overwhelmed and unable to deal with his or her circumstances.
2nd_Heart_Attack: This attribute exists only in the training data set. It will be our label,
the prediction or target attribute. In the training data set, the attribute is set to ‘yes’ for
individuals who have suffered
second heart attacks, and ‘no’ for those who have not.
DATA PREPARATION
Two data sets have been prepared and are available for you to download from the companion web
site. These are labeled Chapter09DataSet_Training.csv, and Chapter09DataSet_Scoring.csv. If
you would like to follow along with this chapter’s example, download these two datasets now, and
complete the following steps:
1)
Begin the process of importing the training data set first. For the most part, the process
will be the same as what you have done in past chapters, but for logistic regression, there
are a few subtle differences. Be sure to set the first row as the attribute names. On the
fourth step, when setting data types and attribute roles, you will need to make at least one
change. Be sure to set the 2nd_Heart_Attack data type to ‘nominal’, rather than binominal.
Even though it is a yes/no field, and RapidMiner will default it to binominal because of
that, the Logistic Regression operator we’ll be using in our modeling phase expects the
label to be nominal. RapidMiner does not offer binominal-to-nominal or integer-to-
nominal operators, so we need to be sure to set this target attribute to the needed data type
of ‘nominal’ as we import it. This is shown in Figure 9-1:
Data Mining
for the Masses
144
Figure 9-1. Setting the 2nd_Heart_Attack attribute’s
data type to ‘nominal’ during import.
2)
At this time you can also change the 2nd_Heart_Attack attribute’s role to ‘label’, if you
wish. We have not done this in Figure 9-1, and subsequently we will be adding a Set Role
operator to our stream as we continue our data preparation.
3)
Complete the data import process for the training data, then drag and drop the data set
into a new, blank main process. Rename the data set’s Retrieve operator as Training.
4)
Import the scoring data set now. Be sure the data type for all attributes is ‘integer’. This
should be the default, but may not be, so double check. Since the 2nd_Heart_Attack
attribute is not included in the scoring data set, you don’t need to worry about changing it
as you did in step 1. Complete the import process, drag and drop the scoring data set into
your main process and rename this data set’s Retrieve operator to be Scoring. Your model
should now appear similar to Figure 9-2.
Chapter 9: Logistic Regression
145
Figure 9-2. The training and
scoring data sets in a
new main process window in RapidMiner.
5)
Run the model and compare the ranges for all attributes between the scoring and training
result set tabs (Figures 9-3 and 9-4, respectively). You should find that the ranges are the
same. As was the case with Linear Regression, the scoring values must all fall within the
lower and upper bounds set by the corresponding values in the training data set. We can
see in Figures 9-3 and 9-4 that this is the case, so our data are very clean, they were
prepared during extraction from Sonia’s source database, and we will not need to do
further data preparation in order to filter out observations with inconsistent values or
modify missing values.
Figure 9-3. Meta data
for the scoring data set
(note absence of 2nd_Heart_Attack attrtibute).
Data Mining for the Masses
146
Figure 9-4. Meta data for the training data set (2nd_Heart_Attack
attribute is present
with ‘nominal’ data type.) Note that all scoring ranges fall within all training ranges.
6)
Switch back to design perspective and add a Set Role operator to your training stream.
Remember that if you designated 2nd_Heart_Attack to have a ‘label’ role during the
import process, you won’t need to add a Set Role operator at this time. We did not do this
in the book example, so we need the operator to designate 2
nd
_Heart_Attack as our label,
our target attribute:
Figure 9-5. Configuring the 2nd_Heart_Attack attribute’s role in
preparation for logistic regression mining.