Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	40/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 36 37 38 39 40 41 42 43 ... 65

Chapter 9: Logistic Regression
143


Stress_Management:    A  binary  attribute  indicating  whether  or  not  the  person  has
previously attended a stress management course: 0 for no; 1 for yes.


Trait_Anxiety:  A score on a scale of 0 to 100 measuring the level of each person’s natural
stress levels and abilities to cope with stress.  A short time after each person in each of the
two data sets had recovered from their first heart attack, they were administered a standard
test of natural anxiety.  Their scores are tabulated and recorded in this attribute along five
point increments.  A score of 0 would indicate that the person never feels anxiety, pressure
or  stress  in  any  situation,  while  a  score  of  100  would  indicate  that  the  person  lives  in  a
constant state of being overwhelmed and unable to deal with his or her circumstances.


2nd_Heart_Attack:  This attribute exists only in the training data set.  It will be our label,
the prediction or target attribute.   In the training data set, the attribute is set to ‘yes’ for
individuals who have suffered second heart attacks, and ‘no’ for those who have not.

DATA PREPARATION

Two data sets have been prepared and are available for you to download from the companion web
site.    These  are  labeled  Chapter09DataSet_Training.csv,  and  Chapter09DataSet_Scoring.csv.    If
you would like to follow along with this chapter’s example, download these two datasets now, and
complete the following steps:

1)

Begin the process of importing the training data set first.  For the most part, the process
will be the same as what you have done in past chapters, but for logistic regression, there
are a few subtle differences.  Be sure to set the first row as the attribute names.  On the
fourth step, when setting data types and attribute roles, you will need to make at least one
change. Be sure to set the 2nd_Heart_Attack data type to ‘nominal’, rather than binominal.
Even though it is a yes/no field, and RapidMiner will default it to binominal because of
that,  the  Logistic  Regression  operator  we’ll  be  using  in  our  modeling  phase  expects  the
label  to  be  nominal.    RapidMiner  does  not  offer  binominal-to-nominal  or  integer-to-
nominal operators, so we need to be sure to set this target attribute to the needed data type
of ‘nominal’ as we import it. This is shown in Figure 9-1:

Data Mining for the Masses
144

Figure 9-1. Setting the 2nd_Heart_Attack attribute’s
data type to ‘nominal’ during import.

2)

At  this  time  you  can  also  change  the  2nd_Heart_Attack  attribute’s  role  to  ‘label’,  if  you
wish.  We have not done this in Figure 9-1, and subsequently we will be adding a Set Role
operator to our stream as we continue our data preparation.

3)

Complete  the  data  import  process  for  the  training  data,  then  drag  and  drop  the  data  set
into a new, blank main process.  Rename the data set’s Retrieve operator as Training.

4)

Import the scoring data set now.  Be sure the data type for all attributes is ‘integer’. This
should  be  the  default,  but  may  not  be,  so  double  check.    Since  the  2nd_Heart_Attack
attribute is not included in the scoring data set, you don’t need to worry about changing it
as you did in step 1.  Complete the import process, drag and drop the scoring data set into
your main process and rename this data set’s Retrieve operator to be Scoring.  Your model
should now appear similar to Figure 9-2.

Chapter 9: Logistic Regression
145

Figure 9-2. The training and scoring data sets in a
new main process window in RapidMiner.

5)

Run the model and compare the ranges for all attributes between the scoring and training
result set tabs (Figures 9-3 and 9-4, respectively).  You should find that the ranges are the
same.  As was the case with Linear Regression, the scoring values must all fall within the
lower and upper bounds set by the corresponding values in the training data set.  We can
see  in  Figures  9-3  and  9-4  that  this  is  the  case,  so  our  data  are  very  clean,  they  were
prepared  during  extraction  from  Sonia’s  source  database,  and  we  will  not  need  to  do
further  data  preparation  in  order  to  filter  out  observations  with  inconsistent  values  or
modify missing values.

Figure 9-3. Meta data for the scoring data set
(note absence of 2nd_Heart_Attack attrtibute).

Data Mining for the Masses
146

Figure 9-4. Meta data for the training data set (2nd_Heart_Attack attribute is present
with ‘nominal’ data type.)  Note that all scoring ranges fall within all training ranges.

6)

Switch  back  to  design  perspective  and  add  a  Set  Role  operator  to  your  training  stream.
Remember  that  if  you  designated  2nd_Heart_Attack  to  have  a  ‘label’  role  during  the
import process, you won’t need to add a Set Role operator at this time.  We did not do this
in the book example, so we need the operator to designate 2
nd
_Heart_Attack as our label,
our target attribute:

Figure 9-5. Configuring the 2nd_Heart_Attack attribute’s role in
preparation for logistic regression mining.

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 36 37 38 39 40 41 42 43 ... 65