Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	41/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 37 38 39 40 41 42 43 44 ... 65

Chapter 9: Logistic Regression
147

With the label attribute set, we are now prepared to begin…

MODELING

7)

Using the search field in the Operators tab, locate the Logistic Regression operator.  You
will see that if you just search for the word ‘logistic’ (as has been done in Figure 9-6), there
are  several  different  logistic,  and  logistic  regression  operators  available  to  you  in
RapidMiner.    We  will  use  the  first  one  in  this  example,  however,  you  are  certainly
encouraged to experiment with the others as you would like.  Drag the Logistic Regression
operator into your training stream.

Figure 9-6. The Logistic Regression operator in our training stream.

8)

The  Logistic  Regression  operator  will  generate  coefficients  for  each  of  our  predictor
attributes, in much the same way that the linear regression operator did.  If you would like
to see these, you can run your model now.  The algebraic formula for logistic regression is
different and a bit more complicated than the one for linear regression.  We are no longer
calculating the slope of a straight line, but rather, we are trying to determine the likelihood
of an observation falling at a given point along a curvy and less well-defined imaginary line
through a data set. The coefficients for logistic regression are used in that formula.

Data Mining for the Masses
148

9)

If you ran your model to see your coefficients, return now to design perspective.  As you
have done in our most recent few chapter examples, add an Apply Model operator to your
stream, to bring the training and scoring data sets together.  Remember that you may need
to  disconnect  and  reconnect  some  ports,  as  we  did  in  Chapter  7  (step  13),  in  order  to
merge your two streams together.  Be sure your lab and mod ports are both connected to res
ports.

Figure 9-7. Applying the model to the scoring data set.

We are finished building the model. Run it now, and we will proceed to…

EVALUATION

Figure 9-8. Coefficients for each predictor attribute.

The initial tab shown in results perspective is a list of our coefficients.  These coefficients are used
in the logistic regression algorithm to predict whether or not each person in our scoring data set

Chapter 9: Logistic Regression
149
will suffer a second heart attack, and if so, how confident we are that the prediction will come true.
Switch to the Scoring results tab.  We will look first at the meta data (Figure 9-9).

Figure 9-9. Meta data for our scoring predictions.

We  can  see  in  this  figure  that  RapidMiner  has  generated  three  new  attributes  for  us:
confidence(Yes), confidence(No), and prediction(2
nd
_Heart_Attack).  In our Statistics column, we
find that out of the 690 people represented, we’re predicting that 357 will not suffer second heart
attacks, and that 333 will.  Sonia’s hope is that she can engage these 333, and perhaps some of the
357 with low confidence levels on their ‘No’ prediction, in programs to improve their health, and
thus their chances of avoiding another heart attack.  Let’s switch to Data View.

Figure 9-10. Predictions for our 690 patients who have suffered a first heart attack.

Data Mining for the Masses
150

In Figure 9-10, we can see that each person has been given a predication of ‘No’ (they won’t suffer
a second heart attack), or ‘Yes’ (they will).  It is critically important to remember at this point  of
our evaluation that if this were real, and not a textbook example, these would be real people, with
names, families and lives.  Yes, we are using data to evaluate their health, but we shouldn’t treat
these people like numbers.  Hopefully our work and analysis will help our imaginary client Sonia in
her  efforts  to  serve  these  people  better.    When  data  mining,  we  should  always  keep  the  human
element in mind, and we’ll talk more about this in Chapter 14.

So we have these predictions that some people in our scoring data set are on the path to a second
heart  attack  and  others  are  not,  but  how  confident  are  we  in  these  predictions?    The
confidence(Yes)  and  confidence(No)  attributes  can  help  us  answer  that  question.    To  start,  let’s
just consider the person represented on Row 1.  This is a single (never been married) 61 year old
man.    He  has  been  classified  as  overweight,  but  has  lower  than  average  cholesterol  (the  mean
shown in our meta data in Figure 9-9 is just over 178).  He scored right in the middle on our trait
anxiety  test  at  50,  and  has  attended  stress  management  class.    With  these  personal  attributes,
compared with those in our training data, our model offers us an 86.1% level of confidence that
the ‘No’ prediction is correct.  This leaves us with 13.9% worth of doubt in our prediction.  The
‘No’ and ‘Yes’ values will always total to 1, or in other words, 100%.  For each person in the data
set,  their  attributes  are  fed  into  the  logistic  regression  model,  and  a  prediction  with  confidence
percentages is calculated.

Let’s consider one other person as an example in Figure 9-10.  Look at Row 11.  This is a 66 year
old man who’s been divorced.  He’s above the average values in every attribute.  While he’s not as
old  as  some  in  our  data  set,  he  is  getting  older,  and  he’s  obese.    His  cholesterol  is  among  the
highest in our data set, he scored higher than average on the trait anxiety test and hasn’t been to a
stress  management  class.    We’re  predicting,  with  99.2%  confidence,  that  this  man  will  suffer  a
second  heart  attack.    The  warning  signs  are  all  there,  and  Sonia  can  now  see  them  fairly  easily.
With an understanding of how to read the output, Sonia can now proceed to…

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 37 38 39 40 41 42 43 44 ... 65