Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	45/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 41 42 43 44 45 46 47 48 ... 65

Data Mining for the Masses
160
web site does not distinguish single types of people, those who are divorced or widowed
are included with those who have never been married (indicated in the data set as ‘S’).


Website_Activity:  This attribute is an indication of how active each customer is on the
company’s web site.  Working with Richard, we used the web site database’s information
which  records  the  duration  of  each  customers  visits  to  the  web  site  to  calculate  how
frequently,  and  for  how  long  each  time,  the  customers  use  the  web  site.    This  is  then
translated into one of three categories: Seldom, Regular, or Frequent.


Browsed_Electronics_12Mo:  This is simply a Yes/No column indicating whether or not
the person browsed for electronic products on the company’s web site in the past year.


Bought_Electronics_12Mo:    Another  Yes/No  column  indicating  whether  or  not  they
purchased an electronic item through Richard’s company’s web site in the past year.


Bought_Digital_Media_18Mo:    This  attribute  is  a  Yes/No  field  indicating  whether  or
not the person has purchased some form of digital media (such as MP3 music) in the past
year and a half. This attribute does not include digital book purchases.


Bought_Digital_Books:  Richard believes that as an indicator of buying behavior relative
to the company’s new eReader, this attribute will likely be the best indicator.  Thus, this
attribute has been set apart from the purchase of other types of digital media.  Further, this
attribute indicates whether or not the customer has ever bought a digital book, not just in
the past year or so.


Payment_Method:  This attribute indicates how the person pays for their purchases.  In
cases  where  the  person  has  paid  in  more  than  one  way,  the  mode,  or  most  frequent
method of payment is used. There are four options:


Bank Transfer—payment via e-check or other form of wire transfer directly from the
bank to the company.


Website  Account—the  customer  has  set  up  a  credit  card  or  permanent  electronic
funds transfer on their account so that purchases are directly charged through their
account at the time of purchase.


Credit  Card—the  person  enters  a  credit  card  number  and  authorization  each  time
they purchase something through the site.


Monthly  Billing—the  person  makes  purchases  periodically  and  receives  a  paper  or
electronic  bill  which  they  pay  later  either  by  mailing  a  check  or  through  the
company web site’s payment system.

Chapter 10: Decision Trees
161


eReader_Adoption:  This attribute exists only in the training data set.  It consists of data
for customers who purchased the previous-gen eReader.  Those who purchased within a
week  of  the  product’s  release  are  recorded  in  this  attribute  as  ‘Innovator’.    Those  who
purchased after the first week but within the second or third weeks are entered as ‘Early
Adopter’.    Those  who  purchased  after  three  weeks  but  within  the  first  two  months  are
‘Early Majority’.  Those who purchased after the first two months are ‘Late Majority’.  This
attribute will serve as our label when we apply our training data to our scoring data.

With Richard’s data and an understanding of what it means, we can now proceed to…

DATA PREPARATION

This  chapter’s  example  consists  of  two  data  sets:  Chapter10DataSet_Training.csv  and
Chapter10DataSet_Scoring.csv.    Download  these  from  the  companion  web  site  now,  then
complete the following steps:

1)

Import both data sets into your RapidMiner repository.  You do not need to worry about
attribute data types because the Decision Tree operator  can handle all types of data.  Be
sure that you do designate the first row of each of the data sets as the attribute names as
you import.  Save them in the repository with descriptive names, so that you will be able to
tell what they are.

2)

Drag  and  drop  both  of  the  data  sets  into  a  new  main  process  window.    Rename  the
Retrieve objects as Training and Scoring respectively.  Run your model to examine the data
and familiarize yourself with the attributes.

Figure 10-2. Meta data for the scoring data set.

Data Mining for the Masses
162

3)

Switch back to design perspective.  While there are no missing or apparently inconsistent
values  in  the  data  set,  there  is  still  some  data  preparation  yet  to  do.    First  of  all,  the
User_ID is an arbitrarily assigned value for each customer.  The customer doesn’t use this
value for anything, it is simply a way to uniquely identify each customer in the data set.  It
is  not  something  that  relates  to  each  person  in  any  way  that  would  correlate  to,  or  be
predictive of, their buying and technology adoption tendencies.  As such, it should not be
included in the model as an independent variable.

We can handle this attribute in one of two ways.  First, we can remove the attribute using a
Select Attributes operator, as was demonstrated back in Chapter 3.  Alternatively, we can
try a new way of handling a non-predictive attribute.  This is accomplished using the Set
Role operator.  Using the search field in the Operators tab, find and add Set Role operators
to both your training and scoring streams.  In the Parameters area on the right hand side of
the screen, set the role of the User_ID attribute to ‘id’.  This will leave the attribute in the
data  set  throughout  the  model,  but  it  won’t  consider  the  attribute  as  a  predictor  for  the
label  attribute.    Be  sure  to  do  this  for  both  the  training  and  scoring  data  sets,  since  the
User_ID attribute is found in both of them (Figure 10-3).

Figure 10-3. Setting the User_ID attribute to an ‘id’ role, so
it won’t be considered in the predictive model.

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 41 42 43 44 45 46 47 48 ... 65