Data Mining
for the Masses
160
web site does not distinguish single types of people, those who are divorced or widowed
are included with those who have never been married (indicated in the data set as ‘S’).
Website_Activity: This attribute is an indication of how active each customer is on the
company’s web site. Working with Richard, we used the web site database’s information
which records the duration of each customers visits to the web site to calculate how
frequently, and for how long each time, the customers use the web site. This is then
translated into one of three categories: Seldom, Regular, or Frequent.
Browsed_Electronics_12Mo: This is simply a Yes/No column indicating whether or not
the person browsed for electronic products on the company’s web site in the past year.
Bought_Electronics_12Mo: Another Yes/No column indicating whether or not they
purchased an electronic item through Richard’s company’s web site in the past year.
Bought_Digital_Media_18Mo: This attribute is a Yes/No field indicating whether or
not the person has purchased some form of digital media (such as MP3 music) in the past
year and a half. This attribute does not include digital book purchases.
Bought_Digital_Books: Richard believes that as an indicator of buying behavior relative
to the company’s new eReader, this attribute will likely be the best indicator. Thus, this
attribute has been set apart from the purchase of other types of digital media. Further, this
attribute indicates whether or not the customer has
ever bought a digital book, not just in
the past year or so.
Payment_Method: This attribute indicates how the person pays for their purchases. In
cases where the person has paid in more than one way, the mode, or most frequent
method of payment is used. There are four options:
Bank Transfer—payment via e-check or other form of wire
transfer directly from the
bank to the company.
Website Account—the customer has set up a credit card or permanent electronic
funds transfer on their account so that purchases are directly charged through their
account at the time of purchase.
Credit Card—the person enters a credit card number and authorization each time
they purchase something through the site.
Monthly Billing—the person makes purchases periodically and receives a paper or
electronic bill which they pay later either by mailing a check or through the
company web site’s payment system.
Chapter 10:
Decision Trees
161
eReader_Adoption: This attribute exists only in the training data set. It consists of data
for customers who purchased the previous-gen eReader. Those who purchased within a
week of the product’s release are recorded in this attribute as ‘Innovator’. Those who
purchased after the first week but within the second or third weeks are entered as ‘Early
Adopter’. Those who purchased after three weeks but within the first two months are
‘Early Majority’. Those who purchased after the first two months are ‘Late Majority’. This
attribute will serve as our label when we apply our training data to our scoring data.
With Richard’s data and an
understanding of what it means, we can now proceed to…
DATA PREPARATION
This chapter’s example consists of two data sets: Chapter10DataSet_Training.csv and
Chapter10DataSet_Scoring.csv. Download these from the companion web site now, then
complete the following steps:
1)
Import both data sets into your RapidMiner repository. You do not need to worry about
attribute data types because the Decision Tree operator can handle all types of data. Be
sure that you do designate the first row of each of the data sets as the attribute names as
you import. Save them in the repository with descriptive names, so that you will be able to
tell what they are.
2)
Drag and drop both of the data sets into a new main process window. Rename the
Retrieve objects as Training and Scoring respectively. Run your model to examine the data
and familiarize yourself with the attributes.
Figure 10-2. Meta data for the scoring data set.
Data Mining for the Masses
162
3)
Switch back to design perspective. While there are no missing or apparently inconsistent
values in the data set, there is still some data preparation yet to do. First of all, the
User_ID is an arbitrarily assigned value for each customer. The customer doesn’t use this
value for anything, it is simply a way to uniquely identify each customer in the data set. It
is not something that relates to each person in any way that would correlate to, or be
predictive of, their buying and technology adoption tendencies. As such, it should not be
included in the model as an independent variable.
We can handle this attribute in one of two ways. First, we can remove the attribute using a
Select Attributes operator, as was demonstrated back in Chapter 3. Alternatively, we can
try a new way of handling a non-predictive attribute. This is accomplished using the Set
Role operator. Using the search field in the Operators tab, find and add Set
Role operators
to both your training and scoring streams. In the Parameters area on the right hand side of
the screen, set the role of the User_ID attribute to ‘id’. This will leave the attribute in the
data set throughout the model, but it won’t consider the attribute as a predictor for the
label attribute. Be sure to do this for both the training and scoring data sets, since the
User_ID attribute is found in both of them (Figure 10-3).
Figure 10-3. Setting the User_ID attribute to an ‘id’ role, so
it won’t be considered in the predictive model.