Data Mining
for the Masses
178
Career_PP: This is the same as the Personal Points attribute, except it is cumulative for
the athlete’s entire career.
Career_TP: This is the same as the Total Points attribute, except it is cumulative for the
athlete’s entire career.
Career_Assists: This is the same as the Career Assists attribute, except it is cumulative
for the athlete’s entire career.
Career_Con: This is the same as the Career Concessions attribute, except it is
cumulative for the athlete’s entire career.
Team_Value: This is a categorical attribute summarizing the athlete’s value to his team.
It is present only in the training data, as it will serve as our label to predict a
Team_Value for each observation in the scoring data set. There are four categories:
Role Player: This is an athlete who is good enough to play at the professional
level, and
may be really good in one area, but is not excellent overall.
Contributor: This is an athlete who contributes across several categories of
defense and offense and can be counted on to regularly help the team win.
Franchise Player: This is an athlete whose skills are so broad, strong and
consistent that the team will want to hang on to them for a long time. These
players are of such a talent level that they can form the foundation of a really
good, competitive team.
Superstar: This is that rare individual who gifts are so superior that they make a
difference in every game. Most teams in the league will have one such player,
but teams with two or three always contend for the league title.
Juan’s data are ready and we understand the attributes available to us. We can now proceed to…
DATA PREPARATION
Access the book’s companion web site and download two files: Chapter11DataSet_Training.csv
and Chapter11DataSet_Scoring.csv. These files contain the 263 current professional athletes and
the 59 prospects respectively. Complete the following steps:
1)
Import both Chapter 11 data sets into your RapidMiner repository. Be sure to designate
the first row as attribute names. You can accept the defaults for data types. Save them
Chapter 11:
Neural Networks
179
with descriptive names, then drag them and drop them into a new main process window.
Be sure to rename the retrieve objects as Training and Scoring.
2)
Add three Set Role operators; two to your training stream and one to your scoring stream.
Use the first in the training stream to set the Player_Name attribute’s role to ‘id’, so it will
not be included in the neural network’s prediction calculations. Do the same for the Set
Role attribute in the scoring stream. Finally, use the second Set Role attribute in the
training stream to set the Team_Value attribute as the ‘label’ for our model. When you are
finished with steps 1 and 2, your process should look like Figure 11-1.
Figure 11-1. Data preparation for neural network analysis.
3)
Go ahead and run the model. Use the meta data view for each of the two data sets to
familiarize yourself with the data. Ensure that your special attributes have their roles set as
they should, in accordance with the parameters you configured in step 2 (see Figures 11-2
and 11-3 which show meta data).
Data Mining for the Masses
180
Figure 11-2. The scoring data set’s meta data
with special attribute
Player_Name designated as an ‘id’.
Figure 11-3. The training data set with two special attributes:
Player_Name (‘id’) and Team_Value (‘label’).
4)
As you review the data sets, note that these two have one characteristic that is unique from
prior example data sets: the ranges for the scoring data sets are not within the ranges for
the training data set. Neural network algorithms, including the one used in RapidMiner,
Chapter 11: Neural Networks
181
often employ a concept called
fuzzy logic, which is an inferential, probability-based
approach to data comparisons allowing us to
infer, based on probabilities, the strength of
the relationship between attributes in our data sets. This gives us added flexibility over
some of the other predictive data mining techniques previously shown in this book.
Having reviewed the data sets’ meta data, return to design perspective so that we can
continue with...
MODELING
5)
Using the search field on the Operators tab, locate the Neural Net operator and add it to
your training stream. Use Apply Model to apply your neural network to your scoring data
set. Be sure both the
mod and
lab ports are
connected to res ports (Figure 11-4).
Figure 11-4. Generating a neural network model and applying it to our scoring data set.
Run the model again. In results perspective, you will find both a graphical model and our
predictions. At this stage we can begin our…
EVALUATION
Neural networks use what is called a ‘hidden layer’ to compare all attributes in a data set to all
other attributes. The circles in the neural network graph are nodes, and the lines between nodes