Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	50/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 46 47 48 49 50 51 52 53 ... 65

Career_PP
Career_Assists
DATA PREPARATION

Data Mining for the Masses
178


Career_PP:  This is the same as the Personal Points attribute, except it is cumulative for
the athlete’s entire career.


Career_TP:  This is the same as the Total Points attribute, except it is cumulative for the
athlete’s entire career.


Career_Assists:  This is the same as the Career Assists attribute, except it is cumulative
for the athlete’s entire career.


Career_Con:    This  is  the  same  as  the  Career  Concessions  attribute,  except  it  is
cumulative for the athlete’s entire career.


Team_Value:  This is a categorical attribute summarizing the athlete’s value to his team.
It  is  present  only  in  the  training  data,  as  it  will  serve  as  our  label  to  predict  a
Team_Value for each observation in the scoring data set. There are four categories:


Role Player:  This is an athlete who is good enough to play at the professional
level, and may be really good in one area, but is not excellent overall.


Contributor:    This  is  an  athlete  who  contributes  across  several  categories  of
defense and offense and can be counted on to regularly help the team win.


Franchise  Player:    This  is  an  athlete  whose  skills  are  so  broad,  strong  and
consistent that the team will want to hang on to them for a long time.  These
players are of such a talent level that they can form the foundation of a really
good, competitive team.


Superstar:  This is that rare individual who gifts are so superior that they make a
difference in every game.  Most teams in the league will have one such player,
but teams with two or three always contend for the league title.

Juan’s data are ready and we understand the attributes available to us. We can now proceed to…

DATA PREPARATION

Access the book’s companion web site and download two files:  Chapter11DataSet_Training.csv
and Chapter11DataSet_Scoring.csv.  These files contain the 263 current professional athletes and
the 59 prospects respectively. Complete the following steps:

1)

Import both Chapter 11 data sets into your RapidMiner repository.   Be sure to designate
the first row as attribute names.  You can accept the defaults  for data types.  Save them

Chapter 11: Neural Networks
179
with descriptive names, then drag them and drop them into a new main process window.
Be sure to rename the retrieve objects as Training and Scoring.

2)

Add three Set Role operators; two to your training stream and one to your scoring stream.
Use the first in the training stream to set the Player_Name attribute’s role to ‘id’, so it will
not be included in the neural network’s prediction calculations.  Do the same for the Set
Role  attribute  in  the  scoring  stream.    Finally,  use  the  second  Set  Role  attribute  in  the
training stream to set the Team_Value attribute as the ‘label’ for our model.  When you are
finished with steps 1 and 2, your process should look like Figure 11-1.

Figure 11-1. Data preparation for neural network analysis.

3)

Go  ahead and  run  the  model.    Use  the  meta  data  view  for  each  of  the  two  data  sets  to
familiarize yourself with the data.  Ensure that your special attributes have their roles set as
they should, in accordance with the parameters you configured in step 2 (see Figures 11-2
and 11-3 which show meta data).

Data Mining for the Masses
180

Figure 11-2. The scoring data set’s meta data with special attribute
Player_Name designated as an ‘id’.

Figure 11-3. The training data set with two special attributes:
Player_Name (‘id’) and Team_Value (‘label’).

4)

As you review the data sets, note that these two have one characteristic that is unique from
prior example data sets:  the ranges for the scoring data sets are not within the ranges for
the training data set.  Neural network algorithms,  including the one used in RapidMiner,

Chapter 11: Neural Networks
181
often  employ  a  concept  called  fuzzy  logic,  which  is  an  inferential,  probability-based
approach to data comparisons allowing us to infer, based on probabilities, the strength of
the  relationship  between  attributes  in  our  data  sets.    This  gives  us  added  flexibility  over
some  of  the  other  predictive  data  mining  techniques  previously  shown  in  this  book.
Having  reviewed  the  data  sets’  meta  data,  return  to  design  perspective  so  that  we  can
continue with...

MODELING

5)

Using the search field on the Operators tab, locate the Neural Net operator and add it to
your training stream.  Use Apply Model to apply your neural network to your scoring data
set.  Be sure both the mod and lab ports are connected to res ports (Figure 11-4).

Figure 11-4. Generating a neural network model and applying it to our scoring data set.

Run  the  model  again.    In  results  perspective,  you  will  find  both  a  graphical  model  and  our
predictions. At this stage we can begin our…

EVALUATION

Neural  networks  use  what  is  called  a  ‘hidden  layer’  to  compare  all  attributes  in  a  data  set  to  all
other attributes.  The circles in the neural network graph are nodes, and the lines between nodes

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 46 47 48 49 50 51 52 53 ... 65