Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	21/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 17 18 19 20 21 22 23 24 ... 65

REVIEW QUESTIONS
Challenge step!

Data Mining for the Masses
68
REVIEW QUESTIONS

1)

What are some of the limitations of correlation models?

2)

What is a correlation coefficient? How is it interpreted?

3)

What is the difference between a positive and a negative correlation? If two attributes have
values that decrease at essentially the same rate, is that a negative correlation?  Why or why
not?

4)

How is correlation strength measured? What are the ranges for strengths of correlation?

5)

The  number  of  heating  oil  consuming  devices  was  suggested  as  a  possibly  interesting
attribute that could be added to the example data set for this chapter.  Can you think of
others?  Why might they be interesting?  To what other attributes in the data set do you
think your suggested attributes might be correlated?  What would be the value in knowing
if they are?

EXERCISE

It is now your turn to develop a correlation model, generate a coefficient matrix, and analyze the
results. To complete this chapter’s exercise, follow the steps below.

1)

Select  a  professional  sporting  organization  that  you  enjoy,  or  of  which  you  are  aware.
Locate that organization’s web site and search it for statistics, facts and figures about the
athletes in that organization.

2)

Open  OpenOffice  Calc,  and  starting  in  Cell  A  across  Row  1  of  the  spreadsheet,  define
some  attributes  (at  least  three  or  four)  to  hold  data  about  each  athlete.    Some  possible
attributes  you  may  wish  to  consider  could  be  annual_salary,  points_per_game,
years_as_pro, height, weight, age, etc.  The list is potentially unlimited, will vary based on
the type of sport you choose, and will depend on the data available to you on the web site
you’ve selected.  Measurements of the athletes’ salaries and performance in competition are

Chapter 4: Correlation
69
likely to be the most interesting.  You may include the athletes’ names, however keep in
mind  that  correlations  can  only  be  conducted  on  numeric  data,  so  the  name  attribute
would  need  to  be  reduced  out  of  your  data  set  before  creating  your  correlation  matrix.
(Remember the Select Attributes operator!)

3)

Look up the statistics for each of your selected attributes and enter them as observations
into your spreadsheet.  Try to find as many as you can—at least thirty is a good rule of
thumb in order to achieve at least a basic level of statistical validity. More is better.

4)

Once you’ve created your data set, use the menu to save it as a CSV file.  Click File, then
Save As.  Enter a file name, and change ‘Save as type:’ to be Text CSV (.csv).  Be sure to
save the file in your data mining data folder.

5)

Open  RapidMiner  and  import  your  data  set  into  your  RapidMiner  repository.    Name  it
Chapter4Exercise,  or  something  descriptive  so  that  you  will  remember  what  data  are
contained in the data set when you look in your repository.

6)

Add the data set to a new process in RapidMiner.  Ensure that the out port is connected to
a  res  port  and  run  your  model.    Save  your  process  with  a  descriptive  name  if  you  wish.
Examine your data in results perspective and ensure there are no missing, inconsistent, or
other  potentially  problematic  data  that  might  need  to  be  handled  as  part  of  your  Data
Preparation phase.  Return to design perspective and handle any data preparation tasks that
may be necessary.

7)

Add  a  Correlation  Matrix  operator  to  your  stream  and  ensure  that  the  mat  port  is
connected to a res port.  Run your model again.  Interpret your correlation coefficients as
displayed on the matrix tab.

8)

Document  your  findings.    What  correlations  exist?    How  strong  are  they?    Are  they
surprising to you and if so, why?  What other attributes would you like to add?  Are there
any you’d eliminate now that you’ve mined your data?

Data Mining for the Masses
70

Challenge step!
9)

While still in results perspective, click on the ExampleSet tab (which exists assuming you
left the exa port connected to a res port when you were in design perspective).  Click on the
Plot View radio button.   Examine correlations that you found in your model visually by
creating a scatter plot of your data.  Choose one attribute for your x-Axis and a correlated
one for your y-Axis.  Experiment with the Jitter slide bar.  What is it doing? (Hint: Try an
Internet search on the term ‘jittering statistics’.)  For an additional visual experience, try a
Scatter 3D or Scatter 3D Color plot.  Consider Figures 4-8 and 4-9 as examples.  Note that
with 3D plots in RapidMiner, you can click and hold to rotate your plot in order to better
see the interactions between the data.

Figure 4-8. A two-dimensional scatterplot with a
colored third dimension and a slight jitter.

Chapter 4: Correlation
71

Figure 4-9. A three-dimensional scatterplot with a colored fourth dimension.

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 17 18 19 20 21 22 23 24 ... 65