Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	52/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 48 49 50 51 52 53 54 55 ... 65

REVIEW QUESTIONS
Challege Step!
CHAPTER TWELVE: TEXT MINING CONTEXT AND PERSPECTIVE
LEARNING OBJECTIVES
ORGANIZATIONAL UNDERSTANDING
DATA UNDERSTANDING

Chapter 11: Neural Networks
187
REVIEW QUESTIONS

1)

Where  do  neural  networks  get  their  name?    What  characteristics  of  the  model  make  it
‘neural’?

2)

Find  another  observation  in  this  chapter’s  example  that  is  interesting  but  not  obvious,
similar to the Lance Goodwin observation.  Why is the observation you found interesting?
Why is it less obvious than some?

3)

How  should  confidence  percentages  be  used  in  conjunction  with  a  neural  network’s
predictions?

4)

Why might a data miner prefer a neural network over a decision tree?

5)

If you want to see a node’s details in a RapidMiner graph of a neural network, what can
you do?

EXERCISE

For this chapter’s exercise, you will create a neural network to predict risk levels for loan applicants
at a bank. Complete the following steps.

1)

Access the companion web site for this text.  Locate and download the training data set
labeled Chapter11Exercise_TrainingData.csv.

2)

Import  the  training  data  set  into  your  RapidMiner  repository  and  name  it  descriptively.
Drag and drop the data set into a new, blank main process.

3)

Set the Credit_Risk attribute as your label.  Remember that Applicant_ID is not predictive.

4)

Add a Neural Net operator to your model.

Data Mining for the Masses
188
5)

Create  your  own  scoring  data  set  using  the  attributes  in  the  training  data  set  as  a  guide.
Enter  at  least  20  observations.    You  can  enter  data  for  people  that  you  know  (you  may
have to estimate some of their attribute values, e.g. their credit score), or you can simply
test  different  values  for  each  of  the attributes.    For  example, you  might choose  to  enter
four consecutive observations with the  same values in all attributes except  for the credit
score, where you might increment each observation’s credit score by 100 from 400 up to
800.

6)

Import your scoring data set and apply your model to it.

7)

Run  your  model  and  review  your  predictions  for  each  of  your  scoring  observations.
Report your results, including any interesting or unexpected results.

Challege Step!

8)

See if you can experiment with different lower bounds for each attribute to find the point
at which a person will be predicted in the ‘DO NOT LEND’ category.  Use a combination
of Declare Missing Values and Replace Missing Values operators to try different thresholds
on various attributes. Report your results.

Chapter 12: Text Mining
189

CHAPTER TWELVE:
TEXT MINING

CONTEXT AND PERSPECTIVE

Gillian  is  a  historian  and  archivist  at  a  national  museum  in  the  Unites  States.    She  has  recently
curated an exhibit on the Federalist Papers.  The Federalist Papers are a series of dozens of essays
that  were  written  and  published  in  the  late  1700’s.    The  essays  were  published  in  two  different
newspapers in the state of New York over the course of about one year, and they were released
anonymously under the author name ‘Publius’.  Their intent was to educate the American people
about the new nation’s proposed constitution, and to advocate in favor of its ratification.  No one
really knew at the time if ‘Publius’ was one individual or many, but several individuals familiar with
the authors and framers of the constitution had spotted some patterns in vocabulary and sentence
structure that  seemed familiar to sections of the U. S. constitution.  Years later, after Alexander
Hamilton  died  in  the  year  1804,  some  notes  were  discovered  that  revealed  that  he  (Hamilton),
James  Madison  and  John  Jay  had  been  the  authors  of  the  papers.    The  notes  indicated  specific
authors for some papers, but not for others.  Specifically, John Jay was revealed to be the author
for papers 3, 4 and 5; Madison for paper 14; and Hamilton for paper 17.  Paper 18 had no author
named, but there was evidence that Hamilton and Madison worked on that one together.

LEARNING OBJECTIVES

After completing the reading and exercises in this chapter, you should be able to:


Explain what text mining is, how it is used and the benefits of using it.


Recognize the various formats that text can be in, in order to perform text mining.


Connect to and import text as a data source for a text mining model.


Develop  a  text  mining  model  in  RapidMiner  including  common  text-parsing  operators
such as tokenization, stop word filtering, n-gram construction, stemming, etc.

Data Mining for the Masses
190


Apply  other  data  mining  models  to  text  mining  results  in  order  to  predict  or  classify
based on textual analysis.

ORGANIZATIONAL UNDERSTANDING

Gillian  would  like  to  analyze  paper  18’s  content  in  the  context  of  the  other  papers  with  known
authors,  to  see  if  she  can  generate  some  evidence  that  the  suspected  collaboration  between
Hamilton  and  Madison  is  in  fact  a  likely  scenario.   She  feels  like  text  mining  might  be  a  good
method to analyze the text in a structured way, and has enlisted our help.  Having studied all of the
Federalist  Papers  and  other  writings  by  the  three  statesmen  who  wrote  them,  Gillian  feels
confident that paper 18 is a collaboration that John Jay did not contribute to—his vocabulary and
grammatical  structure  was  quite  different  from  those  of  Hamilton  and  Madison,  even  when  all
three wrote on the same topic, as they had with the Federalist Papers.  She would like to look for
word and phrase choice frequencies and present the outcome as part of her exhibit on the papers.
We will help Gillian by constructing a text mining model using the text from the Federalist Papers
and some standard text mining methodologies.

DATA UNDERSTANDING

Gillian’s  data  set  is  simple:  we  will  include  the  full  text  of  Federalist  Papers  number  5  (Jay),  14
(Madison), 17 (Hamilton), and 18 (suspected collaboration between Madison and Hamilton).  The
Federalist Papers are available through a number of sources: they have been re-published in book
form,  they  are  available  on  a  number  of  different  web  sites,  and  their  text  is  archived  in  many
libraries throughout the world.  For this chapter’s exercise, the text of these four papers has been
added to the book’s companion web site. There are four files for you to download:


Chapter12_Federalist05_Jay.txt


Chapter12_Federalist14_Madison.txt


Chapter12_Federalist17_Hamilton.txt


Chapter12_Federalist18_Collaboration.txt.

Please download these now, but do not import them into a RapidMiner repository.  The process of
handling  textual  data  in RapidMiner  is  a  bit  different  than  what  we  have  done  in  past  chapters.
With these four papers’ text available to us, we can move directly into the CRISP-DM phase of…

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 48 49 50 51 52 53 54 55 ... 65