Chapter 11: Neural Networks
187
REVIEW QUESTIONS
1)
Where do neural networks get their name? What characteristics of the model make it
‘neural’?
2)
Find another observation in this chapter’s example that is interesting but not obvious,
similar to the Lance Goodwin observation. Why is the observation you found interesting?
Why is it less obvious than some?
3)
How should confidence percentages be used in conjunction with a neural network’s
predictions?
4)
Why might a data miner prefer a neural network over a decision tree?
5)
If you want to see a node’s details in a RapidMiner graph of a neural network, what can
you do?
EXERCISE
For this chapter’s exercise, you will create a neural network to predict risk levels for loan applicants
at a bank. Complete the following steps.
1)
Access the companion web site for this text. Locate and download the training data set
labeled Chapter11Exercise_TrainingData.csv.
2)
Import the training data set into your RapidMiner repository and name it descriptively.
Drag and drop the data set into a new, blank main process.
3)
Set the Credit_Risk attribute as your label. Remember that Applicant_ID is not predictive.
4)
Add a Neural Net operator to your model.
Data Mining for the Masses
188
5)
Create your own scoring data set using the attributes in the training data set as a guide.
Enter at least 20 observations. You can enter data for people that you know (you may
have to estimate some of their attribute values, e.g. their credit score), or you can simply
test different values for each of the attributes. For example, you might choose to enter
four consecutive observations with the same values in all attributes except for the credit
score, where you might increment each observation’s credit score by 100 from 400 up to
800.
6)
Import your scoring data set and apply your model to it.
7)
Run your model and review your predictions for each of your scoring observations.
Report your results, including any interesting or unexpected results.
Challege Step!
8)
See if you can experiment with different lower bounds for each attribute to find the point
at which a person will be predicted in the ‘DO NOT LEND’ category. Use a combination
of Declare Missing Values and Replace Missing Values operators to try different thresholds
on various attributes. Report your results.
Chapter 12: Text Mining
189
CHAPTER TWELVE:
TEXT MINING
CONTEXT AND PERSPECTIVE
Gillian is a historian and archivist at a national museum in the Unites States. She has recently
curated an exhibit on the Federalist Papers. The Federalist Papers are a series of dozens of essays
that were written and published in the late 1700’s. The essays were published in two different
newspapers in the state of New York over the course of about one year, and they were released
anonymously under the author name ‘Publius’. Their intent was to educate the American people
about the new nation’s proposed constitution, and to advocate in favor of its ratification. No one
really knew at the time if ‘Publius’ was one individual or many, but several individuals familiar with
the authors and framers of the constitution had spotted some patterns in vocabulary and sentence
structure that seemed familiar to sections of the U. S. constitution. Years later, after Alexander
Hamilton died in the year 1804, some notes were discovered that revealed that he (Hamilton),
James Madison and John Jay had been the authors of the papers. The notes indicated specific
authors for some papers, but not for others. Specifically, John Jay was revealed to be the author
for papers 3, 4 and 5; Madison for paper 14; and Hamilton for paper 17. Paper 18 had no author
named, but there was evidence that Hamilton and Madison worked on that one together.
LEARNING OBJECTIVES
After completing the reading and exercises in this chapter, you should be able to:
Explain what text mining is, how it is used and the benefits of using it.
Recognize the various formats that text can be in, in order to perform text mining.
Connect to and import text as a data source for a text mining model.
Develop a text mining model in RapidMiner including common text-parsing operators
such as tokenization, stop word filtering, n-gram construction, stemming, etc.
Data Mining for the Masses
190
Apply other data mining models to text mining results in order to predict or classify
based on textual analysis.
ORGANIZATIONAL UNDERSTANDING
Gillian would like to analyze paper 18’s content in the context of the other papers with known
authors, to see if she can generate some evidence that the suspected collaboration between
Hamilton and Madison is in fact a likely scenario. She feels like text mining might be a good
method to analyze the text in a structured way, and has enlisted our help. Having studied all of the
Federalist Papers and other writings by the three statesmen who wrote them, Gillian feels
confident that paper 18 is a collaboration that John Jay did not contribute to—his vocabulary and
grammatical structure was quite different from those of Hamilton and Madison, even when all
three wrote on the same topic, as they had with the Federalist Papers. She would like to look for
word and phrase choice frequencies and present the outcome as part of her exhibit on the papers.
We will help Gillian by constructing a text mining model using the text from the Federalist Papers
and some standard text mining methodologies.
DATA UNDERSTANDING
Gillian’s data set is simple: we will include the full text of Federalist Papers number 5 (Jay), 14
(Madison), 17 (Hamilton), and 18 (suspected collaboration between Madison and Hamilton). The
Federalist Papers are available through a number of sources: they have been re-published in book
form, they are available on a number of different web sites, and their text is archived in many
libraries throughout the world. For this chapter’s exercise, the text of these four papers has been
added to the book’s companion web site. There are four files for you to download:
Chapter12_Federalist05_Jay.txt
Chapter12_Federalist14_Madison.txt
Chapter12_Federalist17_Hamilton.txt
Chapter12_Federalist18_Collaboration.txt.
Please download these now, but do not import them into a RapidMiner repository. The process of
handling textual data in RapidMiner is a bit different than what we have done in past chapters.
With these four papers’ text available to us, we can move directly into the CRISP-DM phase of…
Dostları ilə paylaş: |