Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	57/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 53 54 55 56 57 58 59 60 ... 65

CHAPTER SUMMARY
REVIEW QUESTIONS
Challenge Step!

Chapter 12: Text Mining
213

DEPLOYMENT

Gillian  had  an  interest  in  investigating  the  similarities  and  differences  between  several  of  the
Federalist  Papers  in  order  to  lend  credence  to  the  belief  that  Alexander  Hamilton  and  James
Madison collaborated on paper 18.

Figure 12-27. Final cluster results after training our text mining model to
recognize John Jay’s writing style.

Gillian now has the evidence she had hoped to find.  As we continued to train our model in John
Jay’s  writing  style,  we  have  found  that  he  indeed  was  consistent  from  paper  3  to  4  to  5,  as
RapidMiner  found  these  documents  to  be  the  most  similar  and  subsequently  clustered  them
together in cluster_1.  At the same time, RapidMiner consistently found paper 18, the suspected
collaboration between Hamilton and Madison to be associated with one, then the other, and finally
both  of  them  together.    Gillian  could  further  strengthen  her  model  by  adding  additional  papers
from all three authors, or she could go ahead and add what we’ve already found to her exhibit at
the museum.

CHAPTER SUMMARY

Text mining is a powerful way of analyzing data in an unstructured format such as in paragraphs of
text.  Text can be fed into a model in different ways, and then that text can be broken down into
tokens.    Once  tokenized,  words  can  be  further  manipulated  to  address  matters  such  as  case
sensitivity, phrases or word groupings, and word stems.  The results of these analyses can reveal

Data Mining for the Masses
214
the frequency and commonality of strong words or grams across groups of documents.  This can
reveal  trends  in  the  text,  such  as  what  topics  are  most  important  to  author(s),  or  what  message
should be taken away from the text when reading the documents.

Further, once the documents’ tokens are organized into attributes, the documents can be modeled,
just as other, more structured data sets can be modeled.  Multiple documents can be handled by a
single Process Document operator in RapidMiner, which will apply the  same set of tokenization
and token handlers to all documents at once through the sub-process stream.  After a model has
been  applied  to  a  set  of  documents,  additional  documents  can  be  added  to  the  stream,  passed
through  the  document  processor,  and  run  through  the  model  to  yield  more  well-trained  and
specific results.

REVIEW QUESTIONS

1)

What  are  some  of  the  benefits  of  text  mining  as  opposed  to  the  other  models  you’ve
learned in this book?

2)

How are some ways that text-based data is imported into RapidMiner?

3)

What is a sub-process and when do you use one in RapidMiner?

4)

Define the following terms: token, stem, n-gram, case-sensitive.

5)

How does tokenization enable the application of data mining models to text-based data?

6)

How do you view a k-Means cluster’s details?

EXERCISE

For  this  chapter’s  exercise,  you  will  mine  text  for  common  complaints  against  a  company  or
industry.  Complete the following steps.

Chapter 12: Text Mining
215
1)

Using your favorite search engine, locate a  web site or discussion forum on the Internet
where people have posted complaints, criticisms or pleas for help regarding a company or
an industry (e.g. airlines, utility companies, insurance companies, etc.).
2)

Copy and paste at least ten of these posts or comments into a text editor, saving each one
as its own text document with a unique name.
3)

Open  a  new,  blank  process  in  RapidMiner,  and  using  the  Read  Documents  operator,
connect to each of your ten (or more) text documents containing the customer complaints
you found.
4)

Process these documents in RapidMiner.  Be sure you tokenize and use other handlers in
your sub-process as you deem appropriate/necessary. Experiment with grams and stems.
5)

Use a k-Means cluster to group your documents into two, three or more clusters.  Output
your word list as well.
6)

Report the following:
a.

Based on your word list, what seem to be the most common complaints or issues in
your  documents?    Why  do  you  think  that  is?    What  evidence  can  you  give  to
support your claim?
b.

Based on your word list, are there some terms or phrases that show up in all, or at
least most of your documents? Why do you think these are so common?
c.

Based on your clusters, what groups did you get?  What are the common themes in
each of your clusters? Is this surprising? Why or why not?
d.

How  might  a  customer  service  manager  use  your  model  to  address  the  common
concerns or issues you found?

Challenge Step!

7)

Using your knowledge from past chapters, removed the k-Means clustering operator, and
try to apply a different data mining methodology such as association rules or decision trees
to your text documents. Report your results.

Data Mining for the Masses
217

SECTION THREE: SPECIAL CONSIDERATIONS IN DATA MINING

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 53 54 55 56 57 58 59 60 ... 65