Chapter 12:
Text Mining
213
DEPLOYMENT
Gillian had an interest in investigating the similarities and differences between several of the
Federalist Papers in order to lend credence to the belief that Alexander Hamilton and James
Madison collaborated on paper 18.
Figure 12-27. Final cluster results after training our text mining model to
recognize John Jay’s writing style.
Gillian now has the evidence she had hoped to find. As we continued to train our model in John
Jay’s writing style, we have found that he indeed was consistent from paper 3 to 4 to 5, as
RapidMiner found these documents to be the most similar and subsequently clustered them
together in cluster_1. At the same time, RapidMiner consistently found paper 18, the suspected
collaboration between Hamilton and Madison to be associated with one, then the other, and finally
both of them together. Gillian could further strengthen her model by adding additional papers
from all three authors, or she could go ahead and add what we’ve already found to her exhibit at
the museum.
CHAPTER SUMMARY
Text mining is a powerful way of analyzing data in an unstructured format such as in paragraphs of
text. Text can be fed into a model in different ways, and then that text can be broken down into
tokens. Once tokenized, words can be further manipulated to address matters such as case
sensitivity, phrases or word groupings, and word stems. The results of these analyses can reveal
Data Mining for the Masses
214
the frequency and commonality of strong words or grams across groups of documents. This can
reveal trends in the text, such as what topics are most important to author(s), or what message
should be taken away from the text when reading the documents.
Further, once the documents’ tokens are organized into attributes, the documents can be modeled,
just as other, more structured data sets can be modeled. Multiple documents can be handled by a
single Process Document operator in RapidMiner, which will apply the same set of tokenization
and token handlers to all documents at once through the sub-process stream. After a model has
been applied to a set of documents, additional documents can be added to the stream, passed
through the document processor, and run through the model to yield more well-trained and
specific results.
REVIEW QUESTIONS
1)
What are some of the benefits of text mining as opposed to the other models you’ve
learned in this book?
2)
How are some ways that text-based data is imported into RapidMiner?
3)
What is a sub-process and when do you use one in RapidMiner?
4)
Define the following terms: token, stem, n-gram, case-sensitive.
5)
How does tokenization enable the application of data mining models to text-based data?
6)
How do you view a k-Means cluster’s details?
EXERCISE
For this chapter’s exercise, you will mine text for common complaints against a company or
industry. Complete the following steps.
Chapter 12: Text Mining
215
1)
Using your favorite search engine, locate a web site or discussion forum on the Internet
where people have posted complaints, criticisms or pleas for help regarding a company or
an industry (e.g.
airlines, utility companies,
insurance companies, etc.).
2)
Copy and paste at least ten of these posts or comments into a text editor, saving each one
as its own text document with a unique name.
3)
Open a new, blank process in RapidMiner, and using the Read Documents operator,
connect to each of your ten (or more) text documents containing the customer complaints
you found.
4)
Process these documents in RapidMiner. Be sure you tokenize and use other handlers in
your sub-process as you deem appropriate/necessary. Experiment with grams and stems.
5)
Use a k-Means cluster to group your documents into two, three or more clusters. Output
your word list as well.
6)
Report the following:
a.
Based on your word list, what seem to be the most common complaints or issues in
your documents? Why do you think that is? What evidence can you give to
support your claim?
b.
Based on your word list, are there some terms or phrases that show up in all, or at
least most of your documents? Why do you think these are so common?
c.
Based on your clusters, what groups did you get? What are the common themes in
each of your clusters? Is this surprising? Why or why not?
d.
How might a customer service manager use your model to address the common
concerns or issues you found?
Challenge Step!
7)
Using your knowledge from past chapters, removed the k-Means clustering operator, and
try to apply a different data mining methodology such as association rules or decision trees
to your text documents. Report your results.
Data Mining for the Masses
217
SECTION THREE: SPECIAL CONSIDERATIONS IN DATA MINING