Chapter 12: Text Mining
191
DATA PREPARATION
The text mining module of RapidMiner is an optional add-in. When you installed RapidMiner
(way back in Step 4 of the ‘Preparing RapidMiner…’ section of Chapter 3), we mentioned that you
might want to include the Text Processing component. Whether you did or did not at that time,
we will need it for this chapter’s example, so we can add it now. Even if you did add it earlier, it
might be a good idea to complete all of the steps below to ensure your Text Processing add-in is
up-to-date.
1)
Open RapidMiner to a new, blank process. From the application menu, select Help >
Update RapidMiner…
Figure 12-1: Updating RapidMiner add-ins.
2)
Your computer will need to be connected to the Internet, so that it can check Rapid-I’s
servers to see if any updates are available. Once the connection has been established and
the software has checked for available updates, you will see a window similar to Figure 12-
2. Locate Text Processing in the list (it should be about fourth from the top). If it is
grayed out, that means that the add-in is installed and up-to-date on your computer. If it is
not installed, or not up to the current version, it will be orange. You can double click the
Data Mining for the Masses
192
small square to the left of the Text Processing icon (the circle with ‘ABC’ in it). Then click
the Install button to add or update the module. When it is finished, the window will
disappear and you will be back to your main RapidMiner window.
Figure 12-2. Adding/updating the RapidMiner Text Processing add-in.
3)
In the Operators tab in the lower left hand area of your RapidMiner window, locate and
expand the Text Processing operators folder by clicking on the + sign next to it.
Chapter 12: Text Mining
193
Figure 12-3. Finding tools in the Text Processing operator area.
4)
Within the Text Processing menu tree, there is an operator called Read Document. Drag
this operator and drop it into your main process window. Right click on it and rename it
‘Paper 5’, as shown in Figure 12-4.
Figure 12-4. Adding a Read Document operator to our model.
Data Mining for the Masses
194
5)
In the Parameters area of the RapidMiner window (right hand side), note that you must
specify a ‘file’ that RapidMiner can read. Click on the folder icon to the right of the file
parameter to browse for our first text file.
Figure 12-5. Locating the John Jay Federalist Paper (No. 5).
6)
In this case, we have saved the text files containing the papers’ text in a folder called
Chapter Data Sets. We have browsed to this folder, and highlighted the John Jay paper.
We can click Open to connect our RapidMiner operator to this text file. This will return us
to our main process window in RapidMiner. Repeat steps 4 and 5 three more times, each
time connecting one of the other papers, preferably in numerical order, to a Read
Document operator in RapidMiner. Use care to ensure that you connect the right operator
to the right text file, so that you can keep the text of each paper straight with the operator
that’s handling it. Once finished, your model should look like Figure 12-6.
Chapter 12: Text Mining
195
Figure 12-6. All four Federalist Paper text files are now connected in RapidMiner.
7)
Go ahead and run the model. You will see that each of the four papers have been read
into RapidMiner and can be reviewed in results perspective. After reviewing the text,
return to design perspective.
Figure 12-7. Reviewing the suspected collaboration paper (no. 18) in results perspective.
Data Mining for the Masses
196
8)
We now have our four essays available in RapidMiner. Reading the papers is not enough
however. Gillian’s goal is to analyze the papers. For this, we will use a Process
Documents operator. It is located just above the Read Document operator in the Text
Processing menu tree. Drag this operator into your process and drop it into the Paper 5
stream. There will be an empty doc port on the bottom left hand side of the Process
Documents operator. Disconnect your Paper 14’s out port from its res port and connect it
to the open doc port instead. Remember that you can rearrange port connections by
clicking on the first, then clicking on the second. You will get a warning message asking
you to confirm the disconnect/reconnect action each time you do this. Repeat this process
until all four documents are feeding into the Process Documents operator, as is the case in
Figure 12-8.
Figure 12-8. All four Federalist Papers feeding into a single document processor.
9)
Next, double click on the Process Documents operator. This will take us into a sub-
process window.
Dostları ilə paylaş: |