Data Mining for the Masses

Yüklə 4,8 Kb.

Pdf görüntüsü

səhifə	53/65
tarix	08.10.2017
ölçüsü	4,8 Kb.
	#3815

1 ... 49 50 51 52 53 54 55 56 ... 65

DATA PREPARATION

Chapter 12: Text Mining
191

DATA PREPARATION

The  text  mining  module  of  RapidMiner  is  an  optional  add-in.    When  you  installed  RapidMiner
(way back in Step 4 of the ‘Preparing RapidMiner…’ section of Chapter 3), we mentioned that you
might want to include the Text Processing component.  Whether you did or did not at that time,
we will need it for this chapter’s example, so we can add it now.  Even if you did add it earlier, it
might be a good idea to complete all of the steps below to ensure your Text Processing add-in is
up-to-date.

1)

Open  RapidMiner  to  a  new,  blank  process.    From  the  application  menu,  select  Help  >
Update RapidMiner…

Figure 12-1: Updating RapidMiner add-ins.

2)

Your computer will need to be connected to the Internet, so that it can check Rapid-I’s
servers to see if any updates are available.  Once the connection has been established and
the software has checked for available updates, you will see a window similar to Figure 12-
2.    Locate  Text  Processing  in  the  list  (it  should  be  about  fourth  from  the  top).    If  it  is
grayed out, that means that the add-in is installed and up-to-date on your computer.  If it is
not installed, or not up to the current version, it will be orange.  You can double click the

Data Mining for the Masses
192
small square to the left of the Text Processing icon (the circle with ‘ABC’ in it). Then click
the Install button to add or update the module. When it is finished, the window will
disappear and you will be back to your main RapidMiner window.

Figure 12-2. Adding/updating the RapidMiner Text Processing add-in.

3)

In the Operators tab in the lower left hand area of your RapidMiner window, locate and
expand the Text Processing operators folder by clicking on the + sign next to it.

Chapter 12: Text Mining
193

Figure 12-3. Finding tools in the Text Processing operator area.

4)

Within the Text Processing menu tree, there is an operator called Read Document. Drag
this operator and drop it into your main process window. Right click on it and rename it
‘Paper 5’, as shown in Figure 12-4.

Figure 12-4. Adding a Read Document operator to our model.

Data Mining for the Masses
194
5)

In  the  Parameters  area  of  the  RapidMiner  window  (right  hand  side),  note  that  you  must
specify a ‘file’ that RapidMiner can read.  Click on the folder icon to the right of the file
parameter to browse for our first text file.

Figure 12-5. Locating the John Jay Federalist Paper (No. 5).

6)

In  this  case,  we  have  saved  the  text  files  containing  the  papers’  text  in  a  folder  called
Chapter Data Sets.  We have browsed to this folder, and highlighted the John Jay paper.
We can click Open to connect our RapidMiner operator to this text file. This will return us
to our main process window in RapidMiner.  Repeat steps 4 and 5 three more times, each
time  connecting  one  of  the  other  papers,  preferably  in  numerical  order,  to  a  Read
Document operator in RapidMiner.  Use care to ensure that you connect the right operator
to the right text file, so that you can keep the text of each paper straight with the operator
that’s handling it. Once finished, your model should look like Figure 12-6.

Chapter 12: Text Mining
195

Figure 12-6. All four Federalist Paper text files are now connected in RapidMiner.

7)

Go ahead and run the model. You will see that each of the four papers have been read
into RapidMiner and can be reviewed in results perspective. After reviewing the text,
return to design perspective.

Figure 12-7. Reviewing the suspected collaboration paper (no. 18) in results perspective.

Data Mining for the Masses
196
8)

We now have our four essays available in RapidMiner.  Reading the papers is not enough
however.    Gillian’s  goal  is  to  analyze  the  papers.    For  this,  we  will  use  a  Process
Documents  operator.    It  is  located  just  above  the  Read  Document  operator  in  the  Text
Processing menu tree.  Drag this operator into your process and drop it into the Paper 5
stream.    There  will  be  an  empty  doc  port  on  the  bottom  left  hand  side  of  the  Process
Documents operator.  Disconnect your Paper 14’s out port from its res port and connect it
to  the  open  doc  port  instead.    Remember  that  you  can  rearrange  port  connections  by
clicking on the first, then clicking on the second.  You will get a warning message asking
you to confirm the disconnect/reconnect action each time you do this. Repeat this process
until all four documents are feeding into the Process Documents operator, as is the case in
Figure 12-8.

Figure 12-8. All four Federalist Papers feeding into a single document processor.

9)

Next,  double  click  on  the  Process  Documents  operator.    This  will  take  us  into  a  sub-
process window.

Yüklə 4,8 Kb.

Dostları ilə paylaş:

1 ... 49 50 51 52 53 54 55 56 ... 65