The arboretum procedure


ROWPRO|WORDPRO = SAS-data-set



Yüklə 3,07 Mb.
Pdf görüntüsü
səhifə137/148
tarix30.04.2018
ölçüsü3,07 Mb.
#40673
1   ...   133   134   135   136   137   138   139   140   ...   148

ROWPRO|WORDPRO = SAS-data-set

Specifies the data set that the projection of the rows of the input matrix onto the rows of the matrix

 will be written to. If the IN_V option is specified, the data in the set specified by the IN_V

option will be used for the projection. Otherwise V will be calculated from the input data set. If

SCALEROW|SCALEWORD or SCALEALL is specified and IN_V is specified, then IN_S must

also be specified.



SCALECOL|SCALEDOC, SCALEROW|SCALEWORD, SCALEALL

Requests that the associated projections (column, row, or all) be scaled by the inverse of the

singular values. SCALEALL specifies that both the document (column) and the word (row)

projections should be scaled. SCALECOL or SCALEDOC specifies that the document (columns

of the input matrix) projections be scaled. SCALEROW or SCALEWORD specifies that the term

(rows of the input matrix) projections to be scaled. If p

ij

 is the i



th

 coordinate of the projected

image of the j

th

 document, then scaling replaces the formula 



 with

 where 


 is the i

th

 singular value (the i



th

 entry on the diagonal of 

).

Scaling has two functions. First, it puts more weight on those themes in a document that are



uncommon in the document collection. Second, if either the terms or documents, but not both, are

scaled, and both are placed in the same space then the terms and documents that are highly

associated are more likely to be near each other.

NORMCOL|NORMDOC, NORROW|NORMWORD, NORMALL

Requests to normalize the Euclidean length of the document (column), word (row) or both

projections. For example, if NORMCOL, NORMDOC, or NORMALL is specified, then each

observation in the data set specified by the DOCPRO option will have a length of 1. This is useful

because it bring documents with similar content but different lengths closer together. For most text

mining applications, NORMALL is suggested.

Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.



The SPSVD Procedure

Example

Example 1: Use the SPSVD procedure for training and validation

Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.




The SPSVD Procedure

Example 1: Use the SPSVD procedure for training

and validation

Suppose there are two data set, SASUSER.TRAIN, and SASUSER.VALID produced by the Text



Parsing node in Enterprise Miner. You want to use SASUSER.TRAIN for training a predictive model

and the SASUSER.VALID for validation.

PROC SPSVD DATA=SASUSER.TRAIN K=200 P=50 LOCAL=LOG GLOBAL=ENTROPY;

  ROW KEY;

  COL DOC;

  ENTRY COUNT;

  OUTPUT U=SASUSER.U V=SASUSER.V S=SASUSER.S NORMALL SCALEALL 

         DOCPRO=SASUSER.TRAINDP GWGT=SASUSER.WEIGHTS;

RUN;

PROC SPSVD DATA=SASUSER.VALID IN_U=SASUSER.U IN_S=SASUSER.S



     LOCAL=LOG IN_GLOBAL=SASUSER.WEIGHTS;

  ROW KEY;

  COL DOC;

  ENTRY COUNT;

  OUTPUT NORMALL SCALEALL DOCPRO=SASUSER.VALIDDP;

RUN;


The first PROC statement applies a local log, global entropy weighting scheme to the training data set.

The ROW, COL, and ENTRY options specify the names given to these variables by the Text Parsing

node in Enterprise Miner. Once the procedure has weighted the matrix, it calculates 200 (specified in the

K= option) columns of 

, and 


 based on this weighted matrix. The weighted training data set is

then projected onto the first 200 columns of 

, scaled by the inverse singular values and its length is

normalized. The result of the projection is written to SASUSER.TRAINDP. The calculated global

weights (entropy in this case) are saved to the data set SASUSER.WEIGHTS.

The second PROC statement is to project the validation data set using the calculations from the training

data set. This is done by specifying the 

 and 


 matrices calculated in the first PROC step with the

IN_U and IN_S options. Notice that you do not need to specify the V data set since you are not

projecting the terms. To project the document in the validation data set, specify the same local weighting

option for the validation data set and pass the calculated global weights via the IN_GLOBAL option.

Then, request that the normalized, scaled projection be written to SASUSER.VALIDDP. This way the

validation data set is weighted in exactly the same way as the training data set. Using the GLOBAL

option on the validation data set would cause new global weights to be calculated based on the data in

this set, which is not appropriate in this example because you want each dimension in the validation data




set to correspond to a dimension in the training data set.

Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.




The STDIZE Procedure

The STDIZE Procedure

Overview

Procedure Syntax

PROC STDIZE Statement

BY Statement

FREQ Statement

LOCATION Statement

SCALE Statement

VAR Statement

WEIGHT Statement



Details

Examples

Example 1: Getting Started with the STDIZE Procedure

Example 2: Unstandardizing a Data Set

Example 3: Replacing Missing Values with Standardizing

Example 4: Replacing Missing Values without Standardizing the Variables

References

Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.




Yüklə 3,07 Mb.

Dostları ilə paylaş:
1   ...   133   134   135   136   137   138   139   140   ...   148




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə