ROWPRO|WORDPRO = SAS-data-set
Specifies the data set that the projection of the rows of the input matrix onto the rows of the matrix
will be written to. If the IN_V option is specified, the data in the set specified by the IN_V
option will be used for the projection. Otherwise V will be calculated from the input data set. If
SCALEROW|SCALEWORD or SCALEALL is specified and IN_V is specified, then IN_S must
also be specified.
SCALECOL|SCALEDOC, SCALEROW|SCALEWORD, SCALEALL
Requests that the associated projections (column, row, or all) be scaled by the inverse of the
singular values. SCALEALL specifies that both the document (column) and the word (row)
projections should be scaled. SCALECOL or SCALEDOC specifies that the document (columns
of the input matrix) projections be scaled. SCALEROW or SCALEWORD specifies that the term
(rows of the input matrix) projections to be scaled. If p
ij
is the i
th
coordinate of the projected
image of the j
th
document, then scaling replaces the formula
with
where
is the i
th
singular value (the i
th
entry on the diagonal of
).
Scaling has two functions. First, it puts more weight on those themes in a document that are
uncommon in the document collection. Second, if either the terms or documents, but not both, are
scaled, and both are placed in the same space then the terms and documents that are highly
associated are more likely to be near each other.
NORMCOL|NORMDOC, NORROW|NORMWORD, NORMALL
Requests to normalize the Euclidean length of the document (column), word (row) or both
projections. For example, if NORMCOL, NORMDOC, or NORMALL is specified, then each
observation in the data set specified by the DOCPRO option will have a length of 1. This is useful
because it bring documents with similar content but different lengths closer together. For most text
mining applications, NORMALL is suggested.
Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.
The SPSVD Procedure
Example
Example 1: Use the SPSVD procedure for training and validation
Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.
The SPSVD Procedure
Example 1: Use the SPSVD procedure for training
and validation
Suppose there are two data set, SASUSER.TRAIN, and SASUSER.VALID produced by the Text
Parsing node in Enterprise Miner. You want to use SASUSER.TRAIN for training a predictive model
and the SASUSER.VALID for validation.
PROC SPSVD DATA=SASUSER.TRAIN K=200 P=50 LOCAL=LOG GLOBAL=ENTROPY;
ROW KEY;
COL DOC;
ENTRY COUNT;
OUTPUT U=SASUSER.U V=SASUSER.V S=SASUSER.S NORMALL SCALEALL
DOCPRO=SASUSER.TRAINDP GWGT=SASUSER.WEIGHTS;
RUN;
PROC SPSVD DATA=SASUSER.VALID IN_U=SASUSER.U IN_S=SASUSER.S
LOCAL=LOG IN_GLOBAL=SASUSER.WEIGHTS;
ROW KEY;
COL DOC;
ENTRY COUNT;
OUTPUT NORMALL SCALEALL DOCPRO=SASUSER.VALIDDP;
RUN;
The first PROC
statement applies a local log, global entropy weighting scheme to the training data set.
The ROW, COL, and ENTRY options specify the names given to these variables by the Text Parsing
node in Enterprise Miner. Once the procedure has weighted the matrix, it calculates 200 (specified in the
K= option) columns of
,
, and
based on this weighted matrix. The weighted training data set is
then projected onto the first 200 columns of
, scaled by the inverse singular values and its length is
normalized. The result of the projection is written to SASUSER.TRAINDP. The calculated global
weights (entropy in this case) are saved to the data set SASUSER.WEIGHTS.
The second PROC statement is to project the validation data set using the calculations from the training
data set. This is done by specifying the
and
matrices calculated in the first PROC step with the
IN_U and IN_S options. Notice that you do not need to specify the V data set since you are not
projecting the terms. To project the document in the validation data set, specify the same local weighting
option for the validation data set and pass the calculated global weights via the IN_GLOBAL option.
Then, request that the normalized, scaled projection be written to SASUSER.VALIDDP. This way the
validation data set is weighted in exactly the same way as the training data set. Using the GLOBAL
option on the validation data set would cause new global weights to be calculated based on the data in
this set, which is not appropriate in this example because you want each dimension in the validation data
set to correspond to a dimension in the training data set.
Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.
The STDIZE Procedure
The STDIZE Procedure
Overview
Procedure Syntax
PROC STDIZE Statement
BY Statement
FREQ Statement
LOCATION Statement
SCALE Statement
VAR Statement
WEIGHT Statement
Details
Examples
Example 1: Getting Started with the STDIZE Procedure
Example 2: Unstandardizing a Data Set
Example 3: Replacing Missing Values with Standardizing
Example 4: Replacing Missing Values without Standardizing the Variables
References
Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.