The arboretum procedure

ROWPRO|WORDPRO = SAS-data-set

Yüklə 3,07 Mb.

Pdf görüntüsü

səhifə	137/148
tarix	30.04.2018
ölçüsü	3,07 Mb.
	#40673

1 ... 133 134 135 136 137 138 139 140 ... 148

ROWPRO|WORDPRO = SAS-data-set

Specifies the data set that the projection of the rows of the input matrix onto the rows of the matrix

will be written to. If the IN_V option is specified, the data in the set specified by the IN_V

option will be used for the projection. Otherwise V will be calculated from the input data set. If

SCALEROW|SCALEWORD or SCALEALL is specified and IN_V is specified, then IN_S must

also be specified.

SCALECOL|SCALEDOC, SCALEROW|SCALEWORD, SCALEALL

Requests that the associated projections (column, row, or all) be scaled by the inverse of the

singular values. SCALEALL specifies that both the document (column) and the word (row)

projections should be scaled. SCALECOL or SCALEDOC specifies that the document (columns

of the input matrix) projections be scaled. SCALEROW or SCALEWORD specifies that the term

(rows of the input matrix) projections to be scaled. If p

is the i

coordinate of the projected

image of the j

document, then scaling replaces the formula

with

where

is the i

singular value (the i

entry on the diagonal of

Scaling has two functions. First, it puts more weight on those themes in a document that are

uncommon in the document collection. Second, if either the terms or documents, but not both, are

scaled, and both are placed in the same space then the terms and documents that are highly

associated are more likely to be near each other.

NORMCOL|NORMDOC, NORROW|NORMWORD, NORMALL

Requests to normalize the Euclidean length of the document (column), word (row) or both

projections. For example, if NORMCOL, NORMDOC, or NORMALL is specified, then each

observation in the data set specified by the DOCPRO option will have a length of 1. This is useful

because it bring documents with similar content but different lengths closer together. For most text

mining applications, NORMALL is suggested.

The SPSVD Procedure

Example

Example 1: Use the SPSVD procedure for training and validation

The SPSVD Procedure

Example 1: Use the SPSVD procedure for training

and validation

Suppose there are two data set, SASUSER.TRAIN, and SASUSER.VALID produced by the Text

Parsing node in Enterprise Miner. You want to use SASUSER.TRAIN for training a predictive model

and the SASUSER.VALID for validation.

PROC SPSVD DATA=SASUSER.TRAIN K=200 P=50 LOCAL=LOG GLOBAL=ENTROPY;

ROW KEY;

COL DOC;

ENTRY COUNT;

OUTPUT U=SASUSER.U V=SASUSER.V S=SASUSER.S NORMALL SCALEALL

DOCPRO=SASUSER.TRAINDP GWGT=SASUSER.WEIGHTS;

RUN;

PROC SPSVD DATA=SASUSER.VALID IN_U=SASUSER.U IN_S=SASUSER.S

LOCAL=LOG IN_GLOBAL=SASUSER.WEIGHTS;

ROW KEY;

COL DOC;

ENTRY COUNT;

OUTPUT NORMALL SCALEALL DOCPRO=SASUSER.VALIDDP;

RUN;

The first PROC statement applies a local log, global entropy weighting scheme to the training data set.

The ROW, COL, and ENTRY options specify the names given to these variables by the Text Parsing

node in Enterprise Miner. Once the procedure has weighted the matrix, it calculates 200 (specified in the

K= option) columns of

, and

based on this weighted matrix. The weighted training data set is

then projected onto the first 200 columns of

, scaled by the inverse singular values and its length is

normalized. The result of the projection is written to SASUSER.TRAINDP. The calculated global

weights (entropy in this case) are saved to the data set SASUSER.WEIGHTS.

The second PROC statement is to project the validation data set using the calculations from the training

data set. This is done by specifying the

and

matrices calculated in the first PROC step with the

IN_U and IN_S options. Notice that you do not need to specify the V data set since you are not

projecting the terms. To project the document in the validation data set, specify the same local weighting

option for the validation data set and pass the calculated global weights via the IN_GLOBAL option.

Then, request that the normalized, scaled projection be written to SASUSER.VALIDDP. This way the

validation data set is weighted in exactly the same way as the training data set. Using the GLOBAL

option on the validation data set would cause new global weights to be calculated based on the data in

this set, which is not appropriate in this example because you want each dimension in the validation data

set to correspond to a dimension in the training data set.

The STDIZE Procedure

The STDIZE Procedure

Overview

Procedure Syntax

PROC STDIZE Statement

BY Statement

FREQ Statement

LOCATION Statement

SCALE Statement

VAR Statement

WEIGHT Statement

Details

Examples

Example 1: Getting Started with the STDIZE Procedure

Example 2: Unstandardizing a Data Set

Example 3: Replacing Missing Values with Standardizing

Example 4: Replacing Missing Values without Standardizing the Variables

References

Yüklə 3,07 Mb.

Dostları ilə paylaş:

1 ... 133 134 135 136 137 138 139 140 ... 148