It should be noted

that the first k columns of
form a best fit subspace with respect to the rows of

.

PROC SPSVD allows the user to project the rows onto the first k columns of

for this reason. In the

framework of text mining, this is viewed as a representation for the terms in the data set. Possible uses

for this include clustering the reduced dimension representation of the terms to find concepts prevalent in

the document collection.

Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.

*The SPSVD Procedure*
**Procedure Syntax**
**PROC SPSVD** <

*option(s)*>;

**ROW** *variable*;

**COL** *variable*;

**ENTRY** *variable*;

**OUTPUT** <

*option(s)*>;

Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.

*The SPSVD Procedure*
**PROC SPSVD Statement**
**Invoke the SPSVD procedure.**
**PROC SPSVD** <

*option(s)*>;

**Options**
**DATA = ***SAS-data-set*
Specifies the data set to be analyzed. This data set should be in compressed form as described in

the overview. If you omit the DATA= option, the procedure uses the most recently created SAS

data set.

**IN_GLOBAL = ***SAS-data-set*
Specifies the data containing global weights that are calculated previously in order to apply the

global weights to the input data set. This option is generally used in conjunction with the GWGT

options in the output statement. In a predictive modeling setting, we want to calculate global

weights based only on the training data set. These weights are then written to a data set using the

GWGT option. We can then apply these same weights to another data set. For example, the

weights can be applied to the validation data set by using the IN_GLOBAL option. Note that the

IN_GLOBAL option and the GLOBAL option can not both be specified.

**IN_U = ***SAS-data-set*
Specifies a

matrix to be used for a column projection. If this option is used, the SVD of the

input matrix is not calculated. Rather, the specified

matrix is used for projections. For example,

if COLPRO is specified, but ROWPRO is not, then only the matrix

is needed as long as

SCALECOL or SCALEALL has not been specified. If SCALECOL had been specified, then

IN_S would be needed.

**IN_S = ***SAS-data-set*

Specified a

matrix to be used for a projection. If this option is used, the SVD of the input

matrix is not calculated. Rather, the

,

, and

matrices specified are used for projections.

Only the matrices needed for the requested projections need to be supplied. Thus, for example, if

COLPRO is specified, but ROWPRO is not, then only the matrix

is needed as long as

SCALECOL or SCALEALL has not been specified. If SCALECOL had been specified, then

IN_S would be needed.

**IN_V = ***SAS-data-set*
Specifies a

matrix to be used for a row projection. If this option is used, the SVD of the input

matrix is not calculated. Rather, the

matrix is used for projections. Only the matrices needed

for the requested projections need to be supplied. Thus, for example, if ROWPRO is specified, but

COLPRO is not, then only the matrix

is needed as long as SCALECOL or SCALEALL has not

been specified. If SCALECOL had been specified, then IN_S would be needed.

**k = ***integer*
Represents the number of dimensions that the data set will be reduced to, and controls the number

of columns of

,

, and

to be calculate and used for projections. The procedure will only

calculate the number as specified. Therefore, specifying a k larger than is needed will cause the

procedure to run for an unnecessary long time. See the Overview section for tips on choosing k. If

and

are passed to the procedure via IN_U or IN_V, then the user does not need to specify k

as this will be deduced from the number of columns in the passed data sets. If the user would like

the SVD calculated, then k must be specified.

**p = ***integer*

Specifies the number of iterations, beyond k, before the procedure restarts. PROC SPSVD is an

iterative procedure. As iterations continue, more and more memory is used and the procedure

slows down due to number of calculations required. If the desired quantities have not been

calculated to acceptable accuracy within k+p iterations, the procedure will restart, maintaining

much of the information learned in the first k+p iterations. Setting the value of p low will cause

frequent restarts which will use less memory. However, the restarting takes time so this may slow

the procedure. Conversely, if p is too large, the routine will begin to slow due to the calculations

required. If this value is not specified, it defaults to min{k,75}.

**LOCAL = ***BINARY|LOG*

Specifies a local weight, **L**

ij

, to be used to weight the entries of the input matrix prior to any

calculations. If the WGT = option is specified, the weighted matrix will be written out. Local and

global weights are combined so that an entry,

, of the new matrix is calculated from an entry,

**a**

ij

, of the old matrix as

. If the local weight is not specified, it defaults to

. If a global weight is not specified, it defaults to

.

The following table lists the available local weights:

**Table of Cell Weight**
Binary

If the term appears in the

document, then

Otherwise,

Log