It should be noted
that the first k columns of
form a best fit subspace with respect to the rows of
.
PROC SPSVD allows the user to project the rows onto the first k columns of
for this reason. In the
framework of text mining, this is viewed as a representation for the terms in the data set. Possible uses
for this include clustering the reduced dimension representation of the terms to find concepts prevalent in
the document collection.
Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.
The SPSVD Procedure
Procedure Syntax
PROC SPSVD <
option(s)>;
ROW variable;
COL variable;
ENTRY variable;
OUTPUT <
option(s)>;
Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.
The SPSVD Procedure
PROC SPSVD Statement
Invoke the SPSVD procedure.
PROC SPSVD <
option(s)>;
Options
DATA = SAS-data-set
Specifies the data set to be analyzed. This data set should be in compressed form as described in
the overview. If you omit the DATA= option, the procedure uses the most recently created SAS
data set.
IN_GLOBAL = SAS-data-set
Specifies the data containing global weights that are calculated previously in order to apply the
global weights to the input data set. This option is generally used in conjunction with the GWGT
options in the output statement. In a predictive modeling setting, we want to calculate global
weights based only on the training data set. These weights are then written to a data set using the
GWGT option. We can then apply these same weights to another data set. For example, the
weights can be applied to the validation data set by using the IN_GLOBAL option. Note that the
IN_GLOBAL option and the GLOBAL option can not both be specified.
IN_U = SAS-data-set
Specifies a
matrix to be used for a column projection. If this option is used, the SVD of the
input matrix is not calculated. Rather, the specified
matrix is used for projections. For example,
if COLPRO is specified, but ROWPRO is not, then only the matrix
is needed as long as
SCALECOL or SCALEALL has not been specified. If SCALECOL had been specified, then
IN_S would be needed.
IN_S = SAS-data-set
Specified a
matrix to be used for a projection. If this option is used, the SVD of the input
matrix is not calculated. Rather, the
,
, and
matrices specified are used for projections.
Only the matrices needed for the requested projections need to be supplied. Thus, for example, if
COLPRO is specified, but ROWPRO is not, then only the matrix
is needed as long as
SCALECOL or SCALEALL has not been specified. If SCALECOL had been specified, then
IN_S would be needed.
IN_V = SAS-data-set
Specifies a
matrix to be used for a row projection. If this option is used, the SVD of the input
matrix is not calculated. Rather, the
matrix is used for projections. Only the matrices needed
for the requested projections need to be supplied. Thus, for example, if ROWPRO is specified, but
COLPRO is not, then only the matrix
is needed as long as SCALECOL or SCALEALL has not
been specified. If SCALECOL had been specified, then IN_S would be needed.
k = integer
Represents the number of dimensions that the data set will be reduced to, and controls the number
of columns of
,
, and
to be calculate and used for projections. The procedure will only
calculate the number as specified. Therefore, specifying a k larger than is needed will cause the
procedure to run for an unnecessary long time. See the Overview section for tips on choosing k. If
and
are passed to the procedure via IN_U or IN_V, then the user does not need to specify k
as this will be deduced from the number of columns in the passed data sets. If the user would like
the SVD calculated, then k must be specified.
p = integer
Specifies the number of iterations, beyond k, before the procedure restarts. PROC SPSVD is an
iterative procedure. As iterations continue, more and more memory is used and the procedure
slows down due to number of calculations required. If the desired quantities have not been
calculated to acceptable accuracy within k+p iterations, the procedure will restart, maintaining
much of the information learned in the first k+p iterations. Setting the value of p low will cause
frequent restarts which will use less memory. However, the restarting takes time so this may slow
the procedure. Conversely, if p is too large, the routine will begin to slow due to the calculations
required. If this value is not specified, it defaults to min{k,75}.
LOCAL = BINARY|LOG
Specifies a local weight, L
ij
, to be used to weight the entries of the input matrix prior to any
calculations. If the WGT = option is specified, the weighted matrix will be written out. Local and
global weights are combined so that an entry,
, of the new matrix is calculated from an entry,
a
ij
, of the old matrix as
. If the local weight is not specified, it defaults to
. If a global weight is not specified, it defaults to
.
The following table lists the available local weights:
Table of Cell Weight
Binary
If the term appears in the
document, then
Otherwise,
Log