The arboretum procedure



Yüklə 3.07 Mb.

səhifə135/148
tarix30.04.2018
ölçüsü3.07 Mb.
1   ...   131   132   133   134   135   136   137   138   ...   148

It should be noted that the first k columns of 

 form a best fit subspace with respect to the rows of 

.

PROC SPSVD allows the user to project the rows onto the first k columns of 



 for this reason. In the

framework of text mining, this is viewed as a representation for the terms in the data set. Possible uses

for this include clustering the reduced dimension representation of the terms to find concepts prevalent in

the document collection.

Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.



The SPSVD Procedure

Procedure Syntax

PROC SPSVD <option(s)>;

ROW variable;

COL variable;

ENTRY variable;

OUTPUT <option(s)>;

Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.




The SPSVD Procedure

PROC SPSVD Statement

Invoke the SPSVD procedure.

PROC SPSVD <option(s)>;

Options

DATA = SAS-data-set

Specifies the data set to be analyzed. This data set should be in compressed form as described in

the overview. If you omit the DATA= option, the procedure uses the most recently created SAS

data set.



IN_GLOBAL = SAS-data-set

Specifies the data containing global weights that are calculated previously in order to apply the

global weights to the input data set. This option is generally used in conjunction with the GWGT

options in the output statement. In a predictive modeling setting, we want to calculate global

weights based only on the training data set. These weights are then written to a data set using the

GWGT option. We can then apply these same weights to another data set. For example, the

weights can be applied to the validation data set by using the IN_GLOBAL option. Note that the

IN_GLOBAL option and the GLOBAL option can not both be specified.



IN_U = SAS-data-set

Specifies a 

 matrix to be used for a column projection. If this option is used, the SVD of the

input matrix is not calculated. Rather, the specified 

 matrix is used for projections. For example,

if COLPRO is specified, but ROWPRO is not, then only the matrix 

 is needed as long as

SCALECOL or SCALEALL has not been specified. If SCALECOL had been specified, then

IN_S would be needed.

IN_S = SAS-data-set

Specified a 

 matrix to be used for a projection. If this option is used, the SVD of the input

matrix is not calculated. Rather, the 

, and 


 matrices specified are used for projections.

Only the matrices needed for the requested projections need to be supplied. Thus, for example, if

COLPRO is specified, but ROWPRO is not, then only the matrix 

 is needed as long as

SCALECOL or SCALEALL has not been specified. If SCALECOL had been specified, then

IN_S would be needed.



IN_V = SAS-data-set

Specifies a 

 matrix to be used for a row projection. If this option is used, the SVD of the input



matrix is not calculated. Rather, the 

 matrix is used for projections. Only the matrices needed

for the requested projections need to be supplied. Thus, for example, if ROWPRO is specified, but

COLPRO is not, then only the matrix 

 is needed as long as SCALECOL or SCALEALL has not

been specified. If SCALECOL had been specified, then IN_S would be needed.



k = integer

Represents the number of dimensions that the data set will be reduced to, and controls the number

of columns of 

, and 



 to be calculate and used for projections. The procedure will only

calculate the number as specified. Therefore, specifying a k larger than is needed will cause the

procedure to run for an unnecessary long time. See the Overview section for tips on choosing k. If

 and 


 are passed to the procedure via IN_U or IN_V, then the user does not need to specify k

as this will be deduced from the number of columns in the passed data sets. If the user would like

the SVD calculated, then k must be specified.

p = integer

Specifies the number of iterations, beyond k, before the procedure restarts. PROC SPSVD is an

iterative procedure. As iterations continue, more and more memory is used and the procedure

slows down due to number of calculations required. If the desired quantities have not been

calculated to acceptable accuracy within k+p iterations, the procedure will restart, maintaining

much of the information learned in the first k+p iterations. Setting the value of p low will cause

frequent restarts which will use less memory. However, the restarting takes time so this may slow

the procedure. Conversely, if p is too large, the routine will begin to slow due to the calculations

required. If this value is not specified, it defaults to min{k,75}.

LOCAL = BINARY|LOG

Specifies a local weightL

ij

, to be used to weight the entries of the input matrix prior to any



calculations. If the WGT = option is specified, the weighted matrix will be written out. Local and

global weights are combined so that an entry, 

, of the new matrix is calculated from an entry,

a

ij

, of the old matrix as 



. If the local weight is not specified, it defaults to

. If a global weight is not specified, it defaults to 

.

The following table lists the available local weights:



Table of Cell Weight

Binary


If the term appears in the

document, then 

Otherwise, 

Log





Dostları ilə paylaş:
1   ...   131   132   133   134   135   136   137   138   ...   148


Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2017
rəhbərliyinə müraciət

    Ana səhifə