The arboretum procedure


GLOBAL = NORMAL|GFIDF|IDF|ENTROPY



Yüklə 3,07 Mb.
Pdf görüntüsü
səhifə136/148
tarix30.04.2018
ölçüsü3,07 Mb.
#40673
1   ...   132   133   134   135   136   137   138   139   ...   148

GLOBAL = NORMAL|GFIDF|IDF|ENTROPY

Specifies a global weightG

i

, to be used to weight the entries of the input matrix prior to any



calculations. If the WGT = option is specified, the weighted matrix will be written out. Local and

global weights are combined so that an entry, 

, of the new matrix is calculated from an entry,

a

ij

, of the old matrix as 



. If the local weight is not specified, it defaults to

. If a global weight is not specified, it defaults to 

. The GLOBAL

option may not be used in conjunction with the IN_GLOBALoption. The GWGT option on the

OUTPUT statement enables you to save the calculated global weights so they can be applied to

subsequent data sets by using the IN_GLOBAL option.

Global weights are functions of the row entries of the original, noncompressed, sparse matrix. The

following table lists the available global weights:



Table of Row Weight

Normal


Global

Frequency

divided

by Inverse



Document

Frequency

(GFIDF)

Inverse


Document

Frequency

(IDF)

Entropy


where f

ij

 is the frequency of term i in document jd



i

 is the number of documents in which term i

appears, g

i

 is the number of times that term i appears in the whole document collectionn is the



number of document in the collection, and 

.

TOL = number

Specifies a tolerance for the procedure to stop finding eigenvalues of A

T

A. The procedure is




actually finding eigenvalues of A

T

A. Suppose   is the eigenvalue estimate and y is the eigenvector

estimate, then the procedure terminates when all k sets of values satisfy

. If TOL is not specified, it defaults to 10

--6

, which is more



than adequate for most text mining problems.


The SPSVD Procedure

ROW Statement

Specifies the row variable. This statement is not required if the row variable has a name of ROW.

ROW variable;

variable

Specifies the name of the variable in the input data set that contains the row variable for the

compressed matrix format as described in the overview.

Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.




The SPSVD Procedure

COL Statement

Specifies the row variable. This statement is not required if the column variable has a name of

COL.

COL variable;

variable

Specifies the name of the variable in the input data set that contains the column variable for the

compressed matrix format as described in the overview.

Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.




The SPSVD Procedure

ENTRY Statement

Specifies the variable name of the entry values. This statement is not required if the variable has a

name of ENTRY.

ENTRY variable;

variable

Specifies the name of the variable in the input data set that contains the entry values for the

compressed matrix format as described in the overview.

Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.




The SPSVD Procedure

OUTPUT Statement

Specifies the data sets to be output.

OUTPUT <option(s)>;

Options

S = SAS-data-set

Specifies the name of the data set to store the calculated 

 matrix. The matrix is written with rows

of the matrix as observations in the SAS data set and columns as variables. The variables are

named COL1-COLk. You can not specify S = if the IN_S option has been specified.

U = SAS-data-set

Specifies the name of the data set to store the calculated 

 matrix. The matrix is written with

rows of the matrix as observations in the SAS data set and columns as variables. The variables are

named COL1-COLk. You can not specify U = if the IN_U option has been specified.

V = SAS-data-set

Specifies the name of the data set to store the calculated 

 matrix. The matrix is written with

rows of the matrix as observations in the SAS data set and columns as variables. The variables are

named COL1-COLk. You can not specify V = if the IN_V option has been specified.

GWGT = SAS-data-set

Specifies the name of the data set that contains the calculated global weights. This data set can be

applied to other data sets by using the IN_GLOBAL option. This option must be used in

conjunction with the GLOBAL option.



WGT = SAS-data-set

Specifies the name of the data set to which the procedure writes the weighted matrix, if the

LOCAL, GLOBAL, or both statements are used. If LOCAL and /or GLOBAL is specified, but

WGT= is not, then all calculations performed by the procedure are still based on the weighted

matrix; but the weighted matrix will not be saved. If WGT= is specified and COLPRO,

ROWPRO, U=, S=, and V= are not specified, then the matrix will be weighted and written to disk;

no other calculations will be performed.

COLPRO|DOCPRO = SAS-data-set

Specifies the data set that the projection of the columns of the input matrix onto the columns of the

matrix 

 will be written to. If the IN_U option is specified, the data in the set specified by the



IN_U option will be used for the projection. Otherwise U will be calculated from the input data

set. If SCALECOL|SCALEDOC or SCALEALL is specified and IN_U is specified, then IN_S

must also be specified.



Yüklə 3,07 Mb.

Dostları ilə paylaş:
1   ...   132   133   134   135   136   137   138   139   ...   148




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə