The arboretum procedure

GLOBAL = NORMAL|GFIDF|IDF|ENTROPY

Yüklə 3,07 Mb.

Pdf görüntüsü

səhifə	136/148
tarix	30.04.2018
ölçüsü	3,07 Mb.
	#40673

1 ... 132 133 134 135 136 137 138 139 ... 148

GLOBAL = NORMAL|GFIDF|IDF|ENTROPY

Specifies a global weight, G

, to be used to weight the entries of the input matrix prior to any

calculations. If the WGT = option is specified, the weighted matrix will be written out. Local and

global weights are combined so that an entry,

, of the new matrix is calculated from an entry,

a

, of the old matrix as

. If the local weight is not specified, it defaults to

. If a global weight is not specified, it defaults to

. The GLOBAL

option may not be used in conjunction with the IN_GLOBALoption. The GWGT option on the

OUTPUT statement enables you to save the calculated global weights so they can be applied to

subsequent data sets by using the IN_GLOBAL option.

Global weights are functions of the row entries of the original, noncompressed, sparse matrix. The

following table lists the available global weights:

Table of Row Weight

Normal

Global

Frequency

divided

by Inverse

Document

Frequency

(GFIDF)

Inverse

Document

Frequency

(IDF)

Entropy

where f

is the frequency of term i in document j, d

is the number of documents in which term i

appears, g

is the number of times that term i appears in the whole document collection, n is the

number of document in the collection, and

.

TOL = number

Specifies a tolerance for the procedure to stop finding eigenvalues of A

T

A. The procedure is

actually finding eigenvalues of A

T

A. Suppose is the eigenvalue estimate and y is the eigenvector

estimate, then the procedure terminates when all k sets of values satisfy

. If TOL is not specified, it defaults to 10

--6

, which is more

than adequate for most text mining problems.

The SPSVD Procedure

ROW Statement

Specifies the row variable. This statement is not required if the row variable has a name of ROW.

ROW variable;

variable

Specifies the name of the variable in the input data set that contains the row variable for the

compressed matrix format as described in the overview.

The SPSVD Procedure

COL Statement

Specifies the row variable. This statement is not required if the column variable has a name of

COL.

COL variable;

variable

Specifies the name of the variable in the input data set that contains the column variable for the

compressed matrix format as described in the overview.

The SPSVD Procedure

ENTRY Statement

Specifies the variable name of the entry values. This statement is not required if the variable has a

name of ENTRY.

ENTRY variable;

variable

Specifies the name of the variable in the input data set that contains the entry values for the

compressed matrix format as described in the overview.

The SPSVD Procedure

OUTPUT Statement

Specifies the data sets to be output.

OUTPUT <option(s)>;

Options

S = SAS-data-set

Specifies the name of the data set to store the calculated

matrix. The matrix is written with rows

of the matrix as observations in the SAS data set and columns as variables. The variables are

named COL1-COLk. You can not specify S = if the IN_S option has been specified.

U = SAS-data-set

Specifies the name of the data set to store the calculated

matrix. The matrix is written with

rows of the matrix as observations in the SAS data set and columns as variables. The variables are

named COL1-COLk. You can not specify U = if the IN_U option has been specified.

V = SAS-data-set

Specifies the name of the data set to store the calculated

matrix. The matrix is written with

rows of the matrix as observations in the SAS data set and columns as variables. The variables are

named COL1-COLk. You can not specify V = if the IN_V option has been specified.

GWGT = SAS-data-set

Specifies the name of the data set that contains the calculated global weights. This data set can be

applied to other data sets by using the IN_GLOBAL option. This option must be used in

conjunction with the GLOBAL option.

WGT = SAS-data-set

Specifies the name of the data set to which the procedure writes the weighted matrix, if the

LOCAL, GLOBAL, or both statements are used. If LOCAL and /or GLOBAL is specified, but

WGT= is not, then all calculations performed by the procedure are still based on the weighted

matrix; but the weighted matrix will not be saved. If WGT= is specified and COLPRO,

ROWPRO, U=, S=, and V= are not specified, then the matrix will be weighted and written to disk;

no other calculations will be performed.

COLPRO|DOCPRO = SAS-data-set

Specifies the data set that the projection of the columns of the input matrix onto the columns of the

matrix

will be written to. If the IN_U option is specified, the data in the set specified by the

IN_U option will be used for the projection. Otherwise U will be calculated from the input data

set. If SCALECOL|SCALEDOC or SCALEALL is specified and IN_U is specified, then IN_S

must also be specified.

Yüklə 3,07 Mb.

Dostları ilə paylaş:

1 ... 132 133 134 135 136 137 138 139 ... 148