The arboretum procedure



Yüklə 3.07 Mb.

səhifə134/148
tarix30.04.2018
ölçüsü3.07 Mb.
1   ...   130   131   132   133   134   135   136   137   ...   148
: documentation
documentation -> From cyber-crime to insider trading, digital investigators are increasingly being asked to
documentation -> EnCase Forensic Transform Your Investigations
documentation -> File Sharing Documentation Prepared by Alan Halter Created: 1/7/2016 Modified: 1/7/2016
documentation -> Gaia Data Release 1 Documentation release 0

The SPSVD Procedure

The SPSVD Procedure

Overview

Procedure Syntax

PROC SPSVD Statement

ROW Statement

COL Statement

ENTRY Statement

OUTPUT Statement



EXAMPLES-SECTION

Example 1: Use the SPSVD procedure for training and validation



References

Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.




The SPSVD Procedure

Overview

The SPSVD procedure has two main functions. The first is to calculate a truncated singular value

decomposition of a large sparse matrix. The second is to project the rows or columns of a sparse matrix

onto the columns of a dense matrix.

Proc SPSVD takes at least one data set as input. This data set must be a compressed representation of a

sparse matrix. The data set must contain at least 3 variables, with one observation for each non-zero

entry in the matrix. One variable indicates the row number, another variable indicates the column

number and a third indicates the entry. Thus the following matrix

would be represented in compressed form as:

The last observation in this data set indicates that the number 1 appears in the 3rd row and 4th column of

the original matrix. The Text Parsing node in Enterprise Miner produces a data set that is in the correct

format. For text mining we generally consider the terms as rows and the documents as columns. The

number of times that the term appears in the document is the entry. Thus if term 5 appears in document

10 three times we would say there is a 3 in row 5, column 10. These correspond to the numeric key for

the term, the document number, and the number of times that the term appears in the current document.

Note that other matrices may be passed to the procedure (via the IN_U option for example) and these

matrices are not passed in the sparse matrix format.

PROC SPSVD computes a truncated singular value decomposition of the sparse matrix. Thus, if we call




the sparse matrix 

, PROC SPSVD will compute the first k (k is chosen by the user) columns of 3 new

matrices, 

, and 



 such that 

. Since only the first k columns of these matrices

are calculated, this is referred as a truncated SVD. The procedure will calculate the leading columns of

these matrices must faster than it can calculate the later columns. Thus smaller values of k result in faster

execution times. The algorithm used in this routine is designed to find these leading columns only. The

value of k should be less than the minimum of the number of rows and number of columns in the

original noncompressed matrix. If a user attempts to calculate too many columns the calculations may

not be accurate and calculation times may be slow (in the former case a warning will be written to the

log). The matrices 

 and 


 are orthonormal (i.e. each column is orthogonal to every other column in

the matrix, and the Euclidean norm of each column is 1). 

 is diagonal with monotonically decreasing

non-negative real entries. These matrices have many interesting and important qualities. For text mining

we are interested in using them for dimension reduction.

To see how we accomplish this, we view the columns of the original matrix 

 as coordinates. Each

column then specifies the position of a data point (document) in a high dimensional space. Similarly, the

first column of 

 defines a point in space. If we join this point and the origin, this defines a line. That

line is the best least squares fit to the data (distance measured perpendicular to the line). In other words,

the first column of 

, u

1

, minimize



where a

i

 is the i



th

 column of 

, c

i

 is some constant, n is the number of columns in 



, and 

 is the


Euclidean norm. The second column of 

, u


2

, also defines a line, and together with u

1

, it defines a



plane. This plane is the best fit 2-dimensional subspace to the data (again in a least squares sense).

Generally, the first k columns of 

 form a best-fit least-square k-dimensional subspace. We can

therefore project the columns of 

 onto the first k columns of 

 to reduce the dimension. The result of

projecting a document, d, is k real numbers, p

i

, each of which is the inner-product of d onto a column of



. This projection is defined by 

.

For text mining, PROC SPSVD is generally used to calculate the SVD for the training data set. The



training data set is then projected onto the first k columns of 

 as discussed above. k is chosen by the

user. It is difficult to define an optimal value for k. Generally, k must be large enough to capture much of

the meaning in the document collection, but it should not be so large as to capture the noise. Values

between 100 and 200 work well for large document collection that contains thousand of documents.

Once the training data set has been projected, the matrices used for the projection are saved, and the

PROC is invoked again to project the validation and any test or score data sets onto these same matrices.

Many options exist for weighting the input matrix and scaling or normalizing the projected image. These

are discussed in the syntax sections.




Dostları ilə paylaş:
1   ...   130   131   132   133   134   135   136   137   ...   148


Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2017
rəhbərliyinə müraciət

    Ana səhifə