# The arboretum procedure

Yüklə 3.07 Mb.

 səhifə 134/148 tarix 30.04.2018 ölçüsü 3.07 Mb.
 : documentationdocumentation -> From cyber-crime to insider trading, digital investigators are increasingly being asked todocumentation -> EnCase Forensic Transform Your Investigationsdocumentation -> File Sharing Documentation Prepared by Alan Halter Created: 1/7/2016 Modified: 1/7/2016documentation -> Gaia Data Release 1 Documentation release 0 The SPSVD Procedure Overview Procedure Syntax PROC SPSVD Statement ROW Statement COL Statement ENTRY Statement OUTPUT Statement EXAMPLES-SECTION Example 1: Use the SPSVD procedure for training and validation References Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved. The SPSVD Procedure Overview The SPSVD procedure has two main functions. The first is to calculate a truncated singular value decomposition of a large sparse matrix. The second is to project the rows or columns of a sparse matrix onto the columns of a dense matrix. Proc SPSVD takes at least one data set as input. This data set must be a compressed representation of a sparse matrix. The data set must contain at least 3 variables, with one observation for each non-zero entry in the matrix. One variable indicates the row number, another variable indicates the column number and a third indicates the entry. Thus the following matrix would be represented in compressed form as: The last observation in this data set indicates that the number 1 appears in the 3rd row and 4th column of the original matrix. The Text Parsing node in Enterprise Miner produces a data set that is in the correct format. For text mining we generally consider the terms as rows and the documents as columns. The number of times that the term appears in the document is the entry. Thus if term 5 appears in document 10 three times we would say there is a 3 in row 5, column 10. These correspond to the numeric key for the term, the document number, and the number of times that the term appears in the current document. Note that other matrices may be passed to the procedure (via the IN_U option for example) and these matrices are not passed in the sparse matrix format. PROC SPSVD computes a truncated singular value decomposition of the sparse matrix. Thus, if we call the sparse matrix  , PROC SPSVD will compute the first k (k is chosen by the user) columns of 3 new matrices,  ,  , and   such that  . Since only the first k columns of these matrices are calculated, this is referred as a truncated SVD. The procedure will calculate the leading columns of these matrices must faster than it can calculate the later columns. Thus smaller values of k result in faster execution times. The algorithm used in this routine is designed to find these leading columns only. The value of k should be less than the minimum of the number of rows and number of columns in the original noncompressed matrix. If a user attempts to calculate too many columns the calculations may not be accurate and calculation times may be slow (in the former case a warning will be written to the log). The matrices   and   are orthonormal (i.e. each column is orthogonal to every other column in the matrix, and the Euclidean norm of each column is 1).   is diagonal with monotonically decreasing non-negative real entries. These matrices have many interesting and important qualities. For text mining we are interested in using them for dimension reduction. To see how we accomplish this, we view the columns of the original matrix   as coordinates. Each column then specifies the position of a data point (document) in a high dimensional space. Similarly, the first column of   defines a point in space. If we join this point and the origin, this defines a line. That line is the best least squares fit to the data (distance measured perpendicular to the line). In other words, the first column of  , u 1 , minimize where a i  is the i th  column of  , c i  is some constant, n is the number of columns in  , and   is the Euclidean norm. The second column of  , u 2 , also defines a line, and together with u 1 plane. This plane is the best fit 2-dimensional subspace to the data (again in a least squares sense). Generally, the first k columns of   form a best-fit least-square k-dimensional subspace. We can therefore project the columns of   onto the first k columns of   to reduce the dimension. The result of projecting a document, d, is k real numbers, p i , each of which is the inner-product of d onto a column of . This projection is defined by  . For text mining, PROC SPSVD is generally used to calculate the SVD for the training data set. The training data set is then projected onto the first k columns of   as discussed above. k is chosen by the user. It is difficult to define an optimal value for k. Generally, k must be large enough to capture much of the meaning in the document collection, but it should not be so large as to capture the noise. Values between 100 and 200 work well for large document collection that contains thousand of documents. Once the training data set has been projected, the matrices used for the projection are saved, and the PROC is invoked again to project the validation and any test or score data sets onto these same matrices. Many options exist for weighting the input matrix and scaling or normalizing the projected image. These are discussed in the syntax sections. Dostları ilə paylaş:

Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2017
rəhbərliyinə müraciət

Ana səhifə