The SPSVD Procedure
The SPSVD Procedure
Overview
Procedure Syntax
PROC SPSVD Statement
ROW Statement
COL Statement
ENTRY Statement
OUTPUT Statement
EXAMPLES-SECTION
Example 1: Use the SPSVD procedure for training and validation
References
Copyright 2000 by SAS Institute Inc., Cary, NC, USA. All rights reserved.
The SPSVD Procedure
Overview
The SPSVD procedure has two main functions. The first is to calculate a truncated singular value
decomposition of a large sparse matrix. The second is to project the rows or columns of a sparse matrix
onto the columns of a dense matrix.
Proc SPSVD takes at least one data set as input. This data set must be a compressed representation of a
sparse matrix. The data set must contain at least 3 variables, with one observation for each non-zero
entry in the matrix. One variable indicates the row number, another variable indicates the column
number and a third indicates the entry. Thus the following matrix
would be represented in compressed form as:
The last observation in this data set indicates that the number 1 appears in the 3rd row and 4th column of
the original matrix. The Text Parsing node in Enterprise Miner produces a data set that is in the correct
format. For text mining we generally consider the terms as rows and the documents as columns. The
number of times that the term appears in the document is the entry. Thus if term 5 appears in document
10 three times we would say there is a 3 in row 5, column 10. These correspond to the numeric key for
the term, the document number, and the number of times that the term appears in the current document.
Note that other matrices may be passed to the procedure (via the IN_U option for example) and these
matrices are not passed in the sparse matrix format.
PROC SPSVD computes a truncated singular value decomposition of the sparse matrix. Thus, if we call
the sparse matrix
, PROC SPSVD will compute the first k (k is chosen by the user) columns of 3 new
matrices,
,
, and
such that
. Since only the first k columns of these matrices
are calculated, this is referred as a truncated SVD. The procedure will calculate the leading columns of
these matrices must faster than it can calculate the later columns. Thus smaller values of k result in faster
execution times. The algorithm used in this routine is designed to find these leading columns only. The
value of k should be less than the minimum of the number of rows and number of columns in the
original noncompressed matrix. If a user attempts to calculate too many columns the calculations may
not be accurate and calculation times may be slow (in the former case a warning will be written to the
log). The matrices
and
are orthonormal (i.e. each column is orthogonal to every other column in
the matrix, and the Euclidean norm of each column is 1).
is diagonal with monotonically decreasing
non-negative real entries. These matrices have many interesting and important qualities. For text mining
we are interested in using them for dimension reduction.
To see how we accomplish this, we view the columns of the original matrix
as coordinates. Each
column then specifies the position of a data point (document) in a high dimensional space. Similarly, the
first column of
defines a point in space. If we join this point and the origin, this defines a line. That
line is the best least squares fit to the data (distance measured perpendicular to the line). In other words,
the first column of
, u
1
, minimize
where a
i
is the i
th
column of
, c
i
is some constant, n is the number of columns in
, and
is the
Euclidean norm. The second column of
, u
2
, also defines a line, and together with u
1
, it defines a
plane. This plane is the best fit 2-dimensional subspace to the data (again in a least squares sense).
Generally, the first k columns of
form a best-fit least-square k-dimensional subspace. We can
therefore project the columns of
onto the first k columns of
to reduce the dimension. The result of
projecting a document, d, is k real numbers, p
i
, each of which is the inner-product of d onto a column of
. This projection is defined by
.
For text mining, PROC SPSVD is generally used to calculate the SVD for the training data set. The
training data set is then projected onto the first k columns of
as discussed above. k is chosen by the
user. It is difficult to define an optimal value for k. Generally, k must be large enough to capture much of
the meaning in the document collection, but it should not be so large as to capture the noise. Values
between 100 and 200 work well for large document collection that contains thousand of documents.
Once the training data set has been projected, the matrices used for the projection are saved, and the
PROC is invoked again to project the validation and any test or score data sets onto these same matrices.
Many options exist for weighting the input matrix and scaling or normalizing the projected image. These
are discussed in the syntax sections.