This enables you to identify which documents contain words about
the United States and North
Carolina, respectively.
ALPHANUM|ALPHA|SPACEDELIM'>TOKEN = ALPHANUM|ALPHA|SPACEDELIM
Specifies what will qualify a term to be indexed. The default value is ALPHANUM. The TOKEN
option works in conjunction with the ENHANCE option. The following descriptions hold for
when the ENHANCE option is not used.
ALPHANUM -- Terms are space and punctuation delimited. Each term will consist only of
alpha/numeric characters.
r
ALPHA -- Terms are space, punctuation, and digit delimited. Each term will consist only of
alphabetical characters.
r
SPACEDELIM -- Terms are space delimited. Terms will contain all characters including
punctuation.
r
ENHANCE = language
This option is used to do a limited amount of language specific parsing. Currently only ENGLISH
is supported. If parsing any other languages, the ENHANCE option should not be used. The
ENHANCE option works in conjunction with the TOKEN option.
The following description hold for when the ENHANCE=ENGLISH is used.
ALPHANUM -- Terms are space delimited. Terms that contain punctuation are omitted
unless it is a contraction or a term with a single punctuation at the end. Contractions are
kept in their original form. A term with a single punctuation at the end of the term is kept
but the punctuation is removed.
r
ALPHA -- Terms are space delimited. Terms that contain punctuation or digits are omitted
unless it is a contraction or a term with a single punctuation at the end. Contractions are
kept in their original form. A term with a single punctuation at the end of the term is kept
but the punctuation is removed.
r
SPACEDELIM -- Terms are space delimited. In addition, end of word punctuations are
removed.
r
Copyright 2001 by SAS Institute Inc., Cary, NC, USA. All rights reserved.
The TPARS Procedure
COPY Statement
COPY variables;
variables
Specifies the variables that you want to keep from the input data set.
Copyright 2001 by SAS Institute Inc., Cary, NC, USA. All rights reserved.
The TPARS Procedure
OUTPUT Statement
OUTPUT option(s);
Options
OUT = SAS-data-set
Specifies the name of the data set that will contain the term-by-document frequency table.
KEY = SAS-data-set
Specifies the name of the data set that will contain the index/term pairs, which were indexed in the
OUT data set.
MERGE = SAS-data-set
Specifies the data set that will contain all of the variables listed in the COPY statement along with
a new variable called DOC. The DOC variable is an index to the document number. This
document number corresponds to the numbers in the _DOCUMENT_ variable of the OUT data
set. The MERGE data set is used after a call to the SPSVD procedure and enables you to merge
the original data set with the reduced dimension data.
Copyright 2001 by SAS Institute Inc., Cary, NC, USA. All rights reserved.
The TPARS Procedure
Output
The TPARS procedure generates two output data sets. One of the output data sets is a table in sparse
matrix format that contains the following variables:
_TERM_ -- is the parsed text.
q
_TERMNUM_ -- is a unique numerical index associated with each term.
q
_DOCUMENT_ -- is the document number.
q
_COUNT_-- is the number of times that the term appears in the document.
q
The table can be interpreted as an encoding of a sparse matrix. The following example represents a
collection of four documents.
The collection is indexed by only three words. House appears one time in Document 1. House appears
two times in Document 4. Garage appears three times in Document 2, etc.
Since the words are encoded into numerical representation, a KEY data set that contains the following
variables is also output.
TERM -- is the parsed text.
q
KEY -- is a unique numerical index associated with each term.
q
FREQ -- is the total number of times that a term appears in the document collection.
q
NUMDOC -- is the number of documents in the collection that contain the term.
q
As an example, the following KEY data set indicates that the terms House, Garage, and Sleep are being
identified by 1, 2, and 3, respectively. The term House appears three times in the document collection
and two documents in the collection contain the word House.
Note: The values of _TERMNUM_ and KEY are identical.
Copyright 2001 by SAS Institute Inc., Cary, NC, USA. All rights reserved.