2 The suffix array data structure
2.1. How do we build a suffix array?
The first method we may think of is sorting the suffixes of A using an O(n lg n) sorting algorithm.
Since comparing two suffixes is done in O(n), the final complexity will reach O(n
2
lg n). Even if this
seems daunting, there is an algorithm of complexity O(n lg n), relatively easy to understand and code. If
asymptotically its building time is greater that that of a suffix tree practice
taught us that in reality
constructing a suffix array is much faster, because of the big constant that makes the linear algorithm to be
slower than we might think. Moreover, the amount of memory used implementing a suffix array with O(n)
memory is 3 to 5 times smaller than the amount needed by a suffix tree.
The algorithm is mainly based on maintaining the order of the string’s suffixes sorted by their 2
k
long prefixes. We shall execute m = [log
2
n] (ceil) steps, computing the order of the prefixes of length 2
k
at the k
th
step. It is used an m x n sized matrix. Let’s
denote by A
i
k
the subsequence of A of length 2
k
starting at position i. The position of A
i
k
in the sorted array of A
j
k
subsequences (j = 1, n) is kept in P
(k, i).
When passing from step k to step k + 1, all
the pairs of subsequences A
i
k
and A
i+2^k
k
are
concatenated, therefore obtaining the substrings of length 2
k+1
. For establishing the new order relation
among these, the information computed at the previous step must be used. For each index i it is kept a pair
formed by P
(k, i)
and P
(k, i + 2^k)
. The fact that i + 2
k
may exceed the string’s bounds must not bother us,
because we shall fill the string with the “$”
character, about which we shall consider that it’s
lexicographically smaller than any other character. After sorting, the pairs will be arranged conforming to
the lexicographic order of the strings of length 2
k+1
. One last thing that must
be remembered is that at a
certain step k there may be several substrings A
i
k
= A
j
k
, and these must be labeled identically (P
(k, i)
must
be equal to P
(k, j)
).
An image tells more that a thousand words:
bobocel
step 0:
Dostları ilə paylaş: