Suffix arrays – a contest approach



Yüklə 247,7 Kb.
Pdf görüntüsü
səhifə2/13
tarix20.06.2022
ölçüsü247,7 Kb.
#89807
1   2   3   4   5   6   7   8   9   ...   13
suffix-array

abac = A
0
ac = A
2
bac = A
1
c = A
3
It is easy to see that these are sorted ascending. To store them, it is not necessary to keep a vector 
of strings; it is enough to maintain the index of every suffix in the sorted array. For the example above, we 
get the array 
P = (0, 2, 1, 3),
this being the suffix array for the string “abac”. 
 
 
 
 
 
 
 
 
 
3


2 The suffix array data structure 
 
 
2.1. How do we build a suffix array? 
The first method we may think of is sorting the suffixes of A using an O(n lg n) sorting algorithm. 
Since comparing two suffixes is done in O(n), the final complexity will reach O(n
2
lg n). Even if this 
seems daunting, there is an algorithm of complexity O(n lg n), relatively easy to understand and code. If 
asymptotically its building time is greater that that of a suffix tree practice taught us that in reality 
constructing a suffix array is much faster, because of the big constant that makes the linear algorithm to be 
slower than we might think. Moreover, the amount of memory used implementing a suffix array with O(n) 
memory is 3 to 5 times smaller than the amount needed by a suffix tree. 
The algorithm is mainly based on maintaining the order of the string’s suffixes sorted by their 2
k
long prefixes. We shall execute m = [log
2
n] (ceil) steps, computing the order of the prefixes of length 2
k
at the k
th
step. It is used an m x n sized matrix. Let’s denote by A
i
k
the subsequence of A of length 2
k
starting at position i. The position of A
i
k
in the sorted array of A
j
k
subsequences (j = 1, n) is kept in P
(k, i).
When passing from step k to step k + 1, all the pairs of subsequences A
i
k
and A
i+2^k
k
are 
concatenated, therefore obtaining the substrings of length 2
k+1
. For establishing the new order relation 
among these, the information computed at the previous step must be used. For each index i it is kept a pair 
formed by P
(k, i) 
and P
(k, i + 2^k) 
. The fact that i + 2
k
may exceed the string’s bounds must not bother us, 
because we shall fill the string with the “$” character, about which we shall consider that it’s 
lexicographically smaller than any other character. After sorting, the pairs will be arranged conforming to 
the lexicographic order of the strings of length 2
k+1
. One last thing that must be remembered is that at a 
certain step k there may be several substrings A
i
k
= A
j
k
, and these must be labeled identically (P
(k, i) 
must 
be equal to P
(k, j)
). 
An image tells more that a thousand words:
bobocel 
step 0: 

Yüklə 247,7 Kb.

Dostları ilə paylaş:
1   2   3   4   5   6   7   8   9   ...   13




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2024
rəhbərliyinə müraciət

    Ana səhifə