Suffix Array implementation C++

Suffix array

Before learning suffix array, I would recommend you to read about “tries“. Today’s question is : given a word for example:”blogger”, you want to print all suffixes in sorted order or want to find the kth suffix in lexicographical order.

Naive Solution:
In naive solution you just need to find  all suffixes ,save them in an array of string (this will take around O(n2) operation) and sort them in lexicographical order(this will be taking O(n2 log2(n)) so total complexity will be around O(n2(1 + log2(n))) which is quite large.Then finding the kth suffix in it.

All suffixes of word “blogger”:
blogger
logger
ogger
gger
ger
er
r

Solution using Tries:
In this method we can create the trie for all the suffixes of the word and for finding kth suffix in  lexicographical order, we can pre-process and find lexicographical kth suffix in O(n) (You just figure it out how you can do it).

Even if the idea of suffix trie would be very pleasing,but it’s simplest implementation in which at every step one of the strings suffixes is inserted into structure leads to O(n2) complexity algorithm.There is a “suffix tree” which can be built in O(n) complexity algorithm but creating a suffix tree is quite a long code to write,I will talk about suffix trees in future blogs.Here I will talk about another data structure suffix arrays whose code is quite short to write and it gives around O(n log2(n)) and memory used in implementing suffix array with O(n) memory is 3 to 5 times less than the amount needed by a suffix tree.

Suffix Arrays

Suffix Array is the data structure which stores the sorted form of suffixes of a given strings.It will be more clear after seeing the diagram given below:-

SUFFIX ARRAY

How to build suffix array:

As it is clear from the diagram above, We have to sort the suffixes of the String in lexicographic order for creating suffix arrays. So the big question is How to do it? We will first sort according to the first character of the suffixes of the string and store it in suffix array, for example:

b  l  o  g  g  e  r
 Suffix array –                             0  3 4  2  2  1 5

Now,we will be gradually increasing the sorting criteria i.e from sorting according to single character to two characters, then four characters and so on in the increasing power of two. Until we have reached a length such that we can be sure all the suffixes of the string are themselves sorted.(it will be more clear from the diagram below).

So we have m = ceil(log2(n))” steps to create suffix array where ‘n’ is the length of the string. Let us denote each step by ‘y’, where 0<=y<=m .At step y we will sort the suffixes according to the first 2y characters of each suffix, For example at step 2, we will sort the suffixes according to first 4 (22) characters of each suffix.How to optimally implement the above stated condition? I’ll explain it in step by step manner.

At Step 0(y = 0) :You just sort according to the first character of each suffixes and give sort index(sort index:position of the suffix in the sorted array) to each suffix according to it,if the first character is same for two or more suffixes, give same sort index to  them.See the example below.

Step 0:               Sort-index  
      b                  0                                            
      l                  3                                                      
      o                  4
      g                  2
      g                  2
      e                  1
      r                  5

At Step 1(y = 1):We want to sort the suffixes according to the first two characters. For creating the sort-index at step 1 ,take the help of sort-index created at step 0.Create a pair of indexes in which first value is the sort-index of the respective suffix got in step 0 and second value will be the sort-index of the suffix that starts 1 positions later that we obtain in step 0. If the length of the suffix is less than 2 characters, then we can keep the second value as -1. Now sort according to this pair to get new a sort-index. It will be more helpful if you relate it to diagram given below:


   Step 1(0 - 1)        (concatenating at i  with  i + 21)           Sort-index
        
           bl                        (0,3)                                        0    
           lo                        (3,4)                                        4
           og                        (4,2)                                        5
           gg                        (2,2)                                        3
           ge                        (2,1)                                        2 
           er                        (1,5)                                        1
           r$                        (5,-1)                                       6

Here i denotes index of the suffix starting at position ‘i’ in the string.
At Step 2(y = 2):Same work has to be done at step 2.Here we want to sort according to the first four characters.Create a pair of indexes in which first will be the sort-index of the respective suffix and second will be the sort-index that starts 2 positions later like concatenate bl(1st position) with og(3rd position) to get first four characters of the respective suffix.Do this with all other suffixes and get the sort-index of each position.

 
 Step 2             (1 - 2)(concatenating at i  with  i + 21)     Sort-index
 blog                 (0,5)                                        0
 logg                 (4,3)                                        4
 ogge                 (5,2)                                        5
 gger                 (3,1)                                        3
 ger$                 (2,6)                                        2
 er$$                 (1,-1)                                       1
 r$$$                 (6,-1)                                       6

Same will happen for step 3.

So to generalize this,if we go from step y to step y + 1 , we will concatenate the sort-index of suffix array starting at position ‘i’ with sort-index at ‘i + 2y‘ to obtain the length of 2y + 1 for creating the sort-index at step ‘y + 1’.

Code for building Suffix Arrays:

typedef struct temp{
	int nr[2];
	int p;
}pair_suffix;  
                                                 
bool waytosort(pair_suffix i,pair_suffix j)
{
    if(i.nr[0] == j.nr[0]){
        if(i.nr[1] < j.nr[1])
            return 1;
        else
            return 0;
    }
    else if(i.nr[0] < j.nr[0])
        return 1;
    else 
        return 0;
}
void build_suffix_array(string s){
    long long int len = s.length();
      /*this is the array which stores the pair in 
       suffix array which is to be used for creating 
       sort index at each step */
    pair_suffix L[len];
    //no. of steps to calculate suffix array
    int steps = ceil(log(len)/log(2)); 
    int P[steps + 1][len];
    
    for(int i = 0;i < len;i++){
        //Initialization step not included in steps
        P[0][i] = s[i] - 'A';	
    }
    int cnt = 1;
    int k;
    for(k=1; k<=steps; k++,cnt <<= 1){
        //steps to calculate suffix array
        for(int i=0; i<len; i++){
            L[i].nr[0] = P[k - 1][i];
            L[i].nr[1] = i + cnt < len? P[k - 1][i + cnt]:-1;
            L[i].p = i; 
        }
        sort(L,L + len,waytosort);
        
        for(int i=0; i<len; i++){ P[k][L[i].p] = i > 0 
                           && L[i].nr[0] == L[i - 1].nr[0] 
                           && L[i].nr[1] == L[i - 1].nr[1] 
                           ? P[k][L[i - 1].p]:i;
        }
     }
     //suffix array will be found on the last row of matrix P.
}

Time Complexity:Complexity for building suffix array is O(n*log22n).
Now you can find kth suffix in lexicographical order using suffix array in O(n) complexity.Suffix arrays are quite helpful in competitive programming because they are fast to code.

Application:
1) Compute the longest common prefix(LCP) using suffix arrays.
2) Suffix Arrays are helpful in various string searching and string matching problems.

Questions to practice:
Easy:
http://www.spoj.com/problems/SARRAY/
http://www.codechef.com/problems/TASTR
Medium:
http://www.codechef.com/problems/SSTORY
http://www.spoj.com/problems/DISUBSTR/

  • Nigel Horspool

    Take a look at the Burrows-Wheeler compression algorithm (used in bzip2). Its first step is to create a sorted list of all rotations of a (very long) string. This is almost the same problem.