Knuth Morris Pratt algorithm for string searching

Knuth Morris Pratt algorithm for string searching

You are given two strings – a Text and a Pattern. Determine whether the Pattern exists in the Text or not.E.g. – Consider a Text ABCDGBCDLMand a Pattern BCD. Final output should be all the indexes where the Pattern exists, here it is 1 and 5 (0 based index).This post explains Knuth Morris Pratt algorithm to do so.

The naive method is very basic and straightforward. For every position in the Text, consider it as the starting position and see if a match with the pattern is found.

Considering a text and a pattern of small length, this method would work properly giving O (N*M) complexity, where N and M are the lengths of text and pattern respectively.

The C implementation of the naive method is given below

#include <stdio.h>
#include <string.h>

void stringMatching(char text[], char pattern[])
{
	//Lengths of text and pattern
	int textLen = strlen(text);
	int patternLen = strlen(pattern);
	
	for(int i = 0; i < textLen; i++){
		int j = 0;
		for(j = 0; j < patternLen && i+j < textLen; j++){
			if(text[i + j] != pattern[j]) break;
		}
		
		if(j == patternLen){
			//match found, print the indexes
			printf("\n%d", i);	
		}
	}
}

int main(){
	stringMatching("cdabcde", "abc");
	return 0;
}

The Knuth Morris Pratt Algorithm

If we are able to identify all the positions where an existing match of the pattern ends, in a single iteration over the text. This would solve our problem, since we know the length of the pattern it is easy to identify the starting position of the match. Knuth-Morris-Prat algorithm works on this principle.

In this algorithm, we apply the concept of automaton. An automaton is a kind of an abstract object. At each step of this automaton, some information is presented to it. Using this information the automaton goes to a new state from its current state. whenever the automaton reaches its final state, we have found an end position of a match. The general idea behind the automaton is very simple, consider this Pattern – BBABBB.  Let us list all the prefixes of the pattern.

1. Empty string
2. B
3. BB
4. BBA
5. BBAB
6. BBABB
7. BBABBB

Now for every prefix listed above, consider the longest suffix that is also a prefix of it. The suffix should be different from the string itself.
1. Empty string (No suffix for an empty string)
2. Empty string (The suffix is B but it is same as the string itself)
3. B     (BB) (Here B is the suffix that is also the prefix of string BB)
4. Empty string (No suffix is present which is also the prefix)
5. B     (BBAB(Here B is the suffix that is also the prefix of string BBAB)
6. BB   (BBABB)(Here BB is the suffix that is also the prefix of string BBABB)
7. BB (BBABBB)(Here BB is the suffix that is also the prefix of string BBABBB)

Let us call a suffix that is also a prefix of a string a “Partial match” of that string.We can see that if at some point, the pattern matches till BBABB (which is the prefix of the pattern), then there is also a match till BB which is Partial match of BBABB.
For example :
Let us consider text : BBBABBABBBA and the Pattern BBABBB, the pattern matches till BBABB (If we check from the 2nd character of the text), since BB is partial match of BBABB, the text also matches BB (BB is the largest suffix of BBABB which is also prefix of it), so if we cannot extend our match of BBABB to full pattern with the text, we can start matching the pattern again starting from the suffix BB(BB is the largest suffix of BBABB which is also prefix of it
         
          0 1 2 3 4 5 6 7 8 9 10   (Indexes)
Text : BBBABBABBBA
             BBABBB
Here we wanted the next character of the text to be A to complete our match but it is B so we can start check for next match from 4th index of text because BB is also a partial match now. So here two cases arise,
Consider  pattern :  BBABBB and some text, and till now we have found a match till BBABB, as shown in the figure below.
BBBABB…………..(Text continues)
              BBABB    (Match found till BBABB)

Case 1:  If the next character of the text is ‘B’, that means a match is found, we can start checking in  similar manner starting from the next character  of the text to find another match if there is any.
Case 2: If the next character of the text is anything other than ‘B’ that means the match is not found, so we return to the largest partial match of the string we have found match till now (which is BB for string BBABB), now if the next character of the text matches, our match extends, but if it does not match we will find the next partial match of BB and so on until the partial match becomes empty, then we will skip the next character of the text and start  a fresh check.
Let us see the above operations of Knuth Morris Pratt algorithm with an example.
Text – BBBABBABBBA
Pattern – BBABBB
The automaton for the pattern will look like this…

Knuth Morris Pratt algorithm

 

Knuth Morris Pratt algorithm example

 Step 1: we will start by matching pattern from the first character of the text. Here we found a mismatch at 3rd character. so we will find partial match of BB, which is B and start match from 2nd character of the text. 

Knuth Morris Pratt implementation

Step 2: We when checking from 2nd character, mismatch is found at 6th character of the text, so we start again from the partial match of BBABBB which is BB.

Step 3: We found a match this time at the 5th character of the text.
Now, to find partial matches of prefixes (till we have found a match) of the pattern, we will make an array of length equal to the length of the pattern, let us call it F[pattern_length]. F[i]
(0 <= i <= pattern_length)  contains the length of longest partial match of a prefix (till we have found a match) of pattern till ‘i’.
So here :

F[0] = 0          
F[1] = 0
F[2] = 1
F[3] = 0
F[4] = 1
F[5] = 2
F[6] = 2

The array F not only contains the largest partial match till index i, but also all the partial matches of it, so F[i] is the largest partial match till index i, F[F[i]] is second largest partial match, F[F[F[i]]] is third largest and so on.
Let us call it failure function of the pattern.Below is the function to build the array

void failure_function(char pattern[]){
	int j;
	int ff[100];
	//length of pattern
	int patternLen = strlen(pattern);

	/*let the array be 'ff'
	i = 0 is empty string and i = 1 is a single character,always true */

	ff[0] = ff[1] = 0;                      
	
	for(long long int i = 2; i <= patternLen; i++){
		/* j is the index of the largest next partial match 
		(the largest suffix/prefix) of the string till  
		index i - 1 */
		
		j = ff[i-1];                     
		for( ; ; ){              
			if(pattern[j] == pattern[i-1]){           
				ff[i] = j + 1;       
				break;
			}
			if(j == 0){                      
				ff[i] = 0;
				break;
			}                               
			j = ff[j];
		}
	}
}

Now we have to use failure function to find match. whenever we cannot expand the current partial match, we go to next best partial match of the current partial match and so on. If we are unable to expand even the empty string we just skip this character from the text and go to the next one in the text.

Knuth Morris Pratt algorithm implementation

void KMP(char text[], char pattern[]){
	failure_function(pattern[]); 
	int pattern_length = strlen(pattern);
	int text_length = strlen(text);
	int index = 0;   j = 0;
  
	/* let "occurrences" be the vector container that
	contains that stores indexes whenever a match is found */
  	for(int i = 0; i < text_length; i++){
            for( ; ; ){
                if(text[i] == pattern[index]){
                    index++;
                    if(index == len)
                        occurences.insert(i - len);
		
                    break;
                 }
                 else if(index > 0){
                     index = ff[index];
                 }
                 else{
                     break;
                 }
             }
	 }
    }
}

Please share if there is something is missing or wrong. If you are interested in contributing to algorithms and me, please contact us.