Minimum Edit Distance Dynamic Programming

Minimum edit distance between two strings

Minimum Edit distance between two strings is minimum number of operations one need to perform on one string so that it transforms into another string. Operations allowed are: insertion of a character, deletion of a character and substitution of a character. For example,

String S1  = EXPONENTIAL
String S2 = POLYNOMIAL
minimum edit distance between two strings exampleFrom above example, we see that in order to find minimum edit distance, we have to find best possible alignment of two strings. However, there are so many alignments possible with two strings, it will be very costly to consider each alignment and look for the best one. Can we break problem in smaller and easy to solve subproblems?

Problem at hand is to find minimum edit distance between X[1…n] and Y[1…m] strings. Consider prefix of each string X[1…i] and Y[1…j], let’s find edit distance for these prefixes and lets call it Edit(i,j).
At the end, we need to calculate Edit(n,m).

If we align two string, we start aligning right most part first. So there three possibilities at to treat right most characters.

edit distance

Let’s consider each case one by one:
Case 1:
If last character of S1 does not match with last character of S2, let’s say we delete character from S1. Cost of this operation will be 1. Now, there are i-1 characters in X and j characters in Y to consider which is nothing but Edit(i-1,j).

Case 2:
If last character of S1 does not match with last character of S2, and a new character us added to S1. Cost of insert operation will be 1. There are i characters in X and j-1 characters in Y to consider which is nothing but Edit(i,j-1).

Case 3:
In this case, we do not insert or delete character. There two possibilities : Either the aligned characters match or they do not.
If they match, then find edit distance for i-1 and j-1 length prefix. No cost will incur if they match.
If they don’t match, then we need to substitute one with other. Cost of which will be 1 and our problem reduces to i-1 and j-1 prefixes.

Finally we are able to define our problem into subproblems which can be solved recursively.

Edit(i,j) = min { 1+ Edit(i,j-1), 1 + Edit(i-1,j),
                        Edit(i-1, j-1) if X[i] == Y[j]
                        1+ Edit(i-1, j-1) if X[i] == Y[j] )

What will be the base case for recursion? If both strings are of length zero, cost will be 0.
If one string is of length 0, then cost will be length of other string.

Edit distance between two string : Recursive implementation

#include<stdio.h>
#include<string.h>
 
int min(int a, int b) {
	int min = a > b ? b : a;
	return min;
}
 
int editDistance(char *s1, char *s2, int length1, int length2){
 
	if(length1 == 0 && length2 == 0) return 0;
 
	if(length1 == 0) return length2;
 
	if(length2 == 0) return length1;
 
	int isCharacterEqual = s1[length1] == s2[length2] ? 0 : 1;
        return min(min( (1+ editDistance(s1,s2, length1-1, length2)),
        		( 1+ editDistance(s1,s2,length1, length2-1))),
        		( isCharacterEqual + editDistance(s1,s2, length1-1,length2-1)) );
}
//Driver program
int main(){
 
    char *s = "EXPONENTIAL";
    char *d = "POLYNOMIAL";
    printf("Minimum distance between two strings is : %d", 
            editDistance(s,d, strlen(s), strlen(d)));
    return 0;
}

It is evident that we are solving one sub problem again and again. To avoid that, we can use dynamic programming. There are two necessary conditions to apply dynamic programming to any problem : Problem should optimal subproblems and subproblems should be overlapping. These two conditions are met here.

To implement above formula in dynamic programming, a two dimensional table is required where
Table(i,j) stores Edit(i,j) and every cell can be calculated with bottom up approach. At the end Table(n,m) gives the final edit distance. Does not matter, if we fill table row wise or column wise, when we reach at cell (i,j), we will have all the required cells already filled in.

Minimum edit distance using dynamic programming

#include<stdio.h>
#include<string.h>
 
int editDistance(char *s1, char *s2){
    int n = strlen(s1);
    int m = strlen(s2);
 
    int minimumDistance = 0;
    int currentMinimum  = 0;
    int Table[n+1][m+1] ;
 
    memset(Table,0, sizeof(Table));

    for(int i=0; i<=n; i++)
        Table[i][0] =i;
 
    for(int i=1; i<=m; i++)
        Table[0][i] = i;
 
    for(int i=1; i<=n; i++){
        for(int j=1; j<=m; j++){
            //Case 3 : Possibility 1 :If X[i] == Y[j] 
            if(s1[i-1] == s2[j-1]){
                currentMinimum = Table[i-1][j-1];
            }
            //Case 3 : Possibility 2 :If X[i] != Y[j] 
            else{
                currentMinimum =  Table[i-1][j-1] + 1;
            }
            //Case 1 : Deletion of character from S1 
            if(Table[i][j-1] > Table[i-1][j]){
                minimumDistance = Table[i-1][j] + 1;
            }
            //Case 2 : Addition of character on S1 
            else {
                minimumDistance = Table[i][j-1] + 1;
            }
 
            if(currentMinimum < minimumDistance){
            	minimumDistance = currentMinimum;
            }
            Table[i][j] = minimumDistance;
        }
    }
    return Table[n-1][m-1];
}
//Driver program
int main(){
 
    char *s = "EXPONENTIAL";
    char *d = "POLYNOMIAL";
    printf("Minimum distance between two strings is : %d", 
            editDistance(s,d));
    return 0;
}

Complexity of  finding edit distance between two strings problem is O(NM) with extra space complexity of O(NM).

Applications of edit distance algorithm

1.  Used in spell check program to find closest word.
2. In computational biology to align two genes.
3. In speech recognition,  machine translations and information extraction.

Please share if there is something wrong or missing. If you want to contribute to algorithms and me, please contact us and refer to publishing terms and conditions.