Longest Common Subsequence

Longest common subsequence problem

Given two strings X and Y, find longest common subsequence in two strings. A subsequence of string is set of all the characters in left to right order and not necessarily contiguous.For example: string ABCDEG has ABC, ADG, EG, BCDEG subsequences,whereas BDA is not a subsequence of given string. Longest common subsequence in two strings is maximum number of common characters in two strings with order of appearance in string is same.  For example
X = “ABCDSEFGD”
Y = “ACFEFXVGAB”
Longest Common Subsequence  will be ACEFG.

Brute force method to solve this is to find all subsequences of two string (exponential order) and then find the common one with maximum length. As evident, complexity of this method is exorbitant.

Longest common subsequence using dynamic programming

To come up with dynamic programming solution, first we have to figure out the recursive nature of problem, so that problem can be divided into smaller subproblems and solved.  Second condition for applying dynamic programming is to have overlapping problems so that recursive solution can be optimized by storing results of subproblems.

Given two strings A and B, we start from first characters of them. If first character of A and B match, then this character is definitely part of longest subsequence of two string. We add that to out LCS and find longest subsequence now remaining characters of A and B. Problem reduced by one character.
What if first characters differ? Then there are three possibilities, either longest subsequence starts with first character of string A or it starts with first character of B or it contains none of them.  We have to find longest subsequence with these three cases and take the maximum of three. Actually, there are only two cases as third one implicit in other two.

If LCS[i,j] is the length of the LCS of A[1..i] with B[1..j]. How can we solve for LCS[i,j] in terms of the LCS’s of other smaller problems?

Case 1 : Check if A[i] == B[j]. If yes, problem is now reduced to find longest common subsequence in A[1…i-1] and B[1…j-1].

Case 2 : A[i] != B[j]. Exclude character A[i] and B[j] and find LCS for remaining string. First, we exclude character A[i]. Problem reduces to finding longest common subsequence of A[1…i-1] and B[1…j]. Now, exclude character B[j] and problem reduces to A[1…i] and B[1…j-1]. Take maximum of two cases. So the recursive relation comes up as

 
lcs  =  1 + longestCommonSubsequence(A[i-1], B[j-1]) if A[i] == B[j]
     =  MAX (longestCommonSubsequence( A[i-1],B[j]),
              longestCommonSubsequence( A[i], B[j-1]) if A[i] != B[j]
#include <stdio.h>
 
int max(int a, int b){
	return a>b ? a:b;
}
int longestCommonSubsequence(char *A, char *B){
 
	if (*A == '\0' || *B == '\0') return 0;
 
	else if (*A == *B) {
		return 1 + longestCommonSubsequence(A+1, B+1);
	}
	else {
		return	max(longestCommonSubsequence(A+1,B), 
		    	longestCommonSubsequence(A,B+1));
	}
}
 
int main(void) {
	char *a = "ABCDSEFGD";
	char *b = "ACFEFXVGAB";
 
	printf("\n Longest common subsequence : %d",
			longestCommonSubsequence(a,b));
 
	return 0;
}

Python implementation by Prince Mishra

# start looking from the end, if current characters 
#match, lcs = 1 + lcs of previous items


def lcs_naive(A, B):
    # base case
    # if any string cases to exist,there is 
    # no further checking
    if not A or not B:
        return 0

    # propagation
    if A[-1] == B[-1]:
        return 1 + lcs_naive(A[:-1], B[:-1])

    return max(lcs_naive(A[:-1], B), lcs_naive(A, B[:-1]))

A = '1234'
B = '5162'
print lcs_naive(A, B)

Complexity of recursive method to find longest subsequence is O(2n).

Notice that there are subproblems which are solved multiple times. How do we know that? Function call is made with suffix of A and B, there are (m+1) * (n+1) such combinations and if complexity is exponential, function is called with same suffixes multiple time.  To avoid solving those subproblems again and again, we can store values of those subproblems.

This gives us a perfect case for application of dynamic programming. We create a two dimensional table of size M X N. Table[i][j] stores the longest common subsequence of A[0…i] and B[0..j].

Table[i,j]  =  1 + Table[i-1, j-1] if A[i] == B[j]
            =  MAX (Table[i-1,j], Table[i, j-1]) if A[i] != B[j]
#include <stdio.h>
#include <string.h>
 
int max(int a, int b){
	return a>b ? a:b;
}

int longestCommonSubsequence(char * A, char * B){
    int lenA = strlen(A);
    int lenB = strlen(B);
 
     int Table[lenA+1][lenB+1];

     for (int i=0; i<=lenA; i++){
         Table[i][0] = 0;
     }
     for (int j=0; j<=lenB; j++){
         Table[0][j] = 0;
     }
 
     for (int i=1; i<=lenA; i++){
         for (int j=1; j<=lenB; j++){
             if (A[i] == B[j])
                 Table[i][j] = 1 + Table[i-1][j-1];		
             else
                 Table[i][j] = max(Table[i-1][j], Table[i][j-1]);
	 }
      }

      return Table[lenA][lenB];
}
 
int main(void) {
	char *a = "ABCDSEFGD";
	char *b = "ACFEFXVGAB";
 
	printf("\n Longest common subsequence : %d",
			longestCommonSubsequence(a,b));
 
	return 0;
}

How to find actual sequence of characters? To find sequence, we just walk backwards through
matrix starting from Table[lenA][lenB].Now, for each i and j we cross till we reach i and j as zero, do following steps:

If either Table[i-1][j] or Table[i][j-1] have value equal to Table[i][j], then move to that either or them. 
If both Table[i-1][j] and Table[i][j-1] less than Table[i][j], then move to Table[i-1][j-1]. Output the associated character. 
Note that this will output the characters in the LCS in reverse order.

There is another approach where table is filled bottom up.In this method, we start from the end of two strings and go towards the first character. Table[0][0] gives us the answer.
Advantage of this method is that we do not need to initialize the matrix.

#include <stdio.h>
#include <string.h>
 
int max(int a, int b){
	return a>b ? a:b;
}
int longestCommonSubsequence(char * A, char * B){
 	int lenA = strlen(A);
 	int lenB = strlen(B);
 
	int Table[lenA+1][lenB+1];
	for (int i=0; i <= lenA; i++){
	    Table[i][0] = 0;
	}
 
	for (int j=0; j <= lenB; j++){
	    Table[0][j] = 0;
	}
 
	for (int i=1; i<= lenA; i++){
	    for (int j=1; j <= lenB; j++){
	        if (A[i] == B[j])
		    Table[i][j] = 1 + Table[i-1][j-1];		
                else
		    Table[i][j] = max(Table[i-1][j], Table[i][j-1]);
	    }
	}
	return Table[lenA][lenB];
}
 
int main(void) {
	char *a = "ABCDSEFGD";
	char *b = "ACFEFXVGAB";
 
	printf("\n Longest common subsequence : %d",
			longestCommonSubsequence(a,b));
 
	return 0;
}

Finding longest common subsequence using dynamic programming has complexity of O(n2) with additional space complexity of O(n2).

There is one optimization which can be done in above implementation to reduce space complexity. Notice that to calculate Table[i][j], you need three values : Table[i-i][j-1], Table[i-1][j] and Table[i][j-1]. Once row i is calculate row i-1 has no use. Can we use this insight to save space?

#include <stdio.h>
#include <string.h>
 
int max(int a, int b){
	return a>b ? a:b;
}
 int longestCommonSubsequence(char * A, char * B){
 	int lenA = strlen(A);
 	int lenB = strlen(B);
 
	int Table[lenA+1][lenB+1];
 
	for (int i=lenA; i >= 0; i--){
	    for (int j=lenB; j >= 0; j--){
		if (A[i] == '\0' || B[j] == '\0')
		    Table[i][j] = 0;	
		else if (A[i] == B[j])
		    Table[i][j] = 1 + Table[i+1][j+1];		
		else 
		    Table[i][j] = max(Table[i+1][j], Table[i][j+1]);
	    }
	}
	return Table[0][0];
}
 
int main(void) {
	char *a = "ABCDSEFGD";
	char *b = "ACFEFXVGAB";
 
	printf("\n Longest common subsequence : %d",
			longestCommonSubsequence(a,b));
 
	return 0;
}
 

This method takes O(m*n*(min(m,n))) time and space complexity is O(min(m,n)).

Applications of longest common subsequence problems

Molecular biology. DNA sequences (genes) can be represented as sequences of four letters ACGT, corresponding to the four sub-molecules forming DNA. When biologists find a new sequences, they typically want to know what other sequences it is most similar to. One way of computing how similar two sequences are is to find the length of their longest common subsequence.

File comparison. The Unix program “diff” is used to compare two different versions of the same file, to determine what changes have been made to the file.

Screen redisplay. Many text editors display part of a file on the screen, updating the screen image as the file is changed to save network bandwidth. It is possible to view the computation of the minimum length sequence of characters needed to update the terminal as being a sort of common subsequence problem (the common subsequence tells you the parts of the display that are already correct and don’t need to be changed).

Please share if you find something wrong or missing. If you want to contribute to site, please refer : Publish on Algorithms and Me and contact us. We would be happy to publish your work and in turn will pay you too.

Reference : https://www.ics.uci.edu/~eppstein/161/960229.html

  • http://www.blogger.com/profile/02306978234925515887 prevents email spam

    From the article:

    X = ABCDSEFGD
    Y = ACFEFXVGAB
    LCS Z would be ACFEFG.

    I don’t think I see how ACFEFG could be the LCS? There is only one F in X, right?

    • http://www.blogger.com/profile/07885594236543253450 Jitendra Sangar

      Agree, thanks a lot for pointing it out. Will correct it.

  • http://algorithmsandme.in/ Jitendra Sangar

    HI am just trying this new feature in commenting section.

  • Pingback: Longest common substring – Algorithms and Me()