Merge sort algorithm

Merge Sort Algorithm

We can classify sorting algorithms based on their complexity, selection sort, bubble and insertion sort have complexity of O(n2) while heap sort, merge sort and quick sort (with some assumptions) have complexity of  O(nlogn) and count and Radix sorts are linear O(n) algorithms.In this post, we will concentrate on merge sort algorithm.

Key points about merge sort algorithm

  • class of algorithm : divide and conquer
  • Complexity : Time O(n2)
  • Space : O(n)
  • Stable : Yes
  • In place : NO

Merge sort algorithm falls in divide and conquer class of algorithms where input space is divided into subproblems and these subproblems are then solved individually and combined together to solve original problem. Below figure explains divide and conquer paradigm.

Merge sort algorithm
Divide and conquer strategy

In merge sort algorithm, input space is divided into two subproblems till the time we achieve smallest subproblem and then smallest sub-problem is sorted and then combined up till original input is sorted. Question arises is what is the smallest subproblem? Smallest subproblem is where input cannot be further divided, a subproblem with one item to sorted.

Once, split is done till smallest size, start merging. Going from bottom, start with two one-item subproblems, combine that to create a two item subproblem, then combine two two-items subproblem to create a four item subproblem. Go on till you get problem with original size.

merge sort algorithm

Implementation of Merge sort algorithm

As is the case with all divide and conquer algorithm problems, same processing is applied to subproblem which was applied to original problem, a case for recursive solution. Recursion needs a termination condition. Base condition to stop recursion in merge sort algorithm is when subproblem has only one element to be sorted. How can you sort single element? You don’t do, just return the element.

Now, how do we combine two subproblems solutions? This step is conquer part. Sort smaller parts and combine them together with merge operation. In merge operation, scan both sorted arrays one by one element and based on output of comparison, put one of the two items into output array, till both arrays are scanned. If one array is finished before other, just copy all elements from other array to output array. Copy this output array back to original input array so that it can be combined with bigger sub problems till solution to original problem is derived.

int merge(int a[], int start, int mid, int end){
    int i,j,k;
    int temp[end-start+1];
    i = start; j = mid+1; k =0;
    /* Compare all elements of two sorted arrays and store
      them in sorted array. */
    while(i <= mid && j <= end){
        if(a[i]< a[j]){
            temp[k++]= a[i++];
        else {
            temp[k++]  = a[j++];
    while(i <= mid){
        temp[k++] = a[i++]; 
    while(j <= end){
        temp[k++] = a[j++]; 
    //Copy the temporary array into original array
    for(i=0; i<k; i++){
        a[i+start] = temp[i];
int mergeSort(int a[], int start, int end ){
    int mid  = (start + end)/2;
        //sort the left part
        mergeSort(a, start, mid);
        //sort right part
        mergeSort(a, mid+1, end);
        //merge them together
        merge(a, start, mid, end);
int main(){
	int a[] = {2,3,4,1,8,7,6,9};
	int size = sizeof(a)/sizeof(a[0]);
	mergeSort(a, 0, size-1);
	for(int i=0; i<size; i++){
		printf("%d ", a[i]);

Let’s analyse complexity of merge sort algorithm, it can be divided into 2 parts :
1. Split step, which takes the constant amount of time, all it does is create two array half the size of original array, irrespective of size of array. So, complexity of this step is O(1).

2. Combine step complexity is O(n) where n is number of elements.
Complexity of each divide and conquer step is O(n). So, how many such divide and conquer steps do we have?
For n elements in array, there will be O(logn) such steps. Hence, complexity of merge sort algorithm is O(nlogn)

However, there is one caveat, every time we merge, two sub-arrays an auxiliary array is needed to temporarily store array elements and that incurs O(n) space complexity on merge sort algorithm.

There are some improvements which can be done on this algorithm.
1. When number of elements are less than some threshold, one can use insertion or selection sort to sort those numbers.  Why? Because when n is small, O(n2) is as good as O(nlogn) and it saves extra overhead of split and merging. All in all, using insertion sort in input array with small size, can give better performance.

2. Before calling merging, check if all the elements in right array are greater then left array, if yes, no need of merging. This can be easily checked by comparing a[mid] with a[mid+1]. If a[mid] is less than a[mid+1],  two sub arrays are already sorted and we don’t need to perform merge.

Please let us what do you think of this article and if there is something wrong or missing.

Binary Indexed Trees

Binary Indexed Trees (Fenwick Tree)

Why Binary Indexed Tree?

Consider an array A : {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}, you have to find the sum of a range say (1,5). One way of doing this can be by keeping the sum of elements from index 0 to index i, where 0 <= i<= n. So the array with all the cumulative sum is "sum" :{1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 66, 78, 91, 105, 120, 136} and to calculate sum from 1 to 5 we can simply do sum[5] – sum[1]. This will take O(n) precomputing time and O(q) with q queries with O(1) complexity per query.

Let us modify the question now, Suppose we have another query that modifies value at some index i,  this will make us calculate the sum from index ‘i’ to ‘n’ again. Now the complexity will be O(q*n) if there are ‘q’ queries. Segment trees can be used to solve this in O(q*log(n)). (Refer to this post for segment trees).

Coding for segment trees can be a very lengthy and Hectic process, Segment Trees require a very large memory space, Debugging a code of segment tree is very difficult. Another approach to solve the above problem is to use Binary Indexed Tree data structure, which also has O(q*log(n)) complexity but BIT (Binary Indexed Trees) are much easier to code and require very less memory space than segment trees. Binary Indexed trees are also called Fenwick Trees. 

Representation of Binary Indexed Tree

Consider an input array A : {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}. Binary Indexed Tree or Fenwick tree is represented using an array of size n, where n is the length of the input array. Let’s call the binary indexed tree of this array “tree[n]”. tree[idx] (where idx is some index of BIT) will store the sum of values of the given input array from index (idx – 2^r +1)  to index idx. Here, r is the position of the last set bit (from left to right) in binary representation of the index idx

So, for example idx = 6, binary representation of 6 is 0110, Therefore the last set bit from left to right is the 1st bit (considering 0 based index) which makes r = 1. Therefore tree[6] stores the sum from index 5 to index 6.
The diagram below shows value of r for every index from 1 to 16. The color of rth index is changed for better understanding.

Binary Indexed Tree

Therefore, tree[12] = A[9] + A[10] + A[11] + A[12]. To calculate “tree[idx]”, we can store the cumulative sum from index “0” to index “idx”  where (0 <= idx < n) in an array and then subtract "sum[idx – 2^r + 1]" from "sum[idx]". This will find value of tree[idx] in O(1).
So, tree[idx] = sum[idx] – sum[idx – 2^r + 1]. 

How to Find the value of the last set bit?

Let num be an integer whose last set bit we want to isolate. Note that num can be represented in the form a1b, where a represents the series of bits before last set bit and b represents all the zeros after the last set bit.

Integer (-num) can be found out using 2’s complement of num, which is done by adding 1 to the inverse of num. the expression for finding twos complement is as follows,

(a1b)¯ + 1 = a¯0b¯ + 1.

Since b consists of all zeroes, so b¯  consists all ones.
Therefore, finally we have

-num = (a1b)¯ + 1 = a¯0b¯ + 1 = a¯0(0…0)¯ + 1 = a¯0(1…1) + 1 = a¯1(0…0) = a¯1b

Now, we can easily isolate the last digit, using bitwise operator AND with num and -num

              a 1 b

        &   a¯1 b
     = (0…0)1(0…0)

So, to calculate the value of (idx – 2^r + 1) we just have to do the following operation

idx = idx – (idx & -idx);

Construction of Binary Indexed Tree 

For every index “idx”, tree[idx] is calculated in O(1) complexity using the expression tree[idx] = sum[idx] – sum[idx – 2^r + 1], where “sum[idx]” stores the cumulative sum from index “0” to index “idx”.

The code is shown below.

Sum Between Two Indices :

To calculate sum between two given indices l and r. we will have to calculate sum from index ‘0’ to index ‘r’, then the same thing from index ‘0’ to index ‘l’ and then calculate the difference between of the results obtained.

Let us consider an example of index 13, to calculate sum from index 0 to index 13, array tree will play a major role here, we know that tree[13] will store sum of 13th index only, tree[12] stores sum from 9th index to 12th index and tree[8] stores sum from  index 0 to index 8. So, adding tree[8] + tree[12] + tree[13] will give us cumulative sum from index 0 to index 13.

tree[13] = tree[13] + tree[12] + tree[8]

tree[1101] = tree[1101] + tree[1100] + tree[1000]                   (Binary representation)

Note that, complexity of our algorithm to calculate sum from index 0 to index idx will be O(log(idx)) .

The diagram below illustrate this.

Binary Indexed Tree

The Code to find sum from index 0 to index idx is shown below

Update Value at some position and update BIT : 

If a value at some index idx is added by some value say val, then we will have to update the tree at all those places which are affected by this index. 
For example, if value at 9 is changed, then tree[10], tree[12], tree[16] …so on, will be changed because

tree[10] = tree[9] + tree[10];
tree[12] = tree[9] + tree[10] + tree[11] + tree[12];

while we were reading the sum, we were removing last set bit from index until it became zero. Now, while updating the tree we should add one set bit to the index idx until it becomes greater than or equal to length.

Below is the code to do that.

Binary Indexed Trees easy to code, code length is very short and should be used wherever possible.

The Links of some practice problems are given below :

Counting permutations : Programming as an art

Counting permutations : Programming as an art

I am writing this blog on Algorithms to lay emphasis on not only the solution to the problem, but to some other aspects which get completely ignored when writing the implementation of the algorithms. To start with lets first look at the Question:
Implement a program that would find the total number of ways of picking one number each from N bags (each bag containing K numbers) such that the sum of the picked numbers is greater than a given threshold (H)? K & N > 1? Assume that the program is invoked on the command line, with N, K, input-file-name and H as parameters to the program and outputs the result to its standard output.

Brute Force Algorithm

To start with let us look at how we would solve this problem in the simplest of manner. If we could somehow manage to iterate through all the permutations, then we have to simply add the numbers selected in the current permutation and compare it with the threshold and increment the count if found greater.
The algorithm is simple and can be written as follows:

The complexity of this simple algorithm is O(KNN) since there would KN permutations and for each one of them the sum needs to be found for the N numbers in a given permutation.

Think before you Program

I had given this as an exercise to some students. I am reproducing here, without prejudice, one of the answer to highlight some of the other aspects that were completely ignored.

Biggest problem that I think in the above code is to locate where the algorithm is implemented. The program is simple and it is not difficult to locate where the real algorithm is in this program. Even though the implementation is recursive, but it is in essence doing exactly, with some optimization, what the algorithm I have written above. However, it is spread across at multiple places in the program. It is party present in the recursive method and partly in the main method.
Consider the situation when I would have to use this algorithm for some analytical reasoning in some business application and I had asked for POC to be done. Using the above implementation probably would be difficult for me, since the program does not separate the concerns by putting the logically similar concerns together and the one dis-similar separated with loose coupling.
Another big problem here is the field ‘count’ being held in a class scoped variable. This is very dangerous considering the fact one would wants to calculate the count multiple times in parallel when in production.
One concern that an architect would have is to be able to use different algorithm in different situations, and therefore will require a proper modeling of the problem. Since there is none, the architect would have to sit down and do the modeling afresh.

Art or Science

Now, the question is, whether the aspects that were missed completely in the above solution so difficult that they were completely ignored. So lets just put some thought to it and list down the high level concerns in the problem (I think this is know as the CRC [class responsibility collaboration] card). At a very high level, I see the following concerns:

  1. Bag containing the numbers – I intend to model it via a interface Bag which is responsible for the following:
    1. Count of numbers in the bag
    2. Some kind of Iterator though the numbers in the bag
    3. Min, Max numbers in the bag
    4. Mechanism to hold the numbers in a particular order as desired by the algorithm
    5. With a Self Cloning, Self Factory default implementation
  2. Algorithm to find the count – I intend to model it via a interface PermutationCounter which is responsible for the following:
    1. Name of the algorithm
    2. A method to calculate the count for a given array of bags and the threshold
  3. Take problem input from the screen/file and invoke the algorithm for the problem – this will be done by the main method

Simple, was it not? The advantage that one gets from this approach is that the responsibilities have been clearly demarcated and it is any bodies guess which all parts can be very easily reused. Yes, the Bag and PermutationCounter along with all its implementation can be reused. Now, lets look at adjacent diagram which captures the class hierarchy and how it is to be used by the client (for the scope of this problem it is the main method).  In actual, one can experiment and decide on the strategy to follow for using the implementations. Now let me return the the point where I left the iterative algorithm for this problem. So how are the aspects present in the algorithm mentioned in the start actually implemented. The current permutation can be easily held in an integer array of same size as the number of bags. Each element in the array represents the index of the number in the corresponding bag. To start with this array will be initialized to say for N = 4 as [0, 0, 0, -1] and then can be manipulated to produce arrays (for K = 3) [0, 0, 0, 0], [0, 0, 0, 1], [0, 0, 0, 2], [0, 0, 1, 0], and so on till [2, 2, 2, 2]. These two methods can be present in the IterativePermutationCounter class as private method (implementation as below).
Given the above two implementations, implementation of the algorithm is as follows:
For the complete code and some interesting implementations for the problem you can request to the author.

Suffix Array

Suffix Arrays

What are Suffix arrays and why are they useful ?

Before learning suffix arrays I would recommend you to read about “tries” here
Question:You have a word for example:”blogger”,you want to print all the suffixes in sorted order or want to find the kth suffix in lexicographical order.
Naive Solution:
In naive solution you just need to find  all suffixes ,save them in an array of string (this will take around O(n2) operation) and sort them in lexicographical order(this will be taking O(n2 log2(n)) so total complexity will be around O(n2(1 + log2(n))) which is quite large.Then finding the kth suffix in it.
All suffixes of word “blogger”:
Solution using Tries:
In this method we can create the trie for all the suffixes of the word and for finding kth suffix in  lexicographical order, we can pre-process and find lexicographical kth suffix in O(n) (You just figure it out how you can do it).
Even if the idea of suffix trie would be very pleasing,but it’s simplest implementation in which at every step one of the strings suffixes is inserted into structure leads to O(n2) complexity algorithm.There is a “suffix tree” which can be built in O(n) complexity algorithm but creating a suffix tree is quite a long code to write,I will talk about suffix trees in future blogs.Here I will talk about another data structure suffix arrays whose code is quite short to write and it gives around O(n log2(n)) and memory used in implementing suffix array with O(n) memory is 3 to 5 times less than the amount needed by a suffix tree.

Suffix Arrays

Suffix Array is the data structure which stores the sorted form of suffixes of a given strings.It will be more clear after seeing the diagram given below:-


How to build suffix array:

As it is clear from the diagram above, We have to sort the suffixes of the String in lexicographic order for creating suffix arrays. So the big question is How to do it?

We will first sort according to the first character of the suffixes of the string and store it in suffix array, for example:

                                               b  l  o  g  g  e  r
 Suffix array –                             0  3 4  2  2  1 5
Now,we will be gradually increasing the sorting criteria i.e from sorting according to single character to two characters, then four characters and so on in the increasing power of two. Until we have reached a length such that we can be sure all the suffixes of the string are themselves sorted.(it will be more clear from the diagram below). 
So we have “m = ceil(log2(n))” steps to create the suffix array where ‘n’ is the length of the string. Let us denote each step by ‘y’, where 0<=y<=m .At step y we will sort the suffixes according to the first 2y characters of each suffix, For example at step 2, we will sort the suffixes according to first 4 (22) characters of each suffix.

How to optimally implement the above stated condition?
I’ll explain it in step by step manner.

At Step 0(y = 0) :You just sort according to the first character of each suffixes and give sort index(sort index:position of the suffix in the sorted array) to each suffix according to it,if the first character is same for two or more suffixes, give same sort index to  them.See the example below.

Step 0:               Sort-index                        
           b                 0                                              
           l                  3                                            
           o                 4                                             
           g                 2                                            
           g                 2                                              
           e                 1                                             
           r                 5                                             

At Step 1(y = 1):We want to sort the suffixes according to the first two characters. For creating the sort-index at step 1 ,take the help of sort-index created at step 0.Create a pair of indexes in which first value is the sort-index of the respective suffix got in step 0 and second value will be the sort-index of the suffix that starts 1 positions later that we obtain in step 0.If the length of the suffix is less than 2 characters, then we can keep the second value  as -1. Now sort according to this pair to get new a sort-index.It will be more helpful if you relate it to diagram given below:   
Step 1(0 – 1)(concatenating at i  with  i + 21):                                 
           bl                  (0,3)                    0      
           lo                  (3,4)                    4
           og                 (4,2)                    5 
           gg                 (2,2)                    3
           ge                 (2,1)                     2 
           er                 (1,5)                     1
           r$                 (5,-1)                    6

Here ‘i’ denotes index of the suffix starting at position ‘i’ in the string.

At Step 2(y = 2):Same work has to be done at step 2.Here we want to sort according to the first four characters.Create a pair of indexes in which first will be the sort-index of the respective suffix and second will be the sort-index that starts 2 positions later like concatenate bl(1st position) with og(3rd position) to get first four characters of the respective suffix.Do this with all other suffixes and get the sort-index of each position.
Step 2(1 – 2)(concatenating at i  with  i + 21):
blog           (0,5)                             0
logg           (4,3)                             4        
ogge          (5,2)                             5
gger          (3,1)                             3
ger$          (2,6)                             2
er$$          (1,-1)                            1
r$$$          (6,-1)                            6

Same will happen for step 3.

So to generalize this,if we go from step y to step y + 1 , we will concatenate the sort-index of suffix array starting at position ‘i’ with sort-index at ‘i + 2y‘ to obtain the length of 2y + 1 for creating the sort-index at step ‘y + 1’.

Code for building Suffix Arrays:

Time Complexity:Complexity for building suffix array is O(n*log22n).
Now you can find kth suffix in lexicographical order using suffix array in O(n) complexity.Suffix arrays are quite helpful in competitive programming because they are fast to code.

1) Compute the longest common prefix(LCP) using suffix arrays.
2) Suffix Arrays are helpful in various string searching and string matching problems.

Questions to practice:

Range Sum Using Segment Tree

What is and why Segment Tree?

Take an example of an array = {1 , 4 , 8 , 11, 15, 22} , and task is to find the sum of range say (2,6) , naive approach will be to run the loop from two to six giving us an O(n) time complexity,so if we have q queries it will make it O(n*q), but we can modify it to O(q) complexity if we can pre-compute the sum such that at index “i”, In the pre-computed array we have sum from 0 to index i which make new array as sum[] = {0,1, 5, 13, 24, 39, 61} so now when we are having range sum query like (3,5) we can just compute it by sum[5] – sum[2] which gives O(q) time complexity.

Now a slight modification in question , if we can change any particular number in an array at index i then we have to modify whole sum array from that index ”i” to “n” ,so if we have an alternate query of changing and finding range sum it will be giving us O(n*q) time complexity.Now the Segment tree comes to rescue , by using segment tree we can reduce the complexity from O(n*q) to O(q*log(n)) which is a big difference when n is large. Now we get the answer why Segment Tree?

Segment Tree(Divide and Conquer Solution)

Problem: We have an array arr{0……….n – 1} and we have to perform two types of queries:
                 1)Find sum of elements from index l to r where 0 <= l <= r <= n
                 2)Change value at specified index arr[i] = arr[n]. 
I’ll take about segment in three levels:

Solution for segment tree will be:
1)If the range contains a single element,the element itself will be in that node.
2)Otherwise divide the range into two smaller ranges,each will be half the size of original.
Recursive function will be written as:

Level 1: Construction

             In construction of segment tree I’ll give the recursive approach(top down approach), start with root node which leads to a recursive call to it’s two children(see recursive function) which are storing the sum of the range which is half the size of parent node until we reach leaf node which will be storing the original element of an array. And now this recursion will revert back storing sum of it’s child in it’s parent.So,in the parent node we will be having the sum of it’s two child.See the diagram below for more clearity .We will be using an array named as sum[] which will be storing this tree where index “i” stores the sum of parent,it’s child’s sum will be stored at 2*i and 2*i + 1.It will be a full binary tree because we are dividing segment into two parts,so we will be requiring a size of around 2*2^(ceil(log(n))) – 1 to store the binary tree. It is a binary it’s height will be log(n).

I added the code for creating the segment tree:  

Level 2: Query
Now we have created the tree .Since query operation is quite complex  I’ll explain it by example: Suppose we have to query for f(0,4) ,since we can see that there is no such node which is having this range but we can notice that f(0,4) = f(0,2) + f(3,4) and there are nodes in segment tree containing those two ranges so we can conclude that we have a general expression such that f(x,y) = f(x,z) + f(z + 1,y) .

When querying a segment tree, we select a subset of the nodes with the property that the union of their sets is exactly the interval whose sum we seek. To do so, we start at the root and recurse over nodes whose corresponding intervals have at least one element in common with the query interval. Hence, in our example of f(0,4), we notice that both the left and right subtrees contain elements in the query interval; hence, we recurse on both.The left child of the root serves as a base case(See the figure), since its interval is contained entirely within the query interval; hence it is chosen. At the right child of the root, we notice that its left child has descendants in the query interval(figure), but its right does not; hence we recurse on the former and not the latter. The former is now another base case, and it is chosen and then we return the sum of given range.
Time Complexity:Since it is a binary tree so for query it will be taking O(log n).You can check by yourself.

Level 3:Update
Now in update you just have to check the index of the number in which interval it exist then choosing that interval recursively till we reach the leaf of the tree and updating that leaf,now while you are recursing back you just have to update the node in which the index is such that l <= index <=r (l and r is the range of the node,index is the value to be updated).Below is the code for updating the tree.
Time Complexity:Since the height of the tree is (log n),so for worst case time complexity for update will be O(log n).

Questions for practice:


Knuth Morris Pratt algorithm for string searching

Knuth Morris Pratt (KMP) algorithm for string searching


You are given two strings – a Text and a Pattern. Determine whether the Pattern exists in the Text or not.
E.g. – Consider a Text ABCDGBCDLM and a Pattern BCD. Final output should be all the indexes where the Pattern exists, here it is 1 and 5 (0 based index).

Naive Method

The naive method is very basic and straightforward. For every position in the Text, consider it as the starting position and see if a match with the pattern is found.

Considering a text and a pattern of small length, this method would work properly giving O (N*M) complexity, where N and M are the lengths of text and pattern respectively.

The C++ implementation of the naive method is given below…

The Knuth-Morris-Prat Algorithm

If we are able to identify all the positions where an existing match of the pattern ends, in a single iteration over the text. This would solve our problem, since we know the length of the pattern it is easy to identify the starting position of the match. Knuth-Morris-Prat algorithm works on this principle.

In this algorithm, we apply the concept of automaton. An automaton is a kind of an abstract object. At each step of this automaton, some information is presented to it. Using this information the automaton goes to a new state from its current state. whenever the automaton reaches its final state, we have found an end position of a match. The general idea behind the automaton is very simple, consider this Pattern – BBABBB.  Let us list all the prefixes of the pattern.

1. Empty string
2. B
3. BB
4. BBA

Now for every prefix listed above, consider the longest suffix that is also a prefix of it. The suffix should be different from the string itself.

1. Empty string (No suffix for an empty string)
2. Empty string (The suffix is B but it is same as the string itself)
3. B     (BB) (Here B is the suffix that is also the prefix of string BB)
4. Empty string (No suffix is present which is also the prefix)
5. B     (BBAB(Here B is the suffix that is also the prefix of string BBAB)
6. BB   (BBABB)(Here BB is the suffix that is also the prefix of string BBABB)
7. BB (BBABBB)(Here BB is the suffix that is also the prefix of string BBABBB)

Let us call a suffix that is also a prefix of a string a “Partial match” of that string.We can see that if at some point, the pattern matches till BBABB (which is the prefix of the pattern), then there is also a match till AA which is Partial match of BBABB. 
For example : 
Let us consider text : BBBABBABBBA, pattern matches till BBABB (If we check from the 2nd character of the text), since BB is partial match of BBABB, the text also matches BB (BB is the largest suffix of BBABB which is also prefix of it), so if we cannot extend our match of BBABB to full pattern with the text, we can start matching the pattern again starting from the suffix BB(BB is the largest suffix of BBABB which is also prefix of it
          0 1 2 3 4 5 6 7 8 9 10   (Indexes)
Here we wanted the next character of the text to be A to complete our match but it is B so we can start check for next match from 4th index of text because BB is also a partial match now. 

So here two cases arise, 
Consider  pattern :  BBABBB and some text, and till now we have found a match till BBABB, as shown in the figure below.

           BBBABB…………..(Text continues)
              BBABB    (Match found till BBABB)

Case 1:  If the next character of the text is ‘B’, that means a match is found, we can start checking in  similar manner starting from the next character  of the text to find another match if there is any.
Case 2: If the next character of the text is anything other than ‘B’ that means the match is not found, so we return to the largest partial match of the string we have found match till now (which is BB for string BBABB), now if the next character of the text matches, our match extends, but if it does not match we will find the next partial match of BB and so on until the partial match becomes empty, then we will skip the next character of the text and start  a fresh check.

Let us see the above operations of Knuth Morris Pratt algorithm with an example.
Pattern – BBABBB
The automaton for the pattern will look like this…

 Step 1: we will start by matching pattern from the first character of the text. Here we found a mismatch at 3rd character. so we will find partial match of BB, which is B and start match from 2nd character of the text. 

Step 2: we when checking from 2nd character, mismatch is found at 6th character of the text, so we start again from the partial match of BBABBB which is BB.

Step 3: we found a match this time at the 5th character of the text.

Now, to find partial matches of prefixes (till we have found a match) of the pattern, we will make an array of length equal to the length of the pattern, let us call it F[pattern_length]. F[i]   
(0 <= i <= pattern_length)  contains the length of longest partial match of a prefix (till we have found a match) of pattern till ‘i’.
So here : F[0] = 0
               F[1] = 0
               F[2] = 1
               F[3] = 0
               F[4] = 1
               F[5] = 2
                  F[6] = 2
The array F not only contains the largest partial match till index i, but also all the partial matches of it, so F[i] is the largest partial match till index i, F[F[i]] is second largest partial match, F[F[F[i]]] is third largest and so on.
let us call it failure function of the pattern.Below is the function to build the array

Now we have to use failure function to find match. whenever we cannot expand the current partial match, we go to next best partial match of the current partial match and so on. If we are unable to expand even the empty string we just skip this character from the text and go to the next one in the text.
Below is the code to do that.

Find Euler path in a graph

Find Euler circuit and path in a graph using Fleury’s algorithm.

In last post, Graphs: Find bridges in connected graphswe discussed how we can find bridges in an undirected graph. In this post, we will be discussing an algorithm which uses bridges to find Euler’s path in a graph, algorithm is called as Fleury’s algorithm.

As per the definition, Euler’s path or circuit in a graph is a path which traverses each edge of the graph once between source and destination. If source and destination are same, then it becomes Euler’s circuit. To understand how we can find that there is an Euler path existing in a graph, please follow this post: Graphs : Euler circuit and Euler path in graphs

Fluery’s algorithm to find Euler path or circuit

  1. Start from the source node, call it as current node u. If there are nodes with odd degree (there can be max two such nodes), start any one of them. Else start from any node in graph.
  2. Traverse any edge (u, v) from current node which is not a bridge edge.
  3. Set current as v and go to step 2
  4. When all edges which are not bridge edges are traversed, start traversing bridges edges.
  5. Stop when all nodes are traversed.

Walk through Euler path step by step

1. Graph is as follows, there are two nodes 1 and 6 which have odd degrees, so our candidates for start will be either one of these. Start from 6.

Initial graph with node 1 and 6 with odd degree

2. Traverse to 5 and then to 1 and then to 3 followed by 2 one by one as these are not bridge edges.

3. At 2, there are three edges (2,1), (2,4) and (2,4). However, if we follow edge (2,1) then we will burn the bridge and never be able to come back to rest of the graph, hence we will not be going to that edge. So, we will be moving edge (2,4).

4. At 4 again, we have two edges (4,2), (4,6) and (4,3), But if we go through (4,2) we will again burn bridge as (4,2) is bridge edge. Hence, we will traverse (4,3).

5. At 3, we are left with only edge and we will follow it, which will be followed by edge (6,4).
At 4, bridge edge now has to be followed at 4 because there is no option and same for edge (2,1).

Hence, Euler path will be 6-5-1-3-2-4-3-6-4-2-1

Code for Fluery’s algorithm
Please refer Graphs : Euler circuit and Euler path in graphs for detail graph creation and finding if there is an Euler path existing in graph.
Complexity of above algorithm is O ((V + E)* (V+ E)), as for each vertex V, we need to check each edge whether it is bridge or not which take O (V+E) time. Hence, the overall complexity is as given above.

Find bridges in connected graphs

Detect bridge edges in graph

Given a graph, detect all bridges in it.  An edge is called as bridge edge if and only if on removal of that node, graph becomes disconnected if it was connected graph and if it was disconnected then number of components increase by one.  For example in below graphs, bridges are shown in red.

Bridges in graphs
Concept of detecting bridges in graph will be useful in solving Euler path or tour problem.


Depth First Search of graph can be used to see if graph is connected or not. We can use same concept, one by one remove each edge and see if graph is still connected using DFS. If yes, then edge is not bridge edge, if not, then edge is bridge edge. However, this method entails quite a complexity of O(E * (V+E)) where E is number of edges and V is number of vertices.

Let’s think something better. Consider that we are looking at edge (u,v) in graph. In what condition, we can say that it is bridge edge?
If we can somehow reach node u or any ancestor of u from any node which is decedent of v, that means graph is still connected and (u,v) is not a bridge edge. If above condition is not possible, then (u,v) is bridge.

How can we determine that there is no edge from decedent of v to u or its ancestor? For that we need to maintain time when a node was discovered during depth first search, call is d[].
d[u] is time when node u was discovered using DFS. If d[u] < d[v], means u was discovered before v.
Below is graph with d[] filled for each node.

DFS traversal order of graph

Now, figure out the lowest d[x] which can be reached from each node. Reason to find that is to see if there is a node x which is reachable from children of v and has d[x] less than u,i.e. x is ancestor of u reachable from children of v. Store lowest DFS ancestor reachable from node u in an array low[u].

So low[u] = min(low[u], low[v])  for edge (u,v)
Idea here is that if (u,v) is an edge, then either there is no back edge from sub tree of v to u or ancestor of u, in that case low[u] will be lowest. If there is a back edge to x from sub tree of v, then minimum d[x] reached by node in sub tree will be low[u].
Below diagram shows calculation of low[] in graph.

Calculation of low[]

Finally, if low[v] > disc [u] that means if discovery time of u is less than least ancestor that can be reached from sub tree of v, we have a bridge, because there is no way we can reach to any ancestor of u once we disconnect edge (u,v).

In above diagram, if we remove edge (3,4), still graph is connected as low[4] = 2 which is less than d[3] which is 3. Hence edge (3,4) is not a bridge.

Lots of theory,lets code it. We will be modifying Depth First Search implementation to keep track of d[] and low[].

Code for finding bridges in graph

Complexity Analysis

Complexity of finding bridges in graph code is O(V + E) where V is number of vertices and E is number of edges in graph.

Graphs : Euler circuit and Euler path in graphs

To understand basics of graph, please follow  link:  Graphs : Basics and representation
Today we will be discussing a typical problem on graph that is to find if there exists an Euler path or Euler circuit in graph and if yes, trace that path or circuit.

Problem statement

Find if a connected graph has Euler path or circuit.
Euler path is trail in graph which visits each edge only once from source to destination. For example in following graph, Euler path will be :

Euler Path of a graph

Similarly, Euler circuit is Euler path only where source and destination nodes is same node.


There are two types of graphs : Directed and undirected graphs, for each of them condition for having Euler path and circuit vary a bit. So consider undirected graph first.
For undirected graphs:
For an undirected graph to have Euler path, each and every node of the graph should have even degree except from start and end node of path which may odd degree. To understand what is degree,please follow the link at the top which explains basics about graphs,

To have Euler circuit all nodes should have even degree with no exception.
For directed graphs:
For a directed graph to have Euler path, on each node, number of incoming edges should be equal to number of outgoing nodes except start node where out degree is one more than in degree and end node where incoming is one more than outgoing.
To have Euler circuit, all nodes should have in degree equal to out degree.
We have to keep in mind that for both directed and undirected graphs, above conditions hold when all nodes with non zero degree are part of strongly connected component of graph. To read more on strongly connected component, please follow this link : Graphs : Connected components

Now, since we know condition to have Euler path and circuits, we write code for it. All we need to do is to count in degree and out degree of each node and check for conditions above. If they meet, we are all good to say yes else no there does not exist any Euler path. 


Complexity of above code is O(V+E)

Once we have figured out that there is a Euler path, we need to find that path. There are two algorithms for it.

  1. Hierholzer’s algorithm
  2. Fleury’s algorithm
In next posts we will discuss about these two algorithms and their implementations.

Find N most frequently occurring words in a file

Find N top occurring words in a file

Given a dictionary like text file, find n top occurring words in it i.e. n words whose count is the maximum. Or in other words, find most frequently occurring words.

For example, if given file is like follows:
one, two, two, three, three, three, four,four, four,four, five, five, five, five, five, six,six, six, six, six, six  and we are looking for three most frequent words, output should be :  six,five, four.

Let’s solve a problem which involves two concepts : One of hash maps and other heap.


First step to solve this problem is to read the file. We can create an file object and open the file and read the file using fgets or gets. In this code, we are using C++ to read file. C code to read a file can be found here.

Second, we need to count frequency of each word. So when a word is read, we need to increment count corresponding to that word. Which data structure is best suited for this kind of operation? Hash of course. Hash gives us constant time complexity to lookup a string and increment its count.

Now, we have list of words and corresponding frequency of each word, we need to figure the top n most frequently appearing words. 
So, thump of rule says that whenever we have to figure our top or least from collection of items, heaps are best bet. To find top N, we will go with min heap.
This is how it works:
  • Take first N words appearing in map along with their count and create a min heap with these N elements.
  • Now, one by one read each word from hash and check if the frequency of the words is more than the minimum in heap, that is top of heap.
  • If yes, remove the top and add this word into the heap. If not, continue with next word.
  • When done with all words in hash, min heap will contain N most frequently occurring words in file.
To find K minimum elements using min heap is also explained here : Heaps : Find K smallest element from a given array

Implementation notes:

Problem has been solve using STL in C++, using map, vector and prioirty_queue libraries.
Also, hash map is created to map string with an object which contains string and its frequency to be used for min heap purpose. 
Also, default priority queue in implementation in STL gives max heap, to use it as min heap, new comparator is written.

Code to find top n occurring words

Reading M words from file will require O(M) time, where as creating N element heap will take O(N).
Also scanning through all elements will take O((M-N) Log N) complexity.
Hence overall complexity to find top N occurring words in a file will  be O(M Log N).

Same problem can be asked in other way:
Given  a file with million points (x,y) where x and y are x and y co-ordinates of point. find 100 closest points to origin.
Just update the hash to store the point and it’s distance. Then create a max heap with first 100 points and then read one by one all points and update the max heap accordingly. At the end max heap will contain 100 most closest points to origin.