Merge sort algorithm

Merge Sort Algorithm

We can classify sorting algorithms based on their complexity, selection sort, bubble and insertion sort have complexity of O(n2) while heap sort, merge sort and quicksort (with some assumptions) have complexity of  O(nlogn) and count and Radix sorts are linear O(n) algorithms.
In this post I will be concentrating on one sorting algorithm that is merge sort.
Merge sort falls in divide and conquer class of algorithm where input space is divided into subproblems and these subproblems are then solved individually and combined together to give solution to the original problem. Below figure explains divide and conquer paradigm.
Divide and conquer
In merge sort, input space is divided into two subproblems till the time we achieve smallest sub-problem and then the smallest sub-problem is solved, that is sorted and then combined up till the point where all of the original input is sorted. Question arises is what is the smallest sub-problem?  Smallest sub-problem is the condition where input cannot be further divided. In case of an array of integers, this will be met when there is only one element in the array.
From the above explanation, it is clear that sub-problems are subjected to same processing which is done on original input. That rings bell for recursion. And the base condition to stop recursion is derived in above paragraph, which is there is only one element in array.
Now, how do we combine two sub-problems solutions? This step is conquer part. Sort the smallest parts and combine them together with merge operation. In merge operation, we scan both sorted arrays and based on the comparison, put one of the two items into output array, till both arrays are scanned. If one array is finished before other, just copy all elements from the other array to output array. Copy this output array back to original input array so that it can be combined with bigger sub problems till solution to original problem is derived.
If don’t want to read, watch merge sort video

 

Below figure shows the merge sort operation.
merge step of merge sort
Let’s take and example and work out merge sort and then we will implement it.
Divide step
Divide step of merge sort
Conquer step
Conquer step of merge sort
Merge sort implementation
Code explanation
For mergesort function we get three parameters, the input array a[], the start index and the end index of array. This start and end change in every recursive invocation of mergesort function.
We find the middle index using start and end index of the input array and again invoke the same function with two parts one starting from start to mid and other being from mid+1 to end.
Once base condition is hit, we start winding up and call merge function. Merge function takes four parameters, input array, start, end and middle index based on start and end.
Merge function merges two sub arrays (one from start to mid and other from mid+1 to end) into a single array from start to end. This array is then returned to upper calling function which then again sort two parts of array.
Complexity analysis
As we know that every time input space is divided into two and from the binary search algorithm we know that this division will have complexity of O(log n) where n is input size. So the first part of implementation of merge sort has complexity of O(logn). Now the second part of implements merge step which place every elements in its proper place in array, hence it linear time O(n). Since above step of dividing has to be done for n elements, hence total complexity of merge sort will be O(nlogn).
However, there is one caveat, every time we merge, two subarrays an auxiliary array is needed to temporarily store array elements and that incurs O(n) space complexity on merge sort algorithm.

There are some improvements which can be done on this algorithm.
1. When number of elements are less than some threshold, one can use insertion or selection sort to sort those numbers.  Why? Because when n is small, O(n2) is as good as O(nlogn) and it saves extra overhead of split and merging. All in all, using insertion sort in input array with small size, can give better performance.

2. Before calling merging, check if all the elements in right array are greater then left array, if yes, no need of merging. This can be easily checked by comparing a[mid] with a[mid+1]. If a[mid] is less than a[mid+1],  two sub arrays are already sorted and we don’t need to perform merge.

External Merge sort

Merge sort is best used when data is in huge in size as compared to available memory, like sorting 2GB of data when we have only 100 MB of RAM or physical memory. Data cannot fit in memory and resides on to disk from where it is fetched in chunks and merged.
There are two passes in the external merge sort : Sort pass and merge pass. below figure explains this

external merge sort

Sort pass

  • Divide the input in N chunks where N  = Size of total data/Size of available memory
  • Load each chunk in main memory and sort it with any conventional sorting algorithm.
  • Now load a predefined block of the sorted chunks into memory again. keep a buffer to store sorted output.

Merge Pass

  • Now have N-way merge and put output on to buffer. As buffer get full, write that onto disk.
  • If any of the small chunk taken if exhausted, one can fill the next block from the same chunk.

External merge sort can be improved significantly using parallelism where data chunks are written on different disk where read and write can be performed in parallel. Also, different chunk can be sorted on different processors in parallel. If processors are not available, merge routine can take advantage of multithreading.
There is one problem which is classic example of usage of external merge sort is mentioned here in problem number 5 and solved here : External merge sort.

If you have any other optimization trick or better way to implement merge sort please comment below.

Binary Indexed Trees

Binary Indexed Trees (Fenwick Tree)

Why Binary Indexed Tree?

Consider an array A : {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}, you have to find the sum of a range say (1,5). One way of doing this can be by keeping the sum of elements from index 0 to index i, where 0 <= i<= n. So the array with all the cumulative sum is "sum" :{1, 3, 6, 10, 15, 21, 28, 36, 45, 55, 66, 78, 91, 105, 120, 136} and to calculate sum from 1 to 5 we can simply do sum[5] – sum[1]. This will take O(n) precomputing time and O(q) with q queries with O(1) complexity per query.

Let us modify the question now, Suppose we have another query that modifies value at some index i,  this will make us calculate the sum from index ‘i’ to ‘n’ again. Now the complexity will be O(q*n) if there are ‘q’ queries. Segment trees can be used to solve this in O(q*log(n)). (Refer to this post for segment trees).

Coding for segment trees can be a very lengthy and Hectic process, Segment Trees require a very large memory space, Debugging a code of segment tree is very difficult. Another approach to solve the above problem is to use Binary Indexed Tree data structure, which also has O(q*log(n)) complexity but BIT (Binary Indexed Trees) are much easier to code and require very less memory space than segment trees. Binary Indexed trees are also called Fenwick Trees. 

Representation of Binary Indexed Tree


Consider an input array A : {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}. Binary Indexed Tree or Fenwick tree is represented using an array of size n, where n is the length of the input array. Let’s call the binary indexed tree of this array “tree[n]”. tree[idx] (where idx is some index of BIT) will store the sum of values of the given input array from index (idx – 2^r +1)  to index idx. Here, r is the position of the last set bit (from left to right) in binary representation of the index idx

So, for example idx = 6, binary representation of 6 is 0110, Therefore the last set bit from left to right is the 1st bit (considering 0 based index) which makes r = 1. Therefore tree[6] stores the sum from index 5 to index 6.
The diagram below shows value of r for every index from 1 to 16. The color of rth index is changed for better understanding.


Binary Indexed Tree

Therefore, tree[12] = A[9] + A[10] + A[11] + A[12]. To calculate “tree[idx]”, we can store the cumulative sum from index “0” to index “idx”  where (0 <= idx < n) in an array and then subtract "sum[idx – 2^r + 1]" from "sum[idx]". This will find value of tree[idx] in O(1).
So, tree[idx] = sum[idx] – sum[idx – 2^r + 1]. 

How to Find the value of the last set bit?


Let num be an integer whose last set bit we want to isolate. Note that num can be represented in the form a1b, where a represents the series of bits before last set bit and b represents all the zeros after the last set bit.

Integer (-num) can be found out using 2’s complement of num, which is done by adding 1 to the inverse of num. the expression for finding twos complement is as follows,

(a1b)¯ + 1 = a¯0b¯ + 1.

Since b consists of all zeroes, so b¯  consists all ones.
Therefore, finally we have

-num = (a1b)¯ + 1 = a¯0b¯ + 1 = a¯0(0…0)¯ + 1 = a¯0(1…1) + 1 = a¯1(0…0) = a¯1b

Now, we can easily isolate the last digit, using bitwise operator AND with num and -num

              a 1 b

        &   a¯1 b
       ——————–
     = (0…0)1(0…0)


So, to calculate the value of (idx – 2^r + 1) we just have to do the following operation

idx = idx – (idx & -idx);

Construction of Binary Indexed Tree 


For every index “idx”, tree[idx] is calculated in O(1) complexity using the expression tree[idx] = sum[idx] – sum[idx – 2^r + 1], where “sum[idx]” stores the cumulative sum from index “0” to index “idx”.


The code is shown below.


Sum Between Two Indices :


To calculate sum between two given indices l and r. we will have to calculate sum from index ‘0’ to index ‘r’, then the same thing from index ‘0’ to index ‘l’ and then calculate the difference between of the results obtained.


Let us consider an example of index 13, to calculate sum from index 0 to index 13, array tree will play a major role here, we know that tree[13] will store sum of 13th index only, tree[12] stores sum from 9th index to 12th index and tree[8] stores sum from  index 0 to index 8. So, adding tree[8] + tree[12] + tree[13] will give us cumulative sum from index 0 to index 13.

tree[13] = tree[13] + tree[12] + tree[8]

tree[1101] = tree[1101] + tree[1100] + tree[1000]                   (Binary representation)


Note that, complexity of our algorithm to calculate sum from index 0 to index idx will be O(log(idx)) .

The diagram below illustrate this.


Binary Indexed Tree


The Code to find sum from index 0 to index idx is shown below


Update Value at some position and update BIT : 


If a value at some index idx is added by some value say val, then we will have to update the tree at all those places which are affected by this index. 
For example, if value at 9 is changed, then tree[10], tree[12], tree[16] …so on, will be changed because

tree[10] = tree[9] + tree[10];
tree[12] = tree[9] + tree[10] + tree[11] + tree[12];

while we were reading the sum, we were removing last set bit from index until it became zero. Now, while updating the tree we should add one set bit to the index idx until it becomes greater than or equal to length.


Below is the code to do that.

Binary Indexed Trees easy to code, code length is very short and should be used wherever possible.

The Links of some practice problems are given below : 
http://www.spoj.com/problems/UPDATEIT/
http://www.spoj.com/problems/CTRICK/

Counting permutations : Programming as an art

Counting permutations : Programming as an art

I am writing this blog on Algorithms to lay emphasis on not only the solution to the problem, but to some other aspects which get completely ignored when writing the implementation of the algorithms. To start with lets first look at the Question:
Implement a program that would find the total number of ways of picking one number each from N bags (each bag containing K numbers) such that the sum of the picked numbers is greater than a given threshold (H)? K & N > 1? Assume that the program is invoked on the command line, with N, K, input-file-name and H as parameters to the program and outputs the result to its standard output.

Brute Force Algorithm

To start with let us look at how we would solve this problem in the simplest of manner. If we could somehow manage to iterate through all the permutations, then we have to simply add the numbers selected in the current permutation and compare it with the threshold and increment the count if found greater.
The algorithm is simple and can be written as follows:

The complexity of this simple algorithm is O(KNN) since there would KN permutations and for each one of them the sum needs to be found for the N numbers in a given permutation.

Think before you Program

I had given this as an exercise to some students. I am reproducing here, without prejudice, one of the answer to highlight some of the other aspects that were completely ignored.

Biggest problem that I think in the above code is to locate where the algorithm is implemented. The program is simple and it is not difficult to locate where the real algorithm is in this program. Even though the implementation is recursive, but it is in essence doing exactly, with some optimization, what the algorithm I have written above. However, it is spread across at multiple places in the program. It is party present in the recursive method and partly in the main method.
Consider the situation when I would have to use this algorithm for some analytical reasoning in some business application and I had asked for POC to be done. Using the above implementation probably would be difficult for me, since the program does not separate the concerns by putting the logically similar concerns together and the one dis-similar separated with loose coupling.
Another big problem here is the field ‘count’ being held in a class scoped variable. This is very dangerous considering the fact one would wants to calculate the count multiple times in parallel when in production.
One concern that an architect would have is to be able to use different algorithm in different situations, and therefore will require a proper modeling of the problem. Since there is none, the architect would have to sit down and do the modeling afresh.

Art or Science

Now, the question is, whether the aspects that were missed completely in the above solution so difficult that they were completely ignored. So lets just put some thought to it and list down the high level concerns in the problem (I think this is know as the CRC [class responsibility collaboration] card). At a very high level, I see the following concerns:

  1. Bag containing the numbers – I intend to model it via a interface Bag which is responsible for the following:
    1. Count of numbers in the bag
    2. Some kind of Iterator though the numbers in the bag
    3. Min, Max numbers in the bag
    4. Mechanism to hold the numbers in a particular order as desired by the algorithm
    5. With a Self Cloning, Self Factory default implementation
  2. Algorithm to find the count – I intend to model it via a interface PermutationCounter which is responsible for the following:
    1. Name of the algorithm
    2. A method to calculate the count for a given array of bags and the threshold
  3. Take problem input from the screen/file and invoke the algorithm for the problem – this will be done by the main method

Simple, was it not? The advantage that one gets from this approach is that the responsibilities have been clearly demarcated and it is any bodies guess which all parts can be very easily reused. Yes, the Bag and PermutationCounter along with all its implementation can be reused. Now, lets look at adjacent diagram which captures the class hierarchy and how it is to be used by the client (for the scope of this problem it is the main method).  In actual, one can experiment and decide on the strategy to follow for using the implementations. Now let me return the the point where I left the iterative algorithm for this problem. So how are the aspects present in the algorithm mentioned in the start actually implemented. The current permutation can be easily held in an integer array of same size as the number of bags. Each element in the array represents the index of the number in the corresponding bag. To start with this array will be initialized to say for N = 4 as [0, 0, 0, -1] and then can be manipulated to produce arrays (for K = 3) [0, 0, 0, 0], [0, 0, 0, 1], [0, 0, 0, 2], [0, 0, 1, 0], and so on till [2, 2, 2, 2]. These two methods can be present in the IterativePermutationCounter class as private method (implementation as below).
Given the above two implementations, implementation of the algorithm is as follows:
For the complete code and some interesting implementations for the problem you can request to the author.

Suffix Array

Suffix Arrays

What are Suffix arrays and why are they useful ?

Before learning suffix arrays I would recommend you to read about “tries” here http://algorithmsandme.blogspot.in/2014/03/tries-basics.html.
Question:You have a word for example:”blogger”,you want to print all the suffixes in sorted order or want to find the kth suffix in lexicographical order.
Naive Solution:
In naive solution you just need to find  all suffixes ,save them in an array of string (this will take around O(n2) operation) and sort them in lexicographical order(this will be taking O(n2 log2(n)) so total complexity will be around O(n2(1 + log2(n))) which is quite large.Then finding the kth suffix in it.
All suffixes of word “blogger”:
blogger
logger
ogger
gger
ger
er
Solution using Tries:
In this method we can create the trie for all the suffixes of the word and for finding kth suffix in  lexicographical order, we can pre-process and find lexicographical kth suffix in O(n) (You just figure it out how you can do it).
Even if the idea of suffix trie would be very pleasing,but it’s simplest implementation in which at every step one of the strings suffixes is inserted into structure leads to O(n2) complexity algorithm.There is a “suffix tree” which can be built in O(n) complexity algorithm but creating a suffix tree is quite a long code to write,I will talk about suffix trees in future blogs.Here I will talk about another data structure suffix arrays whose code is quite short to write and it gives around O(n log2(n)) and memory used in implementing suffix array with O(n) memory is 3 to 5 times less than the amount needed by a suffix tree.

Suffix Arrays

Suffix Array is the data structure which stores the sorted form of suffixes of a given strings.It will be more clear after seeing the diagram given below:-

SUFFIX ARRAY


How to build suffix array:

As it is clear from the diagram above, We have to sort the suffixes of the String in lexicographic order for creating suffix arrays. So the big question is How to do it?

We will first sort according to the first character of the suffixes of the string and store it in suffix array, for example:

                                               b  l  o  g  g  e  r
 Suffix array –                             0  3 4  2  2  1 5
Now,we will be gradually increasing the sorting criteria i.e from sorting according to single character to two characters, then four characters and so on in the increasing power of two. Until we have reached a length such that we can be sure all the suffixes of the string are themselves sorted.(it will be more clear from the diagram below). 
So we have “m = ceil(log2(n))” steps to create the suffix array where ‘n’ is the length of the string. Let us denote each step by ‘y’, where 0<=y<=m .At step y we will sort the suffixes according to the first 2y characters of each suffix, For example at step 2, we will sort the suffixes according to first 4 (22) characters of each suffix.

How to optimally implement the above stated condition?
I’ll explain it in step by step manner.

At Step 0(y = 0) :You just sort according to the first character of each suffixes and give sort index(sort index:position of the suffix in the sorted array) to each suffix according to it,if the first character is same for two or more suffixes, give same sort index to  them.See the example below.

Step 0:               Sort-index                        
           b                 0                                              
           l                  3                                            
           o                 4                                             
           g                 2                                            
           g                 2                                              
           e                 1                                             
           r                 5                                             
  

At Step 1(y = 1):We want to sort the suffixes according to the first two characters. For creating the sort-index at step 1 ,take the help of sort-index created at step 0.Create a pair of indexes in which first value is the sort-index of the respective suffix got in step 0 and second value will be the sort-index of the suffix that starts 1 positions later that we obtain in step 0.If the length of the suffix is less than 2 characters, then we can keep the second value  as -1. Now sort according to this pair to get new a sort-index.It will be more helpful if you relate it to diagram given below:   
Step 1(0 – 1)(concatenating at i  with  i + 21):                                 
                                                    Sort-index
           bl                  (0,3)                    0      
           lo                  (3,4)                    4
           og                 (4,2)                    5 
           gg                 (2,2)                    3
           ge                 (2,1)                     2 
           er                 (1,5)                     1
           r$                 (5,-1)                    6

Here ‘i’ denotes index of the suffix starting at position ‘i’ in the string.

At Step 2(y = 2):Same work has to be done at step 2.Here we want to sort according to the first four characters.Create a pair of indexes in which first will be the sort-index of the respective suffix and second will be the sort-index that starts 2 positions later like concatenate bl(1st position) with og(3rd position) to get first four characters of the respective suffix.Do this with all other suffixes and get the sort-index of each position.
Step 2(1 – 2)(concatenating at i  with  i + 21):
                                                  Sort-index
blog           (0,5)                             0
logg           (4,3)                             4        
ogge          (5,2)                             5
gger          (3,1)                             3
ger$          (2,6)                             2
er$$          (1,-1)                            1
r$$$          (6,-1)                            6

Same will happen for step 3.

So to generalize this,if we go from step y to step y + 1 , we will concatenate the sort-index of suffix array starting at position ‘i’ with sort-index at ‘i + 2y‘ to obtain the length of 2y + 1 for creating the sort-index at step ‘y + 1’.

Code for building Suffix Arrays:

Time Complexity:Complexity for building suffix array is O(n*log22n).
Now you can find kth suffix in lexicographical order using suffix array in O(n) complexity.Suffix arrays are quite helpful in competitive programming because they are fast to code.

Application:
1) Compute the longest common prefix(LCP) using suffix arrays.
2) Suffix Arrays are helpful in various string searching and string matching problems.

Questions to practice:
Easy:
http://www.spoj.com/problems/SARRAY/
http://www.codechef.com/problems/TASTR
Medium:
http://www.codechef.com/problems/SSTORY
http://www.spoj.com/problems/DISUBSTR/

Range Sum Using Segment Tree

What is and why Segment Tree?

Take an example of an array = {1 , 4 , 8 , 11, 15, 22} , and task is to find the sum of range say (2,6) , naive approach will be to run the loop from two to six giving us an O(n) time complexity,so if we have q queries it will make it O(n*q), but we can modify it to O(q) complexity if we can pre-compute the sum such that at index “i”, In the pre-computed array we have sum from 0 to index i which make new array as sum[] = {0,1, 5, 13, 24, 39, 61} so now when we are having range sum query like (3,5) we can just compute it by sum[5] – sum[2] which gives O(q) time complexity.

Now a slight modification in question , if we can change any particular number in an array at index i then we have to modify whole sum array from that index ”i” to “n” ,so if we have an alternate query of changing and finding range sum it will be giving us O(n*q) time complexity.Now the Segment tree comes to rescue , by using segment tree we can reduce the complexity from O(n*q) to O(q*log(n)) which is a big difference when n is large. Now we get the answer why Segment Tree?

Segment Tree(Divide and Conquer Solution)

Problem: We have an array arr{0……….n – 1} and we have to perform two types of queries:
                 1)Find sum of elements from index l to r where 0 <= l <= r <= n
                 2)Change value at specified index arr[i] = arr[n]. 
I’ll take about segment in three levels:

Structure
Solution for segment tree will be:
1)If the range contains a single element,the element itself will be in that node.
2)Otherwise divide the range into two smaller ranges,each will be half the size of original.
Recursive function will be written as:
             
Representation:

Level 1: Construction

             In construction of segment tree I’ll give the recursive approach(top down approach), start with root node which leads to a recursive call to it’s two children(see recursive function) which are storing the sum of the range which is half the size of parent node until we reach leaf node which will be storing the original element of an array. And now this recursion will revert back storing sum of it’s child in it’s parent.So,in the parent node we will be having the sum of it’s two child.See the diagram below for more clearity .We will be using an array named as sum[] which will be storing this tree where index “i” stores the sum of parent,it’s child’s sum will be stored at 2*i and 2*i + 1.It will be a full binary tree because we are dividing segment into two parts,so we will be requiring a size of around 2*2^(ceil(log(n))) – 1 to store the binary tree. It is a binary it’s height will be log(n).

I added the code for creating the segment tree:  

Level 2: Query
Now we have created the tree .Since query operation is quite complex  I’ll explain it by example: Suppose we have to query for f(0,4) ,since we can see that there is no such node which is having this range but we can notice that f(0,4) = f(0,2) + f(3,4) and there are nodes in segment tree containing those two ranges so we can conclude that we have a general expression such that f(x,y) = f(x,z) + f(z + 1,y) .

When querying a segment tree, we select a subset of the nodes with the property that the union of their sets is exactly the interval whose sum we seek. To do so, we start at the root and recurse over nodes whose corresponding intervals have at least one element in common with the query interval. Hence, in our example of f(0,4), we notice that both the left and right subtrees contain elements in the query interval; hence, we recurse on both.The left child of the root serves as a base case(See the figure), since its interval is contained entirely within the query interval; hence it is chosen. At the right child of the root, we notice that its left child has descendants in the query interval(figure), but its right does not; hence we recurse on the former and not the latter. The former is now another base case, and it is chosen and then we return the sum of given range.
Time Complexity:Since it is a binary tree so for query it will be taking O(log n).You can check by yourself.

Level 3:Update
Now in update you just have to check the index of the number in which interval it exist then choosing that interval recursively till we reach the leaf of the tree and updating that leaf,now while you are recursing back you just have to update the node in which the index is such that l <= index <=r (l and r is the range of the node,index is the value to be updated).Below is the code for updating the tree.
Time Complexity:Since the height of the tree is (log n),so for worst case time complexity for update will be O(log n).

Questions for practice:
Easy:
http://www.spoj.com/problems/GSS1/
http://www.spoj.com/problems/KGSS/
http://www.spoj.com/problems/HORRIBLE/

Medium:
http://www.codechef.com/problems/QSET
http://www.codechef.com/problems/FLIPCOIN

Knuth Morris Pratt algorithm for string searching

Knuth Morris Pratt (KMP) algorithm for string searching

Problem

You are given two strings – a Text and a Pattern. Determine whether the Pattern exists in the Text or not.
E.g. – Consider a Text ABCDGBCDLM and a Pattern BCD. Final output should be all the indexes where the Pattern exists, here it is 1 and 5 (0 based index).

Naive Method

The naive method is very basic and straightforward. For every position in the Text, consider it as the starting position and see if a match with the pattern is found.

Considering a text and a pattern of small length, this method would work properly giving O (N*M) complexity, where N and M are the lengths of text and pattern respectively.

The C++ implementation of the naive method is given below…

The Knuth-Morris-Prat Algorithm

If we are able to identify all the positions where an existing match of the pattern ends, in a single iteration over the text. This would solve our problem, since we know the length of the pattern it is easy to identify the starting position of the match. Knuth-Morris-Prat algorithm works on this principle.

In this algorithm, we apply the concept of automaton. An automaton is a kind of an abstract object. At each step of this automaton, some information is presented to it. Using this information the automaton goes to a new state from its current state. whenever the automaton reaches its final state, we have found an end position of a match. The general idea behind the automaton is very simple, consider this Pattern – BBABBB.  Let us list all the prefixes of the pattern.

1. Empty string
2. B
3. BB
4. BBA
5. BBAB
6. BBABB
7. BBABBB

Now for every prefix listed above, consider the longest suffix that is also a prefix of it. The suffix should be different from the string itself.

1. Empty string (No suffix for an empty string)
2. Empty string (The suffix is B but it is same as the string itself)
3. B     (BB) (Here B is the suffix that is also the prefix of string BB)
4. Empty string (No suffix is present which is also the prefix)
5. B     (BBAB(Here B is the suffix that is also the prefix of string BBAB)
6. BB   (BBABB)(Here BB is the suffix that is also the prefix of string BBABB)
7. BB (BBABBB)(Here BB is the suffix that is also the prefix of string BBABBB)

Let us call a suffix that is also a prefix of a string a “Partial match” of that string.We can see that if at some point, the pattern matches till BBABB (which is the prefix of the pattern), then there is also a match till AA which is Partial match of BBABB. 
For example : 
Let us consider text : BBBABBABBBA, pattern matches till BBABB (If we check from the 2nd character of the text), since BB is partial match of BBABB, the text also matches BB (BB is the largest suffix of BBABB which is also prefix of it), so if we cannot extend our match of BBABB to full pattern with the text, we can start matching the pattern again starting from the suffix BB(BB is the largest suffix of BBABB which is also prefix of it
         
          0 1 2 3 4 5 6 7 8 9 10   (Indexes)
Text : BBBABBABBBA
             BBABBB
Here we wanted the next character of the text to be A to complete our match but it is B so we can start check for next match from 4th index of text because BB is also a partial match now. 

So here two cases arise, 
Consider  pattern :  BBABBB and some text, and till now we have found a match till BBABB, as shown in the figure below.

           BBBABB…………..(Text continues)
              BBABB    (Match found till BBABB)

Case 1:  If the next character of the text is ‘B’, that means a match is found, we can start checking in  similar manner starting from the next character  of the text to find another match if there is any.
Case 2: If the next character of the text is anything other than ‘B’ that means the match is not found, so we return to the largest partial match of the string we have found match till now (which is BB for string BBABB), now if the next character of the text matches, our match extends, but if it does not match we will find the next partial match of BB and so on until the partial match becomes empty, then we will skip the next character of the text and start  a fresh check.

Let us see the above operations of Knuth Morris Pratt algorithm with an example.
Text – BBBABBABBBA
Pattern – BBABBB
The automaton for the pattern will look like this…


 Step 1: we will start by matching pattern from the first character of the text. Here we found a mismatch at 3rd character. so we will find partial match of BB, which is B and start match from 2nd character of the text. 


Step 2: we when checking from 2nd character, mismatch is found at 6th character of the text, so we start again from the partial match of BBABBB which is BB.


Step 3: we found a match this time at the 5th character of the text.


Now, to find partial matches of prefixes (till we have found a match) of the pattern, we will make an array of length equal to the length of the pattern, let us call it F[pattern_length]. F[i]   
(0 <= i <= pattern_length)  contains the length of longest partial match of a prefix (till we have found a match) of pattern till ‘i’.
So here : F[0] = 0
               F[1] = 0
               F[2] = 1
               F[3] = 0
               F[4] = 1
               F[5] = 2
                  F[6] = 2
The array F not only contains the largest partial match till index i, but also all the partial matches of it, so F[i] is the largest partial match till index i, F[F[i]] is second largest partial match, F[F[F[i]]] is third largest and so on.
let us call it failure function of the pattern.Below is the function to build the array

Now we have to use failure function to find match. whenever we cannot expand the current partial match, we go to next best partial match of the current partial match and so on. If we are unable to expand even the empty string we just skip this character from the text and go to the next one in the text.
Below is the code to do that.


Find Euler path in a graph

Find Euler circuit and path in a graph using Fleury’s algorithm.

In last post, Graphs: Find bridges in connected graphswe discussed how we can find bridges in an undirected graph. In this post, we will be discussing an algorithm which uses bridges to find Euler’s path in a graph, algorithm is called as Fleury’s algorithm.

As per the definition, Euler’s path or circuit in a graph is a path which traverses each edge of the graph once between source and destination. If source and destination are same, then it becomes Euler’s circuit. To understand how we can find that there is an Euler path existing in a graph, please follow this post: Graphs : Euler circuit and Euler path in graphs

Fluery’s algorithm to find Euler path or circuit

  1. Start from the source node, call it as current node u. If there are nodes with odd degree (there can be max two such nodes), start any one of them. Else start from any node in graph.
  2. Traverse any edge (u, v) from current node which is not a bridge edge.
  3. Set current as v and go to step 2
  4. When all edges which are not bridge edges are traversed, start traversing bridges edges.
  5. Stop when all nodes are traversed.

Walk through Euler path step by step

1. Graph is as follows, there are two nodes 1 and 6 which have odd degrees, so our candidates for start will be either one of these. Start from 6.

Initial graph with node 1 and 6 with odd degree

2. Traverse to 5 and then to 1 and then to 3 followed by 2 one by one as these are not bridge edges.

3. At 2, there are three edges (2,1), (2,4) and (2,4). However, if we follow edge (2,1) then we will burn the bridge and never be able to come back to rest of the graph, hence we will not be going to that edge. So, we will be moving edge (2,4).

4. At 4 again, we have two edges (4,2), (4,6) and (4,3), But if we go through (4,2) we will again burn bridge as (4,2) is bridge edge. Hence, we will traverse (4,3).

5. At 3, we are left with only edge and we will follow it, which will be followed by edge (6,4).
At 4, bridge edge now has to be followed at 4 because there is no option and same for edge (2,1).

Hence, Euler path will be 6-5-1-3-2-4-3-6-4-2-1

Code for Fluery’s algorithm
Please refer Graphs : Euler circuit and Euler path in graphs for detail graph creation and finding if there is an Euler path existing in graph.
Complexity of above algorithm is O ((V + E)* (V+ E)), as for each vertex V, we need to check each edge whether it is bridge or not which take O (V+E) time. Hence, the overall complexity is as given above.