Program to check if a string is a Pangram

What is a Pangram?

A pangram is a sentence containing every letter of the alphabets. Our job today is to try to find if the given string is a pangram or not. This will be an O(n) time and O(2n) space complexity problem with n being the length of the string. Reason being that we need to traverse the string all the way to understand if all the alphabets are in the string or not. One can add an exit condition where if the pangram is found, we ignore the rest of the string but again, that means we put in additional checks after each find of an alphabet. Not worth the extra computations required except if we are talking of very big strings. Lets try to tackle that in part 2.

Our simple code converts all the characters in a given string to lower case first. We use the std::transform function in <algorithms> to do that. We also have initially declared a map of all the alphabets with value set to false. Now our task is pretty simple. We iterate through the string and mark the value of the alphabet in the map (i.e. change it to true) if that alphabet is encountered in the map. Accessing map elements is pretty fast with a runtime complexity of O(1). Similarly, accessing array elements in string is O(1). We need to iterate over the whole string. Lastly, we iterate over our map and see if all the elements are set to true. If yes, the string is a pangram otherwise it is not.

Code to find if sentence is pangram or not?

Remember that we skip constants in calculating space and time complexity. Time complexity in this case is O(n) and space complexity is O(2N).

Visit me at: https://www.naresh.se/
Follow me on Twitter: @wolverine2k

This post is contributed by Naresh, please contact us if you are willing to contribute to algorithms and me.

Reverse a string O(n/2) time, O(n) space

Reverse a string

We all know how to reverse a string right? But the quest to always find something better than existing algorithms makes us quite excited. But before jumping into the code, let’s try and see the best case i.e.(Ω) for this problem. It will allow us to find out the best possible time we can have to solve the problem. Nothing else faster than that can be achieved.

Algorithm to reverse a string

1. Iterate from beginning of string to n/2 
2. Swap the elements boundary elements 
3. If string is of even length, continue till n/2 and swap that one 
4. Else we let the middle element remain as it is (i.e. continue to (n-1)/2)

This algorithm to reverse a string gives best possible runtime of Ω(n/2). Can we do any better? Looks like not, since we will need to traverse the string halfway anyways. So, if we implement an algorithm to have a Big-O of n/2 as well, then our Θ i.e. average time will also be n/2. That’s a relief, we will have a linear time complexity growth. Our space complexity will always be O(n) since we will not allocate any space and do in-place replacements in the string. I have not been able to find anything else faster than this one. But write in the comments if there is something faster. I am thinking for maybe dynamic programming or multi-threading but we address that in the next post.

Let’s see if I can explain it better. Suppose the string is “ABCDE”. Our loop will iterate 2 times (5/4). In the first iteration, A will be swapped with E. And the 2nd iteration B will be swapped with D. E will remain at its place. So the reversed string will be “EDCBA”. Looks like that is the one we want. Try similar with an even string and it will still be the same.

Now the code! I am using c++ and will use c++14 where possible. Of course we can use the std::reverse as well BUT that one is a time complexity O(n). Check the logic here. The code for our logic is as below:

Cheers,
Naresh

visit me at: https://www.naresh.se/
Follow me on twitter: @wolverine2k

Thanks Naresh for contributing to Algorithms and Me, if you want to contribute, please contact us and refer publishing at Algorithms and Me.

Longest Substring Without Repeating Characters

Longest Substring Without Repeating Characters

Given a string, find longest substring without repeating characters in it. For example, S = “abcaabaca”, longest substring without repeating characters will be “abc”

Brute force solution will be to scan all substrings of given string and check which one has longest length and no repeating characters. For a string with size n, there will be n * (n-1) substrings, and to check it each for unique characters, it will take n comparison in worst case. So, worst case complexity of this algorithm is O(n3) with additional space of O(n). Code is simple enough.

#include <stdio.h>
#include <string.h>
 
#define true 1
#define false 0
 
#define MAX(a, b) a > b ? a:b
 
unsigned int checkIfAllUniqueCharacters(char *s, int start, int end) {
	unsigned int table[255];
 
	for(int i=0; i<255; i++){
		table[i] = false;
	}
	for (int i = start; i < end; i++) {
	    char ch = s[i];
	    if (table[ch-'a']) return false;
 
            table[ch-'a'] = true;
    }
    return true;
}
 
int longestSubstringWithoutRepeatingCharacters(char *s) {
	int n = strlen(s);
	int maxLength = 0;
 
	for (int i = 0; i < n; i++){
	    for (int j = i + 1; j <= n; j++){
		int length = j-i;
		if (checkIfAllUniqueCharacters(s, i, j)){
			maxLength = MAX(maxLength, length);
		}
	    }
	}
	return maxLength;
}
 
 
int main(void) {
 
	char *s ="abcabcbb";
	printf("Longest substring without repeating characters : %d", 
		longestSubstringWithoutRepeatingCharacters(s));
 
	return 0;
}

Longest Substring Without Repeating Characters: Sliding window approach

A sliding window is an abstract concept commonly used in array/string problems. A window is a range of elements in the array/string which usually defined by the start and end indices. A sliding window is a window “slides” its two boundaries to the certain direction.

In brute force approach, we repeatedly checked each substring for unique characters. Do we need to check each substring? If a substring s[i,j-1] contains non repeating characters, while adding jth character, check if that character is already present in substring s[i,j]. Since we are scanning substring to ascertain uniqueness of new character being added, complexity of this algorithm is O(n2).

How about optimizing the scanning part? What if we use a hash to store characters which are already seen in substring s[i,j], then checking uniqueness is done in O(1) and hence overall algorithm complexity becomes linear. Here is how to use hash to find solution.

Store characters of substring[i,j] (sliding window) in hash. When new character is processed, check if that character is present in hash,If character is not present in hash, add it to hash and increment j. If character already present in substring s[i,j], that means, it cannot be added to longest substring. Find length of substring (j-i) and compare it with current max, if it is greater, max length of longest substring without repeating characters is (j-i). Also, window also moves from i to i+1 and check for longest substring from i+1 index.

Longest substring without repeating characters implementation

#include <stdio.h>
#include <string.h>
 
#define true 1
#define false 0
 
#define MAX(a, b) a > b ? a:b
 
char * substr(char *s, int start, int end){
	char *dest =  (char *)malloc(end-start+1);
	int k = 0;
	int i = start;
	while(i < end){
	    dest[k++] = s[i++];
	}
	return dest;
}
int longestSubstringWithoutRepeatingCharacters(char *s) {
	int n = strlen(s);
	unsigned int table[255];
 
	for(int i=0; i<255; i++){
		table[i] = false;
	}
 
	int maxLength = 0, i = 0, j = 0;
	while (i < n && j < n) {
		if (!table[s[j] -'a']){
			table[s[j++] -'a'] = true;
			int length = j-i;
			maxLength = MAX(maxLength, length);
		}
		else {
			table[s[i++] -'a'] = false;
		}
	}
	return maxLength;
}
 
int main(void) {
 
	char *s ="abcabcbb";
	printf("Longest substring without repeating characters : %d", 
			longestSubstringWithoutRepeatingCharacters(s));
 
	return 0;
}

As mentioned above, complexity of this method is O(n), with additional space complexity of O(n).

Below is example execution of above code.

Current Character : a
Substring (  ) does not contain a
New length of substring without repeating character : 1
Current Character : b
Substring ( a ) does not contain b
New length of substring without repeating character : 2

Current Character : c
Substring ( ab ) does not contain c
New length of substring without repeating character : 3

Current Character : a
Substring (abc) contains a
Advancing i to 1

Current Character : a
Substring ( bc ) does not contain a
New length of substring without repeating character : 3

Current Character : b
Substring (bca) contains b
Advancing i to 2

Current Character : b
Substring ( ca ) does not contain b
New length of substring without repeating character : 3

Current Character : c
Substring (cab) contains c
Advancing i to 3

Current Character : c
Substring ( ab ) does not contain c
New length of substring without repeating character : 3

Current Character : b
Substring (abc) contains b
Advancing i to 4

Current Character : b
Substring (bc) contains b
Advancing i to 5

Current Character : b
Substring ( c ) does not contain b
New length of substring without repeating character : 3

Current Character : b
Substring (cb) contains b
Advancing i to 6

Current Character : b
Substring (b) contains b
Advancing i to 7

Current Character : b
Substring (  ) does not contain b
New length of substring without repeating character : 3

Longest substring without repeating characters : 3

There is a small optimization which helps us to skip more characters when repeating character is found instead skipping one at a time. Store the index of each character seen in substring [i,j-1].  While processing jth character, if it is already in hash, we know the index j’ where that character is in string. There is no way that any substring can contain unique characters till j’ and j are in it. So, we skip all indices from i to j’ and start from j’+1 instead of i+1 as in above method.

#include <stdio.h>
#include <string.h>
 
#define true 1
#define false 0
 
#define MAX(a, b) a > b ? a:b
 
int longestSubstringWithoutRepeatingCharacters(char *s) {
	int n = strlen(s);
	unsigned int table[255];
 
	for(int i=0; i<255; i++){
		table[i] = false;
	}
 
	int maxLength = 0, i = 0, j = 0;
	for (int j = 0, i = 0; j < n; j++) {
		if (table[s[j]-'a']) {
			int currentIndex = table[s[j]-'a'];
			i = MAX(currentIndex, i);
		}
		int currentLength = j - i + 1;
		maxLength = MAX(maxLength, currentLength);
            table[s[j]-'a'] = j + 1;
	}
	return maxLength;
}
 
int main(void) {
 
	char *s ="abcabcbb";
	printf("Longest substring without repeating characters : %d", 
			longestSubstringWithoutRepeatingCharacters(s));
 
	return 0;
}

Complexity of find longest substring without repeating character is hence O(n) with additional space complexity of O(n).

Please share if something is wrong or missing. We would love to hear from you.

Suffix Array implementation C++

Suffix array

Before learning suffix array, I would recommend you to read about “tries“. Today’s question is : given a word for example:”blogger”, you want to print all suffixes in sorted order or want to find the kth suffix in lexicographical order.

Naive Solution:
In naive solution you just need to find  all suffixes ,save them in an array of string (this will take around O(n2) operation) and sort them in lexicographical order(this will be taking O(n2 log2(n)) so total complexity will be around O(n2(1 + log2(n))) which is quite large.Then finding the kth suffix in it.

All suffixes of word “blogger”:
blogger
logger
ogger
gger
ger
er
r

Solution using Tries:
In this method we can create the trie for all the suffixes of the word and for finding kth suffix in  lexicographical order, we can pre-process and find lexicographical kth suffix in O(n) (You just figure it out how you can do it).

Even if the idea of suffix trie would be very pleasing,but it’s simplest implementation in which at every step one of the strings suffixes is inserted into structure leads to O(n2) complexity algorithm.There is a “suffix tree” which can be built in O(n) complexity algorithm but creating a suffix tree is quite a long code to write,I will talk about suffix trees in future blogs.Here I will talk about another data structure suffix arrays whose code is quite short to write and it gives around O(n log2(n)) and memory used in implementing suffix array with O(n) memory is 3 to 5 times less than the amount needed by a suffix tree.

Suffix Arrays

Suffix Array is the data structure which stores the sorted form of suffixes of a given strings.It will be more clear after seeing the diagram given below:-

SUFFIX ARRAY

How to build suffix array:

As it is clear from the diagram above, We have to sort the suffixes of the String in lexicographic order for creating suffix arrays. So the big question is How to do it? We will first sort according to the first character of the suffixes of the string and store it in suffix array, for example:

b  l  o  g  g  e  r
 Suffix array –                             0  3 4  2  2  1 5

Now,we will be gradually increasing the sorting criteria i.e from sorting according to single character to two characters, then four characters and so on in the increasing power of two. Until we have reached a length such that we can be sure all the suffixes of the string are themselves sorted.(it will be more clear from the diagram below).

So we have m = ceil(log2(n))” steps to create suffix array where ‘n’ is the length of the string. Let us denote each step by ‘y’, where 0<=y<=m .At step y we will sort the suffixes according to the first 2y characters of each suffix, For example at step 2, we will sort the suffixes according to first 4 (22) characters of each suffix.How to optimally implement the above stated condition? I’ll explain it in step by step manner.

At Step 0(y = 0) :You just sort according to the first character of each suffixes and give sort index(sort index:position of the suffix in the sorted array) to each suffix according to it,if the first character is same for two or more suffixes, give same sort index to  them.See the example below.

Step 0:               Sort-index  
      b                  0                                            
      l                  3                                                      
      o                  4
      g                  2
      g                  2
      e                  1
      r                  5

At Step 1(y = 1):We want to sort the suffixes according to the first two characters. For creating the sort-index at step 1 ,take the help of sort-index created at step 0.Create a pair of indexes in which first value is the sort-index of the respective suffix got in step 0 and second value will be the sort-index of the suffix that starts 1 positions later that we obtain in step 0. If the length of the suffix is less than 2 characters, then we can keep the second value as -1. Now sort according to this pair to get new a sort-index. It will be more helpful if you relate it to diagram given below:


   Step 1(0 - 1)        (concatenating at i  with  i + 21)           Sort-index
        
           bl                        (0,3)                                        0    
           lo                        (3,4)                                        4
           og                        (4,2)                                        5
           gg                        (2,2)                                        3
           ge                        (2,1)                                        2 
           er                        (1,5)                                        1
           r$                        (5,-1)                                       6

Here i denotes index of the suffix starting at position ‘i’ in the string.
At Step 2(y = 2):Same work has to be done at step 2.Here we want to sort according to the first four characters.Create a pair of indexes in which first will be the sort-index of the respective suffix and second will be the sort-index that starts 2 positions later like concatenate bl(1st position) with og(3rd position) to get first four characters of the respective suffix.Do this with all other suffixes and get the sort-index of each position.

 
 Step 2             (1 - 2)(concatenating at i  with  i + 21)     Sort-index
 blog                 (0,5)                                        0
 logg                 (4,3)                                        4
 ogge                 (5,2)                                        5
 gger                 (3,1)                                        3
 ger$                 (2,6)                                        2
 er$$                 (1,-1)                                       1
 r$$$                 (6,-1)                                       6

Same will happen for step 3.

So to generalize this,if we go from step y to step y + 1 , we will concatenate the sort-index of suffix array starting at position ‘i’ with sort-index at ‘i + 2y‘ to obtain the length of 2y + 1 for creating the sort-index at step ‘y + 1’.

Code for building Suffix Arrays:

typedef struct temp{
	int nr[2];
	int p;
}pair_suffix;  
                                                 
bool waytosort(pair_suffix i,pair_suffix j)
{
    if(i.nr[0] == j.nr[0]){
        if(i.nr[1] < j.nr[1])
            return 1;
        else
            return 0;
    }
    else if(i.nr[0] < j.nr[0])
        return 1;
    else 
        return 0;
}
void build_suffix_array(string s){
    long long int len = s.length();
      /*this is the array which stores the pair in 
       suffix array which is to be used for creating 
       sort index at each step */
    pair_suffix L[len];
    //no. of steps to calculate suffix array
    int steps = ceil(log(len)/log(2)); 
    int P[steps + 1][len];
    
    for(int i = 0;i < len;i++){
        //Initialization step not included in steps
        P[0][i] = s[i] - 'A';	
    }
    int cnt = 1;
    int k;
    for(k=1; k<=steps; k++,cnt <<= 1){
        //steps to calculate suffix array
        for(int i=0; i<len; i++){
            L[i].nr[0] = P[k - 1][i];
            L[i].nr[1] = i + cnt < len? P[k - 1][i + cnt]:-1;
            L[i].p = i; 
        }
        sort(L,L + len,waytosort);
        
        for(int i=0; i<len; i++){ P[k][L[i].p] = i > 0 
                           && L[i].nr[0] == L[i - 1].nr[0] 
                           && L[i].nr[1] == L[i - 1].nr[1] 
                           ? P[k][L[i - 1].p]:i;
        }
     }
     //suffix array will be found on the last row of matrix P.
}

Time Complexity:Complexity for building suffix array is O(n*log22n).
Now you can find kth suffix in lexicographical order using suffix array in O(n) complexity.Suffix arrays are quite helpful in competitive programming because they are fast to code.

Application:
1) Compute the longest common prefix(LCP) using suffix arrays.
2) Suffix Arrays are helpful in various string searching and string matching problems.

Questions to practice:
Easy:
http://www.spoj.com/problems/SARRAY/
http://www.codechef.com/problems/TASTR
Medium:
http://www.codechef.com/problems/SSTORY
http://www.spoj.com/problems/DISUBSTR/

Knuth Morris Pratt algorithm for string searching

Knuth Morris Pratt algorithm for string searching

You are given two strings – a Text and a Pattern. Determine whether the Pattern exists in the Text or not.E.g. – Consider a Text ABCDGBCDLMand a Pattern BCD. Final output should be all the indexes where the Pattern exists, here it is 1 and 5 (0 based index).This post explains Knuth Morris Pratt algorithm to do so.

The naive method is very basic and straightforward. For every position in the Text, consider it as the starting position and see if a match with the pattern is found.

Considering a text and a pattern of small length, this method would work properly giving O (N*M) complexity, where N and M are the lengths of text and pattern respectively.

The C implementation of the naive method is given below

#include <stdio.h>
#include <string.h>

void stringMatching(char text[], char pattern[])
{
	//Lengths of text and pattern
	int textLen = strlen(text);
	int patternLen = strlen(pattern);
	
	for(int i = 0; i < textLen; i++){
		int j = 0;
		for(j = 0; j < patternLen && i+j < textLen; j++){
			if(text[i + j] != pattern[j]) break;
		}
		
		if(j == patternLen){
			//match found, print the indexes
			printf("\n%d", i);	
		}
	}
}

int main(){
	stringMatching("cdabcde", "abc");
	return 0;
}

The Knuth Morris Pratt Algorithm

If we are able to identify all the positions where an existing match of the pattern ends, in a single iteration over the text. This would solve our problem, since we know the length of the pattern it is easy to identify the starting position of the match. Knuth-Morris-Prat algorithm works on this principle.

In this algorithm, we apply the concept of automaton. An automaton is a kind of an abstract object. At each step of this automaton, some information is presented to it. Using this information the automaton goes to a new state from its current state. whenever the automaton reaches its final state, we have found an end position of a match. The general idea behind the automaton is very simple, consider this Pattern – BBABBB.  Let us list all the prefixes of the pattern.

1. Empty string
2. B
3. BB
4. BBA
5. BBAB
6. BBABB
7. BBABBB

Now for every prefix listed above, consider the longest suffix that is also a prefix of it. The suffix should be different from the string itself.
1. Empty string (No suffix for an empty string)
2. Empty string (The suffix is B but it is same as the string itself)
3. B     (BB) (Here B is the suffix that is also the prefix of string BB)
4. Empty string (No suffix is present which is also the prefix)
5. B     (BBAB(Here B is the suffix that is also the prefix of string BBAB)
6. BB   (BBABB)(Here BB is the suffix that is also the prefix of string BBABB)
7. BB (BBABBB)(Here BB is the suffix that is also the prefix of string BBABBB)

Let us call a suffix that is also a prefix of a string a “Partial match” of that string.We can see that if at some point, the pattern matches till BBABB (which is the prefix of the pattern), then there is also a match till BB which is Partial match of BBABB.
For example :
Let us consider text : BBBABBABBBA and the Pattern BBABBB, the pattern matches till BBABB (If we check from the 2nd character of the text), since BB is partial match of BBABB, the text also matches BB (BB is the largest suffix of BBABB which is also prefix of it), so if we cannot extend our match of BBABB to full pattern with the text, we can start matching the pattern again starting from the suffix BB(BB is the largest suffix of BBABB which is also prefix of it
         
          0 1 2 3 4 5 6 7 8 9 10   (Indexes)
Text : BBBABBABBBA
             BBABBB
Here we wanted the next character of the text to be A to complete our match but it is B so we can start check for next match from 4th index of text because BB is also a partial match now. So here two cases arise,
Consider  pattern :  BBABBB and some text, and till now we have found a match till BBABB, as shown in the figure below.
BBBABB…………..(Text continues)
              BBABB    (Match found till BBABB)

Case 1:  If the next character of the text is ‘B’, that means a match is found, we can start checking in  similar manner starting from the next character  of the text to find another match if there is any.
Case 2: If the next character of the text is anything other than ‘B’ that means the match is not found, so we return to the largest partial match of the string we have found match till now (which is BB for string BBABB), now if the next character of the text matches, our match extends, but if it does not match we will find the next partial match of BB and so on until the partial match becomes empty, then we will skip the next character of the text and start  a fresh check.
Let us see the above operations of Knuth Morris Pratt algorithm with an example.
Text – BBBABBABBBA
Pattern – BBABBB
The automaton for the pattern will look like this…

Knuth Morris Pratt algorithm

 

Knuth Morris Pratt algorithm example

 Step 1: we will start by matching pattern from the first character of the text. Here we found a mismatch at 3rd character. so we will find partial match of BB, which is B and start match from 2nd character of the text. 

Knuth Morris Pratt implementation

Step 2: We when checking from 2nd character, mismatch is found at 6th character of the text, so we start again from the partial match of BBABBB which is BB.

Step 3: We found a match this time at the 5th character of the text.
Now, to find partial matches of prefixes (till we have found a match) of the pattern, we will make an array of length equal to the length of the pattern, let us call it F[pattern_length]. F[i]
(0 <= i <= pattern_length)  contains the length of longest partial match of a prefix (till we have found a match) of pattern till ‘i’.
So here :

F[0] = 0          
F[1] = 0
F[2] = 1
F[3] = 0
F[4] = 1
F[5] = 2
F[6] = 2

The array F not only contains the largest partial match till index i, but also all the partial matches of it, so F[i] is the largest partial match till index i, F[F[i]] is second largest partial match, F[F[F[i]]] is third largest and so on.
Let us call it failure function of the pattern.Below is the function to build the array

void failure_function(char pattern[]){
	int j;
	int ff[100];
	//length of pattern
	int patternLen = strlen(pattern);

	/*let the array be 'ff'
	i = 0 is empty string and i = 1 is a single character,always true */

	ff[0] = ff[1] = 0;                      
	
	for(long long int i = 2; i <= patternLen; i++){
		/* j is the index of the largest next partial match 
		(the largest suffix/prefix) of the string till  
		index i - 1 */
		
		j = ff[i-1];                     
		for( ; ; ){              
			if(pattern[j] == pattern[i-1]){           
				ff[i] = j + 1;       
				break;
			}
			if(j == 0){                      
				ff[i] = 0;
				break;
			}                               
			j = ff[j];
		}
	}
}

Now we have to use failure function to find match. whenever we cannot expand the current partial match, we go to next best partial match of the current partial match and so on. If we are unable to expand even the empty string we just skip this character from the text and go to the next one in the text.

Knuth Morris Pratt algorithm implementation

void KMP(char text[], char pattern[]){
	failure_function(pattern[]); 
	int pattern_length = strlen(pattern);
	int text_length = strlen(text);
	int index = 0;   j = 0;
  
	/* let "occurrences" be the vector container that
	contains that stores indexes whenever a match is found */
  	for(int i = 0; i < text_length; i++){
            for( ; ; ){
                if(text[i] == pattern[index]){
                    index++;
                    if(index == len)
                        occurences.insert(i - len);
		
                    break;
                 }
                 else if(index > 0){
                     index = ff[index];
                 }
                 else{
                     break;
                 }
             }
	 }
    }
}

Please share if there is something is missing or wrong. If you are interested in contributing to algorithms and me, please contact us.

Print all anagrams together C implementation

Print all anagrams together

Many a times, we need to sort some elements which are not integers, like strings. How do we go about using quicksort. I did not know that C library has a function for it. This function is very powerful and can easily sort anything given compare function. So in today’s post, we will use quicksort function and learn how to solve our problem. Problem statement is : given a sequence of words, print all anagrams together. Instead of printing, problem can be asked as group all anagrams together. For example, if input is like :cat, tac, god ,gdo, ball, act, dog, output should be

cat
tac
act
dog
god
gdo

Print all anagrams together : Algorithm

First of all, understand how to identify strings as anagrams. There are many efficient ways to find if two string are anagrams or not, but a simple one is to sort two strings based on characters and if two resultant strings match, two strings are anagrams.

Now problem reduces to sorting each individual string and then sort entire array of strings. Anagrams will automatically be grouped together.

Look closely, there is another problem with solution mentioned, that is when we sort original strings individually, original strings are lost. So, idea is to have a duplicate array of strings. First, copy all string to new array and then sort string individually.  Still there is one problem : after sorting new array, we will loose index of string before sorting. Hence, we need to store index of each string before sorting new array.

We now know, what to store, let’s discuss how to store those information? Original strings are stored in an array of character pointers. Also, we need a duplicate array to sort strings. To store positions of strings in original array.  Hence, let’s create a structure which stores both information, string and position.


typedef struct duplicate{
    char *word;
    int pos;
} Duplicate;

First step is to sort individual strings in duplicate array, using library function qsort().
This function is very easy to use, there are four parameters to pass:
1. The buffer or array to sort.
2. The size of buffer or array
3. Size of individual element of buffer.
4. And last and most important is compare function, to be written in code and passed as function pointer to this function.

For further details of qsort() function usage, refer :qsort in c

Once all strings are individually sorted, we sort entire array of strings using same qsort() function with different compare function, which now runs on entire string instead of each character.

We are almost done! All anagrams are placed next to each other, however, we need to print original string together. However, all anagrams will be same string in sorted array. That’s where index stored in duplicate array will help. Print words from original array using the positions given in duplicate array.

Print all anagrams together : Implementation

#include&lt;stdio.h&gt;
#define MAX_WORDS 101

typedef struct duplicate{
    char *word;
    int pos;
}duplicate;
/* Compare function for string comparison */
int compare(const void * a, const void * b){
    return *(char *)a - *(char *)b;
}
/* Compare function for string array sorting */
int compare_str(const void *a, const void *b){
    duplicate *word_1  = (duplicate *) a;
    duplicate *word_2  = (duplicate *) b;
    return strcmp(word_1-&gt;word, word_2-&gt;word);
}
/* This function copies array in duplicate array */
void copy_in_duplicate(char * array[], 
                       duplicate  dup_array[], 
                       int count){
    int i ;
    for(i=0; i&lt;=count; i++){
        dup_array[i].word = (char *) malloc(sizeof(char) 
                           *strlen(array[i]) + 1));
        strcpy(dup_array[i].word, array[i]);
        dup_array[i].pos = i;
    }
}
int main(){
    unsigned int count = -1;
    unsigned int i;
    char *array[MAX_WORDS];
    duplicate dup_array[MAX_WORDS];
    char word[100];
    /*Reading from the standard input */
    while(fgets(word, sizeof(word), stdin) != NULL){
        count++;
        array[count] = (char*)malloc(sizeof(char) * 
                             (strlen(word) +1));
        if(sscanf(word, "%s", array[count])  == 0){
            printf("\n Error inreading strings");
        }
    }
    
    copy_in_duplicate(array, dup_array, count);
    /*Sort each string individually */
    for(i=0;i&lt;=count; i++){
        qsort(dup_array[i].word, strlen(dup_array[i].word), 1, compare);
    }
    /* sort all strings in array */
    qsort(dup_array, count+1, sizeof(dup_array[0]), compare_str);
    /* printing all words with anagrams together */
    for(i=0;i&lt;=count; i++){
        printf("\n %s", array[dup_array[i].pos]);
    }
	return 0;
}

Implementation will perform sorting on M strings, each of size N characters, so complexity of algorithm to print all anagrams together will be O(MlogM + MNlogN) MNlogN because NlogN for sorting each string of size N and there are M strings,MlogM to sort M strings in array.

Strings : Find if any anagram of string is palindrome.

Find if any anagram of string is palindrome

Given a string, find out if any anagram of given string is palindrome.
For example, “AABBC” has an anagram “ABCBA” which is palindrome.

Analysis

Brute force solution will be to find all anagrams of string and then check each one if that is palindrome. For N character long string, we will have n! anagrams and checking each one of that for palindrome will be computation intensive task, so not a good solution.
What is required for a string to be palindrome string?
We need first half of the string to contain same characters as second half. In case of odd length string we can safely ignore the middle character.
In other terms, each character should occur even number of time except one which is middle character which may occur odd number of times.
Best way to check if character occurs even number of times is to increment to count when it is zero and decrement when it is 1.If at the end we have count corresponding to that character as zero, that means we have even number of occurrence of this character, if not then we have odd number of occurrence.
Extending the same idea from previous post: we will be using a hash table and increment- decrement corresponding count of character in hash.
At the end we will sum the array and if sum of array is greater than 1, then we cannot have any anagram which will be palindrome else at least anagram of string will be palindrome.

Code

Complexity of above code is O(N) with constant amount of memory required.

Find if two strings are anagrams

Find if two strings are anagrams

Today, we will discuss a problem on string, problem statement is : Given two strings, find if those two strings are anagrams.

What are anagrams?

Two strings are anagrams if they are composed of same characters, these characters may be in different order in two strings. For example, below strings are anagrams

cider cried dicer 
ether there three
heaps phase shape 
manes manse means

With definition, we know part of solution already;  check if both strings contain same characters, not necessarily in same order. Then they are anagrams.
First approach to check two string contains same set of characters is to sort both strings and compare results. If results are same, strings are anagrams, else not.  With sort, characters will be arranged in lexicographical order within strings. Time complexity of this method is O(N log N) + O(M log M ) where N and M are sizes of strings.

Can we do better? In order to check if each character in a string is present in other, let’s keep track of characters in one string in another. Count number of occurrence of each characters in first string in hash table with keys as characters. Again, count number of occurrence of each characters in second string.  If both hash tables are equal, strings are anagrams. Two hash tables are used for solution. We can do away with one hash table.

While traversing first string, increment count for each occurrence of character. While traversing second string, decrement count for each character, in same hash table. At any point of time,  if count of character is less than zero, then character is present in second string but not in first, and hence, strings are not anagrams.

Once, second string is completely scanned, check if there is any character in hash which has count greater than 0. There is easier way, just add all the values in hash and see if it is zero or not.

To avoid check hash table at end for non zero count, first check lengths of two strings, if lengths are not equal, directly say non anagrams. (Think how it avoids check of non zero count at the end :))

#include <stdio.h>
 
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
 
int is_anagram(char *s, char *d);
 
#define NUM_CHAR 256
#define true 1
#define false 0
 
int isAnagrams(char *s, char *d){
 
    int i = 0;
    int lengthA = strlen(s);
    int lengthB = strlen(d);
 
    int hash[NUM_CHAR];
 
    if(lengthA != lengthB) return false;
 
    for(i=0; i< NUM_CHAR; i++){
        hash[i] = false;
    }
 
    while(*s != '\0'){
        hash[(*s) -'0']++;
        s++;
    }
    while(*d != '\0'){
        hash[(*d) -'0']--;
        if(hash[(*d) - '0'] < 0) return false;
        d++;
    }
 
    return true;
}
 
int main(){
    char *s  = "cider";
    char *d  = "dicer";
 
    if(isAnagrams(s,d)){
        printf("\nYES");
    }
    else{
        printf("\nNO");
    }
    return 0;
}

Complexity of algorithm to check if two strings are anagrams is O(N) where N is length of string.

Check if strings are anagrams using prime factorization

There is mathematical proof (which I don’t know) that a number can be decomposed into a unique set of prime factors. It is not possible to add or take away any factor of that number if number has only prime factors. Prime factorization wiki . How can we use this information to solve our problem.

Assign a unique prime number to each character in alphabet (considering only lower case ). Now for each word, take corresponding prime number for each character of word and multiply. If results of multiplication of two words are same, words are anagrams.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
 
int is_anagram(char *s, char *d);
 
#define NUM_CHAR 256
#define true 1
#define false 0
int hashPrime[26] =  { 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 
                       53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101 };
 
int isAnagrams(char *s, char *d){
 
    int lengthA = strlen(s);
    int lengthB = strlen(d);
 
    if(lengthA != lengthB) return false;
 
	int multiplicationA = 1;
	int multiplicationB = 1;
 
	char *stringA = tolower(s);
	char *stringB = tolower(d);
 
    while(*stringA != '\0'){
        multiplicationA = multiplicationA * hashPrime[(*stringA) -'a'];
 
        stringA++;
    }
    while(*stringB != '\0'){
    	multiplicationB = multiplicationB * hashPrime[(*stringB) -'a'];
        stringB++;
    }
 
    return multiplicationA == multiplicationB ;
}
 
int main(){
    char *s  = "cider";
    char *d  = "dicer";
 
    if(isAnagrams(s,d)){
        printf("\nYES");
    }
    else{
        printf("\nNO");
    }
    return 0;
}

This is fast method, however, there are pitfalls, one must be aware of. Multiplication of integers can easily lead to over flow. For 26 characters, we need at least 26 unique prime numbers and even if we chose lowest 26 prime number, this overflow may occur as soon as we see words with length greater than 9.

Second problem with prime factorization method is that mapping is skewed between input and output space. There is  infinite number of anagrams where at max with 64 bit machine, we can store only 264 words. However, with limit number of words with restricted length this method works good, else move to sorting method discussed earlier.

Please share if something is missing or wrong or something you liked about it. Sharing is caring.

Find first non repeating character in stream of characters

Non repeating character in stream of characters

In post here , we learned how to find first non repeating character in a string. In this case, entire set of character was available to us up front. Now, let’s see a problem where there is a stream of characters and not a string. So, the problem statement is something like this : Given a stream of characters, find first non repeating character at any given point of time. For example stream is like a,b,c,c,c,b,a,d,e,e,f,…. First non repeating character in this stream till now is d. So, we have to return first non repeating character at a given point of time.

If there is predefined set of characters and task is to figure out first non repeating character, then it is very simple. Create a hash table keyed on character and as and when character is encountered in given set, increase the count. Once all characters in set are processed, go through the set again and return the first character which occurs once in set.

However, this approach cannot work when there is continuous stream of characters. We need a different approach. First of all, keep in mind that we have find first non repeating character a given point of time. If we query at time t0 and next instance of character comes at t1 where t1> t0, we will return the character at t0. When character is seen second time, it will be discarded. But what will be first non repeating character at t1 then?

Figure out data structures and behavior

In order to answer above question, store all the characters which are non repeating till t0 and when queried for first non repeating character, return first character in the list. Got some hint?
There is a pattern of characters coming in and leaving out and that is First come first out. Which data structure provides that? Yes, Queues. We have finalized one thing : To get non repeating character at any time we need to implement queue and store each character which is seen only once in that queue.

Second part is to remove a character when it occurs again. As characters seen only once till given point of time are stored in queue, we need to remove node from queue.  This node can be at anywhere in queue. What is the best structure where to implement queue when we have to delete node at anywhere, from start, end or middle ? Doubly linked list.  Till now we have figured out how to delete a node or character entry from queue when it occurs second time and how to get first non repeating character.

However, how to find the exact node to be deleted in queue implemented using doubly linked list. Either we traverse DLL each time with O(n) complexity, or we can store a mapping. Mapping will be between character and node pointer in DLL. Given a node pointer in DLL, how to delete it can be read here :Delete a node from doubly linked list

There is a small catch here : What if the character is seen third time ? We go to the hash <character, node> and get the node. Then try to free the node again which is already freed when it occurred second time. In this case, segmentation fault for sure! 🙂

One solution is to remove node from the hash. But then there is no way if the character never appeared or appeared twice.
To check if character is seen two or more times, use another hash with character as key and boolean as value. Set hash <character> to true when character is seen second time. So, first check in this hash, if the character entry in map is true, do not do any processing.

Great! We have got data structures. Now let’s summarize algorithm to find first non repeating character in stream of characters
1. Take the input character. 
2. Check in 'visited' hash if this character is seen twice already.
3. If yes, don't do anything, process next character.

4. If it not seen twice yet, then we need check if it is seen first time. If seen first time, DDL node entry corresponding to that character will be null. 
5. Add node in DLL and update the hash table with the node value. 

6. If character is already seen once, we will have a node entry in hash table. Remove the node from queue and marked 'visited' entry as true
Let’s work out an example: Stream of characters is : a,t,r,e,r,a,p,p,p,a,u,i,b,t,………….Continues. At first all our containers are false and NULL, queue is empty.

Initial state

Character ‘a’ comes : Looking at visited array,we see that it is definitely not seen more than twice, else the place would have been marked true. Also looking at node map array we can see that this character is still not in queue, it means we have seen it first time. So we add this to queue and store pointer in node map array.
Data structure condition :
 Now comes ‘t’, same above conditions are true, hence ‘t’ is also added to queue. Similarly for r and e too.

first non repeating character in streamWhen first instance of a character is seen

Twist comes when ‘r’ is seen again. This time we have a node pointer stored in node map array corresponding to ‘r’, but visited flag is false. This means we have seen ‘r’ second time. We till remove node ‘r’ from queue and mark visited[‘r’] as true, hence forth character will not considered as first non repeated character. So our data structures at this point look like this:
When second instance of character is seen

After this if query for first non repeated character, output will be ‘a’.
We process next character ‘a’. Again same fate as above ‘r’. We will remove node from the queue and marked visited[‘a’] as true.
After this if query for first non repeated character, output will be ‘t’.
This will continue is similar fashion. Please comment if further explanation is required on example.

Code to find first non repeating character

Complexity of to find first non repeating character will be O(1) to find first non repeating character at a given point of time. We would use NO_OF_CHARACTERS extra space.

Check if string is interleaved of two strings

Given three strings A,B and C, find if C string is interleaved of two strings. A and B. C is said to be interleaved if it contains all the characters of A and B and order of characters in respective strings is maintained. For example string C in figure is interleaved string of A and B.

check string is interleaved of two strings

Find if string is interleaved of two strings

Simple algorithm to find if a string is interleaved of two other string is to start checking length of C as length of A + length of B. If it is not true return false (why?).

If length criteria passes, check first character of C and A. If they match, move to second character of C and A. Keep B at first character. If character in A does not match with character in C, check if first character of C and B match, if they do, then move to second character of C and B. A remains at first character.

Once done, repeat above steps with next characters of strings, while character in C matches character in A or B. If both current character in C does not match with any of the current characters in A and B, return false.

With above algorithm, with each comparison, problem reduces to smaller size, with C with length N-1, one of A or B with M-1. From the description, it is clear that recursion can be used to solve interleaving problem.

isInterleaved(A,B,C)  = isInterleaved(A+1, B, C+1) 
                        || isInterleaved(A, B+1,C+1)

What shall be base case for above recursion?
If end of C is reached, all characters in C are checked, return true if all characters in other two strings are also considered. If not returned false.
Other way, if both string are A and B are fully considered and there are still characters left in C to be considered, then return false. (This case is anyways caught at length check).

Recursive Implementation of interleaved strings algorithm


#include <stdio.h>
#include <string.h>
#define true 1
#define false 0; 
 
int isInterleaved(char *c, char *a, char *b){
 
    if(!(*c) && !(*a) && !(*b))
        return true;
 
    if(*c == '\0'){
        return false;
    }
	 // if character of a and c match
    return ((*c == *a) && isInterleaved(c+1,a+1,b)) || 
    		// if character of b and c match
            ((*c == *b) && isInterleaved(c+1,a,b+1)); 
}
 
int main(void) {
	char *c = "abxycdz";
	char *a = "abcd";
	char *b = "xyz";
	if(isInterleaved(c,a,b)){
	    printf("\n String is interleaved");
	}
	else{
	    printf("\n String is not interleaved");
	}
	return 0;
}

Iterative implementation is as follows

#include <stdio.h>
#include <string.h>
#define true 1
#define false 0; 
 
int isInterleaved(char *c, char *a, char *b){
 
    while(*c != '\0'){
    	if(*c == *a){
            a++;
        }
        else if(*c == *b){
            b++;
        }
        else{
            return false;
        }
        c++;
    }
    if(*a != '\0' || b != '\0')
        return false;
 
    return true;
}
 
int main(void) {
	// your code goes here
	char *c = "abxycdz";
	char *a = "abcd";
	char *b = "xyz";
	if(isInterleaved(c,a,b)){
		printf("\n String is interleaved");
	}
	else{
		printf("\n String is not interleaved");
	}
	return 0;
}

Iterative implementation will not work with input with common characters in string A and B, for example A = XXY and B = XXZ and if C = XXZXXY will return false.

Complexity of iterative implementation will be linear O(N), N being length of string C, where as complexity of recursive solution will b O(2N) but it does not fail in when two strings have common character.

Dynamic programming approach to interleaved strings problem

Look closely and notice that many subproblems are being calculated again and again. Recursion tree for input A = XXY and B = XXZ and C = XXZXXY

interleaved strings with dynamic programming

Looking at the recursion tree, it is clear that if results of subproblems are stored, they need not be calculated again and again. Create a two dimensional table. Table(i,j) = true only if C[i+j-1] is interleaved string if A[i] and B[j]. Empty string is interleaved of two other strings so, Table[0][0] = true
If one of the strings was empty: Table(i,0) = A[i] == C[i] && Table(i-1, 0) that is to say if till i-1 characters C was interleaved of A, then for ith character it will be true if ith character matches ith character of A. Note that B is empty. Again if string A is empty, then Table(0,j) = Table(0, j-1). With these base cases, we can fill table bottom up as follows

Table(i,j) = Table(i-1,j)  if (A[i] == C[i+j]) && (B[j] != C[i+j])
Table(i,j) = Table(i,j-1) (B[i] == C[i+j])&& (A[i] != C[i+j])
Table(i,j) = Table(i-1,j) || Table(i, j-1) if (A[i] == C[i+j]) && (B[j] == C[i+j])

Dynamic programming solution to find if string is interleaved of two strings

#include <stdio.h>
#include <string.h>
#define true 1
#define false 0; 
int isInterleaved(char *c, char *a, char *b){
 
    int lenA = strlen(a);
    int lenB = strlen(b);
    int i,j;
 
    int Table[lenA+1][lenB+1];
    // Initialization
	for(i=0; i<=lenA; i++){
        for(j=0; j<=lenB; j++){
        	Table[i][j] = false;
        }
	}
    for(i=0; i<=lenA; i++){
        for(j=0; j<=lenB; j++){
        	// Both strings are empty
            if(i==0 && j==0)
                Table[i][j] = true;
    		// string A is empty, compare characters in C and B
            if(i==0 && c[j-1] == b[j-1]){
                Table[i][j] =  Table[i][j-1];
            }
            // string B is empty, compare characters in C and A
	        else if(j==0 && c[i-1] == a[i-1]){
                Table[i][j] =  Table[i-1][j];
            }
            // Both strings are not empty
            //1. If character of A matches with character of C
            // but not of B
            else if (a[i-1] == c[i+j-1] && b[j-1] != c[i+j-1]){
                Table[i][j] = Table[i-1][j];
            }
            //2. If character of B matches with character of C
            // but not of A
            else if (a[i-1] != c[i+j-1] && b[j-1] == c[i+j-1]){
                Table[i][j] = Table[i][j-1];
            }
            //1. If character of A matches with character of C
            // and character of B also matches with C
            else if (a[i-1] == c[i+j-1] && b[j-1] == c[i+j-1]){
                Table[i][j] = Table[i-1][j] || Table[i][j-1];
            }
        }
    }
    return Table[lenA][lenB];
}
 
int main(void) {
	// your code goes here
	char *c = "abxycdz";
	char *a = "abcd";
	char *b = "xyz";
	if(isInterleaved(c,a,b)){
		printf("\n String is interleaved");
	}
	else{
		printf("\n String is not interleaved");
	}
	return 0;
}

Complexity of dynamic programming algorithm to find if a string is interleaved of two strings will be O(n2).

For more dynamic programming problems, please refer : Dynamic programming problems

Please share if there is something wrong or missing. We would love to hear what you have to say. If you are interested in contribute to website, please contact us.