# Word count problem

Problem at hand is very simple, we have to count number of occurrences of words in a given file. This is called as word count problem. For example if file is as below,

```input.txt
Hello world. World is so beautiful. I am so busy today, not hello, world is also busy today and tomorrow. Say hello to your beautiful wife.```

Output should look like this, which mentions number of time a word appears in file.

```output.txt
Hello - 3 World - 3 is - 2 so -2
beautiful - 2 I - 1 am - 1 busy - 2 today - 2
tomorrow -1 not - 1 also - 1 Say - 1 hello - 1
to - 1 your - 1 wife - 1```

## Word count problem at small scale

Word count problem can be easily solved using hash. Keys of hash will be words and value being their count in file. Each word is read from file, checked for presence in hash, if word already present in hash, increment the count, else add word to hash with count as 1.

What are limitation of above solution? First, it’s apparent that extra space will be required for storing hash. Now, if input set is very large with mostly unique words, space requirement will be of order N.

Second, while processing words, we have bring in entire file into memory and process each word in serial manner. Complexity will be O(N) where N is number of words in file.

## Word Count problem at large scale

Hash based solution described above works for smaller set of input where input and output can live memory at the same time and parallelism is not required. What if we have process GB or TB of data? Will above method scale to that?

Apparent thing at scale is to divide and conquer. We divide input set into manageable smaller sets, work with them and generate output and then at last, combine smaller output to produce output for the original problem. This approach gives us two advantages: We don’t need to bring in entire input into memory at a time and smaller tasks can be run in parallel (given that output of one task does not depend on output of another, of simplicity sake.)

There is a method called Map Reduce, which uses divide and conquer approach to solve such problem at scale.

Map function
Map function performs certain processing on data sets where processing on each data set is independent of other. For example, if sum of squares of list of number is to be found, map function will find square of each number in list. Finding square of a number is completely independent of other.

Reduce Function
Reduce function works on output generated by map function and combines them to produce final output. Significant difference between map function and reduce function is that reduce may or may not perform processing on data sets independently. So in above example. reduce function takes output of map function that is square of each number and adds them.

Since the map function can independently on data sets, it is candidate for parallelism. Multiple map function can run on multiple servers on independent data sets and generate some key value pairs which then can be consumed by reduce function for further processing.
In the same way more than one reduce functions can be invoked on more than on server which run different batches of inputs (output from map) which are related and dependent on each other.

It is important to know that map and reduce function do not talk to each other hence no data sharing can happen between any two instances of them.

Coming back to our problem of word count, mapper function will read a batch of words and generate tuple as (word, 1)  and send it to reducer where based on word, it will increment count. Below diagram explains it.

There is one problem with this, there is only one reducer and if all tuples of words are going to same reducer, we end up storing all words on single machine memory. Also, single machine is processing all words, so we lose the benefit of parallelism of mapper as reducer becomes bottleneck. How can we have multiple reducer so avoid both of above problems?

If we group words by their first letter and send them to different reducers, we can run 26 different reduces which cater to small set of words and can run in parallel and produce output.

There is another method which to group all same words and send them to reducer. Now, reducer does not have store these words, all it need to do is count. Group by step can be intermediate step between map and reduce.

Code snippets

Mapper function

Reducer function

Driver function

Code snippets are copied from : Hadoop tutorial

This is how we solve or design solutions for word count problem which involve large scale. Please share if there is something wrong or missing.

# Unique Anagrams in file

What are anagrams? If you have any doubt, please refer to post : Find if two strings are anagrams . Today’s problem is on top of finding if a string is anagram of other or not. There is a file or list of words, we have to find a unique anagrams in that file, that means, there can be many permutations of same characters and we have to discard all but one. Fro example, if file contains following words.

`cider ether there three cried dicer heaps phase shape manes manse means`

Output should be

`cider ether heaps manes`

Sometimes question can be tweaked and asked like:

given a file containing one word per line, print out all the combinations of words that are anagrams; each line in the output contains all the words from the input that are anagrams of each other.

Before solving for unique anagrams, let’s solve problem of finding anagrams and then find uniqueness by book keeping. There are two basic steps to solve problem. First, unify each word, to compare words with each other. Second step is to identify if new word just seen, is already seen or not, that to find it is unique anagram or not.

First step : Find if two words are anagrams

How to find if two strings are anagrams? The first method is to sort both string based on characters and then compare them for equal. If they are equal, strings are anagrams of each other else not.

There is another method which is much quicker and does not require sorting of string. This method is called assigning prime factorization.

## Check if strings are anagrams using prime factorization

There is mathematical proof (which I don’t know) that a number can be decomposed into a unique set of prime factors. It is not possible to add or take away any factor of that number if number has only prime factors. Prime factorization wiki . How can we use this information to solve our problem.

Assign a unique prime number to each character in alphabet (considering only lower case ). Now for each word, take corresponding prime number for each character of word and multiply. If results of multiplication of two words are same, words are anagrams.

This is fast method, however, there are pitfalls, one must be aware of. Multiplication of integers can easily lead to over flow. For 26 characters, we need at least 26 unique prime numbers and even if we chose lowest 26 prime number, this overflow may occur as soon as we see words with length greater than 9.

Second problem with prime factorization method is that mapping is skewed between input and output space. There is  infinite number of anagrams where at max with 64 bit machine, we can store only 264 words. However, with limit number of words with restricted length this method works good, else move to sorting method discussed earlier.

Now, we know how to find if words are anagrams or not. All, we need to do is to scan the file or list of words and apply above method and eliminate all but one word..

Second step : To identify unique anagrams

How to easily identify that we have already seen a thing before. Simply hash it. If you have already seen it, hashing will cause collision. Hashing is O(1) solution for it, but how about space requirement. Since, we don’t know how many strings are unique, we have allocate memory for worst case which O(N). Can we optimize on space front?

How about binary search tree base storage. Finding a words already present and insert operation can be done simultaneously in O( log N ) if tree is near balanced.  However, we save a lot of space as nodes are allocated as and when required. Drawback is if unique anagrams words are seen only in ascending or descending order (while doing this for a dictionary, search complexity goes for O(N) in worst case. )

Third data structure which can save space as well as limit time complexity is trie. Basics of trie can be learned here : Trie basics

Search complexity reduces to length of string being searched for and space required is also dependent on longest word in the list. Well this method can also work when we have prime factorization method employed by converting number into string.

Now, we have solved problem for reasonable size of input. How about insane size like billions of words are to be processed. Can we distribute the work? Does our solution can be implemented on a distributed architecture and if yes, how?

First of all, there are two basic steps in the algorithm and we can easily make a pipeline of two steps. First stage do sorting of number while second stage does the comparison.

Sorting stage can have multiple processes, each working on subset of input, sort them and output them to comparator. If we look closely, comparator becomes a space expensive as we have to store each unique word. In worst case, we may hit memory limit at comparator.

We can have multiple comparators as well, each doing comparison on specific set of output produced by stage 1. This classification can be done on as simple as first character of word or complex as range of hash of word. However, this method distributes data on multiple machine which can then produce final output. Take care that distribution is exclusive.

I have tried to explain problem of finding unique anagrams in a file from scratch and take it to a real life production problem and various design and data structure decisions one has to take to solve this problem. Please let me know if something is missing or wrong. Sharing is caring.

## Storage classes in C explained

We will discuss today a very fundamental question asked in interviews, what are different types of storage classes in C? To answer this question, first of all we must know what is storage class.

## What are storage classes?

Storage classes in C is a notion which helps compiler and run time executioner to decided where to allocate memory for variable (memory or register) and also decide scope of it (global or local). Apart from location and scope, storage class also determines initial value of variable.

There are four different types of storage classes in C

1. Auto or automatic storage class

This is default class and any variable defined inside a function without specifying any class is an automatic variable. There are created and destroyed inside the function itself. Also, scope of automatic variable is block in which they are declared. Automatic variables are allocated on memory and there is no default value associated with them.

2. Register storage class

When a variables needs faster access, one can specifically request compiler to allocate memory for variable in CPU using registers. Usually, frequently used variables are stored in register as there is a limit on how many variables can be stored in registers. Point to be noted here is that not all variable types are eligible to be store in register storage class, it’s dependent on CPU architecture. There is not default value assigned to register variable too.

3. Static storage class

Static variable can be defined within a function or a file and maintain their values across function calls. As static variable can be defined inside a function and in a file, there is subtle difference which comes from place where static variable is declared.

If static variable is declared and defined within a function, it’s values are preserved across function calls ( this is as good as a global variable ), however, scope of variable in confined within a function or block it is defined in. These are also called as internal static.

When a variable is defined static at file level, it is visible to all functions within that file. Functions in other files have no access or knowledge to this variable.

Static variables are initialized only once with value as 0.

4. External storage class

These variables are declared in one file and are used in other files. This actually can be better understand if we understand what is difference between declaration and definition. Declaration is nothing but creation of a name of variable, actual memory allocation happens when it is defined. There can be multiple declarations of a variable but only one definition.

Using this distinction between declaration and definition, external storage class avoid double definition. If a variable is declared as external, compiler knows that it is defined elsewhere and no need to allocate memory again for it.

There is no initial value to external variable as they are just declared and not defined.

This was heads up on what are different storage classes in C and how they affect your programs. Please share if there is something missing or wrong in this post. Sharing is caring.

References

1. The Complete Reference C
2. The C Programming Language by Brian W Kernighan and Dennis M Ritchie

# Flip bits of unsigned number

Let us learn a small trick today using bit wise operation. Problem at hand is to flip bits of a given number, unsigned integer. For example, if given number is 17, bit wise representation of number is 0001 0001, the output should be 1110 1110 which is 238.

First of all, understand that we need to make 1 as 0 and 0 as 1, what operation can do so for us? XOR can do. If 1 is XOR with 1, it results in zero and 1 XORed with 0 return 1. So, if XOR every bit of number with 1, it will flip itself. And that’s our solution.

## Flip bits implementation in C

Complexity of method to flip bits of a number is O(1).

This method can also be used to toggle a specific bit of an unsigned integer. For example, given a number toggle 5th bit of number. Only difference in implementation will be to create number to XORed. Idea is to start with 1 and left shift it by K-1 bits (in this case 4 times). Now, 1 is at 5th position. XOR this number with given integer and you got Kth bit toggled.

Reference :  Hacker rank problem

There was a recent question asked in Microsoft interview: Given the numbers n and k, change the kth bit of the number n to ‘0’ if it is ‘1’, else return the number n itself. This problem can be easily solved if you understand above two questions.

# Non-repeating element in array

We have seen problems like missing element and repeating elements in array. Today, we are going to discuss another similar problem. Problem statement is : Find single non-repeating element in unsorted array. In array, all elements except one element, are repeated even number of time. For sake of simplicity, let’s say elements are repeated twice.

For example : A = [1,1,2,3,3,4,4,5,5], non repeated element is 2.

One solution is to sort array and then check which element is not repeated. Complexity of this solution is dominated by the complexity of sorting algorithm.

There is one trick, we must be aware of to solve this problem efficiently. What happens when a number is XORed with itself? By virtue of XOR operation, odd number of 1s in operation, it returns 1 else 0. For example, 1 XOR 1 is 0 where as 1 XOR 1 XOR 1 is 1. Again, with this information, what happens when number is XORed with itself? As same bits will be set and reset in both operands of XOR, XOR will result in zero.

Also, zero XORed with an integer returns same integer and zero XORed with zero, returns zero.

With above information at hand, how can we solve our lonely integer problem? As we know, all numbers except from one are repeated even number of times. If they are XORed, they will end up as zero. Once all repeated numbers are zero, if non repeated number is XORed, that will return as the output right? Yes, it would.

## Find non-repeating element in array implementation

Input to program is two lines, first lines mentions how many elements are there in array and then actual elements in array.

Complexity of this solution to find single non repeating number is O(N).

Please share if something is wrong or missing in post. we would love to hear from you.

# Largest power of 2 less than given number

A number is said to be power of 2 if number can be represented as N = 2x where X is whole integer. Problem here is to find largest power of 2 which is less than a given number N. For example, if given number is 7, largest power of 2 less than 7 is 4 (22). Again for 18, answer would be 16.

Simplest solution is to start with v = 1, multiply v with 2 and check if resultant v is less than N or not. If it more than N, return v/2. Why? Because we multiplied once extra and step over N, since we need v less than N, we divide it again and hence v/2 will be greatest power less than N.

What is problem with above solution? Try it for N = 0x7fffffff, and loop does not terminates because of overflow.

A better approach is to put check in while loop condition itself. Following code takes care of overflow issue.

There is another interesting way to find greatest power of 2 less than N. It uses the binary representation of N. Our required number is V, such that it’s most significant bit is set and all rest bits are be zero. We start with V = N and propagate most significant bit to lower bits. We can do so by below method:

```V--
V | = V >> 1        This propagate MSB to it's neighbor.
V | = V >> 2        This propagates MSB and second MSB to next two bits
V | = V >> 4        Similarly for all other bits.
V | = V >> 8
V | = V >> 16
V++

V = V >> 1```

Let’s see how it works with an example:

```N = 100 V = 01100100  = 0000 0000 0110 0100
V-- = 0110 0011
V | = V >> 1  =  0110 0011 | 0011 0001 = 0111 0011
V | = V >> 2  =  0111 0011 | 0001 1100 = 0111 1111
V | = V >> 4  =  0111 1111 | 0001 1111 = 0111 1111
Actually now it does not matter as all will be one anyways

V++  = 0111 1111 + 0000 0001 = 1000 0000 = 128```

This number is actually greater than N, where as we want less than N. All we need to do is to divide V by two, or in bitwise operation right shift by 1 and we are done.

## Largest power of 2 implementation

This method solves another problem which is find largest power of 2 which is greater than given number N. All we need to do is not to divide by 2 at end.

Please share if something is wrong or missing. We would love to hear from you.

# If number is power of 2?

In last post we discussed how to get count of set bits in an integer. Today, we will discuss another interesting and frequently asked question : Find if given number is power of 2? A number N is called as power of two if N = 2x where x is whole integer. For example, 4 is power of 2 as 4 = 22, also 128 is also power of two as 128 = 27

If we have a look at binary representation of numbers which are power of two, we see a pattern.

```2 =  0000 0010
8 =  0000 1000
32 = 0010 0000```

See that all such numbers have only one bit set. Where as numbers which are not power of 2, more than one bits are set.

```3  = 0000 0011
9  = 0000 1001
33 = 0010 0001```

So, all we need to do is count number of set bits in number, if it is 1, then it’s power of 2 else it not. We already know how to find number of set bits and use it to get set bits and check it is 1. We are done.

There is one simple with while loop solution, notice that we can call a number not power of 2 as soon as we see more than one bit set. Exit the while loop as soon as you see second bit set. In worst case, you have to scan all 32 bits of number.

## Decrement and AND method to check if number is power of 2

There is another easier and more efficient method to solve this problem which is called as decrement and AND method.

As observed above, any number which has only one bit set will be power of 2. This set bit is the leftmost bit in the binary representation of number. What happens when we subtract 1 from this number? As per borrow rule of subtraction, borrowing will impact the only set bit in number and make it zero, while all rest bits become one. There resultant number’s binary representation is exactly opposite of the actual number (zeros are replaced by 1 and 1 by zero). What happens when we AND them? Result should be zero.

```N = 64 = 0x0100 0000
N-1 = 63 = 0x 0011 1111

N AND N-1 = 0000 0000```

So theory is if ANDing N with N-1 results in zero, number is power of 2. Is there a counter example? If there are more than one bits set in number, borrowing will be cut short in between and not end up at most significant set bit, which will not result in change of it to zero. While ANDing these two numbers, we will not get zero, but a non zero integer.

```N   = 34 = 0x0010 0010
N-1 = 33 = 0x0010 0001

N & N-1 = 0x0010 0000 = 32```

Make sure we have an additional condition to check for zero as that is also power of two  but above method works only for integers greater than 0.

Please share if you have any doubts about explanation or method, or you some additional method to check if a number is power of 2 or not?

# Number of set bits

Counting number of set bits in an integer is quite frequently asked interview problem. As we know, everything at the end is represented as bits (0 and 1) and so are integers. For example, 5 is represented as 101 where as 17 as 10001. Ask of the question is to count how many bits are set in binary representation of a number. Extending the example, 5 has 2 bits set where 16 as only one bit set. Number of set bits in a number in binary format is also called as population count.

## Number of set bits with iterative method

This is naive method, we look at each bit of number and increment count if it is set. Complexity of this method depends on number of bits representing number. In worst case, on a 32-bit word with only the most significant bit set, it will loop through 32 iterations.

## Brian Kernighan’s Algorithm

This is a bit optimized algorithm where one does not need to iterate over all bit bit but only through those which are set. Idea is to bitwise AND given number N and N-1 in each iteration till result is not zero. Keep count of number of iterations done and that will be number of bits set in integer.

## Divide and Conquer Strategy

This method is optimized than the previous one. Divide the problem into smaller ones and then solve them  to get result for the bigger one. We start counting number of set bits in every two bits, then adding them to get set bits in every four bits and then add them to get for every eights bits and so on.

How can we get number of set bit in every two bits of a given integer?

Let’s take an example and understand it. Say we have to count for 9. Binary representation of 9 is 1001. In 32 bit representation, if we split this number in chunks of two bits, it looks like:

`Ox00,00 00,00 00,00 10,01`

If we AND 9 with Ox55, what happens? it actually preserves all set bits which are at odd positions. Now, shift 9 by one bit and then again AND it with 0X55? By shifting by one, we lost the last bit and moved all set bits at even positions to odd position. If we AND it with ox55, we get all bits which are set at even positions. What if we add them up? We get pairs of two bits representing number of set bits in them.

```N    = Ox0000 0000 0000 1001
0X55 = Ox0000 0000 0101 0101

N AND OX55 = Ox0000 0000 0000 0001              (1)

N>>1 = Ox0000 0000 0000 0100
N>>1 AND Ox55 = 0000 0000 0000 0100             (2)

Result  = Ox0000 0000 0000 0101```

How can we add number of bits set in 2 bits to get total? Now, we have to add set bits in every four bits. AND it with OX33333333. Then shift integer by 2 positions and again AND it with Ox33333333, adding these two will give us number of bit sets in group of 4 bits.

```N          = 0x0000 0000 0000 0101
0X3333     = 0x0011 0011 0011 0011

N AND 0X3333 = 0x0000 0000 0000 0001                 (1)

N>>2       = 0x0000 0000 0000 0001
0x3333     = 0x0011 0011 0011 0011

(N>>2) AND 0x3333 =  0x0000 0000 0000 0001           (2)
Result  = 0x0000 0000 0000 0010```

Similarly, AND resulting integer and integer shifted by 4 bits with Ox0F0F0F0F and then adding them will give us number of bits set in every eight bit.

Move it by 8 bit and add integer to itself, it gives count per 16 bit and then move it by 16 bit and add it to itself given count in 32 bit. This count is stored at last 6 bits (max value being 32), AND result with Ox003F and you have your answer.

Usually, there are inbuilt functions and instructions in processor (like x86’s popcnt) to do this job and it is rarely done at application level, however it is good to know.

Please share if there is some thing wrong or missing in this post, we would love to correct it. Sharing is caring.

## Replace node with sum of all greater node in binary search tree

Today, we will see a problem, solution to which uses traversal of tree in a bit different manner. Problem statement is : Replace a node with sum of all nodes which are greater than it. First question which comes to mind is : what should be done with rightmost child of tree? Should it remain as such or replaced with zero as there is node greater than it? Based on clarification, we can design our solution. In our case, for simplicity, we will replace right most node with zero.

Example : Original tree

After replace nodes with sum of all greater nodes.

How can we solve it? Easiest way is to do inorder traversal of BST and store that into an array. Now, starting from end of array, replace element of array with sum of all elements on the right. With bit of trick, this algorithm works in O(N) time, however, uses two traversals of tree and O(N) extra space.

We know that inorder traversal of BST gives nodes in sorted in ascending order. Since, we need to visit greater nodes before the node in order to calculate their sum, can we just have nodes of BST visited in sorted but in descending order.

How about reversing the inorder traversal itself? In this, traverse right subtree first, then node and at last left subtree. When right subtree is traversed for a node, we are actually traversing all node which are greater that node, all we need to do is return sum from that traversal. While traversing left subtree, we have to just propagate sum of right subtree and node.

Solution can be implemented using recursion with base case being right most node of tree. If we reached right most tree,  replace it with zero and return the value of the node to parent.

## Implementation of replacing a node with sum of all greater nodes

Complexity of algorithm to replace node with sum of all nodes greater than it, is O(N) with no extra space.

# subtree of binary search tree

We learned basics of binary search tree and it’s properties and attributes. Today, we will discuss an interesting problem : to find if a given BST is subtree of binary search tree? When does tree becomes subtree of binary search tree? When  all nodes in smaller tree are present in bigger one and in the same order, then we call smaller tree as subtree of bigger tree.

Below figures define what qualifies as subtree and not a subtree

To find if a tree is subtree of a given tree, we have to find where does subtree lies in bigger tree. Start with roots of both trees and check nodes are equal, if yes, check if left subtree and right subtree of smaller tree are identical to left subtree and right subtree of bigger tree.

If nodes are not equal or any of the subtree is not identical, we move down to left or right of bigger tree and start the process again.

In this procedure, we are comparing root first, then left subtree and in last right subtree. It’s a preorder traversal. There are two preorder traversal involved. One of the bigger tree for selecting the candidate root and other of both trees simultaneously to check if they are identical. While traversing, instead of printing nodes, compare them and see if they are identical or not. Please refer : how to check if two trees are identical or not?

## Algorithm to find if tree is subtree of binary search tree

```1. Start from root of tree1 and tree2 call them currentNodes
2. If both are null, return true
3. Check if currentNodes are identical?
3.1 Check if data in currentNodes of both tree is equal.
3.2 Recursively, check if left and right subtree are identical
3.3 If 3.1 and 3.2 are true, return true else false
4. If currentNodes are not identical
4.1 Move down to left subtree of tree1 and repeat 3
4.2 Move down to right subtree of tree1 and repeat 3```

Below is the modified preorder traversal of tree which does the trick for us.

## Subtree of binary search tree : Implementation

```#include<stdio.h>
#include<stdlib.h>
#include<math.h>

struct node{
int value;
struct node *left;
struct node *right;
};
typedef struct node Node;

#define true 1
#define false 0
#define MAX(a,b)  (a < b ? b : a)

int isIdenticalBST(Node *node1, Node *node2){
if(!node1 && !node2) return true;

if(!node1 || !node2) return false;

return (node1->value==node2->value
&& isIdenticalBST(node1->left, node2->left)
&& isIdenticalBST(node1->right, node2->right));
}

int isSubtree(Node *node1, Node * node2){
if(!node1) return false;
if(!node2) return true;

if (isIdenticalBST(node1, node2)) return true;

return isSubtree(node1->left, node2)
|| isSubtree(node1->right, node2);
}

Node *createNode(int value){
Node * newNode =  (Node *)malloc(sizeof(Node));
newNode->value = value;
newNode->right= NULL;
newNode->left = NULL;

return newNode;
}
if(!node){
return createNode(value);
}
if (node->value > value)
else

return node;
}

/* Driver program for the function written above */
int main(){
Node *root = NULL;
//Creating a binary tree

Node *root2 = NULL;
int height = 0;
printf( "Is tree subtree : %s",
isSubtree( root, root2 ) ? "Yes" : "No" );

return 0;
}
```

Complexity of algorithm to find if a tree is subtree of binary search tree is O(mn) where m and n are nodes in smaller and bigger tree.

There is another way, it can be done in O(n) complexity where inorder and preorder traversals of two trees are stored in arrays. Check if inorder traversal of smaller tree is subarray of inorder traversal of larger tree and check if preorder traversal of smaller tree is subarray of preorder traversal bigger tree.

If above conditions are true, tree is subtree of larger tree else not. However, we require O(n) extra space.

Why we need to compare two traversals and not only one? Because at least two traversals of tree are required to uniquely define a binary tree.