Word count problem using map reduce

Word count problem

Problem at hand is very simple, we have to count number of occurrences of words in a given file. This is called as word count problem. For example if file is as below,

input.txt
Hello world. World is so beautiful. I am so busy today, not hello, world is also busy today and tomorrow. Say hello to your beautiful wife.

Output should look like this, which mentions number of time a word appears in file.

output.txt
Hello - 3 World - 3 is - 2 so -2
beautiful - 2 I - 1 am - 1 busy - 2 today - 2
tomorrow -1 not - 1 also - 1 Say - 1 hello - 1 
to - 1 your - 1 wife - 1

Word count problem at small scale

Word count problem can be easily solved using hash. Keys of hash will be words and value being their count in file. Each word is read from file, checked for presence in hash, if word already present in hash, increment the count, else add word to hash with count as 1.

What are limitation of above solution? First, it’s apparent that extra space will be required for storing hash. Now, if input set is very large with mostly unique words, space requirement will be of order N.

Second, while processing words, we have bring in entire file into memory and process each word in serial manner. Complexity will be O(N) where N is number of words in file.

Word Count problem at large scale

Hash based solution described above works for smaller set of input where input and output can live memory at the same time and parallelism is not required. What if we have process GB or TB of data? Will above method scale to that?

Apparent thing at scale is to divide and conquer. We divide input set into manageable smaller sets, work with them and generate output and then at last, combine smaller output to produce output for the original problem. This approach gives us two advantages: We don’t need to bring in entire input into memory at a time and smaller tasks can be run in parallel (given that output of one task does not depend on output of another, of simplicity sake.)

There is a method called Map Reduce, which uses divide and conquer approach to solve such problem at scale.

Map function
Map function performs certain processing on data sets where processing on each data set is independent of other. For example, if sum of squares of list of number is to be found, map function will find square of each number in list. Finding square of a number is completely independent of other.

map function

Reduce Function
Reduce function works on output generated by map function and combines them to produce final output. Significant difference between map function and reduce function is that reduce may or may not perform processing on data sets independently. So in above example. reduce function takes output of map function that is square of each number and adds them.

word count problem implementation

Since the map function can independently on data sets, it is candidate for parallelism. Multiple map function can run on multiple servers on independent data sets and generate some key value pairs which then can be consumed by reduce function for further processing.
In the same way more than one reduce functions can be invoked on more than on server which run different batches of inputs (output from map) which are related and dependent on each other.

It is important to know that map and reduce function do not talk to each other hence no data sharing can happen between any two instances of them.

Coming back to our problem of word count, mapper function will read a batch of words and generate tuple as (word, 1)  and send it to reducer where based on word, it will increment count. Below diagram explains it.

Word count problem example

There is one problem with this, there is only one reducer and if all tuples of words are going to same reducer, we end up storing all words on single machine memory. Also, single machine is processing all words, so we lose the benefit of parallelism of mapper as reducer becomes bottleneck. How can we have multiple reducer so avoid both of above problems?

If we group words by their first letter and send them to different reducers, we can run 26 different reduces which cater to small set of words and can run in parallel and produce output.

word count problem

There is another method which to group all same words and send them to reducer. Now, reducer does not have store these words, all it need to do is count. Group by step can be intermediate step between map and reduce.

Code snippets

Mapper function

Reducer function

Driver function

Code snippets are copied from : Hadoop tutorial

This is how we solve or design solutions for word count problem which involve large scale. Please share if there is something wrong or missing.

Unique anagrams in file problem

Unique Anagrams in file

What are anagrams? If you have any doubt, please refer to post : Find if two strings are anagrams . Today’s problem is on top of finding if a string is anagram of other or not. There is a file or list of words, we have to find a unique anagrams in that file, that means, there can be many permutations of same characters and we have to discard all but one. Fro example, if file contains following words.

cider ether there three cried dicer heaps phase shape manes manse means

Output should be

cider ether heaps manes

Sometimes question can be tweaked and asked like:

given a file containing one word per line, print out all the combinations of words that are anagrams; each line in the output contains all the words from the input that are anagrams of each other.

Before solving for unique anagrams, let’s solve problem of finding anagrams and then find uniqueness by book keeping. There are two basic steps to solve problem. First, unify each word, to compare words with each other. Second step is to identify if new word just seen, is already seen or not, that to find it is unique anagram or not.

First step : Find if two words are anagrams

How to find if two strings are anagrams? The first method is to sort both string based on characters and then compare them for equal. If they are equal, strings are anagrams of each other else not.

There is another method which is much quicker and does not require sorting of string. This method is called assigning prime factorization.

Check if strings are anagrams using prime factorization

There is mathematical proof (which I don’t know) that a number can be decomposed into a unique set of prime factors. It is not possible to add or take away any factor of that number if number has only prime factors. Prime factorization wiki . How can we use this information to solve our problem.

Assign a unique prime number to each character in alphabet (considering only lower case ). Now for each word, take corresponding prime number for each character of word and multiply. If results of multiplication of two words are same, words are anagrams.

This is fast method, however, there are pitfalls, one must be aware of. Multiplication of integers can easily lead to over flow. For 26 characters, we need at least 26 unique prime numbers and even if we chose lowest 26 prime number, this overflow may occur as soon as we see words with length greater than 9.

Second problem with prime factorization method is that mapping is skewed between input and output space. There is  infinite number of anagrams where at max with 64 bit machine, we can store only 264 words. However, with limit number of words with restricted length this method works good, else move to sorting method discussed earlier.

Now, we know how to find if words are anagrams or not. All, we need to do is to scan the file or list of words and apply above method and eliminate all but one word..

Second step : To identify unique anagrams

How to easily identify that we have already seen a thing before. Simply hash it. If you have already seen it, hashing will cause collision. Hashing is O(1) solution for it, but how about space requirement. Since, we don’t know how many strings are unique, we have allocate memory for worst case which O(N). Can we optimize on space front?

How about binary search tree base storage. Finding a words already present and insert operation can be done simultaneously in O( log N ) if tree is near balanced.  However, we save a lot of space as nodes are allocated as and when required. Drawback is if unique anagrams words are seen only in ascending or descending order (while doing this for a dictionary, search complexity goes for O(N) in worst case. )

Third data structure which can save space as well as limit time complexity is trie. Basics of trie can be learned here : Trie basics

Search complexity reduces to length of string being searched for and space required is also dependent on longest word in the list. Well this method can also work when we have prime factorization method employed by converting number into string.

Now, we have solved problem for reasonable size of input. How about insane size like billions of words are to be processed. Can we distribute the work? Does our solution can be implemented on a distributed architecture and if yes, how?

First of all, there are two basic steps in the algorithm and we can easily make a pipeline of two steps. First stage do sorting of number while second stage does the comparison.

find unique anagrams

Sorting stage can have multiple processes, each working on subset of input, sort them and output them to comparator. If we look closely, comparator becomes a space expensive as we have to store each unique word. In worst case, we may hit memory limit at comparator.

We can have multiple comparators as well, each doing comparison on specific set of output produced by stage 1. This classification can be done on as simple as first character of word or complex as range of hash of word. However, this method distributes data on multiple machine which can then produce final output. Take care that distribution is exclusive.

unique anagrams example

I have tried to explain problem of finding unique anagrams in a file from scratch and take it to a real life production problem and various design and data structure decisions one has to take to solve this problem. Please let me know if something is missing or wrong. Sharing is caring.

Tinyurl system design and implementation

Design tinyurl system

Design a tinyurl system is a very frequently asked design question in  This system converts a long URL into some small URL, typically people use it when they post something on platforms where there is limit on number of characters like Twitter.

This is system has perfect use case for hash table or hash maps. Hash tables are data structures which store key value pairs where lookup operation for a given key is O(1). So, given a key it’s corresponding value can be found in constant time.

As mentioned above, tinyurl system creates a small url for long usual url. Hence short url is key and the long url is our value. Now, how long short url can be? It depends. Let’s say one creates three character short url. As allowed characters in url are a-z, A-Z and 0-9, the number of possibilities for each place in three character long url is 62, which can represent 2^62  (approx. 250,000) urls. If urls never expire and more url needs to be shortened, the system has to use 4 character which will eventually hit the limit and hence size will be increase to 5 characters.

Well, let’s talk design now. Following are steps are must to design tinyurl system:

1. Take the url and hash it. Define a good hash function which has very low collision rate. What are good hash functions?
2. Store the entry into hash table with tiny url as key and long url as value.
3. When tinyurl is accessed, go to the hash table and access hash[tinyurl] and redirect to that page.

However, it is good to mention here that normal hashing functions will not work here. Usually the hash key generated by them is too long. For example MD5 hashing function generates 32 bit key. Our system should some kind of permutation algorithm which incrementally goes through all the permutations of characters and assign value to it.

Design tinyurl system : mapping function

One thing we need to think of in more detail is how to implement hash, where if tinyurl is pass it gives fetches the correct index in the hash, and at that index we have our long URL stored. Our tinyurl consists of alphanumeric letters [A-Z][a-z][0-9] that is total of 62 digit.  If we store long URL in simple indexed based hash, and convert that index to base 62, every digit in base 62 will give a us character, for example, take index to be 134, in base 62, it can be represented as

134 = 2 X 621 + a X 620

So the digits which we get are [2,a]. Now how to get character corresponding to these digits?  Create a table such that

table[0] ='A'
table[1] ='B'
table[2] ='C'
...
table[26] ='a'
table[27] ='b'
....
table[52] ='0'
table[53] ='1'
...
table[61] ='9'

Lookup each digit in this table and get the corresponding URL. Other problem which needs to be solved is lookup. That is given a tinyurl, how to get the index in hash table?

To convert, take each character in url, do a reverse lookup on alphabets. Resolve the position of each alphabet in table and covert those digits into base 10 numbers. For example, if the tinyurl is a93, then we look up the table and find that corresponding positions are [26,61, 53], So the decimal conversion will be like:

a9362  = [26,61,53] = 26 X 622 + 62 X 611 + 53 X 620 = 99,944 + 3782 + 53 = 103,779

Go and find the long url on index 103,779. This is good hashing function because it is bijective function.  A bijective function is a function f(n) such that for any i and j with i != j, f(i) != f(j); hence collision chances are none.

Below figure shows the entire design concept.

design tinyurl system

Other insights

Very good points which I read from stack overflow by Abel. It’s worth mentioning as it will earn point for you in interview.j

It says that whenever someone clicks on a tiny url, HTTP request is sent to their server. Remember that out key which was generated above is last part of the real url for example, in tinyurl like biturl.in/aDffg1, on aDffg1 is our key, biturl.in is used to send request to designated server. So one of the design aspect of the system will be to set up a server with given url.

Now, server looks for the corresponding long url and uses 302, that is for temporary redirect to redirect user to actual url. Most important part Adel talks about is this and I quote

If you were to use files or first load HTML and then redirect, the browser would add TinyUrl to the history, which is not what you want. Also, the site that is redirected to will see the referrer (the site that you originally come from) as being the site the TinyUrl link is on (i.e., twitter.com, your own site, wherever the link is). This is just as important, so that site owners can see where people are coming from. This too, would not work if a page gets loaded that redirects.

PS: there are more types of redirect. HTTP 301 means: redirect permanent. If that would happen, the browser will not request the bit.ly or TinyUrl site anymore and those sites want to count the hits. That’s why HTTP 302 is used, which is a temporary redirect. The browser will ask TinyUrl.com or bit.ly each time again, which makes it possible to count the hits for you (some tiny url services offer this).

Good explanation to talk about in interview is given coding horror blog entry for tinyurl. So this is what you should talk when somebody asks you to design tinyurl system when asked in design round. Please share your views on this in comments. You can like the page for more updates and interview experiences.

N-ary tree implementation in C

N-ary tree : Way to store hierarchical data

Till now we have learned about binary tree and binary search tree. There is one more form of tree that is N-ary tree or N-ary data structure. As name suggests, Nary tree is a tree where each node has N children. Advance implementations of Nary trees can have flexibility of having different N for each node of the tree. The tree which we will implement will be one such tree. Figure below shows N-ary tree

n-ary tree
Nary tree

What are the usage or applications of N-ary tree or N-ary data structure? One application is to store hierarchical information such as organization charts, store inventories, comments threads with replies etc.
Let’s see how node of a n-ary tree looks like:

 

In this definition, there is one void pointer representing data, this allows us to use the same tree for various data types. Close of C++ template class. In this pointer we put the data information. After that there is one integer nChildren which tells how many children this node has. And in the last we have a pointer to pointer which is nothing but array of pointers to all the children of this node.

Create a node in N-ary tree?

To create a node, we need to pass two parameters. One the number of child nodes this node has. If we don’t know, we can always send 0, later we will see how we can append a child to a parent node. Second is pointer to the data.  Create node function looks like:

Pointer to data, for example an employee record, can be created as follows.

Delete N-ary tree

Deletion of tree happens in similar fashion as for binary tree, delete the children first and then the parent node. In Nary tree we will have multiple child nodes, so we have to go through all the nodes one by one and delete them. Once all child nodes are deleted, delete the parent node. Code to delete a tree is below

Append a child in N-ary tree

To append a child node of a parent node, we need to insert pointer of child to parent array of it’s children, However, we have already allocated the array for the parent node based on input provided at the time of creation of node. How can we increase that? There is one function in C called as realloc which takes two parameters, the old pointer and the size to allocated. It returns the new pointer with new size. Most of the time, the new pointer pointer points to the same memory location and old is done away with. However it is not necessary. New pointer may be pointing to a completely new memory location. So be safe and copy all information from old to new location using memcpy.
We will use this realloc function to reallocate our children array and at the end of the array, add the new child. Code is shown below.

Delete a child

To delete a node, we need to find it. Scan through tree and find the node in parent node’s children array. Delete node’s entry from children array of it’s parent.

Before deleting node, clarify with interviewer or requirement gatherer, what happens to children of current node. If children  have to move to parent node of current, then do that before deleting the node.

If children have to be deleted with node, then recursively delete tree rooted at current node using method mentioned under ‘How to delete a n-ary tree?’

It is costly operation to search a node in N-ary tree. To improve search, we maintain a hash which keeps track of pointer of each node based on same key value. Whenever we want to delete a node, just take pointer from hash and delete it. Similarly, while insertion of a node also, use hash to locate parent pointer.

Update a child in N-ary tree

Using hash mentioned before, locate node pointer, update information by adding new node in children array of parent. Simple!

To print a N-ary tree in level order

No big deal here, it is same as binary search tree level order traversal. Use queue and for node visited, add all children of it on to queue. Take out next node from queue. Repeat process till queue is not empty. Level order traversal of N-ary tree implementation.

 

Above n-ary tree structure can be used to represent organizational structure where an employee can be CEO, MANAGER or EMPLOYEE. CEO is at the first level, MANAGER at the second layer and EMPLOYEE at last layer. Figure shows the layout of Nary tree for organizational tree:

n-ary tree implementation

So this is about N-ary tree, use it wherever you want to store hierarchical data.

Please share if there is something wrong or missing. If you are interested in contributing to website or have an interview experience to share, please contact us.

Design and implement Least Recently Used cache

Implement Least Recently Used (LRU) cache

Before discussing implementation of least recently used cache,let’s understand what is a cache? In plain computer architectural terms, a cache is small buffer of pages OS maintains in order to avoid more expensive main memory accesses. Caches are faster than main memory, however, they smaller in size compared to main memory.  Therefore, there is probability pages are swapped in and out of cache. If a page is not found in cache and main memory is accessed to fetch that page, it’s a cache miss. Page is brought in cache and next time when it is accessed, it is served from cache.

What if there is no space left in cache when a cache miss occurs? New page as to swapped with one of already existing pages in cache. How to decide which page goes out of cache, so that there is minimum increase in cache miss? There are many approaches (eviction methods) to decide which page goes out of cache to make space for new page like First In First Out approach, Least Recently Used, Least Frequently Used etc.

What is least recently used cache ?

In ‘First In First Out’ approach, OS selects page which is oldest in cache and swaps that it with new page. In ‘Least Recently Used’ approach, OS selects the page which was not accessed for longest period of time. In ‘Least Frequently Used’ approach, OS selects the page which is accessed least number of time till a given point of time.

In this post, we would concentrate on Least Recently Used approach and implement it.

LRU cache is similar to first in first out (FIFO) storage where pages which came first are evicted first. Difference between FIFO storage and LRU cache comes from the fact that when a page is accessed again, that page is move to end.

If a page is entered in cache first, it is first candidate to go out if it not accessed again in before cache is full and cache miss happens.  FIFO can be best implemented using queues. Any doubt?

Let’s take an example.We have  cache with size 4. Order of accessing page is 1,2,3,4,5,2,6,6

First 4 steps would be
implement least recently used cache

Now comes page 5, it’s a cache miss, needs to be brought in from memory and there is no space in cache. Evict least recently used page. This will page will be at the front of queue. Remove page 1 as it was the least recently page among four pages we have in cache.

Design lru cache 
When page 2 is accessed, it’s a cache hit. Fine we do not swap anything here.

Page 6 is accessed, it’s a cache miss and cache is already full. If as per previous approach, we swap page at front of queue, wrong page will be swapped as page 2 is most recently used.

How can swapping of page 2 can be avoided?  Remove it from front and put it at tail of queue when it is accessed last. That will give page 3 at front when page 6 is accessed and hence correct page will be remove as page 3 is now least recently used page.
Least recently used cache design
Insertion and removal operation are with O(1) with doubly linked list implementation of Queue. With singly linked lists, deletion may be O(N) because of previous and next pointer mess.

There is still a problem : how to check for cache hit or miss efficiently? Scanning through whole queue searching a page would be O(N). How can we make look up fast? Lookups are best optimized using hash tables.

A hash table with key as page number and value as pointer to node in the queue. With hash table, look up of any page in queue in O(1) complexity.

Least recently used cache lru cache design

1. If cache has free entry, add the page entry to queue.
2. If cache is full and its cache hit, remove the page from present location and add it to end of queue.
3. If cache is full and its cache miss, remove the page at front and insert the new page at end of queue.
4. To check hit or miss, use hash table.

Execution Trace


So, we saw how a least recently used cache is implemented using queue and hash.

Please share if there is something wrong or missing . We would love to hear from you.