Word count problem using map reduce

Word count problem

Problem at hand is very simple, we have to count number of occurrences of words in a given file. This is called as word count problem. For example if file is as below,

input.txt
Hello world. World is so beautiful. I am so busy today, not hello, world is also busy today and tomorrow. Say hello to your beautiful wife.

Output should look like this, which mentions number of time a word appears in file.

output.txt
Hello - 3 World - 3 is - 2 so -2
beautiful - 2 I - 1 am - 1 busy - 2 today - 2
tomorrow -1 not - 1 also - 1 Say - 1 hello - 1 
to - 1 your - 1 wife - 1

Word count problem at small scale

Word count problem can be easily solved using hash. Keys of hash will be words and value being their count in file. Each word is read from file, checked for presence in hash, if word already present in hash, increment the count, else add word to hash with count as 1.

What are limitation of above solution? First, it’s apparent that extra space will be required for storing hash. Now, if input set is very large with mostly unique words, space requirement will be of order N.

Second, while processing words, we have bring in entire file into memory and process each word in serial manner. Complexity will be O(N) where N is number of words in file.

Word Count problem at large scale

Hash based solution described above works for smaller set of input where input and output can live memory at the same time and parallelism is not required. What if we have process GB or TB of data? Will above method scale to that?

Apparent thing at scale is to divide and conquer. We divide input set into manageable smaller sets, work with them and generate output and then at last, combine smaller output to produce output for the original problem. This approach gives us two advantages: We don’t need to bring in entire input into memory at a time and smaller tasks can be run in parallel (given that output of one task does not depend on output of another, of simplicity sake.)

There is a method called Map Reduce, which uses divide and conquer approach to solve such problem at scale.

Map function
Map function performs certain processing on data sets where processing on each data set is independent of other. For example, if sum of squares of list of number is to be found, map function will find square of each number in list. Finding square of a number is completely independent of other.

map function

Reduce Function
Reduce function works on output generated by map function and combines them to produce final output. Significant difference between map function and reduce function is that reduce may or may not perform processing on data sets independently. So in above example. reduce function takes output of map function that is square of each number and adds them.

word count problem implementation

Since the map function can independently on data sets, it is candidate for parallelism. Multiple map function can run on multiple servers on independent data sets and generate some key value pairs which then can be consumed by reduce function for further processing.
In the same way more than one reduce functions can be invoked on more than on server which run different batches of inputs (output from map) which are related and dependent on each other.

It is important to know that map and reduce function do not talk to each other hence no data sharing can happen between any two instances of them.

Coming back to our problem of word count, mapper function will read a batch of words and generate tuple as (word, 1)  and send it to reducer where based on word, it will increment count. Below diagram explains it.

Word count problem example

There is one problem with this, there is only one reducer and if all tuples of words are going to same reducer, we end up storing all words on single machine memory. Also, single machine is processing all words, so we lose the benefit of parallelism of mapper as reducer becomes bottleneck. How can we have multiple reducer so avoid both of above problems?

If we group words by their first letter and send them to different reducers, we can run 26 different reduces which cater to small set of words and can run in parallel and produce output.

word count problem

There is another method which to group all same words and send them to reducer. Now, reducer does not have store these words, all it need to do is count. Group by step can be intermediate step between map and reduce.

Code snippets

Mapper function

Reducer function

Driver function

Code snippets are copied from : Hadoop tutorial

This is how we solve or design solutions for word count problem which involve large scale. Please share if there is something wrong or missing.

Unique anagrams in file problem

Unique Anagrams in file

What are anagrams? If you have any doubt, please refer to post : Find if two strings are anagrams . Today’s problem is on top of finding if a string is anagram of other or not. There is a file or list of words, we have to find a unique anagrams in that file, that means, there can be many permutations of same characters and we have to discard all but one. Fro example, if file contains following words.

cider ether there three cried dicer heaps phase shape manes manse means

Output should be

cider ether heaps manes

Sometimes question can be tweaked and asked like:

given a file containing one word per line, print out all the combinations of words that are anagrams; each line in the output contains all the words from the input that are anagrams of each other.

Before solving for unique anagrams, let’s solve problem of finding anagrams and then find uniqueness by book keeping. There are two basic steps to solve problem. First, unify each word, to compare words with each other. Second step is to identify if new word just seen, is already seen or not, that to find it is unique anagram or not.

First step : Find if two words are anagrams

How to find if two strings are anagrams? The first method is to sort both string based on characters and then compare them for equal. If they are equal, strings are anagrams of each other else not.

There is another method which is much quicker and does not require sorting of string. This method is called assigning prime factorization.

Check if strings are anagrams using prime factorization

There is mathematical proof (which I don’t know) that a number can be decomposed into a unique set of prime factors. It is not possible to add or take away any factor of that number if number has only prime factors. Prime factorization wiki . How can we use this information to solve our problem.

Assign a unique prime number to each character in alphabet (considering only lower case ). Now for each word, take corresponding prime number for each character of word and multiply. If results of multiplication of two words are same, words are anagrams.

This is fast method, however, there are pitfalls, one must be aware of. Multiplication of integers can easily lead to over flow. For 26 characters, we need at least 26 unique prime numbers and even if we chose lowest 26 prime number, this overflow may occur as soon as we see words with length greater than 9.

Second problem with prime factorization method is that mapping is skewed between input and output space. There is  infinite number of anagrams where at max with 64 bit machine, we can store only 264 words. However, with limit number of words with restricted length this method works good, else move to sorting method discussed earlier.

Now, we know how to find if words are anagrams or not. All, we need to do is to scan the file or list of words and apply above method and eliminate all but one word..

Second step : To identify unique anagrams

How to easily identify that we have already seen a thing before. Simply hash it. If you have already seen it, hashing will cause collision. Hashing is O(1) solution for it, but how about space requirement. Since, we don’t know how many strings are unique, we have allocate memory for worst case which O(N). Can we optimize on space front?

How about binary search tree base storage. Finding a words already present and insert operation can be done simultaneously in O( log N ) if tree is near balanced.  However, we save a lot of space as nodes are allocated as and when required. Drawback is if unique anagrams words are seen only in ascending or descending order (while doing this for a dictionary, search complexity goes for O(N) in worst case. )

Third data structure which can save space as well as limit time complexity is trie. Basics of trie can be learned here : Trie basics

Search complexity reduces to length of string being searched for and space required is also dependent on longest word in the list. Well this method can also work when we have prime factorization method employed by converting number into string.

Now, we have solved problem for reasonable size of input. How about insane size like billions of words are to be processed. Can we distribute the work? Does our solution can be implemented on a distributed architecture and if yes, how?

First of all, there are two basic steps in the algorithm and we can easily make a pipeline of two steps. First stage do sorting of number while second stage does the comparison.

find unique anagrams

Sorting stage can have multiple processes, each working on subset of input, sort them and output them to comparator. If we look closely, comparator becomes a space expensive as we have to store each unique word. In worst case, we may hit memory limit at comparator.

We can have multiple comparators as well, each doing comparison on specific set of output produced by stage 1. This classification can be done on as simple as first character of word or complex as range of hash of word. However, this method distributes data on multiple machine which can then produce final output. Take care that distribution is exclusive.

unique anagrams example

I have tried to explain problem of finding unique anagrams in a file from scratch and take it to a real life production problem and various design and data structure decisions one has to take to solve this problem. Please let me know if something is missing or wrong. Sharing is caring.

Design patterns : Adapter pattern

If you have ever travelled to United States from countries like England or India, you face a very common problem. Sockets. In US of A, electric sockets are rectangular in shape where in India or England, it cylindrical in  shape and so are the plugs being put into them. So how to connect? Easy, we carry an adapter, which allows us to work with two incompatible things , and that is our topic of today that is adapter pattern.

Adapter pattern is structural design pattern and works as a bridge between two incompatible interfaces. Intent of adapter pattern is to convert a class or interface into one which is required by client. This lets classes which are independent or incompatible to work together. There are four parts into adapter pattern: Target : this is the interface which is used by client, Adapter : Adapts the interface which needs to be used by client but not compatible, Adaptee : the incompatible interface, Client : Our beloved client.

Adapter-2

There are two ways to implement adapter pattern:

Class Adapter: uses java inheritance to extend source interface

In this method, the class which needs to be used by client and has incompatible methods is extended and a child class is created. This new class which is created contains compatible methods. These compatible methods will eventually call incompatible methods of base class. This method can be implemented in languages which support multiple inheritance (Java, C# don’t support multiple inheritance and hence this cannot be used in these languages). This is because adapter needs to inherit both target and adaptee. However, if we make target as interface, then this problem goes away as class can implement as many interface as possible.

To illustrate, we will take example which we discussed above, the rectangular socket (returning 120V) and cylindrical plug (may take only 40V, 12V, 3V). First let’s create a class to represent rectangular socket returning 120V:

Volts class

RectangularSocket class returning 120V volts

Now, we have cylindrical plug and which needs to get power from socket. However, the socket available takes on rectangular plug. What we will do is to inherit rectangular socket class and create an adapter class which will also take cylindrical plug too which can take 3V, 40V or 120V.

Cool now we have a client (cylindrical plug), Adaptee (rectangular socket) and adapter (class derived from rectangular socket). We have to make target interface which will be implemented by adapter and will be used by client, lets call it as IRectangularSocketAdapter.

Client using above interface:

Object Method: It uses java composition and adapter contains object of Adaptee class.

This is very simple way to implement adapter pattern, just add base class as an attribute to adapter class and use that attribute to call incompatible methods. This method has its own advantages: first, it adapts the base class and all inherited class of that base class which is not case in  class base method. This comes with a caveat that if subclasses add new methods, adapter needs to change in order to expose those methods to clients.

Everything from above example of socket and plug still remains, only change is in adapter class which now contains Adaptee class (rectangular socket) as an attribute. Code below shows how adapter pattern using object method is used.

There are some pertinent questions : What amount of work should be done in adapter. That is simple to answer, the amount you require to make communication between two incompatible classes. If target and adaptee are more or less compatible, then adapter just needs to delegate calls to adaptee. If target and adaptee are incompatible, we may have to do a good heavy lifting, like data structure conversions etc.

Other question, when should we use adapter pattern?  Again simple question, most of the times we face it. When you want to use a class which cannot be changed and you already have implemented logic to use that class, but taking some assumptions which are not correct or you are replacing the class which was supporting your earlier implementation and replacing with new class which cannot be changed. This case is best suited for adaption. You write an adapter which provides a way to use new class using your existing code.

Today, we learned something about a method which is heavily used to use legacy codes and this is called as adapter pattern.Please leave comments/suggestions if you like explanation or if there is something which can be improved or corrected.

Design Patterns : Strategy pattern

In last post on design patterns, we learned about builder pattern, in this post we will be discussing strategy pattern.  Before going into details, first understand basic intent of strategy pattern. Intent is to define a family of algorithms, encapsulate each one, and make them interchangeable. Strategy lets the algorithm vary without any change in clients that use it.

It is a common solution for representing a family of algorithms and letting you choose among them at runtime.

Strategy pattern
Strategy pattern core

Strategy pattern consist of following components:

  • An interface to represent some algorithm (the interface Strategy)
  • One or more concrete implementations of that interface to represent multiple algorithms (the concrete classes ConcreteClassA, ConcreteClassB)
  • One or more clients that use the strategy objects

Example of family of algorithms can be sorting algorithms  (insert sort, quick sort, merge sort and so on), hashing algorithms (MD5, SHA), mathematical operation (addition, subtraction,multiplication),payment methods (credit card, Paypal, wallet). In strategy pattern, we try to define each member of the family and encapsulate implementation via an interface. Basic tenet of this pattern is: encapsulate anything which can change.

Lets talk about some problems and see how strategy pattern solves our maintenance nightmares.

Sorting algorithm using strategy pattern

First example is very simple, easy to understand and make the concept very clear. Example is about sorting algorithm. We want to implement a sorting algorithm, which works on given input and return output in sorted order. There are more sorting algorithms than I have underpants. Which one should we use? Let’s say we deliberate and come up with one sorting algorithm which satisfies our current requirements, let it be insert sort.

We will implement a function insert(List<Integer> input) and use it in clients which needs input to be sorted.
Going down, we want to change sorting algorithm to quick  sort instead of insertion sort. And the hell will break lose. How many place you need to update the code because there will be thousands of clients who will be using it? There is a way out. We implement quick sort and a function fun() which calls insertion sort or quick sort  based on input.

Well, now if we want to implement merge sort, our dear function fun() will have another if and else block. And if in future, a new algorithm is discovered, we need to add additional block for that. So, it is quite evident that code is not extensible and open for the change.

So, let’s get back to tenet of strategy pattern, which says, code which can be changed needs to be encapsulated. Best way to encapsulate things is interface. We will define an interface say ISortingAlgorithm, which declares a function called sort().

Now, we will define three different classes one each for insertion sort, merge sort, quick sort. Each of these classes implement ISortingAlgorithm. 
Any client who wants to use, will send its choice of sorting algorithm. With this approach, any new sorting algorithm discovered can be easily be accommodated. Let’s see the code.

Hashing algorithm using strategy pattern

Second example is classic example which is used to explain strategy pattern and the example is of hashing. Let’s say, we are writing a hash function. At the time of design, we zeroed in on MD5 as our hashing function and we neatly defined and released function. Our function is so good that it is being used in so thousands of clients. Definition of our hash function is something like this:

Things go wrong when one fine day we find that MD5 hashing function has a flaw or MD5 hashing is not what we want. We deliberate and come up with another algorithm to be used say SHA256. And this leads to change to our hash function which we neatly written early. Now, our hashing function becomes:

Problem with above solution is that we still need to support MD5 implementation because, as I said, code has been already released and there are thousands of clients using that. We cannot afford to change all those clients, hence need to support the older API too. Well, we salvaged the situation. But thing to ponder is what if there comes another hashing algorithm which is more efficient than Md5 and SHA256 and we want to change our function and support all legacy implementation and usages. With the golden rule, that if code can change, encapsulate it, and what better way than interface to encapsulate things. We defined an interface let’s say IHasher with one function named hash().
Our earlier Hashing function now takes two parameters, one input to be hashed and other IHasher object. Implementation becomes:
We now create a class for each of the hashing algorithm and class implements IHasher interface.
And now, we call hash(“test String”, new MD5Hasher()) instead of hash(“test string”). However, if need to change MD5 to SHA256, we still need to replace all call with MD5Hasher() with SHA256Hasher(). Again maintenance nightmare. To avoid that we will create and pass objects as IHasher and not of actual class as shown in code below and there is no problem no matter how many times our algorithm changes. We don’t need to change our code. We add another class for that algorithm which implements IHasher interface and we are done. We that is principle is called as open for extension but not for modification. Our core logic remains same, still we can extend it to as many hashing algorithms we can thing of.

Payment Method using strategy pattern

Last example is related payment method. We have a shopping cart and then we have to associate a payment method to that cart. Payment method can be credit card or net banking or Cash on Delivery. To encapsulation of payment method, we will define an interface IPaymentMethod. 

Now we implement class one each for payment method and then a test class to verify that correct payment method is used.


I think no need to explain this example,as it is quite self-explanatory.

As per Gang of Four Strategy patterns book, use strategy pattern when:

  1. Many related classes differ only in behaviour. Strategies provide a way to configure a class with one of many behaviour.
  2. You want different variants of same algorithm, like MD5 and SHA256 variants of hashing algorithm.
  3. Algorithm uses some data which should not be exposed to client. Pattern provides a way to hide complex and algorithm specific data to clients.
  4. A class define many behaviour and that appears as if-else block.                        To conclude, we today understood what is strategy design pattern and when to apply it. We also discussed some examples which are good candidates for application  of strategy pattern.

Design patterns : builder pattern

Today we will discussing a very important pattern in design pattern called as builder pattern, there are class of problem which can be solved using this pattern effectively and efficiently without cluttering the client code which uses those classes.

First of all let’s understand why we need builder pattern and what pattern they solve.

Typical way to instantiate a class is to use its constructor. A constructor is set of instructions which sets up the object for you with a specific state to start with. As we all know, a constructor can take parameters to it. If no constructor is defined for class, instantiate using default constructor which does not take any input parameter. Constructors with parameter set values for the member variables of class. However, many a times only a few member variables of class are mandatory at the time of object creation. Rest all are optional.  So, how do we create an object of such class?  We cannot have constructor with only parameters corresponding to mandatory member variables, as in some cases optional variables are also required to create object correctly.

Example of such a class which has some mandatory and rest optional member variables is FoodItem. It can have some mandatory members like MRP, Manufacturing date, Batch number, and some optional members like nutrients like total fat, saturated fat, trans fat, cholesterol, sodium, MSG, proteins and so on. Not every FoodItem object should have values for total fat, saturated fat, trans fat, cholesterol, sodium or MSG or proteins.

There are two ways to solve this problem part from builder pattern. Let’s look at them  before we move to builder patterns.

Telescoping constructor

First one is to create telescoping constructors. In this we create a default constructor, a constructor with all mandatory member variables and then create a constructor adding each optional variable to it. First constructor with one optional parameter, second with two, third with three and so on. So there are four mandatory and 20 optional, we would end up creating 1 + 20 constructor for the class. While creating object, one need to call constructor with minimum parameters matching the requirement. Many times the constructor with more number of parameters needs to be used because you want to set optional parameter which is after non required parameter in constructor parameter list. In that case, all non required optional parameters are set to zero. This makes code very unmanageable on the client side who needs to create objects of you class.

Telescopic constructor pattern solves problem where one need to send parameters which are not required for object creation. However, it comes with a caveat, if two parameters next to each other are same type, then putting values for one parameter in other is a common mistake and will not result in compile time warning, but may not create desired object. This method is not scalable.

Setter method

Second method is to have a constructor with mandatory members and the a setter function for each parameter. At the time of object creation, only mandatory parameters are passed and setter methods are used to set values for required optional parameters.

There are some problems with this approach. One, use of setter, makes your class mutable. So beware of using this kind of objects in multi-threaded applications. Second, during creation of object, there are intermediate states of object before the final object is created. However, the object can be used by other threads before setters of optional values are executed and a incomplete object is seen by the new thread.

Builder pattern

There is third way which comes with best of above two and deals with disadvantages of them too. This pattern is called as build pattern. In this pattern, we create a static member class of our class called as builder. First we create a builder class with all the mandatory parameters and the using that object set values of optional fields and then finally call build function which creates the object of our class. Let’s take the same example as above and use builder pattern to create an object.

Client Code :There are two advantages of this approach : First we are not putting setter function on member variables of class but on builder class which makes our class as immutable. Second advantage is our final object is created in on instruction, and not in series of instructions like in second method, so there is no risk of inconsistent copies object in different threads.

This is not the only use of builder pattern. Many a times we need to create an object but the contents of those object can be different. For example, one wants to create a Menu for a fast food restaurant which offer meals. Now,meals can be veg and no-veg. Based on request either meal object contains veg dishes or non-veg dishes. Here we have to create an same object but with slight difference. This cannot be achieved by using constructor as in constructor we need to pass all the member variable and in this case one of the variable will be not use either veg or non-veg item.

This is where builder comes into play. We create a builder which returns us a meal based on inputs we have given. It combines items into meal and return that back. Let’s take an example and see.

We have veg meal which contain a veg burger and a cold drink. Non veg meal has chicken burger and ice tea.

First of all, we will create an interface which will be implemented by the solid food and drink as well.
Create a class burger which implements above interface and extend it to create veg burger and chicken burger. Cool that is first part! Similarly we can create a class drink and based on type of meal we can create object with appropriate values.

Veg burger class
Chicken Burger class

Now, create a cold drink class which implements FoodItem interface and extend it to create classes Pepsi and Ice Tea.
Pepsi class

Ice Tea class
Now we create a builder which creates meals for us using these objects.

To use this build, we write a client code which creates veg meal and non-veg meal based on requirement.

There are some disadvantages with builder pattern, first being lot of extraneous code, it may be possible that developer skips adding any of new field into builder and problems occur. However, the best advantage that it keeps class as immutable given multi-threaded world we live in, overcomes the extra lines of code which is being written.

Tinyurl system design and implementation

Design tinyurl system

Design a tinyurl system is a very frequently asked design question in  This system converts a long URL into some small URL, typically people use it when they post something on platforms where there is limit on number of characters like Twitter.

This is system has perfect use case for hash table or hash maps. Hash tables are data structures which store key value pairs where lookup operation for a given key is O(1). So, given a key it’s corresponding value can be found in constant time.

As mentioned above, tinyurl system creates a small url for long usual url. Hence short url is key and the long url is our value. Now, how long short url can be? It depends. Let’s say one creates three character short url. As allowed characters in url are a-z, A-Z and 0-9, the number of possibilities for each place in three character long url is 62, which can represent 2^62  (approx. 250,000) urls. If urls never expire and more url needs to be shortened, the system has to use 4 character which will eventually hit the limit and hence size will be increase to 5 characters.

Well, let’s talk design now. Following are steps are must to design tinyurl system:

1. Take the url and hash it. Define a good hash function which has very low collision rate. What are good hash functions?
2. Store the entry into hash table with tiny url as key and long url as value.
3. When tinyurl is accessed, go to the hash table and access hash[tinyurl] and redirect to that page.

However, it is good to mention here that normal hashing functions will not work here. Usually the hash key generated by them is too long. For example MD5 hashing function generates 32 bit key. Our system should some kind of permutation algorithm which incrementally goes through all the permutations of characters and assign value to it.

Design tinyurl system : mapping function

One thing we need to think of in more detail is how to implement hash, where if tinyurl is pass it gives fetches the correct index in the hash, and at that index we have our long URL stored. Our tinyurl consists of alphanumeric letters [A-Z][a-z][0-9] that is total of 62 digit.  If we store long URL in simple indexed based hash, and convert that index to base 62, every digit in base 62 will give a us character, for example, take index to be 134, in base 62, it can be represented as

134 = 2 X 621 + a X 620

So the digits which we get are [2,a]. Now how to get character corresponding to these digits?  Create a table such that

table[0] ='A'
table[1] ='B'
table[2] ='C'
...
table[26] ='a'
table[27] ='b'
....
table[52] ='0'
table[53] ='1'
...
table[61] ='9'

Lookup each digit in this table and get the corresponding URL. Other problem which needs to be solved is lookup. That is given a tinyurl, how to get the index in hash table?

To convert, take each character in url, do a reverse lookup on alphabets. Resolve the position of each alphabet in table and covert those digits into base 10 numbers. For example, if the tinyurl is a93, then we look up the table and find that corresponding positions are [26,61, 53], So the decimal conversion will be like:

a9362  = [26,61,53] = 26 X 622 + 62 X 611 + 53 X 620 = 99,944 + 3782 + 53 = 103,779

Go and find the long url on index 103,779. This is good hashing function because it is bijective function.  A bijective function is a function f(n) such that for any i and j with i != j, f(i) != f(j); hence collision chances are none.

Below figure shows the entire design concept.

design tinyurl system

Other insights

Very good points which I read from stack overflow by Abel. It’s worth mentioning as it will earn point for you in interview.j

It says that whenever someone clicks on a tiny url, HTTP request is sent to their server. Remember that out key which was generated above is last part of the real url for example, in tinyurl like biturl.in/aDffg1, on aDffg1 is our key, biturl.in is used to send request to designated server. So one of the design aspect of the system will be to set up a server with given url.

Now, server looks for the corresponding long url and uses 302, that is for temporary redirect to redirect user to actual url. Most important part Adel talks about is this and I quote

If you were to use files or first load HTML and then redirect, the browser would add TinyUrl to the history, which is not what you want. Also, the site that is redirected to will see the referrer (the site that you originally come from) as being the site the TinyUrl link is on (i.e., twitter.com, your own site, wherever the link is). This is just as important, so that site owners can see where people are coming from. This too, would not work if a page gets loaded that redirects.

PS: there are more types of redirect. HTTP 301 means: redirect permanent. If that would happen, the browser will not request the bit.ly or TinyUrl site anymore and those sites want to count the hits. That’s why HTTP 302 is used, which is a temporary redirect. The browser will ask TinyUrl.com or bit.ly each time again, which makes it possible to count the hits for you (some tiny url services offer this).

Good explanation to talk about in interview is given coding horror blog entry for tinyurl. So this is what you should talk when somebody asks you to design tinyurl system when asked in design round. Please share your views on this in comments. You can like the page for more updates and interview experiences.