Word count problem
Problem at hand is very simple, we have to count number of occurrences of words in a given file. This is called as word count problem. For example if file is as below,
input.txt Hello world. World is so beautiful. I am so busy today, not hello, world is also busy today and tomorrow. Say hello to your beautiful wife.
Output should look like this, which mentions number of time a word appears in file.
output.txt Hello - 3 World - 3 is - 2 so -2 beautiful - 2 I - 1 am - 1 busy - 2 today - 2 tomorrow -1 not - 1 also - 1 Say - 1 hello - 1 to - 1 your - 1 wife - 1
Word count problem at small scale
Word count problem can be easily solved using hash. Keys of hash will be words and value being their count in file. Each word is read from file, checked for presence in hash, if word already present in hash, increment the count, else add word to hash with count as 1.
What are limitation of above solution? First, it’s apparent that extra space will be required for storing hash. Now, if input set is very large with mostly unique words, space requirement will be of order N.
Second, while processing words, we have bring in entire file into memory and process each word in serial manner. Complexity will be O(N) where N is number of words in file.
Word Count problem at large scale
Hash based solution described above works for smaller set of input where input and output can live memory at the same time and parallelism is not required. What if we have process GB or TB of data? Will above method scale to that?
Apparent thing at scale is to divide and conquer. We divide input set into manageable smaller sets, work with them and generate output and then at last, combine smaller output to produce output for the original problem. This approach gives us two advantages: We don’t need to bring in entire input into memory at a time and smaller tasks can be run in parallel (given that output of one task does not depend on output of another, of simplicity sake.)
There is a method called Map Reduce, which uses divide and conquer approach to solve such problem at scale.
Map function performs certain processing on data sets where processing on each data set is independent of other. For example, if sum of squares of list of number is to be found, map function will find square of each number in list. Finding square of a number is completely independent of other.
Reduce function works on output generated by map function and combines them to produce final output. Significant difference between map function and reduce function is that reduce may or may not perform processing on data sets independently. So in above example. reduce function takes output of map function that is square of each number and adds them.
Since the map function can independently on data sets, it is candidate for parallelism. Multiple map function can run on multiple servers on independent data sets and generate some key value pairs which then can be consumed by reduce function for further processing.
In the same way more than one reduce functions can be invoked on more than on server which run different batches of inputs (output from map) which are related and dependent on each other.
It is important to know that map and reduce function do not talk to each other hence no data sharing can happen between any two instances of them.
Coming back to our problem of word count, mapper function will read a batch of words and generate tuple as (word, 1) and send it to reducer where based on word, it will increment count. Below diagram explains it.
There is one problem with this, there is only one reducer and if all tuples of words are going to same reducer, we end up storing all words on single machine memory. Also, single machine is processing all words, so we lose the benefit of parallelism of mapper as reducer becomes bottleneck. How can we have multiple reducer so avoid both of above problems?
If we group words by their first letter and send them to different reducers, we can run 26 different reduces which cater to small set of words and can run in parallel and produce output.
There is another method which to group all same words and send them to reducer. Now, reducer does not have store these words, all it need to do is count. Group by step can be intermediate step between map and reduce.
Code snippets are copied from : Hadoop tutorial
This is how we solve or design solutions for word count problem which involve large scale. Please share if there is something wrong or missing.