K most frequent words from file

k most frequent words from file

Given a text file, find K most frequent words in file i.e. top k words whose count is maximum. For example, if given file is like follows:

one, two, two, three, three, three, four,four, four,four, five, five, five, five, five, six,six, six, six, six, six and we are looking for three most frequent words, output should be :  six,five, four.

Let’s solve this problem which involves two concepts : One of hash maps and other heap.

First step towards solution is to read given file. We can create an file object and open file and read it using fgets() or gets().

Second, we have to count frequency of each word. So, when a word is read from file, increment count of that word. Which data structure is best suited for this kind of operation? Hash, of course! Hash gives us constant time operation to lookup string and increment its count.

Now, we have a list of words and corresponding frequency of each word. Finally, time to figure out top n most frequent words.
So, thump of rule says that whenever we have to figure our top or least from collection of items, heaps are our best bet. To find top K words, we will go with min heap.

Algorithm to find k most frequent words in file using heap

1. Take first K words appearing in has map along with their count and create a min heap with these k elements. 
2. Now, one by one, read each word from hash and check if frequency of the word is more than the minimum in heap, that is top of heap. 
3. If yes, remove top of heap and add new word in heap. 
4. If not, continue with next word. 
5. When done with all words in hash map, min heap will contain k most frequent words in file.

To find K minimum elements using min heap is also explained here: Heaps : Find K smallest element in given array

Implementation notes
Problem has been solve using STL in C++, using map, vector and priority queue libraries.
Hash map is created to map string with an object which contains string and its frequency to be used for min heap purpose.
Also, default priority queue in implementation in STL gives max heap, to use it as min heap, new comparator is written.
Even trie can be used to count frequency of words in given file. That way we save a lot of space which used in hash. Common prefixes of words are stored only once and not multiple time. Given implementation uses hash map, to read tries and how can they be used in this problem please refer : Trie data structure

Find K most frequent words in file : Implementation

Heap class

/**
 * Created by sangar jitendra on 12/11/15.
 */
/**
 *  Java Program to Implement Binary Heap
 */

import java.util.Arrays;
import java.util.Map;
import java.util.NoSuchElementException;

public class BinaryHeap
{
    /** The number of children each node has **/
    private static final int d = 2;
    private int heapSize;
    private HeapNode[] heap;

    /** Constructor **/
    public BinaryHeap(int capacity)
    {
        heapSize = 0;
        heap = new HeapNode[capacity + 1];
        Arrays.fill(heap, null);
    }

    /** Function to check if heap is empty **/
    public boolean isEmpty( )
    {
        return heapSize == 0;
    }

    /** Check if heap is full **/
    public boolean isFull( )
    {
        return heapSize == heap.length;
    }

    /** Clear heap */
    public void makeEmpty( )
    {
        heapSize = 0;
    }

    /**Get size of heap **/
    public int getSize( )
    {
        return heapSize;
    }

    /** Function to  get index parent of i **/
    private int parent(int i)
    {
        return (i - 1)/d;
    }

    /** Function to get index of k th child of i **/
    private int kthChild(int i, int k)
    {
        return d * i + k;
    }

    /** Function to insert element */
    public int insert(HeapNode x)
    {
        if (isFull( ) )
            throw new NoSuchElementException("Overflow Exception");
        /** Percolate up **/
        heap[heapSize++] = x;
        return heapifyUp(heapSize - 1);
    }

    /** Function to find least element **/
    public HeapNode findMin( )
    {
        if (isEmpty() )
            throw new NoSuchElementException("Underflow Exception");
        return heap[0];
    }

    /** Function to delete min element **/
    public HeapNode deleteMin(Map<String,Integer> onHeap)
    {
        HeapNode keyItem = heap[0];
        delete(0,onHeap);
        return keyItem;
    }

    /** Function to delete element at an index **/
    public HeapNode delete(int ind, Map<String,Integer> onHeap)
    {
        if (isEmpty() )
            throw new NoSuchElementException("Underflow Exception");
        HeapNode keyItem = heap[ind];
        heap[ind] = heap[heapSize - 1];
        heapSize--;
        heapifyDown(ind,onHeap);
        return keyItem;
    }

    /** Function heapifyUp  **/
    private int heapifyUp(int childInd)
    {
        HeapNode tmp = heap[childInd];
        while (childInd > 0 
               && tmp.getCount() < heap[parent(childInd)].getCount()){
            heap[childInd] = heap[ parent(childInd) ];
            childInd = parent(childInd);
        }
        heap[childInd] = tmp;
        return childInd;
    }

    /** Function heapifyDown **/
    private void heapifyDown(int ind,Map<String,Integer> onHeap)
    {
        int child;
        HeapNode tmp = heap[ ind ];
        while (kthChild(ind, 1) < heapSize)
        {
            child = minChild(ind);
            if (heap[child].getCount() < tmp.getCount()){
                heap[ind] = heap[child];
                onHeap.put(heap[child].getWord(),ind);
            }
            else
                break;
            ind = child;
        }
        heap[ind] = tmp;
        onHeap.put(tmp.getWord(),ind);
    }

    /** Function updateKey  **/
    public void updateKey(int childInd, Map<String, Integer> onHeap)
    {
        int child = 0;
        HeapNode tmp = heap[childInd];
        tmp.seCount(tmp.getCount()+1);

        while (kthChild(childInd, 1) < heapSize)
        {
            child = minChild(childInd);
            if (heap[child].getCount() < tmp.getCount()) {
                heap[childInd] = heap[child];
                onHeap.put(heap[child].getWord(),childInd);
            }
            else
                break;
            childInd = child;
        }
        heap[childInd] = tmp;
        onHeap.put(heap[childInd].getWord(),childInd);
    }
    /** Function to get smallest child **/
    private int minChild(int ind)
    {
        int bestChild = kthChild(ind, 1);
        int k = 2;
        int pos = kthChild(ind, k);
        while ((k <= d) && (pos < heapSize))
        {
            if (heap[pos].getCount() < heap[bestChild].getCount())
                bestChild = pos;
            pos = kthChild(ind, k++);
        }
        return bestChild;
    }

    /** Function to print heap **/
    public void printHeap()
    {
        System.out.println("\nHeap = ");
        for (int i = 0; i < heapSize; i++)
            System.out.println(heap[i].getWord() +" " + heap[i].getCount());
        System.out.println();
    }
}

Client code

/**
 * Created by sangar jitendra on 12/11/15.
 */
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;

public class HeapClient {

    public static void main(String[] args) {
        Map<String,Integer> wordCountMap =  new HashMap<>();
        Map<String, Integer> onHeap =  new HashMap<>();
        BinaryHeap binaryHeap = new BinaryHeap(1000);

        if(args == null || args.length < 2)
            return;
        int K = Integer.valueOf(args[0]);
        String fileName = args[1];

        try (BufferedReader br = new BufferedReader(new FileReader(fileName)))
        {
            String sCurrentLine;
            while ((sCurrentLine = br.readLine()) != null) {
                Integer wordCount = wordCountMap.get(sCurrentLine);
                //Add first K elements into the heap
                if(binaryHeap.getSize() < K) {
                    if(wordCount != null) {
                        //word has been been already encountered.
                        if(onHeap.get(sCurrentLine) != null) {
                            //If word is already on heap, update it
                            binaryHeap.updateKey(
                            	   onHeap.get(sCurrentLine),onHeap);
                        }
                        else{
                            //If not, then insert in heap
                            onHeap.put(sCurrentLine, 
                                       binaryHeap.insert(new HeapNode(sCurrentLine, 1)));
                        }
                        wordCountMap.put(sCurrentLine, wordCount);
                    }
                    else{
                        //If words has not been encountered.
                        onHeap.put(sCurrentLine, 
                              binaryHeap.insert(new HeapNode(sCurrentLine, 1)));
                        wordCountMap.put(sCurrentLine,1);
                    }
                }
                else {
                    /*Once we have already entered K elements in heap, 
                      check if word has been already seen
                      If yes, then we will check if word is on heap, 
                      increase count and update */
                    if (wordCount != null) {
                        wordCountMap.put(sCurrentLine, ++wordCount);
                        if (onHeap.get(sCurrentLine) != null) {
                            binaryHeap.updateKey(onHeap.get(sCurrentLine),onHeap);
                        } else {
                          /*If word is not in heap, we will check if current 
                            count of word is more than min on heap, 
                            i.e root of heap,if yes, remove root and add new 
                            node.*/
                            HeapNode minNode = binaryHeap.findMin();
                            if (minNode.getCount() < wordCount) {
                                binaryHeap.deleteMin(onHeap);
                                onHeap.remove(minNode.getWord());
                                onHeap.put(sCurrentLine, 
                                           binaryHeap.insert( new HeapNode(sCurrentLine,  wordCount)));
                            }
                        }
                    }
                    else{
                        wordCountMap.put(sCurrentLine,1);
                    }
                }
            }
            //Final heap contains K most frequent words in file
            binaryHeap.printHeap();
           }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Heap node definition would be :

/* package whatever; // don't place package name! */
 
import java.util.*;
import java.lang.*;
import java.io.*;
 
/* Name of the class has to be "Main" only if the class is public. */
class HeapNode
{
	int count;
	String word;
 
	public HeapNode(String word, int count){
		this.word = word;
		this.count = count;
	}
 
	public int getCount(){
		return count;
	}
 
	public String getWord(){
		return word;
	}
}

Reading M words from file will require O(M) time, where as creating N element heap will take O(N). Also scanning through all elements will take O((M-N) Log N) complexity. Hence overall complexity to find k most frequent words in a file will  be O(M Log N).

Same problem can be asked in other way:
Given  a file with million points (x,y) where x and y are x and y co-ordinates of point. find 100 closest points to origin.
Just update the hash to store the point and it’s distance. Then create a max heap with first 100 points and then read one by one all points and update the max heap accordingly. At the end max heap will contain 100 most closest points to origin.

Please share if there is something missing or wrong. If you want to contribute to website or have an interview experience to share, please contact us and earn.