Unique Anagrams in file
What are anagrams? If you have any doubt, please refer to post : Find if two strings are anagrams . Today’s problem is on top of finding if a string is anagram of other or not. There is a file or list of words, we have to find a unique anagrams in that file, that means, there can be many permutations of same characters and we have to discard all but one. Fro example, if file contains following words.
cider ether there three cried dicer heaps phase shape manes manse means
Output should be
cider ether heaps manes
Sometimes question can be tweaked and asked like:
given a file containing one word per line, print out all the combinations of words that are anagrams; each line in the output contains all the words from the input that are anagrams of each other.
Before solving for unique anagrams, let’s solve problem of finding anagrams and then find uniqueness by book keeping. There are two basic steps to solve problem. First, unify each word, to compare words with each other. Second step is to identify if new word just seen, is already seen or not, that to find it is unique anagram or not.
First step : Find if two words are anagrams
How to find if two strings are anagrams? The first method is to sort both string based on characters and then compare them for equal. If they are equal, strings are anagrams of each other else not.
There is another method which is much quicker and does not require sorting of string. This method is called assigning prime factorization.
Check if strings are anagrams using prime factorization
There is mathematical proof (which I don’t know) that a number can be decomposed into a unique set of prime factors. It is not possible to add or take away any factor of that number if number has only prime factors. Prime factorization wiki . How can we use this information to solve our problem.
Assign a unique prime number to each character in alphabet (considering only lower case ). Now for each word, take corresponding prime number for each character of word and multiply. If results of multiplication of two words are same, words are anagrams.
This is fast method, however, there are pitfalls, one must be aware of. Multiplication of integers can easily lead to over flow. For 26 characters, we need at least 26 unique prime numbers and even if we chose lowest 26 prime number, this overflow may occur as soon as we see words with length greater than 9.
Second problem with prime factorization method is that mapping is skewed between input and output space. There is infinite number of anagrams where at max with 64 bit machine, we can store only 264 words. However, with limit number of words with restricted length this method works good, else move to sorting method discussed earlier.
Now, we know how to find if words are anagrams or not. All, we need to do is to scan the file or list of words and apply above method and eliminate all but one word..
Second step : To identify unique anagrams
How to easily identify that we have already seen a thing before. Simply hash it. If you have already seen it, hashing will cause collision. Hashing is O(1) solution for it, but how about space requirement. Since, we don’t know how many strings are unique, we have allocate memory for worst case which O(N). Can we optimize on space front?
How about binary search tree base storage. Finding a words already present and insert operation can be done simultaneously in O( log N ) if tree is near balanced. However, we save a lot of space as nodes are allocated as and when required. Drawback is if unique anagrams words are seen only in ascending or descending order (while doing this for a dictionary, search complexity goes for O(N) in worst case. )
Third data structure which can save space as well as limit time complexity is trie. Basics of trie can be learned here : Trie basics
Search complexity reduces to length of string being searched for and space required is also dependent on longest word in the list. Well this method can also work when we have prime factorization method employed by converting number into string.
Now, we have solved problem for reasonable size of input. How about insane size like billions of words are to be processed. Can we distribute the work? Does our solution can be implemented on a distributed architecture and if yes, how?
First of all, there are two basic steps in the algorithm and we can easily make a pipeline of two steps. First stage do sorting of number while second stage does the comparison.
Sorting stage can have multiple processes, each working on subset of input, sort them and output them to comparator. If we look closely, comparator becomes a space expensive as we have to store each unique word. In worst case, we may hit memory limit at comparator.
We can have multiple comparators as well, each doing comparison on specific set of output produced by stage 1. This classification can be done on as simple as first character of word or complex as range of hash of word. However, this method distributes data on multiple machine which can then produce final output. Take care that distribution is exclusive.
I have tried to explain problem of finding unique anagrams in a file from scratch and take it to a real life production problem and various design and data structure decisions one has to take to solve this problem. Please let me know if something is missing or wrong. Sharing is caring.