Find if two strings are anagrams

Find if two strings are anagrams

Today, we will discuss a problem on string, problem statement is : Given two strings, find if those two strings are anagrams.

What are anagrams?

Two strings are anagrams if they are composed of same characters, these characters may be in different order in two strings. For example, below strings are anagrams

cider cried dicer 
ether there three
heaps phase shape 
manes manse means

With definition, we know part of solution already;  check if both strings contain same characters, not necessarily in same order. Then they are anagrams.
First approach to check two string contains same set of characters is to sort both strings and compare results. If results are same, strings are anagrams, else not.  With sort, characters will be arranged in lexicographical order within strings. Time complexity of this method is O(N log N) + O(M log M ) where N and M are sizes of strings.

Can we do better? In order to check if each character in a string is present in other, let’s keep track of characters in one string in another. Count number of occurrence of each characters in first string in hash table with keys as characters. Again, count number of occurrence of each characters in second string.  If both hash tables are equal, strings are anagrams. Two hash tables are used for solution. We can do away with one hash table.

While traversing first string, increment count for each occurrence of character. While traversing second string, decrement count for each character, in same hash table. At any point of time,  if count of character is less than zero, then character is present in second string but not in first, and hence, strings are not anagrams.

Once, second string is completely scanned, check if there is any character in hash which has count greater than 0. There is easier way, just add all the values in hash and see if it is zero or not.

To avoid check hash table at end for non zero count, first check lengths of two strings, if lengths are not equal, directly say non anagrams. (Think how it avoids check of non zero count at the end :))

#include <stdio.h>
 
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
 
int is_anagram(char *s, char *d);
 
#define NUM_CHAR 256
#define true 1
#define false 0
 
int isAnagrams(char *s, char *d){
 
    int i = 0;
    int lengthA = strlen(s);
    int lengthB = strlen(d);
 
    int hash[NUM_CHAR];
 
    if(lengthA != lengthB) return false;
 
    for(i=0; i< NUM_CHAR; i++){
        hash[i] = false;
    }
 
    while(*s != '\0'){
        hash[(*s) -'0']++;
        s++;
    }
    while(*d != '\0'){
        hash[(*d) -'0']--;
        if(hash[(*d) - '0'] < 0) return false;
        d++;
    }
 
    return true;
}
 
int main(){
    char *s  = "cider";
    char *d  = "dicer";
 
    if(isAnagrams(s,d)){
        printf("\nYES");
    }
    else{
        printf("\nNO");
    }
    return 0;
}

Complexity of algorithm to check if two strings are anagrams is O(N) where N is length of string.

Check if strings are anagrams using prime factorization

There is mathematical proof (which I don’t know) that a number can be decomposed into a unique set of prime factors. It is not possible to add or take away any factor of that number if number has only prime factors. Prime factorization wiki . How can we use this information to solve our problem.

Assign a unique prime number to each character in alphabet (considering only lower case ). Now for each word, take corresponding prime number for each character of word and multiply. If results of multiplication of two words are same, words are anagrams.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
 
int is_anagram(char *s, char *d);
 
#define NUM_CHAR 256
#define true 1
#define false 0
int hashPrime[26] =  { 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 
                       53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101 };
 
int isAnagrams(char *s, char *d){
 
    int lengthA = strlen(s);
    int lengthB = strlen(d);
 
    if(lengthA != lengthB) return false;
 
	int multiplicationA = 1;
	int multiplicationB = 1;
 
	char *stringA = tolower(s);
	char *stringB = tolower(d);
 
    while(*stringA != '\0'){
        multiplicationA = multiplicationA * hashPrime[(*stringA) -'a'];
 
        stringA++;
    }
    while(*stringB != '\0'){
    	multiplicationB = multiplicationB * hashPrime[(*stringB) -'a'];
        stringB++;
    }
 
    return multiplicationA == multiplicationB ;
}
 
int main(){
    char *s  = "cider";
    char *d  = "dicer";
 
    if(isAnagrams(s,d)){
        printf("\nYES");
    }
    else{
        printf("\nNO");
    }
    return 0;
}

This is fast method, however, there are pitfalls, one must be aware of. Multiplication of integers can easily lead to over flow. For 26 characters, we need at least 26 unique prime numbers and even if we chose lowest 26 prime number, this overflow may occur as soon as we see words with length greater than 9.

Second problem with prime factorization method is that mapping is skewed between input and output space. There is  infinite number of anagrams where at max with 64 bit machine, we can store only 264 words. However, with limit number of words with restricted length this method works good, else move to sorting method discussed earlier.

Please share if something is missing or wrong or something you liked about it. Sharing is caring.