Jumbled Word: A string of characters is given, the task is to find all the meaningful words that can be created by rearranging its letters. Solving a jumble word means to find all the meaningful words that can be made with the initial string.
Objective Of The Article
In this article first we will describe how a jumbled word can be solved and then we present a very simple computer program. After this we present an advanced computer program which will make the solution very fast with the help of a specially designed tree.
Article Revision: 3
Notice 25.9.2009 : This article is undergoing an update and will be made online soon .
Here is another related post Jumbled word solver with C++ and Perl implementation with hash and list: Jumble word solver again.
*Please Note:* This article is very old and lengthy, I am planning for either a revision or a rewrite for the trie tree part which reflect the recent modifications. At the time I wrote this I did not know that the datastructure which I worked out was a trie :) therefore the long explanation.
Article Index
- Jumbled Word: Definition
- Objective of the article
- Introduction
- Sequential File Searching method
- The Tree Searching Method
Introduction
Rearranging the characters of a string is actually generating new permutations of the characters present in the string, but how can we possibly know that the new permutation (string) is meaningful ? To do this we need to check if the newly formed sting is within a list of valid words. For this we need a word list file which will contain a lot of meaningful words stored in.
Primarily one might think of generating all the possible permutations of the given string and check each of them with the word list. If we find a match then we have found one such meaningful word and then we continue the search for more with new permutations. Though this technique seems very simple but actually is very bad and time consuming. To understand why this is not a good idea. If the user enters a word of length n” and the wordlist has x” words. The string would have n! permutations, each permutation is compared with x words in the word list file making it ( x * n! ) comparisons to find the matches. As it can be seen this function grows factorially, it will take a huge time and the time increases with the length of the input string factorially. Just think of a word list file with 20,000 words and an input string of length 5 then we need ( 20,000 * 5! ) = 2,400,000 comparisons plus the computation needed to generate the permutations. We will not use technique, instead we inspect the properties of different permutations of strings and device a technique.
Sequential File Searching
If the characters of a string string1 is rearranged to form another string string2 then the below properties hold.
- The length of both the strings are same
- Each character in one string is present in the other string
If the characters of string1 can be rearranged to form string2 then we tell that string2 is a ‘solution‘ of string1 (or vice-versa)
The first property is clear, if the length of the two strings differ then the characters of the two strings cannot be rearranged to get each other. If the two strings have the same length but do not have the same characters, then they cannot be rearranged to form each other. This suggests the following: (1) Compare the length of string1 with string2 (read from the word list), if it matches then, (2) check if they have the same characters, if this is also true then string2, from the word list, can be formed by rearranging the characters of string1. We need both the condition to match, if any of the condition fails we quit comparing with that word and move to the next word in the word list file. The technique is outlined below.
Sequential File Search Algorithm Outline
READ the input string into 'string1' WHILE the word list file has not finished, DO READ the next word from the word list file into 'string2' IF length of 'string1' equal to length of 'string2', THEN IF each character in 'string1' is present in 'string2, THEN 'string2' IS a solution of 'string1' ELSE, 'string2' is NOT a solution, continue searching ('else' of inner 'if') ELSE, 'string2' is NOT a solution,continue searching ('else' of outer 'if') END of while loop
We present the above idea a bit formally: If there are two strings string1 = w1w2w3w4….wn and string2 = x1x2x3x4….xm, then the characters of string1 can be permuted to form string2 or vice-versa if and only if m = n and each wi has one and only one match xj in the second string.
To make clear what we want to express by each character wi of a string should have one and only one match xj in the other string, is described with an example:
The string “hello” and “eloho” both apperantly have the same characters, but as we told each character should have only one match, this is not satisfied by these two strings: “hello” contains ‘l’ in 2nd and 3rd position, but they will be considered different because they are not in the same positions. When it is matched with another string “eloho” the first ‘l’ in “hello” is matched in “eloho” but the second one is not, because the only ‘l’ in “eloho” is already matched.
To make the above process work we now device an algorithm to find if string1 can be permuted to form string2, that is, if each character in string1 has one and only match in string2. We will present two techniques to do this.
Mark Matching: Checking Character Occurences, The First Method
To match characters in two strings and check if both have the same characters we do the following: Sequentially take each character from string1 and check if it is present in string2, if yes then, mark the matched position with some marker character say a ‘*’, then we stop searching with the current character and continue with the next next character of string1. If a character of string1 is not found in string2, then we stop the whole process and come to the conclution that string1 is not a solution of string2. If all characters in string1 is found in string2 then we conclude that string1 is a solution of string2 Let us see this with an example.
The general outline / algorithm of the above mark matching process is described below. The end of a string is determined by a NUL character. string[i] means the ith character of the string:
Mark Matching Algorithm: To check if two strings have the same characters or not
INITIALIZE 'i' to 0 WHILE 'string1[i]' is not NUL, DO INITIALIZE 'J' TO 0 WHILE 'string2[j]' is not NUL, DO IF 'string1[i]' is equal to 'string2[j]', THEN MARK 'string2[j]' as "matched", and BREAK from the inner loop ELSE INCREMENT 'j' : 'j'='j+1' END of inner while loop IF no characters of 'string2' in the inner loop has been marked, THEN BREAK from outer loop ELSE INCREMENT i' : 'i=i+1' END of outer loop IF all characters in 'string1' has been matched, THEN 'string2' IS a solution of 'string1' ELSE 'string2' is NOT a solution of 'string1'
The C function for the above method is below. The whole program will be presented later.
check_for_match() function with mark matching:
/* This check function is designed with the first matching technique * * where each character matched from string1 in string2 is marked */ bool check_for_match(char *string1, char *string2) { char copy_of_string1[MAX_LENGTH], copy_of_string2[MAX_LENGTH]; int i, j; /* If string length differs return FALSE */ if( strlen(string1) != strlen(string2) ) { return FALSE; } /* Mave a copy of string 1 and string2 and operate on the copy, * * because they will get modified in the matching process */ strcpy( copy_of_string1, string1 ); strcpy( copy_of_string2, string2 ); /* Make the upper case characters to lower */ tolower_word( copy_of_string1 ); tolower_word( copy_of_string2 ); for( i = 0 ; copy_of_string1[i] != NUL ; i++ ) { for(j = 0 ; copy_of_string2[j] != NUL ; j++ ) { /* If string1[i] and string2[j] is equal mark string2[j] */ if( copy_of_string1[i] == copy_of_string2[j] ) { copy_of_string2[j] = MARKER; break; } } /* If string2[j] is NUL that means that no match was found in inner * * for loop so string2 is not a solution os string1, return FALSE */ if( copy_of_string2[j] == NUL ) { return FALSE; } } /* If the outer loop terminates, that means that string1[i] is NUL * * which means all the characters in string1 is matched in string2 * * that is string2 is a solution of string1, return TRUE */ return TRUE; }
Note: All the strings are first converted to lower case before doing anything
Examples
Let string1 = “loleh” and the string read from the word list is string2 = “holme“. Now we perform the below operations.
- Take from string1, the 0th character: string1[0] = 'l', and check if it is present in string2 = “holme”.It is found in the 2nd position and is marked and made “ho*me”.
- Take from string1, the 1st character: string1[1] = 'o', and check if it is present in string2 = “ho*me”.It in found in the 1st position and is marked and made “h**me”.
- Take from string1, the 2nd character: string1[2] = 'l', and check if it is present in string2 = “h**me”.It is NOT found, so we come to a conclution that “loleh” cannot be rearranged to form the word “hello” hence it is not a solution
Note above the first occurence of the character ‘l’ in the second string was crossed in the first match, so when the second time ‘l’ was seached it was not found.
Let there be an input string “loleh” and the second string for matching is “hello“.
- Take from string1, the 0th character: string1[0] = 'l' and check if it is present in string2 = “hello”.It is found in the 2nd position and is marked and made “he*lo”.
- Take from string1, the 1st character” string1[1] = 'o' and check if it is present in string2 = “he*lo”.It is found in the 4th position and is marked and made “he*l*”.
- Take from string1, the 2nd character: string1[2] = 'l' and check if it is present in string2 = “he*l*”.It is found in the 3rd position and is marked and made “he***”.
- Take from string1, the 3th character: string1[3] = 'e' and check if it is present in string2 = “he***”.It is found in the 1st position and is marked and made “h****”.
- Take from string1, the 4th character: string1[4] = 'h' and check if it is present in string2 = “h****”.It is found in the 0th position and is marked and made “*****”.
- string1 has ended, this means all the characters of string1 is found in string2, so “hello” is a solution to “loleh”.
Note that the first time the character ‘l’ is found we only cross the very first match, and the second occurence of ‘l’ is crossed by the second ‘l’ in of string1. Precisely the job is to find atleast and atmost one match for each character.
Because the matched position of the second string is marked with a marker we call this the marked match method
In the above process string2 is destroyed, so it is need to copied elsewhere before using it in this process.
Matching by Sorting: Checking Character Occurences, The Second Method
We now talk about a second process with which we can check if the characters of two strings can be permuted to get each other.
Let string2 be a ‘solution‘ of string1. Then the sorted character sequence of string1 and string2 will be the same. This means if all the characters of string1 occurences in string2 and both the string has the same length, then sorting their characters will position them into a certain relative positions which will be same in both the string. Note that, the sorted character pattern of any string is just one permutaion of it. What we need is just sort string1 and string2 and check if the sorted strings are exactly the same or not. We discuss this matter with a few examples.
the string “sopt” has 6 meaningful permutations among its 24 permutations:
- opts
- post
- pots
- spot
- stop
- tops
Notice that if we sort the sequence of the characters of any of the 24 permutations we get the same string which is “opst“. That means that all permutations of a string has the same sorted pattern. So, if we sort any of the above 6 words we will have “opst”.
It is easily understandable that all the permutations of the string has a unique sorted pattern. We name the input string string1 and the second string (read from the word list) as string2. The outline is described below, The process of checking if two strings can be rearranged to form each other with this technique follows:
Matching by Sorting Algorithm: To check if two strings have the same characters or not:
SORT 'string1' SORT 'string2' IF 'string1' is equal to 'string2', THEN 'string2' IS a solution of 'string1' ELSE 'string2' is NOT a solution of 'string1'
The sorting algorithms are not described. Any general method can be followed. A decent sorting mechanism is to be implemented to minimize the search time. It is good to implement the quicksort method to get good performance.
We also present the in C Language function. We are using the string library functions “stdlib.h” to compare the strings. The sort function is not described here.
check_for_match() function with matching by sorting
/* This check_for_match function is designed with the Matching by Sorting method */ bool check_for_match(char *string1, char *string2) { char copy_of_string1[MAX_LENGTH], copy_of_string2[MAX_LENGTH]; /* If string length differs return FALSE */ if( strlen( string1 ) != strlen( string2 ) ) return FALSE; /* Mave a copy of string 1 and string2 and operate on it, because they * * get modified in the matching process */ strcpy( copy_of_string1 , string1 ); strcpy( copy_of_string2 , string2 ); /* Make the upper case characters to lower */ tolower_word( copy_of_string1 ); tolower_word( copy_of_string2 ); /* Sort strings */ sort_char( copy_of_string1 ); sort_char( copy_of_string2 ); /* Compare the sorted strings, if same then a match id found */ if( strcmp( copy_of_string1 , copy_of_string2 ) == 0 ) return TRUE; /* Else no match */ return FALSE; }
Note: All the strings are first converted to lower case before doing anything
Which One To Use
The first method is better than the second one. This is because in the second method two strings need to be sorted at first and then both are compared with each other. So for each word read from the dictinary it first needs to be sorted before it is compared. But in the first method sorting is not needed, instead we search for the presense of a character, and if we do not find one we can at once break from the process saving the remaining comparisons, which could not be avoided when using the sorting method. The sorting method is important which we will see in the next process where we will use tree to search the words.
The Whole Picture
We have now deviced a mechanism to check if all the characters in a string are present in another string. Previously we discussed that the length of the strings should be same. We have presented the character occurences checking functions. Now we will write the main function and the other functions and build the whole program.
The main function reads each word from the word list and drives the matching process iterating on each word. The process for matching string1 and string2 the check_for_match() function is called from main.
The main function
/* This function takes input from the user, then reads each word from the * * word list and checks if the scanned words from the file is a solution * * of the input string. The matching is done by the check_for_match() function */ /* Default word list name */ char word_list[FILE_NAME_LENGTH] = "words"; int main(int argc, char *argv[]) { FILE *fp; char jumble_word[MAX_LENGTH],scanned_word[MAX_LENGTH]; int i=0; /* Copy the passed wordlist name */ if( argc >1 ) { strcpy(word_list,argv[1]); } fp = fopen(word_list,"rb"); if( fp == NULL ) { perror("Cannot open file "); printf("\"%s\"",word_list); printf("\nSyntax: %s <wordlist_file>\n",argv[0]); exit(1); } printf("\nEnter \'q\' to quit "); while( TRUE ) { i=0; printf("\n\nJumble Word: "); scanf(" %[^\n]",jumble_word); if( jumble_word[0] == 'q' && jumble_word[1] == NUL ) break; rewind( fp ); while( ! feof( fp ) ) { fscanf( fp ," %[^\n\r]", scanned_word); if( check_for_match( jumble_word, scanned_word ) == TRUE ) { i++; printf("\n[%d] %s",i,scanned_word); } } if( i == 0 ) { printf("<no matches>"); } } return 0; }
Note: Any one version of check_for_match() can be used with the main function to make the program, and no code modification is needed in the main function for that
sort_char() function, to sort the characters of a string:
/* the qsort() function of standard library is used to sort the characters */ void sort_char(char *word) { qsort( word , strlen( word ) , sizeof( char ) , compare_char ); } /* the compare function for qsort */ int compare_char(const void *a, const void *b) { return *( (const char *) a ) - *( (const char *) b ); }
tolower_word() function, to make all the characters in a string lowercase:
/* A function to convert all the upper case letters to lower case */ void tolower_word(char *word) { short int i; for( i = 0 ; word[i] != NUL ; i++ ) { word[i] = tolower( word[i] ); } }
Preprocessor directives
#include <stdio .h> #include <string .h> #include <stdlib .h> #define a_minus_A 32 //pre calculated value of 'a' - 'A' #define FILE_NAME_LENGTH 256 #define MAX_LENGTH 150 #define MARKER '*' #define TRUE 1 #define FALSE 0 #define bool short int #ifndef NUL #define NUL '' #endif
Now is the time to run the code and check what it has to say. Below is the download link of the code which contains a big word list, and both of the source codes with the two versions of check_for_match() function
Source Code of Jumble Word Solver: Sequential File Searching method
Now its time to get the C Programming Language source code of the above methods described
Click here to download the C codes and the word list
The compressed archive contains
- wordlist file: “words.txt”
- source file1: “jumble_mark_match_method.c”
- source file2: “jumble_sort_match_method.c”
- read me file: “readme.txt”
Optimizations
The sequential file searching needs to sweep one single pass through the whole word list file comparing with each word. Some slight modifications could be done in the wordlist file itself and change the code accordingly to make the search faster and get better performance. We do not present source code but discuss the optimizations strategies below
- The modification: Sort the word list file according to assending string length order.The benifit: The strings with lesser length will be skipped, the strings with equal length will be matched, when ever the first string with a greater string length than the input string is encountered comparison can be stopped, no need to scan more words further, break at once. Thus the strings with greater string length are not processed, saving computation.
- The modification: Sort the word list file with string length as the primary key and then sort according to the sorted word pattern as the secondary key.The benifit: Like above, the strigns with more length than the input string are avoided. The strings with same lengths are compared, after the first match is found the next word will also be a match, and if the next word is not a match then no further word can be a match, stop at once scanning further. This is because the words with the same character permutation are placed adjacent.
- The modification: Sort each word in the list and store the sorted string patterns in another index file and beside each sorted pattern also store the file byte offsets of the meaningful words in the word list.The benifit: The input word is sorted and searched for in the index file. When found the file byte offsets of the word list files can be used to read the meaningful words and then display.
For these optimizations the first two needs the word list to be designed specially, and the third one needs an extra index file to store the unique sorted patterns ans the meaningful file offsets for each pattern. The last concept will be used in our next tree method which will index the whole word list in memory.
Complexity of sequential searching method
Each word sweeps the whole file once. If the word list file has x words then each words sweeps through x words. The length of each string is computed which needs total (x * n) operations where n is the average length of the strings. Only the strings with the same length are compared. Let there be m strings of the same length of the input string. With the first method each character from string1 is compared with string2. For each string compared worst case needs n2 operations, making a total of (x * n) + (m * n2) operations are needed. But it is to be nodes worst case occurs only when a match is found. So the worst case occurs only a few times, making the complexity mainly depending on m then the number of matches and x.
The second method has sorting and needs sorting of strings in each iteration. If the quick sort is used the worst complexity will be O(n*log n). Total m strings are only sorted making it O(n*log n) * m + (x * n) operations.The complexity here also depends mainly on m, then the length of each of the m strings, and x
The Tree Searching method
Now we talk about a tree which we will use as an index table to find match and then print the locations. Here each word will be sorted and inserted into a tree in such a manner so that there is no need of a linear search and the list of meaningful words can be accessed almost instantly. The maximum number of operations needed by this process depends on the string length. So this process is much much better than the previous one.
Let us understand how the tree is structured. Again look at the example of string “sopt” which has 6 meaningful words among its 24 permutations:
- opts
- post
- pots
- spot
- stop
- tops
Think of a table, where all meaningful words with the same sorted permutations recide at the same location. For example, all the 6 meaningful perutations of “sopt” above will fall in the same location of the table. Let the user input any permutation of the 4 letters ‘s’,’o’,’p’ and ‘t’ we sort it and we get the same sorted permutation “opst”. With this sorted permutation the table is searched and the matched location will contain all the meaningful words.
We will make a special kind of search tree to load the words from a word list. Now we describe how we design a the tree with an example to be searched. First we describe the structure of the node, and how it is made and accessed.
Node structure
typedef struct histogram_structure { int *file_pos; struct histogram_structure *next[MAX_NODES]; }histogram_tree;
Each node has an array of pointers, each element will point to a node. Each element of the array represents each characters. Each character is represented with one node. The character a node represents is determined by the index of the next pointer which points to it.
For example:
- 0th element of next will stand for ‘a’
- 1st element of next will stand for ‘b’
- 2nd element of next will stand for ‘c’
- .
- .
- 25th element of next will stand for ‘z’
If the ith element of the next pointer array contains NULL then it represents that that location do not point to the ith character of the alphabet. If the element does not contain NULL then it points to the ith character of the alphabet, and it can proceed to the next node. A string is stored after sorting and by dividing each of the character to each node, where each node points to the next node mantaining the sorted character sequence through the next pointers. index of the next pointer by which a node is pointed by, determines which character is represented by it.
The meaningful words are actually stored in a wordlist file. The node has another array which contains the byte offset of the meaningful word read from the wordlist file. When access to the word is needed, it is read from the word list using the stored byte offset value stored in the array. The first element of the file_pos array does not store a byte offset, instead it is used to store a counter, counting the number of file byte offset values are stored in that array. This is initillized to 0. If the array contains 5 file byte offsets then the counter (the 0th location of the file_pos array) will contain 5 and the five byte offset values will be stored in the array location 1 to array location 5. Note the file_pos is declared as a long int type pointer. When allocating a node we use ‘malloc()’ to allocate him the first location and we initilize the counter. When indexing the words in the tree when we need to insert a new file offset value into this file_pos array we use ‘realloc()’ function to resize the file_pos array and then insert at the last. This saves a lot of memory space.
The structure is named ‘histogram_structure’ because it creates a histogram like structure of the words of the word list.
The Indexing And Searching Mechanism (direct index)
Each character is represented with one node, the character it represents is determined by the index of the next pointer which points to it. If a node ‘node2‘ is pointed by the 10th location of the next array of another node ‘node1‘ then the node ‘node2‘ will be represented as ‘k‘ the 10th letter (count starts from 0). The sorted permutation “aehrs” will be indexed like below.
- The 0th location (index of ‘a’) of next array of the head node will point to another node, say ‘node1‘.
- The 4th location (index of ‘e’) of next array of the ‘node1‘ node will point to another node, say ‘node2‘.
- The 7th location (index of ‘h’) of next array of the ‘node2‘ node will point to another node, say ‘node3‘.
- The 16th location (index of ‘r’) of next array of the ‘node3‘ node will point to another node, say ‘node4‘.
- The 18th location (index of ‘s’) of next array of the ‘node4‘ node will point to another node, say ‘node5‘.
- And the ‘node5‘ will contain the byte offset of the permutation “aehrs” where it is stored.
To search with the string (character sorted) “aehrs” and check if it is present in the tree (if it creates a valid path)
Start from the head node.
- If the pointer in the 0th location of the next array (index for ‘a’) is not null then move to next[0] (node1)
- If the pointer in the 4th location of the next array in ‘node1’ (index for ‘e’) is not null then move to next[4] (node2)
- If the pointer in the 7th location of the next array in ‘node2’ (index for ‘h’) is not null then move to next[7] (node3)
- If the pointer in the 17th location of the next array in ‘node3’ (index for ‘r’) is not null then move to next[17] (node4)
- If the pointer in the 18th location of the next array in ‘node4’ (index for ‘s’) is not null then move to next[18] (node5)
- The string has ended. We see if the file_pos array stores any file offset by checking if the first position (the counter) is greater than 0, if yes the search is successful and we have found the string. Now we will read the word list file with the file offsets and show the meaningful words.
If the applied string creates a ‘valid path’, that is, it traverses down the tree, and the file_pos counter at the last node is greater than zero, then the string will have file_pos[0] number of meaningful words. When traversing with a certain pattern, if a NULL link is encountered that means that pattern is not valid and has no meaningful permutations. If a string successfully traverses but the file_pos[0] is zero then, though it has successfullt traversed there are no meaningful words for that pattern. This is in the case when the applied string is a substring of a ‘valid path’ down the tree.
To have meaningful permutations, that is to have a valid path in the tree the conditions below must be satisfied:
- The sorted character pattern of a string must traverse down the tree
- After traversing down, the value of file_pos[0] of the last node visited must be greater than 0
Now we come up with a general algorithm
Algorithm to INDEX the tree with string2 read from word list:
ALLOCATE the 'head' node WHILE the word list is not finished, DO READ the file offset of the next word into 'current_offset' READ next word from word list into 'string2' SORT 'string2' INITIALIZE 'i' to 0 INITIALIZE 'current_node' with 'head' WHILE 'string2[i]' not equal to NUL convert the character 'string2[i]' to numerical index and store it in 'index' IF 'current_node -> next[index]' is equal to NULL, THEN ALLOCATE new node 'new' LINK 'new' to the index location of 'current_node' : 'current_node -> next[index]' = 'new' MOVE to newly allocated node: 'current_node' = 'current_node -> next[index]' ELSE IF 'current_node -> next[index]' is NOT equal to NULL, THEN MOVE to the node pointed by index: 'current_node' = 'current_node -> next[index]' INCREMENT 'i' : 'i'='i+1' END of inner while loop //Path Created, now to populate the offset INCREMENT 'current_node -> file_pos[0]' by 1 INSERT the current word offset value in the next 'file_pos' location: 'current_node -> file_pos[ file_pos[0] ] = current_offset' END of outer while loop
Algorithm to SEARCH the tree with string1 to find for a solition:
Read the jumble word to solve from user into 'string1' Sort string1' Initilize 'i' to 0 Initilize 'current_node' with 'head' While 'string1[i]' not equal to NUL convert the character 'string1[i]' to numerical index and store it in 'index' If 'current_node -> next[index]' is equal to NULL, Then 'string1' is not indexed in the tree: No solution for 'string1', search ends, break from loop Else If 'current_node -> next[index]' is NOT equal to NULL, Then Move to the node pointed by index: 'current_node' = 'current_node -> next[index]' Increment 'i' : 'i'='i+1' End of while loop If ('string1[i]' is equal to NUL, ie:'string1' is found ie)-AND-('current_node ->file_pos[0] > 0', ie: 'current_node' has file offset stored), Then Read file offset values from 'file_pos' array, and apply then to fetch the word from wordlist Else, There is solution to 'string1', ie: No meaningful words can be constructed by permuting the character sequence of 'string1'
This process is named ‘direct index’ because the index value of the next pointer array is calculated directly from the value of the character.
Now we will present equivalent C Language code of the above method. Only the main portion (the main loop) of the loading and searching method is presented below. This code is used in the poogram source code presented at the end of the section.
The C Language code segment to INDEX the word list is as below
while( ! feof( fp ) ) { current_node = head; current_file_position = ftell( fp ); fscanf(fp," %[^\n\r]",word); /* If the string word contains invalid characters, skip the word */ if( scan_for_valid_word( word ) == FAIL ) { continue; } word_count++; tolower_word( word ); sort_char( word ); /* While end of the word not reached advance in the histogram tree */ for( i = 0 ; word[i] != NUL ; i++ ) { index = char_to_index( word[i] ); if( node_is_empty( current_node -> next[index] ) == TRUE ) { temp = allocate_new_node(); current_node -> next[index] = temp; } current_node = current_node -> next[index]; } /* After a word has end, store word location offset */ set_current_word_location(current_node,current_file_position); } fclose( fp ); return head; }
The C Language code segment to SEARCH the word list is as below
/* Searches the tree pointed by head for the unsorted pattern word * * The function sorts the word characters and them searches the tree */ current_node = head; /* Backup original word */ strcpy( pattern , word ); /* Sort pattern and make lowercase to make it ready to be searched */ tolower_word( pattern ); sort_char( pattern ); /* While pattern has not ended, advance in the histogram tree through the links * * whenever the converted index points to null ie no path, stop searching, no match * * If the pattern traverses through the tree successfully and the pattern has ended * * check if the last node visited is a word end node, which is detected by, if the * * file_pos array stores atleast one word location offset. If yes search succesful * * return match node, else failed return NULL */ for( i = 0 ; pattern[i] != NUL ; i++ ) { index = char_to_index( pattern[i] ); if( node_is_empty( current_node -> next[index] ) == TRUE ) { return NULL; } current_node = current_node -> next[index]; } if( current_node -> file_pos[0] > 0 ) return current_node; /* this returned node will be passed to another function which will print the words */ else return NULL; }
These codes are used to construct the whole program. The whole program can be found in the downloads section.
More Examples
Now let us inspect some more cases:
Let us again consider the the six meaningful words of “opst” from the first example. Each word is scanned from the wordlist and then it is indexed in the tree making a unique “path” for all the strings which are permutations of each other. When the six words are read and sorted they will generate the same sorted permutation “opst” as we described before. Each such sorted permutation corresponding to a meaningful word creates a ‘vaild path’ which we load into the tree. Now we describe how the six meaningful words, after being read, is indexed/loaded in the tree.
Note: counting starts from zero
Start with the head node of the tree.
- The first character is ‘o‘, the 14th letter of the alphabet, allocate a new node ‘node1‘ and link it to the 14th location of the next array of the head node, and move to ‘node1‘.
- Next we read ‘p‘, the 15th letter of the alphabet, allocate a new node ‘node2‘ and link it to the 15th location of the next array of the current node ‘node1, and move to ‘node2‘.
- Next we read ‘s‘, the 18th letter of the alphabet, allocate a new node ‘node3‘ and link it to the 18th location of the next array of the current node ‘node2‘, and move to ‘node3‘.
- Next we read ‘t‘, the 19th letter of the alphabet, allocate a new node ‘node4‘ and link it to the 19th location of the next array of the current node ‘node3‘, and move to ‘node4‘.
- The string has ended. Now we read the offset value of the word from the wordlist file and then store it to the file_pos array in the current node ‘node4‘ and update the counter by incrementing it to 1
Thus all the words will be scanned, sorted and then indexed. If a node already exists then there is no need to allocate it and we will simply move to it. For example after the ‘valid path’ “opst” is created above, when the next string with the same sorted permutation will be inserted, it will have the same path, so when entering it there is not need to create new nodes instead it only needs to traverse down the path and at the end store the file offset value for the new word and update the counter. So for all those meaningful strings having same sorted character permutations travel the same path but have their file offsets stored in different locations in the file_pos array.
The above case is when two words has same sorted permutation. Now let us see the case of when the paths of two sorted permutation matches partially.
Let the program read a word “pros” which has the sorted permutation “oprs”.
- First we read ‘o‘, the 14th letter of the alphabet. The 14th location of the next array of the head node is already allocated, move to next[14] of current node (to node1)
- Next we read ‘p‘, the 15th letter of the alphabet. The 15th location of the next array of the current node (node1) is already allocated, move to next[15] of current node (to node2)
- Next we read ‘r‘, the 17th letter of the alphabet. The 17th location of the next array of the current node (node2) representing ‘r’ is empty, so we allocate it and then link that allocated node to the 17th location of the next array of the current node, and then move to it (current node’s next[17]). (At this point the path splits)
- Next we read ‘s‘, the 18th letter of the alphabet. The 18th location of the next array of the current node representing ‘s’ is empty, so we allocate it and then link that allocated node to the 18th location of the next array of the current node, and then move to it.
- The word has ended so we read the file offset value of the current word from the wordlist file and enter it into the file_pos array of the current node and update the file_pos counter accordingly
Note The sorted permutation from the previous examples “opst” and “oprs” has a common starting path because of which we find that the nodes are already made and we can proceed, but because the permutations are differnt at a point the paths pointed by the two strings split.
Similarly think of a word “fadge” and “deaf”. The first one has the sorted pattern “adefg” and the second one has “adef” . After the first one is read into the tree and we attempt to enter the second one we see that both the paths which the words would traverse are the same, but the length is different which makes the difference in the permutations. So when we reach the end of the string “adef” what we do is just store the file offset in the file location array and update the file_pos counter.
When we attempt to search for the sorted pattern “opxt” the tree with “opst” indexed in it, the below operations are done. Note how the search instantly breaks out when the first ,match is not found.
- First we read ‘o‘, the 14th letter of the alphabet. The 14th location of the next array of the head node, this is not empty, move to next[14] (to node1)
- Next we read ‘p‘, the 15th letter of the alphabet. The 15th location of the next array of the current node (node1), this is not empty, move to next[15] (to node2)
- Next we read ‘x‘, the 23th letter of the alphabet. The 23th location of the next array of the current node, This IS empty, break from the search, no ‘valid path’ exists, no solutions.
Take the example of “rampage”. Let “rampage” be indexed first, which creates a path “aaegmpr”. If we search the tree with the sorted permutation “aaegm”, then it is clear that “aaegm” will successfully traverse the tree, because it has a common path as “aaegmpr”. When the traversal will finish with “aaegm” it will also need to satisfy another condition, which is the file_pos[0] of the current node is greater than 0 (meaningful word exists), to make the path a ‘valid one’. And if both these satisfy then only we find a successful match
The Indexing Function
This is a very simple function. This function will return the position number in the alphabet. Like ‘a’ will be indexed as 0, b will be indexed as 1 and the values will be returned. Accessing the next pointer array with the letter ‘j’ means accessing the next pointer array with the index value 9. This can be very simply done by subtracting the ASCII value of ‘a’ (97) from each character. The upper case characters are first translated to lowercase then the index calculated. There is no significance in making separate uppercase letters in the tree because this carries no significance in jumble words and also increase huge amount of memory for pointers because the upper cases will be needed to indexed separately. The function also returns index values of three other special characters, hyphen: ‘-‘ , backtick: ‘`’ , apostrophe: ‘\” . The hypen is the most common occuring in words, like “man-to-man”.
The index function will be modified when we introduce an optimized version of the tree.
The indexing function code segment for the direct index method is as below
/* Convert a character to index to be used in the historgram_tree -> next array * * to move to the next node in the path. Each index is accessed directly * * Needs lower case only characters to be passed, else error will incur */ short int char_to_index(char c) { short int index; /* Convert the character to lowercase */ c = tolower(c); if( c >= 'a' && c <= 'z' ) { index = ( short int ) ( c - 'a' ); } else if( c == '\'' ) { index = 26; } else if( c == '-' ) { index = 27; } else if( c == '`' ) { index = 28; } else { /* Not a valid character */ return EMPTY; } return index; }
Case Sensitivity
The program only deals with lower case characters. Distingwishing between upper and lower case characters are not significant to solve jumble words. Whenever a string is read from thw wordlist or from the user both should be first converted to lower case before applying sorting and tree indexing or searching function.
The Memory Optimized Tree (searched index)
The version of the tree just presented using direct index takes huge amount of memory. Like for our test wordlist takes 86 Megabytes of memory in the ram. This is because a huge amount of the next pointer arrray locations go unused. We will make a change in the structure and change the routines to tally with this change to make the memory allocation optimum. The next array is made like the file_pos array. Instead of statically allocating all the 29 pointers we will first allocate only the first location. After each new node is to be inserted, the ‘realloc()’ function would be used to resize the next pointer and insert the pointer in the new created location. After this modification it cannot be determined in which location a character will be indexed, because now the each letter do not point to any unique fixed position. To solve this, a companion character array next_index is introduced. This is also like the file_pos array. The first position of next_index array contains the number of non-NULL links, that is the number of characters a certain node points to. A character is searched in the next_index array, if it is found in position i, then the link for the character is stored in the ith index of the next pointer array.
The process of storing:
When indexing with a certain character, it is searched in the next_index. If it is not found then the array is reallocated and one single storage location is increased, the counter at next_index[0] is increased, and the new character is stored at the new position (last). Else if the character is present in the next_index at the ith location, then the next link is accessed with the index i and proceeded to the next node.
The process of accessing:
First the next_index array is searched for the character to be indexed. Let it be found in the ith position of the array. That means the index for the character in question is (i-1) and next[i-1] represents the link. Else if the character is not found in the next_index list then the character has a NULL link, ie, it does not exists. In the previous versions these NULL links needed to be stored, which resulted in a wastage of memory. The i minus 1 is done because the first location of the next_index array stores a counter, but the array next first location does not store any counter so the index needs to be adjusted. Tip: The first location of the next array can be left unused to avoid the adjustment.
The modified structure is as below
typedef struct histogram_structure { int *file_pos; char *next_index; struct histogram_structure **next; }histogram_tree;
To implement this change basically the main change is needed in the indexing functino which will search for the requested character in the next_index list and return it’s index, with which the next pointer array will be accessed. The modified version of indexing function to work with the optimized search tree , the searched index method, is presented below.
/* Convert a character to index to be used in the historgram_tree -> next array * * to move to the next node in the path. Each index is accessed directly */ short int char_to_index(char c, histogram_tree *current_node) { short int index, n; /* The first location of the next_index stores the counter of how many indexes it has got stored */ /* Locate the position of character c in the next_index */ for( index = 1 , n = current_node -> next_index[0] ; index <= n ; index++) { if( current_node -> next_index[index] == c ) { /* If c is found at position 'index' then the link for character c is in the * * 'index' number location of the next histogram_tree pointer array */ return ( index - 1 ); } } return EMPTY; }
After implementing this modification, loading the same test wordlist takes 48 Megabytes in the memory. It reduces to 55% . The new structure looks like below. Because an array is needed to be searched to get the index of a character this process is named the searched index.
This process determins the index of the character in next array by searching for it in another companion array, this is because this process is named ‘searched index’.
Source Code of Jumble Word Solver: Tree Searching Method
Now we give the links of the C Programming Language source code of the both the above direct index and searched index methods described
The compressed archive contains
- wordlist file: words.txt
- direct_index tree source files in the direct_index directory, and a single file version of the same.
- searched_index tree source files in the searched_index directory, and a single file version of the same.
- compile.sh : a shell script file to compile the sources with gcc
- read me file : readme.txt
Complexity of the tree method
It is clear that there is no more to sweep down whole the file to search each word. The word list is indexed into memory at the begining. Then to find if the input jubmble word is in the tree it is sorted and then traversed down n levels, where n is the length of the string. If the string is a valid one and can be rearranged to form a meaningful word, then the search will be successful. To detect if the search is successful for a n length string only n oprations are needed. And if the search is unsuccessful then the search loop breaks before the string completes making less than n comparisons. Making the search O(n) and where n = length of search string / jumble word. Because the sort routine needs to be invoked before searching once, the searching algorithm will determin the overall complexity. If the quick sort is used the complexity will be O(n*log n) where n = length of search string / jumble word. The average length of words in the dictinary is limited, and small. For an invalid long input the search will break before the whole string is traversed. This makes the search technique very fast, needing only a few operations. After a match is found the words can be at once directly read from the word list file with the offset values, without traversing the file linearly.
The loding routine needs some time to index the word list, because it sorts each string in the word list and then indexes them into the tree. The main time is taken for sorting the strings which again grow with O(n*log n) time. if there are x words in the word list then the function is x * O(n*logn), where n = average length of words in word list
hey
r u using visual basic 6.0 for the codings??
absolutely not. The code is completely written in C language.
You are a Born Genius, Phoxis! Congratulations and all the best.
The design for the weblog is a tad off in Epiphany. Nevertheless I like your web site. I may need to install a
hey, its a really mind blowing code…i really liked it.
I also have a C code in which you enter jumbled words and that program gives you meaningful english words.
Hope you would like my code. Its very short and purely written in C.
for more details please visit
http://codingloverlavi.blogspot.in/2013/04/enter-jumbled-up-alphabets-and-find-all.html
This is a very old one from my blog. Actually this code has undergone many changes and I haven’t updated here. For example now instead of storing the file offsets in the dictionary file, the latest code loads the entire dictionary into memory, which doesn’t take much memory. Also using this module I have made single player scrabble word solver, the one comes in the news paper.
I will definitely visit your blog and check it out. Thanks for visiting.
Really you are a genius.
you manage the code very efficiently, its really great.
thanks to you also for uploading such interesting code and helping us to have a new thought for the same problem that we had already encountered.
I hope I get some time to update the post. Thanks for visiting again.
phoxis
can i challenge u can u convert it into java
Yes, definitely, but at this time I am involved with other stuffs and cannot do it. Actually I thought once to make it with Java. This post is a bit out dated. Here is the latest code: https://github.com/phoxis/wm You may want to give it a look.
Thanks for stopping by!
thx for a quick reply i am using java eclipse and some of the codes is unknown.. but when i run it on MS visual c++ its working fine
Ya, it’s in C, so MSVC++ should run it just fine. Java has a lot of classes with which you can do the same stuff without manually coding the trie. For a trie based implementation, I don’t know if Java already has Trie based class.
Thank you very much sir You help me a lot.
Glad to know it helped!
is there a class on java that will separate string to single chars
Check the methods charAt and getBytes of the String class.
docs.oracle.com/javase/6/docs/api/java/lang/String.html
Nice blog you havee
Thanks Ivan!