A wide character trie implementation

It’s been a long time since I posted here. Therefore I decided to dig out some quick and straightforward stuffs from my disk which I previously decided should’t be in this blog.

I have posted multiple trie and dictionary search based programs in C, C++ and Perl before Jumble Work Solver and Jumble Work Solver Again. This time (again!) it’s about a trie. Although this time there was a specific requirement from a group who needed to implement a trie based word distance counting for bangla language, therefore Unicode support. This was supposed to be modified more and plugged into a spelling correction for scanned OCR text in the Bengali language.

Initially I suggested that a ternary tree would be more appropriate as the memory cost for standard trie (not compressed) would be huge. Although, finally the decision was to go with plain and simple trie. I know it is mad, but this is what it is :D.

Also, before going into the implementation, I should note that there is an efficient implementation of Radix Tree present in libcprops library. Though which we won’t be using for this one. I would recommend you people to have a look into this library if you already haven’t seen it yet.

Let’s say, we have a word “hello” and a node with an array of pointers of length 26 representing each character of the English language, each of which indicates that if the i^th character follows the character represented by this node. Therefore for this example “hello”, the head node’s array of pointers will have the location 7 pointing to another node, which will have the 4th location of the pointer pointing to another node, whose pointer array’s 11 th location will point to another node and so on. Note here the indexing starts from zero. When the word ends, then the next pointer can be pointed to a special terminal marker node, which can be common to all, and the node is marked as a terminal node. This is essential as a valid word can be a substring of another valid word, in which case the shorter would need to be decided. When another new word like “help” comes in, we will follow the same path upto “hel” created by the word “hello”, and as we find there are no pointer for “p” pointing to any node, a new node will be created as explained before.

Have a look here and here on how trie works.

The essential component of the trie here the mapping of the character to the pointer index in the node. In the case of ASCII it is simple, as the possible combinations are limited, and each character can be mapped directly to an integer by doing a plain subtraction. Easiest way is to treat all the characters as lowercase and then subtract the ASCII value of “a”, as all the alphabetical characters are in a sequence.

In the case of Unicode we just need to change the mapping function. That is exactly what is done in this implementation.

The main functions are

void init_wchar_mapper (wchar_t *chars_to_map)

This function init_wchar_mapper, initialises the system with a character mapping set which is to be used. In case of English we make a map of Unicode English characters like as follows.

wchar_t english_map[] = {
                          L'a', L'b', L'c', L'd', L'e', L'f', L'g', L'h', L'i', L'j', L'k', 
                          L'l', L'm', L'n', L'o', L'p', L'q', L'r', L's', L't', L'u', L'v', 
                          L'w', L'x', L'y', L'z', L'A', L'B', L'C', L'D', L'E', L'F', L'G',
                          L'H', L'I', L'J', L'K', L'L', L'M', L'N', L'O', L'P', L'Q', L'R', 
                          L'S', L'T', L'U', L'V', L'W', L'X', L'Y', L'Z', L'1', L'2', L'3', 
                          L'4', L'5', L'6', L'7', L'8', L'9', L'0', // Add more here
                          L'\0'
                        };

And then in the main program before doing anything, call the function as follows.

init_wchar_mapper (lang_map);

This will map each of the Unicode characters in the passed map into a hash wchar_mapper, where the key is the character and the value is its corresponding index. The indices are populated in the same order of the characters in the map.

Also, another hash is made wchar_reverse_mapper which does the reverse mapping, that is, the index of an array to Unicode character. This is done to perform tasks as autocomplete or find all the words starting from any arbitrary node, or whereever we need to know the character which is represented by a link.

static std::unordered_map<wchar_t,int> wchar_mapper;
static std::unordered_map<int,wchar_t> wchar_reverse_mapper; // For autocomplete link search
int max_keys = 0;

/* Last element of 'chars_to_map' SHOULD be L'\0'. 
 * Else we will be in trouble.
 */
void init_wchar_mapper (wchar_t *chars_to_map)
{
  int i;
  
  for (i=0; chars_to_map[i] != L'\0'; i++)
  {
    max_keys++;
    wchar_mapper.insert (std::pair<wchar_t,int> (chars_to_map[i], i));
    wchar_reverse_mapper.insert (std::pair<int,wchar_t> (i, chars_to_map[i]));
  }
}

Once this is made the Unicode character to integer mapping is straightforward.

int wchar_map_to_int (wchar_t key)
{
  std::unordered_map<wchar_t,int>::const_iterator retpair = wchar_mapper.find (key);
  
  if (retpair == wchar_mapper.end ())
  {
    return -1;
  }

  return retpair->second;
}

And so is the reverse mapping


wchar_t int_map_to_wchar (int key)
{
  std::unordered_map<int,wchar_t>::const_iterator retpair = wchar_reverse_mapper.find (key);
  
  if (retpair == wchar_reverse_mapper.end ())
  {
    return -1;
  }
  
  return retpair->second;
}

Once this is made, any kind of indexing operation on the pointer array is performed using the wchar_map_to_int function, and when doing a Depth First Search (DFS) to list all possible words, we can use int_map_to_wchar to get back the represented character by from the pointer index.

Here is a Github link of the code: https://github.com/phoxis/codez/tree/master/wchar_trie

Note that the bengali word list file is not uploaded as I do not have the permission to attach it here. For your language you need to add your maps to set_maps.hpp and recompile. From a more flexible implementation, one can always decouple the mapping of the character to the index in a plain text file, and then read them on the run. In this way the application will not need recompilation, though this was not done here.

Let me know how you would handle such a task, would you use such a simple trie with huge memory requirements, or go for other datastructures.

A wide character trie implementation

Share this:

Related

Leave a comment Cancel reply