A few weeks ago, I went to Québec Ouvert Hackathon 3.3, and I was most interested by Michael Mulley’s Open Parliament. One possible addition to the project is to use cross-referencing of entries based not only on the parliament-supplied subject tags but also on the content of the text itself.

One possibility is to learn embeddings on bags of words but on stemmed words to reduce the dimensionality of the one-hot vector, essentially a bitmap where the bit corresponding to a word is set to 1 if it appears in the bag of words. So, let us start at the beginning, stemming.

Stemming consists, more or less, in finding the longest shared prefix amongst a group of similar words. For example, the closely related words commie, commies, commune, communes, communist, communists, community, and communities, have a common stem of “comm”. Some have longer common prefixes, for example, community and communities have “communit” in common. In order to not conflapulate too many words onto the same stem, we would like a method that somehow favor long stems over large groups of words. In other words, it may be OK if a stem corresponds to only 2-3 words, but manages to keep a lexicographical or semantic unity of the words it represents.

One data structures that allows to construct prefix trees quite easily is the trie. A trie is basically a tree where each node contains a letter and has any number of children, and where a word is stored as the path from the root of the tree to a leaf. We can either have “real” leaves, that is, nodes with no children, effectively determining the end of a word, or “internal” leaves, where the node may also determine the end of a word. In the examples above, we want to be able to retrieve both commune and communes.

Inserting these words in such a trie would result in the following structure

where nodes that can also be leaves (either “real” or “internal”) are shown dotted. We can extract back all the words from the trie using a simple depth-first search, printing the current word/path whenever we reach a dotted node.

So now that we have a leaf, we want to prune it in order to find the stems that are simultaneously the shortest while keeping consistency in the group of words sharing the stems. We want something like this:

where the black nodes correspond to the stems and the gray nodes are pruned.

Luckily, implementing a trie isn’t too complicated, especially for the operations we want to consider here, which are: adding words, computing the number of words in the children of a given node, and finding the stems.

A trie is a tree, but not a binary tree. Any node can have any number of children. One could use an array with the size of the alphabet, or not bother really and use an STL map. When we add a word, we start at the root of the trie (which corresponds to the empty word), seek or add a child corresponding to the first letter of the word, make that child the current node, and seek/add a child corresponding to the second letter of the word, and so on until we have added all letters of the word in the trie. The next word add can either share the nodes already created or add new ones. Adding a word in a trie would look like

class trie { ... public: void add( std::string word) { trie_node * current_node = root; current_node->add_weight(1); for (size_t i=0;i<word.size();i++) { trie_node * next_node = current_node->get_child(word[i]); if (next_node) current_node=next_node; else current_node=current_node->new_child(word[i]); current_node->add_weight(1); } current_node->set_leaf(); } ... }

The `set_leaf()` and `set_weight()` function play an essential role in stemming. Luckily for us, we can update the number of leaves in a sub-tree at the same time we add words, yielding an algorithm linear in the total number of letters added (it’s linear in the number of words added, and adding a word is linear in the number of letters in that word).

*

* *

Now that we have a trie containing all our words and with each node knowing how many leaves are (at or) below it, we can find the stems.

The algorithm I implemented is not complicated at all: a simple depth-first search that stops whenever it reaches a node with a weight of `leaf_weight` or less. It looks something like

void show_stems( const trie_node * this_node, std::string & stack, size_t leaf_weight) { std::string local_stack=stack+this_node->get_symbol(); if (this_node->get_weight()<leaf_weight) std::cout << local_stack.substr(1) << std::endl; else { for (std::map<char,trie_node*>::const_iterator i=this_node->get_children().begin(); i!=this_node->get_children().end(); ++i) show_stems(i->second,local_stack,leaf_weight); } }

Here, the string `stack` keeps track of the word seen so far.

*

* *

How well does it work? Well, not too bad! For a few example words, with a leaf count of 10, we get:

Word | Stems |

communalise communalism communalist communality communalization communalize communard commune communer communicability communicableness communicant communicate communicatee communication communicativeness communicator communing communion communique communisation communise communism communist communitarian communitarianism community communization communize |
communa commune communic communin communio communiq communis communit communiz |

*

* *

To stem a word, we can either look for the longest prefix in the stem table (an operation whose complexity depends on how you implement the stem table) or you can use the trie and walk it down until you reach a node with a weight less than, or equal to, the threshold.

*

* *

Now that we have stems, we can compute the one-hot vector (all this for that). We just enumerate stems, the first being the zeroth, the second being the 1st, and so on. We scan a paragraph, stem each word, and set the hot bits corresponding to each stem found.

We’ll continue from there in a future entry.

*

* *

The complete source (with two example dictionaries) can be found here.