Ambiguous Domain Names

Two weeks ago I attended the Hackreduce Hackathon at Notman House to learn about Hadoop. I joined a few people I knew (and some I just met) to work on a project where the goal was to extract images from the Wikipedia and see if we could correlate the popularity, as the number of references to the image, to some of the intrinsic images characteristics.

But two other guys I know (David and Ian) worked on a rather amusing problem: finding domain names that can be parsed in multiple, hilarious ways. I decided to redo their experiment, just for fun.

So the problem is as follows: can a domain name be parsed in different ways in the absence of dash and points? For example

http://www.speedofart

can be parsed as “speed of art” (which I guess is the intended parse). The domain

http://www.antiquesexchange.com

is actually hard to parse as “antiques exchange”.

But how do we find the alternate parsings? First, we get the largest possible list of words one can find. The one I used is from Kevin’s Word List Page, and is the AGID-4 dictionary. The file is far from perfect (there are a lot of plurals (for example, “women”), and simple words like “bigger” or “of” that are missing), but it is usable for the purpose of creating a parser that generates alternative parses.

The algorithm I used is a simple greedy parser that uses depth-first search to generate the alternative parses, plus some heuristics to eliminate uninteresting solutions (the dictionary knows individual letters a, b, …, z, which causes unknown words to be broken in a large number of pieces, for example, docktrous may end-up being parsed as “dock t r o us” which is not likely to be a valid parsing).

So the algorithm proceeds as follows: it will try to find known words that match the start of the domain name, starting by the longest, doing the shortest last. Once a word is found that matches the beginning of the domain name, it is saved on a word stack, removed from the domain name, and the procedure is applied again on the remainder of the domain name. The algorithm terminates when the domain name was entirely cut up into words, and the stack of words is (if deemed interesting) printed as a possible parsing solution.

The word list is stored as a std::set, which lets us test for valid words in all simplicity; furthermore that we use the string to provide candidate rather than rummaging the dictionary to find words that may fit. The main loop looks like:

void multi_parse(std::set<std::string> & dictionary,
                 std::vector<std::string> & stack,
                 std::string remains,
                 size_t & best_so_far)
 {
  if (remains.size())
   {
    for (size_t len=remains.size();len;len--)
     {
      std::string  word=remains.substr(0,len);
      if (dictionary.find(word)!=dictionary.end())
       {
        stack.push_back(word);
        multi_parse(dictionary,
                    stack,
                    remains.substr(len),
                    best_so_far);
        stack.pop_back();
       }
     }
   }
  else
   {
    // show stack
    if ((stack.size()<8) // number of words
        && (fragments(stack)<4) // checks for the number of 1-letter words
        && (stack.size()<=best_so_far) // smaller parse?
        )
     {
      best_so_far=stack.size();
      for (size_t i=0;i<stack.size();i++)
       std::cout << " " << stack[i];
      std::cout << std::endl;
     }
   }
 }

Called using the AGID-4 dictionary on a domain name such as “www.aministorageonline.com” (stripping both “www.” and “.com”), the above yields the following “interesting” parses:

aministorageonline
 amin is tora geo n line
 amin is tora ge online
 amin is tor age online
 amin is to rage online
 amin i storage online
 a mini storage online

We find the intended parse “a mini storage online” and the amusing alternate “Amin is to rage online”. I do not know why “ge” is a word in the dictionary, but that’s another story.

*
* *

Some are even hard to parse as intended. Here’s a few I found:

http://www.aceofart.com → Ace of Art

http://www.aspenislandranch.com → Aspen Island Ranch

http://www.bobcoshandjob.com → Bob Cosh and Job

http://www.europeniscola.com → Europe Niscola

http://www.antiquesexposition.com → Antiques Exposition

http://www.actorsexchange.com → Actors’ Exchange

Some are just… they just get you to wonder what the people that registered them were thinking:

http://www.aaaaaughibbrgubugbugrguburgle.com/

http://www.lalalalalalalalalalalalalalalalalala.com/

*
* *

So, if you’re going to choose a domain name, maybe you want to pass it through the alternate parser, you know, just to make sure.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: