Friday, October 31, 2008

Catch them in the URL net.

In The Problem With URLs, Codinghorror ponders how to regex his way to catching URLs mixed in with user's text.


The whole discussion becomes about either how to write a better rule or regex, or whether to force users to delimit their URLs properly (with angle brackets, square brackets, bbcode tags...)


But this is foolish. 


Parsing text for things like URLs is a similar problem to trying to detect spam - the inputs are as varied as people can imagine them. So stop trying to deal with it using a series of fixed and universal laws!


Suggested algorithm:


  1. Get the feasible string: from the beginning of what you think might be a URL to the first space (or illegal character), allowing for i18n - you can use your regex here - and a list of any open parenthetics in the paragraph ( (,[,< etc).

  2. Generate a series of the possible URLs from it, by dropping each of the characters from the end that could be wrong:
    " Give me a URL (like http://www.example.com/querya?b)(ideally)? " becomes:

    1. "http://www.example.com/querya?b)(ideally)?"

    2. "http://www.example.com/querya?b)(ideally)"

    3. "http://www.example.com/querya?b)"

    4. "http://www.example.com/querya?b"

    5. "http://www.example.com/querya"


    and any other variations you find useful.

  3. Assign each one a rating based on:
    • whether there are unbalanced parentheses inside

    • whether the parenthesis would balance open ones in the paragraph - in this example the open bracket would be balanced by this close bracket, so that lowers the scores for a. and b.

    • whether the URL is sensible - "blah.com/)" is less sensible than "blah.com/"

    • any other good/bad valuation you can think of


  4. Rank the options

  5. If the top two (or more) options are very close or equal in ranking, then test for the existence of each by just polling the URLs in ranked order until you find a real one. If you adjust the threshold of how close is close, you should only be testing in rare cases. If you don't like polling, just pick one, you can't out-unwit every idiot or mistake.

  6. Finally, return the selected URL


There are endless ways to improve it beyond even that - you could even try balancing the parentheses such that your wikipedia article has its missing bracket fixed. At some point, perhaps, it becomes a bit pointless, but if this is all in a library and isn't too slow, nobody need rewrite it again, and the users are happy.


For me, the power of the method is in using ranking to allow unlikely options - unless you can separate all the possible inputs on a Venn diagram (which you can't here), then some rules will work for some sets of inputs, and others for others, and you'll never find a complete set that works for all of them.


In short, regexes only work for uniform and predictable input on their own


Other fun games:


  • consider trying to find wrongly-typed URLs and correcting them for the user

  • providing suggestions on 'better' URLs (did you mean .com instead of .vom?)

  • suggesting (and automating) the use of tinyurl or other URL compacting services for long or multiple-line-spanning URLs


Droplet or sign of a coming flood?

No comments: