Friday, October 31, 2008

Catch them in the URL net.

In The Problem With URLs, Codinghorror ponders how to regex his way to catching URLs mixed in with user's text.


The whole discussion becomes about either how to write a better rule or regex, or whether to force users to delimit their URLs properly (with angle brackets, square brackets, bbcode tags...)


But this is foolish. 


Parsing text for things like URLs is a similar problem to trying to detect spam - the inputs are as varied as people can imagine them. So stop trying to deal with it using a series of fixed and universal laws!


Suggested algorithm:


  1. Get the feasible string: from the beginning of what you think might be a URL to the first space (or illegal character), allowing for i18n - you can use your regex here - and a list of any open parenthetics in the paragraph ( (,[,< etc).

  2. Generate a series of the possible URLs from it, by dropping each of the characters from the end that could be wrong:
    " Give me a URL (like http://www.example.com/querya?b)(ideally)? " becomes:

    1. "http://www.example.com/querya?b)(ideally)?"

    2. "http://www.example.com/querya?b)(ideally)"

    3. "http://www.example.com/querya?b)"

    4. "http://www.example.com/querya?b"

    5. "http://www.example.com/querya"


    and any other variations you find useful.

  3. Assign each one a rating based on:
    • whether there are unbalanced parentheses inside

    • whether the parenthesis would balance open ones in the paragraph - in this example the open bracket would be balanced by this close bracket, so that lowers the scores for a. and b.

    • whether the URL is sensible - "blah.com/)" is less sensible than "blah.com/"

    • any other good/bad valuation you can think of


  4. Rank the options

  5. If the top two (or more) options are very close or equal in ranking, then test for the existence of each by just polling the URLs in ranked order until you find a real one. If you adjust the threshold of how close is close, you should only be testing in rare cases. If you don't like polling, just pick one, you can't out-unwit every idiot or mistake.

  6. Finally, return the selected URL


There are endless ways to improve it beyond even that - you could even try balancing the parentheses such that your wikipedia article has its missing bracket fixed. At some point, perhaps, it becomes a bit pointless, but if this is all in a library and isn't too slow, nobody need rewrite it again, and the users are happy.


For me, the power of the method is in using ranking to allow unlikely options - unless you can separate all the possible inputs on a Venn diagram (which you can't here), then some rules will work for some sets of inputs, and others for others, and you'll never find a complete set that works for all of them.


In short, regexes only work for uniform and predictable input on their own


Other fun games:


  • consider trying to find wrongly-typed URLs and correcting them for the user

  • providing suggestions on 'better' URLs (did you mean .com instead of .vom?)

  • suggesting (and automating) the use of tinyurl or other URL compacting services for long or multiple-line-spanning URLs


Droplet or sign of a coming flood?

Friday, October 24, 2008

Interface-defined variables (or A quick idea on hybrid or diluted type systems 2)

Having expounded a vague hope in A quick idea on hybrid or diluted type systems, I came to realise that what I was really doing was defining a series of interfaces.

We can already define objects via interfaces:
ISomeInterface MyProperty = new ClassImplementingISomeInterface();


What if we limited ourselves to only defining any method or property with interfaces? Thus I would have a class MyClass, with properties IStringable NameString, IEnumerable<inumber> NumList, etc. That way the class would be maximally flexible, as we would be making the properties make only the minimum requirements of the inputs. Of course, if I pass an object that implements an interface derived from that required, it can still be cast into the minimum required interface.


Excellent, a dynamic and gregarious acceptance of many different static types - some of the innate strengths of dynamic languages in a statically-typed one.


Web Services


Let us step back from a moment and consider how this would help in the world of web services. A web service needs to be forgiving in what it will accept, and strict in its output, if it is to be used by the maximum set of other web services. If we assume that the input and output are various flavours of xml, then the static type is equivalent to the exact DTD of an xml doc, and the set of interfaces is, well, unknown.


In looking at REST interfaces, I came across a discussion of how to handle the inevitable change of interface versions when providing a RESTful API. Kalsey decides, in the end, to provide a version tag in the xml output file, containing pointers to the current, previous and latest API versions, with date-based URLs to provide access to specific versions of the API interface.


If, instead, we consider a request on a REST URI as a request for certain data conforming to a certain interface, then we can skip versioning. We define, instead, a series of smaller interfaces, and allow the user to request any subset of them. Thus my request xml should supply input data in an xml form that can be cast into any of the simpler interfaces, given an xml-generated object.

Stack Overflow Collab

Stack Overflow is new hub for the exchange of coding knowledge, built to leverage the innate tendencies of coding types to want to be right, to want to solve problems, and to gain kudos. So it has attracted many coders in search of recognition, and shows the important trait that its backers (Jeff Atwood, Joel Spolsky etc) are willing to follow the curve of user feedback. Thus SO is a community responsing to individual problems.

The Open Source movement, by contrast, has built up from existing partnerships (largely from the Real World) forming relaxed coding collaborations. The larger projects have then acquired others from across the net who were interested and got involved. The key here is that OSS is a community responding to a collective demand.

So how about filling in that great Venn overlap of community drawn by solutions to individual problems, and the productivity of a community solving a larger problem together? In other words, SO, give us a collaboration hub. Surely it's the ideal place for the inexperienced to participate at the fringes of an OSS project, and for experienced coders to progress up the kudos chain by contributing work and experience to a project, rather than a single point of expertise.

The system, interface, ranking, discovery and how it all overlaps with the overflowing stack of questions and answers on SO are subsidiary issues. But since people's individual project gave rise to problems, which gave rise to the need for SO, let the circle complete, and the solvers coagulate and combine to produce new projects.

I came, I saw, I solved, I wondered about the project, I got involved, we collaborated, we solved, we progressed, we share the story, we conquer.