Friday, October 31, 2008

Catch them in the URL net.

In The Problem With URLs, Codinghorror ponders how to regex his way to catching URLs mixed in with user's text.

The whole discussion becomes about either how to write a better rule or regex, or whether to force users to delimit their URLs properly (with angle brackets, square brackets, bbcode tags...)

But this is foolish. 

Parsing text for things like URLs is a similar problem to trying to detect spam - the inputs are as varied as people can imagine them. So stop trying to deal with it using a series of fixed and universal laws!

Suggested algorithm:

  1. Get the feasible string: from the beginning of what you think might be a URL to the first space (or illegal character), allowing for i18n - you can use your regex here - and a list of any open parenthetics in the paragraph ( (,[,< etc).

  2. Generate a series of the possible URLs from it, by dropping each of the characters from the end that could be wrong:
    " Give me a URL (like " becomes:

    1. ""

    2. ""

    3. ""

    4. ""

    5. ""

    and any other variations you find useful.

  3. Assign each one a rating based on:
    • whether there are unbalanced parentheses inside

    • whether the parenthesis would balance open ones in the paragraph - in this example the open bracket would be balanced by this close bracket, so that lowers the scores for a. and b.

    • whether the URL is sensible - "" is less sensible than ""

    • any other good/bad valuation you can think of

  4. Rank the options

  5. If the top two (or more) options are very close or equal in ranking, then test for the existence of each by just polling the URLs in ranked order until you find a real one. If you adjust the threshold of how close is close, you should only be testing in rare cases. If you don't like polling, just pick one, you can't out-unwit every idiot or mistake.

  6. Finally, return the selected URL

There are endless ways to improve it beyond even that - you could even try balancing the parentheses such that your wikipedia article has its missing bracket fixed. At some point, perhaps, it becomes a bit pointless, but if this is all in a library and isn't too slow, nobody need rewrite it again, and the users are happy.

For me, the power of the method is in using ranking to allow unlikely options - unless you can separate all the possible inputs on a Venn diagram (which you can't here), then some rules will work for some sets of inputs, and others for others, and you'll never find a complete set that works for all of them.

In short, regexes only work for uniform and predictable input on their own

Other fun games:

  • consider trying to find wrongly-typed URLs and correcting them for the user

  • providing suggestions on 'better' URLs (did you mean .com instead of .vom?)

  • suggesting (and automating) the use of tinyurl or other URL compacting services for long or multiple-line-spanning URLs

Droplet or sign of a coming flood?

Friday, October 24, 2008

Interface-defined variables (or A quick idea on hybrid or diluted type systems 2)

Having expounded a vague hope in A quick idea on hybrid or diluted type systems, I came to realise that what I was really doing was defining a series of interfaces.

We can already define objects via interfaces:
ISomeInterface MyProperty = new ClassImplementingISomeInterface();

What if we limited ourselves to only defining any method or property with interfaces? Thus I would have a class MyClass, with properties IStringable NameString, IEnumerable<inumber> NumList, etc. That way the class would be maximally flexible, as we would be making the properties make only the minimum requirements of the inputs. Of course, if I pass an object that implements an interface derived from that required, it can still be cast into the minimum required interface.

Excellent, a dynamic and gregarious acceptance of many different static types - some of the innate strengths of dynamic languages in a statically-typed one.

Web Services

Let us step back from a moment and consider how this would help in the world of web services. A web service needs to be forgiving in what it will accept, and strict in its output, if it is to be used by the maximum set of other web services. If we assume that the input and output are various flavours of xml, then the static type is equivalent to the exact DTD of an xml doc, and the set of interfaces is, well, unknown.

In looking at REST interfaces, I came across a discussion of how to handle the inevitable change of interface versions when providing a RESTful API. Kalsey decides, in the end, to provide a version tag in the xml output file, containing pointers to the current, previous and latest API versions, with date-based URLs to provide access to specific versions of the API interface.

If, instead, we consider a request on a REST URI as a request for certain data conforming to a certain interface, then we can skip versioning. We define, instead, a series of smaller interfaces, and allow the user to request any subset of them. Thus my request xml should supply input data in an xml form that can be cast into any of the simpler interfaces, given an xml-generated object.

Stack Overflow Collab

Stack Overflow is new hub for the exchange of coding knowledge, built to leverage the innate tendencies of coding types to want to be right, to want to solve problems, and to gain kudos. So it has attracted many coders in search of recognition, and shows the important trait that its backers (Jeff Atwood, Joel Spolsky etc) are willing to follow the curve of user feedback. Thus SO is a community responsing to individual problems.

The Open Source movement, by contrast, has built up from existing partnerships (largely from the Real World) forming relaxed coding collaborations. The larger projects have then acquired others from across the net who were interested and got involved. The key here is that OSS is a community responding to a collective demand.

So how about filling in that great Venn overlap of community drawn by solutions to individual problems, and the productivity of a community solving a larger problem together? In other words, SO, give us a collaboration hub. Surely it's the ideal place for the inexperienced to participate at the fringes of an OSS project, and for experienced coders to progress up the kudos chain by contributing work and experience to a project, rather than a single point of expertise.

The system, interface, ranking, discovery and how it all overlaps with the overflowing stack of questions and answers on SO are subsidiary issues. But since people's individual project gave rise to problems, which gave rise to the need for SO, let the circle complete, and the solvers coagulate and combine to produce new projects.

I came, I saw, I solved, I wondered about the project, I got involved, we collaborated, we solved, we progressed, we share the story, we conquer.

Tuesday, July 29, 2008

Missing the value of information

This begins as critique, but I'm sure ideas will surface.

Mike of Mike On Ads has written a javascript example to mine the user's browser history for top sites, and use statistics to work out their gender. In the comments, many many people have posted their scores and whether the score was correct. Some are right, some are wrong.

But there are two complications here. Firstly, a complication of the quality of the data - many PCs have multiple users, particularly in family homes, which will render their scores useless. Also, no frequency data is included - a hundred visits to will be cancelled out by a single

The second, and more interesting, point is this: why do you want their info? For any normal commercial reason - in particular serving ads or choosing content - what is interesting is not the true gender of the viewer; it is their statistical gender. If someone arrives at your sport&sewing site, should you pitch it as a sport site or a sewing site based on their gender, or the ratio of crochet sites to
live sport news sites? You might think you want their gender, but what you really want is correlation.

I should mention that I wouldn't endorse the practise, and that soon enough it should be prevented by, I suggest, returning the default link colours rather
than the live colours for anything but links from the parent domain - i.e. strengthen the XSS protection.

Even without this kind of temporary naughtiness, you can still find out about your general user base - watch what links they click, videos they watch, news items they find interesting. And build useful statistics - perhaps your site is visited by three fairly separate groups, which might determine your site development strategy. This time, you have frequency data, timing data, everything.

But remember not to be evil.

Wednesday, June 25, 2008

A quick idea on hybrid or diluted type systems

One of the main uses of object-orientation in programming is to collect things and encapsulate them. Thus, I can pick up any MyBendableObj made out of the BendableObj class which implements the Bendable interface and know that I can do MyBendableObj.bend(). This reflects how we tend to think as humans - I have a piece of paper, I can fold it. I don't really care if it's 80gsm or 85gsm, unless I'm trying to fold it more than a few times.

A large rift between camps of programming languages (or rather, between their developers) is between statically typed languages like C++ and dynamically typed languages like Python. There are some nice midway-points - you can use Rhino to do Javascript (dynamic) on top of a Java (static) VM, for example.

The old argument goes that static means fast, because the compiler knows what memory requirements are going to be needed, and dynamic means easy to code because you can change your mind about what a variable will hold. In PHP, $data can be a value one minute and an array of mixed objects the next. In Fortran, once declared an integer, always an integer.

But I suspect there is a hybrid possible. For the most part, my choice of type for a variable conveys information to the compiler for its compile-time work. It's a promise not to try to change the variable's type. But there are other ways to do that.

Suppose I take an otherwise dynamically typed language and add a language extension that allows loose declaration of type; I declare it to be a single variable, an object, a collection of objects, etc. Nobody really puts completely different things in $data, unless they're writing appallingly bad code.

So let me do this:
singlevar val = new String();
mixedarray outputCollection = array(i, 4.235, myObj);
singleobj house = new House();

Thus we give the compiler information it can use. I promise not to do too much funny stuff. But it means
  • Flexibility: my IOobject can be a file, a socket or a commandline pipe with the flexibility of a dynamic language (singleobj myIOobj = getIOObjectFromHandle(thisHandle)).
  • Compile errors: I can catch myself doing mySingleObj = objectArrayGenerator(); and fix it.
  • Speed: My compiler is (presumably) doing much the same optimisation for all my fixed-length types, allocating my object handles on the heap, etc etc. And it can do that here, too. Variable length strings are tricky, I'll grant you, but leave them as objects.
  • JIT compilation: hopefully the extra information can aid the JIT compiler
  • Not annoying the coder: dynamic languages are great. You do what you mean to do, and you don't worry about what's coming back. Except that you do, for the most part, know what's coming back. Even if it's a function coming back and you're about to do a functional filter lambda thing with it, that's a good start.
  • Confine implicit conversions to a single type strata: float to integer, integer to char array, these are easy. MyMixedObjectCollection to boolean? It's not even meaningful.
Stepping away from all the theory and the lemmas and the other stuff I haven't much clue about (which is probably why I'm somehow wrong), it feels natural to same something about what this variable will hold, without having to nail down exactly what. I have a CD case that can hold a CD. Or I might put a DVD in it. Or a bluray disc. In the dynamic world I don't care, I could put an elephant, two monkeys and a mathematical axiom in there. In the static world the DVD wouldn't cut it.

Templates should get a lot easier with these type strata - a vector of ints, floats, bools, all make some sense.

Perhaps this would lead the way to per-object overloading: destroy all these books but coat that one in paraffin - we understand exceptions to rules in the everyday world. Why not let us overload the destroy() method in this one, then pass the whole collection to be systematicallyDestroy()ed?

There is much for me to learn about functional programming, typing, OO and design patterns, so I'm not well placed to see whether this would all work. But in my mind, it spun a fine drop.

Saturday, June 14, 2008

Synthetic synaesthesia aid to spelling, reading

Why not apply consistent colours to the letters of e-books and in word processors to allow people with difficulty spelling or reading to recognise words by the colour pattern? Interesting links at the bottom.

Synaesthesia is a mixing of the senses; numbers and letters have colours, smells have sounds, and so on. There is a great synaesthesia generator which you can play with to get the idea, but as with most properties of the mind, I suspect we all have it a bit. More info on wikipedia. [I think I have a form of it called Number Form relating to times and dates - my weeks have a shape that fits into the shape of my year. Audible words also have faint visual echoes of their written shape.]

Learning to read and spell
It is well known that when we read, we learn first to recognise individual letters. Slowly, we get used to common words, and as our vocabulary expands and we read more, we stop looking at individual letters and start recognising whole words.

Those who read a lot get used to seeing the right shape for a given word, so when they come to spell it themselves, they can immediately see whether the word 'looks right'. For example, if I write hospital as hosptial, you can still guess the right word, but you know it is wrong.

Some people have more difficulty reading or spelling, either from reading less or due to learning difficulties like dyslexia. It's entirely understandable - mass reading is a very recent (and fairly unnatural) behaviour which the brain is inevitably going to have difficulty with.

The Original Idea
Could we make use of synaesthesia to provide additional visual cues to the mind? If we added consistent colours to the letters, would that allow us to add more mechanisms to support the response that a word doesn't look right?

A Flaw
One page on the net quotes the statistic that 15% of synaesthetes have a close family member with dyslexia, which suggests some commonality in the mechanisms of the two.

In a blog post, a syneasthete wondered about this topic in May 07. From a recent comment:
"My daughter, aged 11 is dyslexic and has grapheme-colour synesthesia. The colors negatively affect her ability to read and to spell, since some of the letters have the same color. E and U are both green, for example. She also tends to group the colors and hence inserts letters into words because she thinks their colors “go together”." - Mary G.

These would suggest that adding colours to the letters would make things worse rather than better, as there is more going on. In fact, one site considers ADHD a form of sensory overload, and of a sample of people with learning difficulties found 25% of synaesthetes to have ADHD.

So this may well limit usage of the letter colouring to non-synaesthetes

Thursday, June 12, 2008

Approximation in the Web World

In a recent Coding Horror article on the Wide-Finder 2 project, it dawned on me that there is Another Way to solve I/O-limited Top 10 URL problems and generate a faster log analyzer: approximate.

As a Physicist, I know that not only is an approximation frequently perfectly adequate, in many scenarios is it more valuable to have a good, quick answer than a perfect, slow one. Imagine if Google really searched the whole of its multi Petabyte index just for your lolcats query.

The Wide Finder 2 problem is that you have a very large dataset, out of which you want to retrieve the top 10 URLs. Most of the discussion seems to centre on which language to use, and how to parallelise the code.

A fast start
Suppose you've created something clever that uses:
  • a thread to read in the dataset and pile it sequentially into chunks of N megs of memory
  • a batch of threads that do an initial match but not perfectly - making an unsorted statistic for each chunk, idling if there is nothing to process
  • a third batch that sorts these chunks and atomically integrates the result into a main list
Then you tidy up at the end.

But the problem will still be I/O-limited. This is what happens with supercomputers - you just convert a compute-limited problem into an I/O-limited one.

A clever approximation
Although it's not in the spirit of the original project (it doesn't 'mimic' the original script), in many ways adding an approximation speedup is exactly what people need here. It makes the resulting numbers slightly worse in order to make the speed a lot better. No, I know, speed isn't the most important thing when processing log files. But this is a project centred on speed, so adding this dimension might make people think a bit harder. So what's the idea?

Skip most of the data.

The statistics (think pareto, normal distribution, etc) will tell you that for a popular site, a small proportion of the articles will absorb most of the hits. Hence the top 10 list. So we can make use of that by reading a sequential chunk of data, then skipping a big chunk, and repeating. If we simply modify that first thread in this way, we can (with a bit of experimentation) skip a large proportion of the file. The only requirement is that the statistics on the combined selection we do read give the same top 10 as the dataset as a whole.

Monte Carlo log file analysis

If we take the Monte Carlo approach, and randomly decide how much of the file to skip, we can simply keep reading chunks of the file (scanning back to the beginning if necessary) until the top 10 list stops changing. Note that we may be encountering new files, we may be adding more to the statistics of the top ten, but all we need are the right 10 in the right order.

One of the beauties of Monte Carlo methods is that they parallelize superbly. By throwing determinism out the window and letting the sizes of the chunks we read, the sizes of the sections we skip and the decision to quit probabilistic, each thread can run independently. I run until I stop seeing changes, then I decide to quit. The only inter-thread communication becomes getting the data to process. We can use multi-core friendly while loops rather than incrementing shared integers, or waiting a number of steps to read/quit.

So there you go. Physics methods applied to log file analysis; approximation in the Web World. Remember, Google are smart, and they give you approximate answers: "Results 1 - 10 of about 26,900,000 for kittens".

Thursday, June 5, 2008

Ethical/green driving lessons

The Pitch
I haven't covered many straightforward business ideas here, but let's start with some bandwagon jumping: Green driving lessons.

The differences:
  • The instructor's car is a hybrid of some kind, or perhaps a hydrogen/electric car eventually.
  • Lessons are on driving an automatic. All hybrids and electric cars are automatic or paddle-shift, because the standard gearing system and technique is designed to give a petrol engine mid-range revs where the it has its best torque output. I think.
  • It should help shelter the instructor from the rising price of fuel
  • But above all it's got that warm, fuzzy, superiority of doing something normal in a green way.

Did you wonder why 'ethical' sits pressed against the 'green' in the title? Ethically, you'd be better off feeding the starving than saving the environment, in my view. So perhaps the fuel money saved (significant - learners spend even more time at low speeds) could go to aid charities, or at least give the student the choice of adding a pound to their lesson price to go straight to a particular charity. They'd go for it, they're paying for green lessons.

Business Sense
There is good business sense behind all this:

Target demographic: younger females.
Most learners are teenagers. Most teenagers are still quite left-wing, liberal, eco-friendly, moral, etc. They're not bitter yet. So to many of them, particularly those who aspire to driving hybrids later on, green driving lessons would make perfect sense. Admittedly, a more female demographic, and some of the boys will baulk at the automatic-only nature of the deal and the girly image. So consider this a niche market. A niche 50%.

Marketing message: aspirational.
Most marketing, and generally the most manipulative marketing, is aspirational. And here you're certainly trading on that: if you want to be green, you'll have eco-friendly lessons.

Lower overheads = wider profit margins
One of the largest overheads for any driving instructor is the fuel. As the price of petrol rises with the price of oil and tax increases, the instructor's profit margin is squeezed. On the other hand, as this pushes up the average price of driving lessons, a green driving instructor would see their profit margin increase as the extra money mostly went into their pocket.

The Hypocrisy and the Pragmatism
There is hypocrisy (no, not irony) in eco-friendly driving lessons. The best lesson is to simply get a bus instead. But there is pragmatism at work - if kids are going to learn to drive, teach them how to do it with good fuel economy. The clutch/accelerator games disappear in an automatic, so there isn't the tradeoff that lower revs means more stalling. And they'd be geniunely interested in learning. Smooth driving is generally safer anyway.

The Final Kicker: Parents as a new (niche) market
With high petrol prices, inevitably the parents could benefit from learning greener driving, which opens a new market: Parents that take a few lessons on green driving technique. Inevitably, you don't have to be a parent to do it, but parents are the most immediate choice: you already have the customer relationship in teaching their child to drive. So it shouldn't be too difficult to add a few hours on for Mum to learn smoother driving and up her fuel economy. Inevitably, a fair proportion of the time would be spent teaching them to drive more carefully and ironing out bad habits, but that can only be a good thing.

You can always earn commission on referring them to your nearest hybrid dealership.

Tuesday, May 27, 2008

Daily exercise for your creative muscles

Create a flow gadget or RSS feed or other auto-updating thing to give people a short and straightforward creative exercise to do, up to twice a day. In the style of OneWord as an iGoogle gadget, for example.

Could make it social with games like these:
* Write oneword type paragraph, then provide a (related) word for the next person. Show your previous posts in the string.
* Web 2 whispers - you write for an unknown random number of seconds, then the text is snatched away for the next person to continue - they get as many seconds as you had to read it, then a random number to write.

But really, variety would be a great thing. Get people doing word association, describing a day as an X, relating two words, perhaps even writing for the length of a particular song, taking a phone photo of some interesting bit of ceiling, writing a haiku, chewing an interesting shape, drawing on a banana, following on from a randomly chosen sentence, following on from other people's following on sentences, taking a photo of two and a half things, explaining their choice of colour for 'think', imagining the world without an X, thinking up new ideas about a randomly chosen product, venting their spleen, getting angry, sending an extract from a receipt in their pocket, drawing a both-ways-up face, designing a tshirt, spilling some water (intentionally) and giving it a caption, "You are a dog, what do you think of fridges?", asking any of a thousand useful (and perhaps user contributed) questions - what would make a train journey better?, things to do with paperclips/cds/receipts/short pencils/unwanted cushions, invent a new magazine.

How would it work? Just provide a stimulus for the first few people, and occasionally drop in the question: what would you ask of people?

Setup: Blog-style post-and-reply would work well, but the replies must become the centre stage, rather than the post, and it must allow for more complicated supply and reply mechanisms for many of the above games.

Ah! How about a dynamically generated RSS feed unique to each user, supplying a reply-type link for the responses, which updates to show the state of the game when they're finished.

Further rumination, generalising yet further: make the whole thing a game. Let people cluster towards fellow users who supply their kind of game or stimulus, allow them to form their own mini communities. I don't want to say 'Facebook app', but...

Monday, May 26, 2008

Dealing with long words

In his blog entry Injecting Word Breaks With Javascript, John Resig suggests using javascript to insert word-break positions in long words, solving long-word-broke-my-page-layout problems. The hyphenation problem has been around a long time, and in decent typesetting systems like LaTeX, libraries do the job for you, breaking in the right places to maintain the flow of a sentence and not leave you on a new line starting with a spare letter. One comment relays a story of their own algorithm which ran each time, breaking up words too long for the lines, but that it grew too computationally expensive. That implies it was being done every time the page was loaded.

As far as I can see, there are two different challenges here: dealing with long words and dealing with URLs. My comments:

Long words
Given that the best length of a line for legibility is somewhere in the 20-40em range, very few normal words will break your layout. Perhaps just process it when you are about to insert the text into the database - use a proper hyphenation script to insert zero width spaces or wprs or whatever. You can always replace them all later. This way each chunk of text gets hyphenated properly once, instead of badly repeatedly.

Unless you maintain a site dedicated to long strings of text, livingwithspacebarphobia or organic chemistry, URLs are the vast majority of long strings. Why treat them differently? Because they are not expected to be shown in their raw form; most people would prefer the long URL to be hidden behind the usual linktext.

  • Replace the linktext with a shortened form - keep some of the beginning (so we can see the domain) and perhaps the text between the last slash and the following dot (the page name), to make something like ''
  • Steal the title text of the linked page - in processing the form data, follow each link and get hold of the page title, and use that (or, again, a shortened form) as the link text.
The overriding principle? Sanitize user input. Even when all they are doing is typing in words and pasting in URLs, users can cause unexpected problems. And do the sanitizing on input, not as a reformatting exercise on output - the input processing happens once and the result is served up many times.

Friday, May 9, 2008

Social side of bookshops

Bookshops are a hallowed ground on the high street, or on campus. They have the high spending per square metre of a shop, but the quietness and sensible nature of a library. No pumping music, no anorexia-inducing models on display - unlike the rest of retail, the focus is on knowledge and choosing the right product. It is generally a solitary activity, even if you're there with a friend or spouse.

Yet those that frequent a bookshop are more united than most other shoppers. If I stand in front of the Computing section and leaf through Agile Methods books or Web Site Zen, I am united with the other browsers of the section. The basic elements for a classic community are there - geography (we're in the same bookshop, in the same town), commonality of interest (the section we're browsing) and probably social status (although that shouldn't separate us, really).

A golden opportunity.

The web two-point-oh 'revolution' is about interaction and user-generated content. But that's just community with shiny buttons. The internet has always been about information, porn and community. Assuming we've grown out of the second, much of the latest stuff has just been usable combinations of information and community. And if I'm in community with my fellow information seekers in a bookshop, we've shortcircuited the painful path to video-based social networking websites dedicated to specific groups.

The idea is this: have book nights for given subjects. Get an author or other 'name' in, provide coffee/beer and have a short talk and a Q&A. Let people ask questions about which books are actually useful, or whatever they like related to the subject. An example: "Computing book night, with the author of Agile Methods. Bring an inquisitive mind and questions about Agile methods."

There's some truth in the stereotype that geeks lack social skills, yet everyone needs community. I suspect that for many subjects of these book nights, if you build it they will come. But tell tem it's for their education, and let the social side accidentally happen.

Here's the stimulus; my suggestion for a campus bookshop which also sells snacks and drinks:
"It's summer, so let's have an al-fresco reading area outside - provide some cheap novels, expect people to buy their books first. Nice comfy sofas from the union communal area should work.

That should get people to come to the bookshop, and consuming more snacks."