Thursday, June 12, 2008

Approximation in the Web World

In a recent Coding Horror article on the Wide-Finder 2 project, it dawned on me that there is Another Way to solve I/O-limited Top 10 URL problems and generate a faster log analyzer: approximate.

As a Physicist, I know that not only is an approximation frequently perfectly adequate, in many scenarios is it more valuable to have a good, quick answer than a perfect, slow one. Imagine if Google really searched the whole of its multi Petabyte index just for your lolcats query.

The Wide Finder 2 problem is that you have a very large dataset, out of which you want to retrieve the top 10 URLs. Most of the discussion seems to centre on which language to use, and how to parallelise the code.

A fast start
Suppose you've created something clever that uses:
  • a thread to read in the dataset and pile it sequentially into chunks of N megs of memory
  • a batch of threads that do an initial match but not perfectly - making an unsorted statistic for each chunk, idling if there is nothing to process
  • a third batch that sorts these chunks and atomically integrates the result into a main list
Then you tidy up at the end.

But the problem will still be I/O-limited. This is what happens with supercomputers - you just convert a compute-limited problem into an I/O-limited one.

A clever approximation
Although it's not in the spirit of the original project (it doesn't 'mimic' the original script), in many ways adding an approximation speedup is exactly what people need here. It makes the resulting numbers slightly worse in order to make the speed a lot better. No, I know, speed isn't the most important thing when processing log files. But this is a project centred on speed, so adding this dimension might make people think a bit harder. So what's the idea?

Skip most of the data.

The statistics (think pareto, normal distribution, etc) will tell you that for a popular site, a small proportion of the articles will absorb most of the hits. Hence the top 10 list. So we can make use of that by reading a sequential chunk of data, then skipping a big chunk, and repeating. If we simply modify that first thread in this way, we can (with a bit of experimentation) skip a large proportion of the file. The only requirement is that the statistics on the combined selection we do read give the same top 10 as the dataset as a whole.

Monte Carlo log file analysis

If we take the Monte Carlo approach, and randomly decide how much of the file to skip, we can simply keep reading chunks of the file (scanning back to the beginning if necessary) until the top 10 list stops changing. Note that we may be encountering new files, we may be adding more to the statistics of the top ten, but all we need are the right 10 in the right order.

One of the beauties of Monte Carlo methods is that they parallelize superbly. By throwing determinism out the window and letting the sizes of the chunks we read, the sizes of the sections we skip and the decision to quit probabilistic, each thread can run independently. I run until I stop seeing changes, then I decide to quit. The only inter-thread communication becomes getting the data to process. We can use multi-core friendly while loops rather than incrementing shared integers, or waiting a number of steps to read/quit.

So there you go. Physics methods applied to log file analysis; approximation in the Web World. Remember, Google are smart, and they give you approximate answers: "Results 1 - 10 of about 26,900,000 for kittens".


Paddy3118 said...

Nice one!
But each parallel access to the file is going to hit I/O so why not have one process randomly sampling the file.

Do we need to always seek forward in the file or can we randomly seek for a record to process?

Needs some experimentation methinks!

- Paddy.

Phil H said...

One process was what I was aiming for - hence the 'one thread' as the first in the initial version.

I suspect that seeking forward would be faster - that way the read head is doing the least complicated thing (although I am very hazy on what is fast or slow for a hard disk).

Randomly seeking for a single record would add more overheads because of the seek time and disk buffering. It needs to have time to get to the full streaming speed, then skip past enough of the file to offset the new seek time. If the seek time is 10ms, and the disk reads at some 50MB/s, then the seek time will cost you something like 0.5MB.

Then there is the buffering delay - usually disks have a buffer of 8MB or so, so it makes sense to pull this much data at once, then skip ahead. Hopefully with command queueing and speed matching, the hard disk's built in cleverness would be fully employed.

Experimentation definitely a good thing.