As a Physicist, I know that not only is an approximation frequently perfectly adequate, in many scenarios is it more valuable to have a good, quick answer than a perfect, slow one. Imagine if Google really searched the whole of its multi Petabyte index just for your lolcats query.
The Wide Finder 2 problem is that you have a very large dataset, out of which you want to retrieve the top 10 URLs. Most of the discussion seems to centre on which language to use, and how to parallelise the code.
A fast start
Suppose you've created something clever that uses:
- a thread to read in the dataset and pile it sequentially into chunks of N megs of memory
- a batch of threads that do an initial match but not perfectly - making an unsorted statistic for each chunk, idling if there is nothing to process
- a third batch that sorts these chunks and atomically integrates the result into a main list
But the problem will still be I/O-limited. This is what happens with supercomputers - you just convert a compute-limited problem into an I/O-limited one.
A clever approximation
Although it's not in the spirit of the original project (it doesn't 'mimic' the original script), in many ways adding an approximation speedup is exactly what people need here. It makes the resulting numbers slightly worse in order to make the speed a lot better. No, I know, speed isn't the most important thing when processing log files. But this is a project centred on speed, so adding this dimension might make people think a bit harder. So what's the idea?
Skip most of the data.
The statistics (think pareto, normal distribution, etc) will tell you that for a popular site, a small proportion of the articles will absorb most of the hits. Hence the top 10 list. So we can make use of that by reading a sequential chunk of data, then skipping a big chunk, and repeating. If we simply modify that first thread in this way, we can (with a bit of experimentation) skip a large proportion of the file. The only requirement is that the statistics on the combined selection we do read give the same top 10 as the dataset as a whole.
Monte Carlo log file analysis
If we take the Monte Carlo approach, and randomly decide how much of the file to skip, we can simply keep reading chunks of the file (scanning back to the beginning if necessary) until the top 10 list stops changing. Note that we may be encountering new files, we may be adding more to the statistics of the top ten, but all we need are the right 10 in the right order.
One of the beauties of Monte Carlo methods is that they parallelize superbly. By throwing determinism out the window and letting the sizes of the chunks we read, the sizes of the sections we skip and the decision to quit probabilistic, each thread can run independently. I run until I stop seeing changes, then I decide to quit. The only inter-thread communication becomes getting the data to process. We can use multi-core friendly while loops rather than incrementing shared integers, or waiting a number of steps to read/quit.
So there you go. Physics methods applied to log file analysis; approximation in the Web World. Remember, Google are smart, and they give you approximate answers: "Results 1 - 10 of about 26,900,000 for kittens".