Wednesday, June 25, 2008

A quick idea on hybrid or diluted type systems

One of the main uses of object-orientation in programming is to collect things and encapsulate them. Thus, I can pick up any MyBendableObj made out of the BendableObj class which implements the Bendable interface and know that I can do MyBendableObj.bend(). This reflects how we tend to think as humans - I have a piece of paper, I can fold it. I don't really care if it's 80gsm or 85gsm, unless I'm trying to fold it more than a few times.

A large rift between camps of programming languages (or rather, between their developers) is between statically typed languages like C++ and dynamically typed languages like Python. There are some nice midway-points - you can use Rhino to do Javascript (dynamic) on top of a Java (static) VM, for example.

The old argument goes that static means fast, because the compiler knows what memory requirements are going to be needed, and dynamic means easy to code because you can change your mind about what a variable will hold. In PHP, $data can be a value one minute and an array of mixed objects the next. In Fortran, once declared an integer, always an integer.

But I suspect there is a hybrid possible. For the most part, my choice of type for a variable conveys information to the compiler for its compile-time work. It's a promise not to try to change the variable's type. But there are other ways to do that.

Suppose I take an otherwise dynamically typed language and add a language extension that allows loose declaration of type; I declare it to be a single variable, an object, a collection of objects, etc. Nobody really puts completely different things in $data, unless they're writing appallingly bad code.

So let me do this:
singlevar val = new String();
mixedarray outputCollection = array(i, 4.235, myObj);
singleobj house = new House();

Thus we give the compiler information it can use. I promise not to do too much funny stuff. But it means
  • Flexibility: my IOobject can be a file, a socket or a commandline pipe with the flexibility of a dynamic language (singleobj myIOobj = getIOObjectFromHandle(thisHandle)).
  • Compile errors: I can catch myself doing mySingleObj = objectArrayGenerator(); and fix it.
  • Speed: My compiler is (presumably) doing much the same optimisation for all my fixed-length types, allocating my object handles on the heap, etc etc. And it can do that here, too. Variable length strings are tricky, I'll grant you, but leave them as objects.
  • JIT compilation: hopefully the extra information can aid the JIT compiler
  • Not annoying the coder: dynamic languages are great. You do what you mean to do, and you don't worry about what's coming back. Except that you do, for the most part, know what's coming back. Even if it's a function coming back and you're about to do a functional filter lambda thing with it, that's a good start.
  • Confine implicit conversions to a single type strata: float to integer, integer to char array, these are easy. MyMixedObjectCollection to boolean? It's not even meaningful.
Stepping away from all the theory and the lemmas and the other stuff I haven't much clue about (which is probably why I'm somehow wrong), it feels natural to same something about what this variable will hold, without having to nail down exactly what. I have a CD case that can hold a CD. Or I might put a DVD in it. Or a bluray disc. In the dynamic world I don't care, I could put an elephant, two monkeys and a mathematical axiom in there. In the static world the DVD wouldn't cut it.

Templates should get a lot easier with these type strata - a vector of ints, floats, bools, all make some sense.

Perhaps this would lead the way to per-object overloading: destroy all these books but coat that one in paraffin - we understand exceptions to rules in the everyday world. Why not let us overload the destroy() method in this one, then pass the whole collection to be systematicallyDestroy()ed?

There is much for me to learn about functional programming, typing, OO and design patterns, so I'm not well placed to see whether this would all work. But in my mind, it spun a fine drop.

Saturday, June 14, 2008

Synthetic synaesthesia aid to spelling, reading

Summary
Why not apply consistent colours to the letters of e-books and in word processors to allow people with difficulty spelling or reading to recognise words by the colour pattern? Interesting links at the bottom.

Synaesthesia
Synaesthesia is a mixing of the senses; numbers and letters have colours, smells have sounds, and so on. There is a great synaesthesia generator which you can play with to get the idea, but as with most properties of the mind, I suspect we all have it a bit. More info on wikipedia. [I think I have a form of it called Number Form relating to times and dates - my weeks have a shape that fits into the shape of my year. Audible words also have faint visual echoes of their written shape.]

Learning to read and spell
It is well known that when we read, we learn first to recognise individual letters. Slowly, we get used to common words, and as our vocabulary expands and we read more, we stop looking at individual letters and start recognising whole words.

Those who read a lot get used to seeing the right shape for a given word, so when they come to spell it themselves, they can immediately see whether the word 'looks right'. For example, if I write hospital as hosptial, you can still guess the right word, but you know it is wrong.

Some people have more difficulty reading or spelling, either from reading less or due to learning difficulties like dyslexia. It's entirely understandable - mass reading is a very recent (and fairly unnatural) behaviour which the brain is inevitably going to have difficulty with.

The Original Idea
Could we make use of synaesthesia to provide additional visual cues to the mind? If we added consistent colours to the letters, would that allow us to add more mechanisms to support the response that a word doesn't look right?

A Flaw
One page on the net quotes the statistic that 15% of synaesthetes have a close family member with dyslexia, which suggests some commonality in the mechanisms of the two.

In a blog post, a syneasthete wondered about this topic in May 07. From a recent comment:
"My daughter, aged 11 is dyslexic and has grapheme-colour synesthesia. The colors negatively affect her ability to read and to spell, since some of the letters have the same color. E and U are both green, for example. She also tends to group the colors and hence inserts letters into words because she thinks their colors “go together”." - Mary G.

These would suggest that adding colours to the letters would make things worse rather than better, as there is more going on. In fact, one site considers ADHD a form of sensory overload, and of a sample of people with learning difficulties found 25% of synaesthetes to have ADHD.

So this may well limit usage of the letter colouring to non-synaesthetes

Thursday, June 12, 2008

Approximation in the Web World

In a recent Coding Horror article on the Wide-Finder 2 project, it dawned on me that there is Another Way to solve I/O-limited Top 10 URL problems and generate a faster log analyzer: approximate.

As a Physicist, I know that not only is an approximation frequently perfectly adequate, in many scenarios is it more valuable to have a good, quick answer than a perfect, slow one. Imagine if Google really searched the whole of its multi Petabyte index just for your lolcats query.

The Wide Finder 2 problem is that you have a very large dataset, out of which you want to retrieve the top 10 URLs. Most of the discussion seems to centre on which language to use, and how to parallelise the code.

A fast start
Suppose you've created something clever that uses:
  • a thread to read in the dataset and pile it sequentially into chunks of N megs of memory
  • a batch of threads that do an initial match but not perfectly - making an unsorted statistic for each chunk, idling if there is nothing to process
  • a third batch that sorts these chunks and atomically integrates the result into a main list
Then you tidy up at the end.

But the problem will still be I/O-limited. This is what happens with supercomputers - you just convert a compute-limited problem into an I/O-limited one.

A clever approximation
Although it's not in the spirit of the original project (it doesn't 'mimic' the original script), in many ways adding an approximation speedup is exactly what people need here. It makes the resulting numbers slightly worse in order to make the speed a lot better. No, I know, speed isn't the most important thing when processing log files. But this is a project centred on speed, so adding this dimension might make people think a bit harder. So what's the idea?

Skip most of the data.


The statistics (think pareto, normal distribution, etc) will tell you that for a popular site, a small proportion of the articles will absorb most of the hits. Hence the top 10 list. So we can make use of that by reading a sequential chunk of data, then skipping a big chunk, and repeating. If we simply modify that first thread in this way, we can (with a bit of experimentation) skip a large proportion of the file. The only requirement is that the statistics on the combined selection we do read give the same top 10 as the dataset as a whole.

Monte Carlo log file analysis

If we take the Monte Carlo approach, and randomly decide how much of the file to skip, we can simply keep reading chunks of the file (scanning back to the beginning if necessary) until the top 10 list stops changing. Note that we may be encountering new files, we may be adding more to the statistics of the top ten, but all we need are the right 10 in the right order.

One of the beauties of Monte Carlo methods is that they parallelize superbly. By throwing determinism out the window and letting the sizes of the chunks we read, the sizes of the sections we skip and the decision to quit probabilistic, each thread can run independently. I run until I stop seeing changes, then I decide to quit. The only inter-thread communication becomes getting the data to process. We can use multi-core friendly while loops rather than incrementing shared integers, or waiting a number of steps to read/quit.

Outro
So there you go. Physics methods applied to log file analysis; approximation in the Web World. Remember, Google are smart, and they give you approximate answers: "Results 1 - 10 of about 26,900,000 for kittens".

Thursday, June 5, 2008

Ethical/green driving lessons

The Pitch
I haven't covered many straightforward business ideas here, but let's start with some bandwagon jumping: Green driving lessons.

The differences:
  • The instructor's car is a hybrid of some kind, or perhaps a hydrogen/electric car eventually.
  • Lessons are on driving an automatic. All hybrids and electric cars are automatic or paddle-shift, because the standard gearing system and technique is designed to give a petrol engine mid-range revs where the it has its best torque output. I think.
  • It should help shelter the instructor from the rising price of fuel
  • But above all it's got that warm, fuzzy, superiority of doing something normal in a green way.

Did you wonder why 'ethical' sits pressed against the 'green' in the title? Ethically, you'd be better off feeding the starving than saving the environment, in my view. So perhaps the fuel money saved (significant - learners spend even more time at low speeds) could go to aid charities, or at least give the student the choice of adding a pound to their lesson price to go straight to a particular charity. They'd go for it, they're paying for green lessons.

Business Sense
There is good business sense behind all this:

Target demographic: younger females.
Most learners are teenagers. Most teenagers are still quite left-wing, liberal, eco-friendly, moral, etc. They're not bitter yet. So to many of them, particularly those who aspire to driving hybrids later on, green driving lessons would make perfect sense. Admittedly, a more female demographic, and some of the boys will baulk at the automatic-only nature of the deal and the girly image. So consider this a niche market. A niche 50%.

Marketing message: aspirational.
Most marketing, and generally the most manipulative marketing, is aspirational. And here you're certainly trading on that: if you want to be green, you'll have eco-friendly lessons.

Lower overheads = wider profit margins
One of the largest overheads for any driving instructor is the fuel. As the price of petrol rises with the price of oil and tax increases, the instructor's profit margin is squeezed. On the other hand, as this pushes up the average price of driving lessons, a green driving instructor would see their profit margin increase as the extra money mostly went into their pocket.

The Hypocrisy and the Pragmatism
There is hypocrisy (no, not irony) in eco-friendly driving lessons. The best lesson is to simply get a bus instead. But there is pragmatism at work - if kids are going to learn to drive, teach them how to do it with good fuel economy. The clutch/accelerator games disappear in an automatic, so there isn't the tradeoff that lower revs means more stalling. And they'd be geniunely interested in learning. Smooth driving is generally safer anyway.

The Final Kicker: Parents as a new (niche) market
With high petrol prices, inevitably the parents could benefit from learning greener driving, which opens a new market: Parents that take a few lessons on green driving technique. Inevitably, you don't have to be a parent to do it, but parents are the most immediate choice: you already have the customer relationship in teaching their child to drive. So it shouldn't be too difficult to add a few hours on for Mum to learn smoother driving and up her fuel economy. Inevitably, a fair proportion of the time would be spent teaching them to drive more carefully and ironing out bad habits, but that can only be a good thing.

You can always earn commission on referring them to your nearest hybrid dealership.