Monday, May 26, 2008

Dealing with long words

In his blog entry Injecting Word Breaks With Javascript, John Resig suggests using javascript to insert word-break positions in long words, solving long-word-broke-my-page-layout problems. The hyphenation problem has been around a long time, and in decent typesetting systems like LaTeX, libraries do the job for you, breaking in the right places to maintain the flow of a sentence and not leave you on a new line starting with a spare letter. One comment relays a story of their own algorithm which ran each time, breaking up words too long for the lines, but that it grew too computationally expensive. That implies it was being done every time the page was loaded.

As far as I can see, there are two different challenges here: dealing with long words and dealing with URLs. My comments:

Long words
Given that the best length of a line for legibility is somewhere in the 20-40em range, very few normal words will break your layout. Perhaps just process it when you are about to insert the text into the database - use a proper hyphenation script to insert zero width spaces or wprs or whatever. You can always replace them all later. This way each chunk of text gets hyphenated properly once, instead of badly repeatedly.

Unless you maintain a site dedicated to long strings of text, livingwithspacebarphobia or organic chemistry, URLs are the vast majority of long strings. Why treat them differently? Because they are not expected to be shown in their raw form; most people would prefer the long URL to be hidden behind the usual linktext.

  • Replace the linktext with a shortened form - keep some of the beginning (so we can see the domain) and perhaps the text between the last slash and the following dot (the page name), to make something like ''
  • Steal the title text of the linked page - in processing the form data, follow each link and get hold of the page title, and use that (or, again, a shortened form) as the link text.
The overriding principle? Sanitize user input. Even when all they are doing is typing in words and pasting in URLs, users can cause unexpected problems. And do the sanitizing on input, not as a reformatting exercise on output - the input processing happens once and the result is served up many times.

No comments: