[ Jocelyn Ireson-Paine's Home Page | Publications | Dobbs Code Talk Index | Dobbs Blog Version ]

Unicode and the Shavian Alphabet II

In Unicode and the Shavian Alphabet, I wrote about the incompatibility between two online translators: shavian.org's one that translates English into the Shaw alphabet, and Pīnyīn.info's one that translates characters into their Unicode numbers. To summarise: the Shaw alphabet, also known as the Shavian alphabet, was invented in a competition to design an alphabet in which English is spelled as it sounds. I used it as an alien programming language in a cartoon, generating my text with shavian.org's transliterator. I then tried to convert the transliteration into Unicode numbers by pasting into Pīnyīn.info's translator. But the result had the wrong codes, and twice too many of them. Thomas Thurman, author of shavian.org's transliteration script, mailed me to explain why:

With reference to your column at http://www.drdobbs.com/blog/archives/2010/06/unicode_and_the.html : the reason the translator at http://www.pinyin.info/tools/converter/chars2uninumbers.html choked on the Shavian characters you gave it is because all Shavian characters have codepoints above 0xFFFF, and therefore (if you're using UTF-16, which the pinyin.info translator appears to be) they won't fit in a single word and will have to be represented using surrogate pairs. Wikipedia has a reasonable coverage of surrogate pairs: http://en.wikipedia.org/wiki/Surrogate_pair , but briefly, it's a way to represent a Unicode character whose codepoint is too high by using a pair of otherwise illegal characters, both of whose codepoints are low enough. Hence the effect you noted of having "the wrong codes, and twice too many of them".

The fault is presumably with the pinyin.info translator, which shouldn't give out surrogate pairs unless explicitly asked, but it does go to show that, as Wikipedia puts it, "code is often not tested thoroughly with surrogate pairs. This leads to persistent bugs, and potential security holes, even in popular and well-reviewed application software", or as you put it, "computing still is not mature".

Thomas (author of the transliterator script on shavian.org).