Thursday, June 15, 2006

this test app can break

Wow. I didn't expect a post about a little-used Windows API function to generate 30,000 page views. In any event, some folks still doubt the "IsTextUnicode()" explanation, so I'm putting up the test app that I used to validate my theory before I blogged it.

Just run the app, and enter a string into the edit control. As you type, the app repeatedly calls IsTextUnicode() and shows both the result (Unicode/not Unicode) and the flags that IsTextUnicode() returns to indicate which tests it used.

Updated:I had pasted in the relevant chunk of the app source code, but it appears this blog template chokes on 70-column preformatted text. If you really want it, drop me a line.

Wednesday, June 14, 2006

this api can break

Over at WinCustomize, someone thought they'd found an Easter Egg in the Windows Notepad application. If you:
  1. Open Notepad
  2. Type the text "this app can break" (without quotes)
  3. Save the file
  4. Re-open the file in Notepad
Notepad displays seemingly-random Chinese characters, or boxes if your default Notepad font doesn't support those characters.

It's not an Easter egg (even though it seems like a funny one), and as it turns out, Notepad writes the file correctly. It's only when Notepad reads the file back in that it seems to lose its mind.

But we can't even blame Notepad: it's a limitation of Windows itself, specifically the Windows function that Notepad uses to figure out if a text file is Unicode or not.

You see, text files containing Unicode (more correctly, UTF-16-encoded Unicode) are supposed to start with a "Byte-Order Mark" (BOM), which is a two-byte flag that tells a reader how the following UTF-16 data is encoded. Given that these two bytes are exceedingly unlikely to occur at the beginning of an ASCII text file, it's commonly used to tell whether a text file is encoded in UTF-16.

But plenty of applications don't bother writing this marker at the beginning of a UTF-16-encoded file. So what's an app like Notepad to do?

Windows helpfully provides a function called IsTextUnicode()--you pass it some data, and it tells you whether it's UTF-16-encoded or not.

Sorta.

It actually runs a couple of heuristics over the first 256 bytes of the data and provides its best guess. As it turns out, these tests aren't terribly reliable for very short ASCII strings that contain an even number of lower-case letters, like "this app can break", or more appropriately, "this api can break".

The documentation for IsTextUnicode says:

These tests are not foolproof. The statistical tests assume certain amounts of variation between low and high bytes in a string, and some ASCII strings can slip through. For example, if lpBuffer points to the ASCII string 0x41, 0x0A, 0x0D, 0x1D (A\n\r^Z), the string passes the IS_TEXT_UNICODE_STATISTICS test, though failure would be preferable.

Indeed.

As a wise man once said, "In the face of ambiguity, refuse the temptation to guess."

Competency and Layers

Larry Osterman has yet another great network programming post on his blog. To sum up, he declares his second rule of "making things go fast on the network":

You can't design your application protocol in a vacuum. You need to understand how the layers below your application work before you deploy it.

An excellent rule. Actually, I've often heard (and used) a more general form:

You can't be competent doing computer work at level N unless you have a good grasp of level N-1.

Programming is all about abstractions, and we as programmers are fond of thinking that our abstractions mean you "don't need to know" what's under the covers. But abstractions aren't perfect, and if you don't know what's under your current level of abstraction, then you're simply not competent.

For example, if you want to be a good MFC programmer, you need to have a decent grasp of Win32 API fundamentals. If you want to work in Python, you don't need to be a Python core hacker, but you'd better know enough about the implementation to know why, for example, repeated string concatenation is slow. And if you're using a object-relational mapper over top of a relational database, you still need to know your way around SQL.

I first heard this from Dr. Ralph Droms, one of my professors at Bucknell (who also invented DHCP). If I recall correctly, he was quoting one of the "elder statesmen" of computer science (Dijkstra, maybe?), but I can't recall just who it was.