From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <410a85752905d9259d72921e37207d01@plan9.bell-labs.com> To: 9fans@cse.psu.edu Subject: Re: [9fans] wchar_t in ANSI C (was "Announce: port") From: "rob pike, esq." MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Date: Mon, 29 Apr 2002 08:42:20 -0400 Topicbox-Message-UUID: 802b1832-eaca-11e9-9e20-41e7f4b1d025 > Sounds like somebody who doesn't use them enough to know. > wchar_t is closely analogous to rune. > The real problem is that "char" is inadequate for encoding a character, > largely a consequence of Dennis chiming in on the sizeof(char)==1 side. I don't want to debate whether sizeof(char) should be 1, but I do think you're being too forgiving about wchar_t, at least in the original standard. There were too many holes in the standard, such as no defined format for printing wchar_t strings, no defined conversion between strings of either type (just of individual characters) and no defined input method. In short, no stdio support! Too much last-minute committee design, I find. Footnotes 119 and 122 in the I/O section of the standard (printf, scanf) both read: "No special provisions are made for multibyte characters." Give me a break! How hard would it have been to define %lc and %ls, for instance? The answer is surprisingly subtle, and is answered in my next paragraph. The issue that cheeses me most still remains even in the new standard: the clumsiness of converting in the face of conversion errors such as malformed UTF-8, which turn up a lot when you're scanning binary data looking for strings, or just get handed something like Latin-1 when you're expecting UTF-8. Most programs (e.g grep) can do nothing useful in the face of errors except barge on, but the ANSI C standard makes the standard character processing loop a real mess. It also makes scanf("%ls or %lc") impossible to write consistently with the rest of the standard, since you need to stop if there's a conversion error, almost never what you want. This issue is a matter of taste, but I feel it's done wrong. The Plan 9 model, with the concept of an "Error Rune", makes it easy to ignore errors but also easy to handle them, as you decide. Plan 9's is a much better model because it was a model born of experience rather than design without implementation. I reiterate that the error handling issue is one of taste, but that there is no excuse for omitting wchar_t support in stdio. We wrote about this in our UTF paper http://plan9.bell-labs.com/sys/doc/utf.pdf .html .ps (The .html version has some character set awkwardness!). If we could have used ANSI C's design for wide characters, we would have, but it was inadequate. -rob