From mboxrd@z Thu Jan 1 00:00:00 1970 To: 9fans@cse.psu.edu From: "Douglas A. Gwyn" Message-ID: <3CCE1366.EAB27A38@null.net> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit References: <991a7d99caeee7b2557f759e7b5a8a77@caldo.demon.co.uk> Subject: Re: [9fans] wchar_t in ANSI C (was "Announce: port") Date: Tue, 30 Apr 2002 09:40:27 +0000 Topicbox-Message-UUID: 81ebde0e-eaca-11e9-9e20-41e7f4b1d025 forsyth@caldo.demon.co.uk wrote: > does it insist that it be `self-synchronising' ... ? The C language standard doesn't insist on much at all for multibyte encodings, because they are not under control of the programming language. It happens that almost any encoding scheme *will* self-synchronize within a few more coded characters after a coding error; in fact there is a cute "mind-reading" magic trick that exploits the underlying phenomenon: Spread out a deck of 52 cards in a row face-up, ask the victim to pick any card among the first ten, then count forward *mentally* that many cards (J=10, etc.) and iterate with the card reached until the last card runs him past the end of the deck. When he says he's done, you instantly tell him the last card he reached. The trick is that you perform the same procedure using your own choice of starting card; odds are good that the sequences merge before the end. The one real constraint the C standard imposed on multibyte encodings was that there be no embedded 0-valued bytes. The idea was that (before 1994) it was expected that m.b. sequences would be copied etc. using the char-oriented legacy functions and we all know that the 0 byte has special meaning there. Unfortunately, with the spread of UTF-16 as an external encoding, this constraint has led to a real problem, which is being worked on by interested parties. Different people can draw different conclusions from such situations. For example, I take it as one more example of the evil of stealing perfectly legitimate code values for in-band control purposes.