9fans - fans of the OS Plan 9 from Bell Labs
 help / color / mirror / Atom feed
* Re: [9fans] wchar_t in ANSI C (was "Announce: port")
@ 2002-04-30 18:16 David Gordon Hogan
  0 siblings, 0 replies; 6+ messages in thread
From: David Gordon Hogan @ 2002-04-30 18:16 UTC (permalink / raw)
  To: 9fans

>> Now you're being needlessly pedantic. Plan 9 does not run on any
>> 16-bit platform and I explicitly said it was a Plan 9 example.
>
> Well, okay, I was looking toward what would be needed were the
> example to be extended to support more general encodings, in
> anticipation of a complaint that Standard C also requires
> "int" be changed to an appropriate typedef.  If Plan 9 were
> addressing a wider problem domain it would need the same kind
> of thing done for it too.

Yeah, but C doesn't have parametric polymorphism ;-)



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [9fans] wchar_t in ANSI C (was "Announce: port")
       [not found] <991a7d99caeee7b2557f759e7b5a8a77@caldo.demon.co.uk>
@ 2002-04-30  9:40 ` Douglas A. Gwyn
  0 siblings, 0 replies; 6+ messages in thread
From: Douglas A. Gwyn @ 2002-04-30  9:40 UTC (permalink / raw)
  To: 9fans

forsyth@caldo.demon.co.uk wrote:
> does it insist that it be `self-synchronising' ... ?

The C language standard doesn't insist on much at all for multibyte
encodings, because they are not under control of the programming
language.  It happens that almost any encoding scheme *will*
self-synchronize within a few more coded characters after a coding
error; in fact there is a cute "mind-reading" magic trick that
exploits the underlying phenomenon:  Spread out a deck of 52 cards
in a row face-up, ask the victim to pick any card among the first
ten, then count forward *mentally* that many cards (J=10, etc.) and
iterate with the card reached until the last card runs him past the
end of the deck.  When he says he's done, you instantly tell him
the last card he reached.  The trick is that you perform the same
procedure using your own choice of starting card; odds are good
that the sequences merge before the end.

The one real constraint the C standard imposed on multibyte encodings
was that there be no embedded 0-valued bytes.  The idea was that
(before 1994) it was expected that m.b. sequences would be copied
etc. using the char-oriented legacy functions and we all know that
the 0 byte has special meaning there.  Unfortunately, with the spread
of UTF-16 as an external encoding, this constraint has led to a real
problem, which is being worked on by interested parties.

Different people can draw different conclusions from such situations.
For example, I take it as one more example of the evil of stealing
perfectly legitimate code values for in-band control purposes.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [9fans] wchar_t in ANSI C (was "Announce: port")
       [not found] <9b8b28678237726753936b99587567ed@plan9.bell-labs.com>
@ 2002-04-30  9:40 ` Douglas A. Gwyn
  0 siblings, 0 replies; 6+ messages in thread
From: Douglas A. Gwyn @ 2002-04-30  9:40 UTC (permalink / raw)
  To: 9fans

"rob pike, esq." wrote:
> Now you're being needlessly pedantic. Plan 9 does not run on any
> 16-bit platform and I explicitly said it was a Plan 9 example.

Well, okay, I was looking toward what would be needed were the
example to be extended to support more general encodings, in
anticipation of a complaint that Standard C also requires
"int" be changed to an appropriate typedef.  If Plan 9 were
addressing a wider problem domain it would need the same kind
of thing done for it too.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [9fans] wchar_t in ANSI C (was "Announce: port")
  2002-04-29 12:42 rob pike, esq.
@ 2002-04-29 16:18 ` Douglas A. Gwyn
  0 siblings, 0 replies; 6+ messages in thread
From: Douglas A. Gwyn @ 2002-04-29 16:18 UTC (permalink / raw)
  To: 9fans

"rob pike, esq." wrote:
> there is no excuse for omitting wchar_t support in stdio.

But it was added in 1994.

There is now a complete duplication of the legacy "char"
text-handling facilities for "wide character" test-handling.
In fact I predicted that that would be necessary back when the
form of support for multibyte encodings was still being debated,
but the C committee was assured by the affected parties who
bothered to participate (dig) that minimal support for
character-at-a-time translation would satisfy their needs,
which is why that was all that was specified in the initial
version.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [9fans] wchar_t in ANSI C (was "Announce: port")
@ 2002-04-29 12:53 rob pike, esq.
  0 siblings, 0 replies; 6+ messages in thread
From: rob pike, esq. @ 2002-04-29 12:53 UTC (permalink / raw)
  To: 9fans

Here's a simple version of the problem.  Imagine you have a (bio) loop
along the lines of

	int c;

	while((c = Bgetc(&bWin)) != Beof){
		c = process(c);
		Bputc(&out, c);
	}

To make this work with UTF-8, all you do is change 'c' to 'rune'
in the calls:

	int c;

	while((c = Bgetrune(&bWin)) != Beof){
		c = process(c);
		Bputrune(&out, c);
	}

Loops like this are everywhere in the Plan 9 tools.  Bgetrune gets
called much more than Bgetc, I bet, at least for programs operating
on text.

No such charm works in ANSI C.

-rob



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [9fans] wchar_t in ANSI C (was "Announce: port")
@ 2002-04-29 12:42 rob pike, esq.
  2002-04-29 16:18 ` Douglas A. Gwyn
  0 siblings, 1 reply; 6+ messages in thread
From: rob pike, esq. @ 2002-04-29 12:42 UTC (permalink / raw)
  To: 9fans

> Sounds like somebody who doesn't use them enough to know.
> wchar_t is closely analogous to rune.
> The real problem is that "char" is inadequate for encoding a character,
> largely a consequence of Dennis chiming in on the sizeof(char)==1 side.

I don't want to debate whether sizeof(char) should be 1, but I do
think you're being too forgiving about wchar_t, at least in the
original standard.  There were too many holes in the standard, such as
no defined format for printing wchar_t strings, no defined conversion
between strings of either type (just of individual characters) and no
defined input method.  In short, no stdio support!  Too much last-minute
committee design, I find.

Footnotes 119 and 122 in the I/O section of the standard (printf,
scanf) both read: "No special provisions are made for multibyte
characters."  Give me a break!  How hard would it have been to define
%lc and %ls, for instance?

The answer is surprisingly subtle, and is answered in my next paragraph.

The issue that cheeses me most still remains even in the new standard:
the clumsiness of converting in the face of conversion errors such as
malformed UTF-8, which turn up a lot when you're scanning binary data
looking for strings, or just get handed something like Latin-1 when
you're expecting UTF-8.  Most programs (e.g grep) can do nothing
useful in the face of errors except barge on, but the ANSI C standard
makes the standard character processing loop a real mess.  It also
makes scanf("%ls or %lc") impossible to write consistently with the
rest of the standard, since you need to stop if there's a conversion
error, almost never what you want.  This issue is a matter of taste,
but I feel it's done wrong.  The Plan 9 model, with the concept of an
"Error Rune", makes it easy to ignore errors but also easy to handle
them, as you decide.  Plan 9's is a much better model because it was a
model born of experience rather than design without implementation.

I reiterate that the error handling issue is one of taste, but that
there is no excuse for omitting wchar_t support in stdio.

We wrote about this in our UTF paper
	http://plan9.bell-labs.com/sys/doc/utf.pdf .html .ps
(The .html version has some character set awkwardness!).  If we could
have used ANSI C's design for wide characters, we would have, but it
was inadequate.

-rob



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2002-04-30 18:16 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-04-30 18:16 [9fans] wchar_t in ANSI C (was "Announce: port") David Gordon Hogan
     [not found] <991a7d99caeee7b2557f759e7b5a8a77@caldo.demon.co.uk>
2002-04-30  9:40 ` Douglas A. Gwyn
     [not found] <9b8b28678237726753936b99587567ed@plan9.bell-labs.com>
2002-04-30  9:40 ` Douglas A. Gwyn
  -- strict thread matches above, loose matches on Subject: below --
2002-04-29 12:53 rob pike, esq.
2002-04-29 12:42 rob pike, esq.
2002-04-29 16:18 ` Douglas A. Gwyn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).