From mboxrd@z Thu Jan  1 00:00:00 1970
Message-ID: <7359f0490704270741i76150820yb43bdae6603e83bc@mail.gmail.com>
Date: Fri, 27 Apr 2007 07:41:10 -0700
From: "Rob Pike" <robpike@gmail.com>
To: "Fans of the OS Plan 9 from Bell Labs" <9fans@cse.psu.edu>
Subject: Re: [9fans] speaking of kenc
In-Reply-To: <eada9dbbf85f37aabb76006763392fcf@coraid.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <7871fcf50704270721q25223196rea21e7a64ff8ad58@mail.gmail.com>
	<eada9dbbf85f37aabb76006763392fcf@coraid.com>
Topicbox-Message-UUID: 5152ca7a-ead2-11e9-9d60-3106f5b1d025

\u doesn't add anything useful to the plan 9 c compiler because of the
way its input language is defined.  it is useful in c99, because of the
way its input language is defined.

in plan 9, each escape sequence in a string constant represents a
unicode code point.  "\x1234" represents a string with a single
character with value 0x1234.  but in c99, that is an erroneous
string because each escape sequence represents a byte. thus
to represent a unicode value, one is expected to write out the
utf-8 byte sequence.  plan 9's "\x1234" becomes, in c99,
"\xe1\x88\xb4".  the \u escapes were created to give plan 9's
functionality without breaking compatibility with the existing
implicit meaning of the \x escapes.

this subject is quite long and involved - where does utf-8 fit in?
how does source encoding interact with internal representation?
output encoding? etc. etc. - but the key point about \u is that
it makes sense in a utf-8 world with standard c and c++.

plan 9's c is very non-standard in this regard.  i prefer its design
but i don't find \u to be a bad solution.  there are a number of
related notations in the standards pipeline to deal with some of
the other issues, such as forcing utf-8 byte sequences. the
notation is going to get pretty ugly.

-rob